Working with cavities¶
Introduction¶
The comparison of putative binding sites is used in various areas of
pharmaceutical drug development such as side effect (off-target) prediction,
protein function prediction, polypharmacology, drug repurposing, or
bioisosteric replacements. The ccdc.cavity
module provides an API for
cavity search and comparison.
Cavities can be extracted from a protein using the algorithm described in “LIGSITE: automatic and efficient detection of potential small molecule-binding sites in proteins”, M. Hendlich, F. Rippmann, G. Barnickel, J. Mol. Graph. Model., 15, 359-363, 1997 DOI: 10.1016/S1093-3263(98)00002-3.
Once extracted, they can be stored in an XML-based representation, and compared using three different approaches.
1) A cavity graph comparison method can be used, which is an implementation of the CavBase algorithm described in “A New Method to Detect Related Function Among Proteins Independent of Sequence and Fold Homology”, S. Schmitt, D. Kuhn and G. Klebe, J. Mol. Biol. 323, 387-406, 2002 DOI: 10.1016/S0022-2836(02)00811-2,
2) A fast cavity graph comparison can be used, which is derived from the cavity graph comparison method, with enhanced pseudocenter descriptors and accelerated local clique detection to provide faster comparisons with comparable accuracy. This is described in “Extended Graph-Based Models for Enhanced Similarity Search in Cavbase”, T. Krotzky, T. Fober, E. Hüllermeier’ and G. Klebe, IEEE/ACM Trans. Comput. Biol. Bioinf. 11, 878-890, 2014 DOI: 10.1109/TCBB.2014.2325020,
3) A cavity histograms comparison can be used, which is an implementation of the RAPMAD method described in “Large-Scale Mining for Similar Protein Binding Pockets: With RAPMAD Retrieval on the Fly Becomes Real”, T. Krotzky, C. Grunwald, U. Egerland and G. Klebe, J. Chem. Inf. Model., 55, 165-179, 2015 DOI: 10.1021/ci5005898.
Accessing cavity properties¶
First of all, let us import the required class from the appropriate module:
>>> from ccdc.cavity import Cavity
Cavities can be detected starting from a PDB file format using the LIGSITE algorithm:
>>> fxa = 'pdb1fax.pdb'
>>> fxa_cavities = Cavity.from_pdb_file(fxa)
Otherwise, cavities can be read from an XML file format using the methods
ccdc.cavity.Cavity.from_xml()
and
ccdc.cavity.Cavity.from_xml_file()
.
The structure of Factor Xa (PDB entry: 1fax) has got two cavities labelled by unique identifiers. By default, cavities are sorted by volume with the largest first.
>>> print(len(fxa_cavities))
2
>>> print(', '.join(c.identifier for c in fxa_cavities))
pdb1fax.2, pdb1fax.1
>>> print(', '.join('%.3f' % c.volume for c in fxa_cavities))
704.875, 534.500
Ligands which occur in the cavity are indicated by an identifier consisting of the chain ID followed by the ligand name:
>>> for id in fxa_cavities[1].ligand_identifiers:
... print(id)
A:DX9
If the cavity has been created with reference to a protein, the ligands themselves may be obtained:
>>> ligands = fxa_cavities[1].ligands
>>> for l in ligands:
... print('%s: %d' % (l.identifier, len(l.atoms)))
A:DX9265: 61
Each cavity can be described by a number of ccdc.cavity.Cavity.Feature
which are a representation of potential interaction sites. Seven different
feature types are supported:
>>> print(list(sorted(Cavity.feature_types.keys())))
['acceptor', 'aliphatic', 'aromatic', 'donor', 'donor_acceptor', 'metal', 'pi']
For example, we can access the interaction sites of the biggest cavity of Factor Xa as follows:
>>> print(len(fxa_cavities[0].features))
50
>>> print(fxa_cavities[0].features)
(Feature(acceptor at Coordinates(x=30.506, y=20.947, z=20.879)), ...
It is possible to find out a number of properties associated with each feature. If a PDB file was used to generate the cavity, then the information about the corresponding amino acid can be retrieved, otherwise it will appear as unknown (i.e., ‘UNK’). Also, the 3D coordinates of the feature can be accessed. This position corresponds to that of a protein atom for donor, acceptor, pi and metal types, while it will be the centroids of the atoms used to define aromatic or aliphatic features. Other properties include the protein vector describing the ideal direction for interacting with a given feature.
>>> f = fxa_cavities[0].features[0]
>>> print(f.type)
acceptor
>>> print(f.amino_acid_code)
LEU
>>> print(f.coordinates)
Coordinates(x=30.506, y=20.947, z=20.879)
>>> print(f.protein_vector)
Vector(0.906, -0.077, 0.416)
>>> print(f.surface_vector)
Vector(0.907, -0.420, 0.031)
>>> print(f.surface_points[0])
Coordinates(x=30.139, y=18.850, z=22.370)
Features also contain a burial property that represents its degree of burial. Those values range from 0 (not buried at all) to 7 (deeply buried) and are the means of the depth values of all surface points that are associated with the feature. The burial properties can be used to identify the most buried areas of a cavity.
>>> print(round(f.burial, 3))
5.329
If the cavity was detected with protein information available, i.e., from a
PDB format file, then a feature has properties to return the
ccdc.protein.Protein.Chain
, ccdc.protein.Protein.Residue
and
ccdc.molecule.Atom
from which the feature was derived:
>>> print('Chain: %s, Residue: %s, Atom: %s' % (f.chain.identifier, f.residue, f.atom))
Chain: A, Residue: Residue(A:LEU47), Atom: Atom(O)
If no information about the protein is available, i.e. the cavity was created
from an XML representation, these data will be None
. Aromatic and
aliphatic features do not have an associated atom; in this case the value of
the property will be None
.
It is possible to select subsets of the features of a cavity:
>>> cavity = fxa_cavities[1]
>>> print(cavity.features_by_type('acceptor'))
set([Feature(acceptor at Coordinates(...), Feature(acceptor at Coordinates(...), ...
This sort of selection can be useful to check that a ligand is making appropriate contact with the features of a cavity, for example:
>>> ligand = cavity.ligands[0]
>>> for a in ligand.atoms:
... if a.is_acceptor:
... nearby_features = cavity.features_by_type('acceptor') & cavity.features_by_distance(a.coordinates, 3.0)
... dist_features = sorted((f.point_distance(a.coordinates), f) for f in nearby_features)
... print('\n'.join('%.2f: %s' % (d, str(f)) for d, f in dist_features))
... break
2.65: Feature(acceptor at Coordinates(x=7.460, y=6.381, z=28.294))
2.80: Feature(acceptor at Coordinates(x=11.334, y=6.497, z=30.911))
Also available is the subset of features subtended by a list of residues:
>>> print(cavity.features_by_residues([f.residue for f in nearby_features]))
set([Feature(pi at Coordinates(x=12.016, y=6.608, z=31.921)), Feature(donor at Coordinates(x=5.792, y=4.545, z=27.031)), Feature(acceptor at Coordinates(x=7.460, y=6.381, z=28.294)), Feature(acceptor at Coordinates(x=11.334, y=6.497, z=30.911))])
Cavities can be written out using an XML-based representation:
>>> s = fxa_cavities[0].to_xml()
>>> print(s)
<?xml version='1.0'?> <CAVITY> ... </CAVITY>
Cavities, when they have been created from a PDB file, may be written in a Hermes-compatible rlbcoor file format:
>>> fxa_cavites[0].write('fxa_cavity.rlbcoor')
Subcavities¶
There are different ways to create a subcavity on the basis of an existing cavity. One is by defining a discrete point p and a radius around it to represent a sphere in which the cavity features will be selected for the subcavity generation.
>>> print(len(fxa_cavities[1].features))
61
>>> p = (12.171, 7.574, 28.170)
>>> subcav = fxa_cavities[1].subcavity(fxa_cavities[1].features_by_distance(p, 5.0))
>>> print(len(subcav.features))
13
Also, a set of residues can be defined and only features arising from those amino acids will be used. Therefore, we are also making use of the Protein class.
>>> from ccdc.protein import Protein
>>> protein_file = 'pdb1fax.pdb'
>>> pdb = Protein.from_file(protein_file)
>>> cav_residues = [r for r in pdb.residues if r.identifier in ["A:TYR228", "A:VAL213", "A:ASP189", "A:ALA190"]]
>>> subcav = fxa_cavities[1].subcavity(fxa_cavities[1].features_by_residues(cav_residues))
>>> print(len(subcav.features))
10
Comparing cavities¶
Cavities can be compared using a fast cavity graph comparison, returning a
ccdc.cavity.Cavity.FastCavityGraphComparison
instance:
>>> comp = fxa_cavities[0].compare(fxa_cavities[1])
This supports some useful properties, including a similarity score for the two cavities. These scores range from 0.0 to 1.0, from the least similar to the most similar cavities.
>>> print(round(comp.score, 3))
0.232
>>> print(comp.largest_clique_size)
13
>>> print(comp.product_graph_size)
462
The fast cavity graph comparison method can be slow for comparisons of large
cavities due to the generation of very large product graphs. A default limit of
36000 nodes is applied to the size of the product graph, beyond which comparisons
will not be performed. This value can be adjusted by supplying
an extra argument to the ccdc.cavity.Cavity.compare()
comparison function, specifying the maximum allowed size of the product graph.
Cavities can also be compared by using the cavity graph comparison method, returning a
ccdc.cavity.Cavity.CavityGraphComparison
instance:
>>> comp = fxa_cavities[0].compare(fxa_cavities[1], comparison_method=Cavity.ComparisonMethod.CAVITY_GRAPH_COMPARISON)
This also supports some useful properties, such as a similarity score for the two cavities. The score is always positive, and the larger the score, the more similar the two cavities being compared.
>>> print(round(comp.score, 3))
4.532
>>> print(round(comp.clique_rmsd, 3))
1.158
>>> print(comp.n_matches)
6
>>> print(comp.n_cliques)
9
Alternatively, one can use the even faster cavity histogram comparison method, which is based on feature distance histograms, and returns a similarity score. These scores range from 0.0 to 100.0, from the least similar to the most similar cavities.
>>> cav = Cavity.from_xml(s)
>>> print(cav.compare(cav, comparison_method=Cavity.ComparisonMethod.CAVITY_HISTOGRAMS_COMPARISON))
100.0
>>> print(round(cav.compare(fxa_cavities[1], comparison_method=Cavity.ComparisonMethod.CAVITY_HISTOGRAMS_COMPARISON), 2))
7.05
The histogram set for a cavity may be instantiated with any combination of
reference points from which to construct the distance histograms for cavity
similarity computation. These are ‘centroid’, ‘centroid_closest’ (which are the
default settings), ‘centroid_furthest’ and ‘centroid_furthest_furthest’, and
can be specifed through a further argument to the
ccdc.cavity.Cavity.compare()
method.
The histograms generated during the comparison are a a tuple of
ccdc.utilities.Histogram
, one for each type of the cavity features,
repeated for each reference point measure. For example, using two reference
points for the histogram creation would therefore end up in 7 · 2 = 14
histograms.
Cavity databases¶
For associated collaborators, we supply a database of cavity information to allow comparatively fast searching for similar cavities. This is a database of LIGSITE-detected cavities from all X-ray crystallography and electron microscopy structures in the PDB, in a size range between 20 residues and approximately 13000 residues (1000 SEQRES lines in the PDB), and with a resolution better than 3 Angstroms. Only one model is included in the case of NMR structures.
Note
These databases must be considered highly provisional: the underlying SQL schema, and indeed the underlying database technology are likely to change, though the public API exemplified here is likely to remain unaffected.
A utility script is provided, create_cavity_database.py
in the
ccdc/utilities/create_cavity_database
directory, to create cavity
databases from a user-provided set of proteins. These should be within a
single directory, in PDB format or, if Relibase+ is available, a directory of
cavities in XML format. The script is passed the location of the database
file, via the --database-file
argument, and the location of the input data
via the --pdb-directory
or --xml-directory
arguments.
We can use the CavityDatabase API to construct a test database:
>>> protein_dir = 'cavity_proteins'
>>> from ccdc.cavity import CavityDatabase
>>> import tempfile
>>> temp = tempfile.mkstemp()
>>> db = CavityDatabase(temp[1])
>>> db.populate_all(protein_dir)
Skipping pdb1e7b because protein chain(s) with C-alpha coordinates only
Loaded 7 cavities in ...
Loaded 196 histograms in ...
Loaded info in ...
Loaded 2 ligands in ...
Note that one protein was skipped in the above analysis as some of the residues
in one of the chains analysed was incomplete. By default, the database processing is
strict, and requires that all protein residues are fully specified, but you can set this
to be more relaxed by specifying a value of maximum_allowed_incomplete_residues
greater than zero when calling ccdc.cavity.CavityDatabase.populate_all()
.
We can see some of the sizes of the database:
>>> print('Number of cavities: %d, Number of ligands: %d' % (db.get_number_of_cavities(), db.get_number_of_ligands()))
Number of cavities: 7, Number of ligands: 2
We can get access to the cavities with iterators:
>>> for cav in db.cavities():
... print(cav.identifier)
pdb1fax.1
pdb1fax.2
pdb3tnj_full_entry.1
pdb5boj_full_entry.1
pdb5boj_full_entry.2
pdb5boj_full_entry.3
pdb5boj_full_entry.4
We can get access to a single cavity through its identifier or to all cavities in a structure through the PDB identifier:
>>> cavities = db.get_cavities_by_pdb_code('1fax')
>>> print(len(cavities))
2
We can extract the ligands for a PDB code, or for a cavity identifier, or the cavities containing a particular ligand identifier:
>>> ligands = db.get_ligands_by_pdb_code('1fax')
>>> print(', '.join(ligands))
DX9
>>> print(db.get_ligands_by_cavity_identifier('pdb1fax.2'))
[]
>>> cavs = db.get_cavities_by_ligand('DX9')
>>> print(', '.join(c.identifier for c in cavs))
pdb1fax.1
We can get information about a cavity in the form of a dictionary of useful cavity attributes:
>>> info = db.get_info_for_cavity('pdb5boj_full_entry.1')
>>> print('NLigands: %d, NDonors: %d, NAcceptors: %d, Volume: %.3f' % (info['nligands'], info['ndonors'], info['nacceptors'], info['volume']))
NLigands: 0, NDonors: 8, NAcceptors: 9, Volume: 411.125
If further data from the database prove useful, a subsequent release will be able to make them available.
The primary purpose for the database is to be able to perform cavity
comparisons. To achieve this we have a
ccdc.cavity.CavityDatabase.Settings
class that can be used to constrain
a search, and a method ccdc.cavity.CavityDatabase.search()
that can
take an optional ccdc.cavity.Cavity
instance, an optional
ccdc.cavity.CavityDatabase.Settings
instance, and an argument
specifying which comparison method should be used. Fast cavity graph comparison is
the default comparison method. If the probe structure is None, the list of
cavities matching the settings will be returned. Otherwise, the appropriate
comparisons will be performed and the result will be a list of pairs,
comparison scores and cavity identifiers, sorted by similarity.
For example, to return all cavities containing the ligand with identifier DX9:
>>> settings = db.Settings()
>>> settings.with_ligands = ['DX9']
>>> cavs = db.search(settings=settings)
>>> print(', '.join(c.identifier for c in cavs))
pdb1fax.1
or to perform similarity comparisons using the cavity histograms comparison method on cavities containing few aromatic features:
>>> settings = db.Settings()
>>> settings.aromatic_range = [0, 1]
>>> cav = db.cavity('pdb5boj_full_entry.1')
>>> hits = db.search(cav, settings=settings, comparison_method=Cavity.ComparisonMethod.CAVITY_HISTOGRAMS_COMPARISON)
>>> for score, identifier in hits:
... print('Score: %5.2f, %s' % (score, identifier))
Score: 100.00, pdb5boj_full_entry.1
Score: 8.41, pdb5boj_full_entry.3
Score: 7.81, pdb1fax.2
or to perform similarity comparisons using the cavity graph comparison method on cavities with no more than five donor features:
>>> settings = db.Settings()
>>> settings.donor_range = [0,5]
>>> cav = db.cavity('pdb5boj_full_entry.1')
>>> hits = db.search(cav, settings=settings, comparison_method=Cavity.ComparisonMethod.CAVITY_GRAPH_COMPARISON)
>>> for score, identifier in hits:
... print('Score: %.2f, Identifier: %s' % (round(score, 2), identifier))
Score: 4.05, Identifier: pdb5boj_full_entry.3
or to perform similarity comparisons using the fast cavity graph comparison method on cavities within a specified volume range:
>>> settings = db.Settings()
>>> settings.volume_range = [0, 425]
>>> cav = db.cavity('pdb5boj_full_entry.1')
>>> hits = db.search(cav, settings=settings)
>>> for score, identifier in hits:
... print('Score: %.2f, Identifier: %s' % (round(score, 2), identifier))
Score: 1.00, Identifier: pdb5boj_full_entry.1
Score: 0.58, Identifier: pdb5boj_full_entry.2
or to use the cavity histograms comparison method on cavities containing one ligand:
>>> settings = db.Settings()
>>> settings.ligand_range = 1
>>> cav = db.cavity('pdb5boj_full_entry.1')
>>> hits = db.search(cav, settings=settings, comparison_method=Cavity.ComparisonMethod.CAVITY_HISTOGRAMS_COMPARISON)
>>> for score, identifier in hits:
... print('Score: %.2f, Identifier: %s' % (round(score, 2), identifier))
Score: 7.41, Identifier: pdb3tnj_full_entry.1
Score: 7.25, Identifier: pdb1fax.1
See the definition of ccdc.cavity.CavityDatabase.Settings
for the full
range of constraints which may be placed on the search.