Working with cavities


The comparison of putative binding sites is used in various areas of pharmaceutical drug development such as side effect (off-target) prediction, protein function prediction, polypharmacology, drug repurposing, or bioisosteric replacements. The ccdc.cavity module provides an API for cavity search and comparison.

Cavities can be extracted from a protein using the algorithm described in “LIGSITE: automatic and efficient detection of potential small molecule-binding sites in proteins”, M. Hendlich, F. Rippmann, G. Barnickel, J. Mol. Graph. Model., 15, 359-363, 1997 DOI: 10.1016/S1093-3263(98)00002-3.

Once extracted, they can be stored in an XML-based representation, and compared using three different approaches.

1) A cavity graph comparison method can be used, which is an implementation of the CavBase algorithm described in “A New Method to Detect Related Function Among Proteins Independent of Sequence and Fold Homology”, S. Schmitt, D. Kuhn and G. Klebe, J. Mol. Biol. 323, 387-406, 2002 DOI: 10.1016/S0022-2836(02)00811-2,

2) A fast cavity graph comparison can be used, which is derived from the cavity graph comparison method, with enhanced pseudocenter descriptors and accelerated local clique detection to provide faster comparisons with comparable accuracy. This is described in “Extended Graph-Based Models for Enhanced Similarity Search in Cavbase”, T. Krotzky, T. Fober, E. Hüllermeier’ and G. Klebe, IEEE/ACM Trans. Comput. Biol. Bioinf. 11, 878-890, 2014 DOI: 10.1109/TCBB.2014.2325020,

3) A cavity histograms comparison can be used, which is an implementation of the RAPMAD method described in “Large-Scale Mining for Similar Protein Binding Pockets: With RAPMAD Retrieval on the Fly Becomes Real”, T. Krotzky, C. Grunwald, U. Egerland and G. Klebe, J. Chem. Inf. Model., 55, 165-179, 2015 DOI: 10.1021/ci5005898.

Accessing cavity properties

First of all, let us import the required class from the appropriate module:

>>> from ccdc.cavity import Cavity

Cavities can be detected starting from a PDB file format using the LIGSITE algorithm:

>>> fxa = 'pdb1fax.pdb'
>>> fxa_cavities = Cavity.from_pdb_file(fxa)

Otherwise, cavities can be read from an XML file format using the methods ccdc.cavity.Cavity.from_xml() and ccdc.cavity.Cavity.from_xml_file().

The structure of Factor Xa (PDB entry: 1fax) has got two cavities labelled by unique identifiers. By default, cavities are sorted by volume with the largest first.

>>> print(len(fxa_cavities))
>>> print(', '.join(c.identifier for c in fxa_cavities))
pdb1fax.2, pdb1fax.1
>>> print(', '.join('%.3f' % c.volume for c in fxa_cavities))
704.875, 534.500

Ligands which occur in the cavity are indicated by an identifier consisting of the chain ID followed by the ligand name:

>>> for id in fxa_cavities[1].ligand_identifiers:
...     print(id)

If the cavity has been created with reference to a protein, the ligands themselves may be obtained:

>>> ligands = fxa_cavities[1].ligands
>>> for l in ligands:
...     print('%s: %d' % (l.identifier, len(l.atoms)))
A:DX9265: 61

Each cavity can be described by a number of ccdc.cavity.Cavity.Feature which are a representation of potential interaction sites. Seven different feature types are supported:

>>> print(list(sorted(Cavity.feature_types.keys())))
['acceptor', 'aliphatic', 'aromatic', 'donor', 'donor_acceptor', 'metal', 'pi']

For example, we can access the interaction sites of the biggest cavity of Factor Xa as follows:

>>> print(len(fxa_cavities[0].features))
>>> print(fxa_cavities[0].features) 
(Feature(acceptor at Coordinates(x=30.506, y=20.947, z=20.879)), ...

It is possible to find out a number of properties associated with each feature. If a PDB file was used to generate the cavity, then the information about the corresponding amino acid can be retrieved, otherwise it will appear as unknown (i.e., ‘UNK’). Also, the 3D coordinates of the feature can be accessed. This position corresponds to that of a protein atom for donor, acceptor, pi and metal types, while it will be the centroids of the atoms used to define aromatic or aliphatic features. Other properties include the protein vector describing the ideal direction for interacting with a given feature.

>>> f = fxa_cavities[0].features[0]
>>> print(f.type)
>>> print(f.amino_acid_code)
>>> print(f.coordinates)
Coordinates(x=30.506, y=20.947, z=20.879)
>>> print(f.protein_vector)
Vector(0.906, -0.077, 0.416)
>>> print(f.surface_vector)
Vector(0.907, -0.420, 0.031)
>>> print(f.surface_points[0])
Coordinates(x=30.139, y=18.850, z=22.370)

Features also contain a burial property that represents its degree of burial. Those values range from 0 (not buried at all) to 7 (deeply buried) and are the means of the depth values of all surface points that are associated with the feature. The burial properties can be used to identify the most buried areas of a cavity.

>>> print(round(f.burial, 3))

If the cavity was detected with protein information available, i.e., from a PDB format file, then a feature has properties to return the ccdc.protein.Protein.Chain, ccdc.protein.Protein.Residue and ccdc.molecule.Atom from which the feature was derived:

>>> print('Chain: %s, Residue: %s, Atom: %s' % (f.chain.identifier, f.residue, f.atom))
Chain: A, Residue: Residue(A:LEU47), Atom: Atom(O)

If no information about the protein is available, i.e. the cavity was created from an XML representation, these data will be None. Aromatic and aliphatic features do not have an associated atom; in this case the value of the property will be None.

It is possible to select subsets of the features of a cavity:

>>> cavity = fxa_cavities[1]
>>> print(cavity.features_by_type('acceptor')) 
set([Feature(acceptor at Coordinates(...), Feature(acceptor at Coordinates(...), ...

This sort of selection can be useful to check that a ligand is making appropriate contact with the features of a cavity, for example:

>>> ligand = cavity.ligands[0]
>>> for a in ligand.atoms:
...     if a.is_acceptor:
...         nearby_features = cavity.features_by_type('acceptor') & cavity.features_by_distance(a.coordinates, 3.0)
...         dist_features = sorted((f.point_distance(a.coordinates), f) for f in nearby_features)
...         print('\n'.join('%.2f: %s' % (d, str(f)) for d, f in dist_features))
...         break
2.65: Feature(acceptor at Coordinates(x=7.460, y=6.381, z=28.294))
2.80: Feature(acceptor at Coordinates(x=11.334, y=6.497, z=30.911))

Also available is the subset of features subtended by a list of residues:

>>> print(cavity.features_by_residues([f.residue for f in nearby_features])) 
set([Feature(pi at Coordinates(x=12.016, y=6.608, z=31.921)), Feature(donor at Coordinates(x=5.792, y=4.545, z=27.031)), Feature(acceptor at Coordinates(x=7.460, y=6.381, z=28.294)), Feature(acceptor at Coordinates(x=11.334, y=6.497, z=30.911))])

Cavities can be written out using an XML-based representation:

>>> s = fxa_cavities[0].to_xml()
>>> print(s) 
<?xml version='1.0'?> <CAVITY> ... </CAVITY>

Cavities, when they have been created from a PDB file, may be written in a Hermes-compatible rlbcoor file format:

>>> fxa_cavites[0].write('fxa_cavity.rlbcoor') 


There are different ways to create a subcavity on the basis of an existing cavity. One is by defining a discrete point p and a radius around it to represent a sphere in which the cavity features will be selected for the subcavity generation.

>>> print(len(fxa_cavities[1].features))
>>> p = (12.171, 7.574, 28.170)
>>> subcav = fxa_cavities[1].subcavity(fxa_cavities[1].features_by_distance(p, 5.0))
>>> print(len(subcav.features))

Also, a set of residues can be defined and only features arising from those amino acids will be used. Therefore, we are also making use of the Protein class.

>>> from ccdc.protein import Protein
>>> protein_file = 'pdb1fax.pdb'
>>> pdb = Protein.from_file(protein_file)
>>> cav_residues = [r for r in pdb.residues if r.identifier in ["A:TYR228", "A:VAL213", "A:ASP189", "A:ALA190"]]
>>> subcav = fxa_cavities[1].subcavity(fxa_cavities[1].features_by_residues(cav_residues))
>>> print(len(subcav.features))

Comparing cavities

Cavities can be compared using a fast cavity graph comparison, returning a ccdc.cavity.Cavity.FastCavityGraphComparison instance:

>>> comp = fxa_cavities[0].compare(fxa_cavities[1])

This supports some useful properties, including a similarity score for the two cavities. These scores range from 0.0 to 1.0, from the least similar to the most similar cavities.

>>> print(round(comp.score, 3))
>>> print(comp.largest_clique_size)
>>> print(comp.product_graph_size)

The fast cavity graph comparison method can be slow for comparisons of large cavities due to the generation of very large product graphs. A default limit of 36000 nodes is applied to the size of the product graph, beyond which comparisons will not be performed. This value can be adjusted by supplying an extra argument to the comparison function, specifying the maximum allowed size of the product graph.

Cavities can also be compared by using the cavity graph comparison method, returning a ccdc.cavity.Cavity.CavityGraphComparison instance:

>>> comp = fxa_cavities[0].compare(fxa_cavities[1], comparison_method=Cavity.ComparisonMethod.CAVITY_GRAPH_COMPARISON)

This also supports some useful properties, such as a similarity score for the two cavities. The score is always positive, and the larger the score, the more similar the two cavities being compared.

>>> print(round(comp.score, 3))
>>> print(round(comp.clique_rmsd, 3))
>>> print(comp.n_matches)
>>> print(comp.n_cliques)

Alternatively, one can use the even faster cavity histogram comparison method, which is based on feature distance histograms, and returns a similarity score. These scores range from 0.0 to 100.0, from the least similar to the most similar cavities.

>>> cav = Cavity.from_xml(s)
>>> print(, comparison_method=Cavity.ComparisonMethod.CAVITY_HISTOGRAMS_COMPARISON))
>>> print(round([1], comparison_method=Cavity.ComparisonMethod.CAVITY_HISTOGRAMS_COMPARISON), 2))

The histogram set for a cavity may be instantiated with any combination of reference points from which to construct the distance histograms for cavity similarity computation. These are ‘centroid’, ‘centroid_closest’ (which are the default settings), ‘centroid_furthest’ and ‘centroid_furthest_furthest’, and can be specifed through a further argument to the method.

The histograms generated during the comparison are a a tuple of ccdc.utilities.Histogram, one for each type of the cavity features, repeated for each reference point measure. For example, using two reference points for the histogram creation would therefore end up in 7 · 2 = 14 histograms.

Cavity databases

For associated collaborators, we supply a database of cavity information to allow comparatively fast searching for similar cavities. This is a database of LIGSITE-detected cavities from all X-ray crystallography and electron microscopy structures in the PDB, in a size range between 20 residues and approximately 13000 residues (1000 SEQRES lines in the PDB), and with a resolution better than 3 Angstroms. Only one model is included in the case of NMR structures.


These databases must be considered highly provisional: the underlying SQL schema, and indeed the underlying database technology are likely to change, though the public API exemplified here is likely to remain unaffected.

A utility script is provided, in the ccdc/utilities/create_cavity_database directory, to create cavity databases from a user-provided set of proteins. These should be within a single directory, in PDB format or, if Relibase+ is available, a directory of cavities in XML format. The script is passed the location of the database file, via the --database-file argument, and the location of the input data via the --pdb-directory or --xml-directory arguments.

We can use the CavityDatabase API to construct a test database:

>>> protein_dir = 'cavity_proteins'
>>> from ccdc.cavity import CavityDatabase
>>> import tempfile
>>> temp = tempfile.mkstemp()
>>> db = CavityDatabase(temp[1])
>>> db.populate_all(protein_dir) 
Skipping pdb1e7b because protein chain(s) with C-alpha coordinates only
Loaded 7 cavities in ...
Loaded 196 histograms in ...
Loaded info in ...
Loaded 2 ligands in ...

Note that one protein was skipped in the above analysis as some of the residues in one of the chains analysed was incomplete. By default, the database processing is strict, and requires that all protein residues are fully specified, but you can set this to be more relaxed by specifying a value of maximum_allowed_incomplete_residues greater than zero when calling ccdc.cavity.CavityDatabase.populate_all().

We can see some of the sizes of the database:

>>> print('Number of cavities: %d, Number of ligands: %d' % (db.get_number_of_cavities(), db.get_number_of_ligands()))
Number of cavities: 7, Number of ligands: 2

We can get access to the cavities with iterators:

>>> for cav in db.cavities():
...     print(cav.identifier)

We can get access to a single cavity through its identifier or to all cavities in a structure through the PDB identifier:

>>> cavities = db.get_cavities_by_pdb_code('1fax')
>>> print(len(cavities))

We can extract the ligands for a PDB code, or for a cavity identifier, or the cavities containing a particular ligand identifier:

>>> ligands = db.get_ligands_by_pdb_code('1fax')
>>> print(', '.join(ligands))
>>> print(db.get_ligands_by_cavity_identifier('pdb1fax.2'))
>>> cavs = db.get_cavities_by_ligand('DX9')
>>> print(', '.join(c.identifier for c in cavs))

We can get information about a cavity in the form of a dictionary of useful cavity attributes:

>>> info = db.get_info_for_cavity('pdb5boj_full_entry.1')
>>> print('NLigands: %d, NDonors: %d, NAcceptors: %d, Volume: %.3f' % (info['nligands'], info['ndonors'], info['nacceptors'], info['volume']))
NLigands: 0, NDonors: 8, NAcceptors: 9, Volume: 411.125

If further data from the database prove useful, a subsequent release will be able to make them available.

The primary purpose for the database is to be able to perform cavity comparisons. To achieve this we have a ccdc.cavity.CavityDatabase.Settings class that can be used to constrain a search, and a method that can take an optional ccdc.cavity.Cavity instance, an optional ccdc.cavity.CavityDatabase.Settings instance, and an argument specifying which comparison method should be used. Fast cavity graph comparison is the default comparison method. If the probe structure is None, the list of cavities matching the settings will be returned. Otherwise, the appropriate comparisons will be performed and the result will be a list of pairs, comparison scores and cavity identifiers, sorted by similarity.

For example, to return all cavities containing the ligand with identifier DX9:

>>> settings = db.Settings()
>>> settings.with_ligands = ['DX9']
>>> cavs =
>>> print(', '.join(c.identifier for c in cavs))

or to perform similarity comparisons using the cavity histograms comparison method on cavities containing few aromatic features:

>>> settings = db.Settings()
>>> settings.aromatic_range = [0, 1]
>>> cav = db.cavity('pdb5boj_full_entry.1')
>>> hits =, settings=settings, comparison_method=Cavity.ComparisonMethod.CAVITY_HISTOGRAMS_COMPARISON) 
>>> for score, identifier in hits:
...     print('Score: %5.2f, %s' % (score, identifier))
Score: 100.00, pdb5boj_full_entry.1
Score:  8.41, pdb5boj_full_entry.3
Score:  7.81, pdb1fax.2

or to perform similarity comparisons using the cavity graph comparison method on cavities with no more than five donor features:

>>> settings = db.Settings()
>>> settings.donor_range = [0,5]
>>> cav = db.cavity('pdb5boj_full_entry.1')
>>> hits =, settings=settings, comparison_method=Cavity.ComparisonMethod.CAVITY_GRAPH_COMPARISON)
>>> for score, identifier in hits:
...     print('Score: %.2f, Identifier: %s' % (round(score, 2), identifier))
Score: 4.05, Identifier: pdb5boj_full_entry.3

or to perform similarity comparisons using the fast cavity graph comparison method on cavities within a specified volume range:

>>> settings = db.Settings()
>>> settings.volume_range = [0, 425]
>>> cav = db.cavity('pdb5boj_full_entry.1')
>>> hits =, settings=settings)
>>> for score, identifier in hits:
...     print('Score: %.2f, Identifier: %s' % (round(score, 2), identifier))
Score: 1.00, Identifier: pdb5boj_full_entry.1
Score: 0.58, Identifier: pdb5boj_full_entry.2

or to use the cavity histograms comparison method on cavities containing one ligand:

>>> settings = db.Settings()
>>> settings.ligand_range = 1
>>> cav = db.cavity('pdb5boj_full_entry.1')
>>> hits =, settings=settings, comparison_method=Cavity.ComparisonMethod.CAVITY_HISTOGRAMS_COMPARISON)
>>> for score, identifier in hits:
...     print('Score: %.2f, Identifier: %s' % (round(score, 2), identifier))
Score: 7.41, Identifier: pdb3tnj_full_entry.1
Score: 7.25, Identifier: pdb1fax.1

See the definition of ccdc.cavity.CavityDatabase.Settings for the full range of constraints which may be placed on the search.