Pharmacophore searching


The ccdc.pharmacophore module is available only to CSD-Discovery and CSD-Enterprise users.


The ccdc.pharmacophore module can be used to define pharmacophore models and to search a CSD-CrossMiner database of protein structures and CSD crystal structures. It is assumed that you have a working knowledge of the CSD-CrossMiner application. Details of its operation may be found here: CSD-CrossMiner User Guide.


For more information on the CSD-CrossMiner methodology please see: “Interactive and Versatile Navigation of Structural Databases” O. Korb, B. Kuhn, J. Hert, N. Taylor, J. Cole, C. Groom and M. Stahl, J. Med. Chem., 2016 , 59, 4257-4266, DOI: 10.1021/acs.jmedchem.5b01756.

Features and Pharmacophore Queries

Let us import the module and its principal classes:

>>> from ccdc.pharmacophore import Pharmacophore
>>> from ccdc.descriptors import GeometricDescriptors
>>> from ccdc.utilities import Colour

Reading a Pharmacophore Query from File

A pharmacophore query can be defined in several ways, the easiest of which is to read in a pre-prepared file. For this example we will use the same pharmacophore file discussed in the CSD-CrossMiner tutorial 1 which was derived from a Human cathepsin L:

>>> file_name = ''

This existing pharmacophore query can be read into a ccdc.pharmacophore.Pharmacophore.Query:

>>> catl_s3_query = Pharmacophore.Query.from_file(file_name)

Inspecting the Features of a Pharmacophore Query

This query is essentially a collection of ccdc.pharmacophore.Pharmacophore.Feature instances, describing the chemical features of the molecular structures to be found in a pharmacophore search.


In this example, we have six pharmacophore points: a projected donor, two projected acceptors, a hydrophobe and two heavy atoms. You can access them as follows:

>>> features = catl_s3_query.features
>>> print(len(features))
>>> print('\n'.join(str(f) for f in features))

It is possible to inspect each individual pharmacophore point, find out if it belongs to a protein or a small molecule, retrieve the coordinates of the spere(s) representing it, etc:

>>> acc_proj = features[0]
>>> print(acc_proj.description)
>>> print(acc_proj.component_label)
>>> print(acc_proj.spheres)
(Sphere(Point(7.046, 15.714, 0.296) radius 1.00), Sphere(Point(6.46684, 14.4252, -2.12134) radius 1.20))

As you will can see from the picture above, an acceptor projected point is made by two spheres: the first one indicates the location of the acceptor atom while the second one indicates the location of a putative donor.

Each feature is derived from a feature definition, ccdc.pharmacophore.Pharmacophore.FeatureDefinition. This is a class which can recognise itself within a feature database, or within a crystal structure. Several standard definitions are supplied by the CCDC and may be retrieved by:

>>> Pharmacophore.read_feature_definitions()

If you have your own feature definitions, these can be added to the existing ones by providing a directory containing them (in any subdirectory) to ccdc.pharmacophore.Pharmacophore.read_feature_definitions(). The results are placed in a dictionary keyed by the identifier of the feature definition:

>>> keys = list(Pharmacophore.feature_definitions.keys())
>>> keys.sort()
>>> print('\n'.join(keys)) 

We can inspect a feature definition to see what SMARTS patterns it will recognise, and what point generation schemes are to be used. For example, an acceptor point is described by 27 SMARTS strings. The point generator information for the first of them dictates that pharmacophore points will be generated for a SP2 acceptor for the fourth (index 3) and second (index 1) atoms of the SMARTS pattern:

>>> acceptor = Pharmacophore.feature_definitions['acceptor']
>>> SMARTS_defs = acceptor.SMARTS_definitions
>>> print(len(SMARTS_defs))
>>> first = SMARTS_defs[0]
>>> print(first.SMARTS)
>>> for pg in first.point_generators:
...     print('Method: %s, atoms: %s' % pg)
Method: ACCEPTOR_SP2, atoms: (3,)
Method: ACCEPTOR_SP2, atoms: (1,)

Creating a Pharmacophore Query from a Reference Structure

A pharmacophore query can also be constructed from scratch starting from a list of features. One simple way to obtain features is by reference to a crystal structure. Let us read in a crystal structure from the CSD and identify some possible features from it (e.g. ring, donor projected):

>>> from ccdc import io
>>> csd = io.EntryReader('csd')
>>> crystal = csd.crystal('AABHTZ')
>>> ring_feature_def = Pharmacophore.feature_definitions['ring']
>>> ring_features = ring_feature_def.detect_features(crystal)
>>> print(len(ring_features))
>>> donor_proj_def = Pharmacophore.feature_definitions['donor_projected']
>>> donor_proj_features = donor_proj_def.detect_features(crystal)
>>> print(len(donor_proj_features))

This has created a pair of pharmacophore points located at the centroids of the two rings in AABHTZ and a donor projected feature at the donor atom, N6.

>>> print([a for a in crystal.molecule.atoms if a.is_donor])
>>> crystal.molecule.atom("N6").coordinates
Coordinates(x=0.162, y=-2.740, z=-0.491)
>>> print(donor_proj_features[0].spheres[0].centre)
Point(0.161958, -2.74013, -0.491269)

Each sphere will have a tolerance radius which is set to 1.0 Angstrom by default, but can be changed by the user. By modifying its value the user will increase or decrease the uncertainty in the position of this pharmacophore point in the overall pharmacophore query.

If you do not want all of the features detected by this method, they can be filtered using geometrical properties of the molecule. For example, the feature corresponding to the aromatic ring of AABHTZ may be found by checking the centroid of the aromatic ring and the location of the feature point:

>>> from ccdc.descriptors import MolecularDescriptors
>>> mol = crystal.molecule
>>> arom_ring = [r for r in mol.rings if r.is_aromatic][0]
>>> centroid = MolecularDescriptors.atom_centroid(*arom_ring.atoms)
>>> for arom_ring_feature in ring_features:
...     ring_centre = arom_ring_feature.spheres[0].centre
...     if all(abs(centroid[i] - ring_centre[i]) < 1e-6 for i in range(3)):
...         break

This approach allows the user to pick and choose which features of the crystal should be retained.

Then, a pharmacophore query may be created from any features thus created:

>>> aabhtz_query = Pharmacophore.Query(ring_features + donor_proj_features)
>>> print('\n'.join(str(f) for f in aabhtz_query.features))

When a query is created distance constraints are created between all the spheres of the features in the query. These can be inspected by ccdc.pharmacophore.Pharmacophore.Query.distance_constraints:

>>> for dc in aabhtz_query.distance_constraints:
...     print(dc)
DistanceConstraint((1, 0), (0, 0) (4.385, 8.385) which='Any')
DistanceConstraint((2, 0), (0, 0) (0.541, 4.541) which='Any')
DistanceConstraint((2, 1), (0, 0) (2.722, 6.722) which='Any')
DistanceConstraint((2, 0), (1, 0) (4.930, 8.930) which='Any')
DistanceConstraint((2, 1), (1, 0) (7.237, 11.237) which='Any')

You can see that by default the distance constraints are defined from the spheres of the features, with a range of plus or minus 2 Angstroms. Also the constraints are for a match with any atom. We can change the distance ranges by using ccdc.pharmacophore.Pharmacophore.Query.add_distance_constraint():

>>> aabhtz_query.add_distance_constraint((1, 0), (0, 0), (5.0, 8.0), 'Intramolecular')
>>> for dc in aabhtz_query.distance_constraints:
...     print(dc)
DistanceConstraint((1, 0), (0, 0) (5.000, 8.000) which='Intramolecular')
DistanceConstraint((2, 0), (0, 0) (0.541, 4.541) which='Any')
DistanceConstraint((2, 1), (0, 0) (2.722, 6.722) which='Any')
DistanceConstraint((2, 0), (1, 0) (4.930, 8.930) which='Any')
DistanceConstraint((2, 1), (1, 0) (7.237, 11.237) which='Any')

It is possible to set intramolecular constraints between all pharmacophore points if you want them to occur in the same structure. This can be done by setting the ccdc.pharmacophore.Pharmacophore.Query.intra_only attribute to True. These intra-molecular constraints will be applied only to features whose component label is ‘small_molecule’, so first we will make the appropriate change:

>>> for f in aabhtz_query.features:
...     f.component_label = 'small_molecule'
>>> aabhtz_query.intra_only = True
>>> for dc in aabhtz_query.distance_constraints:
...     print(dc)
DistanceConstraint((1, 0), (0, 0) (5.000, 8.000) which='Intramolecular')
DistanceConstraint((2, 0), (0, 0) (0.541, 4.541) which='Intramolecular')
DistanceConstraint((2, 1), (0, 0) (2.722, 6.722) which='Intramolecular')
DistanceConstraint((2, 0), (1, 0) (4.930, 8.930) which='Intramolecular')
DistanceConstraint((2, 1), (1, 0) (7.237, 11.237) which='Intramolecular')

Creating a New Pharmacophore Query

Pharmacophore points can also be created starting from feature definitions and arbitrary spheres, though care must be taken to make sure the geometry of the features is such that hits may be found. The most usual of these is when specifying an excluded volume, though any of the feature definitions may be used in this way. Here we can specify an excluded volume around the C11 atom AABHTZ and set its tolerance radius to 2.0 Angstrom, and similarly we can add a heavy atom features on the secong chlorine atom:

>>> aabhtz_mol = crystal.molecule
>>> excluded = Pharmacophore.ExcludedVolume(GeometricDescriptors.Sphere(aabhtz_mol.atom('C11').coordinates, 2.0))
>>> aabhtz_query.add_feature(excluded)
>>> heavy_def = Pharmacophore.feature_definitions['heavy_atom']
>>> cl2_feat = Pharmacophore.Feature(heavy_def, GeometricDescriptors.Sphere(aabhtz_mol.atom('Cl2').coordinates, 1.0))
>>> aabhtz_query.add_feature(cl2_feat)
>>> print('\n'.join(str(f) for f in aabhtz_query.features))

Creating Feature definitions

It is possible to create your own feature definitions from SMARTS definitions and point generators. Let us exemplify this by creating a feature definition for carboxylic acid:

>>> sm1 = 'C(=O)[OH1]'
>>> sd1 = Pharmacophore.SMARTSDefinition(sm1, (('POINT', (0,)),))
>>> sm2 = 'C(=O)[O-]'
>>> sd2 = Pharmacophore.SMARTSDefinition(sm2, (('POINT', (0,)),))
>>> pg = Pharmacophore.PointPointGenerator('POINT')
>>> sfd = Pharmacophore.FeatureDefinition.from_SMARTS_definitions('COOH', (sd1, sd2), (pg,))
>>> print(sfd)
>>> for d in sfd.SMARTS_definitions:
...     print(d.SMARTS)

Here we have defined SMARTS patterns for the two forms a carboxylic acid may take. For each of these we will wish to place a sphere at the position of the carbon atom, so we associate the SMARTS definition with a point generator named POINT, indicating that it should be applied to the first atom (index 0) of the match. With some patterns, such as the acceptor definition shown above, more than one atom may be decorated with spheres, so we provide a tuple of point generators. Then, when constructing the feature definition, we provide a tuple of ccdc.pharmacophore.Pharmacophore.SMARTSDefinition instances, and a tuple of ccdc.pharmacophore.Pharmacophore.PointGenerator instances, which must contain all point generator names referenced in the SMARTSDefinitions. The feature definition may be written for later use.

The result is a ccdc.pharmacophore.Pharmacophore.FeatureDefinition instance which may be used exactly as the CCDC-provided feature definitions, for example to detect_features in a crystal. If you want to use this feature definition in a search then it will need to be added to the feature database of the search.

When queries are created, distance constraints are constructed between the features to bound the range of the search. These can be inspected:

>>> constraints = aabhtz_query.distance_constraints
>>> print(len(constraints))
>>> print('\n'.join(str(dc) for dc in constraints)) 

If you wish you may add distance constraints to the query, using ccdc.pharmacophore.Pharmacophore.Query.add_distance_constraint().

For a more fine-grained approach to feature creation, it is possible to create features from atoms and a point generation scheme. This allows one to specify, for example, a trigonal geometry for an acceptor. Here we can pick any atom, not necessarily one which might be matched by the feature definition we are using. So, for example, we can create an SP2 acceptor feature sited on the H2 atom of AABHTZ:

>>> acceptor_projected = Pharmacophore.feature_definitions['acceptor_projected']
>>> for pgn in acceptor_projected.point_generator_names:
...     print(pgn)
>>> features = acceptor_projected.features_from_atoms('ACCEPTOR_SP2', aabhtz_mol.atom('H2'))
>>> print(features)
(Feature(acceptor_projected), Feature(acceptor_projected))

With the ring feature definitions, three or more atoms need to be provided to this method. As before, these atoms need not form part of a ring. Some point generation schemes will use the connectivity of the atoms to determine the geometry of the generated spheres.

Feature Databases

Once we have a pharmacophore query, we can use it to search a feature database. For brevity, in this documentation we will use a smaller feature database than that installed by default.

>>> feature_db_file = 'catl_s3.feat'
>>> feature_db = Pharmacophore.FeatureDatabase.from_file(feature_db_file)

This database was created from a set of catl_s3 hits, and 1000 CSD entries, with all the standard feature definitions enabled:

>>> print(len(feature_db))
>>> print(len(feature_db.databases))
>>> print('\n'.join(str(db) for db in feature_db.databases))
DatabaseInfo(file_name='catl_s3.csdsqlx', n_entries=641, colour=Colour(r=0, g=255, b=0, a=255))
DatabaseInfo(file_name='as543be_ASER.sqlite', n_entries=1000, colour=Colour(r=255, g=0, b=0, a=255))
>>> feature_defs = list(feature_db.feature_definitions)
>>> import operator
>>> feature_defs.sort(key=operator.attrgetter('identifier'))
>>> print('\n'.join(str(fd) for fd in feature_defs)) 

The colour parameter is useful if the database will be loaded into CSD-CrossMiner as it will be shown in the Feature Databases window and it will be used for the hit colouring in the Results Hitlist browser when a search is performed against this feature database. This is useful if multiple structure databases have been used to generate a single feature database.

We can check that the database contains feature definitions for each of the features in the pharmacophore query: this assures us that a search will be possible.

>>> db_feature_ids = set(fd.identifier for fd in feature_db.feature_definitions)
>>> print(all(f.identifier in db_feature_ids for f in catl_s3_query.features))

Feature databases may have user-specified annotations. From the database we can find out which annotations are present:

>>> keys = list(feature_db.annotation_keys)
>>> keys.sort()
>>> print('\n'.join(keys))
CSD Refcode
r factor

We can obtain individual entries, either by index or by identifier, from the database and inspect their annotations:

>>> entry = feature_db[0]
>>> print(entry.annotations['ec_number'])
>>> print(entry.annotations['pdb_class'])
Serine protease
>>> csd_entry = feature_db.entry('AABHTZ')
>>> print('\n'.join('%s: %s' % (k, v) for k, v in sorted(csd_entry.annotations.items())))
formula: C13 H12 Cl2 N6 O2
r factor: 4.1

Annotations may be added to the database. Here the first argument will be used as a glob-style pattern to match identifiers, and the remaining keyword arguments will be set as annotations in the database. The database can be written if you wish these annotations to be preserved.

>>> feature_db.annotate('A*', csd_entry="Yes", pdb_entry="No")
>>> print('csd_entry' in feature_db.annotation_keys)
>>> print(feature_db.entry('AABHTZ').annotations['csd_entry'])

Pharmacophore Searching

Like all searching classes in the API, the pharmacophore search class, ccdc.pharmacophore.Pharmacophore.Search has a nested settings class to control optional parameters of the search. In this example we will use one to control how many hits we would like, and a tighter criterion for RMSD of hit structures:

>>> settings = Pharmacophore.Search.Settings()
>>> settings.max_hit_structures = 20
>>> settings.max_hits_per_structure = 1
>>> settings.max_hit_rmsd = 1.0
>>> searcher = Pharmacophore.Search(settings)

And now we are ready to perform a search. We will use the cathepsin L query created above. The database argument is optional, and by default will search the CSD-CrossMiner feature database provided as part of your installation.

>>> hits =, database=feature_db)

The hits are a tuple of ccdc.pharmacophore.Pharmacophore.SearchHit instances. Their properties may be inspected, including the hit molecule and the transformation required to overlay the hit structures.

>>> print(len(hits))
>>> h = hits[0]
>>> print(round(h.overlay_rmsd, 3))
>>> print(h.identifier)
>>> print(h.annotations['pdb_class'])
Electron transport

The last component of the identifier is a hit index, catering for multiple hits on the same structure. We can recover the entry from the database by stripping this component. This may be useful, although most of the properties of the database entry are directly available from the hit.

>>> entry_id = h.identifier[:h.identifier.rindex('_')]

The hit will give access to the whole molecule of the hit, the protein portion (where relevant) and the small_molecule portion:

>>> whole_mol = h.molecule
>>> protein = h.protein
>>> small_mol = h.small_molecule
>>> print(len(whole_mol.atoms))
>>> print(len(protein.atoms))
>>> print(len(small_mol.atoms))

You can access the pharmacophore points matched by a hit. They will be in the order of the features used in the pharmacophore search.

>>> points = h.points

and the points associated with any given feature may be obtained:

>>> donor_projected_points = h.feature_points(catl_s3_query.features[1])
>>> print('%s -> %s' % (donor_projected_points[0], donor_projected_points[1]))
Point(53.6776, 13.1161, 17.386) -> Point(55.8518, 11.9577, 16.0552)

The values of the distance constraints of the query may be inspected:

>>> for dc, val in zip(catl_s3_query.distance_constraints, h.constraint_values()): 
...     print('(%d, %d) (%d, %d) %.3f' % (
...         dc.feature_point1[0], dc.feature_point1[1], dc.feature_point2[0], dc.feature_point2[1], val))
(1, 0) (0, 0) 5.947
(1, 0) (0, 1) 3.268
(1, 1) (0, 0) 8.599
(5, 0) (2, 1) 4.668
(5, 0) (3, 0) 4.880
(5, 0) (4, 0) 1.546

The molecules of the hits may be superimposed using the overlay_transformation property of the hit:

>>> for h in hits:
...    h.molecule.transform(h.overlay_transformation)

and they can then be written using a instance.

>>> from import MoleculeWriter
>>> with MoleculeWriter('hits.mol2') as writer:
...     for h in hits:
...             writer.write(h.molecule)

We can change the tolerances with which the search can be performed, as was done in the tutorial, by changing the radius of the spheres. Because we don’t have the full database for the purpose of this example, we will change only the heavy atom and hydrophobe features to ensure some hits are found:

>>> for feature in catl_s3_query.features:
...     if len(feature.spheres) == 1:
...         feature.spheres = [GeometricDescriptors.Sphere(s.centre, 0.5) for s in feature.spheres]
>>> hits =, database=feature_db)
>>> print(len(hits))  
>>> print(round(h.overlay_rmsd, 3))
>>> print(h.identifier)
>>> for h in hits:
...     print(h.annotations['pdb_class']) 
Electron transport
Electron transport
Electron transport
Electron transport

Searches may be filtered by the annotations found in the feature database. Only hits matching the annotation filters will be found. For example, we may constrain the search above to find only hydrolases:

>>> catl_s3_query.add_feature(Pharmacophore.AnnotationFilter('pdb_class', 'Hydrolase'))
>>> hits =, database=feature_db)
>>> for h in hits:
...     print(h.annotations['pdb_class']) 

It is possible to write the hits, together with any annotations they possess using a instance.

>>> from ccdc.entry import Entry
>>> with io.EntryWriter('hits_annotated.mol2') as writer:
...     for h in hits:
...         mol = h.molecule
...         mol.transform(h.overlay_transformation)
...         dic = { 'overlay_rmsd' : h.overlay_rmsd, 'cluster_number' : h.cluster_number }
...         dic.update(h.annotations)
...         e = Entry.from_molecule(mol, **dic)
...         writer.write(e)

Creating Feature Databases

The ccdc.pharmacophore may be used to create feature databases from molecular structure databases and from the CSD. Here we will show the simplest form of feature database creation. Firstly we shall construct the database information from which the database will be created:

>>> mol2_file = 'catl_s3.mol2'
>>> mol2_info = Pharmacophore.FeatureDatabase.DatabaseInfo(mol2_file, 0, Colour(0, 255, 0, 255))
>>> csd_info = Pharmacophore.FeatureDatabase.DatabaseInfo(csd.file_name, 1000, Colour(255, 0, 0, 255))

The first of these will indicate that all structures will be read from the given mol2 file. The second of these indicates that up to 1000 structures passing the default set of structural filters will be read from the CSD. Two different colours will be assigned to the structures retrieved from the mol2 file and those obtained from the CSD. These colours will be shown when the database is loaded into CSD-CrossMiner and will be used for the hit colouring.

Then we make structure database instances:

>>> import tempfile, os
>>> tempdir = tempfile.mkdtemp()
>>> csdsqlx = os.path.join(tempdir, os.path.basename(mol2_file).replace('.mol2', '.csdsqlx'))
>>> mol2_sdb = Pharmacophore.FeatureDatabase.Creator.StructureDatabase(mol2_info, use_crystal_symmetry=False, structure_database_path=csdsqlx)
>>> csd_sdb = Pharmacophore.FeatureDatabase.Creator.StructureDatabase(csd_info, use_crystal_symmetry=True)

The first of these specifies a location to store the generated csdsqlx file, and that crystal symmetry will not be used. When use_crystal_symmetry is set to True, symmetry-related copies of the molecule will be created based on the spacegroup information supplied for each structure. This is not recommended unless you are using CSD structures. The second specifies that the CSD will be searched using crystal symmetry and with a default set of filters (see below): should you wish to have your own set of filters, provide a instance to control the filtering of the database.

Now we can create the database:

>>> creator = Pharmacophore.FeatureDatabase.Creator()
>>> print(creator.StructureDatabase.default_csd_filters()) 
    has_3d_coordinates = True
    max_hit_structures = 0
    max_r_factor = 10.0
    must_have_elements = []
    must_not_have_elements = [He (2), Be (4), Ne (10), Al (13), Si (14), Ar (18), Sc (21), Ti (22), V (23), Cr (24), Ga (31), Ge (32), As (33), Se (34), Kr (36), Rb (37), Y (39), Zr (40), Nb (41), Mo (42), Tc (43), Ru (44), Rh (45), Pd (46), Ag (47), Cd (48), In (49), Sn (50), Sb (51), Te (52), Xe (54), Cs (55), Ba (56), La (57), Ce (58), Pr (59), Nd (60), Pm (61), Sm (62), Eu (63), Gd (64), Tb (65), Dy (66), Ho (67), Er (68), Tm (69), Yb (70), Lu (71), Hf (72), Ta (73), W (74), Re (75), Os (76), Ir (77), Pt (78), Au (79), Hg (80), Tl (81), Pb (82), Bi (83), Po (84), Rn (86), Fr (87), Ra (88), Ac (89), Th (90), Pa (91), U (92)]
    no_disorder = Non-hydrogen
    no_errors = False
    no_ions = False
    no_metals = False
    no_powder = False
    not_polymeric = True
    only_organic = False
    only_organometallic = False
>>> db = creator.create((mol2_sdb, csd_sdb))

This will by default set the indexed features of the database to be all those which have been read in by ccdc.pharmacophore.Pharmacophore.read_feature_definitions(). If you wish to control the features more finely, or to add your own features, you can provide an iterable of ccdc.pharmacophore.Pharmacophore.FeatureDefinition instances to the create method.

The feature database can now be written to a file:

>>> out_feature_db = os.path.join(tempdir, 'catl_s3.feat')
>>> db.write(out_feature_db)