Substructure searching


In order to be able to set up a substructure search we will need to import the module. Let us also import the module to allow us to read in and write out molecules.

>>> import
>>> import

As a preamble let us set up a variable for a temporary directory.

>>> from ccdc.utilities import _test_output_dir
>>> tempdir = _test_output_dir()

Let us also get a testosterone molecule out of the CSD.

>>> entry_reader ='CSD')
>>> teston10 = entry_reader.molecule('TESTON10')
>>> testosterone = teston10.components[0]

Setting up a substructure

There are several ways to set up a

It can be created from a molecule.

>>> testosterone_substructure =

It can also be created from a SMARTS string.

>>> hydroxyl_substructure ="OH")
>>> ketone_substructure ="[CD4][CD3](=[OD1])[CD4]")

There is a small extension to Daylight SMARTS to allow quadruple, delocalised and pi bonds to be represented, using the characters ‘_’, ‘”’ and ‘|’ respectively.

A substructure can also be read in from a ConQuest Connser file.

>>> filepath = 'monochloropyridine.con'

To achieve this we make use of the class.

>>> connser_substructure =

Substructure search hits

Let us print out the hit identifiers of the original 10 hits identified by the testosterone_search search on the CSD.

>>> for hit in hits:
...     print(hit.identifier)

These can be superimposed using the matched atoms of the hits.

>>> mols = hits.superimpose()
>>> print(len(mols))


Notice that we can only superimpose the structures where all atoms have coordinates.

Let us write out the molecules from these hits as a multi-mol2 file.

>>> output_file = os.path.join(tempdir, 'testosterone_hits.mol2')
>>> with as writer:
...     for m in mols:
...         writer.write(m)

We can also write the result data into a ConQuest to Mercury interchange file, so that the results, with constraints and measurements may be analysed in the data analysis module of Mercury.

>>> hits.write_c2m_file(os.path.join(tempdir, 'testosterone_hits.c2m'))

We can find the matched atoms for each hit, where the hit atoms have coordinates:

>>> print(hits[1].match_atoms())  
[Atom(C1), Atom(C2), Atom(C3), ...]

All forms of search hit support molecule, crystal and entry properties:

>>> entry = hits[1].entry
>>> print(entry.chemical_name == u'17\u03b1-Hydroxyandrost-4-en-3-one')

>>> crystal = hits[1].crystal
>>> print(crystal.spacegroup_symbol)

>>> molecule = hits[1].molecule
>>> print(len(molecule.atoms))


The chemical name from the entry ccdc.entry.Entry.chemical_name is in Unicode format and \u03b1 is the encoding for the ‘GREEK SMALL LETTER ALPHA’ For more information on how to work with Unicode in Python please see

Create a substructure from scratch

To create a substructure from scracth we will use a populated with instances of and

Query Atoms

A may be made to represent a single element, any of a set of elements or a wild card matching any element:

>>> c1 ='C')
>>> c_or_n =['N', 'O'])
>>> any_atom =

Atom constraints

Constraints may be put on to ensure that certain properties of a matched atom are fulfilled. The complete list of constraints are:

  • acceptor: that the atom be a hydrogen bond acceptor
  • donor: that the atom be a hydrogen bond donor
  • aromatic: that the atom be in an aromatic ring
  • cyclic: that the atom be in a ring
  • formal_charge: that the atom have a certain formal charge
  • formal_valency: that the atom have a certain formal valency
  • cyclic_bonds: that the atom have a certain number of cyclic bonds
  • smallest_ring: that the smallest ring in which the atom lies is of a certain size
  • num_bonds: that the atom have a certain number of neighbours
  • num_hydrogens: that the atom have a certain number of hydrogens bonded to it
  • nimplicit_hydrogens: that the atom have a certain number of implicit hydrogens bonded to it
  • has_3d_coordinates: that the atom have 3D coordinates
  • unfused_unbridged_ring: that the atom be or not be in an unfused/unbridged ring

Additionally there is a method which may be used to specify counts of neighbouring atoms of certain element types. For use when searching protein structures there is a method This takes any number of string arguments drawn from ‘AMINO_ACID’, ‘LIGAND’, ‘WATER’, ‘METAL’, ‘UNKNOWN’, and can be used to constrain an atom match to one of these classes. When searching non-protein structures all atoms will be of ‘UNKNOWN’ type.

Constraint conditions

Those constraints which have a boolean value may be assigned to with either True, False or None: if None any constraint of that type will be removed; if True the constraint will be matched only if the atom matches the constraint; if False the constraint will be matched only if the atom does not match the constraint.

Those constraints which take a numeric value may be assigned one of the following:

  • a single numeric value: the constraint will match only if the value of the constraint equals the value
  • a pair of numeric values: the constraint will match if the value of the constraint is in the inclusive range of the values
  • a pair of a string representation of an operator and a numeric value
    • (‘==’, value)
    • (‘>’, value)
    • (‘<’, value)
    • (‘<=’, value)
    • (‘>=’, value)
    • (‘!=’, value)
    • (‘in’, list_of_values)

So, for example, we can add constraints such as:

>>> a ='N')
>>> a.formal_charge = ('in', [1, 2])
>>> a.num_bonds = ('>', 2)
>>> print(a)
QueryAtom(N)[charge: one of 1, 2, number of connected atoms: greater than 2]

Query Bonds

A represents a bond which must be matched in a substructure search. It may represent one or more specific bond types, or any bond type. It may have constraints applied to it, too. The constraints applicable to a are:

  • cyclic
  • bond_length
  • bond_polymeric
  • bond_smallest_ring
  • bond_unfused_unbridged_ring

The bond length constraint may take any of the forms given above; the rest are boolean valued.

Query Substructure

Instances of and may be added to a to form the subject of a substructure search. The following shows a query which may be used to search for abnormally short C=O bond lengths:

>>> c ='C')
>>> o ='O')
>>> b ='Double')
>>> b.bond_length = ('<', 0.94)
>>> short_CO_query =
>>> _ = short_CO_query.add_atom(c)
>>> _ = short_CO_query.add_atom(o)
>>> _ = short_CO_query.add_bond(b, c, o)

And this can be used to search for carboxylic acids:

>>> cooh_substructure =
>>> c = cooh_substructure.add_atom('C')
>>> o1 = cooh_substructure.add_atom('O')
>>> o2 = cooh_substructure.add_atom('O')
>>> b1 = cooh_substructure.add_bond('Double', c, o1)
>>> b2 = cooh_substructure.add_bond('Single', c, o2)
>>> o1.num_hydrogens = 0
>>> o2.num_hydrogens = 1

With any of these substructures, a test may be made that a particular atom matches an atom of the substructure, within the context of the whole substructure. This is equivalent to having a constraint on a of the entire substructure. For example, using the ketone_substructure defined above, and the molecule ABINUU,

>>> abinuu = entry_reader.molecule('ABINUU')
>>> print(ketone_substructure.match_atom(abinuu.atom('C8')))
>>> print(ketone_substructure.match_atom(abinuu.atom('C15')))

Note that by default this match will use the first atom of the substructure, which will be the faster case. If required an arbitrary atom of the query substructure may be used:

>>> print(ketone_substructure.match_atom(abinuu.atom('C15'), ketone_substructure.atoms[1]))

Substructure searching with geometric measurements

It is possible to define geometric objects which may be used as part of a geometric measurement, a constraint or another geometric object. These are defined by:

Each of these methods takes a name, by which the geometric object is known and a sequence of point definitions. The dummy point definition takes an additional argument, before the two point definitions defining the distance along the vector to place the point.

A point can be defined as either a pair, (substructure_index, atom_index), or by a named centroid or dummy point. For example:

>>> guanidino ='[NHX3][CH0X3](=[NH2X3+])[NH2X3]')
>>> carboxylate ='[C][C]([O])[O]')
>>> searcher =
>>> sub1 = searcher.add_substructure(guanidino)
>>> sub2 = searcher.add_substructure(carboxylate)
>>> searcher.add_plane('PLANE1', (sub1, 1), (sub1, 2), (sub1, 3))
>>> searcher.add_plane('PLANE2', (sub2, 0), (sub2, 1), (sub2, 2))
>>> searcher.add_centroid('CENT1', (sub1, 2), (sub1, 3))
>>> searcher.add_centroid('CENT2', (sub2, 2), (sub2, 3))

It is possible to add geometric measurements to searches.

The geometric measurements are added to the using the functions:

>>> searcher.add_distance_measurement('DIST1', 'CENT1', 'CENT2')

Each of the measurement methods above (except for the constant value measurement) has a similarly named constraint method, so for example one can require that the angle between the planes defined above should be between 0 and 20 degrees:

>>> searcher.add_plane_angle_constraint('P1_P2', 'PLANE1', 'PLANE2', (0, 20))

Then we can find some hits in the CSD.

>>> hits =, max_hits_per_structure=1)

With these hits we can inspect the measurements and constraints:

>>> for h in hits:
...     print('DIST1: %.2f' % h.measurements['DIST1'])
...     print('P1_P2: %.2f' % h.constraints['P1_P2'])
DIST1: 6.47
P1_P2: 7.95

One can also determine how the constraints and measurements were defined:

>>> print('%s - %s' % hits[0].measurement_objects('DIST1'))
>>> print(hits[0].centroid_objects('CENT1'))
(Atom(N5), Atom(N6))

and similarly for constraints.

All these measurement functions require a name by which the measurement will be accessed. The geometric measurements require an appropriate set of point definitions as defined above for geometric objects. The atom property measurement allows aspects of a matched atom to be measured. These are:

  • AtomicNumber
  • VdwRadius
  • CovalentRadius
  • TotalCoordinationNumber

A constant value measurement allows constants to be part of a calculated measurement, The unary and binary transform measurements allow arithmetic to be performed on measured values. Unary operators are:

  • ABS
  • LOG
  • LOG10
  • EXP
  • COS
  • SIN
  • TAN
  • ACOS
  • ASIN
  • ATAN
  • SQRT
  • NEG

Binary operators are:

  • MIN
  • MAX
  • ADD
  • POW
  • RSIN
  • RCOS

Arguments to the arithmetic operators are the name of a measurement or the name of a constraint.

For each of the measurements above (except the constant value) there is a corresponding constraint which may be applied to the substructure matches.

Say, for example, that we were interested in understanding the intra-molecular geometry of an aromatic methoxy group. In particular how the preference of the methoxy group to lie in the plane of the aromatic ring affects the Ph-C-O angle.

Let us first create the substructure of interest using a The substructure can then be used to to set up a

>>> ar_methoxy_sub ='[CH3:1][O:2][c:3]1[cH:4]ccc[cH:5]1')
>>> ar_methoxy_search =
>>> ar_methoxy_sub_id = ar_methoxy_search.add_substructure(ar_methoxy_sub)

We can now add the measurements of interest using the indices of the atoms of interest in the SMARTS pattern.

Aromatic methoxy query.

Figure illustrating the aromatic methoxy query.

>>> ar_methoxy_search.add_angle_measurement('ANG1',
...     ar_methoxy_sub_id, ar_methoxy_sub.label_to_atom_index(2),
...     ar_methoxy_sub_id, ar_methoxy_sub.label_to_atom_index(3),
...     ar_methoxy_sub_id, ar_methoxy_sub.label_to_atom_index(4))
>>> ar_methoxy_search.add_angle_measurement('ANG2',
...     ar_methoxy_sub_id, ar_methoxy_sub.label_to_atom_index(2),
...     ar_methoxy_sub_id, ar_methoxy_sub.label_to_atom_index(3),
...     ar_methoxy_sub_id, ar_methoxy_sub.label_to_atom_index(5))
>>> ar_methoxy_search.add_torsion_angle_measurement('TOR1',
...     ar_methoxy_sub_id, ar_methoxy_sub.label_to_atom_index(1),
...     ar_methoxy_sub_id, ar_methoxy_sub.label_to_atom_index(2),
...     ar_methoxy_sub_id, ar_methoxy_sub.label_to_atom_index(3),
...     ar_methoxy_sub_id, ar_methoxy_sub.label_to_atom_index(4))


Here we are making use of the function to convert the reaction SMARTS labels into zero based substructure atom indices.

>>> ar_methoxy_sub.label_to_atom_index(1)

Now we are ready to carry out the search. Because this aromatic methoxy substructure is quite common in the CSD, we will limit our maximum number of hits to 200. Further, to avoid bias by picking multiple observations from the same structure we will limit the number of hits per structure to 1.

>>> ar_methoxy_hits =, max_hits_per_structure=1)
>>> len(ar_methoxy_hits)

To get the data out of the list of instances we can make use of list comprehension and Python’s built in zip functionality.

>>> measurements = [ (h.measurements['ANG1'],
...                   h.measurements['ANG2'],
...                   abs(h.measurements['TOR1']))
...                 for h in ar_methoxy_hits ]
>>> ang1, ang2, abstor1 = zip(*measurements)

See also

For more information on Python’s built in zip function please see the Python on-line documentation


None will be used as the value of a measurement value that is not well defined. For example a torsion measurement on colinear atoms is undefined and will be returned as None.

Finally, we can plot the data using matplotlib (a powerful third party python library for graphical visualisation).


More information on the matplotlib module can be found at

>>> import matplotlib
>>> matplotlib.use('Agg')
>>> import matplotlib.pyplot as plt
>>> plt.scatter(ang1, ang2, c=abstor1, vmin=0, vmax=180)  
<matplotlib.collections.PathCollection object at ...>
>>> cbar = plt.colorbar() 
>>> cbar.set_label('TOR1') 
>>> plt.title('Aromatic methoxy geometry')  
<matplotlib.text.Text object at ...>
>>> plt.xlabel('ANG1')  
<matplotlib.text.Text object at ...>
>>> plt.ylabel('ANG2')  
<matplotlib.text.Text object at ...>
Aromatic methoxy geometry.

Figure illustrating the effect of steric crowding induced by the methyl group lying in the plane of the aromatic ring for a methoxy group attached to an aromatic ring. Note the deviation from the “ideal” of 120°.

Substructure searching with geometric constraints

It is also possible to add geometric constraints to substructure searches. These can be added using:

Suppose, for example, that we wanted to understand the interaction geometry of an aromatic iodine and the nitrogen atom of a pyridine ring. Specifically, does the C-I···N angle tend towards zero as the I···N distance becomes shorter?

First of all let us define the substructures to search for.

>>> ar_I_sub ='Ic1ccccc1')  # I: index 0
>>> pyridine_sub ='n1ccccc1')  # n: index 0

Using these substructures we can set up the search with a distance constraint between the iodine and the pyridine nitrogen atoms.

Aromatic iodine pyridine query.

Figure illustrating the aromatic iodine ··· pyridine query.

>>> halogen_bond_search =
>>> ar_I_sub_id = halogen_bond_search.add_substructure(ar_I_sub)
>>> pyridine_sub_id = halogen_bond_search.add_substructure(pyridine_sub)

Note that at this point we have added two substructures to the search and we can add a distance constraint between them. This requires us to specify both the substructure and atom identifiers of interest. Incidentally this is why the function returns the substructure identifier.

>>> halogen_bond_search.add_distance_constraint('DIST1',
...     ar_I_sub_id, 0,
...     pyridine_sub_id, 0,
...     (0.0, 3.4),  # distance constraint range
...     'Intermolecular')


We could have specified the distance with respect to the van der Waals radii of the atoms using the by setting the vdw_corrected parameter of the function to True.

Rather than just measure the C-I···N angle we can add it as a angular constraint, ensuring that it be greater than 120° for a match.

>>> halogen_bond_search.add_angle_constraint('ANG1',
...     ar_I_sub_id, 1,  # the carbon that the I is attached to
...     ar_I_sub_id, 0,
...     pyridine_sub_id, 0,
...     (120.0, 180.0))  # the angle constraint range

We can now carry out the search.

>>> halogen_bond_hits =
>>> len(halogen_bond_hits)  

To get the data out of the list of instances we can make use of list comprehension and Python’s built in zip functionality.

>>> dist1_ang1 = [(h.constraints['DIST1'], h.constraints['ANG1'])
...               for h in halogen_bond_hits ]
>>> dist1, ang1 = zip(*dist1_ang1)

We can use matplotlib to check for any geometric preferences of the aromatic iodine ··· pyridine halogen bond.

>>> plt.clf()  # Clear the figure from the previous plot 
>>> plt.scatter(dist1, ang1)  
<matplotlib.collections.PathCollection object at ...>
>>> plt.title('Aromatic iodine - pyridine halogen bond geometry')  
<matplotlib.text.Text object at ...>
>>> plt.xlabel('DIST1')  
<matplotlib.text.Text object at ...>
>>> plt.ylabel('ANG1')  
<matplotlib.text.Text object at ...>
Halogen bond distance against angle.

Plotting the angle versus the distance reveals that there a weak negative correlation, i.e., as the contact distance becomes shorter the angle tends towards 180°.

If we would like to perform further geometrical calculations on the hits of a substructure search, we can get access to the matched substructures. These will be instances of ccdc.molecule.Molecule, one for each substructure of the query, containing the atoms matched by the query and the bonds between them. The structures will take account of any symmetry operators involved in the construction of the match.

>>> h = halogen_bond_hits[0]
>>> print(h.match_atoms()) 
[Atom(I1), Atom(C12), ... Atom(C5), Atom(C4)]
>>> print(', '.join("'%s'" % symmop for symmop in h.match_symmetry_operators())) 
'x,y,z', 'x,y,z', ... '1/2-x,-y,-1/2+z', '1/2-x,-y,-1/2+z'
>>> sub_matches = h.match_substructures()
>>> print(sub_matches[0].atoms) 
[Atom(I1), Atom(C12), ..., Atom(C8), Atom(C13)]
>>> print(sub_matches[1].atoms) 
[Atom(N1), Atom(C3), ..., Atom(C5), Atom(C4)]

Both and will return atoms in the order in which they were specified by the substructures added to the SubstructureSearch, in the former case as a single tuple of atoms, in the latter separated into molecules corresponding to the substructures.

The difference between and is that in the former case, the atoms may all be found in, and so will not have any symmetry operators applied to them. This is useful to be able to determine which atoms have participated in the match. In some case where a symmetry operator has had to be applied an atom can appear more than once in the returned tuple of atom. In the latter case separate molecules are returned, hence the atoms are not directly comparable. Symmetry operators are applied, so the atoms are in exactly the same positions as were found by the search. Hence, geometrical operations on the substructures are consonant with the measurements and constraints of the search.

Search filters

It is possible to constrain the hits of a substructure search by various criteria. This is achieved by setting constraints on the For example to restrict a search to provide only organic compounds, with no disorder and with a good R-factor,

>>> halogen_bond_search.settings.only_organic = True
>>> halogen_bond_search.settings.no_disorder = 'all'
>>> halogen_bond_search.settings.max_r_factor = 5.0

Then the search may be performed as before. There are also methods to ensure that particular elements are not present in a hit, or to ensure that particular elements must be present in a hit. For example to prohibit hits containing metalloid elements (which may be present in organic compounds):

>>> halogen_bond_search.settings.must_not_have_elements = ['B','Si','Ge','As','Sb','Te','At']

In additon to such substructure searches, these search filters can also be applied to, and searches.


To configure how disorder of structures mediates the search, may take any of three values:

  • None (or anything that evaluates to False) to indicate no filtering of any disordered structures,
  • ‘all’ to indicate filtering of structures with any disordered atoms, or
  • ‘Non-hydrogen’ (or any string apart from ‘all’) to indicate filtering of structures with heavy atom disorder.

The last option, ‘Non-hydrogen’, is compatible with the ConQuest ‘no disorder’ selector.

Interactive Hit Processing

There is a new mechanism which allows the hits of a substructure search to be processed as they are found by the underlying method instead of collecting all the hits, then processing them later. This may reduce the memory requirements of the search if the interesting aspects of the hit can be extracted without storing the whole hit. In other circumstances it may provide a mechanism for early termination of the search if hits with the right data may be determined before the search has processed the entire database.

Firstly let us import the right modules and set up a search for carboxylate-carboxylate contacts:

>>> import time
>>> from import SMARTSSubstructure, SubstructureSearch
>>> searcher = SubstructureSearch()
>>> searcher.add_substructure(SMARTSSubstructure('C(=O)OH'))
>>> searcher.add_substructure(SMARTSSubstructure('C(=O)OH'))
>>> searcher.add_distance_constraint('DIST1', 0, 1, 1, 3, (0, 2.0), 'Intermolecular')
>>> searcher.add_distance_constraint('DIST2', 0, 3, 1, 1, (0, 2.0), 'Intermolecular')
>>> searcher.add_angle_measurement('ANG1', 0, 1, 1, 3, 1, 2)
>>> searcher.add_angle_measurement('ANG2', 0, 2, 0, 3, 1, 1)

Normally we would call, wait until all hits had been found, then process them to extract the information we sought. Now we can process the hits as they are found, allowing much greater memory efficiency in cases where we do not need all of the hit data.

Let us set up a processor which will record the closest contacts found, by summing the distance constraints. The two methods we will need to override are __init__, to provide space to record data, and add_hit(), to process the hits as they are found. For this example, I will also terminate after a few seconds for brevity.

>>> class ClosestContacts(SubstructureSearch.HitProcessor):
...     '''Record the closest 5 contacts, and terminate after a short time.'''
...     def __init__(self, max_hits=5, timeout=20.):
...         self.max_hits = max_hits
...         self.hits = []
...         self.refcode = None
...         self.start_time = time.time()
...         self.timeout = timeout
...         self.count = 0
...     def add_hit(self, hit):
...         score = hit.constraints['DIST1'] + hit.constraints['DIST2']
...         if len(self.hits) == self.max_hits:
...             self.hits.sort()
...             if score < self.hits[-1][0]:
...                 self.hits.pop()
...                 self.hits.append((score, hit))
...         else:
...             self.hits.append((score, hit))
...         if time.time() - self.start_time > self.timeout:
...             self.cancel()
...         if hit.identifier != self.refcode:
...             self.refcode = hit.identifier
...             self.count += 1

Now we can use it to search:

>>> closest = ClosestContacts()

After 20 seconds the search will be terminated, and we can inspect the hits:

>>> hits = closest.hits
>>> hits.sort()
>>> print(' '.join('%.2f' % h[0] for h in hits)) # doctest: +SKIP
2.71 2.82 2.98 3.06 3.08
>>> print(' '.join(h[1].identifier for h in hits)) # doctest: +SKIP