Substructure searching¶
Introduction¶
In order to be able to set up a substructure search we will need to
import the ccdc.search
module. Let us also import the
ccdc.io
module to allow us to read in and write out molecules.
>>> import ccdc.search
>>> import ccdc.io
As a preamble let us set up a variable for a temporary directory.
>>> from ccdc.utilities import _test_output_dir
>>> tempdir = _test_output_dir()
Let us also get a testosterone molecule out of the CSD.
>>> entry_reader = ccdc.io.EntryReader('CSD')
>>> teston10 = entry_reader.molecule('TESTON10')
>>> testosterone = teston10.components[0]
Setting up a substructure¶
There are several ways to set up a
ccdc.search.QuerySubstructure
.
It can be created from a molecule.
>>> testosterone_substructure = ccdc.search.MoleculeSubstructure(testosterone)
It can also be created from a SMARTS string.
>>> hydroxyl_substructure = ccdc.search.SMARTSSubstructure("OH")
>>> ketone_substructure = ccdc.search.SMARTSSubstructure("[CD4][CD3](=[OD1])[CD4]")
There is a small extension to Daylight SMARTS to allow quadruple, delocalised and pi bonds to be represented, using the characters ‘_’, ‘”’ and ‘|’ respectively.
A substructure can also be read in from a ConQuest Connser file.
>>> filepath = 'monochloropyridine.con'
To achieve this we make use of the ccdc.search.ConnserSubstructure
class.
>>> connser_substructure = ccdc.search.ConnserSubstructure(filepath)
Setting up and running a substructure search¶
The substructure instances created earlier can be used to set up substructure searches. Substructure searches can be used to search many different objects. By default a substructure search will search the CSD.
>>> substructure_search = ccdc.search.SubstructureSearch()
>>> sub_id = substructure_search.add_substructure(connser_substructure)
>>> hits = substructure_search.search()
>>> print(len(hits))
942
Note that there may be multiple hits in one structure. The multiplicity
of hits can be controlled by optional arguments to the search method:
max_hit_structures
controlling how many structures are returned, and
max_hits_per_structure
. These control how many matches in a
structure should be returned. By default these parameters are set to
100000 and 5, respectively. A value of None
represents an unlimited
setting.
>>> testosterone_search = ccdc.search.SubstructureSearch()
>>> sub_id = testosterone_search.add_substructure(testosterone_substructure)
>>> hits = testosterone_search.search()
>>> len(hits)
13
>>> len(testosterone_search.search(max_hit_structures=4))
4
>>> len(testosterone_search.search(max_hit_structures=4, max_hits_per_structure=1))
4
>>> unique_hits = testosterone_search.search(max_hits_per_structure=1)
>>> len(unique_hits)
11
It is also possible to search the molecules in a (multi) molecule file.
>>> file_path = 'testosterone_hits.mol2'
This can be achieved using a ccdc.io.MoleculeReader
instance.
>>> hydroxyl_search = ccdc.search.SubstructureSearch()
>>> sub_id = hydroxyl_search.add_substructure(hydroxyl_substructure)
>>> hydroxyl_hits = hydroxyl_search.search(ccdc.io.MoleculeReader(file_path))
>>> len(hydroxyl_hits)
20
Or, for convenience, by supplying the file path directly to the
search
function.
>>> hydroxyl_hits = hydroxyl_search.search(file_path)
>>> len(hydroxyl_hits)
26
The difference between the numbers of hits found in the preceding two searches is because of the difference between searching an explicit ccdc.io.MoleculeReader
, which will not find hits containing suppressed atoms, and searching an implicitly defined ccdc.io.EntryReader
, which will.
It is also possible to search an individual molecule.
>>> print(len( hydroxyl_search.search(testosterone) ))
1
Or a list of identifiers from the CSD:
>>> print(len( hydroxyl_search.search(['ABEBUF', 'AABHTZ', 'WOSKOE']) ))
3
Substructure search hits¶
Let us print out the hit identifiers of the original 11 hits identified
by the testosterone_search
search on the CSD.
>>> for hit in hits:
... print(hit.identifier)
EFEJAD
EPITES
HOGRUT
ISTEST
ISTEST
PIZZAD
TESBRP
TESTHG
TESTOM
TESTOM01
TESTON10
TESTON10
WOSKOE
These can be superimposed using the matched atoms of the hits.
>>> mols = hits.superimpose()
>>> print(len(mols))
13
Warning
Notice that we can only superimpose the structures where all atoms have coordinates.
Let us write out the molecules from these hits as a multi-mol2 file.
>>> output_file = os.path.join(tempdir, 'testosterone_hits.mol2')
>>> with ccdc.io.MoleculeWriter(output_file) as writer:
... for m in mols:
... writer.write(m)
We can also write the result data into a ConQuest to Mercury interchange file, so that the results, with constraints and measurements may be analysed in the data analysis module of Mercury.
>>> hits.write_c2m_file(os.path.join(tempdir, 'testosterone_hits.c2m'))
We can find the matched atoms for each hit, where the hit atoms have coordinates:
>>> print(hits[1].match_atoms())
[Atom(C1), Atom(C2), Atom(C3), ...]
All forms of search hit support molecule, crystal and entry properties:
>>> entry = hits[1].entry
>>> print(entry.chemical_name == u'17\u03b1-Hydroxyandrost-4-en-3-one')
True
>>> crystal = hits[1].crystal
>>> print(crystal.spacegroup_symbol)
P212121
>>> molecule = hits[1].molecule
>>> print(len(molecule.atoms))
49
Note
The chemical name from the entry
ccdc.entry.Entry.chemical_name
is in Unicode format
and \u03b1 is the encoding for the ‘GREEK SMALL
LETTER ALPHA’
www.fileformat.info/info/unicode/char/3b1/index.htm.
For more information on how to work with Unicode in Python
please see docs.python.org/3.7/howto/unicode.html
Create a substructure from scratch¶
To create a substructure from scracth we will use a ccdc.search.QuerySubstructure
populated with
instances of ccdc.search.QueryAtom
and ccdc.search.QueryBond
.
Query Atoms¶
A ccdc.search.QueryAtom
may be made to represent a single element, any of a set of elements or a wild card matching any element:
>>> c1 = ccdc.search.QueryAtom('C')
>>> n_or_o = ccdc.search.QueryAtom(['N', 'O'])
>>> any_atom = ccdc.search.QueryAtom()
Atom constraints¶
Constraints may be put on ccdc.search.QueryAtom
to ensure that certain properties of a matched atom are fulfilled. The complete list of constraints are:
acceptor: that the atom be a hydrogen bond acceptor
donor: that the atom be a hydrogen bond donor
aromatic: that the atom be in an aromatic ring
cyclic: that the atom be in a ring
formal_charge: that the atom have a certain formal charge
formal_valency: that the atom have a certain formal valency
cyclic_bonds: that the atom have a certain number of cyclic bonds
smallest_ring: that the smallest ring in which the atom lies is of a certain size
num_bonds: that the atom have a certain number of neighbours
num_hydrogens: that the atom have a certain number of hydrogens bonded to it
nimplicit_hydrogens: that the atom have a certain number of implicit hydrogens bonded to it
has_3d_coordinates: that the atom have 3D coordinates
unfused_unbridged_ring: that the atom be or not be in an unfused/unbridged ring
Additionally there is a method ccdc.search.QueryAtom.add_connected_element_count()
which may be used to specify counts of neighbouring atoms of certain element types. For use when searching protein structures there is a method ccdc.search.QueryAtom.add_protein_atom_type_constraint()
. This takes any number of string arguments drawn from ‘AMINO_ACID’, ‘LIGAND’, ‘WATER’, ‘METAL’, ‘UNKNOWN’, and can be used to constrain an atom match to one of these classes. When searching non-protein structures all atoms will be of ‘UNKNOWN’ type.
Constraint conditions¶
Those constraints which have a boolean value may be assigned to with either True, False or None: if None any constraint of that type will be removed; if True the constraint will be matched only if the atom matches the constraint; if False the constraint will be matched only if the atom does not match the constraint.
Those constraints which take a numeric value may be assigned one of the following:
a single numeric value: the constraint will match only if the value of the constraint equals the value
a pair of numeric values: the constraint will match if the value of the constraint is in the inclusive range of the values
a pair of a string representation of an operator and a numeric value
(‘==’, value)
(‘>’, value)
(‘<’, value)
(‘<=’, value)
(‘>=’, value)
(‘!=’, value)
(‘in’, list_of_values)
So, for example, we can add constraints such as:
>>> a = ccdc.search.QueryAtom('N')
>>> a.formal_charge = ('in', [1, 2])
>>> a.num_bonds = ('>', 2)
>>> print(a)
QueryAtom(N)[charge: one of 1, 2, number of connected atoms: greater than 2]
Query Bonds¶
A ccdc.search.QueryBond
represents a bond which must be matched in a substructure search. It may represent one or more specific bond types, or any bond type. It may have constraints applied to it, too. The constraints applicable to a ccdc.search.QueryBond
are:
cyclic
bond_length
bond_polymeric
bond_smallest_ring
bond_unfused_unbridged_ring
The bond length constraint may take any of the forms given above; the rest are boolean valued.
Query Substructure¶
Instances of ccdc.search.QueryAtom
and ccdc.search.QueryBond
may be added to a ccdc.search.QuerySubstructure
to form the subject of a substructure search. The following shows a query which may be used to search for abnormally short C=O bond lengths:
>>> c = ccdc.search.QueryAtom('C')
>>> o = ccdc.search.QueryAtom('O')
>>> b = ccdc.search.QueryBond('Double')
>>> b.bond_length = ('<', 0.94)
>>> short_CO_query = ccdc.search.QuerySubstructure()
>>> _ = short_CO_query.add_atom(c)
>>> _ = short_CO_query.add_atom(o)
>>> _ = short_CO_query.add_bond(b, c, o)
And this can be used to search for carboxylic acids:
>>> cooh_substructure = ccdc.search.QuerySubstructure()
>>> c = cooh_substructure.add_atom('C')
>>> o1 = cooh_substructure.add_atom('O')
>>> o2 = cooh_substructure.add_atom('O')
>>> b1 = cooh_substructure.add_bond('Double', c, o1)
>>> b2 = cooh_substructure.add_bond('Single', c, o2)
>>> o1.num_hydrogens = 0
>>> o2.num_hydrogens = 1
With any of these substructures, a test may be made that a particular atom matches an atom of the
substructure, within the context of the whole substructure. This is equivalent to having a
constraint on a ccdc.search.QueryAtom
of the entire substructure. For example, using the
ketone_substructure defined above, and the molecule ABINUU,
>>> abinuu = entry_reader.molecule('ABINUU')
>>> print(ketone_substructure.match_atom(abinuu.atom('C8')))
True
>>> print(ketone_substructure.match_atom(abinuu.atom('C15')))
False
Note that by default this match will use the first atom of the substructure, which will be the faster case. If required an arbitrary atom of the query substructure may be used:
>>> print(ketone_substructure.match_atom(abinuu.atom('C15'), ketone_substructure.atoms[1]))
True
Substructure searching with geometric measurements¶
It is possible to define geometric objects which may be used as part of a geometric measurement, a constraint or another geometric object. These are defined by:
ccdc.search.SubstructureSearch.add_centroid()
: the centroid of a set of at least two pointsccdc.search.SubstructureSearch.add_dummy_point()
: a point projected along a vector between two pointsccdc.search.SubstructureSearch.add_group()
: a set of atomsccdc.search.SubstructureSearch.add_vector()
: a vector subtended by two pointsccdc.search.SubstructureSearch.add_plane()
: a plane defined by a set of at least three points
Each of these methods takes a name, by which the geometric object is known and a sequence of point definitions. The dummy point definition takes an additional argument, before the two point definitions defining the distance along the vector to place the point.
A point can be defined as either a pair, (substructure_index, atom_index), or by a named centroid or dummy point. For example:
>>> guanidino = ccdc.search.SMARTSSubstructure('[NHX3][CH0X3](=[NH2X3+])[NH2X3]')
>>> carboxylate = ccdc.search.SMARTSSubstructure('[C][C]([O])[O]')
>>> searcher = ccdc.search.SubstructureSearch()
>>> sub1 = searcher.add_substructure(guanidino)
>>> sub2 = searcher.add_substructure(carboxylate)
>>> searcher.add_plane('PLANE1', (sub1, 1), (sub1, 2), (sub1, 3))
>>> searcher.add_plane('PLANE2', (sub2, 0), (sub2, 1), (sub2, 2))
>>> searcher.add_centroid('CENT1', (sub1, 2), (sub1, 3))
>>> searcher.add_centroid('CENT2', (sub2, 2), (sub2, 3))
It is possible to add geometric measurements to
ccdc.search.SubstructureSearch
searches.
The geometric measurements are added to the
ccdc.search.SubstructureSearch
using the functions:
ccdc.search.SubstructureSearch.add_torsion_angle_measurement()
ccdc.search.SubstructureSearch.add_vector_angle_measurement()
ccdc.search.SubstructureSearch.add_plane_angle_measurement()
ccdc.search.SubstructureSearch.add_point_plane_distance_measurement()
ccdc.search.SubstructureSearch.add_vector_plane_angle_measurement()
ccdc.search.SubstructureSearch.add_atom_property_measurement()
ccdc.search.SubstructureSearch.add_constant_value_measurement()
ccdc.search.SubstructureSearch.add_unary_transform_measurement()
ccdc.search.SubstructureSearch.add_binary_transform_measurement()
>>> searcher.add_distance_measurement('DIST1', 'CENT1', 'CENT2')
Each of the measurement methods above (except for the constant value measurement) has a similarly named constraint method, so for example one can require that the angle between the planes defined above should be between 0 and 20 degrees:
>>> searcher.add_plane_angle_constraint('P1_P2', 'PLANE1', 'PLANE2', (0, 20))
Then we can find some hits in the CSD.
>>> hits = searcher.search(max_hit_structures=5, max_hits_per_structure=1)
With these hits we can inspect the measurements and constraints:
>>> for h in hits:
... print('DIST1: %.2f' % h.measurements['DIST1'])
... print('P1_P2: %.2f' % h.constraints['P1_P2'])
DIST1: 6.47
P1_P2: 7.95
One can also determine how the constraints and measurements were defined:
>>> print('%s - %s' % hits[0].measurement_objects('DIST1'))
CENT1 - CENT2
>>> print(hits[0].centroid_objects('CENT1'))
(Atom(N5), Atom(N6))
and similarly for constraints.
All these measurement functions require a name by which the measurement will be accessed. The geometric measurements require an appropriate set of point definitions as defined above for geometric objects. The atom property measurement allows aspects of a matched atom to be measured. These are:
AtomicNumber
VdwRadius
CovalentRadius
TotalCoordinationNumber
A constant value measurement allows constants to be part of a calculated measurement, The unary and binary transform measurements allow arithmetic to be performed on measured values. Unary operators are:
ABS
LOG
LOG10
EXP
COS
SIN
TAN
ACOS
ASIN
ATAN
FLOOR
ROUND
SQRT
NEG
Binary operators are:
MIN
MAX
ADD
SUBTRACT
MULTIPLY
DIVIDE
POW
RSIN
RCOS
Arguments to the arithmetic operators are the name of a measurement or the name of a constraint.
For each of the measurements above (except the constant value) there is a corresponding constraint which may be applied to the substructure matches.
Say, for example, that we were interested in understanding the intra-molecular geometry of an aromatic methoxy group. In particular how the preference of the methoxy group to lie in the plane of the aromatic ring affects the Ph-C-O angle.
Let us first create the substructure of interest using a
ccdc.search.SMARTSSubstructure
. The substructure can then be
used to to set up a ccdc.search.SubstructureSearch
.
>>> ar_methoxy_sub = ccdc.search.SMARTSSubstructure('[CH3:1][O:2][c:3]1[cH:4]ccc[cH:5]1')
>>> ar_methoxy_search = ccdc.search.SubstructureSearch()
>>> ar_methoxy_sub_id = ar_methoxy_search.add_substructure(ar_methoxy_sub)
We can now add the measurements of interest using the indices of the atoms of interest in the SMARTS pattern.
>>> ar_methoxy_search.add_angle_measurement('ANG1',
... ar_methoxy_sub_id, ar_methoxy_sub.label_to_atom_index(2),
... ar_methoxy_sub_id, ar_methoxy_sub.label_to_atom_index(3),
... ar_methoxy_sub_id, ar_methoxy_sub.label_to_atom_index(4))
>>> ar_methoxy_search.add_angle_measurement('ANG2',
... ar_methoxy_sub_id, ar_methoxy_sub.label_to_atom_index(2),
... ar_methoxy_sub_id, ar_methoxy_sub.label_to_atom_index(3),
... ar_methoxy_sub_id, ar_methoxy_sub.label_to_atom_index(5))
>>> ar_methoxy_search.add_torsion_angle_measurement('TOR1',
... ar_methoxy_sub_id, ar_methoxy_sub.label_to_atom_index(1),
... ar_methoxy_sub_id, ar_methoxy_sub.label_to_atom_index(2),
... ar_methoxy_sub_id, ar_methoxy_sub.label_to_atom_index(3),
... ar_methoxy_sub_id, ar_methoxy_sub.label_to_atom_index(4))
Note
Here we are making use of the
ccdc.search.SMARTSSubstructure.label_to_atom_index()
function to
convert the reaction SMARTS labels into zero based
substructure atom indices.
>>> ar_methoxy_sub.label_to_atom_index(1)
0
Now we are ready to carry out the search. Because this aromatic methoxy substructure is quite common in the CSD, we will limit our maximum number of hits to 200. Further, to avoid bias by picking multiple observations from the same structure we will limit the number of hits per structure to 1.
>>> ar_methoxy_hits = ar_methoxy_search.search(max_hit_structures=200, max_hits_per_structure=1)
>>> len(ar_methoxy_hits)
200
To get the data out of the list of ccdc.search.SubstructureHit
instances we can make use of list comprehension and Python’s built in zip
functionality.
>>> measurements = [ (h.measurements['ANG1'],
... h.measurements['ANG2'],
... abs(h.measurements['TOR1']))
... for h in ar_methoxy_hits ]
>>> ang1, ang2, abstor1 = zip(*measurements)
See also
For more information on Python’s built in zip
function please
see the Python on-line documentation
docs.python.org/3/library/functions.html#zip.
Note
None will be used as the value of a measurement value that is not well defined. For example a torsion measurement on colinear atoms is undefined and will be returned as None.
Finally, we can plot the data using matplotlib (a powerful third party python library for graphical visualisation).
Note
More information on the matplotlib module can be found at matplotlib.org.
>>> import matplotlib
>>> matplotlib.use('Agg')
>>> import matplotlib.pyplot as plt
>>> plt.scatter(ang1, ang2, c=abstor1, vmin=0, vmax=180)
<matplotlib.collections.PathCollection object at ...>
>>> cbar = plt.colorbar()
>>> cbar.set_label('TOR1')
>>> plt.title('Aromatic methoxy geometry')
<matplotlib.text.Text object at ...>
>>> plt.xlabel('ANG1')
<matplotlib.text.Text object at ...>
>>> plt.ylabel('ANG2')
<matplotlib.text.Text object at ...>
>>> plt.show()
Substructure searching with geometric constraints¶
It is also possible to add geometric constraints to substructure searches. These can be added using:
Suppose, for example, that we wanted to understand the interaction geometry of an aromatic iodine and the nitrogen atom of a pyridine ring. Specifically, does the C-I···N angle tend towards zero as the I···N distance becomes shorter?
First of all let us define the substructures to search for.
>>> ar_I_sub = ccdc.search.SMARTSSubstructure('Ic1ccccc1') # I: index 0
>>> pyridine_sub = ccdc.search.SMARTSSubstructure('n1ccccc1') # n: index 0
Using these substructures we can set up the search with a distance constraint between the iodine and the pyridine nitrogen atoms.
>>> halogen_bond_search = ccdc.search.SubstructureSearch()
>>> ar_I_sub_id = halogen_bond_search.add_substructure(ar_I_sub)
>>> pyridine_sub_id = halogen_bond_search.add_substructure(pyridine_sub)
Note that at this point we have added two substructures to the search
and we can add a distance constraint between them. This requires us to
specify both the substructure and atom identifiers of interest.
Incidentally this is why the
ccdc.search.SubstructureSearch.add_substructure()
function returns
the substructure identifier.
>>> halogen_bond_search.add_distance_constraint('DIST1',
... ar_I_sub_id, 0,
... pyridine_sub_id, 0,
... (0.0, 3.4), # distance constraint range
... 'Intermolecular')
Note
We could have specified the distance with respect to the van
der Waals radii of the atoms using the by setting the
vdw_corrected parameter of the
ccdc.search.SubstructureSearch.add_distance_constraint()
function to True.
Rather than just measure the C-I···N angle we can add it as a angular constraint, ensuring that it be greater than 120° for a match.
>>> halogen_bond_search.add_angle_constraint('ANG1',
... ar_I_sub_id, 1, # the carbon that the I is attached to
... ar_I_sub_id, 0,
... pyridine_sub_id, 0,
... (120.0, 180.0)) # the angle constraint range
We can now carry out the search.
>>> halogen_bond_hits = halogen_bond_search.search(max_hits_per_structure=1)
>>> len(halogen_bond_hits)
197
To get the data out of the list of ccdc.search.SubstructureHit
instances we can make use of list comprehension and Python’s built in zip
functionality.
>>> dist1_ang1 = [(h.constraints['DIST1'], h.constraints['ANG1'])
... for h in halogen_bond_hits ]
>>> dist1, ang1 = zip(*dist1_ang1)
We can use matplotlib to check for any geometric preferences of the aromatic iodine ··· pyridine halogen bond.
>>> plt.clf() # Clear the figure from the previous plot
>>> plt.scatter(dist1, ang1)
<matplotlib.collections.PathCollection object at ...>
>>> plt.title('Aromatic iodine - pyridine halogen bond geometry')
<matplotlib.text.Text object at ...>
>>> plt.xlabel('DIST1')
<matplotlib.text.Text object at ...>
>>> plt.ylabel('ANG1')
<matplotlib.text.Text object at ...>
>>> plt.show()
If we would like to perform further geometrical calculations on the hits of a substructure search, we can get access to the matched substructures. These will be instances of ccdc.molecule.Molecule
, one for each substructure of the query, containing the atoms matched by the query and the bonds between them. The structures will take account of any symmetry operators involved in the construction of the match.
>>> h = halogen_bond_hits[0]
>>> print(h.match_atoms())
[Atom(I1), Atom(C12), ... Atom(C5), Atom(C4)]
>>> print(', '.join("'%s'" % symmop for symmop in h.match_symmetry_operators()))
'x,y,z', 'x,y,z', ... '1/2-x,-y,-1/2+z', '1/2-x,-y,-1/2+z'
>>> sub_matches = h.match_substructures()
>>> print(sub_matches[0].atoms)
[Atom(I1), Atom(C12), ..., Atom(C8), Atom(C13)]
>>> print(sub_matches[1].atoms)
[Atom(N1), Atom(C3), ..., Atom(C5), Atom(C4)]
Both ccdc.search.SubstructureSearch.SearchHit.match_atoms()
and ccdc.search.SubstructureSearch.SearchHit.match_substructures()
will return atoms in the order in which they were specified by the substructures added to the SubstructureSearch, in the former case as a single tuple of atoms, in the latter separated into molecules corresponding to the substructures.
The difference between ccdc.search.SubstructureSearch.SearchHit.match_atoms()
and ccdc.search.SubstructureSearch.SearchHit.match_substructures()
is that in the former case, the atoms may all be found in ccdc.search.SubstructureSearch.SearchHit.molecule
, and so will not have any symmetry operators applied to them. This is useful to be able to determine which atoms have participated in the match. In some case where a symmetry operator has had to be applied an atom can appear more than once in the returned tuple of atom. In the latter case separate molecules are returned, hence the atoms are not directly comparable. Symmetry operators are applied, so the atoms are in exactly the same positions as were found by the search. Hence, geometrical operations on the substructures are consonant with the measurements and constraints of the search.
Search filters¶
It is possible to constrain the hits of a substructure search by various criteria. This is achieved by setting constraints on the ccdc.search.Search.Settings
. For example to restrict a search to provide only organic compounds, with no disorder and with a good R-factor,
>>> halogen_bond_search.settings.only_organic = True
>>> halogen_bond_search.settings.no_disorder = 'all'
>>> halogen_bond_search.settings.max_r_factor = 5.0
Then the search may be performed as before. There are also methods to ensure that particular elements are not present in a hit, or to ensure that particular elements must be present in a hit. For example to prohibit hits containing metalloid elements (which may be present in organic compounds):
>>> halogen_bond_search.settings.must_not_have_elements = ['B','Si','Ge','As','Sb','Te','At']
In additon to such substructure searches, these search filters can also be applied to
ccdc.search.TextNumericSearch
, ccdc.search.SimilaritySearch
and
ccdc.search.ReducedCellSearch
searches.
Disorder¶
To configure how disorder of structures mediates the search,
ccdc.search.Search.Settings.no_disorder
may take any of three values:
None (or anything that evaluates to False) to indicate no filtering of any disordered structures,
‘all’ to indicate filtering of structures with any disordered atoms, or
‘Non-hydrogen’ (or any string apart from ‘all’) to indicate filtering of structures with heavy atom disorder.
The last option, ‘Non-hydrogen’, is compatible with the ConQuest ‘no disorder’ selector.
Interactive Hit Processing¶
There is a new mechanism which allows the hits of a substructure search to be processed as they are found by the underlying ccdc.search.Search.search()
method instead of collecting all the hits, then processing them later. This may reduce the memory requirements of the search if the interesting aspects of the hit can be extracted without storing the whole hit. In other circumstances it may provide a mechanism for early termination of the search if hits with the right data may be determined before the search has processed the entire database.
Firstly let us import the right modules and set up a search for carboxylate-carboxylate contacts:
>>> import time
>>> from ccdc.search import SMARTSSubstructure, SubstructureSearch
>>> searcher = SubstructureSearch()
>>> searcher.add_substructure(SMARTSSubstructure('C(=O)OH'))
0
>>> searcher.add_substructure(SMARTSSubstructure('C(=O)OH'))
1
>>> searcher.add_distance_constraint('DIST1', 0, 1, 1, 3, (0, 2.0), 'Intermolecular')
>>> searcher.add_distance_constraint('DIST2', 0, 3, 1, 1, (0, 2.0), 'Intermolecular')
>>> searcher.add_angle_measurement('ANG1', 0, 1, 1, 3, 1, 2)
>>> searcher.add_angle_measurement('ANG2', 0, 2, 0, 3, 1, 1)
Normally we would call ccdc.search.Search.search()
, wait until all hits had been found, then process them to extract the information we sought. Now we can process the hits as they are found, allowing much greater memory efficiency in cases where we do not need all of the hit data.
Let us set up a processor which will record the closest contacts found, by summing the distance constraints. The two methods we will need to override are __init__, to provide space to record data, and add_hit(), to process the hits as they are found. For this example, I will also terminate after a few seconds for brevity.
>>> class ClosestContacts(SubstructureSearch.HitProcessor):
... '''Record the closest 5 contacts, and terminate after a short time.'''
... def __init__(self, max_hits=5, timeout=20.):
... self.max_hits = max_hits
... self.hits = []
... self.refcode = None
... self.start_time = time.time()
... self.timeout = timeout
... self.count = 0
... def add_hit(self, hit):
... score = hit.constraints['DIST1'] + hit.constraints['DIST2']
... if len(self.hits) == self.max_hits:
... self.hits.sort()
... if score < self.hits[-1][0]:
... self.hits.pop()
... self.hits.append((score, hit))
... else:
... self.hits.append((score, hit))
... if time.time() - self.start_time > self.timeout:
... self.cancel()
... if hit.identifier != self.refcode:
... self.refcode = hit.identifier
... self.count += 1
Now we can use it to search:
>>> closest = ClosestContacts()
>>> closest.search(searcher)
After 20 seconds the search will be terminated, and we can inspect the hits:
>>> hits = closest.hits
>>> hits.sort()
>>> print(' '.join('%.2f' % h[0] for h in hits))
2.71 2.82 2.98 3.06 3.08
>>> print(' '.join(h[1].identifier for h in hits))
ABUPES ABUNOA ADIPAC ABUPES ACSALA