Combined searches

It is possible to combine queries from different search types using the set-theoretic combinators, ‘And’, ‘Or’ and ‘Not’. This is done using the class ccdc.search.CombinedSearch. The implementation makes use of overloading the python methods __neg__, __and__, and __or__. For more information, please see the Python on-line documentation docs.python.org/3/reference/datamodel.html#emulating-numeric-type.

We will exemplify this by looking for structures of drug molecules containing various motifs. Let’s start by using ccdc.search.TextNumericSearch to find which CSD entries contain drug molecules. All approved drug molecules obtained from DrugBank are cross-referenced to CSD entries if an exact match was found.

>>> from ccdc import search
>>> tns = search.TextNumericSearch()
>>> tns.add_synonym('DRUGBANK')
>>> drug_hits = tns.search()
>>> print(len(drug_hits))
3664

Suppose we want to find all drug molecules containing the sulphonamide functional group, and only those where it is involved in forming an hydrogen bond between the sulfoxide and the primary amine groups.

../_images/SULAMD04.png

We can combine the text numeric search with a substructure search for sulfoxide-amine hydrogen bonds. Let’s define the latter query:

>>> sulfoxide_amine_search = search.SubstructureSearch()
>>> sulfoxide = sulfoxide_amine_search.add_substructure(search.SMARTSSubstructure('S=O'))
>>> amine = sulfoxide_amine_search.add_substructure(search.SMARTSSubstructure('N(H)(H)'))
>>> sulfoxide_amine_search.add_distance_constraint('DIST1', (sulfoxide, 1), (amine, 0), (-5, 0), vdw_corrected=True, type='any')
>>> sulfoxide_amine_search.settings.max_hits_per_structure = 1

Now, we can combine the two queries:

>>> combined = search.CombinedSearch(tns & sulfoxide_amine_search)
>>> combined_hits = combined.search()
>>> print(len(combined_hits))
208

We can refine this search further to specifically eliminate occurrence of other hydrogen bonds:

>>> O_N_hbonds = search.SubstructureSearch()
>>> sub1 = O_N_hbonds.add_substructure(search.SMARTSSubstructure('C~O'))
>>> sub2 = O_N_hbonds.add_substructure(search.SMARTSSubstructure('NH'))
>>> O_N_hbonds.add_distance_constraint('DIST2', (0, 1), (1, 0), (-5, 0.0), vdw_corrected=True, type='any')
>>> N_N_hbonds = search.SubstructureSearch()
>>> sub3 = N_N_hbonds.add_substructure(search.SMARTSSubstructure('NH'))
>>> sub4 = N_N_hbonds.add_substructure(search.SMARTSSubstructure('N'))
>>> N_N_hbonds.add_distance_constraint('DIST3', (0, 1), (1, 0), (-5, 0.5), vdw_corrected=True, type='any')
>>> O_O_hbonds = search.SubstructureSearch()
>>> sub5 = O_O_hbonds.add_substructure(search.SMARTSSubstructure('OH'))
>>> sub6 = O_O_hbonds.add_substructure(search.SMARTSSubstructure('O'))
>>> O_O_hbonds.add_distance_constraint('DIST4', (0, 1), (1, 0), (-5, 0.0), vdw_corrected=True, type='any')
>>> combined = search.CombinedSearch(tns & sulfoxide_amine_search & -O_N_hbonds & -N_N_hbonds & -O_O_hbonds)
>>> hits = combined.search()
>>> print(len(hits))
15

This has found all drug molecules in the CSD with a sulfonamide and the specific interaction motif. Let’s retrieve more information about them:

>>> for h in hits:
...     print(', '.join(str(s) for s in h.entry.synonyms))
4,5-Dichlorophenamide, Diclofenamide, Daranide, Oratrol, DrugBank: DB01144, PDB Chemical Component code: I7A
4,4'-Sulfonyldianiline, Dapsone, Aczone, DrugBank: DB00250
Valdecoxib, Bextra, PDB Chemical Component code: COX, DrugBank: DB00580
Homotaurine, DrugBank: DB06527
DrugBank: DB06527
DrugBank: DB06821
Sulfisoxazole, Gantrisin, Neoxazol, DrugBank: DB00263
Sulfanilamide, sulphanilamide, DrugBank: DB00259
Sulfanilamide, sulphanilamide, DrugBank: DB00259
Sulfanilamide, sulphanilamide, DrugBank: DB00259
Sulfanilamide, sulphanilamide, DrugBank: DB00259
Sulfadiazine, adiazine, DrugBank: DB00359
Sulthiame, DrugBank: DB08329
Sulthiame, DrugBank: DB08329
5-Amino-2-naphthalenesulfonic acid, 1,6-Cleve's acid, DrugBank: DB08238

If you were looking for possible analogues to these structures you could omit the TextNumericSearch to find structures:

>>> analogue_search = search.CombinedSearch(sulfoxide_amine_search & -O_N_hbonds & -N_N_hbonds & -O_O_hbonds)
>>> analogue_hits = analogue_search.search()
>>> print(len(analogue_hits))
1384

Note that the alternative formulation of this query:

>>> analogue_search = search.CombinedSearch(sulfoxide_amine_search & -(O_N_hbonds | N_N_hbonds | O_O_hbonds))

runs much more slowly, as it finds many hits containing one of the hbond types we wish to exclude.

The hits returned from a ccdc.search.CombinedSearch search are instances of ccdc.search.CombinedSearch.CombinedHit. These will have all the attributes of a ccdc.search.SubstructureSearch.SubstructureHit if a substructure search is part of the combined search, and if two or more substructure searches are conjoined, then the attributes of all conjoined substructure search hits will be present in the the dictionaries and methods of the hit. If similarity searches are present then the similarity values will be present in the hit in a dictionary, similarities, keyed by the identifier of the molecule whose similarity is compared. TextNumericSearch hits return no information apart from the identifier of the hit structure. Hits resulting from a negation also provide no additional information.

>>> hit = analogue_hits[0]
>>> print(hit.identifier)
ABACOX
>>> print('%.2f' % hit.constraints['DIST1'])
2.77
>>> print(hit.match_atoms())
[Atom(S1), Atom(O2), Atom(N1), Atom(H8), Atom(H9)]

The other forms of search, SimilaritySearch and ReducedCellSearch may be combined in the same way. For example, we can search for structures similar to two of the structures found in the previous search:

>>> from ccdc import io
>>> csd = io.EntryReader('csd')
>>> suldaz = csd.molecule('SULDAZ')
>>> suldaz_sim = search.SimilaritySearch(suldaz, 0.7)
>>> wenrox = csd.molecule('WENROX')
>>> wenrox_sim = search.SimilaritySearch(wenrox, 0.7)
>>> combined = search.CombinedSearch(suldaz_sim & wenrox_sim)
>>> hits = combined.search()
>>> print(len(hits))
161
>>> hit = hits[0]
>>> print('\n'.join('%8s: %8s: %.2f, %8s: %.2f' % (hit.identifier, 'SULDAZ', hit.similarities['SULDAZ'], 'WENROX', hit.similarities['WENROX']) for hit in hits)) 
  AKOBUZ:   SULDAZ: 0.89,   WENROX: 0.79
  ANIKEQ:   SULDAZ: 1.00,   WENROX: 0.70
  ANILER:   SULDAZ: 1.00,   WENROX: 0.70
ANILER01:   SULDAZ: 1.00,   WENROX: 0.70
  ANILIV:   SULDAZ: 1.00,   WENROX: 0.70
  ...