Text-numeric searching

Introduction

The CSD supports a searching of the textual and numerical data associated with the individual entries.

Before beginning, we must import the necessary module.

>>> from ccdc.search import TextNumericSearch

Warning

This class may only be used to search the CSD or other crystal structure database files.

Searching the CSD using text numeric searches

Let us start off by creating an empty query.

>>> query = TextNumericSearch()

Suppose that we wanted to search a particular journal. The first step to this would be to make sure that the string used to specify the journal was valid/represented in the CSD.

>>> query.is_journal_valid('J.Med.Chem.')
True
>>> query.is_journal_valid('Journal of Medicinal Chemistry')
False

It is possible to programmatically search known journals if you have a vague idea of some part of the title:

>>> print('\n'.join(sorted(k for k in query.journals if 'med.chem' in k.lower())))
ACS.Med.Chem.Lett.
Ann.Med.Chem.Res.
Bioisosteres in Med.Chem.
Bioorg.Med.Chem.
Bioorg.Med.Chem.Lett.
Comprehen.Med.Chem. II
Curr.Top.Med.Chem.
Eur.J.Med.Chem.
Fut.Med.Chem.
Indian J.Chem.,Sect.B:Org.Chem.Incl.Med.Chem.
J.Enzyme Inhib.Med.Chem.
J.Med.Chem.
Med.Chem.
Med.Chem.Res.
Org.Med.Chem.Lett.

If we have exact details of a citation we can find the structures published in the paper.

>>> query = TextNumericSearch()
>>> query.add_citation(journal='Organometallics', year=2000, volume=19, first_page=3354)
>>> print('\n'.join(h.identifier for h in query.search()))
XAGMUN
XAGNAU
XAGNEY
XAGNIC
XAGNOI
XAGNUO
XAGPAW

Not all the fields of a citation are required. One can, for example, find out how many structures were published by J.Med.Chem. in 2008:

>>> query = TextNumericSearch()
>>> query.add_citation(journal='J.Med.Chem.', year=2008)
>>> print(len('\n'.join(h.identifier for h in query.search())))
833

We can even histogram the growth of the CSD over the years.

>>> nhits = []
>>> tot = 0
>>> for i in range(1921, 2015):
...     query = TextNumericSearch()
...     query.add_citation(year=i)
...     tot += len(query.search())
...     nhits.append((i, tot))
>>> print(nhits) 
[(1921, 0), (1922, 0), (1923, 2), (1924, 3), (1925, 8), ...
>>> print(tot)
770387

There are other available text numeric searches. The code snippet below illustrates a search by chemical name.

>>> query = TextNumericSearch()
>>> query.add_synonym('aspirin')
>>> print('\n'.join(h.identifier for h in query.search())) 
ACMEBZ
ACSALA
ACSALA01
ACSALA02...

It is also possible to find out how many structures a specified author, e.g. Greg Shields, published.

>>> query.clear()
>>> query.add_author('G.P.Shields')
>>> print(len(query.search()))
115

Text queries may take an optional mode, which will modify the search. Available modes are accessible from the query:

>>> print('\n'.join(sorted(k for k in query.modes)))
anywhere
exact
is_null
not_null
separate
start
start_of_word

‘Separate’ means a separate, space delimited word within the field.

>>> query.clear()
>>> query.add_color('red', mode='exact')
>>> print('There are %d red compounds in the CSD' % len(query.search()))
There are 107098 red compounds in the CSD

Text-numeric searches allow all forms of search filter given by ccdc.search.Search.Settings. See Search filters for examples of use. To search, for example, for organic red compounds we would specify:

>>> query.settings.only_organic = True
>>> print('There are %d organic red compounds in the CSD' % len(query.search()))
There are 18808 organic red compounds in the CSD
>>> query.settings.only_organic = False

How many entries do not have a doi?

>>> query.clear()
>>> query.add_doi('', mode='is_null')
>>> print(len(query.search()))
121411

How many entries contain the word “dihydrofolate”?

>>> query.clear()
>>> query.add_all_text('dihydrofolate', mode='exact')
>>> print(len(query.search()))
37

There is also an optional argument that can be used to ignore non-alphanumeric and numeric characters of a hit. With this option set to True, “butadiene” will match “buta-1,3-diene”. To illustrate this with a compound name search:

>>> query.clear()
>>> query.add_compound_name('azabicyclononane')
>>> hits = query.search()
>>> len(hits)
1
>>> print(hits[0].entry.chemical_name)
Dimethyl (6RS,8RS)-8-phenyl-9-oxa-1-azabicyclononane-5,5-dicarboxylate
>>> query.clear()
>>> query.add_compound_name('azabicyclononane', ignore_non_alpha_num=True)
>>> hits = query.search()
>>> len(hits)
733
>>> print(hits[-1].entry.chemical_name)
9-(benzylammonio)-1,5-dimethyl-3,7-diazabicyclo[3.3.1]nonane-3,7-di-ium trichloride benzene methanol solvate

Queries may be joined with an implicit boolean ‘and’.

>>> query.clear()
>>> query.add_author('F.H.Allen')
>>> query.add_author('J.Trotter')
>>> print(len(query.search()))
9

A numeric query may take a pair of values, interpreted as an inclusive range. This can be used to, for example, find out whether there are any recent aspirin structures.

>>> query.clear()
>>> query.add_compound_name('aspirin')
>>> query.add_citation(year=[2011,2013])
>>> print('\n'.join(h.identifier for h in query.search()))
ACSALA19
ACSALA20
ACSALA21
ARIFOX
EYOMEL
EYOMIP
EYOMOV
EYOMUB
EYONAI
IBOBUY
IBOBUY01
IBOCEJ
IBOCEJ01
IBOCOT
IBOCOT01
KICVUP
NINFUN
NUWTIJ01
NUWTOP01
NUWTOP02
UTUCIW
YIRPEW

The difference between ‘exact’ and ‘anywhere’ is that an ‘exact’ query of “cat” would only match “cat”, but an ‘anywhere’ query would also match “catty”. Let us illustrate this with a search on compound name. The default behaviour is to use ‘anywhere’ mode.

>>> query.clear()
>>> query.add_compound_name('acetylcholine')
>>> print(len(query.search()))
20

When using the ‘exact’ mode we get two fewer hits.

>>> query.clear()
>>> query.add_compound_name('acetylcholine', mode='exact')
>>> print(len(query.search()))
18

Suppose that we were interested in finding polymorphic ibuprofens. Below is a code snippet illustrating one way of performing such a search.

>>> query.clear()
>>> query.add_synonym('ibuprofen', 'exact')
>>> print(len(query.search()))
70
>>> query.add_polymorph('', 'not_null')
>>> print('\n'.join(h.identifier for h in query.search()))
IBPRAC
IBPRAC01
IBPRAC02
IBPRAC03
IBPRAC04
IBPRAC05
IBPRAC06
IBPRAC07
IBPRAC08
IBPRAC09
IBPRAC10
IBPRAC11
IBPRAC12
IBPRAC13
IBPRAC14
IBPRAC15
IBPRAC16
IBPRAC17
IBPRAC18
IBPRAC19
IBPRAC20
IBPRAC21
IBPRAC22

It is also possible to perform a search using the bioactivity field in the CSD.

>>> query.clear()
>>> query.add_bioactivity('antiinflammatory')
>>> print(len(query.search()))
515
>>> query.add_bioactivity('analgesic')
>>> print(len(query.search()))
139

Let us look for CSD entries that have the exact words “backbone” and “ligand” in their disorder description.

>>> query.clear()
>>> query.add_disorder('backbone', 'exact')
>>> query.add_disorder('ligand', 'exact')
>>> hits = query.search()
>>> print('\n'.join(h.identifier for h in hits))
AJEGIF
COSZEP
DASMOB
HINPAV
IHENUG
YIWRIF
YIWRIG

To show what this means let us print the disorder details of the entry AJEGIF. We can obtain the entry directly from the hit.

>>> ajegif = hits[0].entry
>>> print(ajegif.disorder_details) 
The ligand backbone exhibits a racemic twinning disorder in which
the molecule is disordered over two sites in a 3:1 ratio. One isopropyl
C atom is disordered over two sites with occupancies 0.51:0.49.

For convenience hits also have properties for the crystal and the molecule of a hit.

A ccdc.search.TextNumericSearch will display its component queries in a human readable form:

>>> query.clear()
>>> query.add_compound_name('aspirin')
>>> query.add_citation(year=[2011,2013])
>>> print('\n'.join(q for q in query.queries))
Compound name aspirin anywhere
Journal year in range 2011-2013

Peptide sequence searches

In the CSD a peptide sequence code is given for all entries with more than one alpha-amino carboxy skeleton in an independent residue. Residues having an alpha-amino acid complexed to a main group element or a transition metal atom are not given a peptide sequence code. These sequences can be searched by using the add_peptide_sequence() method.

The code consists of two parts: the first section reports the overall sequence, and the second reports the identity and order of individual amino acids in the peptide.

The first part has the format ‘A=n’ or ‘C=n’ where ‘A’ and ‘C’ correspond to either Acylic or Cyclic peptides and ‘n’ corresponds to the number of amino acid residues present (e.g., ‘A=3’ refers to a tripeptide chain).

The second part gives the peptide sequence using the common three letter codes for amino acid residues (e.g., ‘GLY’ for glycine). A residue code with an additional * shows the residue is a modified version of the amino acid (e.g., Homoproline has the code ‘PRO*’). Any residue in a peptide that does not correspond to an alpha-amino acid will have the code ‘UND’.

A separator of ‘-’ between residues shows the amino acids are linked via a typical -C(O)-NH- peptide linkage, any other linkage between residues will have a ‘,’ separator.

Any branched peptide is reported using an ‘!’ separator (e.g., a code ‘C=2 PHE!-PRO- A=2 ALA*-PHE!’ corresponds to a tripeptide where a modified alanine residue is branched from the phenylalanine residue of a proline-phenylalanine unit.)