Text-numeric searching

Introduction

The CSD supports a searching of the textual and numerical data associated with the individual entries.

Before beginning, we must import the necessary module.

>>> from ccdc.search import TextNumericSearch

Warning

This class may only be used to search the CSD.

Searching the CSD using text numeric searches

Let us start off by creating an empty query.

>>> query = TextNumericSearch()

Suppose that we wanted to search a particular journal. The first step to this would be to make sure that the string used to specify the journal was valid/represented in the CSD.

>>> query.is_journal_valid('J.Med.Chem.')
True
>>> query.is_journal_valid('Journal of Medicinal Chemistry')
False

It is possible to programmatically search known journals if you have a vague idea of some part of the title:

>>> print('\n'.join(sorted(k for k in query.journals if 'med.chem' in k.lower())))
ACS.Med.Chem.Lett.
Ann.Med.Chem.Res.
Bioisosteres in Med.Chem.
Bioorg.Med.Chem.
Bioorg.Med.Chem.Lett.
Comprehen.Med.Chem. II
Curr.Top.Med.Chem.
Eur.J.Med.Chem.
Fut.Med.Chem.
Indian J.Chem.,Sect.B:Org.Chem.Incl.Med.Chem.
J.Enzyme Inhib.Med.Chem.
J.Med.Chem.
Med.Chem.
Med.Chem.Res.
Org.Med.Chem.Lett.

If we have exact details of a citation we can find the structures published in the paper.

>>> query = TextNumericSearch()
>>> query.add_citation(journal='Organometallics', year=2000, volume=19, first_page=3354)
>>> print('\n'.join(h.identifier for h in query.search()))
XAGMUN
XAGNAU
XAGNEY
XAGNIC
XAGNOI
XAGNUO
XAGPAW

Not all the fields of a citation are required. One can, for example, find out how many structures were published by J.Med.Chem. in 2008:

>>> query = TextNumericSearch()
>>> query.add_citation(journal='J.Med.Chem.', year=2008)
>>> print(len('\n'.join(h.identifier for h in query.search())))
833

We can even histogram the growth of the CSD over the years.

>>> nhits = []
>>> tot = 0
>>> for i in range(1921, 2015):
...     query = TextNumericSearch()
...     query.add_citation(year=i)
...     tot += len(query.search())
...     nhits.append((i, tot))
>>> print(nhits) 
[(1921, 0), (1922, 0), (1923, 2), (1924, 3), (1925, 8), ...
>>> print(tot)
769192

There are other available text numeric searches. The code snippet below illustrates a search by chemical name.

>>> query = TextNumericSearch()
>>> query.add_synonym('aspirin')
>>> print('\n'.join(h.identifier for h in query.search())) 
ACMEBZ
ACSALA
ACSALA01
ACSALA02
ACSALA03
ACSALA04
ACSALA05
ACSALA06
ACSALA07
ACSALA08
ACSALA09
ACSALA10
ACSALA11
ACSALA12
ACSALA13
ACSALA14
ACSALA15
ACSALA16
ACSALA17
ACSALA18
ACSALA19
ACSALA20
ACSALA21
ACSALA22
ACSALA23
ACSALA24
ACSALA25
ACSALA26
ACSALA28
ARIFOX
BEHWOA
DIFHOP
DIFQAK
DIPJAQ
DISXOU
EYOMEL
EYOMIP
EYOMOV
EYOMUB
EYONAI
FOJYOV
FOJZUC
HUNJEH
HUPPOX
HUPPOX01
IBOBUY
IBOBUY01
IBOCEJ
IBOCEJ01
IBOCOT
IBOCOT01
JIRNEE
KEWNOQ
KEWNOQ01
LAJVUO01
NINFUN
NUKXOH
NUWTIJ01
NUWTOP01
NUWTOP02
PIKYOA
PIKYUG
PIKZAN
SIBYUA
SIBYUA01
TAZRAO01
TAZRAO02
TORQUM02
UTUCIW
VUGMIT
XOJMOZ
YIRPEW
YOSMOI

It is also possible to find out how many structures a specified authour, e.g. Greg Shields, published.

>>> query.clear()
>>> query.add_author('G.P.Shields')
>>> print(len(query.search()))
115

Text queries may take an optional mode, which will modify the search. Available modes are accessible from the query:

>>> print('\n'.join(sorted(k for k in query.modes)))
anywhere
exact
is_null
not_null
separate
start
start_of_word

‘Separate’ means a separate, space delimited word within the field.

>>> query.clear()
>>> query.add_color('red', mode='exact')
>>> print('There are %d red compounds in the CSD' % len(query.search()))
There are 94224 red compounds in the CSD

Text-numeric searches allow all forms of search filter given by ccdc.search.Search.Settings. See Search filters for examples of use. To search, for example, for organic red compounds we would specify:

>>> query.settings.only_organic = True
>>> print('There are %d organic red compounds in the CSD' % len(query.search()))
There are 15515 organic red compounds in the CSD
>>> query.settings.only_organic = False

How many entries do not have a doi?

>>> query.clear()
>>> query.add_doi('', mode='is_null')
>>> print(len(query.search()))
104611

How many entries contain the word “dihydrofolate”?

>>> query.clear()
>>> query.add_all_text('dihydrofolate', mode='exact')
>>> print(len(query.search()))
36

There is also an optional argument that can be used to ignore non-alphanumeric parts of a hit. Let us illustrate this with a compound name search.

>>> query.clear()
>>> query.add_compound_name('azabicyclononane')
>>> hits = query.search()
>>> len(hits)
1
>>> print(hits[0].entry.chemical_name)
Dimethyl (6RS,8RS)-8-phenyl-9-oxa-1-azabicyclononane-5,5-dicarboxylate
>>> query.clear()
>>> query.add_compound_name('azabicyclononane', ignore_non_alpha_num=True)
>>> hits = query.search()
>>> len(hits)
661
>>> print(hits[-1].entry.chemical_name)
3-Azabicyclo(3.3.1)nonane hydrochloride

Queries may be joined with an implicit boolean ‘and’.

>>> query.clear()
>>> query.add_author('F.H.Allen')
>>> query.add_author('J.Trotter')
>>> print(len(query.search()))
9

A numeric query may take a pair of values, interpreted as an inclusive range. This can be used to, for example, find out whether there are any recent aspirin structures.

>>> query.clear()
>>> query.add_compound_name('aspirin')
>>> query.add_citation(year=[2011,2013])
>>> print('\n'.join(h.identifier for h in query.search()))
ACSALA19
ACSALA20
ACSALA21
ARIFOX
EYOMEL
EYOMIP
EYOMOV
EYOMUB
EYONAI
IBOBUY
IBOBUY01
IBOCEJ
IBOCEJ01
IBOCOT
IBOCOT01
KICVUP
NINFUN
NUWTIJ01
NUWTOP01
NUWTOP02
UTUCIW
YIRPEW

The difference between ‘exact’ and ‘anywhere’ is that an ‘exact’ query of “cat” would only match “cat”, but an ‘anywhere’ query would also match “catty”. Let us illustrate this with a search on compound name. The default behaviour is to use ‘anywhere’ mode.

>>> query.clear()
>>> query.add_compound_name('acetylcholine')
>>> print(len(query.search()))
20

When using the ‘exact’ mode we get two fewer hits.

>>> query.clear()
>>> query.add_compound_name('acetylcholine', mode='exact')
>>> print(len(query.search()))
18

Suppose that we were interested in finding polymorphic ibuprofens. Below is a code snippet illustrating one way of performing such a search.

>>> query.clear()
>>> query.add_synonym('ibuprofen', 'exact')
>>> print(len(query.search()))
48
>>> query.add_polymorph('', 'not_null')
>>> print('\n'.join(h.identifier for h in query.search()))
IBPRAC
IBPRAC01
IBPRAC02
IBPRAC03
IBPRAC04
IBPRAC05
IBPRAC06
IBPRAC07
IBPRAC08
IBPRAC09
IBPRAC10
IBPRAC11
IBPRAC12
IBPRAC13
IBPRAC14
IBPRAC15
IBPRAC16
IBPRAC17
IBPRAC18
IBPRAC19
IBPRAC20

It is also possible to perform a search using the bioactivity field in the CSD.

>>> query.clear()
>>> query.add_bioactivity('antiinflammatory')
>>> print(len(query.search()))
493
>>> query.add_bioactivity('analgesic')
>>> print(len(query.search()))
135

Let us look for CSD entries that have the exact words “backbone” and “ligand” in their disorder description.

>>> query.clear()
>>> query.add_disorder('backbone', 'exact')
>>> query.add_disorder('ligand', 'exact')
>>> hits = query.search()
>>> print('\n'.join(h.identifier for h in hits))
AJEGIF
COSZEP
DASMOB
HINPAV
IHENUG
YIWRIF
YIWRIG

To show what this means let us print the disorder details of the entry AJEGIF. We can obtain the entry directly from the hit.

>>> ajegif = hits[0].entry
>>> print(ajegif.disorder_details) 
The ligand backbone exhibits a racemic twinning disorder in which
the molecule is disordered over two sites in a 3:1 ratio. One isopropyl
C atom is disordered over two sites with occupancies 0.51:0.49.

For convenience hits also have properties for the crystal and the molecule of a hit.

A ccdc.search.TextNumericSearch will display its component queries in a human readable form:

>>> query.clear()
>>> query.add_compound_name('aspirin')
>>> query.add_citation(year=[2011,2013])
>>> print('\n'.join(q for q in query.queries))
Compound name aspirin anywhere
Journal year in range 2011-2013