Quick primer to using the CSD Python API¶
Introduction¶
This section is a quick primer for getting familiar with the CSD Python API. It is not, however, a complete reference to the CSD Python API. That can be found as the API documentation.
Once you have read through this primer and want some more information on how to make use of the CSD Python API it is recommended that you read through the descriptive documentation and look at the examples included in the cookbook documentation.
The CSD Python API is described in the following article. This offers an illustration of the use of the CSD Python API in the community with some example cases.
“What has scripting ever done for us? The CSD Python application programming interface (API)”Richard A. Sykes, Natalie T. Johnson, Christopher J. Kingsbury, Jürgen Harter, Andrew G. P. Maloney,Elna Pidcock, Isaac J. Sugden, Suzanna C. Ward, Ian J. Bruno, Stewart A. Adcock, Peter A. Wood,Patrick McCabe, Alexandru A. Moldovan, Francis Atkinson, Ilenia Giangreco and Jason C. ColeJournal of Applied Crystallography, in press.
Note
Example files referenced within this documentation can be found in the doc/example_files directory within the distributed archive.
Note
In order to use data from the CSD and any feature dependent on CSD-derived data the CSD itself must be installed.
Starting a Python interactive shell¶
This document will make use of the Python interactive shell:
bash-3.2$ python
Python 2.7.3 (default, Jun 21 2012, 13:00:25)
[GCC 4.1.2 20080704 (Red Hat 4.1.2-52)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>
See also
For more information on Python’s interactive mode please see the Python on-line documentation docs.python.org/3/tutorial/interpreter.html#interactive-mode.
Accessing database entries, crystals and molecules from the CSD¶
The CSD Python API makes a distinction between
database entries (ccdc.entry.Entry
), crystals
(ccdc.crystal.Crystal
) and molecules
(ccdc.molecule.Molecule
). To illustrate the difference let us
get access to the database entry, crystal and molecule of ABEBUF.
>>> from ccdc import io
>>> csd_reader = io.EntryReader('CSD')
>>> entry_abebuf = csd_reader.entry('ABEBUF')
>>> cryst_abebuf = csd_reader.crystal('ABEBUF')
>>> mol_abebuf = csd_reader.molecule('ABEBUF')
Molecules¶
A ccdc.molecule.Molecule
contains molecular properties such as
the molecular weight (ccdc.molecule.Molecule.molecular_weight
)
and descriptors such as whether or not the molecule is organic
(ccdc.molecule.Molecule.is_organic
).
>>> round(mol_abebuf.molecular_weight, 3) # only want 3 significant figures
317.341
>>> mol_abebuf.is_organic
True
See also
In the example above we make use of Python’s built in
round
function. For more information on the round
function please see the Python on-line documentation
docs.python.org/3/library/functions.html#round.
The molecule also contains a ccdc.molecule.Molecule.smiles
representation and a
ccdc.molecule.Molecule.to_string()
function for returning a Mol2 or SDF formatted string.
>>> print(mol_abebuf.smiles)
O=C1Nc2ccccc2C(=O)Nc2ccccc12.c1ccncc1
>>> print(mol_abebuf.to_string('mol2'))
@<TRIPOS>MOLECULE
ABEBUF
...
Note that from the SMILES string it is apparent that this molecule contains two disconnected components:
O=C1Nc2ccccc2C(=O)Nc2ccccc12
c1ccncc1
The individual components are molecules as well and can be accessed
through the ccdc.molecule.Molecule.components
list and the
individual molecule with the greatest molecular weight can be accessed
through the ccdc.molecule.Molecule.heaviest_component
attribute.
>>> mol_abebuf.components
[<ccdc.molecule.Molecule object at ...>,
<ccdc.molecule.Molecule object at ...>]
>>> [round(mol.molecular_weight, 3) for mol in mol_abebuf.components]
[238.241, 79.1]
>>> print(mol_abebuf.heaviest_component.smiles)
O=C1Nc2ccccc2C(=O)Nc2ccccc12
See also
The descriptive molecule documentation and the
ccdc.molecule.Molecule
API documentation.
Crystals¶
Crystals contain crystallographic descriptors such as Z’
(ccdc.crystal.Crystal.z_prime
) and the cell volume
(ccdc.crystal.Crystal.cell_volume
).
>>> cryst_abebuf.z_prime
1.0
>>> round(cryst_abebuf.cell_volume, 3) # only want 3 significant figures
3229.582
The molecule in the crystal structure can be accessed from the
ccdc.crystal.Crystal.molecule
attribute.
>>> cryst_abebuf.molecule
<ccdc.molecule.Molecule object at ...>
>>> print(cryst_abebuf.molecule.smiles)
O=C1Nc2ccccc2C(=O)Nc2ccccc12.c1ccncc1
See also
The descriptive crystal documentation and the
ccdc.crystal.Crystal
API documentation.
Entries¶
CSD entries contains additional information about a crystal structure
added during the editorial process. Examples include the chemical name
(ccdc.entry.Entry.chemical_name
) and the publication details
(ccdc.entry.Entry.publication
).
>>> print(entry_abebuf.chemical_name)
5H,11H-Dibenzo(b,f)(1,5)diazocine-6,12-dione pyridine clathrate
>>> print(entry_abebuf.publication)
Citation(authors='S.W.Gordon-Wylie, E.Teplin, J.C.Morris,
M.I.Trombley, S.M.McCarthy, W.M.Cleaver, G.R.Clark',
journal='Journal(Crystal Growth and Design)',
volume='4', year=2004, first_page='789',
doi='10.1021/cg049957u')
The crystal and molecule in an entry can be accessed from the
ccdc.entry.Entry.crystal
and
ccdc.entry.Entry.molecule
attributes respectively.
>>> entry_abebuf.crystal
<ccdc.crystal.Crystal object at ...>
>>> entry_abebuf.crystal.z_prime
1.0
>>> entry_abebuf.molecule
<ccdc.molecule.Molecule object at ...>
>>> print(entry_abebuf.molecule.smiles)
O=C1Nc2ccccc2C(=O)Nc2ccccc12.c1ccncc1
Reading and writing molecules¶
Suppose that we wanted to write out our molecule to a file named
abebuf.mol2
.
>>> file_name = 'abebuf.mol2'
The ccdc.io
module has three types of writer:
ccdc.io.EntryWriter
, ccdc.io.CrystalWriter
,
ccdc.io.MoleculeWriter
. These are all context managers that can
make use of the Python with
syntax. This syntax automatically ensures that the writer is
closed automatically when the with
block of code is exited. Let us illustrate this by
writing the molecule to the file.
See also
For more information on Python’s with
syntax please see the Python on-line
documentation docs.python.org/3/reference/compound_stmts.html#with.
>>> with io.MoleculeWriter(file_name) as mol_writer:
... mol_writer.write(mol_abebuf)
...
Note
In the example above the file format was determined by the
file name extension (‘mol2’). The file format can also be set
explicitly using the format
argument when instantiating a
writer.
Now that we have a molecule written to the file system we can illustrate how to read in a molecule from a file.
>>> mol_reader = io.MoleculeReader(file_name)
>>> mol = mol_reader[0] # get the first molecule in the file
>>> print(mol.identifier)
ABEBUF
>>> print(mol.smiles)
O=C1Nc2ccccc2C(=O)Nc2ccccc12.c1ccncc1
>>> len(mol.components)
2
Note
A ccdc.io.MoleculeReader
is an iterator that can
contain zero or more molecules. This is why we had to
explicitly get the first molecule from it.
Note also that the concept of a molecule file containing multiple molecules is distinct from the notion that a molecule can contain multiple distinct components. Hence the first molecule extracted from the molecule reader having two components.
Searching the CSD¶
There are several main ways of searching the CSD:
text numeric searching based on the data corresponding to entries (
ccdc.search.TextNumericSearch
);similarity searching for chemically-similar entries (
ccdc.search.SimilaritySearch
);substructure searching including searches with 3D geometrical constraints and measurements (
ccdc.search.SubstructureSearch
);reduced cell searching based on crystal cell parameters (
ccdc.search.ReducedCellSearch
).
Textual and numeric searching¶
Suppose that one was interested in the physical properties of aspirin and as such wanted to find publications containing aspirin crystal structures with recorded melting points.
This can be achieved using a ccdc.search.TextNumericSearch
.
>>> from ccdc.search import TextNumericSearch
>>> text_numeric_search = TextNumericSearch()
>>> text_numeric_search.add_compound_name('aspirin')
>>> identifiers = [h.identifier for h in text_numeric_search.search()]
Over the years the assignment of the common and chemical names have somewhat arbitrarily been assigned to the synonyms and compound name fields of the CSD. As such we also need to search the synonyms field for the search to be comprehensive.
>>> text_numeric_search.clear()
>>> text_numeric_search.add_synonym('aspirin')
>>> identifiers.extend([h.identifier for h in text_numeric_search.search()])
Finally, we loop over the identifiers. To ensure that we avoid duplicates we can make use of Python’s built in set functionality. For each identifier we get the relevant entry from the CSD and check if it has a melting point associated with it or not and if it has we write out the identifier, the DOI and the melting point information.
>>> for identifier in sorted(set(identifiers)):
... e = csd_reader.entry(identifier)
... if e.melting_point:
... print('%-8s https://doi.org/%-25s %s' % (e.identifier,
... e.publication.doi,
... e.melting_point))
...
ACMEBZ https://doi.org/10.1107/S0567740881006729 385K
ACSALA13 https://doi.org/10.1021/ja056455b 408.5 K
ACSALA28 https://doi.org/10.1039/D0CC03157G 415 K
ACSALA31 https://doi.org/None 409 K
ACSALA32 https://doi.org/None 409 K
ACSALA34 https://doi.org/None 409 K
ACSALA37 https://doi.org/10.1039/D3CE00611E 409 K
BEHWOA https://doi.org/10.1107/S0567740882005731 401K
CUASPR01 https://doi.org/10.1107/S1600536803026126 above 573 K
HUPPOX https://doi.org/10.1039/b208574g 91-96 deg.C
HUPPOX01 https://doi.org/None 91-96 deg.C
NUWTOP01 https://doi.org/10.1039/c2ce06313a 420 K
PIKYUG https://doi.org/10.1021/acs.cgd.8b00718 502 K
WIHROY https://doi.org/None 409 K
See also
In this example we make use of Python’s built-in set object. Please see the Python on-line documentation for more information docs.python.org/3/library/stdtypes.html.
See also
The descriptive text numeric search documentation and the
ccdc.search.TextNumericSearch
,
ccdc.search.TextNumericHit
API documentation.
Molecular similarity searching¶
Suppose that one wanted to find out what space groups were occupied by
CSD entries that contained compounds similar to the heaviest component
of ABEBUF
. This type of information may be of interest if one was
trying to solve the structure of a new compound using powder data.
Having knowledge of which space groups are likely to be occupied by a
compound can help the interpretation of the data.
This can be achieved using a ccdc.search.SimilaritySearch
.
>>> from ccdc.search import SimilaritySearch
>>> sim_search = SimilaritySearch(mol_abebuf.heaviest_component)
>>> hits = sim_search.search()
>>> len(hits)
41
>>> for h in hits:
... print('%8s %.3f %s' % (h.identifier, h.similarity, h.crystal.spacegroup_symbol))
...
ABEBIT 1.000 P-1
...
See also
In this example we make use of a for
loop. If this is
unfamiliar to you a basic introduction to the for
statement can be found in the Python on-line documentation
docs.python.org/3/tutorial/controlflow.html#for-statements.
See also
The descriptive similarity search documentation and the
ccdc.search.SimilaritySearch
,
ccdc.search.SimilaritySearch.SimilarityHit
API documentation.
Substructure searching with 3D measurements¶
Identifying identical molecules¶
Suppose that we wanted to find out if there were any other structures in
the CSD that contained 5H,11H-Dibenzo(b,f)(1,5)diazocine-6,12-dione
(the heaviest component of ABEBUF
).
This can be achieved by setting up a
ccdc.search.SubstructureSearch
with a
ccdc.search.MoleculeSubstructure
.
>>> from ccdc.search import SubstructureSearch, MoleculeSubstructure
>>> mol_substructure = MoleculeSubstructure(mol_abebuf.heaviest_component)
>>> substructure_search = SubstructureSearch()
>>> sub_id = substructure_search.add_substructure(mol_substructure)
>>> hits = substructure_search.search()
>>> len(hits)
8
>>> for h in hits:
... print(h.identifier)
...
ABEBIT
ABEBOZ
ABEBUF
ISOHAZ
ISOHAZ
ISOHED
ISOHIH
ISOHIH
Note that some CSD entries were hit more than once by this substructure.
If we only want to get one hit per structure we can make use of the
max_hits_per_structure
argument. The example below also illustrates
how one can get the indices of the matched atoms out of a substructure
hit.
>>> hits = substructure_search.search(max_hits_per_structure=1)
>>> len(hits)
6
>>> for h in hits:
... print('%s: (%s)' % (h.identifier, ', '.join('%d' % i for i in h.match_atoms(indices=True))))
ABEBIT: (26, 27, 0, ..., 23, 21, 11)
ABEBOZ: (0, 1, 2, ..., 22, 24, 26)
ABEBUF: (0, 1, 2, ..., 25, 26, 27)
ISOHAZ: (0, 1, 4, ..., 22, 24, 26)
ISOHED: (10, 11, 14, ..., 32, 34, 36)
ISOHIH: (10, 11, 14, ..., 32, 34, 36)
Torsional preferences of an acyclic fragment¶
Suppose that we were interested in finding out the torsional preferences of a 2-ethoxyethoxy fragment. Is it 180° like an alkyl chain or not?
This can be achieved by setting up a
ccdc.search.SubstructureSearch
, for example from a
ccdc.search.SMARTSSubstructure
, and adding a torsion
measurement to the search
ccdc.search.SubstructureSearch.add_torsion_measurement()
.
First we create a ccdc.search.SMARTSSubstructure
using a SMARTS pattern.
>>> from ccdc.search import SMARTSSubstructure
>>> ethoxyethoxy = SMARTSSubstructure('[OD2][CH2]!@[CH2][OD2]')
>>> for qatom in ethoxyethoxy.atoms:
... print('%d: %s' % (qatom.index, qatom))
...
0: QueryAtom(O)[atom aromaticity: equal to 0, count connected elements equal to 2 from [He...Og]]
1: QueryAtom(C)[atom aromaticity: equal to 0, hydrogen count, including deuterium: equal to 2]
2: QueryAtom(C)[atom aromaticity: equal to 0, hydrogen count, including deuterium: equal to 2]
3: QueryAtom(O)[atom aromaticity: equal to 0, count connected elements equal to 2 from [He...Og]]
Then we create a ccdc.search.SubstructureSearch
and add the
substructure to it.
>>> substructure_search = SubstructureSearch()
>>> sub_id = substructure_search.add_substructure(ethoxyethoxy)
We can then add a torsion measurement to the substructure using the
substructure identifier and the ccdc.search.QueryAtom
indices.
>>> substructure_search.add_torsion_angle_measurement('TOR1',
... sub_id, 0,
... sub_id, 1,
... sub_id, 2,
... sub_id, 3)
For the purposes of illustrating this search we will stop the search after the first five structures have been hit.
>>> hits = substructure_search.search(max_hit_structures=5)
>>> for h in hits:
... print('%8s %7.2f' % (h.identifier,
... h.measurements['TOR1']))
...
ABIGEY 176.23
...
Interaction geometry of a pair of functional groups¶
Suppose that we were interested in working out if there were any preferences in geometry for a charged pyridine fragment interacting with the carbonyl oxygen of an amide group.
This can be achieved by setting up a
ccdc.search.SubstructureSearch
, using for example a pair of
ccdc.search.SMARTSSubstructure
instances, and adding a distance
constraint between the atoms of interest
(ccdc.search.SubstructureSearch.add_distance_constraint()
).
Using the ccdc.search.SubstructureSearch.add_angle_measurement()
one could then set up measurements for the two angles of interest.
>>> charged_pyridine = SMARTSSubstructure('c1ccn(H)cc1') # indices N:3 H:4
>>> amide = SMARTSSubstructure('C[NH]C(=O)C') # indices C:2 O:3
>>> substructure_search = SubstructureSearch()
>>> sub1 = substructure_search.add_substructure(charged_pyridine)
>>> sub2 = substructure_search.add_substructure(amide)
Note that we have now added two distinct substructures to this particular substructure search. At this point we can set up the distance constraint between the atoms of interest in the two substructures as well as the angle measurements.
>>> substructure_search.add_distance_constraint('D1',
... sub1, 4,
... sub2, 3,
... (0, 2.5), # distance constraint range
... 'Intermolecular')
>>> substructure_search.add_angle_measurement('A1',
... sub1, 3,
... sub1, 4,
... sub2, 3)
>>> substructure_search.add_angle_measurement('A2',
... sub2, 2,
... sub2, 3,
... sub1, 4)
For the purposes of illustrating this search we will stop the search after the first five structures have been hit.
>>> hits = substructure_search.search(max_hit_structures=5)
>>> for h in hits:
... print('%s (D1 %.2f) (A1 %.2f) (A2 %.2f)' % (h.identifier, h.constraints['D1'], h.measurements['A1'], h.measurements['A2']))
...
AMEHUX (D1 2.12) (A1 130.21) (A2 124.72)
CUWWEX (D1 1.99) (A1 157.26) (A2 142.09)
DOKPAU (D1 1.93) (A1 133.90) (A2 138.39)
DOKPEY (D1 1.89) (A1 133.08) (A2 138.74)
FEZBUK (D1 2.16) (A1 168.21) (A2 122.84)
Conformational geometry analysis¶
The CSD Software provides functionality that enables rapid validation of the
complete geometry of a given query molecule. This can, for example, be
used to validate the binding modes of ligands from protein-ligand
structures. It can also provide valuable information in the assessment
of the relative stability of conformational polymorphs. In this example
we will compare two crystal structures of the HIV-protease inhibitor
Ritonavir, identifiers YIGPIO01
and YIGPIO02
.
To perform a molecular geometry analysis on a molecule one needs to create a geometry analysis engine.
>>> from ccdc import conformer
>>> engine = conformer.GeometryAnalyser()
We are only interested in analysing the torsion angles so we will turn off the analysis of bonds, valence angles and rings. In this example we will also turn off search generalisation.
>>> engine.settings.bond.analyse = False
>>> engine.settings.angle.analyse = False
>>> engine.settings.ring.analyse = False
>>> engine.settings.generalisation = False
To facilitate the analysis we will use a simple function that takes a list of identifiers and prints out the identifier and the number of unusual torsion angles in the structure.
>>> def conformation_analysis(identifiers):
... print('REFCODE, UnusualTorsions')
... with io.MoleculeReader('CSD') as reader:
... for identifer in identifiers:
... mol = reader.molecule(identifer)
... checked_mol = engine.analyse_molecule(mol)
... num_unusual = len([t for t in checked_mol.analysed_torsions if t.unusual])
... print("%s, %i" % (identifer, num_unusual))
...
See also
Here we make use of the concept of a function. If this is unfamiliar to you a basic introduction to functions can be found in the Python on-line documentation docs.python.org/3/tutorial/controlflow.html#defining-functions.
The ccdc.conformer.GeometryAnalyser.analyse_molecule()
function returns a
ccdc.molecule.Molecule
with dynamically added lists of
attributes for features analysed, in this case analysed_torsions
. These
list contain ccdc.conformer.Analysis
instances, which
contains descriptors such as ccdc.conformer.GeometryAnalyser.Analysis.unusual
. We
can use this to work out the number of unusual torsion angles observed
in a crystal structures.
>>> conformation_analysis(['YIGPIO01', 'YIGPIO02'])
REFCODE, UnusualTorsions
YIGPIO01, 3
YIGPIO02, 1
See also
The descriptive molecular geometry analysis
documentation and the
ccdc.conformer
API documentation.
Interaction preference analysis¶
The CSD Portfolio contains knowledge-based libraries of intermolecular
interactions. In this example we will use these libraries to help us
understand what interactions would be preferable in a crystal structure
of paracetamol/acetaminophen (HXACAN
).
First of all we create the central and contact group libraries.
>>> from ccdc.interaction import InteractionLibrary
>>> central_lib = InteractionLibrary.CentralGroupLibrary()
>>> contact_lib = InteractionLibrary.ContactGroupLibrary()
These libraries can be used to identify which central and contact groups are present in the paracetamol molecule.
>>> paracetamol = csd_reader.molecule('HXACAN')
>>> central_hits = central_lib.search_molecule(paracetamol)
>>> contact_hits = contact_lib.search_molecule(paracetamol)
We can then loop over the central and contact group hits to identify any pairs that have a relative density greater than 1.0.
Note
The code below will take a little time to complete.
>>> potential_interaction = []
>>> for central_hit in central_hits:
... for contact_hit in contact_hits:
... data = central_hit.group.interaction_data(contact_hit.group)
... if data is None:
... continue
... rel_density, est_std_dev = data.relative_density
... if rel_density > 2.0:
... interaction = (rel_density,
... data.ncontacts,
... central_hit.name,
... [i+1 for i in central_hit.match_atoms(indices=True)],
... contact_hit.name,
... [i+1 for i in contact_hit.match_atoms(indices=True)])
... potential_interaction.append(interaction)
...
Finally, we sort the interactions by their relative density and print them out.
>>> for interaction in sorted(potential_interaction, reverse=True):
... print("%.2f | %i | %s %s | %s %s" % interaction)
...
3.82 | 949 | aromatic-aliphatic amide [2, 6, 1, 8, 18, 7, 20, 14] | any OH [19, 13]
3.00 | 112 | aromatic-aliphatic amide [2, 6, 1, 8, 18, 7, 20, 14] | phenol OH [3, 5, 4, 19, 13]
2.97 | 135 | acetylamino [1, 18, 7, 20, 8, 15, 16, 17, 14] | phenol OH [3, 5, 4, 19, 13]
2.79 | 2759 | aromatic-aliphatic amide [2, 6, 1, 8, 18, 7, 20, 14] | any polar X-H (X= N,O or S) [19, 13]
2.79 | 2759 | aromatic-aliphatic amide [2, 6, 1, 8, 18, 7, 20, 14] | any polar X-H (X= N,O or S) [18, 14]
2.54 | 1661 | aromatic-aliphatic amide [2, 6, 1, 8, 18, 7, 20, 14] | any uncharged N-H [18, 14]
2.53 | 1626 | aromatic-aliphatic amide [2, 6, 1, 8, 18, 7, 20, 14] | any uncharged N(sp2)-H [18, 14]
2.50 | 1535 | aromatic-aliphatic amide [2, 6, 1, 8, 18, 7, 20, 14] | uncharged C-NH-C [1, 7, 18, 14]
2.48 | 1429 | aromatic-aliphatic amide [2, 6, 1, 8, 18, 7, 20, 14] | amide N-H [8, 1, 7, 20, 18, 14]
2.41 | 1810 | aromatic-aliphatic amide [2, 6, 1, 8, 18, 7, 20, 14] | any N-H [18, 14]
See also
The descriptive interaction library documentation and
the ccdc.interaction
API documentation.