Quick primer to using the CSD Python API¶
This section is a quick primer for getting familiar with the CSD Python API. It is not, however, a complete reference to the CSD Python API. That can be found as the API documentation.
Once you have read through this primer and want some more information on how to make use of the CSD Python API it is recommended that you read through the descriptive documentation and look at the examples included in the cookbook documentation.
Example files referenced within this documentation can be found in the doc/example_files directory within the distributed archive.
In order to use data from the CSD and any feature dependent on CSD-derived data the CSD itself must be installed.
Starting a Python interactive shell¶
This document will make use of the Python interactive shell:
bash-3.2$ python Python 2.7.3 (default, Jun 21 2012, 13:00:25) [GCC 4.1.2 20080704 (Red Hat 4.1.2-52)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>>
For more information on Python’s interactive mode please see the Python on-line documentation docs.python.org/3/tutorial/interpreter.html#interactive-mode.
Accessing database entries, crystals and molecules from the CSD¶
The CSD Python API makes a distinction between
database entries (
ccdc.crystal.Crystal) and molecules
ccdc.molecule.Molecule). To illustrate the difference let us
get access to the database entry, crystal and molecule of ABEBUF.
>>> from ccdc import io >>> csd_reader = io.EntryReader('CSD') >>> entry_abebuf = csd_reader.entry('ABEBUF') >>> cryst_abebuf = csd_reader.crystal('ABEBUF') >>> mol_abebuf = csd_reader.molecule('ABEBUF')
ccdc.molecule.Molecule contains molecular properties such as
the molecular weight (
and descriptors such as whether or not the molecule is organic
>>> round(mol_abebuf.molecular_weight, 3) # only want 3 significant figures 317.341 >>> mol_abebuf.is_organic True
In the example above we make use of Python’s built in
round function. For more information on the
function please see the Python on-line documentation
>>> print(mol_abebuf.smiles) O=C1Nc2ccccc2C(=O)Nc2ccccc12.c1ccncc1 >>> print(mol_abebuf.to_string('mol2')) @<TRIPOS>MOLECULE ABEBUF ...
Note that from the SMILES string it is apparent that this molecule contains two disconnected components:
The individual components are molecules as well and can be accessed
ccdc.molecule.Molecule.components list and the
individual molecule with the greatest molecular weight can be accessed
>>> mol_abebuf.components [<ccdc.molecule.Molecule object at ...>, <ccdc.molecule.Molecule object at ...>] >>> [round(mol.molecular_weight, 3) for mol in mol_abebuf.components] [238.241, 79.1] >>> print(mol_abebuf.heaviest_component.smiles) O=C1Nc2ccccc2C(=O)Nc2ccccc12
>>> cryst_abebuf.z_prime 1.0 >>> round(cryst_abebuf.cell_volume, 3) # only want 3 significant figures 3229.582
The molecule in the crystal structure can be accessed from the
>>> cryst_abebuf.molecule <ccdc.molecule.Molecule object at ...> >>> print(cryst_abebuf.molecule.smiles) O=C1Nc2ccccc2C(=O)Nc2ccccc12.c1ccncc1
CSD entries contains additional information about a crystal structure
added during the editorial process. Examples include the chemical name
ccdc.entry.Entry.chemical_name) and the publication details
>>> print(entry_abebuf.chemical_name) 5H,11H-Dibenzo(b,f)(1,5)diazocine-6,12-dione pyridine clathrate >>> print(entry_abebuf.publication) Citation(authors='S.W.Gordon-Wylie, E.Teplin, J.C.Morris, M.I.Trombley, S.M.McCarthy, W.M.Cleaver, G.R.Clark', journal='Journal(Crystal Growth and Design)', volume='4', year=2004, first_page='789', doi='10.1021/cg049957u')
>>> entry_abebuf.crystal <ccdc.crystal.Crystal object at ...> >>> entry_abebuf.crystal.z_prime 1.0 >>> entry_abebuf.molecule <ccdc.molecule.Molecule object at ...> >>> print(entry_abebuf.molecule.smiles) O=C1Nc2ccccc2C(=O)Nc2ccccc12.c1ccncc1
Reading and writing molecules¶
Suppose that we wanted to write out our molecule to a file named
>>> file_name = 'abebuf.mol2'
ccdc.io module has three types of writer:
ccdc.io.MoleculeWriter. These are all context managers that can
make use of the Python
with syntax. This syntax automatically ensures that the writer is
closed automatically when the
with block of code is exited. Let us illustrate this by
writing the molecule to the file.
For more information on Python’s
with syntax please see the Python on-line
>>> with io.MoleculeWriter(file_name) as mol_writer: ... mol_writer.write(mol_abebuf) ...
In the example above the file format was determined by the
file name extension (‘mol2’). The file format can also be set
explicitly using the
format argument when instantiating a
Now that we have a molecule written to the file system we can illustrate how to read in a molecule from a file.
>>> mol_reader = io.MoleculeReader(file_name) >>> mol = mol_reader # get the first molecule in the file >>> print(mol.identifier) ABEBUF >>> print(mol.smiles) O=C1Nc2ccccc2C(=O)Nc2ccccc12.c1ccncc1 >>> len(mol.components) 2
ccdc.io.MoleculeReader is an iterator that can
contain zero or more molecules. This is why we had to
explicitly get the first molecule from it.
Note also that the concept of a molecule file containing multiple molecules is distinct from the notion that a molecule can contain multiple distinct components. Hence the first molecule extracted from the molecule reader having two components.
Searching the CSD¶
There are several main ways of searching the CSD:
text numeric searching based on the data corresponding to entries (
similarity searching for chemically-similar entries (
substructure searching including searches with 3D geometrical constraints and measurements (
reduced cell searching based on crystal cell parameters (
Textual and numeric searching¶
Suppose that one was interested in the physical properties of aspirin and as such wanted to find publications containing aspirin crystal structures with recorded melting points.
This can be achieved using a
>>> from ccdc.search import TextNumericSearch >>> text_numeric_search = TextNumericSearch() >>> text_numeric_search.add_compound_name('aspirin') >>> identifiers = [h.identifier for h in text_numeric_search.search()]
Over the years the assignment of the common and chemical names have somewhat arbitrarily been assigned to the synonyms and compound name fields of the CSD. As such we also need to search the synonyms field for the search to be comprehensive.
>>> text_numeric_search.clear() >>> text_numeric_search.add_synonym('aspirin') >>> identifiers.extend([h.identifier for h in text_numeric_search.search()])
Finally, we loop over the identifiers. To ensure that we avoid duplicates we can make use of Python’s built in set functionality. For each identifier we get the relevant entry from the CSD and check if it has a melting point associated with it or not and if it has we write out the identifier, the DOI and the melting point information.
>>> for identifier in sorted(set(identifiers)): ... e = csd_reader.entry(identifier) ... if e.melting_point: ... print('%-8s http://dx.doi.org/%-25s %s' % (e.identifier, ... e.publication.doi, ... e.melting_point)) ... ACMEBZ http://dx.doi.org/10.1107/S0567740881006729 385K ACSALA13 http://dx.doi.org/10.1021/ja056455b 135.5deg.C ACSALA28 http://dx.doi.org/10.1039/D0CC03157G 415 K BEHWOA http://dx.doi.org/10.1107/S0567740882005731 401K CUASPR01 http://dx.doi.org/10.1107/S1600536803026126 above 573 K HUPPOX http://dx.doi.org/10.1039/b208574g 91-96 deg.C HUPPOX01 http://dx.doi.org/None 91-96 deg.C NUWTOP01 http://dx.doi.org/10.1039/c2ce06313a 147 deg.C PIKYUG http://dx.doi.org/10.1021/acs.cgd.8b00718 502 K
In this example we make use of Python’s built-in set object. Please see the Python on-line documentation for more information docs.python.org/3/library/stdtypes.html.
Molecular similarity searching¶
Suppose that one wanted to find out what space groups were occupied by
CSD entries that contained compounds similar to the heaviest component
ABEBUF. This type of information may be of interest if one was
trying to solve the structure of a new compound using powder data.
Having knowledge of which space groups are likely to be occupied by a
compound can help the interpretation of the data.
This can be achieved using a
>>> from ccdc.search import SimilaritySearch >>> sim_search = SimilaritySearch(mol_abebuf.heaviest_component) >>> hits = sim_search.search() >>> len(hits) 40 >>> for h in hits: ... print('%8s %.3f %s' % (h.identifier, h.similarity, h.crystal.spacegroup_symbol)) ... ABEBIT 1.000 P-1 ABEBOZ 1.000 P-1 ABEBUF 1.000 Pbca ISOHAZ 1.000 P-1 ISOHED 1.000 P212121 ISOHIH 1.000 P21/n LIGREY 0.981 P21/n LIGREY01 0.981 P21/c DMECDC 0.954 P21 MEANLD 0.954 P21/a EQOTIQ 0.849 P21/n UCELIX 0.847 P21/c DBZCDC 0.828 P21/a WUZPAL 0.809 P21/n ADOHIM 0.806 P21/c WUZNUD 0.792 P21/c WUZPEP 0.790 Pbca MANTRN 0.788 P212121 WUZNOX 0.788 Pbca ASITOL 0.775 Pca21 ASITUR 0.775 P-1 ASIVAZ 0.775 P-1 TOVZEJ 0.759 P-1 MUWSOP 0.754 P21/c QAFMET 0.750 P21/n LOSKAI 0.741 P21/c NUJPUE 0.739 P21/c TOKZAW 0.739 P-1 HIRROR 0.735 P21/c HIRROR01 0.735 P21/n LOSKEM 0.706 P21/n ARUGOM 0.706 P-1 JONWOA 0.705 Pbca GAVBIS 0.870 Pbca GATZAG 0.847 C2/c GAVBAK 0.790 P21/c GATZUA 0.775 P21/c GATYUZ 0.757 P-1 GATZOU 0.730 P212121 REKYOZ 0.857 P21/c
In this example we make use of a
for loop. If this is
unfamiliar to you a basic introduction to the
statement can be found in the Python on-line documentation
Substructure searching with 3D measurements¶
Identifying identical molecules¶
Suppose that we wanted to find out if there were any other structures in
the CSD that contained
(the heaviest component of
>>> from ccdc.search import SubstructureSearch, MoleculeSubstructure >>> mol_substructure = MoleculeSubstructure(mol_abebuf.heaviest_component) >>> substructure_search = SubstructureSearch() >>> sub_id = substructure_search.add_substructure(mol_substructure) >>> hits = substructure_search.search() >>> len(hits) 8 >>> for h in hits: ... print(h.identifier) ... ABEBIT ABEBOZ ABEBUF ISOHAZ ISOHAZ ISOHED ISOHIH ISOHIH
Note that some CSD entries were hit more than once by this substructure.
If we only want to get one hit per structure we can make use of the
max_hits_per_structure argument. The example below also illustrates
how one can get the indices of the matched atoms out of a substructure
>>> hits = substructure_search.search(max_hits_per_structure=1) >>> len(hits) 6 >>> for h in hits: ... print('%s: (%s)' % (h.identifier, ', '.join('%d' % i for i in h.match_atoms(indices=True)))) ABEBIT: (26, 27, 0, ..., 23, 21, 11) ABEBOZ: (0, 1, 2, ..., 22, 24, 26) ABEBUF: (0, 1, 2, ..., 25, 26, 27) ISOHAZ: (0, 1, 4, ..., 22, 24, 26) ISOHED: (10, 11, 14, ..., 32, 34, 36) ISOHIH: (10, 11, 14, ..., 32, 34, 36)
Torsional preferences of an acyclic fragment¶
Suppose that we were interested in finding out the torsional preferences of a 2-ethoxyethoxy fragment. Is it 180° like an alkyl chain or not?
This can be achieved by setting up a
ccdc.search.SubstructureSearch, for example from a
ccdc.search.SMARTSSubstructure, and adding a torsion
measurement to the search
First we create a
ccdc.search.SMARTSSubstructure using a SMARTS pattern.
>>> from ccdc.search import SMARTSSubstructure >>> ethoxyethoxy = SMARTSSubstructure('[OD2][CH2]!@[CH2][OD2]') >>> for qatom in ethoxyethoxy.atoms: ... print('%d: %s' % (qatom.index, qatom)) ... 0: QueryAtom(O)[atom aromaticity: equal to 0, count connected elements equal to 2 from [0: Li ... 116: Rn ]] 1: QueryAtom(C)[atom aromaticity: equal to 0, hydrogen count, including deuterium: equal to 2] 2: QueryAtom(C)[atom aromaticity: equal to 0, hydrogen count, including deuterium: equal to 2] 3: QueryAtom(O)[atom aromaticity: equal to 0, count connected elements equal to 2 from [0: Li ... 116: Rn ]]
Then we create a
ccdc.search.SubstructureSearch and add the
substructure to it.
>>> substructure_search = SubstructureSearch() >>> sub_id = substructure_search.add_substructure(ethoxyethoxy)
We can then add a torsion measurement to the substructure using the
substructure identifier and the
>>> substructure_search.add_torsion_angle_measurement('TOR1', ... sub_id, 0, ... sub_id, 1, ... sub_id, 2, ... sub_id, 3)
For the purposes of illustrating this search we will stop the search after the first five structures have been hit.
>>> hits = substructure_search.search(max_hit_structures=5) >>> for h in hits: ... print('%8s %7.2f' % (h.identifier, ... h.measurements['TOR1'])) ... ABIGEY 176.23 ABIGEY 162.75 ABOKAD 64.74 ABOKAD 58.64 ABOKAD -63.53 ABOKAD 67.31 ABOKAD 66.00 ABOKAD 73.18 ACEJIE 41.31 ACEJIE -68.13 ACEJIE -64.21 ACEJIE 67.41 ACEJIE -67.65 ACEJIE 66.83 ACEJIE -70.26 ACEJIE 67.18 ACIFEZ 70.50 ACIFEZ 55.07 ACIFEZ -70.58 ACIJOM 180.00
Interaction geometry of a pair of functional groups¶
Suppose that we were interested in working out if there were any preferences in geometry for a charged pyridine fragment interacting with the carbonyl oxygen of an amide group.
This can be achieved by setting up a
ccdc.search.SubstructureSearch, using for example a pair of
ccdc.search.SMARTSSubstructure instances, and adding a distance
constraint between the atoms of interest
one could then set up measurements for the two angles of interest.
>>> charged_pyridine = SMARTSSubstructure('c1ccn(H)cc1') # indices N:3 H:4 >>> amide = SMARTSSubstructure('C[NH]C(=O)C') # indices C:2 O:3 >>> substructure_search = SubstructureSearch() >>> sub1 = substructure_search.add_substructure(charged_pyridine) >>> sub2 = substructure_search.add_substructure(amide)
Note that we have now added two distinct substructures to this particular substructure search. At this point we can set up the distance constraint between the atoms of interest in the two substructures as well as the angle measurements.
>>> substructure_search.add_distance_constraint('D1', ... sub1, 4, ... sub2, 3, ... (0, 2.5), # distance constraint range ... 'Intermolecular') >>> substructure_search.add_angle_measurement('A1', ... sub1, 3, ... sub1, 4, ... sub2, 3) >>> substructure_search.add_angle_measurement('A2', ... sub2, 2, ... sub2, 3, ... sub1, 4)
For the purposes of illustrating this search we will stop the search after the first five structures have been hit.
>>> hits = substructure_search.search(max_hit_structures=5) >>> for h in hits: ... print('%s (D1 %.2f) (A1 %.2f) (A2 %.2f)' % (h.identifier, h.constraints['D1'], h.measurements['A1'], h.measurements['A2'])) ... AMEHUX (D1 2.12) (A1 130.21) (A2 124.72) CUWWEX (D1 1.99) (A1 157.26) (A2 142.09) DOKPAU (D1 1.93) (A1 133.90) (A2 138.39) DOKPEY (D1 1.89) (A1 133.08) (A2 138.74) FEZBUK (D1 2.16) (A1 168.21) (A2 122.84)
Conformational geometry analysis¶
The CSD Software provides functionality that enables rapid validation of the
complete geometry of a given query molecule. This can, for example, be
used to validate the binding modes of ligands from protein-ligand
structures. It can also provide valuable information in the assessment
of the relative stability of conformational polymorphs. In this example
we will compare two crystal structures of the HIV-protease inhibitor
To perform a molecular geometry analysis on a molecule one needs to create a geometry analysis engine.
>>> from ccdc import conformer >>> engine = conformer.GeometryAnalyser()
We are only interested in analysing the torsion angles so we will turn off the analysis of bonds, valence angles and rings. In this example we will also turn off search generalisation.
>>> engine.settings.bond.analyse = False >>> engine.settings.angle.analyse = False >>> engine.settings.ring.analyse = False >>> engine.settings.generalisation = False
To facilitate the analysis we will use a simple function that takes a list of identifiers and prints out the identifier and the number of unusual torsion angles in the structure.
>>> def conformation_analysis(identifiers): ... print('REFCODE, UnusualTorsions') ... with io.MoleculeReader('CSD') as reader: ... for identifer in identifiers: ... mol = reader.molecule(identifer) ... checked_mol = engine.analyse_molecule(mol) ... num_unusual = len([t for t in checked_mol.analysed_torsions if t.unusual]) ... print("%s, %i" % (identifer, num_unusual)) ...
Here we make use of the concept of a function. If this is unfamiliar to you a basic introduction to functions can be found in the Python on-line documentation docs.python.org/3/tutorial/controlflow.html#defining-functions.
ccdc.conformer.GeometryAnalyser.analyse_molecule() function returns a
ccdc.molecule.Molecule with dynamically added lists of
attributes for features analysed, in this case
ccdc.conformer.Analysis instances, which
contains descriptors such as
can use this to work out the number of unusual torsion angles observed
in a crystal structures.
>>> conformation_analysis(['YIGPIO01', 'YIGPIO02']) REFCODE, UnusualTorsions YIGPIO01, 3 YIGPIO02, 0
The descriptive molecular geometry analysis
documentation and the
ccdc.conformer API documentation.
Interaction preference analysis¶
The CSD Portfolio contains knowledge-based libraries of intermolecular
interactions. In this example we will use these libraries to help us
understand what interactions would be preferable in a crystal structure
of paracetamol/acetaminophen (
First of all we create the central and contact group libraries.
>>> from ccdc.interaction import InteractionLibrary >>> central_lib = InteractionLibrary.CentralGroupLibrary() >>> contact_lib = InteractionLibrary.ContactGroupLibrary()
These libraries can be used to identify which central and contact groups are present in the paracetamol molecule.
>>> paracetamol = csd_reader.molecule('HXACAN') >>> central_hits = central_lib.search_molecule(paracetamol) >>> contact_hits = contact_lib.search_molecule(paracetamol)
We can then loop over the central and contact group hits to identify any pairs that have a relative density greater than 1.0.
The code below will take a little time to complete.
>>> potential_interaction =  >>> for central_hit in central_hits: ... for contact_hit in contact_hits: ... data = central_hit.group.interaction_data(contact_hit.group) ... if data is None: ... continue ... rel_density, est_std_dev = data.relative_density ... if rel_density > 2.0: ... interaction = (rel_density, ... data.ncontacts, ... central_hit.name, ... [i+1 for i in central_hit.match_atoms(indices=True)], ... contact_hit.name, ... [i+1 for i in contact_hit.match_atoms(indices=True)]) ... potential_interaction.append(interaction) ...
Finally, we sort the interactions by their relative density and print them out.
>>> for interaction in sorted(potential_interaction, reverse=True): ... print("%.2f | %i | %s %s | %s %s" % interaction) ... 3.82 | 949 | aromatic-aliphatic amide [2, 6, 1, 8, 18, 7, 20, 14] | any OH [19, 13] 3.00 | 112 | aromatic-aliphatic amide [2, 6, 1, 8, 18, 7, 20, 14] | phenol OH [3, 5, 4, 19, 13] 2.97 | 135 | acetylamino [1, 18, 7, 20, 8, 15, 16, 17, 14] | phenol OH [3, 5, 4, 19, 13] 2.79 | 2759 | aromatic-aliphatic amide [2, 6, 1, 8, 18, 7, 20, 14] | any polar X-H (X= N,O or S) [19, 13] 2.79 | 2759 | aromatic-aliphatic amide [2, 6, 1, 8, 18, 7, 20, 14] | any polar X-H (X= N,O or S) [18, 14] 2.54 | 1661 | aromatic-aliphatic amide [2, 6, 1, 8, 18, 7, 20, 14] | any uncharged N-H [18, 14] 2.53 | 1626 | aromatic-aliphatic amide [2, 6, 1, 8, 18, 7, 20, 14] | any uncharged N(sp2)-H [18, 14] 2.50 | 1535 | aromatic-aliphatic amide [2, 6, 1, 8, 18, 7, 20, 14] | uncharged C-NH-C [1, 7, 18, 14] 2.48 | 1429 | aromatic-aliphatic amide [2, 6, 1, 8, 18, 7, 20, 14] | amide N-H [8, 1, 7, 20, 18, 14] 2.41 | 1810 | aromatic-aliphatic amide [2, 6, 1, 8, 18, 7, 20, 14] | any N-H [18, 14]