Quick primer to using the CSD Python API

Introduction

This section is a quick primer for getting familiar with the CSD Python API. It is not, however, a complete reference to the CSD Python API. That can be found as the API documentation.

Once you have read through this primer and want some more information on how to make use of the CSD Python API it is recommended that you read through the descriptive documentation and look at the examples included in the cookbook documentation.

Note

Example files referenced within this documentation can be found in the doc/example_files directory within the distributed archive.

Note

In order to use data from the CSD and any feature dependent on CSD-derived data the CSD itself must be installed.

Starting a Python interactive shell

This document will make use of the Python interactive shell:

bash-3.2$ python
Python 2.7.3 (default, Jun 21 2012, 13:00:25)
[GCC 4.1.2 20080704 (Red Hat 4.1.2-52)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>

See also

For more information on Python’s interactive mode please see the Python on-line documentation docs.python.org/3/tutorial/interpreter.html#interactive-mode.

Accessing database entries, crystals and molecules from the CSD

The CSD Python API makes a distinction between database entries (ccdc.entry.Entry), crystals (ccdc.crystal.Crystal) and molecules (ccdc.molecule.Molecule). To illustrate the difference let us get access to the database entry, crystal and molecule of ABEBUF.

>>> from ccdc import io
>>> csd_reader = io.EntryReader('CSD')
>>> entry_abebuf = csd_reader.entry('ABEBUF')
>>> cryst_abebuf = csd_reader.crystal('ABEBUF')
>>> mol_abebuf = csd_reader.molecule('ABEBUF')

Molecules

A ccdc.molecule.Molecule contains molecular properties such as the molecular weight (ccdc.molecule.Molecule.molecular_weight) and descriptors such as whether or not the molecule is organic (ccdc.molecule.Molecule.is_organic).

>>> round(mol_abebuf.molecular_weight, 3)  # only want 3 significant figures
317.341
>>> mol_abebuf.is_organic
True

See also

In the example above we make use of Python’s built in round function. For more information on the round function please see the Python on-line documentation docs.python.org/3/library/functions.html#round.

The molecule also contains a ccdc.molecule.Molecule.smiles representation and a ccdc.molecule.Molecule.to_string() function for returning a Mol2 or SDF formatted string.

>>> print(mol_abebuf.smiles)
O=C1Nc2ccccc2C(=O)Nc2ccccc12.c1ccncc1
>>> print(mol_abebuf.to_string('mol2'))  
@<TRIPOS>MOLECULE
ABEBUF
...

Note that from the SMILES string it is apparent that this molecule contains two disconnected components:

  • O=C1Nc2ccccc2C(=O)Nc2ccccc12

  • c1ccncc1

The individual components are molecules as well and can be accessed through the ccdc.molecule.Molecule.components list and the individual molecule with the greatest molecular weight can be accessed through the ccdc.molecule.Molecule.heaviest_component attribute.

>>> mol_abebuf.components  
[<ccdc.molecule.Molecule object at ...>,
 <ccdc.molecule.Molecule object at ...>]
>>> [round(mol.molecular_weight, 3) for mol in mol_abebuf.components]
[238.241, 79.1]
>>> print(mol_abebuf.heaviest_component.smiles)
O=C1Nc2ccccc2C(=O)Nc2ccccc12

See also

The descriptive molecule documentation and the ccdc.molecule.Molecule API documentation.

Crystals

Crystals contain crystallographic descriptors such as Z’ (ccdc.crystal.Crystal.z_prime) and the cell volume (ccdc.crystal.Crystal.cell_volume).

>>> cryst_abebuf.z_prime
1.0
>>> round(cryst_abebuf.cell_volume, 3)  # only want 3 significant figures
3229.582

The molecule in the crystal structure can be accessed from the ccdc.crystal.Crystal.molecule attribute.

>>> cryst_abebuf.molecule 
<ccdc.molecule.Molecule object at ...>
>>> print(cryst_abebuf.molecule.smiles)
O=C1Nc2ccccc2C(=O)Nc2ccccc12.c1ccncc1

See also

The descriptive crystal documentation and the ccdc.crystal.Crystal API documentation.

Entries

CSD entries contains additional information about a crystal structure added during the editorial process. Examples include the chemical name (ccdc.entry.Entry.chemical_name) and the publication details (ccdc.entry.Entry.publication).

>>> print(entry_abebuf.chemical_name)
5H,11H-Dibenzo(b,f)(1,5)diazocine-6,12-dione pyridine clathrate
>>> print(entry_abebuf.publication)  
Citation(authors='S.W.Gordon-Wylie, E.Teplin, J.C.Morris,
          M.I.Trombley, S.M.McCarthy, W.M.Cleaver, G.R.Clark',
         journal='Journal(Crystal Growth and Design)',
         volume='4', year=2004, first_page='789',
         doi='10.1021/cg049957u')

The crystal and molecule in an entry can be accessed from the ccdc.entry.Entry.crystal and ccdc.entry.Entry.molecule attributes respectively.

>>> entry_abebuf.crystal 
<ccdc.crystal.Crystal object at ...>
>>> entry_abebuf.crystal.z_prime
1.0
>>> entry_abebuf.molecule 
<ccdc.molecule.Molecule object at ...>
>>> print(entry_abebuf.molecule.smiles)
O=C1Nc2ccccc2C(=O)Nc2ccccc12.c1ccncc1

Reading and writing molecules

Suppose that we wanted to write out our molecule to a file named abebuf.mol2.

>>> file_name = 'abebuf.mol2'

The ccdc.io module has three types of writer: ccdc.io.EntryWriter, ccdc.io.CrystalWriter, ccdc.io.MoleculeWriter. These are all context managers that can make use of the Python with syntax. This syntax automatically ensures that the writer is closed automatically when the with block of code is exited. Let us illustrate this by writing the molecule to the file.

See also

For more information on Python’s with syntax please see the Python on-line

documentation docs.python.org/3/reference/compound_stmts.html#with.

>>> with io.MoleculeWriter(file_name) as mol_writer:
...     mol_writer.write(mol_abebuf)
...

Note

In the example above the file format was determined by the file name extension (‘mol2’). The file format can also be set explicitly using the format argument when instantiating a writer.

Now that we have a molecule written to the file system we can illustrate how to read in a molecule from a file.

>>> mol_reader =  io.MoleculeReader(file_name)
>>> mol = mol_reader[0]  # get the first molecule in the file
>>> print(mol.identifier)
ABEBUF
>>> print(mol.smiles)
O=C1Nc2ccccc2C(=O)Nc2ccccc12.c1ccncc1
>>> len(mol.components)
2

Note

A ccdc.io.MoleculeReader is an iterator that can contain zero or more molecules. This is why we had to explicitly get the first molecule from it.

Note also that the concept of a molecule file containing multiple molecules is distinct from the notion that a molecule can contain multiple distinct components. Hence the first molecule extracted from the molecule reader having two components.

Searching the CSD

There are several main ways of searching the CSD:

Textual and numeric searching

Suppose that one was interested in the physical properties of aspirin and as such wanted to find publications containing aspirin crystal structures with recorded melting points.

This can be achieved using a ccdc.search.TextNumericSearch.

>>> from ccdc.search import TextNumericSearch
>>> text_numeric_search = TextNumericSearch()
>>> text_numeric_search.add_compound_name('aspirin')
>>> identifiers = [h.identifier for h in text_numeric_search.search()]

Over the years the assignment of the common and chemical names have somewhat arbitrarily been assigned to the synonyms and compound name fields of the CSD. As such we also need to search the synonyms field for the search to be comprehensive.

>>> text_numeric_search.clear()
>>> text_numeric_search.add_synonym('aspirin')
>>> identifiers.extend([h.identifier for h in text_numeric_search.search()])

Finally, we loop over the identifiers. To ensure that we avoid duplicates we can make use of Python’s built in set functionality. For each identifier we get the relevant entry from the CSD and check if it has a melting point associated with it or not and if it has we write out the identifier, the DOI and the melting point information.

>>> for identifier in sorted(set(identifiers)):
...     e = csd_reader.entry(identifier)
...     if e.melting_point:
...         print('%-8s https://doi.org/%-25s %s' % (e.identifier,
...                                                    e.publication.doi,
...                                                    e.melting_point))
...
ACMEBZ   https://doi.org/10.1107/S0567740881006729 385K
ACSALA13 https://doi.org/10.1021/ja056455b         135.5deg.C
ACSALA28 https://doi.org/10.1039/D0CC03157G        415 K
BEHWOA   https://doi.org/10.1107/S0567740882005731 401K
CUASPR01 https://doi.org/10.1107/S1600536803026126 above 573 K
HUPPOX   https://doi.org/10.1039/b208574g          91-96 deg.C
HUPPOX01 https://doi.org/None                      91-96 deg.C
NUWTOP01 https://doi.org/10.1039/c2ce06313a        147 deg.C
PIKYUG   https://doi.org/10.1021/acs.cgd.8b00718   502 K

See also

In this example we make use of Python’s built-in set object. Please see the Python on-line documentation for more information docs.python.org/3/library/stdtypes.html.

See also

The descriptive text numeric search documentation and the ccdc.search.TextNumericSearch, ccdc.search.TextNumericHit API documentation.

Molecular similarity searching

Suppose that one wanted to find out what space groups were occupied by CSD entries that contained compounds similar to the heaviest component of ABEBUF. This type of information may be of interest if one was trying to solve the structure of a new compound using powder data. Having knowledge of which space groups are likely to be occupied by a compound can help the interpretation of the data.

This can be achieved using a ccdc.search.SimilaritySearch.

>>> from ccdc.search import SimilaritySearch
>>> sim_search = SimilaritySearch(mol_abebuf.heaviest_component)
>>> hits = sim_search.search()
>>> len(hits)
40
>>> for h in hits:
...     print('%8s %.3f %s' % (h.identifier, h.similarity, h.crystal.spacegroup_symbol))
...
  ABEBIT 1.000 P-1
  ABEBOZ 1.000 P-1
  ABEBUF 1.000 Pbca
  ISOHAZ 1.000 P-1
  ISOHED 1.000 P212121
  ISOHIH 1.000 P21/n
  LIGREY 0.981 P21/n
LIGREY01 0.981 P21/c
  DMECDC 0.954 P21
  MEANLD 0.954 P21/a
  EQOTIQ 0.849 P21/n
  UCELIX 0.847 P21/c
  DBZCDC 0.828 P21/a
  WUZPAL 0.809 P21/n
  ADOHIM 0.806 P21/c
  WUZNUD 0.792 P21/c
  WUZPEP 0.790 Pbca
  MANTRN 0.788 P212121
  WUZNOX 0.788 Pbca
  ASITOL 0.775 Pca21
  ASITUR 0.775 P-1
  ASIVAZ 0.775 P-1
  TOVZEJ 0.759 P-1
  MUWSOP 0.754 P21/c
  QAFMET 0.750 P21/n
  LOSKAI 0.741 P21/c
  NUJPUE 0.739 P21/c
  TOKZAW 0.739 P-1
  HIRROR 0.735 P21/c
HIRROR01 0.735 P21/n
  LOSKEM 0.706 P21/n
  ARUGOM 0.706 P-1
  JONWOA 0.705 Pbca
  GAVBIS 0.870 Pbca
  GATZAG 0.847 C2/c
  GAVBAK 0.790 P21/c
  GATZUA 0.775 P21/c
  GATYUZ 0.757 P-1
  GATZOU 0.730 P212121
  REKYOZ 0.857 P21/c

See also

In this example we make use of a for loop. If this is unfamiliar to you a basic introduction to the for statement can be found in the Python on-line documentation docs.python.org/3/tutorial/controlflow.html#for-statements.

Substructure searching with 3D measurements

Identifying identical molecules

Suppose that we wanted to find out if there were any other structures in the CSD that contained 5H,11H-Dibenzo(b,f)(1,5)diazocine-6,12-dione (the heaviest component of ABEBUF).

This can be achieved by setting up a ccdc.search.SubstructureSearch with a ccdc.search.MoleculeSubstructure.

>>> from ccdc.search import SubstructureSearch, MoleculeSubstructure
>>> mol_substructure = MoleculeSubstructure(mol_abebuf.heaviest_component)
>>> substructure_search = SubstructureSearch()
>>> sub_id = substructure_search.add_substructure(mol_substructure)
>>> hits = substructure_search.search()
>>> len(hits)
8
>>> for h in hits:
...     print(h.identifier)
...
ABEBIT
ABEBOZ
ABEBUF
ISOHAZ
ISOHAZ
ISOHED
ISOHIH
ISOHIH

Note that some CSD entries were hit more than once by this substructure. If we only want to get one hit per structure we can make use of the max_hits_per_structure argument. The example below also illustrates how one can get the indices of the matched atoms out of a substructure hit.

>>> hits = substructure_search.search(max_hits_per_structure=1)
>>> len(hits)
6
>>> for h in hits:
...     print('%s: (%s)' % (h.identifier, ', '.join('%d' % i for i in h.match_atoms(indices=True))))  
ABEBIT: (26, 27, 0, ..., 23, 21, 11)
ABEBOZ: (0, 1, 2, ..., 22, 24, 26)
ABEBUF: (0, 1, 2, ..., 25, 26, 27)
ISOHAZ: (0, 1, 4, ..., 22, 24, 26)
ISOHED: (10, 11, 14, ..., 32, 34, 36)
ISOHIH: (10, 11, 14, ..., 32, 34, 36)

Torsional preferences of an acyclic fragment

Suppose that we were interested in finding out the torsional preferences of a 2-ethoxyethoxy fragment. Is it 180° like an alkyl chain or not?

This can be achieved by setting up a ccdc.search.SubstructureSearch, for example from a ccdc.search.SMARTSSubstructure, and adding a torsion measurement to the search ccdc.search.SubstructureSearch.add_torsion_measurement().

First we create a ccdc.search.SMARTSSubstructure using a SMARTS pattern.

>>> from ccdc.search import SMARTSSubstructure
>>> ethoxyethoxy = SMARTSSubstructure('[OD2][CH2]!@[CH2][OD2]')
>>> for qatom in ethoxyethoxy.atoms:  
...     print('%d: %s' % (qatom.index, qatom))
...
0: QueryAtom(O)[atom aromaticity: equal to 0, count connected elements equal to 2 from [He...Og]]
1: QueryAtom(C)[atom aromaticity: equal to 0, hydrogen count, including deuterium: equal to 2]
2: QueryAtom(C)[atom aromaticity: equal to 0, hydrogen count, including deuterium: equal to 2]
3: QueryAtom(O)[atom aromaticity: equal to 0, count connected elements equal to 2 from [He...Og]]

Then we create a ccdc.search.SubstructureSearch and add the substructure to it.

>>> substructure_search = SubstructureSearch()
>>> sub_id = substructure_search.add_substructure(ethoxyethoxy)

We can then add a torsion measurement to the substructure using the substructure identifier and the ccdc.search.QueryAtom indices.

>>> substructure_search.add_torsion_angle_measurement('TOR1',
...     sub_id, 0,
...     sub_id, 1,
...     sub_id, 2,
...     sub_id, 3)

For the purposes of illustrating this search we will stop the search after the first five structures have been hit.

>>> hits = substructure_search.search(max_hit_structures=5)
>>> for h in hits:
...     print('%8s %7.2f' % (h.identifier,
...                         h.measurements['TOR1']))
...
  ABIGEY  176.23
  ABIGEY  162.75
  ABOKAD   64.74
  ABOKAD   58.64
  ABOKAD  -63.53
  ABOKAD   67.31
  ABOKAD   66.00
  ABOKAD   73.18
  ACEJIE   41.31
  ACEJIE  -68.13
  ACEJIE  -64.21
  ACEJIE   67.41
  ACEJIE  -67.65
  ACEJIE   66.83
  ACEJIE  -70.26
  ACEJIE   67.18
  ACIFEZ   70.50
  ACIFEZ   55.07
  ACIFEZ  -70.58
  ACIJOM  180.00

Interaction geometry of a pair of functional groups

Suppose that we were interested in working out if there were any preferences in geometry for a charged pyridine fragment interacting with the carbonyl oxygen of an amide group.

This can be achieved by setting up a ccdc.search.SubstructureSearch, using for example a pair of ccdc.search.SMARTSSubstructure instances, and adding a distance constraint between the atoms of interest (ccdc.search.SubstructureSearch.add_distance_constraint()).

Using the ccdc.search.SubstructureSearch.add_angle_measurement() one could then set up measurements for the two angles of interest.

Charged pyridine to amide contact query.

Substructure search to investigate the geometric preferences of charged pyridine groups interacting with an amide carbonyl oxygen.

>>> charged_pyridine = SMARTSSubstructure('c1ccn(H)cc1')  # indices N:3 H:4
>>> amide = SMARTSSubstructure('C[NH]C(=O)C')  # indices C:2 O:3
>>> substructure_search = SubstructureSearch()
>>> sub1 = substructure_search.add_substructure(charged_pyridine)
>>> sub2 = substructure_search.add_substructure(amide)

Note that we have now added two distinct substructures to this particular substructure search. At this point we can set up the distance constraint between the atoms of interest in the two substructures as well as the angle measurements.

>>> substructure_search.add_distance_constraint('D1',
...     sub1, 4,
...     sub2, 3,
...     (0, 2.5),  # distance constraint range
...     'Intermolecular')
>>> substructure_search.add_angle_measurement('A1',
...     sub1, 3,
...     sub1, 4,
...     sub2, 3)
>>> substructure_search.add_angle_measurement('A2',
...     sub2, 2,
...     sub2, 3,
...     sub1, 4)

For the purposes of illustrating this search we will stop the search after the first five structures have been hit.

>>> hits = substructure_search.search(max_hit_structures=5)
>>> for h in hits:
...     print('%s (D1 %.2f) (A1 %.2f) (A2 %.2f)' % (h.identifier, h.constraints['D1'], h.measurements['A1'], h.measurements['A2']))
...
AMEHUX (D1 2.12) (A1 130.21) (A2 124.72)
CUWWEX (D1 1.99) (A1 157.26) (A2 142.09)
DOKPAU (D1 1.93) (A1 133.90) (A2 138.39)
DOKPEY (D1 1.89) (A1 133.08) (A2 138.74)
FEZBUK (D1 2.16) (A1 168.21) (A2 122.84)

Conformational geometry analysis

The CSD Software provides functionality that enables rapid validation of the complete geometry of a given query molecule. This can, for example, be used to validate the binding modes of ligands from protein-ligand structures. It can also provide valuable information in the assessment of the relative stability of conformational polymorphs. In this example we will compare two crystal structures of the HIV-protease inhibitor Ritonavir, identifiers YIGPIO01 and YIGPIO02.

To perform a molecular geometry analysis on a molecule one needs to create a geometry analysis engine.

>>> from ccdc import conformer
>>> engine = conformer.GeometryAnalyser()

We are only interested in analysing the torsion angles so we will turn off the analysis of bonds, valence angles and rings. In this example we will also turn off search generalisation.

>>> engine.settings.bond.analyse = False
>>> engine.settings.angle.analyse = False
>>> engine.settings.ring.analyse = False
>>> engine.settings.generalisation = False

To facilitate the analysis we will use a simple function that takes a list of identifiers and prints out the identifier and the number of unusual torsion angles in the structure.

>>> def conformation_analysis(identifiers):
...     print('REFCODE, UnusualTorsions')
...     with io.MoleculeReader('CSD') as reader:
...         for identifer in identifiers:
...             mol = reader.molecule(identifer)
...             checked_mol = engine.analyse_molecule(mol)
...             num_unusual = len([t for t in checked_mol.analysed_torsions if t.unusual])
...             print("%s, %i" % (identifer, num_unusual))
...

See also

Here we make use of the concept of a function. If this is unfamiliar to you a basic introduction to functions can be found in the Python on-line documentation docs.python.org/3/tutorial/controlflow.html#defining-functions.

The ccdc.conformer.GeometryAnalyser.analyse_molecule() function returns a ccdc.molecule.Molecule with dynamically added lists of attributes for features analysed, in this case analysed_torsions. These list contain ccdc.conformer.Analysis instances, which contains descriptors such as ccdc.conformer.GeometryAnalyser.Analysis.unusual. We can use this to work out the number of unusual torsion angles observed in a crystal structures.

>>> conformation_analysis(['YIGPIO01', 'YIGPIO02'])
REFCODE, UnusualTorsions
YIGPIO01, 3
YIGPIO02, 0

Interaction preference analysis

The CSD Portfolio contains knowledge-based libraries of intermolecular interactions. In this example we will use these libraries to help us understand what interactions would be preferable in a crystal structure of paracetamol/acetaminophen (HXACAN).

First of all we create the central and contact group libraries.

>>> from ccdc.interaction import InteractionLibrary
>>> central_lib = InteractionLibrary.CentralGroupLibrary()
>>> contact_lib = InteractionLibrary.ContactGroupLibrary()

These libraries can be used to identify which central and contact groups are present in the paracetamol molecule.

>>> paracetamol = csd_reader.molecule('HXACAN')
>>> central_hits = central_lib.search_molecule(paracetamol)
>>> contact_hits = contact_lib.search_molecule(paracetamol)

We can then loop over the central and contact group hits to identify any pairs that have a relative density greater than 1.0.

Note

The code below will take a little time to complete.

>>> potential_interaction = []
>>> for central_hit in central_hits: 
...     for contact_hit in contact_hits:
...         data = central_hit.group.interaction_data(contact_hit.group)
...         if data is None:
...             continue
...         rel_density, est_std_dev = data.relative_density
...         if rel_density > 2.0:
...             interaction = (rel_density,
...                            data.ncontacts,
...                            central_hit.name,
...                            [i+1 for i in central_hit.match_atoms(indices=True)],
...                            contact_hit.name,
...                            [i+1 for i in contact_hit.match_atoms(indices=True)])
...             potential_interaction.append(interaction)
...

Finally, we sort the interactions by their relative density and print them out.

>>> for interaction in sorted(potential_interaction, reverse=True):  
...     print("%.2f | %i | %s %s | %s %s" % interaction)
...
3.82 | 949 | aromatic-aliphatic amide [2, 6, 1, 8, 18, 7, 20, 14] | any OH [19, 13]
3.00 | 112 | aromatic-aliphatic amide [2, 6, 1, 8, 18, 7, 20, 14] | phenol OH [3, 5, 4, 19, 13]
2.97 | 135 | acetylamino [1, 18, 7, 20, 8, 15, 16, 17, 14] | phenol OH [3, 5, 4, 19, 13]
2.79 | 2759 | aromatic-aliphatic amide [2, 6, 1, 8, 18, 7, 20, 14] | any polar X-H (X= N,O or S) [19, 13]
2.79 | 2759 | aromatic-aliphatic amide [2, 6, 1, 8, 18, 7, 20, 14] | any polar X-H (X= N,O or S) [18, 14]
2.54 | 1661 | aromatic-aliphatic amide [2, 6, 1, 8, 18, 7, 20, 14] | any uncharged N-H [18, 14]
2.53 | 1626 | aromatic-aliphatic amide [2, 6, 1, 8, 18, 7, 20, 14] | any uncharged N(sp2)-H [18, 14]
2.50 | 1535 | aromatic-aliphatic amide [2, 6, 1, 8, 18, 7, 20, 14] | uncharged C-NH-C [1, 7, 18, 14]
2.48 | 1429 | aromatic-aliphatic amide [2, 6, 1, 8, 18, 7, 20, 14] | amide N-H [8, 1, 7, 20, 18, 14]
2.41 | 1810 | aromatic-aliphatic amide [2, 6, 1, 8, 18, 7, 20, 14] | any N-H [18, 14]
Atom indices of HXACAN.

Atom indices of the paracetamol molecule in HXACAN.

See also

The descriptive interaction library documentation and the ccdc.interaction API documentation.