Molecular geometry analysis


The CSD Portfolio contains a knowledge-base of intramolecular geometries. This knowledge-base provides easy and rapid access to millions of chemically classified bond lengths, valence angles, acyclic torsion angles, and ring conformations derived from the CSD.

This enables you to rapidly validate the complete geometry of a given query structure and identify any unusual features without the need to construct complex search queries, or carry out detailed data analyses.

Let us import in the ccdc.conformer module.

>>> from ccdc import conformer

Let us also import the module so that we can read in molecules of interest.

>>> from ccdc import io


For more information on these knowledge-based geometry libraries please see: “Retrieval of Crystallographically-Derived Molecular Geometry Information”, I. J. Bruno, J. C. Cole, M. Kessler, Jie Luo, W. D. S. Motherwell, L. H. Purkis, B. R. Smith, R. Taylor, R. I. Cooper, S. E. Harris and A. G. Orpen, J. Chem. Inf. Comput. Sci., 44, 2133-2144, 2004 DOI: 10.1021/ci049780b.

The molecular geometry analysis engine

In order to be able to carry out a molecular geometry analysis one requires an instance of the ccdc.conformer.GeometryAnalyser class. Let us create one.

>>> engine = conformer.GeometryAnalyser()


The most important function of an instance of the ccdc.conformer.GeometryAnalyser class is the ccdc.conformer.GeometryAnalyser.analyse_molecule() function, which when supplied with a molecule returns a geometry analysed molecule.

Geometry analysis settings

The settings used in a geometry analysis are stored in the settings attribute, which is an instance of the ccdc.conformer.GeometryAnalyser.Settings class.

To find out what the current settings are one can make use of the ccdc.conformer.GeometryAnalyser.Settings.summary() function.

>>> print(engine.settings.summary()) 
Generalisation: True
Impose upper limits: False
Filter rfactor: any
Filter heaviest element: Unknown
Filter solvent: include_solvent
Filter organometallic: all
Filter powder: False
Type: bond
    Analyse: True
    Classification measure: Z-score
    Classification measure threshold: 2.00
    Min. Obs. Generalised: 15
    Min. Obs. Exact: 15
    Min. Relevance: 0.75
    Few hits threshold: 5
Type: angle
    Analyse: True
    Classification measure: Z-score
    Classification measure threshold: 2.00
    Min. Obs. Generalised: 15
    Min. Obs. Exact: 15
    Min. Relevance: 0.75
    Few hits threshold: 5
Type: torsion
    Analyse: True
    Classification measure: Local density
    Classification measure threshold: 5.00 interval = 10.00
    Min. Obs. Generalised: 40
    Min. Obs. Exact: 40
    Min. Relevance: 0.75
    Few hits threshold: 15
Type: ring
    Analyse: True
    Classification measure: Local density
    Classification measure threshold: 5.00 interval = 10.00
    Min. Obs. Generalised: 15
    Min. Obs. Exact: 15
    Min. Relevance: 0.75
    Few hits threshold: 15

It is also possible to find out the value of a particular setting by accessing it directly. For example to check the minimum number observations required for a generalised ring search one would:

>>> engine.settings.ring.min_obs_generalised

It is possible to change any of the settings directly. For example, let us turn off generalisation.

>>> engine.settings.generalisation = False

For more information on the setting options available have a look at the ccdc.conformer.GeometryAnalyser.Settings API documentation.

Performing a geometry analysis on a molecule

Let us read in a PDB ligand, from the structure 1hak, that has had hydrogen atoms added to it and its bond types standardised to CSD conventions.

>>> ligand_1hak = '1hak-lig.mol2'
>>> mol_reader = io.MoleculeReader(ligand_1hak)
>>> mol = mol_reader[0]
>>> mol_reader.close()

After reading a molecule in, it is recommended to assign any unknown bond types, standardise bond types to CSD conventions, and add missing hydrogens.

>>> mol.assign_bond_types(which='unknown')
>>> mol.standardise_aromatic_bonds()
>>> mol.standardise_delocalised_bonds()
>>> mol.add_hydrogens()

We can now analyse the geometry of this molecule.

>>> geometry_analysed_mol = engine.analyse_molecule(mol)

Note that this returns a molecule that has had a geometry analysis applied to it and has had additional attributes added to it (for example lists of angles and torsions). We can now find out how many unusual bonds, angles, torsion and rings this molecule has.

Analysing the results

Let us now check how many unusual bonds, angles, torsions and rings were found.

>>> len([b for b in geometry_analysed_mol.analysed_bonds if b.unusual])
>>> len([a for a in geometry_analysed_mol.analysed_angles if a.unusual])
>>> len([t for t in geometry_analysed_mol.analysed_torsions if t.unusual])
>>> len([r for r in geometry_analysed_mol.analysed_rings if r.unusual])

If you only wanted the unusual features that have enough hits use the ccdc.conformer.GeometryAnalyser.Analysis.enough_hits attribute of a feature.

>>> len([t for t in geometry_analysed_mol.analysed_torsions if t.unusual and t.enough_hits])

If you only wanted the unusual features where there were few hits in the CSD you can use the ccdc.conformer.GeometryAnalyser.Analysis.few_hits attribute.

>>> len([t for t in geometry_analysed_mol.analysed_torsions if t.unusual and t.few_hits])

To find out which torsions are unusual we can loop over them and print out the atom names.

>>> for torsion in geometry_analysed_mol.analysed_torsions:
...     if torsion.unusual and torsion.enough_hits:
...         print(', '.join(label for label in torsion.atom_labels))
O23, C21, C22, C24
C21, C22, C24, N27
C31, C32, C43, C52
C33, C32, C43, C52

To illustrate some of the more advanced functionality exposed let us get the unusual torsion around C32.

>>> tors = geometry_analysed_mol.analysed_torsions
>>> unusual_tors = [t for t in tors if t.unusual and t.enough_hits]
>>> t_c32 = next(t for t in unusual_tors if t.atom_labels[1]=='C32')

To access the underlying histogram data one can use the histogram function.

>>> t_c32.histogram()  
(0, 1, 0, ..., 20, 25, 39)

The individual CSD structures used in the analysis can be retrieved using the hit function.

>>> hit_mols = t_c32.hit_molecules

To write these out to a SDF file one would use a molecule writer.

>>> torsion_identifier = '_'.join(torsion.atom_labels)
>>> torsion_filename = os.path.join(tempdir, torsion_identifier + '.sdf')
>>> writer = io.MoleculeWriter(torsion_filename)
>>> for molecule in hit_mols:
...     writer.write(molecule)
>>> writer.close()

Now let us look at the angles where there are no hits.

>>> no_hits = [(i, a) for i, a in enumerate(geometry_analysed_mol.analysed_angles) if a.no_hits]
>>> print(len(no_hits))
>>> for index, hit in no_hits:
...     print('Index: %d, Label: %s' % (index, hit.fragment_label))
Index: 14, Label: C10_N11_C21
Index: 17, Label: C12_C13_S14

Let us see if we can find some data for them by using generalisation.

>>> engine.settings.generalisation = True

In order to speed up the analysis we will only analyse angles. To do this we turn the bond, torsion and ring analysis off.

>>> = False
>>> engine.settings.torsion.analyse = False
>>> engine.settings.ring.analyse = False

Finally, we do the analysis of the original molecule.

>>> analysed_angles = engine.analyse_molecule(mol)

We now have data for all the angles.

>>> print(len([a for a in analysed_angles.analysed_angles if a.no_hits]))

To check the classification of one of the angles for which we previously did not have any hits we can make use of the index in the no_hits tuple we created earlier.

>>> for index, hit in no_hits:
...     new_hit = analysed_angles.analysed_angles[index]
...     print('Label: %s, Classification: %s' % (new_hit.fragment_label, new_hit.classification))
Label: C10_N11_C21, Classification: Not unusual (enough hits)
Label: C12_C13_S14, Classification: Not unusual (enough hits)

We can determine which angles required generalisation and which did not by inspecting the angles’ generalised attribute:

>>> print(analysed_angles.analysed_angles[6].generalised)
>>> print(analysed_angles.analysed_angles[7].generalised)

We can see the similarity of matched fragments to the query fragment by inspecting the individual hits. Where the hit was an exact match this will give a score of 1.0, and the hit was not from generalisation. This allows one to get a measure of the accuracy of the calculation:

>>> hits = analysed_angles.analysed_angles[6].hits
>>> scores = [h.similarity_score for h in hits]
>>> print('%d exact matches, %d generalised matches, %.2f mean match, %.2f worst match' % (scores.count(1.0), len(scores) - scores.count(1.0), sum(scores)/len(scores), min(scores)))
2 exact matches, 23 generalised matches, 0.94 mean match, 0.90 worst match

It is possible to analyse the geometry of molecules with no 3D information, for example AACFAZ in the CSD:

>>> aacfaz = io.MoleculeReader('csd').molecule('AACFAZ')
>>> = True
>>> engine.settings.angle.analyse = True
>>> engine.settings.torsion.analyse = True
>>> engine.settings.ring.analyse = True
>>> checked = engine.analyse_molecule(aacfaz)

The checked results will have no values, no classifications, hence no z-score or local density information. There will be no ring information. Checked results will have data associated with the distributions, such as mean, standard_deviation, histograms etc.

>>> print(checked.analysed_bonds[0].classification)
(enough hits)
>>> print(checked.analysed_bonds[0].value)
>>> print(round(checked.analysed_bonds[0].mean, 2))
>>> print(checked.analysed_rings)

It is also possible to analyse individual bonds, angles, torsions, rings or fragments of a molecule. For example, if we use AABHTZ in the CSD:

>>> aabhtz = io.MoleculeReader('csd').molecule('AABHTZ')
>>> res_bond = engine.analyse_bond(aabhtz.atom('N1'), aabhtz.atom('N2'))
>>> print(len(res_bond.hits))
>>> res_angle = engine.analyse_angle(*list(map(aabhtz.atom, ('C7', 'N1', 'N2'))))
>>> print(len(res_angle.hits))

Molecular geometry analysis with multiple data libraries

It is now possible to perform a molecular geometry analysis using multiple data libraries simultaneously.

To set up a geometry analysis engine with multiple data libraries the molecular geometry analysis engine should be instantiated with a list of mogul.path file names. The path below points to a small example data library.

>>> small_geometry_library = 'Small_mogul/mogul.path' 

Note that the distinguished name CSD is used for the molecular geometry library included in the CSD Portfolio.

To set up a molecular geometry analysis engine that made use of the CSD as well as the small example library one would use the syntax below.

>>> engine = conformer.GeometryAnalyser(databases=['CSD', small_geometry_library]) 

This molecular geometry analysis engine can be used to analyse molecules as per usual.

>>> csd_molecule_reader = io.MoleculeReader('CSD') 
>>> mol = csd_molecule_reader.molecule('AWEGOY01') 
>>> analysed_mol = engine.analyse_molecule(mol) 

Hits from the analysis can be drawn from either library. Molecules will be accessed from the corresponding crystal database, either the CSD or the database from which the small example library data files were constructed. The source crystal database may be identified from the hits.

>>> hits = analysed_mol.analysed_torsions[0].hits 
>>> set([h.source_name for h in hits]) 
set([u'CSD', u'Aug18_ASER', u'Small_mogul'])
>>> small_geometry_library_hits = [h for h in hits if h.source_name == 'Small_mogul'] 
>>> first_small_geometry_library_hit = small_geometry_library_hits[0] 
>>> print(first_small_geometry_library_hit.molecule.identifier) 


The location of the crystal database used to retrieve the crystal structure for a particular hit is specified in the mogul.path file. The content of the mogul.path file will look something along the lines of:

Mogul Data
Date    : 21 02 2013
Version : Unknown
CSD     :
Name    : NoSourceDatabase

where is the relative path to the database in question. Alternatively, the absolute path of the crystal database can be provided.

It is possible to use a molecular geometry library with no associated database by providing a mogul.path file with no entry for CSD. For example:

Mogul Data
Date    : 21 02 2013
Version : Unknown
CSD     :
Name    : NoSourceDatabase

In this case no molecule can be retrieved and ring analysis is not possible. This feature should be considered experimental.