Molecular geometry analysis¶
Introduction¶
The CSD Portfolio contains a knowledge-base of intramolecular geometries. This knowledge-base provides easy and rapid access to millions of chemically classified bond lengths, valence angles, acyclic torsion angles, and ring conformations derived from the CSD.
This enables you to rapidly validate the complete geometry of a given query structure and identify any unusual features without the need to construct complex search queries, or carry out detailed data analyses.
Let us import in the ccdc.conformer
module.
>>> from ccdc import conformer
Let us also import the ccdc.io
module so that we can read in
molecules of interest.
>>> from ccdc import io
Note
For more information on these knowledge-based geometry libraries please see: “Retrieval of Crystallographically-Derived Molecular Geometry Information”, I. J. Bruno, J. C. Cole, M. Kessler, Jie Luo, W. D. S. Motherwell, L. H. Purkis, B. R. Smith, R. Taylor, R. I. Cooper, S. E. Harris and A. G. Orpen, J. Chem. Inf. Comput. Sci., 44, 2133-2144, 2004 DOI: 10.1021/ci049780b.
The molecular geometry analysis engine¶
In order to be able to carry out a molecular geometry analysis one
requires an instance of the ccdc.conformer.GeometryAnalyser
class. Let us
create one.
>>> engine = conformer.GeometryAnalyser()
Note
The most important function of an instance of the
ccdc.conformer.GeometryAnalyser
class is the
ccdc.conformer.GeometryAnalyser.analyse_molecule()
function, which when
supplied with a molecule returns a geometry analysed molecule.
Geometry analysis settings¶
The settings used in a geometry analysis are stored in the settings
attribute, which is an instance of the ccdc.conformer.GeometryAnalyser.Settings
class.
To find out what the current settings are one can make use of the
ccdc.conformer.GeometryAnalyser.Settings.summary()
function.
>>> print(engine.settings.summary())
Generalisation: True
Impose upper limits: False
Filter rfactor: any
Filter heaviest element: Unknown
Filter solvent: include_solvent
Filter organometallic: all
Filter powder: False
Type: bond
Analyse: True
Classification measure: Z-score
Classification measure threshold: 2.00
Min. Obs. Generalised: 15
Min. Obs. Exact: 15
Min. Relevance: 0.75
Few hits threshold: 5
Type: angle
Analyse: True
Classification measure: Z-score
Classification measure threshold: 2.00
Min. Obs. Generalised: 15
Min. Obs. Exact: 15
Min. Relevance: 0.75
Few hits threshold: 5
Type: torsion
Analyse: True
Classification measure: Local density
Classification measure threshold: 5.00 interval = 10.00
Min. Obs. Generalised: 40
Min. Obs. Exact: 40
Min. Relevance: 0.75
Few hits threshold: 15
Type: ring
Analyse: True
Classification measure: Local density
Classification measure threshold: 5.00 interval = 10.00
Min. Obs. Generalised: 15
Min. Obs. Exact: 15
Min. Relevance: 0.75
Few hits threshold: 15
It is also possible to find out the value of a particular setting by accessing it directly. For example to check the minimum number observations required for a generalised ring search one would:
>>> engine.settings.ring.min_obs_generalised
15
It is possible to change any of the settings directly. For example, let us turn off generalisation.
>>> engine.settings.generalisation = False
For more information on the setting options available have a look at
the ccdc.conformer.GeometryAnalyser.Settings
API documentation.
Performing a geometry analysis on a molecule¶
Let us read in a PDB ligand, from the structure 1hak, that has had hydrogen atoms added to it and its bond types standardised to CSD conventions.
>>> ligand_1hak = '1hak-lig.mol2'
>>> mol_reader = io.MoleculeReader(ligand_1hak)
>>> mol = mol_reader[0]
>>> mol_reader.close()
After reading a molecule in, it is recommended to assign any unknown bond types, standardise bond types to CSD conventions, and add missing hydrogens.
>>> mol.assign_bond_types(which='unknown')
>>> mol.standardise_aromatic_bonds()
>>> mol.standardise_delocalised_bonds()
>>> mol.add_hydrogens()
We can now analyse the geometry of this molecule.
>>> geometry_analysed_mol = engine.analyse_molecule(mol)
Note that this returns a molecule that has had a geometry analysis applied to it and has had additional attributes added to it (for example lists of angles and torsions). We can now find out how many unusual bonds, angles, torsion and rings this molecule has.
Analysing the results¶
Let us now check how many unusual bonds, angles, torsions and rings were found.
>>> len([b for b in geometry_analysed_mol.analysed_bonds if b.unusual])
3
>>> len([a for a in geometry_analysed_mol.analysed_angles if a.unusual])
13
>>> len([t for t in geometry_analysed_mol.analysed_torsions if t.unusual])
6
>>> len([r for r in geometry_analysed_mol.analysed_rings if r.unusual])
1
If you only wanted the unusual features that have enough hits use the
ccdc.conformer.GeometryAnalyser.Analysis.enough_hits
attribute of a feature.
>>> len([t for t in geometry_analysed_mol.analysed_torsions if t.unusual and t.enough_hits])
4
If you only wanted the unusual features where there were few hits in the
CSD you can use the ccdc.conformer.GeometryAnalyser.Analysis.few_hits
attribute.
>>> len([t for t in geometry_analysed_mol.analysed_torsions if t.unusual and t.few_hits])
2
To find out which torsions are unusual we can loop over them and print out the atom names.
>>> for torsion in geometry_analysed_mol.analysed_torsions:
... if torsion.unusual and torsion.enough_hits:
... print(', '.join(label for label in torsion.atom_labels))
...
O23, C21, C22, C24
C21, C22, C24, N27
C31, C32, C43, C52
C33, C32, C43, C52
To illustrate some of the more advanced functionality exposed let us get the unusual torsion around C32.
>>> tors = geometry_analysed_mol.analysed_torsions
>>> unusual_tors = [t for t in tors if t.unusual and t.enough_hits]
>>> t_c32 = next(t for t in unusual_tors if t.atom_labels[1]=='C32')
To access the underlying histogram data one can use the histogram function.
>>> t_c32.histogram()
(0, 1, 0, ..., 20, 25, 39)
The individual CSD structures used in the analysis can be retrieved using the hit function.
>>> hit_mols = t_c32.hit_molecules
To write these out to a SDF file one would use a molecule writer.
>>> torsion_identifier = '_'.join(torsion.atom_labels)
>>> torsion_filename = os.path.join(tempdir, torsion_identifier + '.sdf')
>>> writer = io.MoleculeWriter(torsion_filename)
>>> for molecule in hit_mols:
... writer.write(molecule)
>>> writer.close()
Now let us look at the angles where there are no hits.
>>> no_hits = [(i, a) for i, a in enumerate(geometry_analysed_mol.analysed_angles) if a.no_hits]
>>> print(len(no_hits))
2
>>> for index, hit in no_hits:
... print('Index: %d, Label: %s' % (index, hit.fragment_label))
Index: 14, Label: C10_N11_C21
Index: 17, Label: C12_C13_S14
Let us see if we can find some data for them by using generalisation.
>>> engine.settings.generalisation = True
In order to speed up the analysis we will only analyse angles. To do this we turn the bond, torsion and ring analysis off.
>>> engine.settings.bond.analyse = False
>>> engine.settings.torsion.analyse = False
>>> engine.settings.ring.analyse = False
Finally, we do the analysis of the original molecule.
>>> analysed_angles = engine.analyse_molecule(mol)
We now have data for all the angles.
>>> print(len([a for a in analysed_angles.analysed_angles if a.no_hits]))
0
To check the classification of one of the angles for which we
previously did not have any hits we can make use of the index in the
no_hits
tuple we created earlier.
>>> for index, hit in no_hits:
... new_hit = analysed_angles.analysed_angles[index]
... print('Label: %s, Classification: %s' % (new_hit.fragment_label, new_hit.classification))
Label: C10_N11_C21, Classification: Not unusual (enough hits)
Label: C12_C13_S14, Classification: Not unusual (enough hits)
We can determine which angles required generalisation and which did not by inspecting the angles’ generalised attribute:
>>> print(analysed_angles.analysed_angles[6].generalised)
True
>>> print(analysed_angles.analysed_angles[7].generalised)
False
We can see the similarity of matched fragments to the query fragment by inspecting the individual hits. Where the hit was an exact match this will give a score of 1.0, and the hit was not from generalisation. This allows one to get a measure of the accuracy of the calculation:
>>> hits = analysed_angles.analysed_angles[6].hits
>>> scores = [h.similarity_score for h in hits]
>>> print('%d exact matches, %d generalised matches, %.2f mean match, %.2f worst match' % (scores.count(1.0), len(scores) - scores.count(1.0), sum(scores)/len(scores), min(scores)))
2 exact matches, 23 generalised matches, 0.94 mean match, 0.90 worst match
It is possible to analyse the geometry of molecules with no 3D information, for example AACFAZ in the CSD:
>>> aacfaz = io.MoleculeReader('csd').molecule('AACFAZ')
>>> engine.settings.bond.analyse = True
>>> engine.settings.angle.analyse = True
>>> engine.settings.torsion.analyse = True
>>> engine.settings.ring.analyse = True
>>> checked = engine.analyse_molecule(aacfaz)
The checked results will have no values, no classifications, hence no z-score or local density information. There will be no ring information. Checked results will have data associated with the distributions, such as mean, standard_deviation, histograms etc.
>>> print(checked.analysed_bonds[0].classification)
(enough hits)
>>> print(checked.analysed_bonds[0].value)
None
>>> print(round(checked.analysed_bonds[0].mean, 2))
1.51
>>> print(checked.analysed_rings)
[]
It is also possible to analyse individual bonds, angles, torsions, rings or fragments of a molecule. For example, if we use AABHTZ in the CSD:
>>> aabhtz = io.MoleculeReader('csd').molecule('AABHTZ')
>>> res_bond = engine.analyse_bond(aabhtz.atom('N1'), aabhtz.atom('N2'))
>>> print(len(res_bond.hits))
412
>>> res_angle = engine.analyse_angle(*list(map(aabhtz.atom, ('C7', 'N1', 'N2'))))
>>> print(len(res_angle.hits))
333
Molecular geometry analysis with multiple data libraries¶
It is now possible to perform a molecular geometry analysis using multiple data libraries simultaneously.
To set up a geometry analysis engine with multiple data libraries the
molecular geometry analysis engine should be instantiated with a list of
mogul.path
file names. The path below points to a small example
data library.
>>> small_geometry_library = 'Small_mogul/mogul.path'
Note that the distinguished name CSD
is used for the molecular
geometry library included in the CSD Portfolio.
To set up a molecular geometry analysis engine that made use of the CSD as well as the small example library one would use the syntax below.
>>> engine = conformer.GeometryAnalyser(databases=['CSD', small_geometry_library])
This molecular geometry analysis engine can be used to analyse molecules as per usual.
>>> csd_molecule_reader = io.MoleculeReader('CSD')
>>> mol = csd_molecule_reader.molecule('AWEGOY01')
>>> analysed_mol = engine.analyse_molecule(mol)
Hits from the analysis can be drawn from either library. Molecules will be accessed from the corresponding crystal database, either the CSD or the database from which the small example library data files were constructed. The source crystal database may be identified from the hits.
>>> hits = analysed_mol.analysed_torsions[0].hits
>>> set([h.source_name for h in hits])
set([u'CSD', u'Aug18_ASER', u'Small_mogul'])
>>> small_geometry_library_hits = [h for h in hits if h.source_name == 'Small_mogul']
>>> first_small_geometry_library_hit = small_geometry_library_hits[0]
>>> print(first_small_geometry_library_hit.molecule.identifier)
AWEGOY01
Note
The location of the crystal database used to retrieve the
crystal structure for a particular hit is specified in the
mogul.path
file. The content of the mogul.path
file will look
something along the lines of:
Mogul Data
Date : 21 02 2013
Version : Unknown
CSD : Small.ss
Name : NoSourceDatabase
where Small.ss
is the relative path to the database in question.
Alternatively, the absolute path of the crystal database can be provided.
It is possible to use a molecular geometry library with no associated
database by providing a mogul.path file with no entry for CSD
.
For example:
Mogul Data
Date : 21 02 2013
Version : Unknown
CSD :
Name : NoSourceDatabase
In this case no molecule can be retrieved and ring analysis is not possible. This feature should be considered experimental.