Molecular geometry analysis¶
The CSD Portfolio contains a knowledge-base of intramolecular geometries. This knowledge-base provides easy and rapid access to millions of chemically classified bond lengths, valence angles, acyclic torsion angles, and ring conformations derived from the CSD.
This enables you to rapidly validate the complete geometry of a given query structure and identify any unusual features without the need to construct complex search queries, or carry out detailed data analyses.
Let us import in the
>>> from ccdc import conformer
Let us also import the
ccdc.io module so that we can read in
molecules of interest.
>>> from ccdc import io
For more information on these knowledge-based geometry libraries please see: “Retrieval of Crystallographically-Derived Molecular Geometry Information”, I. J. Bruno, J. C. Cole, M. Kessler, Jie Luo, W. D. S. Motherwell, L. H. Purkis, B. R. Smith, R. Taylor, R. I. Cooper, S. E. Harris and A. G. Orpen, J. Chem. Inf. Comput. Sci., 44, 2133-2144, 2004 DOI: 10.1021/ci049780b.
The molecular geometry analysis engine¶
In order to be able to carry out a molecular geometry analysis one
requires an instance of the
ccdc.conformer.GeometryAnalyser class. Let us
>>> engine = conformer.GeometryAnalyser()
The most important function of an instance of the
ccdc.conformer.GeometryAnalyser class is the
ccdc.conformer.GeometryAnalyser.analyse_molecule() function, which when
supplied with a molecule returns a geometry analysed molecule.
Geometry analysis settings¶
The settings used in a geometry analysis are stored in the settings
attribute, which is an instance of the
To find out what the current settings are one can make use of the
>>> print(engine.settings.summary()) Generalisation: True Impose upper limits: False Filter rfactor: any Filter heaviest element: Unknown Filter solvent: include_solvent Filter organometallic: all Filter powder: False Type: bond Analyse: True Classification measure: Z-score Classification measure threshold: 2.00 Min. Obs. Generalised: 15 Min. Obs. Exact: 15 Min. Relevance: 0.75 Few hits threshold: 5 Type: angle Analyse: True Classification measure: Z-score Classification measure threshold: 2.00 Min. Obs. Generalised: 15 Min. Obs. Exact: 15 Min. Relevance: 0.75 Few hits threshold: 5 Type: torsion Analyse: True Classification measure: Local density Classification measure threshold: 5.00 interval = 10.00 Min. Obs. Generalised: 40 Min. Obs. Exact: 40 Min. Relevance: 0.75 Few hits threshold: 15 Type: ring Analyse: True Classification measure: Local density Classification measure threshold: 5.00 interval = 10.00 Min. Obs. Generalised: 15 Min. Obs. Exact: 15 Min. Relevance: 0.75 Few hits threshold: 15
It is also possible to find out the value of a particular setting by accessing it directly. For example to check the minimum number observations required for a generalised ring search one would:
>>> engine.settings.ring.min_obs_generalised 15
It is possible to change any of the settings directly. For example, let us turn off generalisation.
>>> engine.settings.generalisation = False
For more information on the setting options available have a look at
ccdc.conformer.GeometryAnalyser.Settings API documentation.
Performing a geometry analysis on a molecule¶
Let us read in a PDB ligand, from the structure 1hak, that has had hydrogen atoms added to it and its bond types standardised to CSD conventions.
>>> ligand_1hak = '1hak-lig.mol2'
>>> mol_reader = io.MoleculeReader(ligand_1hak) >>> mol = mol_reader >>> mol_reader.close()
After reading a molecule in, it is recommended to assign any unknown bond types, standardise bond types to CSD conventions, and add missing hydrogens.
>>> mol.assign_bond_types(which='unknown') >>> mol.standardise_aromatic_bonds() >>> mol.standardise_delocalised_bonds() >>> mol.add_hydrogens()
We can now analyse the geometry of this molecule.
>>> geometry_analysed_mol = engine.analyse_molecule(mol)
Note that this returns a molecule that has had a geometry analysis applied to it and has had additional attributes added to it (for example lists of angles and torsions). We can now find out how many unusual bonds, angles, torsion and rings this molecule has.
Analysing the results¶
Let us now check how many unusual bonds, angles, torsions and rings were found.
>>> len([b for b in geometry_analysed_mol.analysed_bonds if b.unusual]) 3 >>> len([a for a in geometry_analysed_mol.analysed_angles if a.unusual]) 12 >>> len([t for t in geometry_analysed_mol.analysed_torsions if t.unusual]) 6 >>> len([r for r in geometry_analysed_mol.analysed_rings if r.unusual]) 1
If you only wanted the unusual features that have enough hits use the
ccdc.conformer.GeometryAnalyser.Analysis.enough_hits attribute of a feature.
>>> len([t for t in geometry_analysed_mol.analysed_torsions if t.unusual and t.enough_hits]) 4
If you only wanted the unusual features where there were few hits in the
CSD you can use the
>>> len([t for t in geometry_analysed_mol.analysed_torsions if t.unusual and t.few_hits]) 2
To find out which torsions are unusual we can loop over them and print out the atom names.
>>> for torsion in geometry_analysed_mol.analysed_torsions: ... if torsion.unusual and torsion.enough_hits: ... print(', '.join(label for label in torsion.atom_labels)) ... O23, C21, C22, C24 C21, C22, C24, N27 C31, C32, C43, C52 C33, C32, C43, C52
To illustrate some of the more advanced functionality exposed let us get the unusual torsion around C32.
>>> tors = geometry_analysed_mol.analysed_torsions >>> unusual_tors = [t for t in tors if t.unusual and t.enough_hits] >>> t_c32 = next(t for t in unusual_tors if t.atom_labels=='C32')
To access the underlying histogram data one can use the histogram function.
>>> t_c32.histogram() (0, 1, 0, ..., 20, 25, 39)
The individual CSD structures used in the analysis can be retrieved using the hit function.
>>> hit_mols = t_c32.hit_molecules
To write these out to a SDF file one would use a molecule writer.
>>> torsion_identifier = '_'.join(torsion.atom_labels) >>> torsion_filename = os.path.join(tempdir, torsion_identifier + '.sdf') >>> writer = io.MoleculeWriter(torsion_filename) >>> for molecule in hit_mols: ... writer.write(molecule) >>> writer.close()
Now let us look at the angles where there are no hits.
>>> no_hits = [(i, a) for i, a in enumerate(geometry_analysed_mol.analysed_angles) if a.no_hits] >>> print(len(no_hits)) 2 >>> for index, hit in no_hits: ... print('Index: %d, Label: %s' % (index, hit.fragment_label)) Index: 14, Label: C10_N11_C21 Index: 17, Label: C12_C13_S14
Let us see if we can find some data for them by using generalisation.
>>> engine.settings.generalisation = True
In order to speed up the analysis we will only analyse angles. To do this we turn the bond, torsion and ring analysis off.
>>> engine.settings.bond.analyse = False >>> engine.settings.torsion.analyse = False >>> engine.settings.ring.analyse = False
Finally, we do the analysis of the original molecule.
>>> analysed_angles = engine.analyse_molecule(mol)
We now have data for all the angles.
>>> print(len([a for a in analysed_angles.analysed_angles if a.no_hits])) 0
To check the classification of one of the angles for which we
previously did not have any hits we can make use of the index in the
no_hits tuple we created earlier.
>>> for index, hit in no_hits: ... new_hit = analysed_angles.analysed_angles[index] ... print('Label: %s, Classification: %s' % (new_hit.fragment_label, new_hit.classification)) Label: C10_N11_C21, Classification: Not unusual (enough hits) Label: C12_C13_S14, Classification: Not unusual (enough hits)
We can determine which angles required generalisation and which did not by inspecting the angles’ generalised attribute:
>>> print(analysed_angles.analysed_angles.generalised) True >>> print(analysed_angles.analysed_angles.generalised) False
We can see the similarity of matched fragments to the query fragment by inspecting the individual hits. Where the hit was an exact match this will give a score of 1.0, and the hit was not from generalisation. This allows one to get a measure of the accuracy of the calculation:
>>> hits = analysed_angles.analysed_angles.hits >>> scores = [h.similarity_score for h in hits] >>> print('%d exact matches, %d generalised matches, %.2f mean match, %.2f worst match' % (scores.count(1.0), len(scores) - scores.count(1.0), sum(scores)/len(scores), min(scores))) 2 exact matches, 23 generalised matches, 0.94 mean match, 0.90 worst match
It is possible to analyse the geometry of molecules with no 3D information, for example AACFAZ in the CSD:
>>> aacfaz = io.MoleculeReader('csd').molecule('AACFAZ') >>> engine.settings.bond.analyse = True >>> engine.settings.angle.analyse = True >>> engine.settings.torsion.analyse = True >>> engine.settings.ring.analyse = True >>> checked = engine.analyse_molecule(aacfaz)
The checked results will have no values, no classifications, hence no z-score or local density information. There will be no ring information. Checked results will have data associated with the distributions, such as mean, standard_deviation, histograms etc.
>>> print(checked.analysed_bonds.classification) (enough hits) >>> print(checked.analysed_bonds.value) None >>> print(round(checked.analysed_bonds.mean, 2)) 1.51 >>> print(checked.analysed_rings) 
It is also possible to analyse individual bonds, angles, torsions, rings or fragments of a molecule. For example, if we use AABHTZ in the CSD:
>>> aabhtz = io.MoleculeReader('csd').molecule('AABHTZ') >>> res_bond = engine.analyse_bond(aabhtz.atom('N1'), aabhtz.atom('N2')) >>> print(len(res_bond.hits)) 398 >>> res_angle = engine.analyse_angle(*list(map(aabhtz.atom, ('C7', 'N1', 'N2')))) >>> print(len(res_angle.hits)) 322
Molecular geometry analysis with multiple data libraries¶
It is now possible to perform a molecular geometry analysis using multiple data libraries simultaneously.
To set up a geometry analysis engine with multiple data libraries the
molecular geometry analysis engine should be instantiated with a list of
mogul.path file names. The path below points to a small example
>>> small_geometry_library = 'Small_mogul/mogul.path'
Note that the distinguished name
CSD is used for the molecular
geometry library included in the CSD Portfolio.
To set up a molecular geometry analysis engine that made use of the CSD as well as the small example library one would use the syntax below.
>>> engine = conformer.GeometryAnalyser(databases=['CSD', small_geometry_library])
This molecular geometry analysis engine can be used to analyse molecules as per usual.
>>> csd_molecule_reader = io.MoleculeReader('CSD') >>> mol = csd_molecule_reader.molecule('AWEGOY01') >>> analysed_mol = engine.analyse_molecule(mol)
Hits from the analysis can be drawn from either library. Molecules will be accessed from the corresponding crystal database, either the CSD or the database from which the small example library data files were constructed. The source crystal database may be identified from the hits.
>>> hits = analysed_mol.analysed_torsions.hits >>> set([h.source_name for h in hits]) set([u'CSD', u'Aug18_ASER', u'Small_mogul']) >>> small_geometry_library_hits = [h for h in hits if h.source_name == 'Small_mogul'] >>> first_small_geometry_library_hit = small_geometry_library_hits >>> print(first_small_geometry_library_hit.molecule.identifier) AWEGOY01
The location of the crystal database used to retrieve the
crystal structure for a particular hit is specified in the
mogul.path file. The content of the
mogul.path file will look
something along the lines of:
Mogul Data Date : 21 02 2013 Version : Unknown CSD : Small.ss Name : NoSourceDatabase where ``Small.ss`` is the relative path to the database in question. It is possible to use a molecular geometry library with no associated database by providing a mogul.path file with no entry for ``CSD``. For example:
Mogul Data Date : 21 02 2013 Version : Unknown CSD : Name : NoSourceDatabase In this case no molecule can be retrieved and ring analysis is not possible. This feature should be considered experimental.