Descriptors

Introduction

The ccdc.descriptors module contains functionality for generating geometric, molecular and crystallographic descriptors.

Let us create an ccdc.io.EntryReader so that we can read in molecules and crystals from the CSD.

>>> from ccdc.io import EntryReader
>>> entry_reader = EntryReader('CSD')

As a preamble let us also set up a variable for a temporary directory.

>>> import tempfile
>>> tempdir = tempfile.mkdtemp()

Molecular geometry

The module ccdc.descriptors.MolecularDescriptors contains a number of simple static methods for inspecting the geometry of a molecule. For example,

>>> from ccdc.descriptors import MolecularDescriptors as MD, GeometricDescriptors as GD
>>> aabhtz = entry_reader.molecule('AABHTZ')
>>> print(round(MD.atom_distance(aabhtz.atom('Cl1'), aabhtz.atom('Cl2')), 2))
5.47
>>> print(round(MD.atom_angle(aabhtz.atom('Cl1'), aabhtz.atom('C6'), aabhtz.atom('Cl2')), 2))
178.25
>>> print(round(MD.atom_torsion_angle(aabhtz.atom('Cl1'), aabhtz.atom('Cl2'), aabhtz.atom('O2'), aabhtz.atom('O1')), 2))
12.45

Note that the atoms do not have to be connected for the geometry to be determined.

One can construct centroids of sets of atoms, vectors from two atoms and RMSD fitted planes. For convenience one can construct centroids and planes directly from a ring:

>>> centroid = MD.atom_centroid(*tuple(a for a in aabhtz.atoms))
>>> r1_centroid = MD.ring_centroid(aabhtz.rings[0])
>>> r2_centroid = MD.ring_centroid(aabhtz.rings[1])
>>> r1_plane = MD.ring_plane(aabhtz.rings[0])
>>> r2_plane = MD.ring_plane(aabhtz.rings[1])

The plane methods construct instances of ccdc.descriptors.GeometricDescriptors.Plane. This object supports a few methods for geometric analysis as well as the normal vector and distance from the origin:

>>> print(round(GD.point_distance(r1_centroid, r2_centroid), 3))
6.385
>>> print(round(r1_plane.plane_angle(r2_plane), 3))
88.576
>>> print(round(r1_plane.point_distance(r1_centroid), 3))
0.0
>>> print(r1_plane.normal)
Vector(0.608, -0.050, 0.792)
>>> print(round(r1_plane.distance, 3))
-1.785

Vectors and planes may be constructed directly from points:

>>> x_axis = GD.Vector(1, 0, 0)
>>> vec = GD.Vector.from_points(r1_centroid, centroid)
>>> plane = GD.Plane.from_points(centroid, r1_centroid, r2_centroid)

With the module ccdc.descriptors.MolecularDescriptors is also possible to calculate the adjacency matrix of a molecule. For instance, the following code allows one to calculate the logarithm of the number of walks of length k that start and end at the same atom.

>>> abeboz = entry_reader.molecule('ABEBOZ')
>>> adjacency_matrix = MD.AdjacencyMatrixDescriptorCalculator(abeboz)
>>> print(round(adjacency_matrix.self_returning_walk_ln(10), 3))
9.838

It is possible to count the number of times a pair of elements appear with a specified minimum path length. For example:

>>> atom_pair_distance = MD.AtomPairDistanceDescriptorCalculator(abeboz)
>>> print(atom_pair_distance.element_pair_count("C","N",5))
2.0

Finally, one can look at the connectivity of a molecule and calculate a connectivity index:

>>> connectivity_index = MD.ConnectivityIndices(abeboz)
>>> print(round(connectivity_index.connectivity_index(4), 3))
5.861

Superimposing molecules

Suppose that we want to superimpose a set of conformers using a substructure of the input molecule rather than all the atoms in it.

This can be achieved using the ccdc.descriptors.MolecularDescriptors.overlay() static method.

Let us generate a set of 25 conformers of Rosuvastatin using the PDB ligand from 1hwl as the starting conformation.

>>> from ccdc.io import MoleculeReader
>>> filepath = '1hwl-lig.mol2'

In this instance we will get the first molecule in the file.

>>> mol_reader = MoleculeReader(filepath)
>>> mol = mol_reader[0]
>>> mol_reader.close()

Now we set up a conformer generator and generate the 25 conformations.

>>> from ccdc.conformer import ConformerGenerator
>>> conformer_generator = ConformerGenerator()  
>>> conformer_generator.settings.max_conformers = 25  
>>> conformers = conformer_generator.generate(mol)  

In this instance we want to superimpose the conformers on the pyrimidine ring of the input molecule. We therefore generate a pyrimidine substructure search.

>>> from ccdc.search import SubstructureSearch, SMARTSSubstructure
>>> searcher = SubstructureSearch()
>>> sub_id = searcher.add_substructure( SMARTSSubstructure('c1ncncc1') )

We can use this to identify the atoms to match in the input molecule.

>>> input_hits = searcher.search(mol)
>>> print(len(input_hits))
1
>>> input_match_atoms = input_hits[0].match_atoms()

Finally, we can use this information to write out a file with conformers superimposed on the pyridine ring of the input structure using the ccdc.descriptors.MolecularDescriptors.overlay().

>>> from ccdc.descriptors import MolecularDescriptors
>>> from ccdc.io import MoleculeWriter
>>> mol_writer = MoleculeWriter(os.path.join(tempdir, 'superpos.mol2'))
>>> for c in conformers:  
...     cmol = c.molecule
...     chits = searcher.search(cmol)
...     match_atoms = chits[0].match_atoms()
...     atom_pairs = zip(input_match_atoms, match_atoms)
...     overlaid_mol = MolecularDescriptors.overlay(mol, cmol, atoms=atom_pairs)
...     mol_writer.write(overlaid_mol)
...

Another function of interest in this context is the ccdc.descriptors.MolecularDescriptors.rmsd().

Maximum Common Substructure

ccdc.descriptors.MolecularDescriptors.MaximumCommonSubstructure may be used to find a maximum common substructure of two ccdc.molecule.Molecule instances. We can create an instance of the appropriate class:

>>> mcss = MolecularDescriptors.MaximumCommonSubstructure()

This instance contains a nested class, ccdc.descriptors.MolecularDescriptors.MaximumCommonSubstructure.Settings allowing the configuration of graph search. The meaning of the properties should be evident from their names. Their default settings are as follows:

>>> print(mcss.settings.ignore_hydrogens)
False
>>> print(mcss.settings.check_bond_count)
False
>>> print(mcss.settings.check_bond_polymeric)
False
>>> print(mcss.settings.check_bond_type)
True
>>> print(mcss.settings.check_charge)
True
>>> print(mcss.settings.check_element)
True
>>> print(mcss.settings.check_hydrogen_count)
False

Let us read in two molecules with a non-trivial common substructure:

>>> csd = MoleculeReader('CSD')
>>> hxacan = csd.molecule('HXACAN')
>>> wexwut = csd.molecule('WEXWUT')

We can now determine the maximum common substructure of these two molecules. With default settings only the phenyl ring is common, so we will relax the conditions of the search:

>>> mcss.settings.check_element = False
>>> mcss.settings.ignore_hydrogens = True
>>> atoms, bonds = mcss.search(hxacan, wexwut)

The atoms returned is a tuple of pairs of atoms matched, the first from HXACAN, the second from WEXWUT. Similarly for matched bonds. We can extract the atoms for either molecule using python’s zip function:

>>> hxacan_atoms = list(zip(*atoms))[0]
>>> wexwut_atoms = list(zip(*atoms))[1]

For example, let us make a diagram, highlighting the common substructure in each molecule. Now we can see that with the relaxed search conditions all of HXACAN is matched with a similar group in WEXWUT:

>>> from ccdc.diagram import DiagramGenerator
>>> generator = DiagramGenerator()
>>> generator.settings.highlight_color = 'cyan'
>>> generator.settings.shrink_symbols = False
>>> hxacan_image = generator.image(hxacan, highlight_atoms=hxacan_atoms)
>>> wexwut_image = generator.image(wexwut, highlight_atoms=wexwut_atoms)
>>> hxacan_image.save('hxacan_mcss.png')  
>>> wexwut_image.save('wexwut_mcss.png')  
HXACAN WEXWUT
../_images/hxacan_mcss.png
../_images/wexwut_mcss.png

There is a class, ccdc/descriptors.MolecularDescriptors.AtomDistanceSearch which allows rapid proximity detection for atoms. This is of most use when dealing with proteins, but can be used for small molecule proximity detection. For example to look for possible close contacts:

>>> searcher = MD.AtomDistanceSearch(aabhtz)
>>> for a in aabhtz.heavy_atoms:
...     contacts = set(searcher.atoms_within_range(a.coordinates, 2.0))
...     contacts = contacts - set(a.neighbours) - set([a])
...     if contacts:
...         print(a, contacts) 
Atom(N1) set([Atom(H4)])
Atom(N5) set([Atom(H6)])
Atom(C10) set([Atom(H9), Atom(H6), Atom(H7)])

See also

For more information on the available molecular descriptors please have a look at the API documentation: ccdc.descriptors.MolecularDescriptors.

Powder patterns

Note

The powder pattern features are available only to CSD-Materials and CSD-Enterprise users.

Simulating powder patterns

To illustrate the use of the ccdc.descriptors.CrystalDescriptors.PowderPattern class let us read in a crystal from the CSD.

>>> crystal = entry_reader.crystal('AFUSEZ')

To create a powder pattern we simply import the ccdc.descriptors.CrystalDescriptors.PowderPattern class and use its ccdc.descriptors.CrystalDescriptors.PowderPattern.from_crystal() static method.

>>> from ccdc.descriptors import CrystalDescriptors
>>> pattern = CrystalDescriptors.PowderPattern.from_crystal(crystal)

The pattern created is an instance of the ccdc.descriptors.CrystalDescriptors.PowderPattern class.

To write the pattern to as a xye or Bruker raw file one can use the functions ccdc.descriptors.CrystalDescriptors.PowderPattern.write_xye_file() and ccdc.descriptors.CrystalDescriptors.PowderPattern.write_raw_file() respectively.

>>> pattern.write_xye_file(os.path.join(tempdir, 'afusez.xye'))

Comparing powder patterns

It is possible to read in a pattern stored in xye format using the ccdc.descriptors.CrystalDescriptors.PowderPattern.from_xye_file() function.

>>> file_pattern = CrystalDescriptors.PowderPattern.from_xye_file(os.path.join(tempdir, 'afusez.xye'))

Let us determine the similarity between the AFUSEZ and AFUSEA crystals’ calculated powder patterns.

>>> afusea_pattern = CrystalDescriptors.PowderPattern.from_crystal(entry_reader.crystal('AFUSEA'))
>>> round( file_pattern.similarity(afusea_pattern), 3 )
0.912

Settings can be used to control the various parameters of the simulation of a powder pattern. For example to use a non-default wavelength and non-standard two theta range:

>>> settings = CrystalDescriptors.PowderPattern.Settings()
>>> settings.wavelength = CrystalDescriptors.PowderPattern.Wavelength(CrystalDescriptors.PowderPattern.Wavelength.Wavelength_FeKa1)
>>> settings.two_theta_minimum = 10.
>>> settings.two_theta_maximum = 70.
>>> pattern = CrystalDescriptors.PowderPattern.from_crystal(crystal, settings)

Statistical descriptors

There is a namespace, ccdc.descriptors.StatisticalDescriptors for statistical measures which we have found useful. There is no intent to make this a complete statistical package such as Rpy but it has been found convenient to have implementations of enrichment metrics when analysing results from virtual screening and docking studies.

Within this namespace is a class, ccdc.descriptors.StatisticalDescriptors.RankStatistics for calculating various rank-based enrichment metrics.

Firstly let us import the appropriate module:

>>> import operator
>>> import collections
>>> from ccdc.descriptors import StatisticalDescriptors

Next we will set up some dummy activity data.

>>> activity_data_file = 'rank_statistics.csv'

We will now read the file and format the appropriate data:

>>> with open(activity_data_file) as f:
...     rows = [l.split(',') for l in f]
>>> scores = [(float(row[0].strip()), bool(int(row[1].strip())), row[2]) for row in rows]

The ccdc.descriptors.StatisticalDescriptors.RankStatistics may now be initialised with these data:

>>> rank_stats = StatisticalDescriptors.RankStatistics(scores, activity_column=operator.itemgetter(1))

The argument, activity_column indicates which column of the scores data should be interpreted as an activity classification. In this instance we have used operator.itemgetter. We can instead use operator.attrgettr to extract activity data from class instances. An artificial example is given below:

>>> ActivityRecord = collections.namedtuple('ActivityRecord', ['score', 'active', 'identifier'])
>>> activity_records = [ActivityRecord(*row) for row in scores]
>>> rank_stats = StatisticalDescriptors.RankStatistics(scores, activity_column=operator.attrgetter('activity'))

For convenience, if the scores data is an indexed collection, as it is in this example, then a simple integer index may be used to specify the column with the activity classification. Similarly, the activity column may be specified as a string field name for activity data in namedtuples.

>>> rank_stats = StatisticalDescriptors.RankStatistics(scores, activity_column=1)

When the scores are simply an iterable of booleans the activity_column need not be specified:

>>> activities = [row[1] for row in scores]
>>> rank_stats = StatisticalDescriptors.RankStatistics(activities)

The methods of the ccdc.descriptors.StatisticalDescriptors.RankStatistics now return the various enrichment metrics:

>>> print(round(rank_stats.AUC(), 3))
0.725
>>> print(round(rank_stats.EF(0.1), 3))
4.468
>>> print(round(rank_stats.ACC(0.1), 3))
0.897
>>> print(round(rank_stats.PPV(0.1), 3))
0.099
>>> print(round(rank_stats.EF(0.5), 3))
1.362
>>> print(round(rank_stats.ACC(0.5), 3))
0.507
>>> print(round(rank_stats.PPV(0.5), 3))
0.03

Principle axes aligned molecules

There is a class, ccdc.descriptors.MolecularDescriptors.PrincipleAxesAlignedBox which allows aligning of a molecule on its principle axes, and deriving a minimal bounding box around it.

Let us take an example, paracetamol, from the CSD and explore its packing shell to see how it grows as hydrogen bonds are expanded.

>>> hxacan_crystal = entry_reader.crystal('HXACAN')
>>> hxacan_mol = hxacan_crystal.molecule

We can now form its minimal bounding box:

>>> box = MolecularDescriptors.PrincipleAxesAlignedBox(hxacan_mol)

The primary, secondary and tertiary axes, and the volume of the box are accessible as properties:

>>> print('%s, %s, %s' % (box.x_vector, box.y_vector, box.z_vector))
Vector(7.448, 5.724, 6.326), Vector(4.769, 3.794, 2.816), Vector(3.210, 1.755, 1.971)
>>> print(round(box.volume, 3))
315.943

The other property of import is the aligned molecule:

>>> aligned_mol = box.aligned_molecule