Utilities

Introduction

The ccdc.utilities module contains a number of general purpose classes.

Logger

The ccdc.utilities.Logger provides a means for the script writer to control output messages. This is a thin wrapper around python’s logging.

>>> from ccdc.utilities import Logger
>>> logger = Logger()

Five categories of message can be emitted: debug, info, warning, error and critical.

>>> logger.debug('Generating another conformer.')

We do not see anything since the logger’s default filter is set at INFO.

>>> logger.info('Conformer generation finished.')
INFO <doctest utilities.rst[7]>:1 Conformer generation finished.

Because this documentation is tested using doctest, the file and line number are less than relevant. We can turn them off.

>>> logger.ignore_line_numbers(True)
>>> logger.warning('Sampling limit reached.')
WARNING Sampling limit reached.
>>> logger.error('Invalid input molecule format.')
ERROR Invalid input molecule format.
>>> logger.critical("The conformer generator failed unexpectedly.")
CRITICAL The conformer generator failed unexpectedly.

Messages can be filtered at any level, by default debug messages are filtered out.

>>> logger.set_log_level(Logger.DEBUG)
>>> logger.debug('And now we see debug messages.') 
DEBUG And now we see debug messages.

The class operates as a singleton, i.e., many instances will share the same underlying data.

>>> another_logger = Logger()
>>> another_logger.info('Still no line numbers.')
INFO Still no line numbers.

Any changes made to either instance will be reflected in the other instance.

>>> another_logger.ignore_line_numbers(False)
>>> logger.warning('Logger has line numbers turned back on.')
WARNING <doctest utilities.rst[17]>:1 Logger has line numbers turned back on.

Log messages may be redirected to a file, rather than the default stdout.

>>> file_name = 'logfile.log'
>>> another_logger.set_output_file(file_name)
>>> logger.info('This message will go to logfile.log')

CCDC code can generate warnings and these will be sent to the logger’s destination whether stderr or a file.

CCDC code can issue informative messages. These are be enabled by:

>>> from ccdc.io import EntryReader
>>> logger.set_ccdc_log_level(6)
>>> logger.set_ccdc_minimum_log_level(3)
>>> e = entry_reader[0]

This when run in a non-doctest fashion will produce on the logger’s output stream:

DEBUG 3: AserDatabase::entry extracted: AABHTZ

Be warned: setting a ccdc_minimum_log_level to 1 will produce a lot of output.

There is one final method of the logger: ccdc.utilities.Logger.fatal() which will emit a critical error message and exit.

There is a tidy way to redirect messages to a file using a context manager:

>> from ccdc.utilities import FileLogger
>> with FileLogger(os.path.join(tempdir, 'redirected.log')) as logger:
...     logger.warning('This message will be in redirected.log')

On entry to the block the logger will be set to redirect to a file; on exit the logger will be reset to the default.

Histograms

There is a class, ccdc.utilities.Histogram which may be used to construct histograms of data points. We can create one by specifying the starting value, the ending value and the width of a bin:

>>> from ccdc.utilities import Histogram
>>> h = Histogram(-1.0, 1.0, 0.1)

It may be populated by adding values:

>>> h.add_values([-1.0 + 0.01*i for i in range(200)])
>>> print(h.frequencies) 
(10, 10, 10, ...)

The histogram copes gracefully with out of range values:

>>> h.add_value(-2.0)
>>> h.add_value(3.2)
>>> print('Underflow: %d, overflow: %d' % (h.nunderflow, h.noverflow))
Underflow: 1, overflow: 1

The histogram gives access to the usual properties:

>>> print('Start: %.1f, End: %.1f, NBins: %d, Bin width: %.1f' % (h.start_value, h.end_value, h.nbins, h.bin_width))
Start: -1.0, End: 1.0, NBins: 20, Bin width: 0.1

There is also a comparison method, giving a degree of disimilarity between two histograms:

>>> h2 = Histogram(-1.5, 1.5, 0.1)
>>> h2.add_values([-1.5 + 0.1*i for i in range(300)])
>>> print(round(h.compare(h2), 2))
46.2

Grids

There is a class, ccdc.utilities.Grid to allow the creation, manipulation, reading and writing of orthonormal grid data.

Firstly we need to import the appropriate class:

>>> from ccdc.utilities import Grid

The Grid constructor takes the coordinates of the origin of the grid and the far corner of the grid. Optionally it may take a grid spacing, defaulting to 0.2, and a default value to place in the grid, defaulting to 0.0. For example we can construct a grid representing the hydrophobic field of a molecule, and score other molecules against it. This approach might be used to derive additional scoring terms for the analysis of docking solutions.

>>> from ccdc import io
>>> csd = io.EntryReader('csd')
>>> mol = csd.molecule('ABABEL')
>>> cog = mol.centre_of_geometry()
>>> mol.translate((-cog[0], -cog[1], -cog[2]))
>>> xmin = min(a.coordinates.x for a in mol.atoms)
>>> xmax = max(a.coordinates.x for a in mol.atoms)
>>> ymin = min(a.coordinates.y for a in mol.atoms)
>>> ymax = max(a.coordinates.y for a in mol.atoms)
>>> zmin = min(a.coordinates.z for a in mol.atoms)
>>> zmax = max(a.coordinates.z for a in mol.atoms)
>>> grid = Grid(origin=(xmin-2., ymin-2., zmin-2.), far_corner=(xmax+2., ymax+2., zmax+2.))

This has now constructed an empty grid whose dimensions enclose the molecule with a margin of two Angstroms. We can populate it from the atoms of the molecule:

>>> def ring_centroid(r):
...     x = sum(a.coordinates.x for a in r.atoms)
...     y = sum(a.coordinates.y for a in r.atoms)
...     z = sum(a.coordinates.z for a in r.atoms)
...     return (x/len(r.atoms), y/len(r.atoms), z/len(r.atoms))
>>> for r in mol.rings:
...     if r.is_aromatic:
...         grid.set_sphere(ring_centroid(r), 2.5, len(r.atoms))

The grid has now been populated with spheres of radius 2.5 around the centroid of the aromatic rings of ABABEL. We can now score molecules against this to get a measure of lipophilic overlap of the two molecules. This score will be a dictionary keyed by the atoms of the probe molecule. In this instance we are interested only in the scores of the aromatic atoms, but in other cases we might wish to penalise polar atoms in positions with high lipophilic scores.

>>> other_mol = csd.molecule('ABABAI')
>>> other_cog = other_mol.centre_of_geometry()
>>> other_mol.translate((-other_cog[0], -other_cog[1], -other_cog[2]))
>>> d = grid.score_molecule(other_mol)
>>> lipophilic_score = sum(d[a] for r in other_mol.rings for a in r.atoms if r.is_aromatic)
>>> print(round(lipophilic_score, 2))
2.63
>>> polar_score = sum(d[a] for a in other_mol.atoms if a.is_donor or a.is_acceptor)
>>> print(round(polar_score, 2))
0.72

The Grid class supports various methods to inspect the values in the grid, the extrema, the number of steps in each dimension, the number of non-zero values in the grid, the value at a point, which will be the linear interpolation of the nearby grid values, and the value at given indices:

>>> low, high = grid.extrema
>>> print('(%.1f, %.1f)' % (round(low, 2), round(high, 2)))
(0.0, 5.8)
>>> print(grid.nsteps)
(72, 53, 37)
>>> print(grid.count_grid())
15938
>>> v = grid.value_at_point(mol.atom('C10').coordinates)
>>> print(round(v, 2))
2.64
>>> v = grid.value(20, 25, 10)
>>> print(round(v, 2))
0.66

One can iterate through the values of the grids using the nsteps method to provide the ranges over which to iterate.

There is a method to copy a grid, so we will do this prior to performing some operations on the grid:

>>> copy = grid.copy()

There is a fairly complete set of arithmetic and logical operations on grids. These include both grid arguments or scalar arguments, and in-place assignment or returning a fresh grid. These can be used with the masked_set method of a grid to make grids picking out high or low values of an original grid. For example, to construct a grid which has values only where the original grid has values above some threshold:

>>> g2 = grid > (high - low) / 2.
>>> g2 = -g2
>>> g3 = grid.masked_set(g2, 0.0)
>>> print(g3.count_grid())
2231

There are methods to perform some smoothing operations on the grid: dilate where any point with a non-zero neighbour will be set to 1.0; contract, where any point with a zero neighbour will be set to 0.0; mean_value_of_neighbours, where grid points are replaced with the average of neighbouring points values, max_value_of_neighbours and min_value_of_neighbours, similarly.

There is a method to flood fill a grid. This will take a grid to fill with values, a value to set, a threshold above which the value will be set, and initial starting indices. This will return a sub-grid containing the flood-filled region. A periodic flag can be passed as a bool for each axis to give the flood flill a periodic boundary condition.

There is a method to extract from a grid the set of islands containing values at or above a threshold. This will return a tuple of subgrids of the original. This may be useful when looking for high-scoring regions of a grid. For example, with the grid constructed earlier one can extract the regions of maximal lipophilicity:

>>> islands = g3.islands(5.0)
>>> print(len(islands))
2
>>> first = islands[0]
>>> print(first.nsteps)
(4, 4, 5)
>>> print(first.bounding_box)
(Coordinates(x=-3.396, y=-1.615, z=-0.456), Coordinates(x=-2.794, y=-1.007, z=0.347))

One can construct a grid from a set of smaller grids, as long as these grids have the same spacing. This super-grid will be the smallest grid containing all the argument grids. For example, we can make the super-grid containing the islands of maximal lipophilicity, with a margin of 2 Angstrom:

>>> super_grid = Grid.super_grid(2.0, *islands)
>>> print(super_grid.bounding_box)
(Coordinates(x=-5.396, y=-3.615, z=-2.456), Coordinates(x=5.025, y=4.028, z=2.548))

Grids may be written to and read from three different formats: ACNT, CCP4 and GRD formats. The process to convert between file formats is a matter of reading from one file and writing to another. The format will be deduced from the suffix of the file, or it may be explicitly specified via a format parameter:

>>> grd_file = 'grid.grd'
>>> grid.write(grd_file)
>>> copy = Grid.from_file(grd_file)
>>> ccp4_file = 'grid.ccp4'
>>> copy.write(ccp4_file)

Licence

The ccdc.utilities.Licence provides information about the current licence for the CSD Software and its modules.

>>> from ccdc.utilities import Licence

Licence information can be obtained and examined as a string:

>>> licence = Licence()
>>> licence  
Licensing
195 days remaining
Modules: CSD-Core, CSD-Materials, CSD-Discovery

The licence properties may be examined individually:

>>> licence.modules  
['CSD-Core', 'CSD-Materials', 'CSD-Discovery']
>>> licence.days_remaining  
195