CSD Landscape Generator

The use of module ccdc.csp.csd_landscape_generator is documented here.

The crystal structure prediction method implemented is that described in Generation of crystal structures using known crystal structures as analogues The method uses observed crystal structures from a crystal structure database as analogues for the generated crystal structures of the query molecule. The analogue structures are discovered and ranked by shape and chemical similarity. New predicted crystal structures are generated by overlaying the query molecule onto the observed molecules in analogue crystal structures, then optimising the resulting structure.

CSD Landscape Generator API Usage

The CSD Landscape Generator API is used to create crystal structure predictions of a query molecule.

Let’s start with a conformer of the molecule Adrenaline from the CSD.

>>> from ccdc import io
>>> csd = io.MoleculeReader('csd')
>>> mol = csd.molecule('ADRENL')

The class ccdc.csp.csd_landscape_generator.CSDLandscapeGenerator contains the settings and functionality needed to generate crystal structure landscapes of a query molecule. Below we create an instance of CSDLandscapeGenerator. This instance of the landscape generator is set to generate 3 crystal structures. NB the output in this documentation is generated with settings fixed to use a small set of template crystals. Real use can take advantage of many more template crystals in the CSD or other databases.

>>> from ccdc.csp.csd_landscape_generator import CSDLandscapeGenerator
>>> landscape_generator = CSDLandscapeGenerator()
>>> landscape_generator.settings.nstructures=3

The landscape generator will generate crystal structure predictions from it’s generate method, which is a Python generator function yielding crystal structures until the requested number of structures have been generated. The predicted crystal structures of the input molecule can be generated into a list in one line as follows:

>>> landscape = list(landscape_generator.generate(mol))

The resulting list contains CSD Python API Entry objects which have been extended with scoring information about the prediction results.

>>> for prediction in landscape:
...     print(f"Prediction {prediction.identifier} is in spacegroup {prediction.crystal.spacegroup_symbol}")
Prediction Predicted_ADRENL_on_HXACAN44_0_0-0 is in spacegroup Pbca
Prediction Predicted_ADRENL_on_MORVAR_0_0-0 is in spacegroup P21/n
Prediction Predicted_ADRENL_on_GUKWER_0_0-0 is in spacegroup P-1
>>> min_score = min(landscape, key=lambda prediction: prediction.score)
>>> print(f"{min_score.identifier} has the lowest score") 
Predicted_ADRENL_on_... has the lowest score

The prediction score is approximately equivalent to crystal lattice energy in kJ/mol. When a prediction run is complete, the predictions also have a relative_score property, which is their score relative to the lowest (best) scoring prediction in the landscape.

>>> print("Number of structures with the lowest score:", sum(1 for prediction in landscape if prediction.relative_score == 0))
Number of structures with the lowest score: 1

Generated crystal structures are written to a working directory where the input molecule as mol2, a .csv file of results information, a .gcd file of template crystal structure identifiers, and log files can be found. The working directory setting can be set to a target output directory if required.

>>> print(f"Output is in {landscape_generator.settings.working_directory}") 
Output is in ...
>>> print(f"Generated files are {', '.join(sorted(os.listdir(landscape_generator.settings.working_directory)))}") 
Generated files are ADRENL.mol2, Predicted_ADRENL.csv, Predicted_ADRENL.gcd, Predicted_ADRENL_on_GUKWER_0_0-0.cif, Predicted_ADRENL_on_HXACAN44_0_0-0.cif, Predicted_ADRENL_on_MORVAR_0_0-0.cif, ...

Further control over landscape generation

There are further settings and ways of using the API that give more control over the results.

Generation of crystal structure landscape for enantiopure substances should be limited to the Sohncke space groups. This can be set and tested with the sohncke_only setting, which is False by default.

>>> print(landscape_generator.settings.sohncke_only)

Generation of crystal structure landscapes with multiple components is possible by supplying an input molecule with multiple components. For example CSD entry HUMJEE gives us a multi-component molecule containing one molecule of paracetamol and one water molecule. The landscape generated from this input will contain both molecules in the asymmetric unit of every generated crystal.

>>> paracetamol_hydrate = csd.molecule('HUMJEE')
>>> landscape = list(landscape_generator.generate(paracetamol_hydrate))
>>> print([len(m.atoms) for m in landscape[0].molecule.components])
[3, 20]

Generated crystal structures are written as CIF by default. These CIF files contain all the information necessary to be imported into a CSP landscape database (see ccdc.csp.database and ccdc.csp.prediction). The generated crystal structures can also be written as mol2 files.

>>> print(landscape_generator.settings.format)
>>> landscape_generator.settings.format = 'mol2'
>>> print(landscape_generator.settings.format)

Landscapes are generated using template crystal structures from the CSD and an associated shape database. Alternative crystal structure and shape databases can be used by setting the database_file and shape_database_location settings. If doing this, the shape database used must have been created from the crystal structure database.

>>> print(f"Using crysal structure database {landscape_generator.settings.database_file}") 
Using crysal structure database ...
>>> print(f"With shape database {landscape_generator.settings.shape_database_location}") 
With shape database ...

Crystal structure landscape generation is a resource intensive activity, likely to use a large amount of CPU power and RAM. The CSD Landscape Generator will try to use nearly all of the CPU power available, which will also affect how much RAM is used. This can be controlled with the nthreads setting to reduce or increase what is used.

>>> landscape_generator.settings.nthreads = 1
>>> print(f"Number of threads used {landscape_generator.settings.nthreads}")
Number of threads used 1

Landscape generation can take a long time, particularly for large numbers of structures and large molecules. Previously we have seen how the whole generated landscape can be retrieved in one line, however it may be more useful to report ongoing progress by iterating over the results in a loop as they are generated.

>>> landscape = []
>>> for prediction in landscape_generator.generate(mol):
...     print(prediction.identifier)
...     landscape.append(prediction)
>>> print(f"Landscape generation complete, {len(landscape)} structures generated")
Landscape generation complete, 3 structures generated

Reporting progress like this allows the user to see what is happening. If generation is interrupted, for example with a CTRL-C in a terminal, the processing won’t stop immediately. Instead, the currently active generation will complete over the next few seconds and the relative scores will be updated, and then finally the process will stop.

Shape Database Creation

The crystal structure prediction method uses a shape database to find similarly shaped molecules in a crystal structure database. The create_shape_database.py script can be used to create a shape database from a crystal structure database. The help text for create_shape_database.py can be shown as follows:

$ python ccdc/utilities/csp/create_shape_database/create_shape_database.py -h
usage: create_shape_database.py [-h] structure_database output_database

Create a shape database from a crystal structure database.

For example:

python create_shape_database.py my_structures.csdsqlx my_structures_shapes.sqlite

positional arguments:
  structure_database  Input crystal structure database path e.g.
  output_database     Output shape database path e.g.

optional arguments:
  -h, --help          show this help message and exit

The create_shape_database.py script can be used to create a shape database from a crystal structure database as in the following example. The input crystal structure database can be any type recognised by CCDC software including .csdsql, .csdsqlx and .sqlite. While the script is running it will report progress information every 1000 structures, and report issues where it cannot generate the shape information for a structure. These failed structures are usually unimportant for crystal structure prediction and can be ignored.

$ python ccdc/utilities/csp/create_shape_database/create_shape_database.py my_structures.csdsqlx my_shapes.sqlite
Reading structure database my_structures.csdsqlx
Exception (OQABAM01) Too many steps in ChemicalGraphSearch: search abandoned
Exception (OQACIV02) Too many steps in ChemicalGraphSearch: search abandoned
Exception (OQACIV03) Too many steps in ChemicalGraphSearch: search abandoned
Exception (OQACIV04) Too many steps in ChemicalGraphSearch: search abandoned
Creating shape database my_shapes.sqlite