Reading and writing molecules and crystals

Introduction

The ccdc.io module is used to read and write molecules and crystals.

>>> from ccdc import io

Let us also set up a variable for a temporary directory to write files to.

>>> import tempfile
>>> tempdir = tempfile.mkdtemp()

Supported file formats

Several file formats are supported for writing or Reading molecules, Reading crystals or Reading entries. Below is a comprehensive list of file formats organised by name and sorted in alphabetical order.

cif csdsql csdsqlx identifiers mol mol2 pdb res sdf sqlite sqlmol2

The Crystallographic Information File (cif) is a standard text file format for representing crystallographic information, and it’s the format in which structures are deposited into the CSD. However, in order to access the molecular information associated with those structures, it’s more appropriate to use one of the well known chemistry file formts (e.g. mol2 and sdf). Mol is a synonym for sdf.

The Protein Data Bank (pdb) file format is a textual file format describing the three-dimensional structures of molecules held in the Protein Data Bank. It contains description and annotation of protein and nucleic acid structures including atomic coordinates, secondary structure assignments, as well as atomic connectivity.

The res file format is a well known crystallographic format first introduced by the refinement program SHELX.

Identifiers is a list of molecule identifiers, and can only be used with an accompanying database. In the special case of the CSD this is known as a gcd file. The gcd file format is essentially a list of CSD refcodes and as such is commonly used in CSD Software to access lists of CSD structures. Below is an example of a valid .gcd file:

ACEHAR
ALPLNI
BAXMET

Csdsql, csdsqlx, and sqlite are formats in which the CSD is, or was, distributed. The sqlmol2 and csdsqlx formats are also used for the CSD-CrossMiner structural database providing cavity or full protein structures. sqlmol2 is an optimised variant of the MOL2 format that can be written by other CCDC software but is not recommended for general use. The csdsql format can also be written by the the API and offers much greater read and search performance than other supported formats.

Reading molecules

Let us start by finding out what the supported “file” formats for a molecule reader are. The known formats are stored in a dictionary called known_formats in the ccdc.io.MoleculeReader class.

>>> reader_formats = sorted(io.MoleculeReader.known_formats.keys())
>>> print('\n'.join(reader_formats))
cif
csdsql
csdsqlx
identifiers
mol
mol2
res
sdf
sqlite
sqlmol2

These file formats were introduced in the Supported file formats section.

The distinguished word ‘CSD’ will open the users’ installed CSD database. Note that the ccdc.io.MoleculeReader will return molecules from the CSD rather than crystals. This will open the main CSD database and any update databases (such as Feb18) present in the main CSD database directory.

>>> csd_reader = io.MoleculeReader('CSD')
>>> first_molecule = csd_reader[0]
>>> print(first_molecule.identifier)
AABHTZ
>>> abebuf_mol = csd_reader.molecule('ABEBUF')
>>> print(abebuf_mol.identifier)
ABEBUF

It is worth noting that all three of the reader classes, ccdc.io.EntryReader, ccdc.io.CrystalReader and ccdc.io.MoleculeReader support the methods entry, crystal and molecule.

Now let us create a molecule file reader from a mol2 file named pde5_inhibitors.mol2.

>>> pde5_filepath = 'pde5_inhibitors.mol2'

To get access to the molecules in this file we make use of a ccdc.io.MoleculeReader.

>>> mol_reader = io.MoleculeReader(pde5_filepath)

A molecule reader is an iterator from which individual molecules can be accessed by an index.

>>> mol = mol_reader[0]
>>> print(mol.identifier)
1XP0-lig

Clearly it is also possible to loop over the molecule reader iterator.

>>> for mol in mol_reader:
...     print(mol.identifier)
...
1XP0-lig
1XOZ-lig
1T9S-lig

Note that in the example above the MoleculeReader deduced the file type from the file extension. It is also possible to provide the file type as an optional argument to the MoleculeReader.

In the case where one has a string containing the molecular data, for instance from a webservice, it is now possible to read the molecules directly, rather than writing the string to a file and opening it. As a somewhat contrived example one may now do:

>>> text = open(pde5_filepath).read()
>>> with io.MoleculeReader(text) as reader:
...     for m in reader:
...         print(m.identifier)
...
1XP0-lig
1XOZ-lig
1T9S-lig

Note that the format is determined automatically.

Suppose that we had a text file with refcodes.

>>> filepath = 'some_refcodes.txt'

In order to treat this file as a gcd file when it is read in we make use of the format parameter.

>>> mol_reader = io.MoleculeReader(filepath, format='identifiers')
>>> for mol in mol_reader:
...     print(mol.identifier)
ACEHAR
ALPLNI
BAXMET

Finally, let us make sure that we have closed the molecule reader.

>>> mol_reader.close()

We can also provide refcodes to the ccdc.io.MoleculeReader as an iterable of identifiers:

>>> with io.MoleculeReader(['ABEBUF', 'HXACAN', 'VUSDIX04']) as reader:
...     for mol in reader:
...         print(mol.identifier)
ABEBUF
HXACAN
VUSDIX04

Reading crystals

Let us start by finding out what the supported “file” formats for a crystal reader are. The known formats are stored in a dictionary called known_formats in the ccdc.io.CrystalReader class.

>>> reader_formats = sorted(io.CrystalReader.known_formats.keys())
>>> print('\n'.join(reader_formats))
cif
csdsql
csdsqlx
identifiers
mol
mol2
res
sdf
sqlite
sqlmol2

These file formats were introduced in the Supported file formats section.

Note

Although the mol2 file format is usually used for molecules it does support the ability to store crystallographic information using the @<TRIPOS>CRYSIN record. The sdf file format, on the other hand, does not support crystallographic information. If a sdf file or a mol2 file (without the @<TRIPOS>CRYSIN record) are read in using a ccdc.io.CrystalReader a default crystal will be created for the molecule (see Default crystal).

Let us create a crystal reader using the some_refcodes.txt file in the Reading molecules example.

>>> filepath = 'some_refcodes.txt'

Again, in order to tell the reader that the input file is in gcd file format we make use of the format parameter.

>>> crystal_reader = io.CrystalReader(filepath, format='identifiers')
>>> for cryst in crystal_reader:
...     print(cryst.spacegroup_symbol)
Pbcn
P21/a
P21/a

Let us close the crystal reader.

>>> crystal_reader.close()

Next, let us read crystals from the installed CSD.

>>> crystal_reader = io.CrystalReader('CSD')
>>> first_crystal = crystal_reader[0]
>>> print(first_crystal.spacegroup_symbol)
P-1
>>> abebuf_crystal = crystal_reader.crystal('ABEBUF')
>>> print(abebuf_crystal.spacegroup_symbol)
Pbca
>>> crystal_reader.close()

Finally, let us read crystals from a res file:

>>> res_filepath = 'three_structures.res'
>>> crystal_reader = io.CrystalReader(res_filepath)
>>> print(', '.join(crystal.spacegroup_symbol for crystal in crystal_reader))
Pbca, P21/c, P212121

Writing molecules

Let us start by finding out what the supported “file” formats for a molecule writer are. The known formats are stored in a dictionary called known_formats in the ccdc.io.MoleculeWriter class.

>>> writer_formats = sorted(io.MoleculeWriter.known_formats.keys())
>>> print('\n'.join(writer_formats))
cif
csdsql
identifiers
mol
mol2
pdb
res
sdf

See Supported file formats for a brief description of each individual file format.

To illustrate how the molecule writer works let us read in molecules from a gcd file and write them out into a sdf file.

>>> filepath = 'some_refcodes.txt'

In order to tell the reader that the input file is in gcd file format we make use of the format parameter.

>>> mol_reader = io.MoleculeReader(filepath, format='identifiers')
>>> with io.MoleculeWriter(os.path.join(tempdir, 'some_refcodes.sdf')) as mol_writer:
...     for mol in mol_reader:
...         mol_writer.write(mol)
...

Note

The python with syntax automatically ensures that the mol_writer is closed automatically once the with block of code is exited. For more information please see PEP 343.

Writing crystals

Let us start by finding out what the supported “file” formats for a crystal writer are. The known formats are stored in a dictionary called known_formats in the ccdc.io.CrystalWriter class.

>>> writer_formats = sorted(io.CrystalWriter.known_formats.keys())
>>> print('\n'.join(writer_formats))
cif
csdsql
identifiers
mol
mol2
pdb
res
sdf

These file formats were introduced in the Supported file formats section.

To illustrate how the crystal writer works let us read in crystals from a gcd file and append them to a cif file.

>>> filepath = 'some_refcodes.txt'

In order to tell the reader that the input file is in gcd file format we make use of the format parameter. In order to append crystals to an existing cif file we use the append parameter.

>>> crystal_reader = io.CrystalReader(filepath, format='identifiers')
>>> with io.CrystalWriter(os.path.join(tempdir, 'some_refcodes.cif'), append=True) as crystal_writer:
...     for crystal in crystal_reader:
...         crystal_writer.write(crystal)
...

Reading Entries

CSD crystallographic database entries can be read from the CSD to gain access to data, such as publication data, not accessible via the crystal. To do this one can create a ccdc.io.EntryReader:

>>> entry_reader = io.EntryReader('CSD')

which has all the methods of the other reader classes. For example:

>>> first_entry = entry_reader[0]
>>> print(first_entry.publication)  
Citation(authors='P.-E.Werner',
    journal='Journal(Crystal Structure Communications)',
    volume='5', year=1976, first_page='873', doi=None)
>>> abebuf_entry = entry_reader.entry('ABEBUF')
>>> print(abebuf_entry.publication)  
Citation(authors='S.W.Gordon-Wylie, E.Teplin, J.C.Morris, M.I.Trombley, S.M.McCarthy, W.M.Cleaver, G.R.Clark',
    journal='Journal(Crystal Growth and Design)',
    volume='4', year=2004, first_page='789', doi='10.1021/cg049957u')

EntryReaders also give access to sd-tags for sdf format files and the cif-tags from cif files as strings in a dictionary-like object named attributes. This means that one can get access to any property included in a sdf or cif file. To illustrate this let us read in a cif file with many caffeine structures.

>>> caffeine_file_name = 'caffeine.cif'

To get access to the raw CIF data we need to open the file as an ccdc.io.EntryReader.

>>> reader = io.EntryReader(caffeine_file_name)
>>> entry_from_cif = reader[0]
>>> print(entry_from_cif.attributes['_exptl_crystal_colour'])
orange

This method is particularly useful when reading files from a docking experiment using GOLD, where docking data are written to the output files in the form of sd-tags in either sdf or mol2 file formats. For example:

>>> gold_file_name = 'gold_output.sdf'
>>> with io.EntryReader(gold_file_name) as reader:
...     for e in reader:
...         print('%s: %.3f' % (e.identifier, float(e.attributes['Gold.Chemscore.Fitness']))) 
ZINC02871146: 15.275
...
ZINC02871189: 12.767

Writing Entries

Entries may be written using a ccdc.io.EntryWriter.

See Entry documentation for further details.

Default crystal

When reading in a file without crystallographic information as a crystal or when writing out a molecule without crystallographic information a default crystal will be created.

The default crystal is described in the table below.

Default crystal
Space group Unknown
Cell lengths a 1.0
b 1.0
c 1.0
Cell angles alpha 90.0
beta 90.0
gamma 90.0

To illustrate this let us read in a mol2 file without crystallographic information.

>>> filepath = '1hak-lig.mol2'

To read in the first and only crystal from this file we make use of a ccdc.io.CrystalReader.

>>> crystal_reader = io.CrystalReader(filepath)
>>> crystal = crystal_reader[0]

We can now check the default values.

>>> print(crystal.spacegroup_symbol)
Unknown
>>> print(crystal.cell_lengths)
CellLengths(a=1.0, b=1.0, c=1.0)
>>> print(crystal.cell_angles)
CellAngles(alpha=90.0, beta=90.0, gamma=90.0)

Working with multiple databases

It is possible to create readers that work with multiple databases simultaneously. Any database which can be created with an ccdc.io.EntryReader, ccdc.io.CrystalReader or ccdc.io.MoleculeReader instance may be read into a single, compound database.

To illustrate this in a rather contrived example, let us create a reader that reads a collection of structures from some of the files we have used in this document:

>>> db = io.EntryReader([
...     pde5_filepath, res_filepath, caffeine_file_name, gold_file_name, ['HXACAN', 'VUSDIX04']
... ])
>>> print(len(db))
177

The file_name attribute of the compound database will show the absolute path of each component of the database, so we can check how many entries are in each database:

>>> for file_name in db.file_name:
...    er = io.EntryReader(file_name)
...    print('%s: %d' % (os.path.basename(file_name) if isinstance(file_name, six.string_types) else file_name, len(er)))
pde5_inhibitors.mol2: 3
three_structures.res: 3
caffeine.cif: 139
gold_output.sdf: 30
['HXACAN', 'VUSDIX04']: 2