Reading and writing molecules and crystals

Introduction

The ccdc.io module is used to read and write molecules and crystals.

>>> from ccdc import io

Let us also set up a variable for a temporary directory to write files to.

>>> import tempfile
>>> tempdir = tempfile.mkdtemp()

Supported file formats

Several file formats are supported for writing or Reading molecules, Reading crystals or Reading entries. Below is a comprehensive list of file formats organised by name and sorted in alphabetical order.

CIF csdsql csdsqlx identifiers mmCIF Mol Mol2 PDB res SDF sqlite sqlmol2

The Crystallographic Information File (CIF) is a standard text file format for representing crystallographic information, and it’s the format in which structures are deposited into the CSD. However, in order to access the molecular information associated with those structures, it’s more appropriate to use one of the well known chemistry file formats (e.g., Mol2 and SDF). Mol is a synonym for SDF.

The Protein Data Bank (PDB) file format is a textual file format describing the three-dimensional structures of molecules held in the Protein Data Bank. It contains description and annotation of protein and nucleic acid structures including atomic coordinates, secondary structure assignments, as well as atomic connectivity.

Python API supports reading and writing of mmCIF to a level of support that is greater than for the PDB file format. Please note, however, that the CSD portfolio does not currently support every data field. Please see the list of supported categories below. There are additional fields common to CIF and mmCIF formats that are not listed here:

_entry.id

_chemical.absolute_configuration

_cell.angle_alpha
_cell.angle_beta
_cell.angle_gamma
_cell.length_a
_cell.length_b
_cell.length_c
_cell.volume

_symmetry.Int_Tables_number
_symmetry.cell_setting
_symmetry.space_group_name_H-M
_symmetry.space_group_name_Hall

_pdbx_struct_oper_list.id
_pdbx_struct_oper_list.symmetry_operation

_diffrn.ambient_pressure
_diffrn.ambient_temp
_diffrn.details
_diffrn_radiation.probe
_diffrn_radiation_wavelength.wavelength
_diffrn_reflns.theta_max
_diffrn_source.source

_exptl_crystal.density_diffrn
_exptl_crystal.density_meas

_refine.details
_refine.diff_density_max
_refine.diff_density_min
_refine.ls_R_factor_all
_refine.ls_R_factor_obs
_refine.ls_goodness_of_fit_ref
_refine.ls_number_constraints
_refine.ls_number_parameters
_refine.ls_number_restraints
_refine.ls_shift_over_su_max
_refine.ls_wR_factor_R_free

_atom_site.B_iso_or_equiv
_atom_site.Cartn_x
_atom_site.Cartn_y
_atom_site.Cartn_z
_atom_site.U_iso_or_equiv
_atom_site.auth_asym_id
_atom_site.auth_atom_id
_atom_site.auth_comp_id
_atom_site.auth_seq_id
_atom_site.fract_x
_atom_site.fract_y
_atom_site.fract_z
_atom_site.group_PDB
_atom_site.id
_atom_site.label_alt_id
_atom_site.label_asym_id
_atom_site.label_atom_id
_atom_site.label_comp_id
_atom_site.label_seq_id
_atom_site.occupancy
_atom_site.pdbx_PDB_ins_code
_atom_site.pdbx_PDB_model_num
_atom_site.pdbx_formal_charge
_atom_site.type_symbol

_atom_site_anisotrop.U[1][1]
_atom_site_anisotrop.U[1][2]
_atom_site_anisotrop.U[1][3]
_atom_site_anisotrop.U[2][2]
_atom_site_anisotrop.U[2][3]
_atom_site_anisotrop.U[3][3]
_atom_site_anisotrop.id
_atom_site_anisotrop.type_symbol

_struct_conn.conn_type_id
_struct_conn.id
_struct_conn.pdbx_ptnr1_label_alt_id
_struct_conn.pdbx_ptnr2_label_alt_id
_struct_conn.pdbx_value_order
_struct_conn.ptnr1_label_asym_id
_struct_conn.ptnr1_label_atom_id
_struct_conn.ptnr1_label_comp_id
_struct_conn.ptnr1_label_seq_id
_struct_conn.ptnr2_label_asym_id
_struct_conn.ptnr2_label_atom_id
_struct_conn.ptnr2_label_comp_id
_struct_conn.ptnr2_label_seq_id

The RES file format is a well known crystallographic format first introduced by the refinement program SHELX.

Identifiers is a list of molecule identifiers, and can only be used with an accompanying database. In the special case of the CSD this is known as a GCD file. The GCD file format is essentially a list of CSD refcodes and as such is commonly used in CSD Software to access lists of CSD structures. Below is an example of a valid GCD file:

ACEHAR
ALPLNI
BAXMET

Csdsql, csdsqlx, and sqlite are formats in which the CSD is, or was, distributed. The sqlmol2 and csdsqlx formats are also used for the CSD-CrossMiner structural database providing cavity or full protein structures. Sqlmol2 is an optimised variant of the Mol2 format that can be written by other CCDC software but is not recommended for general use. The csdsql format can also be written by the the API and offers much greater read and search performance than other supported formats.

Reading molecules

Let us start by finding out what the supported “file” formats for a molecule reader are. The known formats are stored in a dictionary called known_formats in the ccdc.io.MoleculeReader class.

>>> reader_formats = sorted(io.MoleculeReader.known_formats.keys())
>>> print('\n'.join(reader_formats))
cif
csdsql
csdsqlx
identifiers
mmcif
mol
mol2
res
sdf
sqlite
sqlmol2

These file formats were introduced in the Supported file formats section.

The distinguished word ‘CSD’ will open the users’ installed CSD database. Note that the ccdc.io.MoleculeReader will return molecules from the CSD rather than crystals. This will open the main CSD database and any update databases (such as Feb18) present in the main CSD database directory.

>>> csd_reader = io.MoleculeReader('CSD')
>>> first_molecule = csd_reader[0]
>>> print(first_molecule.identifier)
AABHTZ
>>> abebuf_mol = csd_reader.molecule('ABEBUF')
>>> print(abebuf_mol.identifier)
ABEBUF

It is worth noting that all three of the reader classes, ccdc.io.EntryReader, ccdc.io.CrystalReader and ccdc.io.MoleculeReader support the methods entry, crystal and molecule.

Now let us create a molecule file reader from a Mol2 file named pde5_inhibitors.mol2.

>>> pde5_filepath = 'pde5_inhibitors.mol2'

To get access to the molecules in this file we make use of a ccdc.io.MoleculeReader.

>>> mol_reader = io.MoleculeReader(pde5_filepath)

A molecule reader is an iterator from which individual molecules can be accessed by an index.

>>> mol = mol_reader[0]
>>> print(mol.identifier)
1XP0-lig

It is also possible to loop over the molecule reader iterator.

>>> for mol in mol_reader:
...     print(mol.identifier)
...
1XP0-lig
1XOZ-lig
1T9S-lig

Note that in the example above the MoleculeReader deduced the file type from the file extension. It is also possible to provide the file type as an optional argument to the MoleculeReader.

In the case where one has a string containing the molecular data, for instance from a webservice, it is now possible to read the molecules directly, rather than writing the string to a file and opening it. As a somewhat contrived example one may now do:

>>> text = open(pde5_filepath).read()
>>> with io.MoleculeReader(text) as reader:
...     for m in reader:
...         print(m.identifier)
...
1XP0-lig
1XOZ-lig
1T9S-lig

Note that the format is determined automatically.

Suppose that we had a text file with refcodes.

>>> filepath = 'some_refcodes.txt'

In order to treat this file as a GCD file when it is read in we make use of the format parameter.

>>> mol_reader = io.MoleculeReader(filepath, format='identifiers')
>>> for mol in mol_reader:
...     print(mol.identifier)
ACEHAR
ALPLNI
BAXMET

Finally, let us make sure that we have closed the molecule reader.

>>> mol_reader.close()

We can also provide refcodes to the ccdc.io.MoleculeReader as an iterable of identifiers:

>>> with io.MoleculeReader(['ABEBUF', 'HXACAN', 'VUSDIX04']) as reader:
...     for mol in reader:
...         print(mol.identifier)
ABEBUF
HXACAN
VUSDIX04

Reading crystals

Let us start by finding out what the supported “file” formats for a crystal reader are. The known formats are stored in a dictionary called known_formats in the ccdc.io.CrystalReader class.

>>> reader_formats = sorted(io.CrystalReader.known_formats.keys())
>>> print('\n'.join(reader_formats))
cif
csdsql
csdsqlx
identifiers
mmcif
mol
mol2
res
sdf
sqlite
sqlmol2

These file formats were introduced in the Supported file formats section.

Note

Although the mol2 file format is usually used for molecules it does support the ability to store crystallographic information using the @<TRIPOS>CRYSIN record. The sdf file format, on the other hand, does not support crystallographic information. If a sdf file or a mol2 file (without the @<TRIPOS>CRYSIN record) are read in using a ccdc.io.CrystalReader a default crystal will be created for the molecule (see Default crystal).

Let us create a crystal reader using the some_refcodes.txt file in the`Reading molecules`_ example.

>>> filepath = 'some_refcodes.txt'

Again, in order to tell the reader that the input file is in the GCD file format we make use of the format parameter.

>>> crystal_reader = io.CrystalReader(filepath, format='identifiers')
>>> for cryst in crystal_reader:
...     print(cryst.spacegroup_symbol)
Pbcn
P21/a
P21/a

Let us close the crystal reader.

>>> crystal_reader.close()

Next, let us read crystals from the installed CSD.

>>> crystal_reader = io.CrystalReader('CSD')
>>> first_crystal = crystal_reader[0]
>>> print(first_crystal.spacegroup_symbol)
P-1
>>> abebuf_crystal = crystal_reader.crystal('ABEBUF')
>>> print(abebuf_crystal.spacegroup_symbol)
Pbca
>>> crystal_reader.close()

Finally, let us read crystals from a res file:

>>> res_filepath = 'three_structures.res'
>>> crystal_reader = io.CrystalReader(res_filepath)
>>> print(', '.join(crystal.spacegroup_symbol for crystal in crystal_reader))
Pbca, P21/c, P212121

Writing molecules

Let us start by finding out what the supported “file” formats for a molecule writer are. The known formats are stored in a dictionary called known_formats in the ccdc.io.MoleculeWriter class.

>>> writer_formats = sorted(io.MoleculeWriter.known_formats.keys())
>>> print('\n'.join(writer_formats))
cif
csdsql
csdsqlx
identifiers
mmcif
mol
mol2
pdb
res
sdf

See Supported file formats for a brief description of each individual file format.

To illustrate how the molecule writer works let us read in molecules from a GCD file and write them out into a SDF file.

>>> filepath = 'some_refcodes.txt'

In order to tell the reader that the input file is in the GCD file format we make use of the format parameter.

>>> mol_reader = io.MoleculeReader(filepath, format='identifiers')
>>> with io.MoleculeWriter(os.path.join(tempdir, 'some_refcodes.sdf')) as mol_writer:
...     for mol in mol_reader:
...         mol_writer.write(mol)
...

Note

The python with syntax automatically ensures that the mol_writer is closed automatically once the with block of code is exited. For more information please see PEP 343.

A ccdc.io.MoleculeWriter will determine the file format from an optional format parameter, otherwise it will use the file extension. However the file extension “.cif” is used for both CIF and mmCIF files. In order to distinguish between the two file formats the format parameter should be used, otherwise the format used will depend on the molecule written. For example CSD molecules would be written as CIF files and PDB protein molecules would be written as mmCIF files.

Writing crystals

Let us start by finding out what the supported “file” formats for a crystal writer are. The known formats are stored in a dictionary called known_formats in the ccdc.io.CrystalWriter class.

>>> writer_formats = sorted(io.CrystalWriter.known_formats.keys())
>>> print('\n'.join(writer_formats))
cif
csdsql
csdsqlx
identifiers
mmcif
mol
mol2
pdb
res
sdf

These file formats were introduced in the Supported file formats section.

To illustrate how the crystal writer works let us read in crystals from a GCD file and append them to a CIF file.

>>> filepath = 'some_refcodes.txt'

In order to tell the reader that the input file is in the GCD file format we make use of the format parameter. In order to append crystals to an existing CIF file we use the append parameter.

>>> crystal_reader = io.CrystalReader(filepath, format='identifiers')
>>> with io.CrystalWriter(os.path.join(tempdir, 'some_refcodes.cif'), append=True) as crystal_writer:
...     for crystal in crystal_reader:
...         crystal_writer.write(crystal)
...

A ccdc.io.CrystalWriter will determine the file format from an optional format parameter, otherwise it will use the file extension. However the file extension “.cif” is used for both CIF and mmCIF files. In order to distinguish between the two file formats the format parameter should be used, otherwise the format used will depend on the crystal written. For example CSD crystals would be written as CIF files and PDB protein crystals would be written as mmCIF files.

Reading Entries

CSD crystallographic database entries can be read from the CSD to gain access to data, such as publication data, not accessible via the crystal. To do this one can create a ccdc.io.EntryReader:

>>> entry_reader = io.EntryReader('CSD')

which has all the methods of the other reader classes. For example:

>>> first_entry = entry_reader[0]
>>> print(first_entry.publication)  
Citation(authors='P.-E.Werner',
    journal='Journal(Crystal Structure Communications)',
    volume='5', year=1976, first_page='873', doi=None)
>>> abebuf_entry = entry_reader.entry('ABEBUF')
>>> print(abebuf_entry.publication)  
Citation(authors='S.W.Gordon-Wylie, E.Teplin, J.C.Morris, M.I.Trombley, S.M.McCarthy, W.M.Cleaver, G.R.Clark',
    journal='Journal(Crystal Growth and Design)',
    volume='4', year=2004, first_page='789', doi='10.1021/cg049957u')

EntryReaders also give access to sd-tags for SDF format files and the cif-tags from CIF and mmCIF files as strings in a dictionary-like object named attributes. This means that one can get access to any property included in those files. To illustrate this let us read in a CIF file with many caffeine structures.

>>> caffeine_file_name = 'caffeine.cif'

To get access to the raw CIF data we need to open the file as an ccdc.io.EntryReader.

>>> reader = io.EntryReader(caffeine_file_name)
>>> entry_from_cif = reader[0]
>>> print(entry_from_cif.attributes['_exptl_crystal_colour'])
orange

This method is particularly useful when reading files from a docking experiment using GOLD, where docking data are written to the output files in the form of sd-tags in either SDF or Mol2 file formats. For example:

>>> gold_file_name = 'gold_output.sdf'
>>> with io.EntryReader(gold_file_name) as reader:
...     for e in reader:
...         print('%s: %.3f' % (e.identifier, float(e.attributes['Gold.Chemscore.Fitness']))) 
ZINC02871146: 15.275
...
ZINC02871189: 12.767

Writing Entries

Entries may be written using a ccdc.io.EntryWriter.

See Entry documentation for further details.

Default crystal

When reading in a file without crystallographic information as a crystal or when writing out a molecule without crystallographic information a default crystal will be created.

The default crystal is described in the table below.

Default crystal

Space group

Unknown

Cell lengths

a

1.0

b

1.0

c

1.0

Cell angles

alpha

90.0

beta

90.0

gamma

90.0

To illustrate this let us read in a mol2 file without crystallographic information.

>>> filepath = '1hak-lig.mol2'

To read in the first and only crystal from this file we make use of a ccdc.io.CrystalReader.

>>> crystal_reader = io.CrystalReader(filepath)
>>> crystal = crystal_reader[0]

We can now check the default values.

>>> print(crystal.spacegroup_symbol)
Unknown
>>> print(crystal.cell_lengths)
CellLengths(a=1.0, b=1.0, c=1.0)
>>> print(crystal.cell_angles)
CellAngles(alpha=90.0, beta=90.0, gamma=90.0)

Working with multiple databases

It is possible to create readers that work with multiple databases simultaneously. Any database which can be created with an ccdc.io.EntryReader, ccdc.io.CrystalReader or ccdc.io.MoleculeReader instance may be read into a single, compound database.

To illustrate this in a rather contrived example, let us create a reader that reads a collection of structures from some of the files we have used in this document:

>>> db = io.EntryReader([
...     pde5_filepath, res_filepath, caffeine_file_name, gold_file_name, ['HXACAN', 'VUSDIX04']
... ])
>>> print(len(db))
177

The file_name attribute of the compound database will show the absolute path of each component of the database, so we can check how many entries are in each database:

>>> for file_name in db.file_name:
...    er = io.EntryReader(file_name)
...    print('%s: %d' % (os.path.basename(file_name) if isinstance(file_name, six.string_types) else file_name, len(er)))
pde5_inhibitors.mol2: 3
three_structures.res: 3
caffeine.cif: 139
gold_output.sdf: 30
['HXACAN', 'VUSDIX04']: 2