Working with molecules, atoms and bonds


The classes ccdc.molecule.Atom, ccdc.molecule.Bond and ccdc.molecule.Molecule are all stored in the ccdc.molecule module.

However, you are unlikely to use these classes to create objects directly. You are much more likely to come across them after having read in a molecule from a file or after having accessed one from a database. Let us therefore import the module to allow us to get access to some molecules.

>>> from ccdc import io
>>> from ccdc.molecule import Molecule
>>> csd = io.MoleculeReader('CSD')

A ccdc.molecule.Molecule can contain several molecules

What - a molecule can contain several molecules??? Yes, in the CSD Python API we refer to these sub-molecules as components.

The reason why this arises is that most molecular file formats allow the inclusion of molecular species which are not covalently bonded to each other to be included in the same entry. Below for example is a SMILES representation of the CSD structure ABEBUF.


This “molecule” contains two components 5H,11H-Dibenzo(b,f)(1,5)diazocine-6,12-dione and pyridine.


It is similarly possible to represent several molecules within single entries of mol2 and sdf files. In mol2 file format these are referred to as substructures.

Let us illustrate this using the CSD structure ABEBUF.

>>> filepath = 'ABEBUF.mol2'

To get access to the molecule in this file we make use of a

>>> mol_reader = io.MoleculeReader(filepath)
>>> mol = mol_reader[0]
>>> print(mol.identifier)
>>> print(mol.smiles)
>>> len(mol.components)
>>> for component in mol.components:
...     print(component.smiles)

This has implications for molecular properties, which are calculated using all the components within a molecule. Let us illustrate this using the molecular weight.

>>> print(round(mol.molecular_weight, 4))
>>> for component in mol.components:
...     print(round(component.molecular_weight, 4))

Finally, it is worth noting that in many cases when dealing with molecules containing salts and solvents one is not particularly interested in the salts or solvents per se, but rather the larger organic molecule in the structure. This “heaviest” component can be accessed as illustrated below.

>>> heaviest_component = mol.heaviest_component
>>> print('Identifier: %s, Mol.Wt: %.3f' % (heaviest_component.identifier, heaviest_component.molecular_weight))
Identifier: 01, Mol.Wt: 238.241

Standardising and editing molecules

There are some functions available for standardising and editing molecules. Let us start by illustrating how one can remove and add hydrogen atoms.

>>> print(round(mol.molecular_weight, 4))
>>> mol.remove_hydrogens()
>>> print(round(mol.molecular_weight, 4))
>>> mol.add_hydrogens()
>>> print(round(mol.molecular_weight, 4))

To illustrate some of the more advanced molecule standardisation functions let us read in a PDB ligand (from the structure 1f0r), which has not got any hydrogen atoms added to it, in sdf file format.

>>> filepath = '1f0r-lig.sdf'

To get access to the molecule in this file we make use of a

>>> mol_reader = io.MoleculeReader(filepath)
>>> mol = mol_reader[0]
>>> print(mol.identifier)
>>> print(round(mol.molecular_weight, 4))

First of all if I am unsure about the bond typing I will need to assign this first. Note that this bond typing is based on the geometry of the molecule.

>>> mol.assign_bond_types()

By default, the ccdc.molecule.Molecule.assign_bond_types() function guesses the bond types of all bonds. However, if you had a molecule that only had the bond types set for some of the bonds and you had confidence in these then you could guess the bond types of only those bonds with “unknown” types by setting the parameter which to unknown.

If you prefer to have alternating single and double bonds rather than aromatic bonds you can call:

>>> mol.kekulize()

After the bond types have been assigned one can add hydrogen atoms to the molecule.

>>> mol.add_hydrogens()

Let us check if the molecule has had its molecular weight increased

>>> print(mol.molecular_weight)

If you had a molecule with bond types that you had confidence in, but you wanted to standardise the aromatic and delocalised bonds to CSD conventions then you could make use of the functions:

Finally, let us come back to the concept of a molecule having multiple components. Suppose that we wanted to add the heaviest component from the ABEBUF molecule to the 1f0r ligand. This could be achieved with the ccdc.molecule.Molecule.add_molecule() function.

>>> mol.add_molecule(heaviest_component)
>>> for component in mol.components:
...     print(component.identifier)


Operations which change the underlying molecule of an entry or a crystal, such as adding hydrogens, or assigning bond types will permanently affect the molecule. Any subsequent access of the molecule of an entry or crystal will return the mutated molecule. Reading the entry or crystal afresh from a database will, of course, return the original, unmutated molecule.

Accessing molecular properties

Let us now have a look at the molecular properties available to us using the ABEBUF structure.

>>> filepath = 'ABEBUF.mol2'

To get access to the molecule in this file we make use of a

>>> mol_reader = io.MoleculeReader(filepath)
>>> mol = mol_reader[0]

Lists of the atoms and bonds in a molecule are accessed using the attributes with these names.

>>> mol.atoms  
[Atom(O1), Atom(O2), Atom(N1), ... Atom(H13), Atom(H14), Atom(H15)]
>>> len(mol.atoms)
>>> mol.heavy_atoms 
[Atom(O1), Atom(O2), ... Atom(C17), Atom(C18), Atom(C19)]
>>> mol.bonds  
[Bond(Double Atom(O1) Atom(C1)), ... Bond(Single Atom(H15) Atom(C19))]
>>> len(mol.bonds)

we can gain access to particular atoms and bonds of the molecule if we know their index in the lists. Alternatively, we can gain access to particular atoms and bonds of the molecule by atom labels, provided the labels used are unique within the molecule:

>>> mol.atom('O2')
>>>'C1', 'O1')
Bond(Double Atom(O1) Atom(C1))

The atoms and bonds have a Sybyl type:

>>> print(mol.atom('N1').sybyl_type)
>>> print('N1', 'C1').sybyl_type)

An atom has a Van der Waals radius:

>>> mol.atom('O1').vdw_radius
>>> mol.atom('C1').vdw_radius

An atom also has an atomic weight:

>>> mol.atom('O1').atomic_weight
>>> mol.atom('C1').atomic_weight

We can also access the formal charge of the molecule as well as the molecular weight, the molecular volume (if all the atoms of the molecule have 3D coordinates) and the molecular formula.

>>> mol.formal_charge
>>> round(mol.molecular_weight, 2)
>>> round(mol.molecular_volume, 2)
>>> mol.formula
'C19 H15 N3 O2'

Note that the molecular formula is the sum of all the components in the molecule. To get a component-wise formula one can for example:

>>> '.'.join(c.formula for c in mol.components)
'C14 H10 N2 O2.C5 H5 N1'

There is also a property which tells you whether or not the molecule is organic (as opposed to metal-organic).

>>> mol.is_organic

It is also worth mentioning that the molecule identifier and SMILES representation can be accessed through the molecule.

>>> print(mol.identifier)
>>> print(mol.smiles)


We can get access to the rings of a molecule. For example, let us read in ‘CEMGUY’ from the CSD which has an interesting set of rings:

>>> cemguy = csd.molecule('CEMGUY')
CEMGUY 2D diagram.

CEMGUY 2D diagram, to exemplify ring analysis.

We can immediately get access to the size of the largest and smallest ring of this molecule:

>>> print('Largest ring: %d, Smallest ring: %d' % (cemguy.largest_ring_size, cemguy.smallest_ring_size))
Largest ring: 6, Smallest ring: 3

The molecule has a property which will give access to the basic rings of the structure. The set of rings returned will be all simple rings, not an enclosing, larger ring. In this example, there will be a 3-membered ring, a 5-membered ring and two six-membered rings, but not the six-membered ring containing the 3- and 5-membered rings:

>>> rings = cemguy.rings
>>> print(len(rings))
>>> print([len(r) for r in rings])
[3, 5, 6, 6]

We can get the atoms and bonds of each ring:

>>> print(rings[0].atoms) 
[Atom(...), Atom(...), Atom(...)]
>>> print(rings[0].bonds) 
(Bond(Single Atom(...) Atom(...)), Bond(Single Atom(...) Atom(...)), Bond(Single Atom(...) Atom(...)))

Note that the atoms of the ring are ordered such that there is a bond between consecutive atoms. We can test for containment in a ring:

>>> print(cemguy.atom('C1') in rings[0])
>>> print('C6', 'C1') in rings[0])

There are a couple of predicates on rings:

>>> print(rings[0].is_aromatic)
>>> print(rings[-1].is_aromatic)
>>> print(rings[0].is_fused)
>>> print(rings[-1].is_fused)
>>> print(rings[0].is_fully_conjugated)
>>> print(rings[-1].is_fully_conjugated)

The predicate ‘is_fully_conjugated’ is true for aromatic ring systems and for ring systems where every single bond is conjugated, for example the 5-membered ring in ‘POCTOS’.

POCTOS 2D diagram

POCTOS 2D diagram to exemplify conjugated rings.

>>> poctos = csd.molecule('POCTOS').components[0]
>>> poctos_rings = poctos.rings
>>> print(poctos_rings[0].is_aromatic)
>>> print(poctos.rings[0].is_fully_conjugated)

The atom ‘C6’ in ‘CEMGUY’ is a spiro atom and is in two rings; the atom ‘C5’ is not spiro but is also in two rings. The Atom property, rings, gives access to the list of all rings in which the atom lies, and the Bond property, rings, similarly gives access to the rings of which the bond forms part:

>>> print(cemguy.atom('C6').is_spiro)
>>> print(len(cemguy.atom('C6').rings))
>>> print(cemguy.atom('C5').is_spiro)
>>> print(cemguy.atom('C5').rings) 
[Atom(...)-Atom(...)-Atom(...), Atom(...)-Atom(...)-Atom(...)-Atom(...)-Atom(...)]
>>> print(cemguy.atom('C12').rings)
>>> print(len('C1', 'C5').rings))

Some of the bonds of ‘POCTOS’ are conjugated:

>>>'C6', 'C7').is_conjugated
>>>'C1', 'O1').is_conjugated

We are not analysing the geometry of the bonded system in order to determine conjugation: this may produce false positives.

Working with string representations of mol2 and sdf files

It is sometimes useful to have access to a string representation of a molecule within a Python script, for example to convert a CCDC molecule into an RDKit molecule by making use of a SDF formatted string as an intermediate.

The ccdc.molecule.Molecule.to_string() method is therefore provided for a molecule to be returned as a string in either ‘mol2’ or ‘sdf’ file format.

>>> s = mol.to_string(format='mol2')
>>> print(s) 
    39    41     2     0     0
Generated from the CSD

     1 O1       1.9287   5.1676   1.7209   O.2       1 RES1   0.0000

Similarly the ccdc.molecule.Molecule.from_string() method reads in a molecule from a ‘mol2’, ‘sdf’, ‘mol’, or ‘cif’ formatted string.

>>> m = Molecule.from_string(s, format='mol2')
>>> print('Identifier: %s, Mol.Wt: %.4f' % (m.identifier, m.molecular_weight))
Identifier: ABEBUF, Mol.Wt: 317.3407

The format parameter may be one of ‘mol2’, ‘sdf’, ‘mol’, or ‘cif’.


A SMILES representation of the molecule is accessible through its ccdc.molecule.Molecule.smiles attribute.

Close Contacts

Close contacts, pairs of atoms closer than the sum of there Van der Waals radii, can be identified within a molecule.

>>> abelup = csd.molecule('ABELUP')
>>> print(abelup.contacts())
(Contact(Atom(O1)-Atom(O2)), Contact(Atom(O1)-Atom(O7)), Contact(Atom(O3)-Atom(H2)), Contact(Atom(O4)-Atom(H2)), Contact(Atom(O5)-Atom(H3)), Contact(Atom(O6)-Atom(H3)), Contact(Atom(O7)-Atom(H1)), Contact(Atom(C7)-Atom(H1)), Contact(Atom(O11)-Atom(C9)), Contact(Atom(O11)-Atom(H6)), Contact(Atom(O11)-Atom(H17)), Contact(Atom(C17)-Atom(H6)), Contact(Atom(H5)-Atom(H16)))

The ccdc.molecule.Molecule.Contact instance has a couple of useful properties:

>>> contact = abelup.contacts()[0]
>>> print(contact.atoms)
(Atom(O1), Atom(O2))
>>> print(round(contact.length, 3))
>>> print(round(contact.strength, 3))
>>> print(contact.type)
Van der Waals

Hydrogen Bonds

The hydrogen bonds of the molecule may be obtained by the method ccdc.molecule.Molecule.hbonds().

>>> hbonds = abelup.hbonds()
>>> print(len(hbonds))
>>> print(hbonds)

There is only one hydrogen bond detected. This is an instance of ccdc.molecule.Molecule.HBond which is a derived class from ccdc.molecule.Molecule.Contact and has some additional properties:

>>> hb = hbonds[0]
>>> print(hb.atoms)
(Atom(O7), Atom(H1), Atom(O1))
>>> print('D-H-A: %.3f, H-A: %.3f, D-A: %.3f, Strength: %.3f' % (hb.angle, hb.length, hb.da_distance, hb.strength))
D-H-A: 146.797, H-A: 1.612, D-A: 2.438, Strength: 1.612

HBond Criterion

There is another possible hydrogen bond in this structure between atoms O1 and O2. There is, though, no hydrogen to be donated, but use of the keywork argument ‘hbond_criterion’ to ccdc.molecule.Molecule.hbonds() will allow this case to be detected:

>>> criterion = Molecule.HBondCriterion()
>>> hbonds = abelup.hbonds(hbond_criterion=criterion)
>>> print(hbonds)
(HBond(Atom(O1)-Atom(O2)), HBond(Atom(O1)-Atom(O7)))

The default constructed ccdc.molecule.Molecule.HBondCriterion does not require the hydrogens to be present, and in consequence both potential hydrogen bonds are detected. Note that the hydrogen bonds returned do not contain a hydrogen even though it is present in one of the hydrogen bonds. In consequence the ccdc.molecule.Molecule.HBond.length and ccdc.molecule.Molecule.HBond.angle are both None.

Let us look more closely at the ccdc.molecule.Molecule.HBondCriterion class. We can retrieve the default behaviour of the method by setting the attribute ccdc.molecule.Molecule.HBondCriterion.require_hydrogens to True:

>>> criterion.require_hydrogens = True
>>> print(abelup.hbonds(hbond_criterion=criterion))

We can specify the geometry of the hydrogen bonds detected:

>>> abacuf = csd.molecule('ABACUF01')
>>> hbonds = abacuf.hbonds(hbond_criterion=criterion)
>>> print(hbonds)
(HBond(Atom(O2)-Atom(H7A)-Atom(O8A)), HBond(Atom(O6)-Atom(H5A)-Atom(O7A)), HBond(Atom(O2A)-Atom(H7)-Atom(O8)), HBond(Atom(O6A)-Atom(H5)-Atom(O7)))
>>> print([round(hb.angle, 3) for hb in hbonds])
[147.972, 160.882, 147.972, 160.882]
>>> print([round(hb.da_distance, 3) for hb in hbonds])
[2.788, 2.879, 2.788, 2.879]
>>> criterion.angle_tolerance = 150.0
>>> hbonds = abacuf.hbonds(hbond_criterion=criterion)
>>> print(hbonds)
(HBond(Atom(O6)-Atom(H5A)-Atom(O7A)), HBond(Atom(O6A)-Atom(H5)-Atom(O7)))
>>> print([round(hb.angle, 3) for hb in hbonds])
[160.882, 160.882]
>>> criterion.angle_tolerance = 120.0
>>> criterion.distance_range = (-5.0, -0.67)
>>> hbonds = abacuf.hbonds(hbond_criterion=criterion)
>>> print(hbonds)
(HBond(Atom(O2)-Atom(H7A)-Atom(O8A)), HBond(Atom(O2A)-Atom(H7)-Atom(O8)))
>>> print([round(hb.da_distance, 3) for hb in hbonds])
[2.788, 2.788]

If we wish we can work with distances, rather than Van der Waals corrected distances:

>>> criterion = Molecule.HBondCriterion()
>>> criterion.vdw_corrected = False
>>> print(criterion.distance_range)
(-5.0, 0.0)
>>> criterion.distance_range = (0, 2.8)
>>> hbonds = abacuf.hbonds(hbond_criterion=criterion)
>>> print(hbonds)
(HBond(Atom(O2)-Atom(O8A)), HBond(Atom(O2A)-Atom(O8)))
>>> print([round(hb.da_distance, 3) for hb in hbonds])
[2.788, 2.788]

The :attr::ccdc.molecule.Molecule.HBondCriterion.path_length_range may be used to control whether hydrogen bonds may span distinct components of a molecule. By default this range is set to (4, 999), i.e. hydrogen bonding atoms must be on the same component and separated by at least four atoms. We can change this to detect hydrogen bonds between the components of a molecule:

>>> catkit = csd.molecule('CATKIT')
>>> print(len(catkit.hbonds()))
>>> criterion.path_length_range = (-1, 999)
>>> print(len(catkit.hbonds(hbond_criterion=criterion)))

Within the ccdc.molecule.Molecule.HBondCriterion we have access to the atom types required for forming hydrogen bonds. These may be changed to include or exclude particular hydrogen bonds. The atom types are exposed as dictionary-like attributes of ccdc.molecule.Molecule.HBondCriterion:

>>> criterion = Molecule.HBondCriterion()
>>> donor_types = criterion.donor_types
>>> print(', '.join(k for k in donor_types.keys()))
+nitrogen, metal bound N, imine N, aromatic (6-ring) N, amide or thioamide N, planar N, pyramidal N, ammonium N (NH4+, RNH3+ etc.), unclassified  N, -nitrogen, +oxygen, metal bound O, water O, hydroxyl O, unclassified  O, -oxygen, +sulphur, metal bound S, non metal bound S, -sulphur, +carbon, sp C, sp2 C, sp3 C, aromatic C, carboncationic C, -carbon

As with all dictionary-like objects, care must be taken to use the exact string as a key. The values are ‘YES’, ‘NO’ and ‘NEVER’. The latter is purely heuristic, indicating that it is highly unlikely that this should ever be considered a hydrogen-bonding atom. When setting, the values True and False will work as expected:

>>> print(donor_types['carboncationic C'])
>>> donor_types['carboncationic C'] = True
>>> print(donor_types['carboncationic C'])

This interface is somewhat fiddly and error prone, so there are methods using atoms as exemplars of their type:

>>> donor_atom = abacuf.atoms[5]
>>> print(criterion.is_donor(donor_atom))
>>> print(criterion.donor_atom_type(donor_atom))
metal bound O
>>> criterion.set_donor_type(donor_atom, False)
>>> print(criterion.is_donor(donor_atom))

Now no hydrogen bonds will be found using this criterion:

>>> print(abacuf.hbonds(hbond_criterion=criterion))

Accessing atomic properties

Let us have a look at the atomic properties available to us using the ABEBUF structure. To do this we will need an atom. Let us just use the first one in the molecule.

>>> atom = mol.atoms[0]

We can identify the atom using the index and/or the label. Atoms may have residue labels where these have been set in the molecule.

>>> print(atom.index)
>>> print(atom.label)
>>> print(atom.residue_label)
>>> atom.residue_label = 'RES'
>>> print(atom.residue_label)

There is a convenience method, ccdc.molecule.Molecule.set_residue_by_component() which sets the residue labels of disconnected components of the molecule to distinct values.

To understand the chemistry we can inspect the atomic symbol and the formal charge.

>>> print(atom.atomic_symbol)
>>> print(atom.formal_charge)

There is a property to inspect the partial charge of an atom, where available, and a method of the molecule to calculate partial charges. The method used is that of Gasteiger J.; Marsili M., Tetrahedron 36 (22) 3219-3228 1980. This is comparatively fast, and has good convergence properties, but is applicable only to organic molecules. Accordingly a bool is returned indicating the success of failure of the assignment.

>>> print(atom.partial_charge)
>>> print(mol.assign_partial_charges())
>>> print(round(atom.partial_charge, 2))

If a formal charge is explicitly set for any atom of the molecule, the partial charges will be invalid and so will be removed.

>>> atom.formal_charge = -1
>>> print(atom.partial_charge)

In common with many cheminformatics toolkits, the partial charge properties of atoms may be abused to store an arbitrary value:

>>> atom.partial_charge = 3.14
>>> print(round(atom.partial_charge, 2))

Finally, partial charges may be removed, allowing atomic formal charges to be represented in a mol2 format charge field:

>>> mol.remove_partial_charges()
>>> print(atom.partial_charge)

Furthermore, there are also descriptors to check whether or not an atom is a metal, a hydrogen bond donor, a hydrogen bond acceptor and whether or not it is in a ring system.

>>> atom.is_metal
>>> atom.is_donor
>>> atom.is_acceptor
>>> atom.is_cyclic

Clearly, we also have access to the atomic coordinates.

>>> print(atom.coordinates)
Coordinates(x=1.929, y=5.168, z=1.721)
>>> print(', '.join('%.4f' % a for a in atom.fractional_coordinates))
0.1799, 0.4741, 0.0623


The coordinates returned by ccdc.molecule.Atom.coordinates() is in orthogonal space. If the atom has no coordinates, None will be returned. Similarly fractional_coordinates will return None if no coordinate information is available.

Where the uncertainty of the determination of the coordinates has been recorded, this information may be obtained from the atom. In the case of ABEBUF read from the molecule file, this information is not available:

>>> print(atom.fractional_uncertainties)

However the CSD entry for ABEBUF does contain the requisite data:

>>> abebuf = io.MoleculeReader('CSD').molecule('ABEBUF')
>>> a = abebuf.atoms[0]
>>> print(a.fractional_uncertainties)
(0.17989(8), 0.47410(8), 0.06227(3))
>>> uncertain_val = a.fractional_uncertainties[0]
>>> print(uncertain_val.value)
>>> print(uncertain_val.precision)
>>> print(uncertain_val.uncertainty)

Atoms read from some sources, such as PDB files or CIF files may have atomic displacement parameters recorded. For such atoms there is a property, ccdc.molecule.Atom.displacement_parameters giving access to a class, ccdc.molecule.Atom.DisplacementParameters from which these data may be extracted. Atoms read from the CSD do not yet contain atomic displacement parameters, so the property returns None:

>>> print(atom.displacement_parameters)

However, an atom read from a PDB file will have this information:

>>> file_name = 'pdb1fax.pdb'
>>> from ccdc.protein import Protein
>>> protein = Protein.from_file(file_name)
>>> atom_with_adps = protein.atoms[0]
>>> adps = atom_with_adps.displacement_parameters
>>> print(adps.type)
>>> print(adps.temperature_factor)

There are other, more specialised data available: please see the documentation of the class for more details.

There is a convenience property to determine whether all the heavy atoms of a molecule have 3D coordinate information:

>>> mol.is_3d

We have access to the neighbouring atoms:

>>> atom.neighbours

and to the bonds to this atom:

>>> atom.bonds
(Bond(Double Atom(O1) Atom(C1)),)

We can access R/S chirality information for atoms:

>>> duxyea = io.MoleculeReader('CSD').molecule('DUXYEA')
>>> print(', '.join('(%s, %s)' % (a.label, a.chirality) for a in duxyea.atoms if a.is_chiral)) 
(C1, S), (C3, R), (C6, R), (C7, S)...

We will also detect para-chiral centres where the chirality of the atom depends only on the chirality of connected atoms:

>>> hptrol = io.MoleculeReader('csd').molecule('HPTROL')
>>> hptrol.add_hydrogens()
>>> print(', '.join('(%s, %s)' % (a.label, a.chirality) for a in hptrol.atoms if a.is_chiral)) # doctest: +NORMALIZE_WHITESPACE
(C2, R), (C4, S), (C6, S)
HPTROL 2D diagram.

HPTROL 2D diagram, to exemplify para-chiral centres.

Here the central C4 atom is para-chiral since the atoms C2 and C6 have opposite chirality.

Accessing bond properties

Let us have a look at the bond properties available to us using the ABEBUF structure. To do this we will need a bond. Let us just use the first one in the molecule.

>>> bond = mol.bonds[0]

To identify the bond we can access the two atoms connected by it.

>>> atom1, atom2 = bond.atoms
>>> print(atom1.label)
>>> print(atom2.label)

To understand the chemistry we can check the bond type.

>>> print(bond.bond_type)

Note that we can also use integers to check the type of a bond.

>>> bond.bond_type == 1
>>> bond.bond_type == 2

The integers used to represent bond types according to the CSD convention are.

Type Integer
Unknown 0
Single 1
Double 2
Triple 3
Quadruple 4
Aromatic 5
Delocalised 7
Pi 9

It is also possible to check whether or not a bond is cyclic and/or rotatable.

>>> bond.is_cyclic
>>> bond.is_rotatable