Working with proteins


The ccdc.protein module contains a single ccdc.protein.Protein class and two nested ccdc.protein.Protein.Chain, ccdc.protein.Protein.Residue classes which are used to load and manipulate proteins in the CSD Python API.

Proteins can be read from files in PDB, Mol2 or mmCIF format.

>>> from ccdc.protein import Protein
>>> protein_fax = '1fax_protein.mol2'
>>> protein_3cen = 'pdb3cen.ent'
>>> protein_1crn = '1crn.cif'
>>> p_1fax = Protein.from_file(protein_fax)
>>> p_3cen = Protein.from_file(protein_3cen)
>>> p_1crn = Protein.from_file(protein_1crn)

Accessing protein properties

Once we have a protein, we can access its chains, residues, waters, metals and any bound ligands or cofactors. As an example, we can get the number of chains and the number of residues in each chain for the protein 1fax.

>>> print("Protein has {} chains and {} residues.".format(len(p_1fax.chains),len(p_1fax.residues)))
Protein has 2 chains and 290 residues.
>>> for chain in p_1fax.chains:
...     print("Chain {} has {} residues.".format(chain.identifier, len(chain.residues)))
Chain A has 235 residues.
Chain L has 55 residues.

We can look into the detail of the residues of 1crn.

>>> print('1CRN has sequence', '-'.join(r.three_letter_code for r in p_1crn.residues))
>>> print(f'1CRN has {len([r for r in p_1crn.residues if r.cysteine_sulphur])} cysteine sulfur atoms')
1CRN has 6 cysteine sulfur atoms

Also, we can access the identifier of the first ligand in protein 3cen.

>>> print('{} has a ligand with identifier {}'.format(p_3cen.identifier, p_3cen.ligands[0].identifier))
3CEN has a ligand with identifier A:FXA1

Or we can get the number of water molecules and metal atoms in protein 3cen.

>>> print('{} has {} waters and {} metals'.format(p_3cen.identifier, len(p_3cen.waters), len(p_3cen.metals)))
3CEN has 279 waters and 0 metals

Similarly, to access the nucleic acids and their nucleotides.

>>> p_1aay = Protein.from_file(protein_1aay)
>>> print("Protein has {} nucleic acids and {} nucleotides.".format(len(p_1aay.nucleic_acids),len(p_1aay.nucleotides)))
Protein has 2 nucleic acids and 22 nucleotides.
>>> for nucleic_acid in p_1aay.nucleic_acids:
...     print("Nucleic Acid {} has {} nucleotides.".format(nucleic_acid.identifier, len(nucleic_acid.nucleotides)))
Nucleic Acid B has 11 nucleotides.
Nucleic Acid C has 11 nucleotides.
>>> nucleic_acid = p_1aay['B']
>>> print("The base sequence for Nucleic Acid {} is {}".format(nucleic_acid.identifier, nucleic_acid.sequence))
The base sequence for Nucleic Acid B is AGCGTGGGCGT

There are some attributes on ccdc.molecule.Atom which may be helpful in categorising the atoms of the protein. For example:

>>> p_1aay.nucleotides[0].atoms[0].is_in_protein
>>> print(p_1aay.nucleotides[0].atoms[0].protein_atom_type)
>>> print(p_1aay.waters[0].atoms[0].protein_atom_type)
>>> print(p_1fax.atoms[0].protein_atom_type)
>>> print(p_3cen.ligands[0].atoms[0].protein_atom_type)

Manipulating proteins

As well as drilling down to individual residues and atoms in the protein, we can perform a number of useful operations on the protein as a whole. For example, we might want to remove unnecessary waters, metals and ligands to prepare the protein for docking.

>>> for metal in p_1fax.metals:
...     p_1fax.remove_metal(metal)
>>> for ligand in p_1fax.ligands:
...     p_1fax.remove_ligand(ligand.identifier)
>>> p_1fax.remove_all_waters()
>>> p_1fax.remove_chain('L')

Before using a protein structure in any structure-based drug design project it is important to add all hydrogen atoms. The ccdc.protein.Protein.add_hydrogens() method can be used to protonate a protein and its components (e.g., ligands, cofactors and waters). By default, all metal bonds will be removed, atom charges will be set to zero, the correct ionisation and tautomeric state will be assigned to residues such as Asp, Glu and Arg, a file containing SMARTS-based rules will be used to protonate the ligand molecules. View the GOLD documentation for further details.

>>> p_3cen.add_hydrogens()

Note that adding hydrogens adds the atoms to the end of the list of atoms. For some situations (for example, when writing out PDB format files) it is preferrable to have all the atoms in a given residue in a single block of atoms. If this is desirable, the ccdc.protein.Protein.sort_atoms_by_residue() will sort the atoms so this is so.

>>> p_3cen.sort_atoms_by_residue()

A ccdc.protein.Protein is a ccdc.molecule.Molecule too

Since proteins are also molecules, we can manipulate the protein with any of the molecule methods. As an example we can write the modified protein into a mol2 file.

>>> from import MoleculeWriter
>>> with MoleculeWriter('3cen_protonated.mol2') as protein_writer:
...     protein_writer.write(p_3cen)

Superposing protein chains and binding sites

Two protein chains may be superposed using sequence alignment followed by atom overlay. The Smith-Waterman algorithm is used to generate global sequence alignments which give a pair-wise matching of one residue to another which can then be used for overlay.

>>> protein_file_1e2h = '1e2h.pdb'
>>> protein_file_1of1 = '1of1.pdb'
>>> protein_1e2h = Protein.from_file(protein_file_1e2h)
>>> protein_1of1 = Protein.from_file(protein_file_1of1)
>>> chain_superposition = Protein.ChainSuperposition()
>>> (rmsd, transformation) = chain_superposition.superpose(protein_1of1.chains[0], protein_1e2h.chains[0])

The root-mean-square deviation of the atom overlay and overlay transformation matrix are returned.

>>> print('chain overlay rmsd is {:6.4f}'.format(rmsd))
chain overlay rmsd is 0.2129

Alternatively, a binding site of a given radius can be specified for the first protein chain.

>>> ligand_1of1 = protein_1of1.ligands[0]
>>> print(ligand_1of1.identifier)
>>> binding_site_1of1 = Protein.BindingSiteFromMolecule(protein_1of1, ligand_1of1, 6.)
>>> (rmsd, transformation) = chain_superposition.superpose(protein_1of1.chains[0], protein_1e2h.chains[0], binding_site_1of1)
>>> print('binding site overlay rmsd is {:6.4f}'.format(rmsd))
binding site overlay rmsd is 0.1225