Working with proteins¶
Introduction¶
The ccdc.protein
module contains a single ccdc.protein.Protein
class and two nested ccdc.protein.Protein.Chain
, ccdc.protein.Protein.Residue
classes which are used to load and manipulate proteins in the CSD Python API.
Proteins can be read from files in PDB, Mol2 or mmCIF format.
>>> from ccdc.protein import Protein
>>> protein_fax = '1fax_protein.mol2'
>>> protein_3cen = 'pdb3cen.ent'
>>> protein_1crn = '1crn.cif'
>>> p_1fax = Protein.from_file(protein_fax)
>>> p_3cen = Protein.from_file(protein_3cen)
>>> p_1crn = Protein.from_file(protein_1crn)
Accessing protein properties¶
Once we have a protein, we can access its chains, residues, waters, metals and any bound ligands or cofactors. As an example, we can get the number of chains and the number of residues in each chain for the protein 1fax.
>>> print("Protein has {} chains and {} residues.".format(len(p_1fax.chains),len(p_1fax.residues)))
Protein has 2 chains and 290 residues.
>>> for chain in p_1fax.chains:
... print("Chain {} has {} residues.".format(chain.identifier, len(chain.residues)))
Chain A has 235 residues.
Chain L has 55 residues.
We can look into the detail of the residues of 1crn.
>>> print('1CRN has sequence', '-'.join(r.three_letter_code for r in p_1crn.residues))
1CRN has sequence THR-THR-CYS-CYS-PRO-SER-ILE-VAL-ALA-ARG-SER-ASN-PHE-ASN-VAL-CYS-ARG-LEU-PRO-GLY-THR-PRO-GLU-ALA-ILE-CYS-ALA-THR-TYR-THR-GLY-CYS-ILE-ILE-ILE-PRO-GLY-ALA-THR-CYS-PRO-GLY-ASP-TYR-ALA-ASN
>>> print(f'1CRN has {len([r for r in p_1crn.residues if r.cysteine_sulphur])} cysteine sulfur atoms')
1CRN has 6 cysteine sulfur atoms
Also, we can access the identifier of the first ligand in protein 3cen.
>>> print('{} has a ligand with identifier {}'.format(p_3cen.identifier, p_3cen.ligands[0].identifier))
3CEN has a ligand with identifier A:FXA1
Or we can get the number of water molecules and metal atoms in protein 3cen.
>>> print('{} has {} waters and {} metals'.format(p_3cen.identifier, len(p_3cen.waters), len(p_3cen.metals)))
3CEN has 279 waters and 0 metals
Similarly, to access the nucleic acids and their nucleotides.
>>> p_1aay = Protein.from_file(protein_1aay)
>>> print("Protein has {} nucleic acids and {} nucleotides.".format(len(p_1aay.nucleic_acids),len(p_1aay.nucleotides)))
Protein has 2 nucleic acids and 22 nucleotides.
>>> for nucleic_acid in p_1aay.nucleic_acids:
... print("Nucleic Acid {} has {} nucleotides.".format(nucleic_acid.identifier, len(nucleic_acid.nucleotides)))
Nucleic Acid B has 11 nucleotides.
Nucleic Acid C has 11 nucleotides.
>>> nucleic_acid = p_1aay['B']
>>> print("The base sequence for Nucleic Acid {} is {}".format(nucleic_acid.identifier, nucleic_acid.sequence))
The base sequence for Nucleic Acid B is AGCGTGGGCGT
There are some attributes on ccdc.molecule.Atom
which may be helpful in categorising the atoms of the protein. For example:
>>> p_1aay.nucleotides[0].atoms[0].is_in_protein
True
>>> print(p_1aay.nucleotides[0].atoms[0].protein_atom_type)
Nucleotide
>>> print(p_1aay.waters[0].atoms[0].protein_atom_type)
Water
>>> print(p_1fax.atoms[0].protein_atom_type)
Amino_acid
>>> print(p_3cen.ligands[0].atoms[0].protein_atom_type)
Ligand
Manipulating proteins¶
As well as drilling down to individual residues and atoms in the protein, we can perform a number of useful operations on the protein as a whole. For example, we might want to remove unnecessary waters, metals and ligands to prepare the protein for docking.
>>> for metal in p_1fax.metals:
... p_1fax.remove_metal(metal)
>>>
>>> for ligand in p_1fax.ligands:
... p_1fax.remove_ligand(ligand.identifier)
>>>
>>> p_1fax.remove_all_waters()
>>> p_1fax.remove_chain('L')
Before using a protein structure in any structure-based drug design project it is important to add all hydrogen atoms.
The ccdc.protein.Protein.add_hydrogens()
method can be used to protonate a protein and its components (e.g.,
ligands, cofactors and waters). By default, all metal bonds will be removed, atom charges will be set to zero, the correct
ionisation and tautomeric state will be assigned to residues such as Asp, Glu and Arg, a file containing SMARTS-based
rules will be used to protonate the ligand molecules. View
the GOLD documentation for further details.
>>> p_3cen.add_hydrogens()
Note that adding hydrogens adds the atoms to the end of the list of atoms. For some situations (for example, when writing out PDB format files) it is preferrable to have all the atoms in a given residue in a single block of atoms. If this is desirable, the ccdc.protein.Protein.sort_atoms_by_residue()
will sort the atoms so this is so.
>>> p_3cen.sort_atoms_by_residue()
A ccdc.protein.Protein
is a ccdc.molecule.Molecule
too¶
Since proteins are also molecules, we can manipulate the protein with any of the molecule methods. As an example we can write the modified protein into a mol2 file.
>>> from ccdc.io import MoleculeWriter
>>> with MoleculeWriter('3cen_protonated.mol2') as protein_writer:
... protein_writer.write(p_3cen)
...
Superposing protein chains and binding sites¶
Two protein chains may be superposed using sequence alignment followed by atom overlay. The Smith-Waterman algorithm is used to generate global sequence alignments which give a pair-wise matching of one residue to another which can then be used for overlay.
>>> protein_file_1e2h = '1e2h.pdb'
>>> protein_file_1of1 = '1of1.pdb'
>>> protein_1e2h = Protein.from_file(protein_file_1e2h)
>>> protein_1of1 = Protein.from_file(protein_file_1of1)
>>> chain_superposition = Protein.ChainSuperposition()
>>> (rmsd, transformation) = chain_superposition.superpose(protein_1of1.chains[0], protein_1e2h.chains[0])
The root-mean-square deviation of the atom overlay and overlay transformation matrix are returned.
>>> print('chain overlay rmsd is {:6.4f}'.format(rmsd))
chain overlay rmsd is 0.2129
Alternatively, a binding site of a given radius can be specified for the first protein chain.
>>> ligand_1of1 = protein_1of1.ligands[0]
>>> print(ligand_1of1.identifier)
A:SCT400
>>> binding_site_1of1 = Protein.BindingSiteFromMolecule(protein_1of1, ligand_1of1, 6.)
>>> (rmsd, transformation) = chain_superposition.superpose(protein_1of1.chains[0], protein_1e2h.chains[0], binding_site_1of1)
>>> print('binding site overlay rmsd is {:6.4f}'.format(rmsd))
binding site overlay rmsd is 0.1225