Protein API

Introduction

The main class of the ccdc.protein module is ccdc.protein.Protein.

A ccdc.protein.Protein contains attributes and functions that relate to protein structures.

API

class ccdc.protein.Protein(identifier, _molecule=None, _protein_structure=None, _cell=None)[source]
class BindingSite(protein, whole_residues=True)[source]

A binding site in the protein.

property atoms

The atoms of the cavity.

property cofactors

The cofactors of the cavity.

property formula

Return the chemical formula of the molecules in the binding site.

property ligands

The ligands of the cavity.

property metals

The metals in the cavity.

property nucleotides

The nucleotides of the cavity.

property residues

The residues of the cavity.

property waters

The waters of the cavity.

class BindingSiteFromAtom(protein, atom, distance)[source]

A binding site defined from a protein atom.

class BindingSiteFromListOfAtoms(protein, atoms)[source]

A binding site defined from a list of protein atoms.

class BindingSiteFromListOfResidues(protein, list_of_residues)[source]

A binding site from a list of residues.

class BindingSiteFromMolecule(protein, molecule, distance, whole_residues=True)[source]

A binding site defined from an arbitrary molecule.

class BindingSiteFromPoint(protein, origin=(0, 0, 0), distance=12.0)[source]

A cavity defined from a point.

class BindingSiteFromResidue(protein, residue, distance)[source]

A binding site defined from protein residue.

class Chain(index, _protein_structure=None)[source]

A chain of a protein.

property author_identifier

The author provided identifier of the chain.

property identifier

The identifier of the chain.

property index

The index of the chain in the protein.

property residues

The residues of a chain.

property sequence

The sequence of amino acid one letter codes in this chain.

class ChainSuperposition(settings=None)[source]

Class for superposition of protein chains using sequence alignment

class Settings[source]

Configuration options for the superposition of protein chains.

overlay_convergence_tolerance

tolerance for convergence in overlay

overlay_minimum_cycles

minimum number of cycles in overlay

overlay_weighting_factor

weighting factor to use in overlay

sequence_alignment_tool

external sequence alignment program

sequence_search_tool

external sequence search program

superposition_atoms

protein chain atoms to use in overlay (RIGID, BACKBONE or CALPHA)

superpose(chain1, chain2, binding_site1=None)[source]

Superpose two protein chains or binding sites

An implementation of the Smith-Waterman algorithm is used unless an external sequence alignment tool is specified in the settings.

If a binding site is supplied for the first chain, only the atoms in the binding site will be overlaid.

Parameters:
  • chain1 – a ccdc.protein.Chain instance

  • chain2 – a ccdc.protein.Chain instance

  • binding_site1 – a ccdc.protein.BindingSite instance for the first chain

Returns:

the root-mean square deviation of the overlay and the transformation matrix

class EntitySrcGen(_src_gen)[source]

A single entity src gen entry of a protein from pdbx/mmcif format

property entity_src_gen_entity_id: str

Returns the value of entity_src_gen_entity_id.

property entity_src_gen_expression_system_id: str

Returns the value of entity_src_gen_expression_system_id.

property entity_src_gen_gene_src_common_name: str

Returns the value of entity_src_gen_gene_src_common_name.

property entity_src_gen_gene_src_details: str

Returns the value of entity_src_gen_gene_src_details.

property entity_src_gen_gene_src_genus: str

Returns the value of entity_src_gen_gene_src_genus.

property entity_src_gen_gene_src_species: str

Returns the value of entity_src_gen_gene_src_species.

property entity_src_gen_gene_src_strain: str

Returns the value of entity_src_gen_gene_src_strain.

property entity_src_gen_gene_src_tissue: str

Returns the value of entity_src_gen_gene_src_tissue.

property entity_src_gen_gene_src_tissue_fraction: str

Returns the value of entity_src_gen_gene_src_tissue_fraction.

property entity_src_gen_host_org_common_name: str

Returns the value of entity_src_gen_host_org_common_name.

property entity_src_gen_host_org_details: str

Returns the value of entity_src_gen_host_org_details.

property entity_src_gen_host_org_genus: str

Returns the value of entity_src_gen_host_org_genus.

property entity_src_gen_host_org_species: str

Returns the value of entity_src_gen_host_org_species.

property entity_src_gen_pdbx_alt_source_flag: str

Returns the value of entity_src_gen_pdbx_alt_source_flag.

property entity_src_gen_pdbx_beg_seq_num: int

Returns the value of entity_src_gen_pdbx_beg_seq_num.

property entity_src_gen_pdbx_description: str

Returns the value of entity_src_gen_pdbx_description.

property entity_src_gen_pdbx_end_seq_num: int

Returns the value of entity_src_gen_pdbx_end_seq_num.

property entity_src_gen_pdbx_gene_src_atcc: str

Returns the value of entity_src_gen_pdbx_gene_src_atcc.

property entity_src_gen_pdbx_gene_src_cell: str

Returns the value of entity_src_gen_pdbx_gene_src_cell.

property entity_src_gen_pdbx_gene_src_cell_line: str

Returns the value of entity_src_gen_pdbx_gene_src_cell_line.

property entity_src_gen_pdbx_gene_src_cellular_location: str

Returns the value of entity_src_gen_pdbx_gene_src_cellular_location.

property entity_src_gen_pdbx_gene_src_fragment: str

Returns the value of entity_src_gen_pdbx_gene_src_fragment.

property entity_src_gen_pdbx_gene_src_gene: str

Returns the value of entity_src_gen_pdbx_gene_src_gene.

property entity_src_gen_pdbx_gene_src_ncbi_taxonomy_id: str

Returns the value of entity_src_gen_pdbx_gene_src_ncbi_taxonomy_id.

property entity_src_gen_pdbx_gene_src_organ: str

Returns the value of entity_src_gen_pdbx_gene_src_organ.

property entity_src_gen_pdbx_gene_src_organelle: str

Returns the value of entity_src_gen_pdbx_gene_src_organelle.

property entity_src_gen_pdbx_gene_src_scientific_name: str

Returns the value of entity_src_gen_pdbx_gene_src_scientific_name.

property entity_src_gen_pdbx_gene_src_variant: str

Returns the value of entity_src_gen_pdbx_gene_src_variant.

property entity_src_gen_pdbx_host_org_atcc: str

Returns the value of entity_src_gen_pdbx_host_org_atcc.

property entity_src_gen_pdbx_host_org_cell: str

Returns the value of entity_src_gen_pdbx_host_org_cell.

property entity_src_gen_pdbx_host_org_cell_line: str

Returns the value of entity_src_gen_pdbx_host_org_cell_line.

property entity_src_gen_pdbx_host_org_cellular_location: str

Returns the value of entity_src_gen_pdbx_host_org_cellular_location.

property entity_src_gen_pdbx_host_org_culture_collection: str

Returns the value of entity_src_gen_pdbx_host_org_culture_collection.

property entity_src_gen_pdbx_host_org_gene: str

Returns the value of entity_src_gen_pdbx_host_org_gene.

property entity_src_gen_pdbx_host_org_ncbi_taxonomy_id: str

Returns the value of entity_src_gen_pdbx_host_org_ncbi_taxonomy_id.

property entity_src_gen_pdbx_host_org_organ: str

Returns the value of entity_src_gen_pdbx_host_org_organ.

property entity_src_gen_pdbx_host_org_organelle: str

Returns the value of entity_src_gen_pdbx_host_org_organelle.

property entity_src_gen_pdbx_host_org_scientific_name: str

Returns the value of entity_src_gen_pdbx_host_org_scientific_name.

property entity_src_gen_pdbx_host_org_strain: str

Returns the value of entity_src_gen_pdbx_host_org_strain.

property entity_src_gen_pdbx_host_org_tissue: str

Returns the value of entity_src_gen_pdbx_host_org_tissue.

property entity_src_gen_pdbx_host_org_tissue_fraction: str

Returns the value of entity_src_gen_pdbx_host_org_tissue_fraction.

property entity_src_gen_pdbx_host_org_variant: str

Returns the value of entity_src_gen_pdbx_host_org_variant.

property entity_src_gen_pdbx_host_org_vector: str

Returns the value of entity_src_gen_pdbx_host_org_vector.

property entity_src_gen_pdbx_host_org_vector_type: str

Returns the value of entity_src_gen_pdbx_host_org_vector_type.

property entity_src_gen_pdbx_seq_type: str

Returns the value of entity_src_gen_pdbx_seq_type.

property entity_src_gen_pdbx_src_id: int

Returns the value of entity_src_gen_pdbx_src_id.

property entity_src_gen_plasmid_details: str

Returns the value of entity_src_gen_plasmid_details.

property entity_src_gen_plasmid_name: str

Returns the value of entity_src_gen_plasmid_name.

class NucleicAcid(index, _protein_structure=None)[source]

A nucleic acid of a protein.

property author_identifier

The author provided identifier of the nucleic acid.

property identifier

The identifier of the nucleic acid.

property index

The index of the nucleic acid in the protein.

property nucleotides

The nucleotides of the nucleic acid

property sequence

The sequence of nucleotide one letter codes in this nucleic acid.

class Nucleotide(index, _nucleotide)[source]

A single nucleotide of a nucleic acid.

property atoms

The atoms of the nucleotide.

property author_identifier

The author provided identifier of this nucleotide.

property code

The PDB nucleotide code.

property identifier

The identifier of this nucleotide.

property index

The index of this nucleotide in its nucleic acid.

property nucleic_acid_author_identifier

The author provided identifier of the nucleic acid of which this nucleotide is a part.

property nucleic_acid_identifier

The identifier of the nucleic acid of which this nucleotide is a part.

property one_letter_code

The nucleotide one letter code.

class Residue(i, _residue)[source]

A single amino acid residue of a protein.

property atoms

The atoms of the residue.

property author_identifier

The author provided identifier of this residue.

property backbone_atoms

The backbone atoms of the amino acid.

property c_alpha

The C alpha atom of the residue.

property c_beta

The C beta atom, or None if there is no C beta atom.

property c_terminus

The C terminus atom.

property carbonyl_oxygen

The carbonyl oxygen atom.

property chain_author_identifier

The author provided identifier of the chain of which this residue is a part.

property chain_identifier

The identifier of the chain of which this residue is a part.

property cysteine_sulphur

The sulphur of a cysteine residue, or None if not a cysteine.

property identifier

The identifier of this residue.

property index

The index of this residue in its chain.

property ins_code

PDB Insertion Code

property is_acidic

Whether the residue is acidic.

property is_basic

Whether the residue is basic.

property is_hydrophilic

Whether the residue is hydrophilic.

property is_hydrophobic

Whether the residue is hydrophobic.

property n_terminus

The N terminus atom.

property one_letter_code

The one letter code of the amino acid.

property sidechain_atoms

The sidechain atoms of this amino acid.

property three_letter_code

The three letter code of the amino acid.

add_cofactor(molecule)[source]

Add a molecule to the protein as a cofactor.

add_hydrogens(mode='All', rules_file=None)[source]

Add hydrogens to the protein structure

This method protonates the protein structure by performing the following operations:

  • Remove metal bonds

  • Assign ligand and cofactor bond types and standardise aromatic and delocalised bonds to CSD conventions

  • Set atom charges to zero

  • Set bond types for ARG, GLU, ASP appropriately

  • Apply protonation rules to ligands and cofactors

  • Add hydrogens to protein, ligands, cofactors, nucleic acids and waters where necessary

  • Set any remaining unknown bond type to single

Parameters:
  • mode – ‘all’ to generate all hydrogens (remove any existing hydrogens first) or ‘missing’ to generate hydrogens deemed to be missing.

  • rules_file – File of rules that express special cases - if None, a default version will be used

Raises:
  • FileNotFoundError – if the rules_file passed in doesnt exist

  • ValueError – if mode is not either ‘all’ or ‘missing’

add_ligand(molecule)[source]

Add a molecule to the protein as a ligand.

are_labels_pdb_compliant()[source]

Are labels writeable in PDB format?

assign_unique_chain_identifiers(mode='all')[source]

Assigns unique chain identifiers to disconnected chain fragments and/or non-chain components

Occasionally, for example when merging two proteins, you can have a protein that has multiple chains with the same chain identifier. This can lead to issues down the line: for example file writers may end up mixing the order of residues as the code will ‘see’ the residues as belonging to the same chain and sort them by sequence number

Calling this method will reassign chain identifiers to all components. Each disconnected chain will get a unique chain ID. Note that there is no guarentee that old chain IDs will be the same. If, say previously the first chain was B and the second A the chain identifiers may reverse.

Ligands, Co-factors, Metals and Waters will also get new chain ids: these will be derived based on which chain they are overall closest to (i.e. a ligand is given the chain id of the chain that has the most atoms within 5.0 Angstroms.)

Note that calling this method will mean that any pre-existing indexes into the atom list will probably be invalidated, as the consequence of changing ids can mean a re-ordering of atoms and residues.

Parameters:

mode – ‘all’ (the default) will assign chains new IDs to chain and then associate non-chain components (e.g.co-factors, waters, ligands and metals) to have the chain id of the chain they are nearest to. ‘nonchain’ will just re-assign chain ids to co-factors, waters, ligands and metals based on chain proximity.

property cavity_atoms

The atoms making up the binding site, if this was read from a gold protein.

property cavity_residues

The residues making up the cavity.

property chains

A tuple of ccdc.protein.Protein.Chain.

property cofactors

The tuple of cofactors in the protein.

The identifier of the molecule is of the form chain_id:residue_name.

Note that hydrogen atoms are added automatically to the returned molecules however these are not added to the parent protein.

copy()[source]

Copies the protein.

detect_ligand_bonds(covalent_links='include')[source]

Removes all bonds between ligand or cofactor atoms, and redetects them based on distance between atoms.

This can be useful if the bonds specified by the CONECT records in the PDB are unspecified or undesirable.

Parameters:

mode – covalent_links ‘include’ to include covalent links between the protein and the ligand (the default) and ‘exclude’ to remove them

property entity_src_gens: list[EntitySrcGen]

The source and expression details for a protein entity in PDBx/mmCIF format.

static from_entry(entry)[source]

Constructs a protein from a given ccdc.entry.Entry.

Parameters:

entry – Entry from which to construct the protein.

static from_file(file_name)[source]

Reads a protein from a file, and constructs the protein.

static known_cofactor_codes()[source]

Provide access to a list of known cofactors codes in the underlying library.

property ligands

The tuple of ligands in the protein.

The identifier of the molecule is of the form chain_id:residue_name.

Note that hydrogen atoms are added automatically to the returned molecules however these are not added to the parent protein.

property metals

The metal atoms of the protein.

normalise_labels(mode='pdb')[source]

Normalise labels of atoms in the protein structure

Parameters:

mode – ‘pdb’ (the default) will try to normalise the labels to PDB compliance if possible (i.e. no longer than 4 characters.) If labels are already compliant they will not be changed ‘force’ will call the normalisation regardless of whether they are already compliant ‘molecule’ will normalise using ccdc.molecule.Molecule.normalise_labels

property nucleic_acids

A tuple of ccdc.protein.Protein.NucleicAcid.

property nucleotides

The nucleotides of the protein.

remove_all_metals()[source]

Removes all metals from the protein.

remove_all_waters()[source]

Removes all waters from the protein.

remove_chain(chain_id)[source]

Remove the chain with the given identifier.

remove_cofactor(cofactor_id)[source]

Remove the specified cofactor.

Parameters:

cofactor_id – str, of the form chain_id:cofactor_id.

remove_hydrogens()[source]

Remove all hydrogens from the protein

remove_ligand(ligand_id)[source]

Remove the specified ligand.

Parameters:

ligand_id – str, of the form chain_id:ligand_id.

remove_metal(atom)[source]

Remove the given metal atom.

remove_metal_bonds(bonds=None)[source]

Removes metal bonds.

Parameters:

bonds – iterable of ccdc.molecule.Bond instances. If None all metal bonds will be removed.

remove_nucleic_acid(nucleic_acid_chain_id)[source]

Remove the chain with the given identifier.

remove_nucleotide(nucleotide_id)[source]

Remove the specified nucleotide.

remove_residue(residue_id)[source]

Remove the specified residue.

remove_water(water_mols)[source]

Remove the water (or waters). If water_mols is a list (or tuple) of water objects remove all waters in said list or tuple

property residues

The amino acid residues of the protein.

property sequence

The one-letter code sequence.

sort_atoms_by_residue()[source]

Sorts atoms by residue

After editing, sometimes the underlying atom list in a protein is not sorted by residue so atoms in a single residue are not in a single block of atoms. In particular, adding hydrogens will add new hydrogen atoms to the end of the atom list.

Calling this method will re-order the atoms in the protein so that each residue is in a single atom block in the atom list. This is useful in particular if you are writing PDB files where having residues as single blocks of ATOM lines is desirable.

Note that calling this method will mean that any pre-existing indexes into the atom list will probably be invalidated.

property waters

The waters of the protein.

Returns:

a tuple of ccdc.molecule.Molecule, representing the oxygens of the water.

ccdc.protein.ProteinWriter

Writes a protein as a crystal structure.

This type is an alias of ccdc.io.CrystalWriter. If the file format supports it, certain protein-specific information will be included, such as the _entity and _entity_src_gen tables in mmCIF files, which would not be included when writing with ccdc.io.MoleculeWriter.