Docking and scoring

Note

The ccdc.docking module is available only to CSD-Discovery, CSDS+GOLD and CSD-Enterprise users.

Introduction

Molecular docking is a widely-used computational tool for understanding molecular recognition, which aims to predict the interactions established in a complex formed by two or more constituent molecules with known structures. A key type of molecular docking is protein-ligand docking because of its therapeutic applications in modern structure-based drug design. In this context, the docking process involves the prediction of the binding conformation of small-molecule ligands to the targeted protein binding site. The ccdc.docking module provides an API to protein-ligand docking.

The module contains a single class ccdc.docking.Docker which, like other classes of the CSD Python API, contains a nested ccdc.docking.Docker.Settings class which will be used to specify the desired docking. Once the nested settings class is appropriately configured, ccdc.docking.Docker.dock() can be called to perform the docking.

Note

For more information on the docking algorithm please see: “Molecular recognition of receptor sites using a genetic algorithm with a description of desolvation”, G. Jones, P. Willett and R. C. Glen, J. Mol. Biol., 245, 43-53, 1995, DOI: 10.1016/S0022-2836(95)80037-9.

Setting up a docking run

Essential steps

Let us import the appropriate ccdc.docking module and the ccdc.io modules so that we can read in molecules of interest.

>>> from ccdc.docking import Docker
>>> from ccdc.io import MoleculeReader, EntryReader

A docking requires one or more protein files, one or more ligand files, and a binding site definition. There are many other optional settings which can be passed to the docking process, but these are essential. Let us set them up:

>>> docker = Docker()
>>> settings = docker.settings

Now get the protein:

>>> MLL1_protein_file = '2w5y_protein_prepared.mol2'

and load it into the settings:

>>> settings.add_protein_file(MLL1_protein_file)

If more than one protein is added, an ensemble docking will be performed, where the best protein into which to dock is selected by the genetic algorithm of GOLD. View the GOLD documentation for further details.

Next we can load the native ligand bound to the protein and used it to define the binding site:

>>> MLL1_native_ligand_file = 'SAH_native.mol2'
>>> MLL1_native_ligand = MoleculeReader(MLL1_native_ligand_file)[0]
>>> MLL1_protein = settings.proteins[0]
>>> settings.binding_site = settings.BindingSiteFromLigand(MLL1_protein, MLL1_native_ligand, 8.0)

This will define a binding site of radius 8 Angstroms around the ligand. If the distance optional argument is not provided a radius of 6.0 Angstroms will be used. Please see Binding Site Definition for further details on the possible ways to define the binding site for docking.

We can set some other useful docking parameters, for example the fitness function to be used or the output directory where docking results will be saved:

>>> settings.fitness_function = 'plp'
>>> settings.autoscale = 10.
>>> settings.early_termination = False
>>> import tempfile
>>> batch_tempd = tempfile.mkdtemp()
>>> settings.output_directory = batch_tempd
>>> settings.output_file = 'docked_ligands.mol2'

All that now remains is to select some ligands to dock, and to choose between interactive or batch docking of ligands. In the latter case we should set the ligands in the ccdc.docking.Docker.Settings instance; in the former we will provide the ligands to an interactive ccdc.docking.Docker.InteractiveResults() instance returned by the ccdc.docking.Docker.dock() method. Here we will use the batch method to dock the S-adenosyl-L-homocysteine (SAH) ligand back into the protein:

>>> MLL1_ligand_file = 'SAH.mol2'
>>> MLL1_ligand = MoleculeReader(MLL1_ligand_file)[0]
>>> settings.add_ligand_file(MLL1_ligand_file, 10)

The second argument gives the number of docking runs per ligand.

Now we launch the docker:

>>> results = docker.dock() 
Starting GOLD with conf file ...

This will write the settings to a file called api_gold.conf file - see the GOLD conf file documentation for details - and launch a GOLD process to perform the docking, blocking on the completion of the process. The returned value is a ccdc.docking.Docker.Results instance.

We can check whether the docking ran to completion:

>>> print results.return_code
0

The location of the written GOLD conf file can be obtained from the settings. This is by default the user working directory:

>>> batch_conf_file = settings.conf_file

Inspecting results

It is useful to be able to use the ccdc.docking.Docker.Results class to inspect a previously completed docking. For example we can inspect the batch job we ran earlier by creating a settings instance with the right configuration file:

>>> settings = Docker.Settings.from_file(batch_conf_file)
>>> results = Docker.Results(settings)
>>> ligands = results.ligands
>>> print len(ligands)
10

The results instance can give access to the docked ligands, in the form of a ccdc.docking.Docker.Results.DockedLigandReader instance. This class is a subclass of ccdc.io.EntryReader but provides ccdc.docking.Docker.Results.DockedLigand instances, which are derivations of ccdc.entry.Entry, providing all the methods of the entry, with a couple of convenience method for extracting fitness values and scoring terms. Please note that docked ligands are not reanked by score.

>>> ligands = results.ligands

Each entry of this reader will contain the SDTag fields provided by GOLD, in the form of an attributes dictionary. This allows access to the component energy terms of the scoring function. The precise names of the terms will, of course, depend on the scoring function used. They may be obtained by inspection of the attributes dictionary of the entry. For the current PLP scoring function we will have, for example:

>>> first_dock = ligands[0]
>>> 'Gold.PLP.Fitness' in first_dock.attributes
True
>>> 'Gold.PLP.part.hbond' in first_dock.attributes
True
>>> 'Gold.PLP.part.buried' in first_dock.attributes
True

These terms may be extracted more conveniently as:

>>> print first_dock.fitness() 
68.42

If a rescore has been performed, there will be more than one fitness value in the docked ligand; in this case it is necessary to specify which fitness value to return:

>>> print first_dock.fitness('plp') 
68.42

or, perhaps more generally:

>>> print first_dock.fitness(settings.fitness_function) 
68.42
>>> print first_dock.fitness(settings.rescore_function) 
...

Individual scoring terms may be retrieved analogously: where the set of provided filters yields a single term, that term will be returned as a float; where more than one term matches, a dict of matched terms will be returned. Thus, providing no filters will allow a simple inspection of the terms available in the ligand. Filters will be matched in a case-insensitive fashion with an arbitrary order.

>>> print first_dock.scoring_term() 
{u'Gold.PLP.Chemscore.Hbond': 4.574 ...
>>> print first_dock.scoring_term('plp', 'chemscore', 'hbond') 
4.574

The values present in this dictionary will differ from run to run, so we will not exemplify them here.

We can also extract the hydrogen bonds which GOLD has detected in a docked ligand using ccdc.docking.Docker.Results.DockedLigand.HBond. This takes an optional argument: if absent the docking’s fitness function will be used. In the case that a rescore has been run, there may be two sets of hydrogen bonds: one from the fitness function (if it is ‘chemscore’, ‘goldscore’ or ‘plp’) and one from the rescore function. Here we can request a specific set of hydrogen bonds by the name of the scoring function. Where the fitness function is ‘asp’, or if an explicitly requested set of hydrogen bonds can not be found, None will be returned. Otherwise the result will be a tuple of ccdc.molecule.Molecule.HBond instances with a strength attribute the score that GOLD has assigned to the hydrogen bond.

>>> print first_dock.hbonds() 
(HBond(Atom(N...)-Atom(H...)-Atom(O)), ...)

One can form complexes of docked ligands with the protein into which they docked, adjusting the geometry of the protein for the rotations of sidechains incurred by the induced fit of the ligand. If the docking involved more than one protein, i.e. an ensemble docking run, the correct protein for the ligand will be selected. This will return a ccdc.protein.Protein instance:

>>> protein = results.make_complex(ligands[0])
>>> print len(protein.ligands)
1

Binding Site Definition

It is necessary to define the protein binding site. This can be done in several ways, e.g. by specifying the approximate centre of the binding site and taking all atoms that lie within a specified radius of this point. Only those atoms specifically included in the binding site definition will be considered during docking. The binding site definition should therefore be large enough to contain any possible binding mode of the ligand, and include all atoms or residues that might be involved in ligand binding.

Defining a binding site from a reference ligand

To define a binding site including all protein atoms located within 8 Angstroms of any atom of a bound reference ligand, we use the statements:

>>> MLL1_native_ligand_file = 'SAH_native.mol2'
>>> MLL1_native_ligand = MoleculeReader(MLL1_native_ligand_file)[0]
>>> settings.binding_site = settings.BindingSiteFromLigand(MLL1_protein, MLL1_native_ligand, 8.0)

By default, all protein atoms within 6.0 Angstroms of each selected ligand are used for the binding site definition. We changed this by setting the second argument to 8.0.

Defining a binding site from a protein atom, residue(s) or point

Alternatively, we can define our binding site from a single protein atom:

>>> HIS3839CB = [a for a in MLL1_protein["A:HIS3839"].atoms if a.label=="CB"][0]
>>> settings.binding_site = settings.BindingSiteFromAtom(MLL1_protein, HIS3839CB, 10.0)

a single protein residue:

>>> HIS3839 = [r for r in MLL1_protein.residues if r.identifier == 'A:HIS3839'][0]
>>> settings.binding_site = settings.BindingSiteFromResidue(MLL1_protein,HIS3839, 10.0)

or a list of protein residues:

>>> active_site_string =  str('A:SER3836,A:ILE3838,A:HIS3839,A:GLY3840,A:ARG3841,A:GLY3842,A:TYR3883,')
>>> active_site_string += str('A:ARG3903,A:PHE3904,A:ILE3905,A:ASN3906,A:HIS3907,A:SER3908,A:CYS3909,')
>>> active_site_string += str('A:TYR3944,A:PHE3946,A:PRO3956,A:CYS3957,A:ASN3958,A:CYS3959,A:LEU3968,A:ZN4970')
>>> active_site = str(active_site_string).split(',')
>>> active_site_residues = [r for r in MLL1_protein.residues if r.identifier in active_site]
>>> settings.binding_site = settings.BindingSiteFromListOfResidues(MLL1_protein, active_site_residues)

or from the coordinates of a single point, that is NOT within the van der Waals sphere of any protein atom:

>>> settings.binding_site = settings.BindingSiteFromPoint(MLL1_protein,(-41.2, 7.5, 18.7), 10.0)

Optionally, the atom indices, in a form suitable for GOLD may be written when the settings file is written. This is controlled by the settings attribute ccdc.docking.Docker.Settings.save_binding_site_atoms.

A binding site instance supports attributes giving access to the atoms, residues, waters, metals and ligands subtended by the definition.

>>> atoms = settings.binding_site.atoms
>>> print(len(atoms))
557
>>> residues = settings.binding_site.residues
>>> print(len(residues))
32
>>> waters = settings.binding_site.waters
>>> print(len(waters))
0
>>> print(len(settings.binding_site.metals))
0
>>> print(len(settings.binding_site.ligands))
0

Docking Constraints

Docking constraints may be applied to the docking to influence the selection of preferred binding modes.

Distance, substucture-based distance, HBond and Protein HBond constraints bias solutions towards ones that show desired orientations with respect to the protein binding site.

Region constraints and template similarity constraints bias solutions with respect to features defined as absolute coordinates in space.

Scaffold match constraints enforce a part of the ligand to be placed on top of the coordinates of a predefined scaffold.

Setting up protein HBond constraints

A frequently used constraint is the Protein HBond constraint, biasing poses towards those that involve standard HBond acceptor atoms and the hydrogen atom linked to standard HBond donors to form hydrogen bonds.

Individual Protein HBond constraints bias towards solutions involving a single protein donor (specified as the hydrogen) or acceptor group in hydrogen bonding with the ligand. For example, if we expect hydrogen bond donor groups in the ligand to interact with ND1 of His3839 of MLL1 we would use the code below:

>>> nd1_his3839 = [a for a in MLL1_protein["A:HIS3839"].atoms if a.label=="ND1"]
>>> settings.add_constraint(settings.ProteinHBondConstraint(nd1_his3839))

For a given Protein Hbond constraint more than one protein atom can be used. This will instruct the program to use an either-or type of constraint during docking and the constraint will be considered satisfied if at least one of the specified atoms is involved in a hydrogen bond with the docked ligand. To require at least one of backbone N-H of His3907, and backbone N-H of Asn3958 to form an hydrogen bond with SAH use:

>>> atoms = []
>>> backbone_nh_asn3958 = [a for a in MLL1_protein["A:ASN3958"].atoms if a.label=="H"]
>>> backbone_nh_his3907 = [a for a in MLL1_protein["A:HIS3907"].atoms if a.label=="H"]
>>> atoms.extend(backbone_nh_asn3958)
>>> atoms.extend(backbone_nh_his3907)
>>> settings.add_constraint(settings.ProteinHBondConstraint(atoms))

Setting up distance or HBond constraints

Distance constraints and HBond constraints penalise solutions where specific protein and ligand atoms fail to be within a certain distance from one another or form an hydrogen bond, respectively.

To set a HBond constraint between the nitrogen ND1 of His3839 and the primary amino group of SAH (you must specify the hydrogen atom, not the O or N atom to which it is attached), use:

>>> protein_atom = nd1_his3839[0]
>>> ligand_atom_H = [a for a in MoleculeReader(MLL1_ligand_file)[0].atoms if a.atomic_symbol=="H"][0]
>>> settings.add_constraint(settings.HBondConstraint(protein_atom, ligand_atom_H))

To set a distance constraint between the same protein atom and the nitrogen atom of the primary amino group of SAH, use:

>>> ligand_atom_N = [a for a in MoleculeReader(MLL1_ligand_file)[0].atoms if a.label=="N1"][0]
>>> settings.add_constraint(settings.DistanceConstraint(protein_atom, ligand_atom_N))

Setting up substructure-based distance constraints

Substructure-based distance constraints allow automatic application to multiple ligands which have a common functional group. The constraint forces the docking algorithm to limit the distance between a protein atom and one atom of this functional group. Docking solutions will be biased towards the specified distance range. Please note that the substructure must be a sub-graph rather than a complete molecule.

Let’s use a glycinyl fragment as the substructure template to define a constraint between its nitrogen atom and the same protein atom as above.

>>> fragment_file = 'glycinyl_fragment.mol2'
>>> fragment = MoleculeReader(fragment_file)[0]
>>> fragment_atom_N = [a for a in fragment.atoms if a.label=="N1"][0]
>>> settings.add_constraint(settings.SubstructureConstraint(protein_atom, fragment, fragment_atom_N, [3.5, 2.5], 5, False))

This finishes the instructions for constraints related to the protein binding site. Let’s now focus on protein-independent constraints.

Setting up scaffold match constraints

The scaffold match constraint can be used to enforce the placement of a fragment at an exact specified position in the binding site. The geometry of this fragment will not be altered during docking. The scaffold can, for example, be a common core or it may just be a substructure known to adopt a certain binding position. The scaffold must be supplied as a mol2 file containing the scaffold fragment in its docked position. Let’s load an adenine scaffold template and set up the constraint:

>>> scaffold_file = 'adenine_scaffold.mol2'
>>> scaffold =  MoleculeReader(scaffold_file)[0]
>>> settings.add_constraint(settings.ScaffoldMatchConstraint(scaffold))

Setting up similarity constraints

Template similarity constraints do not require exact match to a given fragment and will only bias solutions towards the template without enforcing it. The similarity between ligand and template is evaluated as a Gaussian overlap term. The similarity constraint can be applied in three ways:

  • by using the overlap between all donor atoms in the template and the ligand being docked
  • by using the overlap between all acceptor atoms in the template and the ligand being docked
  • by using the overlap of all atoms of the template (this can be regarded as a ligand shape constraint)

Let’s re-use the same scaffold as in the above example to illustrate the different behaviour of those two constraints.

>>> settings.add_constraint(settings.TemplateSimilarityConstraint('acceptor', scaffold))

We have specified an overlap of all hydrogen bond acceptor atoms between the ligand and the template, but you can choose ‘donor’ or ‘all’ instead of ‘acceptor’ to change the behaviour of the constraint. Make sure your fragment includes hydrogen atoms if you wish to use ‘donor’ or ‘all’.

Setting up region constraints

It is possible to bias the docking towards solutions where particular regions of the binding site are occupied by specific types of ligand atoms (e.g. all hydrophobic atoms or all aromatic ring atoms).

>>> settings.add_constraint(settings.RegionConstraint((-40.8, 7.3, 23.4), 2.0, 'arom_ring_atoms', weight=1.0))

Additional settings

Early termination options may be inspected and set. By default they will be set active, with number of ligands set to 3 and RMSD set to 1.5:

>>> settings = Docker().settings
>>> print(settings.early_termination)
(True, 3, 1.5)
>>> settings.early_termination = False
>>> print(settings.early_termination)
(False, None, None)
>>> settings.early_termination = (True, 5, 2.5)
>>> print(settings.early_termination)
(True, 5, 2.5)

When setting early termination, number of ligands and RMSD will assume their default values if not specified.

Diverse solutions may be requested for a docking. By default this option is not enabled. If not specified, diverse solution cluster size and RMSD will be given default values.

>>> print(settings.diverse_solutions)
(False, None, None)
>>> settings.diverse_solutions = True
>>> print(settings.diverse_solutions)
(True, 1, 1.5)
>>> settings.diverse_solutions = (True, 5, 2.5)
>>> print(settings.diverse_solutions)
(True, 5, 2.5)

Interactive docking

The easiest way to set up a protein-ligand docking is to read in a previously prepared docking configuration file (e.g. MLL1_PHB_Constraints.conf). This file can be used to run a docking directly, or amended through the API before launching a docking process.

>>> conf_file = 'MLL1_PHB_Constraints.conf'
>>> settings = Docker.Settings.from_file(conf_file)
>>> docker = Docker(settings=settings)

If the settings contains all the right information the docking may be launched immediately. For speed of running this test we will scale the search to 10% of the normal run, set the number of docking poses to 5, set the results directory to a temporary directory and exemplify interactive docking:

>>> settings.autoscale = 10
>>> tempdir = tempfile.mkdtemp()
>>> settings.output_directory = tempdir
>>> settings.set_hostname(ndocks=5)
>>> session = docker.dock(mode='interactive')  
Starting GOLD with conf file ...
CONNECTED TO localhost ...

Now we have an interactive docking session, connecting to a docking process via sockets. We can send ligands to the process as ccdc.entry.Entry instances, so we will construct an entry reader:

>>> SAH_file = 'SAH.mol2'
>>> SAH_entry = EntryReader(SAH_file)[0]
>>> poses = session.dock(SAH_entry)

The returned poses will be a tuple of ccdc.entry.Entry instances, containing as attributes a dictionary of values set by the docking process. For visual inspection of the results we may want to save the poses in the output directory.

>>> from ccdc.io import EntryWriter
>>> import os
>>> for inx, pose in enumerate(poses):
...     pose_name = 'gold_soln_%s_m1_%d.mol2' % (os.path.basename(SAH_file).split('.')[0], inx+1)
...     with EntryWriter(os.path.join(settings.output_directory, pose_name)) as w:
...         w.write(pose)
...

During the docking calculation ligand log files are produced and written into the output directory. They will be named as gold_<ligand_name>_m#.log where m# is an index to the number of the ligand parsed to the interactive docker. Because this naming scheme uses ligand names, it is preferable if they do not contain any special character.

The docking process remains connected, so further ligands may be sent to it, allowing the results of a docking to influence the choice of further dockings. The process will be terminated when the ccdc.docking.Docker.InteractiveResults instance is deleted. The example program, Interactive docking shows an example of using this mechanism to select candidate ligands on the basis of similarity to successfully docked ligands.

Rescoring

The ranking of ligand docking poses according to certain scoring functions is the most important step in virtual screening for drug discovery. An ideal scoring function is expected to to identify the correct binding pose for a ligand and to rank the docked compounds in order to discriminate active from inactive ligands. This capability is challenging in virtual screening, although a particular scoring function may perform better than another for selected class of proteins. Therefore, consensus approaches, using multiple scoring functions, have been proposed to overcome the limitations of individual scoring functions.

The CSD Python API allows you to rescore a single ligand or a set of ligands with the four scoring functions available in GOLD (i.e., GoldScore, ChemScore, CHEMPLP and ASP). Rescoring can be performed automatically after a docking run. This will result in the solutions from the docking being automatically scored with another scoring function. Alternatively, rescoring can be performed independently of the docking, e.g., against an existing set of GOLD solution files (or ligand poses from an alternative source).

To automatically rescore the results of a docking run with another scoring function you will need to first set up the docking in the normal way. You will therefore need to:

  • Provide a prepared protein input file
  • Define the binding site (preferably the same definition that was used for the original docking)
  • Specify the ligand(s) you wish to rescore
>>> from ccdc.docking import Docker
>>> docker = Docker()
>>> settings = docker.settings
>>> MLL1_protein = '2w5y_protein_prepared.mol2'
>>> SAH_ligand = 'SAH.mol2'
>>> settings.add_protein_file(os.path.abspath(MLL1_protein))
>>> ligand = MoleculeReader(SAH_ligand)[0]
>>> settings.binding_site = settings.BindingSiteFromPoint(None, ligand.centre_of_geometry(), 10.)
>>> settings.add_ligand_file(SAH_ligand, 10)
>>> settings.output_directory = 'results'

Then select the required scoring functions to be used for the docking and the rescore:

>>> settings.fitness_function = 'goldscore'
>>> settings.rescore_function = 'plp'
>>> settings.output_file = 'ligand_rescore.mol2'

By default, the docked ligand pose are minimised before rescoring. This is important as, due to the nature of scoring functions, one finds that small changes in location or conformation of the pose can have large effects on the calculated score.

To rescore an existing set of GOLD solution files or ligand poses from an alternative source (i.e., without first running a docking) we should set the fitness function to None and specify the scoring function to be used for rescoring.

>>> settings.fitness_function = None
>>> settings.rescore_function = 'plp'

Protein and Ligand Preparation

It is often necessary to prepare the protein and ligand molecules to make them more appropriate for docking. For example, it might be necessary to protonate a protein or assign bond types to a ligand.

A protein can be protonated using the ccdc.protein.Protein.add_hydrogens() method. View the descriptive protein documentation for further details.

The CSD Python API allows you to create a ccdc.docking.Docker.LigandPreparation object in order to automatically prepare ligands for docking. By default, the following operations are performed:

  • Unknown and dummy atoms are removed
  • Unknown bond types are assigned
  • SMARTS based protonation rules are applied
  • Missing hydrogen atoms are added

The following example shows the default preparation of the SAH ligand.

>>> SAH_ligand = 'SAH.mol2'
>>> ligand_prep = Docker.LigandPreparation()

>>> prepared_lig = ligand_prep.prepare(EntryReader(SAH_ligand)[0])
>>> from ccdc.io import MoleculeWriter
>>> with MoleculeWriter('prepared_lig.mol2') as mol_writer:
...     mol_writer.write(prepared_lig.molecule)
...

Each ligand preparation object has a settings attribute, which can be used to control the preparation behaviour.

>>> ligand_prep = Docker.LigandPreparation()
>>> ligand_prep.settings.remove_unknown_atoms = True
>>> ligand_prep.settings.add_hydrogens = False

When molecules are processed in interactive mode, the default ligand preparation steps are automatically enabled. However, it is possible to amend the default settings, or if ligands are pre-prepared we can switch the ligand preparation off. The following example shows how to add the standardisation of bond types according to CSD conventions.

>>> conf_file = 'MLL1_PHB_Constraints.conf'
>>> settings = Docker.Settings.from_file(conf_file)
>>> docker = Docker(settings=settings)
>>> ligand_prep = docker.LigandPreparation()
>>> ligand_prep.settings.standardise_bond_types = True
>>> session = docker.dock(mode='interactive')  
Starting GOLD with conf file ...
CONNECTED TO localhost ...
>>> session.ligand_preparation = ligand_prep

If we don’t want the ligand preparation to be performed then:

>>> session.ligand_preparation = None