Docking and scoring¶
Note
The ccdc.docking
module is available only to licensed CSD-Discovery, GOLD and CSD-Enterprise users.
Introduction¶
Molecular docking is a widely-used computational tool for understanding molecular recognition, which aims to predict the interactions established in a complex formed by two or more constituent molecules with known structures. A key type of molecular docking is protein-ligand docking because of its therapeutic applications in modern structure-based drug design. In this context, the docking process involves the prediction of the binding conformation of small-molecule ligands to the targeted protein binding site.
The ccdc.docking
module provides an API to protein-ligand docking.
The module contains a single class ccdc.docking.Docker
which, like other classes of the CSD Python API, contains a nested ccdc.docking.Docker.Settings
class which will be used to specify the desired docking. Once the nested settings class is appropriately configured, ccdc.docking.Docker.dock()
can be called to perform the docking.
Note
For more information on the docking algorithm please see: “Molecular recognition of receptor sites using a genetic algorithm with a description of desolvation”, G. Jones, P. Willett and R. C. Glen, J. Mol. Biol., 245, 43-53, 1995, DOI: 10.1016/S0022-2836(95)80037-9.
Setting up a docking run¶
Essential steps¶
Let us import the appropriate ccdc.docking
module and the ccdc.io
modules so that we can read in molecules of interest.
>>> from ccdc.docking import Docker
>>> from ccdc.io import MoleculeReader, EntryReader
A docking requires one or more protein files, one or more ligand files, and a binding site definition. There are many other optional settings which can be passed to the docking process, but these are essential. Let us set them up:
>>> docker = Docker()
>>> settings = docker.settings
Now get the protein:
>>> MLL1_protein_file = '2w5y_protein_prepared.mol2'
and load it into the settings:
>>> settings.add_protein_file(MLL1_protein_file)
If more than one protein is added, an ensemble docking will be performed, where the best protein into which to dock is selected by the genetic algorithm of GOLD. View the GOLD documentation for further details.
Next we can load the native ligand bound to the protein and used it to define the binding site:
>>> MLL1_native_ligand_file = 'SAH_native.mol2'
>>> MLL1_native_ligand = MoleculeReader(MLL1_native_ligand_file)[0]
>>> MLL1_protein = settings.proteins[0]
>>> settings.binding_site = settings.BindingSiteFromLigand(MLL1_protein, MLL1_native_ligand, 8.0)
This will define a binding site of radius 8 Angstroms around the ligand. If the distance optional argument is not provided a radius of 6.0 Angstroms will be used. Please see Binding Site Definition for further details on the possible ways to define the binding site for docking.
We can set some other useful docking parameters, for example the fitness function to be used or the output directory where docking results will be saved:
>>> settings.fitness_function = 'plp'
>>> settings.autoscale = 10.
>>> settings.early_termination = False
>>> import tempfile
>>> batch_tempd = tempfile.mkdtemp()
>>> settings.output_directory = batch_tempd
>>> settings.output_file = 'docked_ligands.mol2'
All that now remains is to select some ligands to dock, and to choose between interactive or batch docking of ligands. In the latter case we should set the ligands in the ccdc.docking.Docker.Settings
instance; in the former we will provide the ligands to an interactive ccdc.docking.Docker.InteractiveResults()
instance returned by the ccdc.docking.Docker.dock()
method. Here we will use the batch method to dock the S-adenosyl-L-homocysteine (SAH) ligand back into the protein:
>>> MLL1_ligand_file = 'SAH.mol2'
>>> MLL1_ligand = MoleculeReader(MLL1_ligand_file)[0]
>>> settings.add_ligand_file(MLL1_ligand_file, 10)
The second argument gives the number of docking runs per ligand.
Now we launch the docker:
>>> results = docker.dock()
This will write the settings to a file called api_gold.conf file
- see the GOLD conf file documentation for details -
and launch a GOLD process to perform the docking, blocking on the completion of the process. The returned value is a ccdc.docking.Docker.Results
instance.
We can check whether the docking ran to completion:
>>> print(results.return_code)
0
The location of the written GOLD conf file can be obtained from the settings. This is by default the user working directory:
>>> batch_conf_file = settings.conf_file
Inspecting results¶
It is useful to be able to use the ccdc.docking.Docker.Results
class to inspect a previously completed docking. For example we can inspect the batch job we ran earlier by creating a settings instance with the right configuration file:
>>> settings = Docker.Settings.from_file(batch_conf_file)
>>> results = Docker.Results(settings)
>>> ligands = results.ligands
>>> print(len(ligands))
10
The results instance can give access to the docked ligands, in the form of a ccdc.docking.Docker.Results.DockedLigandReader
instance. This class is a subclass of ccdc.io.EntryReader
but provides ccdc.docking.Docker.Results.DockedLigand
instances, which are derivations of ccdc.entry.Entry
, providing all the methods of the entry, with a couple of convenience method for extracting fitness values and scoring terms. Please note that docked ligands are not reanked by score.
>>> ligands = results.ligands
Each entry of this reader will contain the SDTag fields provided by GOLD, in the form of an attributes dictionary. This allows access to the component energy terms of the scoring function. The precise names of the terms will depend on the scoring function used. They may be obtained by inspection of the attributes dictionary of the entry. For the current PLP scoring function we will have, for example:
>>> first_dock = ligands[0]
>>> 'Gold.PLP.Fitness' in first_dock.attributes
True
>>> 'Gold.PLP.part.hbond' in first_dock.attributes
True
>>> 'Gold.PLP.part.buried' in first_dock.attributes
True
These terms may be extracted more conveniently as:
>>> print(first_dock.fitness())
68.42
If a rescore has been performed, there will be more than one fitness value in the docked ligand; in this case it is necessary to specify which fitness value to return:
>>> print(first_dock.fitness('plp'))
68.42
or, perhaps more generally:
>>> print(first_dock.fitness(settings.fitness_function))
68.42
>>> print(first_dock.fitness(settings.rescore_function))
...
Individual scoring terms may be retrieved analogously: where the set of provided filters yields a single term, that term will be returned as a float; where more than one term matches, a dict of matched terms will be returned. Thus, providing no filters will allow a simple inspection of the terms available in the ligand. Filters will be matched in a case-insensitive fashion with an arbitrary order.
>>> print(first_dock.scoring_term())
{u'Gold.PLP.Chemscore.Hbond': 4.574 ...
>>> print(first_dock.scoring_term('plp', 'chemscore', 'hbond'))
4.574
The values present in this dictionary will differ from run to run, so we will not exemplify them here.
We can also extract the hydrogen bonds which GOLD has detected in a docked ligand using ccdc.docking.Docker.Results.DockedLigand.HBond
.
This takes an optional argument: if absent the docking’s fitness function will be used. In the case that a rescore has been run, there may be two sets of hydrogen bonds: one from the fitness function (if it is ‘chemscore’, ‘goldscore’ or ‘plp’) and one from the rescore function. Here we can request a specific set of hydrogen bonds by the name of the scoring function. Where the fitness function is ‘asp’, or if an explicitly requested set of hydrogen bonds can not be found, None will be returned. Otherwise the result will be a tuple of ccdc.molecule.Molecule.HBond
instances with a strength attribute the score that GOLD has assigned to the hydrogen bond.
>>> print(first_dock.hbonds())
(HBond(Atom(N...)-Atom(H...)-Atom(O)), ...)
One can form complexes of docked ligands with the protein into which they docked, adjusting the geometry of the protein for the rotations of sidechains incurred by the induced fit of the ligand. If the docking involved more than one protein, i.e. an ensemble docking run, the correct protein for the ligand will be selected. This will return a ccdc.protein.Protein
instance:
>>> protein = results.make_complex(ligands[0])
>>> print(len(protein.ligands))
1
Binding Site Definition¶
It is necessary to define the protein binding site. This can be done in several ways, e.g. by specifying the approximate centre of the binding site and taking all atoms that lie within a specified radius of this point. Only those atoms specifically included in the binding site definition will be considered during docking. The binding site definition should therefore be large enough to contain any possible binding mode of the ligand, and include all atoms or residues that might be involved in ligand binding.
Defining a binding site from a reference ligand¶
To define a binding site including all protein atoms located within 8 Angstroms of any atom of a bound reference ligand, we use the statements:
>>> MLL1_native_ligand_file = 'SAH_native.mol2'
>>> MLL1_native_ligand = MoleculeReader(MLL1_native_ligand_file)[0]
>>> settings.binding_site = settings.BindingSiteFromLigand(MLL1_protein, MLL1_native_ligand, 8.0)
By default, all protein atoms within 6.0 Angstroms of each selected ligand are used for the binding site definition. We changed this by setting the second argument to 8.0.
Defining a binding site from a protein atom, residue(s) or point¶
Alternatively, we can define our binding site from a single protein atom:
>>> HIS3839CB = [a for a in MLL1_protein["A:HIS3839"].atoms if a.label=="CB"][0]
>>> settings.binding_site = settings.BindingSiteFromAtom(MLL1_protein, HIS3839CB, 10.0)
a single protein residue:
>>> HIS3839 = [r for r in MLL1_protein.residues if r.identifier == 'A:HIS3839'][0]
>>> settings.binding_site = settings.BindingSiteFromResidue(MLL1_protein,HIS3839, 10.0)
or a list of protein residues:
>>> active_site_string = str('A:SER3836,A:ILE3838,A:HIS3839,A:GLY3840,A:ARG3841,A:GLY3842,A:TYR3883,')
>>> active_site_string += str('A:ARG3903,A:PHE3904,A:ILE3905,A:ASN3906,A:HIS3907,A:SER3908,A:CYS3909,')
>>> active_site_string += str('A:TYR3944,A:PHE3946,A:PRO3956,A:CYS3957,A:ASN3958,A:CYS3959,A:LEU3968,A:ZN4970')
>>> active_site = str(active_site_string).split(',')
>>> active_site_residues = [r for r in MLL1_protein.residues if r.identifier in active_site]
>>> settings.binding_site = settings.BindingSiteFromListOfResidues(MLL1_protein, active_site_residues)
or from the coordinates of a single point, that is NOT within the van der Waals sphere of any protein atom:
>>> settings.binding_site = settings.BindingSiteFromPoint(MLL1_protein,(-41.2, 7.5, 18.7), 10.0)
NB! If an mmCIF file is used as an input file, residues’ IDs have to match the following fields: _atom_site.label_asym_id
, _atom_site.label_comp_id
, _atom_site.label_seq_id
.
Optionally, the atom indices, in a form suitable for GOLD may be written when the settings file is written. This is controlled by the settings attribute ccdc.docking.Docker.Settings.save_binding_site_atoms
.
A binding site instance supports attributes giving access to the atoms, residues, waters, metals and ligands subtended by the definition.
>>> atoms = settings.binding_site.atoms
>>> print(len(atoms))
557
>>> residues = settings.binding_site.residues
>>> print(len(residues))
32
>>> waters = settings.binding_site.waters
>>> print(len(waters))
0
>>> print(len(settings.binding_site.metals))
0
>>> print(len(settings.binding_site.ligands))
0
Docking Constraints¶
Docking constraints may be applied to the docking to influence the selection of preferred binding modes.
Distance, substructure-based distance, HBond and Protein HBond constraints bias solutions towards ones that show desired orientations with respect to the protein binding site.
Region constraints and template similarity constraints bias solutions with respect to features defined as absolute coordinates in space.
Scaffold match constraints enforce a part of the ligand to be placed on top of the coordinates of a predefined scaffold.
Setting up protein HBond constraints¶
A frequently used constraint is the Protein HBond constraint, biasing poses towards those that involve standard HBond acceptor atoms and the hydrogen atom linked to standard HBond donors to form hydrogen bonds.
Individual Protein HBond constraints bias towards solutions involving a single protein donor (specified as the hydrogen) or acceptor group in hydrogen bonding with the ligand. For example, if we expect hydrogen bond donor groups in the ligand to interact with ND1 of His3839 of MLL1 we would use the code below:
>>> nd1_his3839 = [a for a in MLL1_protein["A:HIS3839"].atoms if a.label=="ND1"]
>>> settings.add_constraint(settings.ProteinHBondConstraint(nd1_his3839))
For a given Protein Hbond constraint more than one protein atom can be used. This will instruct the program to use an either-or type of constraint during docking and the constraint will be considered satisfied if at least one of the specified atoms is involved in a hydrogen bond with the docked ligand. To require at least one of backbone N-H of His3907, and backbone N-H of Asn3958 to form an hydrogen bond with SAH use:
>>> atoms = []
>>> backbone_nh_asn3958 = [a for a in MLL1_protein["A:ASN3958"].atoms if a.label=="H"]
>>> backbone_nh_his3907 = [a for a in MLL1_protein["A:HIS3907"].atoms if a.label=="H"]
>>> atoms.extend(backbone_nh_asn3958)
>>> atoms.extend(backbone_nh_his3907)
>>> settings.add_constraint(settings.ProteinHBondConstraint(atoms))
Setting up distance or HBond constraints¶
Distance constraints and HBond constraints penalise solutions where specific protein and ligand atoms fail to be within a certain distance from one another or form an hydrogen bond, respectively.
To set a HBond constraint between the nitrogen ND1 of His3839 and the primary amino group of SAH (you must specify the hydrogen atom, not the O or N atom to which it is attached), use:
>>> protein_atom = nd1_his3839[0]
>>> ligand_atom_H = [a for a in MoleculeReader(MLL1_ligand_file)[0].atoms if a.atomic_symbol=="H"][0]
>>> settings.add_constraint(settings.HBondConstraint(protein_atom, ligand_atom_H))
To set a distance constraint between the same protein atom and the nitrogen atom of the primary amino group of SAH, use:
>>> ligand_atom_N = [a for a in MoleculeReader(MLL1_ligand_file)[0].atoms if a.label=="N1"][0]
>>> settings.add_constraint(settings.DistanceConstraint(protein_atom, ligand_atom_N))
Setting up substructure-based distance constraints¶
Substructure-based distance constraints allow automatic application to multiple ligands which have a common functional group. The constraint forces the docking algorithm to limit the distance between a protein atom and one atom of this functional group. Docking solutions will be biased towards the specified distance range. Please note that the substructure must be a sub-graph rather than a complete molecule.
Let’s use a glycinyl fragment as the substructure template to define a constraint between its nitrogen atom and the same protein atom as above.
>>> fragment_file = 'glycinyl_fragment.mol2'
>>> fragment = MoleculeReader(fragment_file)[0]
>>> fragment_atom_N = [a for a in fragment.atoms if a.label=="N1"][0]
>>> settings.add_constraint(settings.SubstructureConstraint(protein_atom, fragment, fragment_atom_N, [3.5, 2.5], 5, False))
This finishes the instructions for constraints related to the protein binding site. Let’s now focus on protein-independent constraints.
Setting up scaffold match constraints¶
The scaffold match constraint can be used to enforce the placement of a fragment at an exact specified position in the binding site. The geometry of this fragment will not be altered during docking. The scaffold can, for example, be a common core or it may just be a substructure known to adopt a certain binding position. The scaffold must be supplied as a mol2 file containing the scaffold fragment in its docked position. Let’s load an adenine scaffold template and set up the constraint:
>>> scaffold_file = 'adenine_scaffold.mol2'
>>> scaffold = MoleculeReader(scaffold_file)[0]
>>> settings.add_constraint(settings.ScaffoldMatchConstraint(scaffold))
Setting up similarity constraints¶
Template similarity constraints do not require exact match to a given fragment and will only bias solutions towards the template without enforcing it. The similarity between ligand and template is evaluated as a Gaussian overlap term. The similarity constraint can be applied in three ways:
by using the overlap between all donor atoms in the template and the ligand being docked
by using the overlap between all acceptor atoms in the template and the ligand being docked
by using the overlap of all atoms of the template (this can be regarded as a ligand shape constraint)
Let’s re-use the same scaffold as in the above example to illustrate the different behaviour of those two constraints.
>>> settings.add_constraint(settings.TemplateSimilarityConstraint('acceptor', scaffold))
We have specified an overlap of all hydrogen bond acceptor atoms between the ligand and the template, but you can choose ‘donor’ or ‘all’ instead of ‘acceptor’ to change the behaviour of the constraint. Make sure your fragment includes hydrogen atoms if you wish to use ‘donor’ or ‘all’.
Setting up region constraints¶
It is possible to bias the docking towards solutions where particular regions of the binding site are occupied by specific types of ligand atoms (e.g. all hydrophobic atoms or all aromatic ring atoms).
>>> settings.add_constraint(settings.RegionConstraint((-40.8, 7.3, 23.4), 2.0, 'arom_ring_atoms', weight=1.0))
Setting up pharmacophore constraints¶
As with the region constraint, the pharmacophore constraint can be used to bias the docking towards solutions in which particular regions of the binding site are occupied by specific types of ligand atom (e.g. hydrogen-bond donor, hydrogen-bond acceptor, ring centre). Pharmacophore points are defined as spheres with a user-defined radius, which by default, is set to 0.7 Å. Five types of pharmacophore point can be defined: H-bond acceptor, H-bond donor, H-bond donor or acceptor, aromatic ring centre and ring centre. The pharmacophore sphere will be placed on the atom coordinates of H-bond donor or acceptor atoms and at the centroid of a ring. A contribution (determined according to a user-specified weighting) is then added to the score for each specified pharmacophore point. The contribution is added if the distance between the pharmacophore point and an atom of the same type in the molecule is less than the specified radius of the sphere. If a pharmacophore constraint point is not matched in the molecule, it will not be used during scoring.
>>> settings.add_constraint(settings.PharmacophoreConstraint((-40.8, 7.2, 23.6),
... 'aromatic_center',radius=0.7))
By default, a pharmacophore constraint point is also used as a docking fitting point. The fitting point weight of this fitting point can be changed. This behaviour can optionally be switched off if needs be.
>>> settings.add_constraint(settings.PharmacophoreConstraint((-40.4, 6.3, 17.7),
... 'donor',radius=0.7,weight=5.0,fitting_point_weight=3.0))
>>> settings.add_constraint(settings.PharmacophoreConstraint((-38.4,-0.2, 10.4),
... 'acceptor',radius=0.7,weight=5.0,use_as_fitting_point = False))
Flexible side chains¶
It is possible to specify that one or more protein side chains are to be treated as flexible during docking. This means that each flexible side chain will be allowed to undergo torsional rotation around one or more of its acyclic bonds. However, making a side chain flexible can make docking more difficult because it increases the search space that must be explored, and it may also increase the chance of false positives. A side chain should be made flexible only if there is a good reason to believe (e.g. from X-ray data) that it is likely to move in response to ligand binding.
By default, all side chains will be treated as rigid, that means they will be held fixed at their input conformation during the docking. To make a side chain flexible one or more allowed rotamers need to be defined. Each rotamer specifies the torsion angles that are permitted to vary, and the allowed values or ranges of values for those torsion angles. Each rotamer therefore describes one allowed conformation of the side chain as defined by the torsion angles values (chi1, chi2, etc.) and their allowed ranges (delta1, delta2, etc.). Rotamers can be specified in different ways. First of all let’s identify a residue that must be set as flexible.
>>> phe3946 = [r for r in settings.binding_site.residues if r.identifier=="A:PHE3946"][0]
Then, let’s create a rotamer library associated with this residue:
>>> phe3946_rotamer_library = settings.RotamerLibrary(settings.protein_files[0], phe3946)
It is possible to use rotamers from a pre-canned rotamer library read by a default parameter file which contains the most commonly observed side chain conformations for the naturally occurring amino acids.
>>> phe3946_rotamer_library.add_default_rotamers()
Alternatively, one can define a rotamer in which all rotatable torsions in the side chain will be allowed to vary over the range (delta - chi) to (delta + chi), where chi values are taken from the protein input file.
>>> phe3946_rotamer_library.add_crystal_rotamer()
A side chain can also be set to be freely rotatable. This will define a single rotamer where all rotatable torsions are permitted to vary over the range -180 to +180. Any previously defined rotamers will be lost.
>>> phe3946_rotamer_library.add_free_rotamer()
Finally, rotamers can be specified directly by the user.
>>> rotamer = settings.Rotamer(phe3946_rotamer_library.number_of_torsions)
>>> rotamer.chi1 = (-76,10,20)
>>> rotamer.chi2 = (150,10)
>>> rotamer.energy = 5
>>> phe3946_rotamer_library.add_rotamer(rotamer)
Once the rotamers have been specified and added to the library, this can be added to the settings object.
>>> settings.add_rotamer_library(settings.proteins[0], phe3946_rotamer_library)
To reset a residue as rigid:
>>> phe3946_rotamer_library.remove_rotamers()
Additional settings¶
Early termination options may be inspected and set. By default they will be set active, with number of ligands set to 3 and RMSD set to 1.5:
>>> settings = Docker().settings
>>> print(settings.early_termination)
(True, 3, 1.5)
>>> settings.early_termination = False
>>> print(settings.early_termination)
(False, None, None)
>>> settings.early_termination = (True, 5, 2.5)
>>> print(settings.early_termination)
(True, 5, 2.5)
When setting early termination, number of ligands and RMSD will assume their default values if not specified.
Diverse solutions may be requested for a docking. By default this option is not enabled. If not specified, diverse solution cluster size and RMSD will be given default values.
>>> print(settings.diverse_solutions)
(False, None, None)
>>> settings.diverse_solutions = True
>>> print(settings.diverse_solutions)
(True, 1, 1.5)
>>> settings.diverse_solutions = (True, 5, 2.5)
>>> print(settings.diverse_solutions)
(True, 5, 2.5)
Interactive docking¶
The easiest way to set up a protein-ligand docking is to read in a previously prepared docking configuration file (e.g. MLL1_PHB_Constraints.conf). This file can be used to run a docking directly, or amended through the API before launching a docking process.
>>> conf_file = 'MLL1_PHB_Constraints.conf'
>>> settings = Docker.Settings.from_file(conf_file)
>>> docker = Docker(settings=settings)
If the settings contains all the right information the docking may be launched immediately. For speed of running this test we will scale the search to 10% of the normal run, set the number of docking poses to 5, set the results directory to a temporary directory and exemplify interactive docking:
>>> settings.autoscale = 10
>>> tempdir = tempfile.mkdtemp()
>>> settings.output_directory = tempdir
>>> settings.set_hostname(ndocks=5)
>>> session = docker.dock(mode='interactive')
Now we have an interactive docking session, connecting to a docking process via sockets. We can send ligands to the process as ccdc.entry.Entry
instances, so we will construct an entry reader:
>>> SAH_file = 'SAH.mol2'
>>> SAH_entry = EntryReader(SAH_file)[0]
>>> poses = session.dock(SAH_entry)
The returned poses will be a tuple of ccdc.entry.Entry
instances, containing as attributes a dictionary of values set by the docking process. For visual inspection of the results we may want to save the poses in the output directory.
>>> from ccdc.io import EntryWriter
>>> import os
>>> for inx, pose in enumerate(poses):
... pose_name = 'gold_soln_%s_m1_%d.mol2' % (os.path.basename(SAH_file).split('.')[0], inx+1)
... with EntryWriter(os.path.join(settings.output_directory, pose_name)) as w:
... w.write(pose)
...
During the docking calculation ligand log files are produced and written into the output directory. They will be named as gold_<ligand_name>_m#.log where m# is an index to the number of the ligand parsed to the interactive docker. Because this naming scheme uses ligand names, it is preferable if they do not contain any special character.
The docking process remains connected, so further ligands may be sent to it, allowing the results of a docking to influence the choice of further dockings. The process will be terminated when the ccdc.docking.Docker.InteractiveResults
instance is deleted. The example program, Interactive docking shows an example of using this mechanism to select candidate ligands on the basis of similarity to successfully docked ligands.
Rescoring¶
The ranking of ligand docking poses according to certain scoring functions is the most important step in virtual screening for drug discovery. An ideal scoring function is expected to identify the correct binding pose for a ligand and to rank the docked compounds in order to discriminate active from inactive ligands. This capability is challenging in virtual screening, although a particular scoring function may perform better than another for selected class of proteins. Therefore, consensus approaches, using multiple scoring functions, have been proposed to overcome the limitations of individual scoring functions.
The CSD Python API allows you to rescore a single ligand or a set of ligands with the four scoring functions available in GOLD (i.e., GoldScore, ChemScore, CHEMPLP and ASP). Rescoring can be performed automatically after a docking run. This will result in the solutions from the docking being automatically scored with another scoring function. Alternatively, rescoring can be performed independently of the docking, e.g., against an existing set of GOLD solution files (or ligand poses from an alternative source).
To automatically rescore the results of a docking run with another scoring function you will need to first set up the docking in the normal way. You will therefore need to:
Provide a prepared protein input file
Define the binding site (preferably the same definition that was used for the original docking)
Specify the ligand(s) you wish to rescore
>>> from ccdc.docking import Docker
>>> docker = Docker()
>>> settings = docker.settings
>>> MLL1_protein = '2w5y_protein_prepared.mol2'
>>> SAH_ligand = 'SAH.mol2'
>>> settings.add_protein_file(os.path.abspath(MLL1_protein))
>>> ligand = MoleculeReader(SAH_ligand)[0]
>>> settings.binding_site = settings.BindingSiteFromPoint(None, ligand.centre_of_geometry(), 10.)
>>> settings.add_ligand_file(SAH_ligand, 10)
>>> settings.output_directory = 'results'
Then select the required scoring functions to be used for the docking and the rescore:
>>> settings.fitness_function = 'goldscore'
>>> settings.rescore_function = 'plp'
>>> settings.output_file = 'ligand_rescore.mol2'
By default, the docked ligand pose are minimised before rescoring. This is important as, due to the nature of scoring functions, one finds that small changes in location or conformation of the pose can have large effects on the calculated score.
To rescore an existing set of GOLD solution files or ligand poses from an alternative source (i.e., without first running a docking) we should set the fitness function to None and specify the scoring function to be used for rescoring.
>>> settings.fitness_function = None
>>> settings.rescore_function = 'plp'
Protein and Ligand Preparation¶
It is often necessary to prepare the protein and ligand molecules to make them more appropriate for docking. For example, it might be necessary to protonate a protein or assign bond types to a ligand.
A protein can be protonated using the ccdc.protein.Protein.add_hydrogens()
method. View the descriptive protein documentation for further details.
The CSD Python API allows you to create a ccdc.docking.Docker.LigandPreparation
object in order to automatically prepare ligands for docking.
By default, the following operations are performed:
Unknown and dummy atoms are removed
Unknown bond types are assigned
SMARTS based protonation rules are applied
Missing hydrogen atoms are added
The following example shows the default preparation of the SAH ligand.
>>> SAH_ligand = 'SAH.mol2'
>>> ligand_prep = Docker.LigandPreparation()
>>> prepared_lig = ligand_prep.prepare(EntryReader(SAH_ligand)[0])
>>> from ccdc.io import MoleculeWriter
>>> with MoleculeWriter('prepared_lig.mol2') as mol_writer:
... mol_writer.write(prepared_lig.molecule)
...
Each ligand preparation object has a settings attribute, which can be used to control the preparation behaviour.
>>> ligand_prep = Docker.LigandPreparation()
>>> ligand_prep.settings.remove_unknown_atoms = True
>>> ligand_prep.settings.add_hydrogens = False
When molecules are processed in interactive mode, the default ligand preparation steps are automatically enabled. However, it is possible to amend the default settings, or if ligands are pre-prepared we can switch the ligand preparation off. The following example shows how to add the standardisation of bond types according to CSD conventions.
>>> conf_file = 'MLL1_PHB_Constraints.conf'
>>> settings = Docker.Settings.from_file(conf_file)
>>> docker = Docker(settings=settings)
>>> ligand_prep = docker.LigandPreparation()
>>> ligand_prep.settings.standardise_bond_types = True
>>> session = docker.dock(mode='interactive')
>>> session.ligand_preparation = ligand_prep
If we don’t want the ligand preparation to be performed then:
>>> session.ligand_preparation = None
Output File Write Options¶
In some use cases, not all of GOLD’s output files are necessary nor desirable. For example, for high-throughput virtual screening, GOLD’s default output options will be expensive in both computation and disk space.
As a result, you may wish to use GOLD’s write options to disable the output of certain files.
The following example demonstrates a simple use of GOLD’s write_options:
>>> settings = Docker().settings
>>> settings.write_options = 'MIN_OUT'
>>> settings.write('gold.conf')
In this example, ‘MIN_OUT’ is allocated to write_options in order to disable all files except for gold.log and bestranking.lst. The following is a list of all available write options:
MIN_OUT: Use this to write only the gold.log and bestranking.lst files. This is the recommended option for high-throughput virtual screening
NO_LOG_FILES: Use this to disable the writing of all ligand log files and the gold_protein.log file.
NO_LINK_FILES: Use this to disable the writing of ranked pose shortcut files to solution files. By default, one file is written per solution file.
NO_RNK_FILES: Use this to disable the writing of the ranked fitness lists (.rnk extension) for each molecule. By default, one file is written per ligand.
NO_BESTRANKING_LST_FILE: Use this to disable the writing of the bestranking.lst file which includes a list of the highest scoring pose for each ligand.
NO_GOLD_SOLN_LIGAND_MOL2_FILES: Use this to disable the writing of all solution files. As there would be nothing to point to, this option also disables the writing of the ranked pose shortcut files.
NO_GOLD_LIGAND_MOL2_FILE: Use this to disable the writing of all ligand files. By default, one file is written per ligand.
NO_GOLD_PROTEIN_MOL2_FILE: Use this to disable the writing of the protein file. By default, one file is written per target protein.
NO_LGFNAME_FILE: Use this to disable the writing of the .lgfname file.
NO_PLP_MOL2_FILES: If using the ChemPLP scoring function, use this to disable the writing of plp_ligand.mol2 and plp_protein.mol2.
NO_PID_FILE: Use this to disable the writing of the gold.pid file.
NO_SEED_LOG_FILE: Use this to disable the writing of the gold.seed_log file.
NO_GOLD_ERR_FILE: Use this to disable the writing of the gold.err file.
NO_FIT_PTS_FILES: Use this to disable the writing of all files related to fitting points including, but not limited to, fit_pts.mol2 and fit_pts_merged.mol2.
NO_ASP_MOL2_FILES: If using the ASP scoring function, use this to disable the writing of asp_ligand.mol2 and asp_protein.mol2.
NO_GOLD_LOG_FILE: Use this to disable the writing of gold.log.
If you wish to pass more than one write option, they must be allocated together. This can be in a single string, or in a list of strings. For example, the following lines are equivalent:
>>> settings.write_options = 'NO_RNK_FILES NO_LOG_FILES NO_GOLD_ERR_FILE'
>>> settings.write_options = ['NO_RNK_FILES', 'NO_LOG_FILES', 'NO_GOLD_ERR_FILE']
Also, the write options listed above are case insensitive. For example, the following lines are equivalent:
>>> settings.write_options = 'NO_LOG_FILES'
>>> settings.write_options = 'no_log_files'
Setting up ensemble docking with waters¶
It is possible to configure advanced docking settings such as ensemble docking and docking with water molecules. This example will show how to use them both. Ensemble docking is one of the methodologies designed to address the issue of protein flexibility by adding multiple conformations of the target protein rather than just the single rigid receptor structure used in standard docking. Here we will show how to perform a non-native docking of a potent antiviral compound into an ensemble of four different protein conformers of Thymidine kinase (TK) (PDB entry codes 1e2k, 1e2i, 1of1 and 4ivq). All protein have been superimposed onto chain A of 1of1. First of all, let’s add the proteins individually to the settings object and let’s create a merged protein object from the four different protein files.
>>> from ccdc.molecule import Molecule
>>> from ccdc.protein import Protein
>>> docker = Docker()
>>> settings = docker.settings
>>> tk_1e2h = '1E2H_protein.mol2'
>>> tk_1e2i = '1E2I_protein.mol2'
>>> tk_1of1 = '1OF1_protein.mol2'
>>> tk_4ivq = '4IVQ_protein.mol2'
>>> protein_files = [tk_1e2h, tk_1e2i, tk_1of1, tk_4ivq]
>>> merged_protein = Molecule()
>>> for protein_file in protein_files:
... settings.add_protein_file(os.path.abspath(protein_file))
... protein = Protein.from_file(os.path.abspath(protein_file))
... merged_protein.add_molecule(protein)
Then, let’s define the binding site from a reference ligand. This binding site definition will be applicable across the whole ensemble.
>>> ligand_file = '1E2K_ligand.mol2'
>>> ligand = MoleculeReader(ligand_file)[0]
>>> settings.binding_site = settings.BindingSiteFromLigand(merged_protein, ligand)
The same ligand will also be used for predicting its binding conformation.
>>> settings.add_ligand_file(ligand_file, 10)
We can set some other useful docking parameters, such as the fitness function to be used or the output directory where docking results will be saved and the ensemble docking is ready to be run:
>>> settings.fitness_function = 'plp'
>>> settings.autoscale = 10.
>>> settings.early_termination = False
>>> import tempfile
>>> batch_tempd = tempfile.mkdtemp()
>>> settings.output_directory = batch_tempd
However, before proceeding with the docking run, we will define some active water molecules that should be considered during the ensemble docking. There are three active water molecules: two waters that coordinate hydrogen bonds between the thymidine ring of the ligand and Arg176 of TK, and a third water molecule that coordinates hydrogen bonds between the nucleobase of the ligand and the side chain of Gln125. This water molecule can compete with the thymidine ring of the antiviral compound to form direct hydrogen bonds with Gln125 and therefore can be displaced. Waters must be provided in separate files, one water molecule per file.
Settings can be customised for specific water molecules. For example, with the toggle_state parameter:
Toggle state leaves the docking program to decide whether the water should be present or absent (bound or displaced by the ligand) during the docking.
On state sets the water to be always present in the binding site and allows the hydrogen positions to vary during docking in order to maximise the hydrogen bonding score both from interactions with the protein and the ligand.
The Off water state option allows a water to be removed from consideration during docking.
The orientation of the waters can be also changed with the spin_state parameter:
the spin option will automatically optimise the orientation of the hydrogen atoms.
the trans_spin option will make the docking algorithm spin and translate the water molecule to optimise the orientation of the hydrogen atoms as well as the water molecule’s position within a defined radius. Note that the distance value must be between 0 and 2 Å.
the fix option will use the orientation specified in the input file.
Here we will set each water molecule to toggle state, and will allow them to trans-spin up to 1 Å.
>>> water_1 = 'water_1.mol2'
>>> water_2 = 'water_2.mol2'
>>> water_3 = 'water_3.mol2'
>>> settings.add_water_file(water_1, toggle_state='toggle', spin_state='trans_spin', movable_distance=1.0)
>>> settings.add_water_file(water_2, toggle_state='toggle', spin_state='trans_spin', movable_distance=1.0)
>>> settings.add_water_file(water_3, toggle_state='toggle', spin_state='trans_spin', movable_distance=1.0)
Finally we launch the docker:
>>> results = docker.dock()
We can now inspect the results to see which protein conformer has been selected, and how many waters have been used during the docking calculation.
>>> docked_ligands = Docker.Results(settings).ligands
>>> scores = [l.fitness() for l in docked_ligands]
>>> i = scores.index(max(scores))
>>> top_ranked = docked_ligands[i]
>>> print("Fitness: {}, EnsembleProteinID: {}, Number of docked waters: {}.".format(top_ranked.fitness(),top_ranked.scoring_term('ensemble', 'id'), len(top_ranked.docked_waters)))
Fitness: 84.7412, EnsembleProteinID: 3.0, Number of docked waters: 2.
The top-ranked solution uses proteins 3 (1of1) and displaces one of the three water molecules, as only two are used for docking.