Field-based virtual screening

Note

The ccdc.screening module is available only to CSD-Discovery and CSD-Enterprise users.

Introduction

The CSD Python API provides programmatic access to the CSD data and the CSD Software functionality, as well as to features that have never been exposed within an interface. One of these features is the ccdc.screening module that can be used to screen a library of compounds against a pharmacophore query obtained from one or multiple overlaid ligands. The algorithm generalises the 3D pharmacophore definition using atom property fields that are created around the query based on user-defined atom types and potentials.

Three main steps are performed:

  • Generation of a field potential from the query and creation of fitting points in hot-spots

  • Global optimisation of the translation and rotation of each ligand by generating conformer libraries and then fitting the ligand atoms to precalculated fitting points

  • Scoring of each screened ligand with numerical gradients.

The methodology requires two input files (i.e., a query and the molecules to screen) where all atoms have 3D coordinates.

Screening Workflow

The ccdc.screening.Screener is available in the ccdc.screening module. Let us import this module.

>>> from ccdc.screening import Screener

Let us also import the ccdc.io module so that we can read in the input files.

>>> from ccdc import io

>>> query_file = 'P28845.sdf'
>>> screen_set_file = 'P28845_actives.sdf'
>>> query = [m for m in io.MoleculeReader(query_file)]
>>> screen_set = [m for m in io.MoleculeReader(screen_set_file)]

The ccdc.screening.Screener.Settings allows the output directory to be defined. This is where all generated files will be stored.

>>> import os

>>> settings = Screener.Settings()
>>> settings.output_directory = os.path.join(os.getcwd(), "screen_data")

It is also possible to define the parameter directory if one wishes to use customised atom types definition and potentials. Here we use the default.

The ccdc.screening.Screener.screen() performs the field-based ligand screening, returning a ccdc.screening.Screener.ScreenHitList. Let us generate the field potentials around the query. Field maps for the donor, acceptor and non-polar fields will be saved to the user defined output directory in .ACNT format (e.g., grid_ACCEPTOR_MEDIUM.acnt). The corresponding fitting points in Mol2 format (e.g., fitpts_ACCEPTOR_MEDIUM.mol2) will be saved also to the output directory.

>>> screener = Screener(query, settings=settings)

Molecules in screen_set are then screened using their input conformation.

>>> results = screener.screen([[m] for m in screen_set]) 

It is also possible to use conformer libraries and select the conformation of each ligand that best matches the query fields.

See also

For more information on the conformer generation please see the descriptive conformer generation and molecular minimisation documentation and the ccdc.conformer.ConformerGenerator API documentation.

Each result is represented in a ccdc.screening.Screener.ScreenHitList.ScreenHit instance. Here we sort the list of scores in ascending order (the more negative the score the better) and write out the screened molecules to a Mol2 file.

>>> scores = sorted([(r.score, r.identifier) for r in results]) 
>>> from ccdc.io import MoleculeWriter 
>>> molwriter = MoleculeWriter('P28845_results.mol2') 
>>> for r in results: 
...    molwriter.write(r.molecule) 

Library Screening of actives and decoys

Before committing to a virtual screening campaign of millions of compounds, it is best practise to validate the virtual screening tool that you are going to use. This is often done by screening a library of known active and decoy molecules against the pharmacophore model, followed by the calculation of several enrichment metrics. A ROC curve is also plotted to assess the success of the model in scoring and ranking the active molecules earlier than the decoys.

Here we will use a collection of 47 actives and 2070 decoys for Cyclin-dependent kinase 2 (CDK2). They are publicly available from DUD LIB VS 1.0. The pharmacophore query will be generated from a subset of 13 CDK2 inhibitors taken from the AZ test set, a benchmarking data set for pharmacophore programs. These molecules were arbitrarily chosen but cover all different chemotypes represented in the whole set. The CSD Ligand Overlay program was used to flexibly align these molecules. Here we are going to use one of the resulting solutions to generate our pharmacophore query.

First of all, let’s read in the query file.

>>> query_file = 'solution_02.sdf'
>>> query = [m for m in io.MoleculeReader(query_file)]

Then, let’s setup the screener settings.

>>> import os
>>> settings = Screener.Settings()
>>> settings.output_directory = os.path.join(os.getcwd(), "screen_data")
>>> screener = Screener(query, settings=settings, nthreads=4)

The field maps are now generated around the input query and four threads were used for the calculation.

../_images/maps.png

Let’s now screen the active molecules. Note that we are not generating conformers for them but it would be good to generate at least 25 conformations. The program will then optimise the placement of every conformer onto the fields, and only the top-scoring conformation will be retained and used for ranking.

>>> actives_to_screen = 'cdk2_clustered_3D_MM.sdf'
>>> output_name_actives = os.path.join(settings.output_directory, "actives_screened.mol2") 
>>> screen_set = [m for m in io.MoleculeReader(actives_to_screen)] 
>>> scores=[]
>>> with io.MoleculeWriter(output_name_actives) as molwriter:
...    for mol in screen_set:
...        mol.standardise_aromatic_bonds()
...        mol.standardise_delocalised_bonds()
...        res = screener.screen([ [mol] ])   
...        scores.extend([(r.score, 1, r.identifier) for r in res]) 
...        for r in res: 
...            molwriter.write(r.molecule) 
>>> actives_scores = sorted(scores)

Similarly, we screen the decoy molecules.

>>> decoys_to_screen = 'DUD_cdk2_decoys_ID_pass_MWPass_I_MM.sdf'
>>> output_name_decoys = os.path.join(settings.output_directory, "decoys_screened.mol2") 
>>> screen_set = [m for m in io.MoleculeReader(decoys_to_screen)]
>>> scores=[]
>>> with io.MoleculeWriter(output_name_decoys) as molwriter:
...    for mol in screen_set:
...        mol.standardise_aromatic_bonds() 
...        mol.standardise_delocalised_bonds() 
...        res = screener.screen([ [mol] ])   
...        scores.extend([(r.score, 0, r.identifier) for r in res]) 
...        for r in res:
...            molwriter.write(r.molecule)
>>> decoys_scores = sorted(scores)

We can now combine the scores of the actives and decoys.

>>> all_data = actives_scores
>>> all_data.extend(decoys_scores)
>>> screening_scores = sorted(all_data)
>>> output_name_scores = os.path.join(settings.output_directory, "screening_scores.csv")
>>> with open(output_name_scores, 'w') as f:
...    for r in screening_scores:
...        score, activity, identifier = r
...        _ = f.write('%.3f, %d, %s\n' %(score, activity, identifier))

Finally, we can calculate some enrichment metrics (e.g. the AUC).

>>> from ccdc.descriptors import StatisticalDescriptors
>>> import operator
>>> rank_stats = StatisticalDescriptors.RankStatistics(screening_scores, activity_column=operator.itemgetter(1))  
>>> print(round(rank_stats.AUC(), 3)) 
0.753

Note that you may get slightly different results for the AUC value as the algorithm is stochastic.

The argument, activity_column indicates which column of the scores data should be interpreted as an activity classification. We have assigned an activity equal to 1 to active molecules and 0 to decoys.

See also

The “Field-based virtual screening” example in the cookbook documentation.