Protein-ligand searching

Note

This functionality is available only to CSD-Discovery and CSD-Enterprise users.

Drug discovery scientists are often required to data mine 3D information. Here we discuss how one can combine 3D substructure searching with searching of protein-ligand binding sites derived from the Protein Data Bank (PDB) to, for example, find ligands with specific/similar structures; investigate patterns of interactions between protein binding sites, waters, metals and small molecules; or understand geometric patterns within small molecules bound to proteins.

See also

For more information on substructure searching and how to define geometric measurements and/or constraints please see Setting up and running a substructure search

Let us import the relevant modules:

>>> import ccdc.io
>>> import ccdc.search

These searches can be performed on the structure database (i.e., pdb_crossminer.csdsqlx) released with CSD-CrossMiner alongside the feature database.

Metalloproteins, proteins containing a metal ion in their structures, have a wide variety of functions in vivo. The family of Zinc metalloproteins contains many attractive drug targets. It would be therefore interesting to find all Zinc-containing proteins in the database. To do so, one can define a substructure search for the Zinc ion by specifying that it should belong to the ‘metal’ component of a protein-ligand complex.

>>> from ccdc.pharmacophore import Pharmacophore
>>> db_file = os.path.join(os.path.dirname(Pharmacophore.default_feature_database_location()), 'pdb_crossminer.csdsqlx')
>>> db = ccdc.io.EntryReader(db_file)
>>> print(len(db))
355744
>>> query = ccdc.search.SubstructureSearch()
>>> zinc = ccdc.search.SMARTSSubstructure('[Zn]')
>>> zinc.atoms[0].add_protein_atom_type_constraint('Metal')
>>> sub = query.add_substructure(zinc)
>>> query.settings.max_hits_per_structure = 1
>>> hits = query.search(db)
>>> print(len(hits))
10465

Similarly, we can search for organometallic ligands in the database where the Zinc is covalently bound to another atom. The search will be set up in a very similar way, but this time the atoms in the SMART pattern will be constrained to belong to the ‘ligand’ component of the database entry.

>>> query = ccdc.search.SubstructureSearch()
>>> zinc = ccdc.search.SMARTSSubstructure('[Zn]~*')
>>> zinc.atoms[0].add_protein_atom_type_constraint('Ligand')
>>> zinc.atoms[1].add_protein_atom_type_constraint('Ligand')
>>> sub = query.add_substructure(zinc)
>>> query.settings.max_hits_per_structure = 1
>>> print(len(query.search(db)))
64

It is also possible to add geometric constraints to the search. For example, one might want to find all entries in the database where the Zinc cation coordinates the carboxylate group of the protein residues.

>>> query = ccdc.search.SubstructureSearch()
>>> zinc = ccdc.search.SMARTSSubstructure('[Zn]')
>>> zinc.atoms[0].add_protein_atom_type_constraint('Metal')
>>> sub = query.add_substructure(zinc)
>>> carboxylate = ccdc.search.SMARTSSubstructure('c(~o)~o')
>>> for a in carboxylate.atoms:
...    a.add_protein_atom_type_constraint('Amino')
>>> sub = query.add_substructure(carboxylate)
>>> query.add_distance_constraint('DIST1', (0, 0), (1, 1), (-5, 2), vdw_corrected=True, type='any')
>>> hits = query.search(db)
>>> print(len(hits))
28171

All measurements and/or contraints values can be retrieved from the individual hits:

>>> for h in hits:
...    for k, v in h.constraints.items():
...        print("Identifier: {}, {}:  {}".format(h.identifier, k, round(v, 3))) 
Identifier: 1B3D_m1_B_bs_S27_B_401, DIST1:  4.804
Identifier: 1B3D_m1_B_bs_S27_B_401, DIST1:  4.18
Identifier: 1A7T_m1_A_bs_MES_A_250, DIST1:  3.936
...
Identifier: 966C_m1_A_bs_RS2_A_1, DIST1:  4.332
Identifier: 8CPA_m1_A_bs_AGF_A_309, DIST1:  2.247
Identifier: 8CPA_m1_A_bs_AGF_A_309, DIST1:  2.215

Hits can be superimposed onto the matching atoms. Here we show how to superimpose the top 10 hits onto the first one.

>>> from ccdc.descriptors import MolecularDescriptors
>>> for h in hits[:10]:
...    ov = MolecularDescriptors.Overlay(hits[0].molecule, h.molecule, atoms=zip(hits[0].match_atoms(), h.match_atoms()), with_symmetry=False)
...    mol = ov.molecule

It’s possible to define complex query. For example, let’s find all interactions between a ligand carboxylate group and a protein guanidino group.

../_images/3D_query.png
>>> query = ccdc.search.SubstructureSearch()

>>> carboxylate = ccdc.search.SMARTSSubstructure('c~(o)~o')
>>> for a in carboxylate.atoms:
...    a.add_protein_atom_type_constraint('Ligand')

>>> guanidino = ccdc.search.SMARTSSubstructure('n:c(:n):n')
>>> for a in guanidino.atoms:
...    a.add_protein_atom_type_constraint('Amino')

>>> sub1 = query.add_substructure(guanidino)
>>> sub2 = query.add_substructure(carboxylate)

>>> query.add_plane('PLANE1', (sub1, 1), (sub1, 2), (sub1, 3))
>>> query.add_plane('PLANE2', (sub2, 0), (sub2, 1), (sub2, 2))

>>> query.add_centroid('CENT1', (sub1, 2), (sub1, 3))
>>> query.add_centroid('CENT2', (sub2, 1), (sub2, 2))

>>> query.add_distance_constraint('DIST1', 'CENT1', 'CENT2', (0, 3.0))

>>> query.add_plane_angle_constraint('ANG1', 'PLANE1', 'PLANE2', (0, 25))
>>> query.settings.max_hits_per_structure = 1

In such cases, it is recommended to use the ccdc.search.SubstructureSearch.HitProcessor to iterate over the hits.

>>> class WriteHit(ccdc.search.SubstructureSearch.HitProcessor):
...    '''Print out the hit id each time we find one'''
...    def __init__(self, max_hits=1000):
...        self.hits = [] # ccdc.search.SubstructureHit objects
...        self.max_hits = max_hits
...
...    def add_hit(self, hit):
...        self.hits.append(hit)
...        print('%s: nhits %d' % (hit.identifier, len(self.hits)))
...        if len(self.hits) == self.max_hits:
...            self.cancel()

>>> hit_writer = WriteHit()
>>> hit_writer.search(query, db) 
1A2N_m1_A_bs_TET_A_420: nhits 1
1A59_m1_A_bs_CIT_A_379: nhits 2
1A4Q_m1_B_bs_DPC_B_466: nhits 3
...
2PJC_m1_C_bs_343_C_601: nhits 998
2PJC_m1_B_bs_343_B_501: nhits 999
2PJC_m1_A-B_bs_343_A_401: nhits 1000