Similarity searching¶
Introduction¶
In order to be able to set up searches we will need to import the
ccdc.search
module. Let us also import the ccdc.io
module
to allow us to read in and write out molecules.
>>> import ccdc.search
>>> import ccdc.io
As a preamble let us also set up a variable for a temporary directory and a file path to a testosterone molecule.
>>> import tempfile
>>> tempdir = tempfile.mkdtemp()
>>> filepath = 'testosterone.mol2'
To get access to the molecule in the testosterone mol2 file we make use
of a ccdc.io.MoleculeReader
.
>>> reader = ccdc.io.MoleculeReader(filepath)
>>> testosterone = reader[0]
Similarity search¶
To run a similarity search one must first create a
ccdc.search.SimilaritySearch
whose initialiser takes a
ccdc.molecule.Molecule
and a similarity threshold between 0.0
and 1.0. By default the similarity threshold is set to 0.7.
>>> similarity_query = ccdc.search.SimilaritySearch(testosterone)
The similarity search can then be run by making use of the
search()
function.
>>> sim_hits = similarity_query.search()
>>> print(len(sim_hits))
796
To reduce the number of hits we can increase the similarity threshold.
>>> similarity_query.threshold = 0.9
>>> sim_hits = similarity_query.search()
>>> print(len(sim_hits))
83
An alternative approach to reducing the number of hits is to constrain the number of hits to return.
>>> sim_hits = similarity_query.search(max_hit_structures=10)
>>> print(len(sim_hits))
10
Let us find out what these structures are and what their similarity to the query is.
>>> for hit in sim_hits:
... print('%9s: %.2f' % (hit.identifier, hit.similarity))
BAWMAN: 1.00
BEJVAN: 1.00
BOKVUS: 1.00
CERVAX: 1.00
DIGRIV: 1.00
EFEJAD: 1.00
EPITES: 1.00
GIXVIW: 1.00
GIXVOC: 1.00
HANSTO: 1.00
Similarity searches allow all forms of search filter given by ccdc.search.Search.Settings
.
See Search filters for examples of use.
Similarity queries allow all the forms of search that a
ccdc.search.SubstructureSearch
does.
See also
For more information please see Setting up and running a substructure search
For example we can use an instance of a ccdc.molecule.Molecule
directly.
>>> file_path = 'ABEBUF.mol2'
To get access to the first molecule in the ABEBUF.mol2 file we make use
of the ccdc.io.MoleculeReader
.
>>> h = similarity_query.search_molecule(ccdc.io.MoleculeReader(file_path)[0])
>>> print('Identifier: %s, similarity: %.3f' % (h.identifier, h.similarity))
Identifier: ABEBUF, similarity: 0.106