This module enables the calculation of probabilities for the formation of hydrogen bonds based on a statistical model built from relevant structures in the CSD. The model encapsulates information regarding the environment of the functional groups, which ensures the prediction is specific to the target molecule.
To run it we will import the appropriate module:
>>> from ccdc.descriptors import CrystalDescriptors
Like most complex classes of the API, it has a nested class
ccdc.descriptors.CrystalDescriptors.HBondPropensities.Settings to control the operation of the HBond propensity calculation. Typically the default construction is appropriate, but there are methods and properties to fine-tune the prediction. Specifically it has a
ccdc.molecule.Molecule.HBondCriterion instance to specify the geometry and the atom types which constitute a hydrogen bond. See HBond Criterion for details of its API.
The databases property may be changed. This requires a list of CSD format databases: currently the HBond propensity API does not support other format databases.
It is possible to set the functional groups directory for the calculation. It is not normally necessary to do this, unless unusual hydrogen bonds need to be analysed.
The working directory for the calculation may be set: by default it will be an arbitrary temporary directory.
First, let us create an
>>> hbp = CrystalDescriptors.HBondPropensities()
The first thing to do is to set a target crystal; here we have chosen VAYXOI which is the refcode of Omeprazole, which is used in the treatment of gastric acid disorders.
>>> from ccdc import io >>> csd = io.CrystalReader('csd') >>> crystal = csd.crystal('VAYXOI') >>> hbp.set_target(crystal)
Having set a target, the HBondPropensities instance now has a
ccdc.descriptors.CrystalDescriptors.HBondPropensities.functional_groups attribute, which is a tuple of
ccdc.descriptors.CrystalDescriptors.HBondPropensities.FunctionalGroup instances, each of which matches a hydrogen donor or acceptor of the target structure.
>>> print(hbp.functional_groups) (FunctionalGroup(ar_N_2), FunctionalGroup(sulfinyl_1), FunctionalGroup(imidazole_1), FunctionalGroup(ar_methoxy)) >>> fg = hbp.functional_groups >>> print(fg.identifier) sulfinyl_1
>>> hits = fg.matches(crystal) >>> print(len(hits)) 1 >>> print(hits.match_atoms()) [Atom(O1), Atom(S1), Atom(C6), Atom(C7)]
There will also be attributes
ccdc.descriptors.HBondPropensities.acceptors containing instances of
ccdc.descriptors.HBondPropensities.HBondAcceptor respectively. These have a few useful properties:
>>> print(hbp.donors) (HBondDonor(NH1(atom 3 of imidazole_1)) no evidence,) >>> print(hbp.acceptors) (HBondAcceptor(N(atom 1 of ar_N_2)) no evidence, HBondAcceptor...) >>> d = hbp.donors >>> print(d.functional_group_identifier) imidazole_1 >>> print(round(d.accessible_surface_area, 2)) 0.17 >>> print(d.nlone_pairs) 1 >>> print(d.identifier) NH1(atom 3 of imidazole_1)
There is currently no evidence for these HBondAtoms: this will be supplied after we construct and analyse the fitting data. The method ccdc.descriptors.CrystalDescriptors.HBondPropensities.hbond_atoms may be used to determine the hydrogen bonding atoms of an arbitrary crystal, not just the target of the HBondPropensities calculation.
The fitting data are candidate structures from the databases of the settings object which have the potential, as measured by fingerprint screening, to match one or more of the functional groups of the target crystal. This can be generated from the HBondPropensities instance or loaded from a file, most usually and computationally cheapest, a GCD file.
These objects are instances of
To limit the candidate structures via a gcd file, use the static from_file method
>>> file_name = 'fitting_data.gcd'
>>> fitting_data = CrystalDescriptors.HBondPropensities.FittingData.from_file(file_name, hbp.settings.databases) >>> print(len(fitting_data)) 49894 >>> print(fitting_data.advice_comment()) good size
The fitting data may then be assigned to the HBondPropensities instance for use in the rest of the calculation:
>>> hbp.fitting_data = fitting_data
The fitting data have not yet been matched against the functional groups of the targets. This will happen once the match_fitting_data method is called.
The above steps are optional and the generation can be performed on the entire database by only running the following method:
>>> hbp.match_fitting_data(count=50) 137 structures ar_N_2: 67 (could do with more) sulfinyl_1: 50 (not enough) imidazole_1: 51 (could do with more) ar_methoxy: 56 (could do with more)
The method, match_fitting_data, takes two keyword arguments. The first, count, is how many instances to find for each functional group; the second, verbose, will periodically print a brief summary of how many matching structures have been found for each functional group. The default for the former is 300, which is deemed to be a good number for the group. For speed in running this test this has been reduced to 50, which is rather minimal.
After calling this method there should be at least the specified number of matched fitting data items for each functional group. This is confirmed by the call to match_fitting_data, but can be checked explicitly by calling:
>>> hbp.show_fitting_data_counts() 137 structures ar_N_2: 67 (could do with more) sulfinyl_1: 50 (not enough) imidazole_1: 51 (could do with more) ar_methoxy: 56 (could do with more)
If these are not deemed to be sufficient, match_fitting_data may be called with a larger count.
The fitting data have now been pruned to a more manageable level. We can now analyse the hydrogen bonds in the fitting data set. After this analysis donors and acceptors have been augmented with the positive and negative instances in the fitting data set.
>>> hbp.analyse_fitting_data() >>> print(hbp.donors) (HBondDonor(NH1(atom 3 of imidazole_1)) 61 positive 80 negative,) >>> print(hbp.acceptors) (HBondAcceptor(N(atom 1 of ar_N_2)) 24 positive 53 negative, HBondAcceptor(O(atom 0 of sulfinyl_1)) 41 positive 15 negative, HBondAcceptor(N(atom 0 of imidazole_1)) 42 positive 45 negative, HBondAcceptor(O(atom 1 of ar_methoxy)) 6 positive 49 negative, HBondAcceptor(O(atom 1 of ar_methoxy)) no evidence)
There are other attributes of the donors and acceptors: accessible_surface_area, functional_group_identifier, nlone_pairs, identifier, label (of the atom in the crystal structure’s molecule) and the atom itself. For donors the donor_atom_type property is the type as determined by the HBondCriterion class. Similarly for acceptors.
Where there is insufficient evidence for the logistic regression to use, these will be excluded. In this case there is enough evidence for all features, but if you wish you can explicitly exclude features by setting the
ccdc.descriptors.CrystalDescriptors.HBondPropensities.HBondAtom.exclude to True. Here the last acceptor, atom_1_of_ar_methoxy, has very little positive evidence, so it will be a more robust regression if we exclude it:
>>> hbp.acceptors[-1].exclude = True
We are now ready to perform the regression. This uses the statistical package, R (R Core Team (2013). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, R project, which is included in your CSD distribution.
When we perform the logistic regression various data files are written to the working directory. The files written are:
VAYXOI_training_dataset.csv - the dataset on which the logistic regression will be performed.
prediction_molecule_VAYXOI.mol2 - the molecule whose crystal was used for the prediction
fitting_data_VAYXOI.gcd - the fitting data generated (or read) for the prediction
VAYXOI_counts.html - a table of the evidence for the hydrogen bonds
*.con - connser files matching the functional groups of the crystal
>>> model = hbp.perform_regression()
The returned value is an instance of
ccdc.descriptors.CrystalDescriptors.HBondPropensities.Model. It can report the regression equation of the model, the area under the ROC curve and a qualitative measure of how well the model fits the data:
>>> print(model.equation) 2.13 + 0.25*Donorother + 1.05*Acceptoratom_0_of_sulfinyl_1 + -3.39*Acceptoratom_1_of_ar_methoxy + -0.68*Acceptoratom_1_of_ar_N_2 + -0.02*Acceptorother + 0.09*competition + -0.02*Donor_steric_density + -0.03*Acceptor_steric_density + 0.06*Donor_aromaticity + 0.26*Acceptor_aromaticity >>> print(round(model.area_under_roc_curve, 3)) 0.825 >>> print(model.advice_comment) good
We can choose which parameters are to be used as explanatory variables in the calculation of the regression. Here we will remove the aromaticity parameters from the model and re-run the regression:
>>> params = hbp.all_parameters >>> print('\n'.join(p.identifier for p in params)) Donor atom of group Acceptor atom of group Competition Donor steric density Acceptor steric density Donor aromaticity Acceptor aromaticity >>> params[-1].include = params[-2].include = False >>> model = hbp.perform_regression() >>> print(model.advice_comment) good >>> print(round(model.area_under_roc_curve, 1)) 0.8 >>> print(model.equation) 3.13 + 0.28*Donorother + 0.92*Acceptoratom_0_of_sulfinyl_1 + -2.30*Acceptoratom_1_of_ar_methoxy + -0.79*Acceptoratom_1_of_ar_N_2 + 0.90*Acceptorother + -0.02*Competition + -0.03*Donor_steric_density + -0.04*Acceptor_steric_density
The excluded parameters are no longer present in the regression equation, and the quality of the regression has declined by a small amount.
Now we can calculate the HBond propensities.
>>> propensities = hbp.calculate_propensities()
This will calculate the propensity for each of the hydrogen bonds in the analysed fitting data.
>>> inters = hbp.inter_propensities >>> intras = hbp.intra_propensities
or to have them all at once:
>>> propensities = hbp.propensities
These are tuples of
instances, specialised as InterPropensity and IntraPropensity. They have properties for the donor and acceptor
(references to HBondDonor and HBondAcceptor atoms of the HBondPropensities class, hence with access to the underlying
atom, the functional group information, the evidence etc.), the predicted propensity, the bounds, uncertainty and
predictive error, and a dictionary of quantitative scores for the hbond in question. If the hbond is observed in the
target crystal, there will be a crystallographic HBond property; otherwise this property is
>>> hbond_groups = hbp.generate_hbond_groupings()
These are tuples of
ccdc.descriptors.CrystalDescriptors.HBondPropensities.HBondGrouping instances. Each
of these groupings can be refered to as a putative structure, with a unique
H-bonding network. Each grouping has access to the underlying donors and acceptors
as well as the underlying hydrogen bonds that comprise the grouping. Each grouping
also has a H-bond score based on the average propensities of the corresponding H-bonds
and a group coordination score based on the average of each donor and acceptor atoms
Where there are many donors or acceptors, the number of hbond groupings can become enormous very quickly. For example, in SUCROS01 with 8 donors and 11 acceptors, there will be something over 68 million possible hbond groupings. The method
ccdc.descriptors.CrystalDescriptors.HBondPropensities.generate_hbond_groupings() takes optional parameters, min_donor_prob, min_acceptor_prob to reject groupings containing donors or acceptors with low coordination probabilities. If these are not provided, and the target structure contains more than 8 donors or acceptors, the method will calculate values for these parameters in order to give a reasonable number of groupings.
The donors and acceptors of the grouping will have been augmented with the outcome, i.e. the coordination of the donor or acceptor in the putative structure.
>>> donors = hbond_groups.donors >>> print(donors.outcome_label) donates once