HBond Propensities

This module enables the calculation of probabilities for the formation of hydrogen bonds based on a statistical model built from relevant structures in the CSD. The model encapsulates information regarding the environment of the functional groups, which ensures the prediction is specific to the target molecule.

To run it we will import the appropriate module:

>>> from ccdc.descriptors import CrystalDescriptors

Settings

Like most complex classes of the API, it has a nested class ccdc.descriptors.CrystalDescriptors.HBondPropensities.Settings to control the operation of the HBond propensity calculation. Typically the default construction is appropriate, but there are methods and properties to fine-tune the prediction. Specifically it has a ccdc.molecule.Molecule.HBondCriterion instance to specify the geometry and the atom types which constitute a hydrogen bond. See HBond Criterion for details of its API.

The databases property may be changed. This requires a list of CSD format databases: currently the HBond propensity API does not support other format databases.

It is possible to set the functional groups directory for the calculation. It is not normally necessary to do this, unless unusual hydrogen bonds need to be analysed.

The working directory for the calculation may be set: by default it will be an arbitrary temporary directory.

Basic operation

First, let us create an ccdc.descriptors.CrystalDescriptors.HBondPropensities instance:

>>> hbp = CrystalDescriptors.HBondPropensities()

The first thing to do is to set a target crystal; here we have chosen VAYXOI which is the refcode of Omeprazole, which is used in the treatment of gastric acid disorders.

>>> from ccdc import io
>>> csd = io.CrystalReader('csd')
>>> crystal = csd.crystal('VAYXOI')
>>> hbp.set_target(crystal)

Functional Groups

Having set a target, the HBondPropensities instance now has a ccdc.descriptors.CrystalDescriptors.HBondPropensities.functional_groups attribute, which is a tuple of ccdc.descriptors.CrystalDescriptors.HBondPropensities.FunctionalGroup instances, each of which matches a hydrogen donor or acceptor of the target structure.

>>> print(hbp.functional_groups)
(FunctionalGroup(ar_N_2), FunctionalGroup(sulfinyl_1), FunctionalGroup(imidazole_1), FunctionalGroup(ar_methoxy))
>>> fg = hbp.functional_groups[1]
>>> print(fg.identifier)
sulfinyl_1

We can check where the functional group matches on a crystal. The returned values will be instances of ccdc.search.SubstructureSearch.SubstructureHit. See Substructure search hits for details.

>>> hits = fg.matches(crystal)
>>> print(len(hits))
1
>>> print(hits[0].match_atoms())
[Atom(O1), Atom(S1), Atom(C6), Atom(C7)]

There will also be attributes ccdc.descriptors.HBondPropensities.donors and ccdc.descriptors.HBondPropensities.acceptors containing instances of ccdc.descriptors.HBondPropensities.HBondDonor and ccdc.descriptors.HBondPropensities.HBondAcceptor respectively. These have a few useful properties:

>>> print(hbp.donors)
(HBondDonor(NH1(atom 3 of imidazole_1)) no evidence,)
>>> print(hbp.acceptors) 
(HBondAcceptor(N(atom 1 of ar_N_2)) no evidence, HBondAcceptor...)
>>> d = hbp.donors[0]
>>> print(d.functional_group_identifier)
imidazole_1
>>> print(round(d.accessible_surface_area, 2)) 
0.17
>>> print(d.nlone_pairs)
1
>>> print(d.identifier)
NH1(atom 3 of imidazole_1)

There is currently no evidence for these HBondAtoms: this will be supplied after we construct and analyse the fitting data. The method ccdc.descriptors.CrystalDescriptors.HBondPropensities.hbond_atoms may be used to determine the hydrogen bonding atoms of an arbitrary crystal, not only the target of the HBondPropensities calculation.

Fitting data

The fitting data are candidate structures from the databases of the settings object which have the potential, as measured by fingerprint screening, to match one or more of the functional groups of the target crystal. This can be generated from the HBondPropensities instance or loaded from a file, most usually and computationally cheapest, a GCD file. These objects are instances of ccdc.descriptors.CrystalDescriptors.HBondPropensities.FittingData.

To limit the candidate structures via a GCD file, use the static from_file method

>>> file_name = 'fitting_data.gcd'
>>> fitting_data = CrystalDescriptors.HBondPropensities.FittingData.from_file(file_name, hbp.settings.databases)
>>> print(len(fitting_data))
49894
>>> print(fitting_data.advice_comment())
good size

The fitting data may then be assigned to the HBondPropensities instance for use in the rest of the calculation:

>>> hbp.fitting_data = fitting_data

The fitting data have not yet been matched against the functional groups of the targets. This will happen once the match_fitting_data method is called.

The above steps are optional and the generation can be performed on the entire database by only running the following method:

>>> hbp.match_fitting_data(count=50)
137 structures
     ar_N_2:    67 (could do with more)
 sulfinyl_1:    50 (not enough)
imidazole_1:    51 (could do with more)
 ar_methoxy:    56 (could do with more)

The method, match_fitting_data, takes two keyword arguments. The first, count, is how many instances to find for each functional group; the second, verbose, will periodically print a brief summary of how many matching structures have been found for each functional group. The default for the former is 300, which is deemed to be a good number for the group. For speed in running this test this has been reduced to 50, which is rather minimal.

After calling this method there should be at least the specified number of matched fitting data items for each functional group. This is confirmed by the call to match_fitting_data, but can be checked explicitly by calling:

>>> hbp.show_fitting_data_counts()
137 structures
     ar_N_2:    67 (could do with more)
 sulfinyl_1:    50 (not enough)
imidazole_1:    51 (could do with more)
 ar_methoxy:    56 (could do with more)

If these are not deemed to be sufficient, match_fitting_data may be called with a larger count.

Analysis

The fitting data have now been pruned to a more manageable level. We can now analyse the hydrogen bonds in the fitting data set. After this analysis donors and acceptors have been augmented with the positive and negative instances in the fitting data set.

>>> hbp.analyse_fitting_data()
>>> print(hbp.donors)
(HBondDonor(NH1(atom 3 of imidazole_1)) 61 positive 80 negative,)
>>> print(hbp.acceptors)
(HBondAcceptor(N(atom 1 of ar_N_2)) 24 positive 53 negative, HBondAcceptor(O(atom 0 of sulfinyl_1)) 41 positive 15 negative, HBondAcceptor(N(atom 0 of imidazole_1)) 42 positive 45 negative, HBondAcceptor(O(atom 1 of ar_methoxy)) 6 positive 49 negative, HBondAcceptor(O(atom 1 of ar_methoxy)) no evidence)

There are other attributes of the donors and acceptors: accessible_surface_area, functional_group_identifier, nlone_pairs, identifier, label (of the atom in the crystal structure’s molecule) and the atom itself. For donors the donor_atom_type property is the type as determined by the HBondCriterion class. Similarly for acceptors.

Where there is insufficient evidence for the logistic regression to use, these will be excluded. In this case there is enough evidence for all features, but if you wish you can explicitly exclude features by setting the ccdc.descriptors.CrystalDescriptors.HBondPropensities.HBondAtom.exclude to True. Here the last acceptor, atom_1_of_ar_methoxy, has very little positive evidence, so it will be a more robust regression if we exclude it:

>>> hbp.acceptors[-1].exclude = True

Regression

We are now ready to perform the regression. This uses the statistical package, R (R Core Team (2013). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, R project, which is included in your CSD distribution.

When we perform the logistic regression various data files are written to the working directory. The files written are:

  • VAYXOI_training_dataset.csv - the dataset on which the logistic regression will be performed.

  • prediction_molecule_VAYXOI.mol2 - the molecule whose crystal was used for the prediction

  • fitting_data_VAYXOI.gcd - the fitting data generated (or read) for the prediction

  • VAYXOI_counts.html - a table of the evidence for the hydrogen bonds

  • *.con - connser files matching the functional groups of the crystal

>>> model = hbp.perform_regression()

The returned value is an instance of ccdc.descriptors.CrystalDescriptors.HBondPropensities.Model. It can report the regression equation of the model, the area under the ROC curve and a qualitative measure of how well the model fits the data:

>>> print(model.equation) 
2.13 + 0.25*Donorother + 1.05*Acceptoratom_0_of_sulfinyl_1 + -3.39*Acceptoratom_1_of_ar_methoxy + -0.68*Acceptoratom_1_of_ar_N_2 + -0.02*Acceptorother + 0.09*competition + -0.02*Donor_steric_density + -0.03*Acceptor_steric_density + 0.06*Donor_aromaticity + 0.26*Acceptor_aromaticity
>>> print(round(model.area_under_roc_curve, 3)) 
0.825
>>> print(model.advice_comment) 
good

We can choose which parameters are to be used as explanatory variables in the calculation of the regression. Here we will remove the aromaticity parameters from the model and re-run the regression:

>>> params = hbp.all_parameters
>>> print('\n'.join(p.identifier for p in params))
Donor atom of group
Acceptor atom of group
Competition
Donor steric density
Acceptor steric density
Donor aromaticity
Acceptor aromaticity
>>> params[-1].include = params[-2].include = False
>>> model = hbp.perform_regression()
>>> print(model.advice_comment)
good
>>> print(round(model.area_under_roc_curve, 1))
0.8
>>> print(model.equation) 
3.13 + 0.28*Donorother + 0.92*Acceptoratom_0_of_sulfinyl_1 + -2.30*Acceptoratom_1_of_ar_methoxy + -0.79*Acceptoratom_1_of_ar_N_2 + 0.90*Acceptorother + -0.02*Competition + -0.03*Donor_steric_density + -0.04*Acceptor_steric_density

The excluded parameters are no longer present in the regression equation, and the quality of the regression has declined by a small amount.

Propensities

Now we can calculate the HBond propensities.

>>> propensities = hbp.calculate_propensities()

This will calculate the propensity for each of the hydrogen bonds in the analysed fitting data.

>>> inters = hbp.inter_propensities
>>> intras = hbp.intra_propensities

or to have them all at once:

>>> propensities = hbp.propensities

These are tuples of ccdc.descriptors.CrystalDescriptors.HBondPropensities.Propensity instances, specialised as InterPropensity and IntraPropensity. They have properties for the donor and acceptor (references to HBondDonor and HBondAcceptor atoms of the HBondPropensities class, hence with access to the underlying atom, the functional group information, the evidence etc.), the predicted propensity, the bounds, uncertainty and predictive error, and a dictionary of quantitative scores for the hbond in question. If the hbond is observed in the target crystal, there will be a crystallographic HBond property; otherwise this property is None.

Propensity Groupings

>>> hbond_groups = hbp.generate_hbond_groupings()

These are tuples of ccdc.descriptors.CrystalDescriptors.HBondPropensities.HBondGrouping instances. Each of these groupings can be refered to as a putative structure, with a unique H-bonding network. Each grouping has access to the underlying donors and acceptors as well as the underlying hydrogen bonds that comprise the grouping. Each grouping also has a H-bond score based on the average propensities of the corresponding H-bonds and a group coordination score based on the average of each donor and acceptor atoms coordination score.

Where there are many donors or acceptors, the number of hbond groupings can become enormous very quickly. For example, in SUCROS01 with 8 donors and 11 acceptors, there will be something over 68 million possible hbond groupings. The method ccdc.descriptors.CrystalDescriptors.HBondPropensities.generate_hbond_groupings() takes optional parameters, min_donor_prob, min_acceptor_prob to reject groupings containing donors or acceptors with low coordination probabilities. If these are not provided, and the target structure contains more than 8 donors or acceptors, the method will calculate values for these parameters in order to give a reasonable number of groupings.

The donors and acceptors of the grouping will have been augmented with the outcome, i.e. the coordination of the donor or acceptor in the putative structure.

>>> donors = hbond_groups[0].donors
>>> print(donors[0].outcome_label)
donates once