SMARTS implementation

SMARTS is a language that allows you to specify substructures using rules that are extensions of SMILES (Simplified Molecular Input Line Entry Specification). The CSD Python API implementation of SMARTS is a subset of the full SMARTS functionality.

Extensions to the SMARTS Language

CCDC supports a small extension to Daylight SMARTS to allow quadruple, delocalised and pi bonds to be represented using the characters ‘_’, ‘"’ and ‘|’ respectively

Atoms in SMARTS patterns can have an numeric mapping added. This is essentially cosmetic in searching, but is useful in post-processing of patterns. For example C1[N,H2:5]CCCC1 will parse. The :5 will have no affect on the search.

Supported features since version 3.0.13

We now support some previously unsupported features. We now include recursive SMARTS support, support for most atom constraints and ‘dot disconnect’ support.

  • Recursive SMARTS support
    • SMARTS patterns such as [$(OCO),$(OCCO)] are now supported

  • Atom properties
    • These are almost fully supported with one caveat (see below in unsupported features.)

  • Bond properties
    • These are almost fully supported (see below in unsupported features for remaining unsupported features.)

  • Dot disconnect support
    • searches for single SMARTS patterns with disconnected fragments is now supported. For example, to find molecules containing both an C(O)NH and a CC(=O)(C) in a single molecule, one could use (C(=O)N(H).CC(=O)(C))

  • Search for the 3D stereochemistry atoms.
    • Searching in the CSD Python API does now support the expression of stereochemical descriptors on a SMARTS string, however it should be noted that when applied to the Cambridge Structural Database the search will only search the molecules in the stored asymmetric unit. Centrosymmetric structures containing molecules with chiral centres are racemates and so will contain molecules with inverted stereoisomers; a stereochemical descriptor is not guarenteed to hit all of them for this reason. When searching the CSD, the sterochemistry of an atom is determined from its 3D coordinates.

  • Atomic mass constraints
    • This is only useful for searching for Deuterium ([2H]) in the current version of the Cambridge Structural Database, but could be useful more generally for searching other data sources where atomic masses have been annotated onto atoms.

Unsupported features

  • General:
    • Reaction SMARTS, e.g. [CC>>CC]

  • Atom properties:
    • h<n>: implicit hydrogens.

    • ‘Sterochemical or unspecified’ sterochemistry is not supported (for example [C@?]) as the stereochemistry is treated as a 3D property

  • Bond properties:
    • The following constructs are not supported as they would not lead to any hits:
      • NOT any bond, e.g. !~

      • different bond types combined with AND operator, e.g. -&= (single and double)

    • Variable stereochemistry on bonds is unsupported , for example */,\[R]=;@[R]/,\* will not parse

Matching to aromatic and aliphatic atoms in the CSD will correspond to the representations curated in the CSD rather than the canonical representations defined by the SMILES specification.

Aromaticity and its relationship to SMARTS searching

Some chemoinformatic toolkits handle aromaticity in different ways. Some effectively define an aromatic atom as one in a ring that obeys Huckel rules.

CCDC chooses not to do this, as our underlying data is curated. Choices on whether a given bond is represented as aromatic have been made at the point of structural editting.

Matching to aromatic and aliphatic atoms in the CSD will correspond to the representations curated in the CSD rather than the canonical representations defined by the SMILES specification.

In particular when searching using a SMARTS pattern we define an aromatic atom as an atom that is in a ring of aromatic bonds. I.e. the search does not kekulize rings.

This can lead to some challenges when a user wants a comprehensive search for a set of similar fragments. For example, a 1,2,4 triazole molecule in the Cambridge Structural Database (CSD) can be represented as entirely aromatic or in a representation of single and double bonds. Consider the following two SMARTS patterns;

C1=NN=CN1

c1nncn1

They will match different subsets in the CSD.

The most comprehensive method for creating a SMARTS to search the CSD in such cases is to express a given ring using variable attachment counts and variable bond types and avoid expressing elements as ‘aliphatic’ or ‘aromatic’. For example

[#6;X3]1:,=[#7;X2,X3]:,-[#7;X2,X3]:,=[#6;X3]:,-[#7;X2,X3]1

Is a very generalised SMARTS pattern that will find 1,2,4 triazole moieties in CSD structures. Note that this pattern matches entries such as NATRUV where the triazole ring is part of a 1,2,4-triazolo(1,5-a)pyrimidine fragment. As the pyrimidine is expressed as aromatic, neither of the simple patterns would have matched this.

Hydrogen treatment in CSD Searching

As the CCDC’s toolkit is primarily aimed at use with structures in the CSD, the CCDC’s search implementation does not support implicit hydrogens. Many toolkits treat hydrogens as implicit counts. In many uses in the CSD an explicit location is present and so we choose to treat them on a par with any other element rather than treating them as effectively elemental properties. Hydrogens can be siteless and searched successfully.