Skip to main content
  • Poster presentation
  • Open access
  • Published:

Complexity effects in fingerprint similarity searching

Similarity searching using fingerprint representations of molecules is widely applied for mining of chemical databases [1]. Known active compounds are used as templates to search for novel hits using similarity measures for quantitative bit string comparison. A variety of similarity metrics are being used for this purpose including the popular Tanimoto coefficient [1] and the Tversky coefficients [2].

Differences in molecular complexity and size are known to bias the evaluation of fingerprint similarity [3]. Complex molecules tend to produce fingerprints with higher bit density than simpler ones, which often leads to artificially high similarity values in search calculations. For example, we have thoroughly analyzed similarity value distributions and demonstrated that apparent asymmetry in Tversky similarity search calculations is a direct consequence of differences in fingerprint bit densities [4].

There are in principle two approaches to balance complexity effects; either by designing fingerprints that have constant bit density, regardless of the nature of test molecules, or, alternatively, by introducing similarity metrics that equally weight bit positions that are set on or off. We have shown that a size-independent fingerprint with constant bit density does not produce asymmetrical search results [4]. In addition, a novel similarity metric has been developed, which not only balances complexity effects, but also results in further improved search performance compared to conventional calculations on Tanimoto similarity [5]. However, highly complex molecules are generally much less suitable as reference compounds for fingerprint searching than active compounds having complexity comparable to the screening database [5]. Random deletion of bits that are set on in complex templates has been shown to increase compound recall, despite the associated loss in chemical information content [6]. Taking relative chemical complexity of reference and database compounds into account makes it possible to increase the success rates of fingerprint similarity searching.

References

  1. Willett P, et al: J Chem Inf Comput Sci. 1998, 38 (6): 983-96.

    Article  CAS  Google Scholar 

  2. Chen X, Brown F: Chem Med Chem. 2007, 2 (2): 180-2.

    Article  CAS  Google Scholar 

  3. Flower D: J Chem Comput Sci. 1998, 38 (3): 379-86.

    Article  CAS  Google Scholar 

  4. Wang Y, et al: Chem Med Chem. 2007, 2 (7): 1037-42.

    Article  CAS  Google Scholar 

  5. Wang Y, Bajorath J: J Chem Inf Model. 2008, 48 (1): 75-84. 10.1021/ci700314x.

    Article  CAS  Google Scholar 

  6. Wang Y, et al: Chem Biol Drug Design. 2008, 71 (6): 511-7. 10.1111/j.1747-0285.2008.00664.x.

    Article  CAS  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Open Access This is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License ( https://creativecommons.org/licenses/by-nc/2.0 ), which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.

Reprints and permissions

About this article

Cite this article

Wang, Y., Geppert, H. & Bajorath, J. Complexity effects in fingerprint similarity searching. Chemistry Central Journal 3 (Suppl 1), P5 (2009). https://doi.org/10.1186/1752-153X-3-S1-P5

Download citation

  • Published:

  • DOI: https://doi.org/10.1186/1752-153X-3-S1-P5

Keywords