Volume 3 Supplement 1

4th German Conference on Chemoinformatics: 22. CIC-Workshop

Open Access

Distance-dependent: characterizing virtual screening datasets

  • C Anthes1,
  • SG Rohrer1 and
  • K Baumann1
Chemistry Central Journal20093(Suppl 1):P19

DOI: 10.1186/1752-153X-3-S1-P19

Published: 05 June 2009

Many reports evaluating ligand-based virtual screening methods show that the results are strongly dependent on the composition of the employed benchmark datasets. Recently, it became apparent, that two causes for overoptimistic validation results need to be avoided: artificial enrichment and analogue bias. Artificial enrichment is observed when the decoy set (i. e. the background) differs significantly from the set of actives regarding "simple" molecular properties. Analogue bias describes the fact that in the dataset of actives certain scaffolds are over-represented. Both phenomena render retrieval of actives trivial.

Several techniques were proposed in the literature to cope with these problems. Most of them use the mean of pair wise distances or the mean of pair wise similarity coefficients to characterize dataset diversity [1]. It is obvious that these measures depend on the dataset but also on the employed structure descriptor and the distance/similarity measure.

The goal of this study was to assess whether or not commonly employed measures of diversity reasonably characterize benchmark dataset composition. Therefore, previously published diversity measures were compared to recently introduced spatial statistics-based figures of dataset topology [2]. The relative advantages and disadvantages of the studied figures are contrasted. Interestingly, figures based on more distant neighbours than just the nearest one, performed very well. From a detailed analysis of the findings, a guideline for characterizing ligand-based virtual screening datasets is derived.

Authors’ Affiliations

Institut für Pharmazeutische Chemie, Technische Universität Braunschweig


