Distance-dependent: characterizing virtual screening datasets
© Anthes et al; licensee BioMed Central Ltd. 2009
Published: 05 June 2009
Many reports evaluating ligand-based virtual screening methods show that the results are strongly dependent on the composition of the employed benchmark datasets. Recently, it became apparent, that two causes for overoptimistic validation results need to be avoided: artificial enrichment and analogue bias. Artificial enrichment is observed when the decoy set (i. e. the background) differs significantly from the set of actives regarding "simple" molecular properties. Analogue bias describes the fact that in the dataset of actives certain scaffolds are over-represented. Both phenomena render retrieval of actives trivial.
Several techniques were proposed in the literature to cope with these problems. Most of them use the mean of pair wise distances or the mean of pair wise similarity coefficients to characterize dataset diversity . It is obvious that these measures depend on the dataset but also on the employed structure descriptor and the distance/similarity measure.
The goal of this study was to assess whether or not commonly employed measures of diversity reasonably characterize benchmark dataset composition. Therefore, previously published diversity measures were compared to recently introduced spatial statistics-based figures of dataset topology . The relative advantages and disadvantages of the studied figures are contrasted. Interestingly, figures based on more distant neighbours than just the nearest one, performed very well. From a detailed analysis of the findings, a guideline for characterizing ligand-based virtual screening datasets is derived.
This article is published under license to BioMed Central Ltd.