Validation of predicitve modelling techniques in drug design – influence of test set composition
© Matz et al; licensee BioMed Central Ltd. 2009
Published: 05 June 2009
In chemoinformatics and in the analysis of Quantitative Structure-Activity Relationships (QSAR) experimental data of a molecular property of interest are routinely mathematically related to a set of carefully chosen structure descriptors which represent the molecules under study. Many different mathematical techniques can be used for this purpose. For the sake of simplicity, we focus here on the simplest technique to predict continuous properties, which is linear regression. In a typical model building process the data analyst needs to make several decisions with respect to model complexity and model parameters. Mostly, these decisions are data-driven which makes some form of internal validation necessary to obtain information how the model complexity or model parameters should be set to achieve good predictivity. After the model building process is finished a final validation step with fresh data needs to be carried out to assure that the developed model is truly predictive. The gold standard for doing this is to set aside a portion of the data for this final validation step (the so-called test set validation). There exist several methods to split the original data set into the training and the test set. Common techniques comprise statistical design techniques on the matrix of structure descriptors such as the Kennard-Stone algorithm or the DUPLEX algorithm, using the property vector under study for achieving a uniform distribution in property space, or plain random sampling. All of the aforementioned techniques show some advantages and disadvantages. Here, we highlight the influence of the data splitting algorithm on assessing the predictivity. In recent publications several authors simply focused on the size of the prediction error without accounting for the bias introduced by the splitting algorithm. This is corrected here and it turns out that the commonly accepted best practice is actually not the optimal technique to estimate the true prediction error.
- Baumann K, Albert H, von Korff M: A systematic evaluation of the benefits and hazards of variable selection in latent variable regression. Part I. J Chemom. 2002, 16: 339-350. 10.1002/cem.730.View ArticleGoogle Scholar
- Snee RD: Validation of regression models, methods and examples. Technometrics. 1977, 19: 415-428. 10.2307/1267881.View ArticleGoogle Scholar
- Kennard RW, Stone LA: Computer aided design of experiments. Technometrics. 1969, 11: 137-148. 10.2307/1266770.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd.