A quantitative structure- property relationship of gas chromatographic/mass spectrometric retention data of 85 volatile organic compounds as air pollutant materials by multivariate methods
© Sarkhosh et al; licensee BioMed Central Ltd. 2012
Published: 2 May 2012
A quantitative structure-property relationship (QSPR) study is suggested for the prediction of retention times of volatile organic compounds. Various kinds of molecular descriptors were calculated to represent the molecular structure of compounds. Modeling of retention times of these compounds as a function of the theoretically derived descriptors was established by multiple linear regression (MLR) and artificial neural network (ANN). The stepwise regression was used for the selection of the variables which gives the best-fitted models. After variable selection ANN, MLR methods were used with leave-one-out cross validation for building the regression models. The prediction results are in very good agreement with the experimental values. MLR as the linear regression method shows good ability in the prediction of the retention times of the prediction set. This provided a new and effective method for predicting the chromatography retention index for the volatile organic compounds.
Volatile organic compounds (VOCs) are molecules that have a high vapor pressure and low water solubility, like organic chemicals in general. There are many different compounds which may be classified as volatile organic compounds (VOCs). The compounds the nose detects as smells are generally VOCs. Many VOCs are human-made chemicals that are used and produced in the manufacture of paints, pharmaceuticals, and refrigerants. VOCs typically are industrial solvents, such as trichloroethylene; fuel oxygenates, such as methyl tert-butyl ether (MTBE), or by-products produced by chlorination in water treatment, such as chloroform.
VOCs are common ground-water contaminants. Many volatile organic compounds are also hazardous air pollutants. VOCs also play a major role in the formation of various secondary pollutants through photochemical reactions in the presence of sunlight and nitrogen oxides. Furthermore, some VOCs could contribute to the atmospheric ozone depletion and the build-up persistent pollutions in remote areas. Therefore, these compounds have been an important environmental issue over the last two decades and have attracted significant attention from different research groups. Although scientist are provided a sensitive and specific analytical method for identifying and measuring the VOCs [1–5], but there are some limitations for these methods. For example, sorbents are normally selective in, although not limited to, adsorbing/absorbing certain classes of VOCs or sometimes high sample volumes of air are needed for identification or measuring the VOCs. Besides the above mentioned method, the experimental determination of retention time is time-consuming and expensive. Alternatively, quantitative structure–retention relationship (QSRR) provides a promising method for the estimation of retention time based on descriptors derived solely from the molecular structure to fit experimental data [6–8]. A QSRR study involves the prediction of chromatographic retention parameters using molecular structure. QSRR studies are widely investigated in gas chromatography (GC) [9–11] and high-performance liquid chromatography (HPLC) . The chromatographic parameters are expected to be proportional to a free energy change that is related to the solute distribution on the column.
Model development in QSAR/QSPR studies comprises different critical steps as (1) descriptor generation, (2) data splitting to calibration (or training) and prediction (or validation) sets, (3) variable selection, (4) finding appropriate model between selected variables and activity/property and (5) model validation.
In the present work, a QSRR study has been carried out on the GC/MS system retention times (tR’s) for 85 volatile organic compounds by using structural molecular descriptors. We applied a linear MLR and a nonlinear algorithms ANN along with stepwise regression as variable selection method.
Computer hardware and software
All calculations were run on a Toshiba personal computer with a Pentium IV as CPU and windows XP as operating system. The molecular structures of data set were sketched using ChemDraw (Ver. 11, supplied by Cambridge Software Company). The sketched structures were exported to Chem3D module in order to create their 3D structures. Energy minimization was performed using MM+ molecular mechanics and AM1 semi-empirical to obtain the root mean square (RMS) gradient below 0.01 kcal/mol Å. The next step in developing a model is generation of the corresponding numerical descriptors of the molecular structures. Molecular descriptors define the molecular structure and physicochemical properties of molecules by a single number. A wide variety of descriptors have been reported for using in QSAR/QSPR analysis. Here, 477 descriptors were generated for each compound, using Dragon (ver. 3) software (Milano Chemometrics and QSAR research group, http://www.disat.unimib.it/chm/). SPSS (ver.16, http://www.spss.com) other calculations were performed in PLS_Toolbox (ver. 4.1, Eigenvector Company) and MATLAB (version 7.8, Math Works, Inc.) environment. SPSS (ver.16, http://www.spss.com) was used in variable selection step to select the most relevant descriptors. Other calculations were performed in PLS_Toolbox (ver. 4.1, Eigenvector Company) and MATLAB (version 7.8, Math Works, Inc.) environment.
The data set and the corresponding experimental and predicted tR values by MLR and ANN methods.
tR (min) Exp.
tR (min) Exp.
2-Chloroethyl vinyl ether
Results and discussion
Principal component analysis
477 descriptors were calculated by Dragon for each molecule, therefore we have 85×447 data matrix X. The rows and columns of this matrix are the number of molecules and molecular descriptors respectively. All descriptors have been checked to ensure: (a) that values of each descriptor are available for each structure (the investigation of missing values) and (b) that there is a significant variation in these values over studied molecules. Descriptors which has zero value for each structure or shows constant values over molecules in the under study data set are discarded. X and y(dependent variable) were preprocessed by autoscaling. We used stepwise regression  for variable selection. PCA  is applied for reduction of the descriptor space (variable space).
In the variable selection step, feature selection (FS) [17–19] and feature extraction (FE) are commonly used methods to handle a large number of calculated descriptors in QSAR/QSPR studies. In FE methods, by the use of principal component analysis (PCA), the information contained in the descriptors data matrix is extracted into new orthogonal variables with lower dimensions.
Multivariate regression models
Multiple linear regression (MLR), is the most frequently used modeling method in QSAR/QSPRs. MLR yields models that are simpler and easier to interpret than factor based methods like PCR(principal component regression) [20, 21] and PLS(partial least squares) [22–27]. However, due to the co- linearity between structural descriptors, MLR is not able to extract useful information from structural data, and over fitting problem is encountered. ANN model used to handle the probable nonlinear relationship between descriptors and retention times. One of the most striking point of ANN is that its free from collinearity problem.
We have tested the stability of models by random selection of train and test sets to assure the resulting models are not just chance correlations. 100 models by random selection are constructed and the average of R2 and RMSEC are 0.971 and 0.162, respectively. The results show that the different selected test sets from the data have very similar R2 values. This indicates that the final model is reliable and has a satisfactory stability.
Multiple linear Regression (MLR)
Multiple linear regression  are the most widely used and known modeling methods.
where ŷi is the estimated value of the ith object , y i is the corresponding reference value of this object and N is the number of the objects.
The statistical parameters of MLR model
Variable selection method
best MLR model for the prediction of VOCs retention times
salvation connectivity index
average connectivity index
3D MoRSE- signal 10/weighted by atomic masses
3D MoRSE- signal 21/weighted by atomic masses
broto-moreau autocorrelation of a topological structure- log2/weighted by atomic masses
Moran autocorrelation− lag 2/weighted by atomic masses
To show the absence of correlation by chance between independent and dependents variables randomization test was performed. The MLR modeling was performed on the randomized data, the elements of the response vector were randomly mixed and the modeling was performed on the matrix X and new response vector . The squared correlation coefficient of the model with the randomly selected data (R2CR) was much lower than those of the original model, it could be considered that the model was reliable, and had not been obtained by chance. The low mean of correlation value (R2CR=0.091) obtained by chance confirms this result.
Artificial neural network
Artificial neural networks (ANNs) [31–33] are biologically inspired algorithms designed to simulate the way in which the human brain processes information. ANNs are parallel computational models which are able, at least in principle, to map any nonlinear functional relationship between an input and an output hyperspace to the desired accuracy. The network receives a set of inputs, which are multiplied by each neuron’s weight. These products are summed for each neuron, and a nonlinear transfer function is applied. A set of six descriptors that were appeared in the MLR model, were used as input parameter of the network.
In this investigation, the logsig function was used as a transfer function of hidden layer, traingdx function as a training function and a linear function for the output layer. We optimized the parameters such as number of nodes, momentum (α) and learning rate (η). These values were obtained during the network training. The initial weights were randomly selected between (-0.3 and 0.3). A neural network with 6-6-1 topology was developed with optimum momentum and learning rate.
Parameters of the ANN model
optimized parameters of the ANN
Number of input neurons
Number of hidden neurons
Number of output neurons
R 2 cal
R 2 pre
Description of models descriptors
Molecular descriptors will probably play an increasing role in scientific growth. In fact, the availability of large numbers of theoretical descriptors containing diverse source of chemical information would be useful to better understand relationship between molecular structure and experimental evidence.
where L a is the principal quantum number (2 for C, N, O atoms,3 for Si, S, Cl, …)of the a th atom in the k th subgraph and δa the corresponding vertex degree; k is the total number of m th order subgrah; n is the number of vertices in subgraph. The normalization factor 1/ (2 m + 1 ) is defined in such a way that the indices m χ and m χ s for compounds containing only second-row atom coincide. The X1sol shows a regression coefficient (0.823), which is the largest among the descriptors appearing in the model. This parameter can be considered as entropy of solvation and somehow indicates the dispersion interactions occurring in the solutions. X5A also is a measure of branching of the molecules. The large contributions of this parameter in the retention behavior of VOCs are in agreement with the contribution that one would expect for the interaction of a stationary phase with the VOCs. X5A is the average connectivity index that can be obtained from molecular graph.
where f(x) is any function of the variable x and l is the lag representing an interval of x; a and b defined the total studied interval of the function. The function f(x) is usually a time dependent function such s a time – dependent electrical signal, or spatial dependent function such as the population density in space.
Mor21m (3D MoRSE- signal 21/weighted by atomic masses) and Mor10m (3D MoRSE- signal 10/weighted by atomic masses) are among the 3D-MoRSE descriptors. 3D-MoRSE  descriptors (Mor21m, Mor10m) are molecule atom projections along different angles, such as in electron diffraction. They represent different views of the whole molecule structure, although their meaning remains not too clear. 3D-MoRSE descriptors are based on the idea of obtaining information from the 3D atomic coordinates by the transform used in electron diffraction studies for preparing theoretical scattering curves. 3D-MoRSE descriptors are important because they take into account the 3D arrangement of the atoms without ambiguities (in contrast with those coming from chemical graphs), and also because they do not depend on the molecular size, thus being applicable to a large number of molecules with great structural variance and being a characteristic common to all of them.
All of these descriptors are individually important to increase or decrease the retention times but the final effect is not result in each descriptor individually and all of the selected descriptors as a group can influenced on the dependent variable. This problem in many cases led to the hard interpretation of the model.
QSRR analysis was performed on a series of volatile organic compounds, for prediction the retention times. Stepwise regression was used to as variable selection method. The comparison of the models that were developed by selected descriptors shows that MLR can present the satisfactory result for predicting of retention time of VOCs. Comparison of the linear (MLR) and nonlinear (ANN) methods showed the little superiority of the ANN over the MLR model for the prediction of the retention time of VOCs, so the results suggest that a relation between the molecular descriptor and the retention times is linear. The QSRR models proposed with the simply calculated molecular descriptors can be used to estimate the chromatographic retention times of new compound even in the absence of the standard candidates. The most important selected descriptors are topological which can capture the variance in the retention times what related to the size and shape of the molecules.
This article has been published as part of Chemistry Central Journal Volume 6 Supplement 2, 2012: Proceedings of CMA4CH 2010: Application of Multivariate Analysis and Chemometry to Cultural Heritage and Environment. The full contents of the supplement are available online at http://journal.chemistrycentral.com/supplements/6/S2.
- Sin WM, Wong YC, Sham WC, Wang D: Development of an analytical technique and stability evaluation of 143 C3-C12 volatile organic compounds in summa canisters by gas chromatography mass spectrometry. Analyst. 2001, 126 (3): 310-321. 10.1039/b008746g.View ArticleGoogle Scholar
- Oliver KD, Pleil JD, McClenny WA: Sample integrity of trace level volatile organic compounds in ambient air stored in SUMMA polished canisters. Atmos Environ. 1986, 20 (7): 1403-1411. 10.1016/0004-6981(86)90011-9.View ArticleGoogle Scholar
- Gholson AR, Jayant RKM, Storm JF: Evaluation of aluminum canisters for the collection and storage of air toxics. Anal Chem. 1990, 62 (17): 1899-1902. 10.1021/ac00216a032.View ArticleGoogle Scholar
- Hsu JP, Miller G, Moran V: Analytical method for determination of trace organics in gas samples collected by canister. J Chromatogr Sci. 1991, 29 (2): 83-88.View ArticleGoogle Scholar
- Pankow JF, Luo W, Isabelle LM, Bender DA, Baker RJ: Determination of a wide range of volatile organic compounds in ambient air using multisorbent adsorption/thermal Desorption and gas chromatography/mass spectrometry. J Anal Chem. 1998, 70 (24): 5313-5221.View ArticleGoogle Scholar
- Jalali-Heravi M, Kyani A: Use of computer-assisted methods for the modeling of the retention time of a variety of volatile organic compounds: a PCA-MLR-ANN approach. J Chem Inf Comput Sci. 2004, 44 (4): 1328-1335. 10.1021/ci0342270.View ArticleGoogle Scholar
- Luan F, Xue C, Zhang R, Zhao C, Liu M, Hu Z, Fan B: Prediction of retention time of a variety of volatile organic compounds based on the heuristic method and support vector machine. Anal Chim Acta. 2005, 537 (1-2): 101-110. 10.1016/j.aca.2004.12.085.View ArticleGoogle Scholar
- Fragkaki AG, Tsantili-Kakoulidou A, Angelis YS, Koupparis M, Georgakopoulos C: Gas chromatographic quantitative structure–retention relationships of trimethylsilylated anabolic androgenic steroids by multiple linear regression and partial least squares. J Chromatogram A. 2009, 1216 (47): 8404-8420. 10.1016/j.chroma.2009.09.066.View ArticleGoogle Scholar
- Katritzky AR, Chen K: QSPR correlation and predictions of GC retention indexes for methyl-branched hydrocarbons produced by insects. Anal Chem. 2000, 72 (1): 101-109. 10.1021/ac990800w.View ArticleGoogle Scholar
- Hu RJ, Liu HX, Zhang RS, Xue CX, Yao XJ, Liu MC, Hu ZD, Fan BT: QSPR prediction of GC retention indices for nitrogen-containing polycyclic aromatic compounds from heuristically computed molecular descriptors. Talanta. 2005, 68 (1): 31-39. 10.1016/j.talanta.2005.04.034.View ArticleGoogle Scholar
- Liu F, Liang Y, Cao C, Zhou N: QSPR study of GC retention indices for saturated esters on seven stationary phases based on novel topological indices. Talanta. 2007, 72 (4): 1307-1315. 10.1016/j.talanta.2007.01.038.View ArticleGoogle Scholar
- Zhao M, Li Z, Wu Y, Tang YR, Wang C, Zhang Z, Peng S: Studies on logP, retention time and QSAR of 2-substituted phenylnitronyl nitroxides as free radical scavengers. Eur J Med Chem. 2007, 42 (7): 955-965. 10.1016/j.ejmech.2006.12.027.View ArticleGoogle Scholar
- EPA Method 8260C. Volatile organic compounds by gas chromatography/mass spectroscopy (GC/MS). 2007Google Scholar
- Hu R, Doucet JP, Delamar M, Zhang R: QSAR models for 2-amino-6 arylsulfonylbenzonitriles and congeners HIV-1 reverse transcriptase inhibitors based on linear and nonlinear regression methods. Eur J Med Chem. 2009, 44 (5): 2158-2171. 10.1016/j.ejmech.2008.10.021.View ArticleGoogle Scholar
- Xu L, Zhang W: Comparison of different methods for variable selection. Anal Chim Acta. 2001, 446 (1-2): 477-483.View ArticleGoogle Scholar
- Hemmateenejad B, Yazdani M: QSPR models for half-wave reduction potential of steroids: A comparative study between feature selection and feature extraction from subsets of or entire set of descriptors. Anal Chim Acta. 2009, 634 (1): 27-35. 10.1016/j.aca.2008.11.062.View ArticleGoogle Scholar
- Hancock T, Put R, Coomans D, Vander Heyden Y, Everingham Y: A performance comparison of modern statistical techniques for molecular descriptor selection and retention prediction in chromatographic QSRR studies. Chemometr Intell Lab Syst. 2005, 76 (2): 185-196. 10.1016/j.chemolab.2004.11.001.View ArticleGoogle Scholar
- Leardi R: Application of a genetic algorithm to feature selection under full validation conditions and to outlier detection. J Chemometr. 1994, 8 (1): 65-79. 10.1002/cem.1180080107.View ArticleGoogle Scholar
- Leardi R, Gonzalez AL: Genetic algorithm applied to feather selection in PLS-Regression: How and when to use them. Chemomet Intell Lab Syst. 1998, 41 (2): 195-207. 10.1016/S0169-7439(98)00051-3.View ArticleGoogle Scholar
- Torrecilla JS, Garcia J, Rojo E, Rodriguez F: Estimation of toxicity of ionic liquids in Leukemia Rat Cell Line and Acetylcholinesterase enzyme by principal component analysis, neural networks and multiple lineal regressions. J Hazard Mater. 2009, 164 (1): 182-194. 10.1016/j.jhazmat.2008.08.022.View ArticleGoogle Scholar
- Haaland DM, Thomas EV: Partial least squares methods for spectral analyses.2.application to simulated and glass spectral data. Anal chem. 1988, 60 (11): 1202-1208. 10.1021/ac00162a021.View ArticleGoogle Scholar
- Vanyúr R, Héberger K, Jakus J: Prediction of anti-HIV-1 activity of a series of tetraphyrrole molecules. J Chem Inf Comput Sci. 2003, 43 (6): 1829-1836. 10.1021/ci0304627.View ArticleGoogle Scholar
- Hasegawa K, Funatsu K: GA strategy for variable selection in QSAR studies: GAPLS and D-optimal designs for predictive QSAR model. J Mol Struct (Theochem). 1998, 425 (3): 255-262. 10.1016/S0166-1280(97)00205-4.View ArticleGoogle Scholar
- Sagrado S, Cronin MTD: Application of the modelling power approach to variable subset selection for GA-PLS QSAR models. Anal Chim Acta. 2008, 609 (2): 169-174. 10.1016/j.aca.2008.01.013.View ArticleGoogle Scholar
- Hemmateenejad B, Miri R, Akhond M, Shamsipur M: QSAR study of the calcium channel antagonist activity of some recently synthesized dihydropyridine derivatives. An application of genetic algorithm for variable selection in MLR and PLS methods. Chemometr Intell Lab Syst. 2002, 64 (1): 91-99. 10.1016/S0169-7439(02)00068-0.View ArticleGoogle Scholar
- Pan Y, Jiang J, Wang R, Cao H, Cui Y: Predicting the auto-ignition temperatures of organic compounds from molecular structure using support vector machine. J Hazard Mater. 2009, 164 (2-3): 1242-1249. 10.1016/j.jhazmat.2008.09.031.View ArticleGoogle Scholar
- Ghasemi J, Asadpour S, Abdolmaleki A: Prediction of gas chromatography/electron capture detector retention times of chlorinated pesticides, herbicides, and organohalides by multivariate chemometrics methods. Anal Chim Acta. 2007, 588 (2): 200-206. 10.1016/j.aca.2007.02.027.View ArticleGoogle Scholar
- Weisberg S: Applied Linear Regression. 1985, John Wiley & Sons, New York, 2, 9780471879572Google Scholar
- Pourbasheer E, Riahi S, Ganjali MR, Norouzi P: Application of genetic algorithm-support vector machine (GA-SVM) for prediction of BK-channels activity. Eur J Med Chem. 2009, 44 (12): 5023-5028. 10.1016/j.ejmech.2009.09.006.View ArticleGoogle Scholar
- Dragos H, Gilles M, Alexandre V: Predicting the predictability: a unified approach to the applicability domain problem of QSAR models. J Chem Inf Model. 2009, 49: 1762-1776. 10.1021/ci9000579.View ArticleGoogle Scholar
- Zupan J, Gasteiger J: Neural Networks in Chemistry and Drug Design. 1999, Wiley VCH Ed., Weinheim, 2, 9783527297795Google Scholar
- Dreyfus G: Neural Networks - Methodology and Applications. 2005, Springer, Heidelberg, 9783540229803Google Scholar
- Chauvin Y, Rumelheart D: Backpropagation: theory, architectures and applications. 1995, Lawrence Erlbaum Associates, Hillsdale, New Jersey Hove, UK, 9780805812596Google Scholar
- Fernandez M, Caballero J, Tundidor-Camba A: Linear and nonlinear QSAR study of N-hydroxy-2 [(phenylsulfonyl) amino] acetamide derivatives as matrix metalloproteinase inhibitors. Bioorg Med Chem. 2006, 14 (12): 4137-4150. 10.1016/j.bmc.2006.01.072.View ArticleGoogle Scholar
- Saiz-Urra L, Gonzalez MP, Teijeira M: QSAR studies about cytotoxicity of benzophenazines with dual inhibition toward both topoisomerases I and II: 3D-MoRSE descriptors and statistical considerations about variable selection. Bioorg Med Chem. 2006, 14 (21): 7347-7358. 10.1016/j.bmc.2006.05.081.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.