Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
4 Mining Exhaled Volatile Organic Compounds for Breast Cancer Detection Kichun Sky Lee1 , M. Forrest Abouelnasr1 , Charlene W. Bayer1 , Sheryl G.A. Gabram2 , Boris Mizaikoff3 , André Rogatko2 , and Brani Vidakovic1,2 Georgia Institute of Technology, Atlanta, USA1 ; Emory University, Atlanta, USA2 ; University of Ulm, Germany3 CONTENTS 4.1 4.2 4.3 4.4 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . What are Volatile Organic Compounds? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dimension Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 16 17 25 27 This paper provides a new application of a nonlinear dimension reduction technique to analyze volatile organic compounds (VOCs) in human breath that are measured as high dimensional and sparse data. For potential early detection and monitoring for disease recurrence of breast cancer, the researchers are interested in classifying subjects to cases and controls based on the exhaled VOC content. It is demonstrated that principal component analysis is inferior in capturing the relations between VOCs and incidence of breast cancer to a nonlinear dimension reduction methodology that considers the geometric manifold of observed VOCs. To take the chemical properties into account, our dimension reduction approach introduces measures of chemical closeness between VOCs. We demonstrate that the nonlinear dimension reduction technique in our context results in a weak classifier that in conjunction with other diagnostic modalities may improve the accuracy of diagnosis of breast cancer. 4.1 Introduction Women in the U.S. have a one-in-eight lifetime chance of developing invasive breast cancer and a 1 in 33 chance of dying from breast cancer (ACS, 2007). However, if the disease is detected in an early stage, it is highly treatable. 1-58488-344-8/04/$0.00+$1.50 c 2004 by CRC Press LLC ° 15 16 Statistical Data Mining and Knowledge Discovery Mortality can be significantly reduced with multi-modal therapy that includes surgery, targeted medical oncology treatments and radiation therapy (Berry,et al., 2005). A major problem with early detection of breast cancer is that mammography techniques can be uncomfortable or not available worldwide due to the lack of financial and technical resources. A new promising diagnostic method is the analysis of exhaled VOCs (Cao and Duan, 2006; Poli et al., 2005). While the exact reactions and chemical processes associated with breast cancer are not accurately known, a correlation may be found between the presence of disease and the occurrences of certain exhaled VOCs. Resent research involving canine scent detection with various types of cancer demonstrated high sensitivity and specificity of these nature-made classifiers (Pickel et al., 2004; Dobson, 2003; Willis et al., 2004; McCulloch et al., 2006; Gordon et al., 2008). This study was performed to investigate this link, but with summaries of breath mass spectrometric analysis as informative descriptors. Two groups were examined: a group of women who recently had been diagnosed with breast cancer (but who had not yet started any treatment) and a cancer-free control group. Using the massspectrometric summaries in the context of Laplacian eigenmap dimension reduction, we were able to predictively distinguish between the two groups with a correct classification rate of 0.7580, on average. One of the novelties of our research is the guidance of this classification scheme by chemical closeness between the VOCs expressed by the so called closeness matrix, critical in tuning the Laplacian eigenmaps. 4.2 What are Volatile Organic Compounds? Volatile organic compounds are organic chemical compounds that have a high vapor pressure under normal conditions significantly vaporizing and entering the atmosphere (EPA, 2007). Common VOC containing products include paint thinners, pharmaceuticals, refrigerants, dry cleaning solvents, and constituents of petroleum fuels (e.g., gasoline and natural gas). Flora and fauna are also an important biological sources of VOCs; for example, it is known that trees emit large amounts of VOCs, especially isoprene and terpenes. Humans also are sources of VOCs from their skin and breath. During this research, human-exhaled VOCs are collected for diagnostics of breast cancer in the following way. Human subjects: Two groups of subjects have been examined. The case group consisted of women who had recently been diagnosed with breast cancer at Stages II, III, or IV, and prior to receiving any treatment. The control group consisted of healthy women in a similar age range confirmed to be cancer-free by a mammogram taken less than one year prior to the sample collection. Subjects were not allowed to eat or drink for at least two hours prior to breath sample collection. Mining Exhaled VOCs for Breast Cancer Detection 17 Breath Collection and Assay: Human alveolar breath and background air were sampled separately. The alveolar breath of each subject was collected by the subR ject breathing five times at five-minute intervals into a valved-Teflon° sampling R ° bulb (Markes Bio-VOC Sampler) containing a Radiello passive sampler. Analysis was via thermal desorption/gas chromatography/mass spectrometry (Markes International Ltd. ULTRA thermal desorber/ Thermo Trace GC ULTRA/Thermo Trace DSQ mass spectrometer) (Lechner and Rieder 2007; Manini et al., 2006). Background air was collected passively with Radiello samplers placed in the rooms during the breath sample collection time period. The breath data was corrected for the background concentration data. 4.2.1 Description of Data Our observations consist of expressions of 378 VOCs for each of total 41 subjects, (xi ∈ R378 , i = 1, . . . , 41). Out of 41 subjects 17 subjects came from the case group (label yi = +1) and 24 from the control group (label yi = −1), that is, (xi , yi ) ∈ R p × {−1, +1}, i = 1, . . . , n, (4.1) with p = 378 and n = 41. Our goal is to construct a classifying function C, C : x ∈ R p 7→ {−1, +1}, (4.2) where a label for a new observation xnew is predicted as C(xnew ). Since the dimension of each observation xi exceeds the number of observations, for inference purposes we have a so-called “small n-large p” type of problem. In addition, data are sparse, that is, many VOCs for a particular subject are not observed (Figure 4.1(a-d)). 4.3 Dimension Reduction Two approaches of dimension reduction (linear and non-linear) are discussed and compared. The construction of classifying function C is discussed in the subsequent sections. 4.3.1 Linear Dimension Reduction Principal component analysis (PCA), also called Karhunen-Loeve decomposition in engineering fields, is a widely used technique in multivariate data analysis as a linear dimension reduction method (Jolliffe, 2002). In PCA, linear combinations of xi , ui , are iteratively chosen so that they are mutually orthogonal and “compress” the variance. Formally, the principal components ui are defined as follows ui = arg max ||ui ||=1 ui ⊥u j , j=1,...,i−1 var(uTi x), (4.3) Statistical Data Mining and Knowledge Discovery 50 50 100 100 150 150 VOCs VOCs 18 200 200 250 250 300 300 350 350 5 10 subject index 15 5 (a) Scaled image of VOC data for the case group. 10 15 subject index 20 (b) Scaled image for the control group. 1200 1400 1200 1000 magnitude of VOC magnitude of VOC 1000 800 600 400 800 600 400 200 200 0 0 50 100 150 200 250 index of VOCs 300 350 0 0 400 (c) 2nd subject in the case group; 2nd column of the matrix in (a). 50 100 150 200 250 index of VOCs 300 350 400 (d) 10th subject in the control group; 10th column of the matrix in (b). FIGURE 4.1 Plots of VOC data for Case and Control groups. where for u1 the orthogonality constraint is removed by definition. These ui are the optimal linear representation, optimized by minimizing the square of the errors. By defining X = [ x1 x2 · · · xn ], an empirical version of equation 4.3 can be written as ui = arg max ||ui ||=1 ui ⊥u j , j=1,...,i−1 uTi XX T ui (4.4) which is solved by the eigenvector associated with ith largest eigenvalue of XX T . The dimension reduction is achieved by taking a subset {u1 , u2 , . . . , uk }, k < min{n, p} as a basis. An illustration for x ∈ R2 is shown in Figure 4.2. The first principal component corresponds to the direction of largest variability in x, the second is orthogonal to the first. In this case, dimension reduction can be achieved of replacing the two dimensional data with the scores along the first principal component. The scores of first principal component and second principal component for VOC data are shown in Figure 4.3. As evident, the discriminative power for the first component is highly influenced by outliers. Mining Exhaled VOCs for Breast Cancer Detection 19 5 1st component 2nd component 4 3 2 1 0 −1 −2 −3 −4 −5 −5 0 5 FIGURE 4.2 Illustration of PCA with 2 dimensional data. 500 800 0 600 400 3rd Principal Coordinate 2nd Principal Coordinate −500 −1000 −1500 −2000 −2500 −3000 0 −200 −400 −600 −800 −3500 −4000 −2000 200 Control Case 0 2000 4000 6000 8000 1st Principal Coordinate 10000 12000 (a) 1st vs. 2nd. −1000 −1200 −2000 Control Case 0 2000 4000 6000 8000 1st Principal Coordinate 10000 12000 (b) 1st vs. 3rd. FIGURE 4.3 Scores of the principal components for VOC data; a few outliers dominate the geometry. 4.3.2 Nonlinear Dimension Reduction Many nonlinear dimension reduction techniques guided by a non-linear manifold in which “data live” have been proposed. There are local linear embedding (LLE) (Roweis and Saul, 2000), ISOMAP (Tenenbaum et al., 2000), Hessian eigenmaps (Donoho and Grimes, 2003), Laplacian eigenmaps (Belkin and Niyogi, 2003), and diffusion maps (Lafon, 2004), to name a few. Unlike the PCA, in the listed non-linear classification methods the transformations are made in such a way as to preserve the distance structure of data, thus capturing its intrinsic dimensionality. These learning algorithms can be unified by the following framework, describing the computation of an embedding for the given data set. Algorithm 1. Start from the data set {x1 , . . . , xn } of size n where each observation belongs to R p . Construct a n × n “neighborhood” or similarity matrix W . 2. Optionally normalize W as W̃ . 20 Statistical Data Mining and Knowledge Discovery Nitrogen-containing compounds Alcohols Alkanes Aldehydes Ketones Cyclo-ethers or furans Alkenes Halocarbons Esters Organosulfur compounds Organosilicon compounds Carboxylic Acids Aromatic Rings Miscellaneous TABLE 4.1 Chemical grouping of 378 VOCs. 3. Compute the k largest nontrivial eigenvalues and eigenvectors v j of W̃ . 4. The embedding of each observation xi is the vector yi with elements yi j representing i-th element of the j-th eigenvector v j of W̃ . In the following we focus on Laplacian eigenmaps. They can be described under the above framework with appropriate definition of neighborhood matrix W . We will discuss its properties and application to our data set in detail. 4.3.2.1 Distance Between VOCs Defining a neighborhood for each observation requires the notion of distance between two VOCs. Each VOC is presumed to be chemically related or connected to some other VOC from the set of 378 VOCs. This consideration led us to have the following distance metric between two observations, de f ||xi − x j ||2 = 378 378 ∑ ∑ akl (xi,k − x j,l )2 , (4.5) k=1 l=1 where akl represents connection strength between k-th and l-th VOC. All ai j form the VOC closeness matrix A that is derived by the chemical properties of VOCs. The closeness matrix A was found by grouping all compounds into fourteen groups. Each compound’s closeness factor with itself is one, and each compound’s closeness to others in the same group is 0 < λ < 1. A compound’s closeness to other compounds outside of its group is zero. The compounds are sorted to groups based upon their chemical properties. The proposed groups are shown in Table 4.1. In each case where a compound may belong to more than one group, the group Mining Exhaled VOCs for Breast Cancer Detection 21 with the most active or significant component was chosen. Then, we define 1 if k = l, akl = λ if k-th and l-th VOCs are in the same group, 0 otherwise, with λ between 0 and 1. Parameter λ can be interpreted as the effect of group in calculating the distance. The closeness matrix for 378 VOCs and λ = 0.5 is shown in the Figure 4.4. Notice VOCs are ordered by the retention time from chromatographic analysis of the breath collection tubes. 50 100 150 200 250 300 350 50 100 150 200 250 300 350 FIGURE 4.4 Closeness matrix; Shades of gray express the affinity between two VOCs. This grouping is justifiable from a chemical standpoint for several reasons. Most compounds in a group (e.g. alkanes, alcohols) behave chemically similarly. And for those which do not behave similarly, like the nitrogen-containing compounds, their presence may indicate the utilization of similar biochemical substrates or pathways, which may themselves be symptoms of the disease. Finally, several members of each group may be in equilibrium with each other. 4.3.2.2 Laplacian Eigenmaps Embedding provided by the Laplacian eigenmaps preserves local geometric information optimally by making neighboring points mapped as closely as possible. Building neighborhoods amounts to introducing ‘adjacency’ metric between the points. Adjacency matrix consists of entries Wi j representing the closeness between ith and jth observations as Wi j = e− ||xi −x j ||2 t , (4.6) where t is a “diffusion” parameter. One can simply assume Wi j = 1 if two nodes are connected or within k-nearest neighbors range. 22 Statistical Data Mining and Knowledge Discovery A mapping from x1 , . . . , xn to y1 , . . . , yn is obtained by solving the following optimization problem, (4.7) min ∑(yi − y j )2Wi j , yT Dy=1 i, j yT D1=0 where diagonal matrix D is defined via Dii = ∑ j Wi j . We put a penalty proportional to the adjacency Wi j if xi and x j are mapped apart. By spectral decomposition the equation 4.7 is simplified to the following generalized eigenvalue problem: (D −W )f = µ Df. (4.8) The f can be viewed as optimal “cuts” in the geometric structure of xi . The matrix D −W is called the graph Laplacian of x1 , . . . , xn . By taking eigenvectors associated with the smallest nonzero l eigenvalues, one gets xi → ( f1 (i), . . . , fl (i)). (4.9) The Laplacian eigenmaps for VOCs are depicted in Figure 4.5. It is clear that the first three eigenvectors are more discriminative than those in PCA. We employ several classification techniques on the summaries in (4.9). The results are provided in the following section. 4.3.3 Classification and Comparisons The classification function C can be approximated by a linear combination of the new transformed coordinates, since eigenfunctions of the Laplacian constitute an orthogonal basis for the Hilbert space L2 defined on the manifold of data (Belkin and Niyogi, 2004). Four classification methods: linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), support vector machines (SVM), and least square estimate (LSE) have been applied to the VOC data in the transformed domain. The assessment of PCA and Laplacian eigenmap classifiers was performed by simulation. For each classification method, observations are randomly split to the training set (60% of cases and 60% of controls) and to the validation set (remaining 40% of cases and controls). Two dominant PCA or Laplacian components served as classifying descriptors. The discriminatory boundaries were constructed using the training sets and correct classification rates assessed by the validation sets. This was repeated 1000 times. The Laplacian eigenmap classification was done in a semi-supervised manner. All observations irrespective of their labels (training and validation sets) have been used to assess the manifold geometry and guide the transformation. Once transformed, the labeled dominant components (from the training set) produced discrimination boundaries to classify the unlabeled components corresponding to validation set. Of course, the true labels in the validation set are known and we are able to calculate the rate of correct classification. For LSE, the classifying function can be put in a closed form. Suppose a training set has L pairs, (x1 , y1 ), . . . , (xL , yL ), and U unlabeled observations xL+1 , . . . , xL+U , Mining Exhaled VOCs for Breast Cancer Detection 23 0.4 0.4 0.3 0.3 0.2 3rd component 2nd component 0.2 0.1 0 0.1 0 −0.1 −0.1 −0.2 −0.2 Control Case −0.3 −0.3 −0.2 −0.1 0 1st component 0.1 0.2 Control Case −0.3 (a) 1st vs. 2nd with λ = 0, 5 neighbors. −0.2 −0.1 0 1st component 0.1 0.2 (b) 1st vs. 3rd with λ = 0, 5 neighbors. 0.3 0.3 0.2 0.1 3rd component 2nd component 0.2 0.1 0 −0.1 0 −0.1 −0.2 −0.3 −0.2 −0.4 Control Case −0.3 −0.3 −0.2 −0.1 1st component 0 −0.5 0.1 (c) 1st vs. 2nd with λ = 0.005, 4 neighbors. Control Case −0.4 −0.3 −0.2 −0.1 0 2nd component 0.1 0.2 0.3 (d) 2nd vs. 3rd with λ = 0.005, 4 neighbors. FIGURE 4.5 Laplacian eigenmap coordinate system spanned by two dominant components; the intrinsic geometry of VOCs is better expressed in the transformed domain. where xi ∈ R378 , yi ∈ {−1, +1}. After the transformation we obtain f (i) ∈ Rl corresponding to xi : (xi , yi ) 7→ ( f (i) , yi ), i = 1, . . . , L xi 7→ f (i) , i = L + 1, . . . , L +U. Classification function C(x) is approximated by a linear combination of l components that serve as the new coordinates: l C(x) = (x) ∑ ck fk , (4.10) k=1 where f (x) is the transformation of x. Let f = [ f (1) f (2) . . . f (L) ] and y = [y1 . . . yL ] corresponds only to training samples. Then, ck are estimated by c = (f fT )−1 f y by the least square criterion. Since the labels are either −1 or +1, the classification 24 Statistical Data Mining and Knowledge Discovery 600 0.4 500 0.3 400 0.2 2nd component 2nd component 300 200 100 0 0.1 0 −0.1 −0.2 −100 −0.3 −200 −0.4 case control −300 0 2000 4000 6000 1st component 8000 −0.5 (a) LDA with PCA. case control −0.2 10000 −0.1 0 1st component 0.1 0.2 (b) LDA with Laplacian eigenmaps. 600 0.4 500 0.3 400 0.2 2nd component 2nd component 300 200 100 0 0.1 0 −0.1 −0.2 −100 −0.3 −200 −0.4 case control −300 0 2000 4000 6000 1st component 8000 −0.5 (c) QDA with PCA. case control −0.2 10000 −0.1 0 1st component 0.1 0.2 (d) QDA with Laplacian eigenmaps. 600 0.4 500 0.3 400 0.2 2nd component 2nd component 300 200 100 0 0.1 0 −0.1 −0.2 −100 −0.3 −200 case control −300 0 2000 4000 6000 1st component 8000 (e) SVM with PCA. 10000 −0.4 −0.5 case control −0.2 −0.1 0 1st component 0.1 0.2 (f) SVM with Laplacian eigenmaps. FIGURE 4.6 Illustration of classifications with PCA and Laplacian eigenmaps with λ = 0.1; Points represent a random training set. Responses for the validation set are estimated by the decision boundary inferred from the training set. Mining Exhaled VOCs for Breast Cancer Detection PCA LDA QDA SVM 0.6126 0.6030 0.7000 LSE 0.6224 25 Laplacian eigenmaps LDA QDA SVM LSE 0.7668 0.7648 0.7346 0.7658 TABLE 4.2 Average correct classification ratio for four classification methods based on 1000 random partitions of data to training and validation sets. (x) of x can be based on ∑lk=1 ck fk as ( 1 y= −1 (x) ∑lk=1 ck fk ≥ 0, (x) ∑lk=1 ck fk < 0. Figure 4.6 illustrates decision boundaries for PCA mapping and Laplacian eigenmaps with LDA, QDA, and SVM. Notice that LDA produces a linear separation whereas QDA has a quadratic separation boundary based on the assumption that the variances in the two groups are different from each other. The SVM varies according to its kernel properties. In Figure 4.6(e), the radial basis kernel with width parameter 105 , penalty parameter 100, and 5 neighbors was used while in Figure 4.6(f) the radial basis kernel with width 0.4, penalty 10, and 5 neighbors is applied. Table 4.2 shows an average of the ratio of correct classifications for each method. As evident, the average rates of the Laplacian eigenmaps are exceeding those of the PCA. The classification rate depending on parameter λ in the closeness matrix is given in Figure 4.7, which illustrates benefits of closeness matrix in the classification process. 4.4 Conclusions This study demonstrated a significant link between the presence of breast cancer and exhaled VOCs. As a consequence, the predictive screening may be possible with this method. While the sensitivity and specificity of this breath test may be low, it is an inexpensive, noninvasive test that may have wide applicability worldwide. As a “weak classifier” the VOCs analysis may be combined with other independent classifiers, to increase confidence level of the diagnostic decision. The acceptance of breath volatiles as a viable clinical screening technique has been slow due to a number of issues: 1. largely unsuccessful attempts to find a single or small number of low-level VOCs as biomarker compounds for individual diseases; 2. lack of understanding of the physiological meaning of the detected volatiles; and 26 Statistical Data Mining and Knowledge Discovery lda qda svm lse 0.8 classification rate 0.75 0.7 0.65 0.6 0.55 0.5 0 0.05 0.1 λ in closeness matrix 0.15 0.2 FIGURE 4.7 Classification rate with two Laplacian components as a function of to λ . The ratio is maximized at λ = 0.085. 3. results and methods cannot be easily converted into a clinically useful form. With a larger sample size, we plan to focus attention on those compounds which are found to be the strongest indicators of breast cancer, or on the relationships found between compounds. This will help to better understand the specific chemical and metabolic changes involved with the disease. While this method might be made more accurate using the equipment and techniques employed by Phillips et al. (1997), the collecting devices used in this study are portable and easier to use giving a more practicable application for this technology. 4.4.1 Chemistry Considerations Several assumptions and steps which were taken in the analysis of the data must be justified from a chemical perspective. The proposed methodology operates under the assumption that a few linear or nonlinear combinations of the compounds dominate the predictive power of data to diagnose cancer. This is inherent, for example, in PCA, where the dimension of the data is reduced from 378 to two. While the inherent mathematical/chemical space of disease-related VOCs may have many dimensions, two optimized dimensions can well model this link. Chemically, this may be because only a few reactions dominate the influence that the disease has on the presence of each compound. This suggests that a small number of nonlinear combinations can effectively model the entire system. Chemical “closeness” used in described classification could be refined. With multiple reactions and reactants, many co-products and co-reactants would exist, and the dominant reactions would most closely control these relationships. Chemical close- Mining Exhaled VOCs for Breast Cancer Detection 27 ness is an effective way to model this behavior, and proved to be a useful tool to sharpen the computational analysis. However, if any two compounds are found to have little correlation, then the “closeness” can be set to zero, and that relationship will not contribute to the calculation step. In this way, the inclusion of the closeness step in calculation does not preclude “non-closeness” interactions. Acknowledgment. The research of Lee, Abouelnasr and Vidakovic was supported by NSF Grants 0505490 and 0724524 at Georgia Institute of Technology. References ACS, American Cancer Society, (2007), What Are the Key Statistics for Breast Cancer?, Detailed Guide:Breast Cancer, available at http://www.cancer.org/. Belkin, M., and Niyogi, P. (2003), “Laplacian Eigenmaps for Dimensionality Reduction and Data Representation,” Neural Computation, 15, 1373-1396. Belkin, M., and Niyogi, P. (2004), “Semi-supervised Learning on Riemannian Manifolds,” Machine Learning, 56, 209-239. Berry, D. A., Cronin, K. A., Plevritis, S. K., Fryback, D. G., Clarke, L., Zelen, M., Mandelblatt, J. S., Yakovlev, A. Y., Habbema, J. D. F., and Feuer, E. J. (2005), “Effect of Screening and Adjuvant Therapy on Mortality from Breast Cancer,” The New England Journal of Medicine, 353, 1784-1792. Cao, W., and Duan, Y. (2006), “Breath Analysis: Potential for Clinical Diagnosis and Exposure Assessment,” Clinical Chemistry, 52, 800-811. Dobson, R. (2003), “Dogs Can Sniff Out First Signs of Men’s Cancer,” Sunday Times, April 2003, 27, 5. Donoho, D. and Grimes, C. (2003), “Hessian Eigenmaps: New Tools for Nonlinear Dimensionality Reduction,” Proceedings of National Academy of Science, 100, 5591-5596. EPA, Environmental Protection Agency, (2007), Indoor Air Quality: Organic Gases (Volatile Organic Compounds - VOCs), available at http://www.epa.gov/iaq/voc.html. Gordon, R. T., Shatz, C. B., Myers, L. J., Kosty, M., Gonczy, C., Kroener, J., Tran, M., Kurtzhals, P., Heath, S., Koziol, J. A., Arthur, N., Gabriel, M., Hemping, J., Hemping, G., Nesbitt, S., Tucker-Clark, L., and Zaayer, J. (2008), “The use of Canines in the Detection of Human Cancers,” The Journal of Alternative and Complementary Medicine, 14, 61-67. Henderson, C. (1991), Breast Cancer (12th ed.), New York: McGraw-Hill. Jolliffe, I.T. (2002), Principal Component Analysis (2nd ed.), NY:Springer. 28 Statistical Data Mining and Knowledge Discovery Lafon, S. (2004), “Diffusion Maps and Geometric Harmonics,” Yale University, Ph.D. dissertation. Lechner, M., and Rieder, J. (2007), “Mass Spectrometric Profiling of Low-molecularweight Volatile Compounds–Diagnostic Potential and Latest Applications,” Current Medical Chemistry, 14, 987-995. Manini, P., De Palma, G., Andreoli, R., Poli, D., Mozzoni, P., Folesani, G., Mutti, A., and Apostoli, P. (2006), “Environmental and Biological Monitoring of Benzene Exposure in a Cohort of Italian Taxi Drivers,” Toxicology Letters, 167, 142-151. McCulloch, M., Jezierski, T., Broffman, M., Hubbard, A., Turner, K., and Janecki, T. (2006), “Diagnostic Accuracy of Canine Scent Detection in Early- and Late-stage Lung and Breast Cancers,” Integrative Cancer Therapies, 5, 30-39. Phillips, M. (1997), “Method for the Collection and Assay of Volatile Organic Compounds in Breath,” Analytical Biochemistry, 247, 272-278. Pickel, D., Manucy, G., Walker, D., Hall, S., and Walker, J. (2004), “Evidence for Canine Olfactory Detection of Melanoma,” Applied Animal Behaviour Science, 89, 107-116. Poli, D., Carbognani, P., Corradi, M., Goldoni, M., Acampa, O., Balbi, B., Bianchi, L., Rusca, M., and Mutti, A. (2005), “Exhaled Volatile Organic Compounds in Patients with Non-small Cell Lung Cancer: Cross Sectional and Nested Short-term Follow-up Study,” Respiratory Research, 6, 71. Roweis, S. T., and Saul, L. K. (2000), “Nonlinear Dimensionality Reduction by Locally Linear Embedding,” Science, 290, 2323-2326. Tenenbaum, J. B., Silva, V. D., and Langford, J. C. (2000), “A Global Geometric Framework for Nonlinear Dimensionality Reduction,” Science, 290, 2319-2323. Willis, C. M., Church, S. M., Guest, C. M., Cook, W. A., McCarthy, N., Bransbury, A. J., Church, M. R. T., Church, J. C. T. (2004), “Olfactory Detection of Human Bladder Cancer by Dogs: Proof of Principle Study,” BMJ, 329, 712. Index breast cancer, 15 classification, 24 closeness matrix, 20 dimension reduction, 17 linear, 18 nonlinear, 20 graph Laplacian, 22 Laplacian eigenmaps, 16, 20, 21 linear discriminant analysis, 22 quadratic discriminant analysis, 22 small n-large p problem, 17 support vector machines, 22 VOC, 15 volatile organic compounds, 15 weak classifier, 24 29