Download Mining Exhaled Volatile Organic Compounds for Breast Cancer

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

K-nearest neighbors algorithm wikipedia , lookup

Principal component analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
4
Mining Exhaled Volatile Organic Compounds
for Breast Cancer Detection
Kichun Sky Lee1 , M. Forrest Abouelnasr1 , Charlene W. Bayer1 , Sheryl G.A.
Gabram2 , Boris Mizaikoff3 , André Rogatko2 , and Brani Vidakovic1,2
Georgia Institute of Technology, Atlanta, USA1 ; Emory University, Atlanta, USA2 ;
University of Ulm, Germany3
CONTENTS
4.1
4.2
4.3
4.4
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
What are Volatile Organic Compounds? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Dimension Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
16
17
25
27
This paper provides a new application of a nonlinear dimension reduction technique
to analyze volatile organic compounds (VOCs) in human breath that are measured
as high dimensional and sparse data. For potential early detection and monitoring
for disease recurrence of breast cancer, the researchers are interested in classifying
subjects to cases and controls based on the exhaled VOC content.
It is demonstrated that principal component analysis is inferior in capturing the
relations between VOCs and incidence of breast cancer to a nonlinear dimension
reduction methodology that considers the geometric manifold of observed VOCs.
To take the chemical properties into account, our dimension reduction approach introduces measures of chemical closeness between VOCs. We demonstrate that the
nonlinear dimension reduction technique in our context results in a weak classifier
that in conjunction with other diagnostic modalities may improve the accuracy of
diagnosis of breast cancer.
4.1 Introduction
Women in the U.S. have a one-in-eight lifetime chance of developing invasive breast
cancer and a 1 in 33 chance of dying from breast cancer (ACS, 2007). However, if
the disease is detected in an early stage, it is highly treatable.
1-58488-344-8/04/$0.00+$1.50
c 2004 by CRC Press LLC
°
15
16
Statistical Data Mining and Knowledge Discovery
Mortality can be significantly reduced with multi-modal therapy that includes
surgery, targeted medical oncology treatments and radiation therapy (Berry,et al.,
2005). A major problem with early detection of breast cancer is that mammography
techniques can be uncomfortable or not available worldwide due to the lack of financial and technical resources. A new promising diagnostic method is the analysis of
exhaled VOCs (Cao and Duan, 2006; Poli et al., 2005). While the exact reactions and
chemical processes associated with breast cancer are not accurately known, a correlation may be found between the presence of disease and the occurrences of certain
exhaled VOCs. Resent research involving canine scent detection with various types
of cancer demonstrated high sensitivity and specificity of these nature-made classifiers (Pickel et al., 2004; Dobson, 2003; Willis et al., 2004; McCulloch et al., 2006;
Gordon et al., 2008).
This study was performed to investigate this link, but with summaries of breath
mass spectrometric analysis as informative descriptors. Two groups were examined:
a group of women who recently had been diagnosed with breast cancer (but who
had not yet started any treatment) and a cancer-free control group. Using the massspectrometric summaries in the context of Laplacian eigenmap dimension reduction,
we were able to predictively distinguish between the two groups with a correct classification rate of 0.7580, on average. One of the novelties of our research is the
guidance of this classification scheme by chemical closeness between the VOCs expressed by the so called closeness matrix, critical in tuning the Laplacian eigenmaps.
4.2 What are Volatile Organic Compounds?
Volatile organic compounds are organic chemical compounds that have a high vapor pressure under normal conditions significantly vaporizing and entering the atmosphere (EPA, 2007). Common VOC containing products include paint thinners,
pharmaceuticals, refrigerants, dry cleaning solvents, and constituents of petroleum
fuels (e.g., gasoline and natural gas). Flora and fauna are also an important biological
sources of VOCs; for example, it is known that trees emit large amounts of VOCs,
especially isoprene and terpenes. Humans also are sources of VOCs from their skin
and breath. During this research, human-exhaled VOCs are collected for diagnostics
of breast cancer in the following way.
Human subjects: Two groups of subjects have been examined. The case group consisted of women who had recently been diagnosed with breast cancer at Stages II, III,
or IV, and prior to receiving any treatment. The control group consisted of healthy
women in a similar age range confirmed to be cancer-free by a mammogram taken
less than one year prior to the sample collection. Subjects were not allowed to eat or
drink for at least two hours prior to breath sample collection.
Mining Exhaled VOCs for Breast Cancer Detection
17
Breath Collection and Assay: Human alveolar breath and background air were
sampled separately. The alveolar breath of each subject was collected by the subR
ject breathing five times at five-minute intervals into a valved-Teflon°
sampling
R
°
bulb (Markes Bio-VOC Sampler) containing a Radiello passive sampler. Analysis
was via thermal desorption/gas chromatography/mass spectrometry (Markes International Ltd. ULTRA thermal desorber/ Thermo Trace GC ULTRA/Thermo Trace
DSQ mass spectrometer) (Lechner and Rieder 2007; Manini et al., 2006). Background air was collected passively with Radiello samplers placed in the rooms during the breath sample collection time period. The breath data was corrected for the
background concentration data.
4.2.1
Description of Data
Our observations consist of expressions of 378 VOCs for each of total 41 subjects,
(xi ∈ R378 , i = 1, . . . , 41). Out of 41 subjects 17 subjects came from the case group
(label yi = +1) and 24 from the control group (label yi = −1), that is,
(xi , yi ) ∈ R p × {−1, +1}, i = 1, . . . , n,
(4.1)
with p = 378 and n = 41. Our goal is to construct a classifying function C,
C : x ∈ R p 7→ {−1, +1},
(4.2)
where a label for a new observation xnew is predicted as C(xnew ). Since the dimension
of each observation xi exceeds the number of observations, for inference purposes we
have a so-called “small n-large p” type of problem. In addition, data are sparse, that
is, many VOCs for a particular subject are not observed (Figure 4.1(a-d)).
4.3 Dimension Reduction
Two approaches of dimension reduction (linear and non-linear) are discussed and
compared. The construction of classifying function C is discussed in the subsequent
sections.
4.3.1
Linear Dimension Reduction
Principal component analysis (PCA), also called Karhunen-Loeve decomposition in
engineering fields, is a widely used technique in multivariate data analysis as a linear
dimension reduction method (Jolliffe, 2002). In PCA, linear combinations of xi ,
ui , are iteratively chosen so that they are mutually orthogonal and “compress” the
variance. Formally, the principal components ui are defined as follows
ui =
arg max
||ui ||=1
ui ⊥u j , j=1,...,i−1
var(uTi x),
(4.3)
Statistical Data Mining and Knowledge Discovery
50
50
100
100
150
150
VOCs
VOCs
18
200
200
250
250
300
300
350
350
5
10
subject index
15
5
(a) Scaled image of VOC data for the case group.
10
15
subject index
20
(b) Scaled image for the control group.
1200
1400
1200
1000
magnitude of VOC
magnitude of VOC
1000
800
600
400
800
600
400
200
200
0
0
50
100
150
200
250
index of VOCs
300
350
0
0
400
(c) 2nd subject in the case group;
2nd column of the matrix in (a).
50
100
150
200
250
index of VOCs
300
350
400
(d) 10th subject in the control group;
10th column of the matrix in (b).
FIGURE 4.1
Plots of VOC data for Case and Control groups.
where for u1 the orthogonality constraint is removed by definition. These ui are the
optimal linear representation, optimized by minimizing the square of the errors. By
defining X = [ x1 x2 · · · xn ], an empirical version of equation 4.3 can be written as
ui =
arg max
||ui ||=1
ui ⊥u j , j=1,...,i−1
uTi XX T ui
(4.4)
which is solved by the eigenvector associated with ith largest eigenvalue of XX T . The
dimension reduction is achieved by taking a subset {u1 , u2 , . . . , uk }, k < min{n, p} as
a basis. An illustration for x ∈ R2 is shown in Figure 4.2. The first principal component corresponds to the direction of largest variability in x, the second is orthogonal
to the first. In this case, dimension reduction can be achieved of replacing the two
dimensional data with the scores along the first principal component.
The scores of first principal component and second principal component for VOC
data are shown in Figure 4.3. As evident, the discriminative power for the first component is highly influenced by outliers.
Mining Exhaled VOCs for Breast Cancer Detection
19
5
1st component
2nd component
4
3
2
1
0
−1
−2
−3
−4
−5
−5
0
5
FIGURE 4.2
Illustration of PCA with 2 dimensional data.
500
800
0
600
400
3rd Principal Coordinate
2nd Principal Coordinate
−500
−1000
−1500
−2000
−2500
−3000
0
−200
−400
−600
−800
−3500
−4000
−2000
200
Control
Case
0
2000
4000
6000
8000
1st Principal Coordinate
10000
12000
(a) 1st vs. 2nd.
−1000
−1200
−2000
Control
Case
0
2000
4000
6000
8000
1st Principal Coordinate
10000
12000
(b) 1st vs. 3rd.
FIGURE 4.3
Scores of the principal components for VOC data; a few outliers dominate the
geometry.
4.3.2
Nonlinear Dimension Reduction
Many nonlinear dimension reduction techniques guided by a non-linear manifold
in which “data live” have been proposed. There are local linear embedding (LLE)
(Roweis and Saul, 2000), ISOMAP (Tenenbaum et al., 2000), Hessian eigenmaps
(Donoho and Grimes, 2003), Laplacian eigenmaps (Belkin and Niyogi, 2003), and
diffusion maps (Lafon, 2004), to name a few. Unlike the PCA, in the listed non-linear
classification methods the transformations are made in such a way as to preserve the
distance structure of data, thus capturing its intrinsic dimensionality. These learning
algorithms can be unified by the following framework, describing the computation
of an embedding for the given data set.
Algorithm
1. Start from the data set {x1 , . . . , xn } of size n where each observation belongs
to R p . Construct a n × n “neighborhood” or similarity matrix W .
2. Optionally normalize W as W̃ .
20
Statistical Data Mining and Knowledge Discovery
Nitrogen-containing compounds
Alcohols
Alkanes
Aldehydes
Ketones
Cyclo-ethers or furans
Alkenes
Halocarbons
Esters
Organosulfur compounds
Organosilicon compounds
Carboxylic Acids
Aromatic Rings
Miscellaneous
TABLE 4.1
Chemical grouping of 378 VOCs.
3. Compute the k largest nontrivial eigenvalues and eigenvectors v j of W̃ .
4. The embedding of each observation xi is the vector yi with elements yi j representing i-th element of the j-th eigenvector v j of W̃ .
In the following we focus on Laplacian eigenmaps. They can be described under
the above framework with appropriate definition of neighborhood matrix W . We will
discuss its properties and application to our data set in detail.
4.3.2.1
Distance Between VOCs
Defining a neighborhood for each observation requires the notion of distance between two VOCs. Each VOC is presumed to be chemically related or connected to
some other VOC from the set of 378 VOCs. This consideration led us to have the
following distance metric between two observations,
de f
||xi − x j ||2 =
378 378
∑ ∑ akl (xi,k − x j,l )2 ,
(4.5)
k=1 l=1
where akl represents connection strength between k-th and l-th VOC. All ai j form
the VOC closeness matrix A that is derived by the chemical properties of VOCs.
The closeness matrix A was found by grouping all compounds into fourteen groups.
Each compound’s closeness factor with itself is one, and each compound’s closeness
to others in the same group is 0 < λ < 1. A compound’s closeness to other compounds outside of its group is zero. The compounds are sorted to groups based upon
their chemical properties. The proposed groups are shown in Table 4.1.
In each case where a compound may belong to more than one group, the group
Mining Exhaled VOCs for Breast Cancer Detection
21
with the most active or significant component was chosen. Then, we define


1 if k = l,
akl = λ if k-th and l-th VOCs are in the same group,


0 otherwise,
with λ between 0 and 1. Parameter λ can be interpreted as the effect of group in
calculating the distance. The closeness matrix for 378 VOCs and λ = 0.5 is shown in
the Figure 4.4. Notice VOCs are ordered by the retention time from chromatographic
analysis of the breath collection tubes.
50
100
150
200
250
300
350
50
100
150
200
250
300
350
FIGURE 4.4
Closeness matrix; Shades of gray express the affinity between two VOCs.
This grouping is justifiable from a chemical standpoint for several reasons. Most
compounds in a group (e.g. alkanes, alcohols) behave chemically similarly. And for
those which do not behave similarly, like the nitrogen-containing compounds, their
presence may indicate the utilization of similar biochemical substrates or pathways,
which may themselves be symptoms of the disease. Finally, several members of each
group may be in equilibrium with each other.
4.3.2.2
Laplacian Eigenmaps
Embedding provided by the Laplacian eigenmaps preserves local geometric information optimally by making neighboring points mapped as closely as possible. Building
neighborhoods amounts to introducing ‘adjacency’ metric between the points. Adjacency matrix consists of entries Wi j representing the closeness between ith and jth
observations as
Wi j = e−
||xi −x j ||2
t
,
(4.6)
where t is a “diffusion” parameter. One can simply assume Wi j = 1 if two nodes are
connected or within k-nearest neighbors range.
22
Statistical Data Mining and Knowledge Discovery
A mapping from x1 , . . . , xn to y1 , . . . , yn is obtained by solving the following optimization problem,
(4.7)
min ∑(yi − y j )2Wi j ,
yT Dy=1 i, j
yT D1=0
where diagonal matrix D is defined via Dii = ∑ j Wi j . We put a penalty proportional
to the adjacency Wi j if xi and x j are mapped apart. By spectral decomposition the
equation 4.7 is simplified to the following generalized eigenvalue problem:
(D −W )f = µ Df.
(4.8)
The f can be viewed as optimal “cuts” in the geometric structure of xi . The matrix
D −W is called the graph Laplacian of x1 , . . . , xn . By taking eigenvectors associated
with the smallest nonzero l eigenvalues, one gets
xi → ( f1 (i), . . . , fl (i)).
(4.9)
The Laplacian eigenmaps for VOCs are depicted in Figure 4.5. It is clear that the first
three eigenvectors are more discriminative than those in PCA. We employ several
classification techniques on the summaries in (4.9). The results are provided in the
following section.
4.3.3 Classification and Comparisons
The classification function C can be approximated by a linear combination of the
new transformed coordinates, since eigenfunctions of the Laplacian constitute an
orthogonal basis for the Hilbert space L2 defined on the manifold of data (Belkin
and Niyogi, 2004). Four classification methods: linear discriminant analysis (LDA),
quadratic discriminant analysis (QDA), support vector machines (SVM), and least
square estimate (LSE) have been applied to the VOC data in the transformed domain.
The assessment of PCA and Laplacian eigenmap classifiers was performed by
simulation. For each classification method, observations are randomly split to the
training set (60% of cases and 60% of controls) and to the validation set (remaining
40% of cases and controls). Two dominant PCA or Laplacian components served as
classifying descriptors. The discriminatory boundaries were constructed using the
training sets and correct classification rates assessed by the validation sets. This was
repeated 1000 times.
The Laplacian eigenmap classification was done in a semi-supervised manner. All
observations irrespective of their labels (training and validation sets) have been used
to assess the manifold geometry and guide the transformation. Once transformed,
the labeled dominant components (from the training set) produced discrimination
boundaries to classify the unlabeled components corresponding to validation set. Of
course, the true labels in the validation set are known and we are able to calculate the
rate of correct classification.
For LSE, the classifying function can be put in a closed form. Suppose a training
set has L pairs, (x1 , y1 ), . . . , (xL , yL ), and U unlabeled observations xL+1 , . . . , xL+U ,
Mining Exhaled VOCs for Breast Cancer Detection
23
0.4
0.4
0.3
0.3
0.2
3rd component
2nd component
0.2
0.1
0
0.1
0
−0.1
−0.1
−0.2
−0.2
Control
Case
−0.3
−0.3
−0.2
−0.1
0
1st component
0.1
0.2
Control
Case
−0.3
(a) 1st vs. 2nd with λ = 0, 5 neighbors.
−0.2
−0.1
0
1st component
0.1
0.2
(b) 1st vs. 3rd with λ = 0, 5 neighbors.
0.3
0.3
0.2
0.1
3rd component
2nd component
0.2
0.1
0
−0.1
0
−0.1
−0.2
−0.3
−0.2
−0.4
Control
Case
−0.3
−0.3
−0.2
−0.1
1st component
0
−0.5
0.1
(c) 1st vs. 2nd with λ = 0.005, 4 neighbors.
Control
Case
−0.4
−0.3
−0.2
−0.1
0
2nd component
0.1
0.2
0.3
(d) 2nd vs. 3rd with λ = 0.005, 4 neighbors.
FIGURE 4.5
Laplacian eigenmap coordinate system spanned by two dominant components;
the intrinsic geometry of VOCs is better expressed in the transformed domain.
where xi ∈ R378 , yi ∈ {−1, +1}. After the transformation we obtain f (i) ∈ Rl corresponding to xi :
(xi , yi ) 7→ ( f (i) , yi ), i = 1, . . . , L
xi 7→ f (i) ,
i = L + 1, . . . , L +U.
Classification function C(x) is approximated by a linear combination of l components that serve as the new coordinates:
l
C(x) =
(x)
∑ ck fk
,
(4.10)
k=1
where f (x) is the transformation of x. Let f = [ f (1) f (2) . . . f (L) ] and y = [y1 . . . yL ]
corresponds only to training samples. Then, ck are estimated by
c = (f fT )−1 f y
by the least square criterion. Since the labels are either −1 or +1, the classification
24
Statistical Data Mining and Knowledge Discovery
600
0.4
500
0.3
400
0.2
2nd component
2nd component
300
200
100
0
0.1
0
−0.1
−0.2
−100
−0.3
−200
−0.4
case
control
−300
0
2000
4000
6000
1st component
8000
−0.5
(a) LDA with PCA.
case
control
−0.2
10000
−0.1
0
1st component
0.1
0.2
(b) LDA with Laplacian eigenmaps.
600
0.4
500
0.3
400
0.2
2nd component
2nd component
300
200
100
0
0.1
0
−0.1
−0.2
−100
−0.3
−200
−0.4
case
control
−300
0
2000
4000
6000
1st component
8000
−0.5
(c) QDA with PCA.
case
control
−0.2
10000
−0.1
0
1st component
0.1
0.2
(d) QDA with Laplacian eigenmaps.
600
0.4
500
0.3
400
0.2
2nd component
2nd component
300
200
100
0
0.1
0
−0.1
−0.2
−100
−0.3
−200
case
control
−300
0
2000
4000
6000
1st component
8000
(e) SVM with PCA.
10000
−0.4
−0.5
case
control
−0.2
−0.1
0
1st component
0.1
0.2
(f) SVM with Laplacian eigenmaps.
FIGURE 4.6
Illustration of classifications with PCA and Laplacian eigenmaps with λ = 0.1;
Points represent a random training set. Responses for the validation set are
estimated by the decision boundary inferred from the training set.
Mining Exhaled VOCs for Breast Cancer Detection
PCA
LDA
QDA
SVM
0.6126 0.6030 0.7000
LSE
0.6224
25
Laplacian eigenmaps
LDA
QDA
SVM
LSE
0.7668 0.7648 0.7346 0.7658
TABLE 4.2
Average correct classification ratio for four classification methods based on 1000
random partitions of data to training and validation sets.
(x)
of x can be based on ∑lk=1 ck fk as
(
1
y=
−1
(x)
∑lk=1 ck fk ≥ 0,
(x)
∑lk=1 ck fk < 0.
Figure 4.6 illustrates decision boundaries for PCA mapping and Laplacian eigenmaps with LDA, QDA, and SVM. Notice that LDA produces a linear separation
whereas QDA has a quadratic separation boundary based on the assumption that the
variances in the two groups are different from each other. The SVM varies according
to its kernel properties. In Figure 4.6(e), the radial basis kernel with width parameter 105 , penalty parameter 100, and 5 neighbors was used while in Figure 4.6(f) the
radial basis kernel with width 0.4, penalty 10, and 5 neighbors is applied.
Table 4.2 shows an average of the ratio of correct classifications for each method.
As evident, the average rates of the Laplacian eigenmaps are exceeding those of
the PCA. The classification rate depending on parameter λ in the closeness matrix is
given in Figure 4.7, which illustrates benefits of closeness matrix in the classification
process.
4.4 Conclusions
This study demonstrated a significant link between the presence of breast cancer and
exhaled VOCs. As a consequence, the predictive screening may be possible with
this method. While the sensitivity and specificity of this breath test may be low,
it is an inexpensive, noninvasive test that may have wide applicability worldwide.
As a “weak classifier” the VOCs analysis may be combined with other independent
classifiers, to increase confidence level of the diagnostic decision. The acceptance
of breath volatiles as a viable clinical screening technique has been slow due to a
number of issues:
1. largely unsuccessful attempts to find a single or small number of low-level
VOCs as biomarker compounds for individual diseases;
2. lack of understanding of the physiological meaning of the detected volatiles;
and
26
Statistical Data Mining and Knowledge Discovery
lda
qda
svm
lse
0.8
classification rate
0.75
0.7
0.65
0.6
0.55
0.5
0
0.05
0.1
λ in closeness matrix
0.15
0.2
FIGURE 4.7
Classification rate with two Laplacian components as a function of to λ . The
ratio is maximized at λ = 0.085.
3. results and methods cannot be easily converted into a clinically useful form.
With a larger sample size, we plan to focus attention on those compounds which are
found to be the strongest indicators of breast cancer, or on the relationships found
between compounds. This will help to better understand the specific chemical and
metabolic changes involved with the disease. While this method might be made more
accurate using the equipment and techniques employed by Phillips et al. (1997),
the collecting devices used in this study are portable and easier to use giving a more
practicable application for this technology.
4.4.1
Chemistry Considerations
Several assumptions and steps which were taken in the analysis of the data must be
justified from a chemical perspective.
The proposed methodology operates under the assumption that a few linear or
nonlinear combinations of the compounds dominate the predictive power of data to
diagnose cancer. This is inherent, for example, in PCA, where the dimension of the
data is reduced from 378 to two. While the inherent mathematical/chemical space
of disease-related VOCs may have many dimensions, two optimized dimensions can
well model this link. Chemically, this may be because only a few reactions dominate
the influence that the disease has on the presence of each compound. This suggests
that a small number of nonlinear combinations can effectively model the entire system.
Chemical “closeness” used in described classification could be refined. With multiple reactions and reactants, many co-products and co-reactants would exist, and the
dominant reactions would most closely control these relationships. Chemical close-
Mining Exhaled VOCs for Breast Cancer Detection
27
ness is an effective way to model this behavior, and proved to be a useful tool to
sharpen the computational analysis. However, if any two compounds are found to
have little correlation, then the “closeness” can be set to zero, and that relationship
will not contribute to the calculation step. In this way, the inclusion of the closeness
step in calculation does not preclude “non-closeness” interactions.
Acknowledgment. The research of Lee, Abouelnasr and Vidakovic was supported
by NSF Grants 0505490 and 0724524 at Georgia Institute of Technology.
References
ACS, American Cancer Society, (2007), What Are the Key Statistics for Breast Cancer?, Detailed Guide:Breast Cancer, available at http://www.cancer.org/.
Belkin, M., and Niyogi, P. (2003), “Laplacian Eigenmaps for Dimensionality Reduction and Data Representation,” Neural Computation, 15, 1373-1396.
Belkin, M., and Niyogi, P. (2004), “Semi-supervised Learning on Riemannian Manifolds,” Machine Learning, 56, 209-239.
Berry, D. A., Cronin, K. A., Plevritis, S. K., Fryback, D. G., Clarke, L., Zelen,
M., Mandelblatt, J. S., Yakovlev, A. Y., Habbema, J. D. F., and Feuer, E. J. (2005),
“Effect of Screening and Adjuvant Therapy on Mortality from Breast Cancer,” The
New England Journal of Medicine, 353, 1784-1792.
Cao, W., and Duan, Y. (2006), “Breath Analysis: Potential for Clinical Diagnosis
and Exposure Assessment,” Clinical Chemistry, 52, 800-811.
Dobson, R. (2003), “Dogs Can Sniff Out First Signs of Men’s Cancer,” Sunday
Times, April 2003, 27, 5.
Donoho, D. and Grimes, C. (2003), “Hessian Eigenmaps: New Tools for Nonlinear Dimensionality Reduction,” Proceedings of National Academy of Science, 100,
5591-5596.
EPA, Environmental Protection Agency, (2007), Indoor Air Quality: Organic Gases
(Volatile Organic Compounds - VOCs), available at http://www.epa.gov/iaq/voc.html.
Gordon, R. T., Shatz, C. B., Myers, L. J., Kosty, M., Gonczy, C., Kroener, J., Tran,
M., Kurtzhals, P., Heath, S., Koziol, J. A., Arthur, N., Gabriel, M., Hemping, J.,
Hemping, G., Nesbitt, S., Tucker-Clark, L., and Zaayer, J. (2008), “The use of Canines in the Detection of Human Cancers,” The Journal of Alternative and Complementary Medicine, 14, 61-67.
Henderson, C. (1991), Breast Cancer (12th ed.), New York: McGraw-Hill.
Jolliffe, I.T. (2002), Principal Component Analysis (2nd ed.), NY:Springer.
28
Statistical Data Mining and Knowledge Discovery
Lafon, S. (2004), “Diffusion Maps and Geometric Harmonics,” Yale University,
Ph.D. dissertation.
Lechner, M., and Rieder, J. (2007), “Mass Spectrometric Profiling of Low-molecularweight Volatile Compounds–Diagnostic Potential and Latest Applications,” Current
Medical Chemistry, 14, 987-995.
Manini, P., De Palma, G., Andreoli, R., Poli, D., Mozzoni, P., Folesani, G., Mutti,
A., and Apostoli, P. (2006), “Environmental and Biological Monitoring of Benzene
Exposure in a Cohort of Italian Taxi Drivers,” Toxicology Letters, 167, 142-151.
McCulloch, M., Jezierski, T., Broffman, M., Hubbard, A., Turner, K., and Janecki,
T. (2006), “Diagnostic Accuracy of Canine Scent Detection in Early- and Late-stage
Lung and Breast Cancers,” Integrative Cancer Therapies, 5, 30-39.
Phillips, M. (1997), “Method for the Collection and Assay of Volatile Organic Compounds in Breath,” Analytical Biochemistry, 247, 272-278.
Pickel, D., Manucy, G., Walker, D., Hall, S., and Walker, J. (2004), “Evidence for
Canine Olfactory Detection of Melanoma,” Applied Animal Behaviour Science, 89,
107-116.
Poli, D., Carbognani, P., Corradi, M., Goldoni, M., Acampa, O., Balbi, B., Bianchi,
L., Rusca, M., and Mutti, A. (2005), “Exhaled Volatile Organic Compounds in Patients with Non-small Cell Lung Cancer: Cross Sectional and Nested Short-term
Follow-up Study,” Respiratory Research, 6, 71.
Roweis, S. T., and Saul, L. K. (2000), “Nonlinear Dimensionality Reduction by Locally Linear Embedding,” Science, 290, 2323-2326.
Tenenbaum, J. B., Silva, V. D., and Langford, J. C. (2000), “A Global Geometric
Framework for Nonlinear Dimensionality Reduction,” Science, 290, 2319-2323.
Willis, C. M., Church, S. M., Guest, C. M., Cook, W. A., McCarthy, N., Bransbury,
A. J., Church, M. R. T., Church, J. C. T. (2004), “Olfactory Detection of Human
Bladder Cancer by Dogs: Proof of Principle Study,” BMJ, 329, 712.
Index
breast cancer, 15
classification, 24
closeness matrix, 20
dimension reduction, 17
linear, 18
nonlinear, 20
graph Laplacian, 22
Laplacian eigenmaps, 16, 20, 21
linear discriminant analysis, 22
quadratic discriminant analysis, 22
small n-large p problem, 17
support vector machines, 22
VOC, 15
volatile organic compounds, 15
weak classifier, 24
29