Download A comparison of model-based and regression classification

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Principal component analysis wikipedia , lookup

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Mixture model wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
A comparison of model-based and regression
classification techniques applied to near
infrared spectroscopic data in food
authentication studies
Deirdre Toher∗†‡, Gerard Downey∗ and Thomas Brendan Murphy†
∗
Ashtown Food Research Centre, Teagasc, Dublin 15, Ireland
Department of Statistics, School of Computer Science and Statistics, Trinity College
Dublin, Dublin 2, Ireland
‡
Corresponding Author:
email [email protected]; tel + 353.1.8059500; fax
+353.1.8059550
†
1
Abstract
Classification methods can be used to classify samples of unknown
type into known types. Many classification methods have been proposed in the chemometrics, statistical and computer science literature.
Model-based classification methods have been developed from a
statistical modelling viewpoint. This approach allows for uncertainty
in the classification procedure to be quantified using probabilities.
Linear discriminant analysis and quadratic discriminant analysis are
particular model-based classification methods.
Partial least squares discriminant analysis is commonly used in
food authentication studies based on spectroscopic data. This method
uses partial least squares regression with a binary outcome variable
for two-group classification problems.
In this paper, model-based classification is compared to partial
least squares discriminant analysis for its ability to correctly classify
pure and adulterated honey samples when the honey has been extended by three different adulterants. The methods are compared
using the classification performance, the range of applicability of the
methods and the interpretability of the results.
In addition, since the percentage of adulterated samples in any
given sample set is unlikely to be known in a real-life setting, the ability
of updating procedures within model-based clustering to accurately
predict the adulterated samples, even when the proportion of pure to
adulterated samples in the training data is grossly unrepresentative of
the true situation, is studied in detail.
2
1
Introduction
The main aim of food authenticity studies[1] is to detect when foods are not
what they claim to be and thereby prevent economic fraud or possible damage to health. Foods that are susceptible to such fraud are those which are
expensive and subject to the vagaries of weather during growth or harvesting
e.g. coffee, various fruits, herbs and spices. Food fraud can generate significant amounts of money (e.g. several million US dollars) for unscrupulous
traders so the risk of adulteration is real. Honey is defined by the EU[2] as
“the natural, sweet product produced by Apis mellifera bees from the nectar
of plants or from secretions of living plants, which bees collect, transform
by combining with specific substances of their own, deposit, dehydrate, store
and leave in honeycombs to ripen and mature”. As it is a relatively expensive product to produce and extremely variable in nature, honey is prone to
adulteration for economic gain. Instances of honey adulteration have been
recorded since Roman times when concentrated grape juice was sometimes
added, although nowadays industrial syrups are more likely to be used as
honey extenders. False claims may also be made in relation to the geographic origin of the honey but this study concentrated on attempting to
classify samples as either pure or adulterated. In this study, artisanal honeys
were adulterated in the laboratory using three adulterants – fructose:glucose
mixtures, fully-inverted beet syrup and high fructose corn syrup – in various
ratios and weight percentages.
Model-based classification[3] is a classification method based on the Gaussian
mixture model with parsimonious covariance structure. This method models
data within groups using a Gaussian distribution and the abundance of each
group has some fixed probability. This classification method has been shown
to give excellent classification performance in a wide range of applications[4] .
A recent extension of model-based classification that uses data with unknown
group membership in the model-fitting procedure has been developed[5] . A
detailed review of model-based classification and its extensions is given in
Section 4.1.
Partial least squares regression is a method that seeks to optimise both the
variance explained and correlation with the response variable[6] . In a previous
study[7] it was found to outperform other chemometric methods commonly
used in the study of near-infrared transflectance spectra. It has the advantage
in that it can utilise highly-correlated variables for classification purposes.
Both model-based classification and partial least squares discriminant
analysis requires training on data with known group or class labels. The
collection of training data in food authenticity can be very expensive and
time-consuming so methods that require few training data observations are
3
particularly useful.
We show that both methods give excellent classification performance,
even when few training data values are available. We also find that modelbased classification is robust in situations where the training and test data
are quite different.
2
2.1
Materials and Methods
Honey Samples
Honey samples (157 samples) were obtained directly from beekeepers throughout the island of Ireland. Samples were from the years 2000 and 2001; they
were stored unrefrigerated from time of production and were not filtered after receipt in the laboratory. Prior to spectral collection, honeys were incubated at 40◦ C overnight to dissolve any crystalline material, manually stirred
to ensure homogeneity and adjusted to a standard solids content (70◦ Brix)
to avoid spectral complications from naturally-occurring variations in sugar
concentration.
Collecting, extending and recording spectra of the honey was done at
time points several months apart; the first phase involved extending some
of the authentic samples of honey with fructose:glucose mixtures, the second
phase involved extending some of the remaining authentic samples with fullyinverted beet syrup and high fructose corn syrup. All adulterant solutions
were also produced at 70◦ Brix. Brix standardisation of honeys and adulterant solutions meant that any adulteration detected would not be simply on
the basis of gross added solids.
The fructose:glucose mixtures were produced by dissolving fructose and
glucose (Analar grade; Merck) in distilled water in the following ratios:- 0.7:1,
1.2:1 and 2.3:1 w/w. Twenty-five of the pure honeys were adulterated with
each of the three fructose:glucose adulterant solutions at three levels i.e. 7,
14 and 21% w/w thus producing 225 adulterated honeys.
The other adulterant solutions were generated by diluting commerciallysourced fully-inverted beet syrup (50:50 fructose:glucose; Irish Sugar, Carlow,
Ireland) and high fructose corn syrup (45% fructose and 55% glucose) with
distilled water. Eight authentic honeys were chosen at random and adulterated with beet invert syrup at levels of 7, 10, 14, 21, 30, 50 and 70% w/w;
high fructose corn syrup was added to ten different, randomly-selected honeys at 10, 30, 50 and 70% w/w. This produced 56 BI-adulterated and 40
HFCS-adulterated samples.
On visual inspection, the spectra of the pure honey from the two mea4
surement phases displayed an offset, with those recorded in the first phase
exhibiting a higher mean absorbance value. To remove this offset, the difference of the mean absorbance values at each wavelength was subtracted
from spectra collected in the first phase. This was done to ensure that the
detection of the fructose:glucose adulterants was not a by-product of the
offset.
2.2
Spectral Collection
In the first measurement phase, transflectance spectra were collected between
400 and 2498 nm in 2 nm steps on a NIRSystems 6500 scanning monochromator (FOSS NIRSystems, Silver Springs, MD) fitted with a sample transport
module. The second phase collected transflectance spectra between 700 and
2498 nm, again in 2 nm steps, using the same monochromator. Samples were
presented to the instrument in a camlock cell fitted with a gold-plated backing
plate (0.1 mm sample thickness; part no. IH-0355-1). Between samples, this
cell was washed with detergent, rinsed thoroughly with tepid water and dried
using lens tissue. In the absence of any manufacturer-supplied temperature
control for this accessory, an ad hoc procedure was devised to minimise variation in the temperature of honey samples prior to spectral collection. This
procedure involved placing each sample in the instrument and scanning it 15
times without storing the resultant spectra. On the sixteenth scan, the spectrum was stored for analysis. During the pre-scanning process, each sample
was presumed to have equilibrated to instrument temperature. While instrumental temperature does vary (K. Norris, personal communication), this
variation is less than that likely to occur in the ambient temperature in a laboratory which is not equipped with an air-conditioning system. Preliminary
experimental work had indicated 15 scans to be an appropriate compromise
number on the basis of visual examination to test sample spectra. All spectra (478 in total) were collected in duplicate (including re-sampling); mean
spectra were used in all subsequent calculations.
3
Dimension Reduction and Preprocessing
Spectral data between 1100 and 2498 nm were used in this study. Each
spectrum therefore contains 700 wavelengths with adjacent absorption values being highly correlated. Therefore before using model-based classification methods, a dimension reduction step is required – this avoids singular
covariance matrices, improves computational efficiency and increases statistical modelling possibilities. The technique chosen for data reduction in this
5
1.2
0.8
0.4
0.0
Intensity
pure
adulterated
1100
1300
1500
1700
1900
2100
2300
2500
Wavelength (nm)
Figure 1: Spectra of Pure and Adulterated Honey Samples
study is wavelet analysis.
3.1
Wavelet Analysis
Wavelet analysis is a technique commonly used in image and signal processing in order to compress data. Here, it is used to decompose each spectrum
into a series of wavelet coefficients. Without any thresholding of these coefficients, the original spectra can be exactly reconstructed from the coefficients.
However, many of the coefficients in the wavelet analysis are zero or close
to zero. By thresholding the coefficients that are zero or close to zero, it is
possible to dramatically reduce the dimensionality of the dataset. The resulting recomposed spectra are then approximations of each of the individual
spectra. Ogden[8] gives a good practical introduction to wavelet analysis.
3.1.1
Daubechies’ Wavelet
Daubechies’ wavelet is a consistently reliable type to use and is the default
within wavethresh[9] . To efficiently carry out wavelet analysis, the data
dimension should be of the order 2k , where k is an integer. Unfortunately,
this can result in quite a lot of information being set aside. Techniques of
extending the data to bring them up to the nearest 2k are available, but
in this case these methods result in problems when carrying out the modelbased discriminant analysis – the associated variance structures are often
singular. Thus the central 29 = 512 observations were chosen – the range
6
(1290 nm – 2312 nm).
3.1.2
Thresholding techniques
Various methods of thresholding[10] wavelet functions in order to achieve
dimension reduction have been published.
As a default procedure, universal hard thresholding
is used. When using
p
k
universal thresholding, the threshold λ = σ̂ 2 log 2 , where σ̂ is a robust
estimate of the standard deviation of the coefficients and there are 2k coefficients. Other thresholding techniques may sometimes provide better approximations of the spectra but their use adds another decision into the process
that may lead to overfitting rather than a general procedure.
0.4
Thresholded
0.0
Intensity
0.8
Actual
1290
1490
1690
1890
2090
2290
Wavelength (nm)
Figure 2: Actual and Thresholded Spectra
3.2
Preprocessing: Savitzky-Golay
Using the Savitzky-Golay[11] algorithm to calculate derivatives of the data is
commonly used in spectroscopic analysis. This concentrates analysis on the
shape rather than the height of the spectra. Thus we compare this method
of pre-processing the data with the alternative of no preprocessing.
7
4
Classification Techniques
The classification techniques used on this data set were based on Gaussian
mixture models; under these models each group was modelled using a Gaussian
distribution. The covariance of each of the Gaussian models is structured in
a parsimonious manner using constraints on the eigen decomposition of the
covariance matrix[4] . This approach offers the ability to model groups that
have distinct volume, shape and orientation properties.
We introduce mixture models in general terms in Section 4.1 and then
show the development of clustering and classification methods from mixtures.
4.1
Mixture Models
The mixture model assumes that observations come from one of G groups,
that observations within each group g are modelled by a density f (·|θg ) where
θg are unknown parameters and that the probability of coming from group g
is pg .
Therefore, given data y with independent multivariate observations y1 , . . . , yM ,
a mixture model with G groups has a likelihood function
Lmix (θ1 , . . . , θG ; p1 , . . . , pG |y) =
M X
G
Y
pg f (ym |θg ).
(1)
m=1 g=1
The Gaussian mixture model further assumes the density f to be a multivariate Gaussian density φ, parameterized by a mean µg and covariance
matrix Σg
exp{− 21 (ym − µg )T Σ−1
g (ym − µg )}
p
φg (ym |µg , Σg ) ≡
det(2πΣg )
The multivariate Gaussian densities imply that the groups are centered
at the means µg with shape, orientation and volume of the scatter of observations within the group depending on the covariances Σg .
The Σg can be decomposed using an eigen decomposition into the form,
Σg = λg Dg Ag DgT ,
where λg is a constant of proportionality, Dg an orthogonal matrix of eigenvectors and Ag is a diagonal matrix where the elements are proportional to
the eigenvalues as described by Fraley and Raftery[4] .
The parameters λg , Ag and Dg have interpretations in terms of volume,
shape and orientation of the scatter for the component. The parameter λg
8
controls the volume, Ag the shape of the scatter and Dg controls the orientation . Constraining the parameters to be equal across groups gives great
modelling flexibility. Some of the options for constraining the covariance
parameters are given in Table 1.
Table 1: Parametrizations of the covariance matrix Σg
Model ID: E=Equal, V=Variable, I=Identity
Model ID Decomposition Structure
EII
Σg = λI
Spherical
VII
Σg = λg I
Spherical
EEI
Σg = λA
Diagonal
VEI
Σg = λg A
Diagonal
EVI
Σg = λAg
Diagonal
VVI
Σg = λg Ag
Diagonal
T
EEE
Σg = λDAD
Ellipsoidal
EEV
Σg = λDg ADgT
Ellipsoidal
T
VEV
Σg = λg Dg ADg
Ellipsoidal
T
VVV
Σg = λg Dg Ag Dg Ellipsoidal
The letters in ModelID denote the volume, shape and orientation repectively. For example, EEV represents equal volume and shape with variable orientation. The mixture model (1) can be fitted to multivariate observations y1 , y2 , . . . , yN by maximizing the log-likelihood (1) using the EM
algorithm[12] . The resulting output from the EM algorithm includes estimates
of the probability of group membership for each observation; these can be
used to cluster the observations into their most probable groups. This procedure is the basis of model-based clustering[4] and is easily implemented in
the mclust library[13] for the statistics package R[14] .
However, in model-based discriminant analysis (also known as eigenvalue
discriminant analysis)[3] , the model is fitted to the data which consist of
multivariate observations wn where n = 1, 2, . . . , N and labels ln where lng =
1 if observation n belongs to group g and 0 otherwise.
Therefore, the resulting likelihood function is,
Ldisc (p1 , p2 , . . . , pG ; θ1 , θ2 , . . . , θG |w, l) =
N Y
G
Y
[pg f (wn |θg )]lng
(2)
n=1 g=1
The log-likelihood function (2) is maximized yielding parameter estimates
p̂1 , p̂2 , . . . , p̂G and θ̂1 , θ̂2 , . . . , θ̂G . For stability, equal probabilities, p̂1 = . . . =
p̂G = 1/G, are often assumed.
9
The posterior probability of group membership for an observation y whose
label is unknown can be estimated as
p̂g f (y|θ̂g )
P(Group g|y) ≈ PG
q=1 p̂q f (y|θ̂q )
(3)
and observations thus can be classified into their most probable group. The
mclust[13] package can also be used to perform the model-based discriminant
analysis. This again allows for the possibility of the models demonstrated
in Table 1. It is worth noting that Linear Discriminant Analysis (LDA) and
Quadratic Discriminant Analysis (QDA) are special cases of model-based
discriminant analysis and they correspond to the EEE and VVV models
respectively.
4.2
Discriminant Analysis With Updating
Model-based discriminant analysis as developed in[3] only uses the observations with known group membership in the model fitting procedure. Once the
model is fitted, the observations with unknown group labels can be classified
into their most probable groups.
An alternative approach is to model both the labelled data (w, l) and the
unlabelled data y and to maximize the resulting log-likelihood for the combined model. The likelihood function for the combined data is a product of
the likelihood functions given in (2) and (1). This classification approach was
recently developed in[5] and was demonstrated to give improved classification
performance over the classical model-based discriminant analysis.
With this modelling approach the likelihood function is of the form
Lupdate (p, θ|w, l, y) = Ldisc (p, θ|w, l)Lmix (p, θ|y)
N Y
G
M X
G
Y
Y
lng
=
[pg f (wn |θg )] ×
pg fg (ym |θg ) (4)
n=1 g=1
m=1 g=1
The log-likelihood (4) is maximized using the EM algorithm (Section 4.3)
to find estimates for p (if estimated) and θ. Output from the EM algorithm
includes estimates of the probability of group membership for the unlabelled
observations y, as given in (3).
The EM algorithm for maximizing the log-likelihood (4) proceeds iteratively substituting the unknown labels with their estimated values. At
each iteration the estimated labels are updated and new parameter estimates
are produced. By passing the estimated values of the unknown labels into
the EM algorithm it is possible to “update” the classification results with
10
some of the knowledge gained from fitting the model to all of the data. Indeed, even with small training sets containing unrepresentative proportions
of pure/adulterated honey samples, updating shows consistency.
4.3
EM Algorithm
The Expectation Maximization Algorithm[12] is ideally suited to the problem of maximizing the log-likelihood function when some of the data have
unknown group labels; this arises in the model-based clustering likelihood
(1) and the model-based discriminant analysis with updating likelihood (4).
In this section, we show the steps involved in the EM algorithm for modelbased discriminant analysis with updating; the model-based clustering steps
are shown in[4] .
Considering data to be classified as consisting of M multivariate observations consisting of two parts: known, ym and unknown zm . In this context
the spectroscopic data are observed and thus known, so are treated as the
ym . The labels (pure or adulterated) are unknown and thus are treated as
the zm . Additionally, N labelled observations are available which consist of
two parts: known wn and known labels ln .
The unobserved portion of the data z, is a matrix of indicator functions,
so that zm = (zm1 , . . . , zmG ), where each zmg is 1 if ym is from group g and
0 otherwise.
Then the observed data likelihood can be written in the form
LO (p, θ|wN , lN , yM ) =
N Y
G
Y
lng
[pg f (wn |θg )]
×
n=1 g=1
M X
G
Y
pg fg (ym |θg )
(5)
m=1 g=1
and the complete data likelihood is
LC (p, θ|wN , lN , yM , zM ) =
N Y
G
Y
M Y
G
Y
×
[pg f (ym |θg )]zmg
lng
[pg f (wn |θg )]
n=1 g=1
m=1 g=1
(6)
Initial estimates of p̂ (if estimated) and θ̂ = (µ̂, Σ̂) are taken from classical
model-based discriminant analysis, by maximizing (2).
The expected value of the unknown labels are calculated so that
(k)
(k+1)
ẑmg
(k)
p̂g f (ym |θ̂g )
← PG
(k)
q=1
(k)
p̂q f (ym |θ̂q )
(7)
for g = 1, . . . , G and m = 1, . . . , M and the parameters p and θ can then be
11
estimated by:
p̂(k+1)
g
µ̂(k+1)
g
P
(k+1)
+ M
m=1 ẑmg
←
if estimated
N +M
PN
PM
(k+1)
n=1 lng wn +
m=1 ẑmg ym
←
PN
PM
(k+1)
n=1 lng +
m=1 ẑmg
PN
n=1 lng
(8)
The estimates of Σg again depend on the constraints placed on the eigenvalue
decomposition; details of the calculations are given in[3, 15] .
The iterative process continues until convergence is achieved. The use of
an Aitken acceleration-based convergence criterion is discussed in[5] .
Updating can take two forms – soft and hard updating. With soft updating (EM) updates of the missing labels are made using (7), so the unknown
labels are replaced by probabilities (or a soft classification). Whereas, hard
updating (CEM) replaces the probabilities given in (7) with an indicator vector of the most probable group. The hard classification algorithm does not
maximize (4) but actually tries to maximize (6) (although local maxima are
a possibility).
5
Regression-Based Techniques
Commonly used in near infrared spectroscopy on food samples, discriminant
partial least squares was found to be a reliable method of classifying Irish
honey samples[7] .
5.1
Partial Least Squares Regression
Partial least squares regression (PLSR) was developed by Wold[16, 17] and
is based on the assumption of a linear relationship between the observed
variables (e.g. the spectroscopy measurements) and the outcome variable
(e.g. pure or adulterated). It is similar to principal components regression
(PCR). Given that X is an n × p matrix that represents observed values:n samples on p measurement points; and that y is a vector of length n
representing the outcome variable/label.
If S is the sample covariance matrix of X, the similarity between PCR
and PLSR can be seen when one examines what is being maximised in each
situation. PCR maximises the function Var(Xα) where vm is the mth principal component and vlT Sα = 0 for l = 1, . . . , m − 1 while PLSR maximises
Corr2 (y, Xα)Var(Xα) where φ̂m is the mth PLS direction and φTl Sα = 0 for
l = 1, . . . , m − 1, both subject to the condition that ||α|| = 1. However,
within the PLS maximization problem, the variance term does dominate.
12
5.1.1
Algorithm for PLSR
Each observation xj is standardized to have 0 mean and variance of 1. Ini(0)
tialize ŷ(0) = 1ỹ, xj = xj for j = 1, . . . , p and φ̂1j = hxj , yi – the univariate
regression coefficient of y on each xj .
For m = 1, . . . , p
D
E
P
(m−1)
(m−1)
• hm = pj=1 φ̂mj xj
where φ̂mj = xj
,y
• θ̂m = hhm , yi / hhm , hm i
• ŷ(m) = ŷ(m−1) + θ̂m hm
E
i
hD
(m−1)
(m−1)
(m)
/ hhm , hm i hm for j = 1, . . . , p so that
− hm , xj
• xj = xj
(m−1)
each xj
is othogonalized with respect to hm
It has been noted that the sequence of PLS coefficients for m = 1, . . . , p
represents the conjugate gradient sequence for computing least squares solutions.
5.1.2
Number of Parameters
It uses m relevant loadings/components in the model. However, deciding on
m is not trivial[18] . Even when m is known, the number of parameters in
the model is open for debate. There is a problem in calculating the degrees
of freedom of a model using PLS[19] . Calculating the number of parameters
in a model is especially relevant when using a complexity penalty as part of
the model selection criterion. For the purposes of this study, the number of
parameters in the population model was assumed to be
p (p + 1)
+m+1
2
so that if m = 0 no correlation between X and y exists and m = p is the full
least squares model; this agrees with [18] .
6
Model Selection and Verification
The model with the best Bayesian Information Criterion value was chosen
in each case – so that a consistent criterion could be applied to all models.
It is also possible to choose models based on leave-one-out cross validation,
but this is computationally expensive.
The certainty of the classification decisions are measured using Brier’s
score[20] .
13
6.1
Bayesian Information Criterion (BIC)
The results given are models selected using the Bayesian Information Criterion (BIC), where the BIC of a function is
BIC = 2loglikelihood − d log(N + M ),
given that N + M is the number of observations from f and d is the number
of parameters used in the model.
6.2
Brier’s Score
Brier[20] developed a method of producing a continuous performance measure
where perfect prediction gives a Brier’s score of zero. Given G groups and
N samples and forecasted probabilities ẑn1 , . . . , ẑnG for sample n belonging
to group 1, . . . , G respectively then the Brier’s score, B, is
G
N
1 XX
B=
(ẑgn − ztruegn )2
2n g=1 n=1
(9)
where ztruegn is an indicator variable for the actual group membership. It is
especially useful for determining the certainty of predictions. Some observations may be just barely put into the correct group, or indeed just miss out
on correct classification. Observations that are barely classified correctly will
add more to the Brier’s score than those where a more certain classification
is made.
A trait of PLSR is that some regression outputs may in fact be beyond
the zero-one scale. For the purposes of calculating a pseudo-Brier score,
such results were set to be equal to either zero or one so that these certain
classifications do not add to the total.
7
Comparison of Methods
As an example of the data input to the classification problem, Figure 1 shows
transflectrance spectra of authentic and adulterated honey samples. Visual
examination of these spectra reveal a sloping baseline, the absense of fine
structure and the impossibility of visually differentiating between samples,
features which are typical of near infrared spectral collections. For these
reasons, multivariate statistical methods are required to efficiently and effectively extract useful information from such datasets.
The data consist of 157 unadulterated samples, 225 extended with fructose:glucose mixtures, 56 extended with beet invert syrup and 40 samples
14
extended with high fructose corn syrup. Let fg, bi and cs represent adulteration by fructose:glucose mixtures, beet invert syrup and high fructose
corn syrup respectively and let adult represent samples adulterated with any
of the three adulterant types. The figures reported are the mean correct
classification rates (as percentages) over 400 random splits of the data into
training and test sets and their associated standard deviations (in brackets).
Three different types of training data were examined:
a) correct proportions of pure and each type of adulterant
b) correct proportions of pure and adulterated
c) unrepresentative proportions of pure and adulterated
The performance of model-based discriminant analysis (DA), soft updating
(EM), hard updating (CEM) and partial least squares regression (PLSR) was
measured at three different ratios of training/test data – 50/50%, 25/75%
and 10/90% in order to fully examine the robustness of each technique to
varying sample size.
7.1
Correct Proportions of Pure and of Each Type of
Adulterant in Training Set
The training sets, at 50%, 25% and 10% of the data, are completely representatative of the entire population studied.
Training Test
pure: 79
78
fg:
112 113
bi:
28
28
cs:
20
20
pure: 39 118
fg:
56 169
bi:
14
42
cs:
10
30
pure: 16 141
fg:
22 203
bi:
6
50
cs:
4
36
Method
DA
EM
CEM
PLSR
DA
EM
CEM
PLSR
DA
EM
CEM
PLSR
BIC selection
Classification Rate
94.86 (1.01)
93.02 (2.02)
93.95 (1.56)
94.66 (1.04)
93.51 (1.07)
84.49 (12.72)
91.60 (3.94)
93.88 (1.14)
90.81 (1.70)
52.38 (15.58)
69.80 (20.85)
92.46 (3.75)
method
Brier’s Score
0.060 (0.014)
0.060 (0.016)
0.053 (0.009)
0.053 (0.007)
0.071 (0.012)
0.140 (0.120)
0.076 (0.037)
0.060 (0.011)
0.104 (0.045)
0.465 (0.157)
0.294 (0.209)
0.075 (0.031)
There is no significant difference in the performance of DA and PLSR, but
both outperform the updating procedures. All methods have decreasing performance as the training percentage decreases. However, the decrease is not
large for DA and PLSR.
15
As the percentage in the training set decreases the number of factors selected by PLSR becomes more variable (8–14 factors at 50% to 1–30 factors
at 10%). DA chooses more parsimonious models as the training set percentage decreases, but the updating procedures (EM and CEM) choose models
with more parameters as the size of the training set decreases relative to the
test set.
7.2
Correct Proportions of Pure/Adulterated in Training Set
Only the proportion of pure to adulterated samples are kept fixed – the
proportion of each type of adulterant is allowed to vary between simulations.
Training Test
pure:
79
78
adult: 160 161
pure:
adult:
39
80
118
241
pure:
adult:
16
32
141
289
Method
DA
EM
CEM
PLSR
DA
EM
CEM
PLSR
DA
EM
CEM
PLSR
BIC selection
Classification Rate
94.66 (1.18)
93.84 (2.41)
93.84 (1.57)
94.58 (0.91)
93.45 (1.01)
83.86 (12.18)
90.85 (7.17)
93.85 (1.01)
90.63 (1.74)
51.79 (15.00)
68.14 (20.99)
92.45 (4.14)
method
Brier’s Score
0.062 (0.015)
0.059 (0.020)
0.054 (0.015)
0.054 (0.006)
0.073 (0.013)
0.145 (0.117)
0.083 (0.069)
0.060 (0.010)
0.115 (0.057)
0.470 (0.151)
0.312 (0.210)
0.074 (0.032)
As in Section 7.1, there is no significant difference between DA and PLSR,
which both outperform the updating methods. Classification performance is
slightly worse (as expected) but is actually quite robust to type of adulterant.
The number of factors chosen by PLSR again becomes more variable
as the proportion of training data to test data decreases. The pattern of
DA choosing more parsimonious models while updating procedures choose
models with more parameters again occurs with decreasing training data
proportion.
7.3
Unrepresentative Proportions of Pure/Adulterated
In this section the ratios of pure to adulterated samples in the training sets
(and hence also in the test sets) are varied in order to examine the performance of each method when the training data does not accurately reflect
16
Table 2: 50% Training Data / 50% Test Data
BIC selection method
Training Test Method Classification Rate Brier’s Score
pure:
40 117
DA
92.47 (1.52)
0.079 (0.018)
EM
93.13 (1.55)
0.061 (0.015)
adult: 199 122
CEM
92.45 (1.81)
0.067 (0.017)
PLSR
95.15 (1.15)
0.058 (0.006)
pure:
20 102
DA
90.01 (1.60)
0.101 (0.017)
adult: 219 137
EM
90.32 (3.35)
0.089 (0.031)
CEM
89.24 (2.90)
0.100 (0.029)
PLSR
97.08 (1.28)
0.050 (0.008)
pure:
99
58
DA
94.98 (1.21)
0.058 (0.013)
EM
92.69 (1.80)
0.064 (0.016)
adult: 140 181
CEM
93.82 (1.48)
0.055 (0.014)
PLSR
94.66 (1.22)
0.054 (0.007)
pure: 119
38
DA
95.34 (0.92)
0.064 (0.016)
adult: 120 201
EM
90.82 (4.41)
0.081 (0.036)
CEM
92.87 (2.91)
0.064 (0.027)
PLSR
94.87 (1.11)
0.047 (0.004)
the entire population structure. In the 50% training data situation, PLSR
marginally outperforms DA and updating methods when the number of pure
samples in the training set is less than expected. There is no significant
difference between PLSR and DA when the number of pure samples in the
training set is greater than expected. In the 25% training data there is almost no difference in classification performance between DA and PLSR, with
DA marginally outperforming PLSR when the number of pure samples in the
training data set is greater than expected (> 39 pure samples). With 10%
of the data used as a training set, DA shows its robustness to both training sample size and composition. It should be noted that PLSR again also
does well in terms of classification performance but does have a much higher
standard deviation than DA.
Where there are fewer than the expected number of pure samples (e.g. <
79 in Table 2) in the training set, DA almost always chooses EEE, while when
there are more than the expected number of pure samples in the training set,
DA chooses VEV most frequently. The range of factors chosen by PLSR for
when the number of pure samples in the training set is higher than expected
is smaller than when the number of pure samples in the training set is less
than expected.
17
Table 3: 25% Training Data / 75% Test Data
BIC selection method
Training
Test Method Classification Rate Brier’s Score
pure:
20 137
DA
92.25 (1.68)
0.079 (0.017)
EM
72.37 (18.16)
0.256 (0.178)
adult: 99 222
CEM
82.16 (16.19)
0.169 (0.159)
PLSR
93.12 (3.59)
0.067 (0.023)
pure:
10 147
DA
91.12 (1.63)
0.089 (0.014)
adult: 109 212
EM
68.93 (18.38)
0.291 (0.183)
CEM
82.28 (15.57)
0.168 (0.151)
PLSR
92.68 (6.09)
0.070 (0.037)
pure:
49 108
DA
93.98 (0.98)
0.068 (0.012)
EM
85.05 (10.34)
0.135 (0.099)
adult: 70 251
CEM
90.79 (6.48)
0.084 (0.063)
PLSR
93.50 (1.35)
0.067 (0.012)
pure:
59
98
DA
93.73 (1.00)
0.072 (0.015)
adult: 60 261
EM
84.92 (9.30)
0.137 (0.088)
CEM
90.83 (3.49)
0.084 (0.032)
PLSR
93.41 (1.59)
0.068 (0.014)
Table 4: 10% Training Data / 90% Test Data
BIC selection method
Training Test Method Classification Rate Brier’s Score
pure:
8 149
DA
89.87 (2.08)
0.110 (0.047)
adult: 40 281
EM
61.02 (17.37)
0.368 (0.171)
CEM
85.91 (8.41)
0.132 (0.080)
PLSR
88.17 (7.33)
0.106 (0.047)
pure:
4 153
DA
88.66 (2.37)
0.115 (0.025)
adult: 44 277
EM
66.37 (15.83)
0.302 (0.142)
CEM
91.88 (0.94)
0.074 (0.009)
PLSR
85.95 (8.67)
0.120 (0.054)
pure: 20 137
DA
90.89 (1.70)
0.099 (0.023)
adult: 28 293
EM
74.45 (16.39)
0.241 (0.164)
CEM
85.15 (11.28)
0.140 (0.112)
PLSR
89.75 (5.42)
0.098 (0.035)
pure: 24 133
DA
90.92 (1.74)
0.104 (0.029)
adult: 24 297
EM
80.03 (11.17)
0.187 (0.112)
CEM
86.05 (8.56)
0.131 (0.084)
PLSR
89.83 (4.97)
0.097 (0.033)
18
8
Conclusions
Both PLSR and model-based discriminant analysis achieve good classification
results. Unlike previous studies[5] the updating techniques do not improve
classification performance. This may be due to the structure, or lack thereof,
within the adulterated samples. Comparing the Brier’s scores to the proportion of samples misclassified shows that classification decisions made are
quite definitive.
Model-based discriminant analysis is more flexible and more robust than
PLSR. As the training set proportion decreases and the make up of the
training set becomes unrepresentative of the population, model-based discriminant analysis shows its strengths.
The results from model-based discriminant analysis are in the form of the
probability of being a member of each group rather than, as in PLSR, a score.
Thus a cost function can be easily imposed at the end of the classification
process and the results have a natural interpretation.
9
Acknowledgements
The work reported in this paper is funded by Teagasc under the Walsh Fellowship Scheme. Thanks are due to individual beekeepers for the provision
of honey samples and to the Irish Department of Agriculture & Food for
financial support (FIRM programme) which enabled collection and spectral
analysis of the samples. This work is also supported under the Science Foundation of Ireland Basic Research Grant scheme (Grant 04/BR/M0057).
References
[1] M. Lees, editor. Food Authenticity: Issues and Methodologies. Eurofins
Scientific, Nantes, France, 1998.
[2] European Commission. Council Directive 2001/110/EC of 20 June 2001,
relating to honey, 2002.
[3] H. Bensmail and G. Celeux. Regularized Gaussian discriminant analysis
through eigenvalue decomposition. Journal of the American Statistical
Association, 91:1743–1748, 1996.
[4] C. Fraley and A. E. Raftery. Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association, 97:611–631, 2002.
19
[5] N. Dean, T. B. Murphy, and G. Downey. Using Unlabelled Data To Update Classication Rules With Applications In Food Authenticity Studies.
Journal of the Royal Statistical Society, Series C, 55:1–14, 2006.
[6] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical
Learning. Springer, 2001.
[7] G. Downey, V. Fouratier, and J. D. Kelly. Detection of honey adulteration by addition of fructose and glucose using near infrared transflectance spectroscopy. Journal of Near Infrared Spectroscopy, 11:447–
456, 2003.
[8] R. T. Ogden. Essential Wavelets for Statistical Applications and Data
Analysis. Birkhauser, 1997.
[9] A. Kovac (1997), M. Maechler (1999), and G. Nason (R-port).
wavethresh: Software to perform wavelet statistics and transforms.,
2004. R package version 2.2-8.
[10] D. L. Donoho and I. M. Johnstone. Ideal Spatial Adaptation by Wavelet
Shrinkage. Biometrika, 81(3):425–455, 1994.
[11] A. Savitzky and M. J. E. Golay. Smoothing and differentiation of data
by simplified least squares procedures. Analytical Chemistry, 36:1627–
1639, 1964.
[12] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood
for incomplete data via the EM algorithm (with discussion). Journal of
the Royal Statistical Society, Series B, 39:1–38, 1977.
[13] C. Fraley, A. E. Raftery, Dept. of Statistics, University of Washington,
and R. Wehrens (R-port). mclust: Model-based cluster analysis. R
package version 2.1-8.
[14] R Development Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna,
Austria, 2004. ISBN 3-900051-07-0.
[15] G. Celeux and G. Govaert. Gaussian parsimonious clustering models.
Pattern Recognition, 28:781–793, 1995.
[16] H. Wold. Estimation of principal components and related models by
iterative least squares. In Multivariate Analysis, pages 391–420, 1966.
20
[17] H. Wold. Nonlinear estimation by iterative least square procedures. In
Research Papers in Statistics: Festschrift for J. Neyman, pages 411–444,
1966.
[18] I. S. Helland. Some theorectical aspects of partial least squares regression. Chemometrics Intell. Lab. Syst, 58:97–107, 2001.
[19] H. Van Der Voet. Pseudo-degrees of freedom for partial least squares.
J. Chemometrics, 13:195–208, 1999.
[20] G. W. Brier. Verification of forecasts expressed in terms of probability.
Monthy Weather Review, 78:1–3, 1950.
21