Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
A comparison of model-based and regression classification techniques applied to near infrared spectroscopic data in food authentication studies Deirdre Toher∗†‡, Gerard Downey∗ and Thomas Brendan Murphy† ∗ Ashtown Food Research Centre, Teagasc, Dublin 15, Ireland Department of Statistics, School of Computer Science and Statistics, Trinity College Dublin, Dublin 2, Ireland ‡ Corresponding Author: email [email protected]; tel + 353.1.8059500; fax +353.1.8059550 † 1 Abstract Classification methods can be used to classify samples of unknown type into known types. Many classification methods have been proposed in the chemometrics, statistical and computer science literature. Model-based classification methods have been developed from a statistical modelling viewpoint. This approach allows for uncertainty in the classification procedure to be quantified using probabilities. Linear discriminant analysis and quadratic discriminant analysis are particular model-based classification methods. Partial least squares discriminant analysis is commonly used in food authentication studies based on spectroscopic data. This method uses partial least squares regression with a binary outcome variable for two-group classification problems. In this paper, model-based classification is compared to partial least squares discriminant analysis for its ability to correctly classify pure and adulterated honey samples when the honey has been extended by three different adulterants. The methods are compared using the classification performance, the range of applicability of the methods and the interpretability of the results. In addition, since the percentage of adulterated samples in any given sample set is unlikely to be known in a real-life setting, the ability of updating procedures within model-based clustering to accurately predict the adulterated samples, even when the proportion of pure to adulterated samples in the training data is grossly unrepresentative of the true situation, is studied in detail. 2 1 Introduction The main aim of food authenticity studies[1] is to detect when foods are not what they claim to be and thereby prevent economic fraud or possible damage to health. Foods that are susceptible to such fraud are those which are expensive and subject to the vagaries of weather during growth or harvesting e.g. coffee, various fruits, herbs and spices. Food fraud can generate significant amounts of money (e.g. several million US dollars) for unscrupulous traders so the risk of adulteration is real. Honey is defined by the EU[2] as “the natural, sweet product produced by Apis mellifera bees from the nectar of plants or from secretions of living plants, which bees collect, transform by combining with specific substances of their own, deposit, dehydrate, store and leave in honeycombs to ripen and mature”. As it is a relatively expensive product to produce and extremely variable in nature, honey is prone to adulteration for economic gain. Instances of honey adulteration have been recorded since Roman times when concentrated grape juice was sometimes added, although nowadays industrial syrups are more likely to be used as honey extenders. False claims may also be made in relation to the geographic origin of the honey but this study concentrated on attempting to classify samples as either pure or adulterated. In this study, artisanal honeys were adulterated in the laboratory using three adulterants – fructose:glucose mixtures, fully-inverted beet syrup and high fructose corn syrup – in various ratios and weight percentages. Model-based classification[3] is a classification method based on the Gaussian mixture model with parsimonious covariance structure. This method models data within groups using a Gaussian distribution and the abundance of each group has some fixed probability. This classification method has been shown to give excellent classification performance in a wide range of applications[4] . A recent extension of model-based classification that uses data with unknown group membership in the model-fitting procedure has been developed[5] . A detailed review of model-based classification and its extensions is given in Section 4.1. Partial least squares regression is a method that seeks to optimise both the variance explained and correlation with the response variable[6] . In a previous study[7] it was found to outperform other chemometric methods commonly used in the study of near-infrared transflectance spectra. It has the advantage in that it can utilise highly-correlated variables for classification purposes. Both model-based classification and partial least squares discriminant analysis requires training on data with known group or class labels. The collection of training data in food authenticity can be very expensive and time-consuming so methods that require few training data observations are 3 particularly useful. We show that both methods give excellent classification performance, even when few training data values are available. We also find that modelbased classification is robust in situations where the training and test data are quite different. 2 2.1 Materials and Methods Honey Samples Honey samples (157 samples) were obtained directly from beekeepers throughout the island of Ireland. Samples were from the years 2000 and 2001; they were stored unrefrigerated from time of production and were not filtered after receipt in the laboratory. Prior to spectral collection, honeys were incubated at 40◦ C overnight to dissolve any crystalline material, manually stirred to ensure homogeneity and adjusted to a standard solids content (70◦ Brix) to avoid spectral complications from naturally-occurring variations in sugar concentration. Collecting, extending and recording spectra of the honey was done at time points several months apart; the first phase involved extending some of the authentic samples of honey with fructose:glucose mixtures, the second phase involved extending some of the remaining authentic samples with fullyinverted beet syrup and high fructose corn syrup. All adulterant solutions were also produced at 70◦ Brix. Brix standardisation of honeys and adulterant solutions meant that any adulteration detected would not be simply on the basis of gross added solids. The fructose:glucose mixtures were produced by dissolving fructose and glucose (Analar grade; Merck) in distilled water in the following ratios:- 0.7:1, 1.2:1 and 2.3:1 w/w. Twenty-five of the pure honeys were adulterated with each of the three fructose:glucose adulterant solutions at three levels i.e. 7, 14 and 21% w/w thus producing 225 adulterated honeys. The other adulterant solutions were generated by diluting commerciallysourced fully-inverted beet syrup (50:50 fructose:glucose; Irish Sugar, Carlow, Ireland) and high fructose corn syrup (45% fructose and 55% glucose) with distilled water. Eight authentic honeys were chosen at random and adulterated with beet invert syrup at levels of 7, 10, 14, 21, 30, 50 and 70% w/w; high fructose corn syrup was added to ten different, randomly-selected honeys at 10, 30, 50 and 70% w/w. This produced 56 BI-adulterated and 40 HFCS-adulterated samples. On visual inspection, the spectra of the pure honey from the two mea4 surement phases displayed an offset, with those recorded in the first phase exhibiting a higher mean absorbance value. To remove this offset, the difference of the mean absorbance values at each wavelength was subtracted from spectra collected in the first phase. This was done to ensure that the detection of the fructose:glucose adulterants was not a by-product of the offset. 2.2 Spectral Collection In the first measurement phase, transflectance spectra were collected between 400 and 2498 nm in 2 nm steps on a NIRSystems 6500 scanning monochromator (FOSS NIRSystems, Silver Springs, MD) fitted with a sample transport module. The second phase collected transflectance spectra between 700 and 2498 nm, again in 2 nm steps, using the same monochromator. Samples were presented to the instrument in a camlock cell fitted with a gold-plated backing plate (0.1 mm sample thickness; part no. IH-0355-1). Between samples, this cell was washed with detergent, rinsed thoroughly with tepid water and dried using lens tissue. In the absence of any manufacturer-supplied temperature control for this accessory, an ad hoc procedure was devised to minimise variation in the temperature of honey samples prior to spectral collection. This procedure involved placing each sample in the instrument and scanning it 15 times without storing the resultant spectra. On the sixteenth scan, the spectrum was stored for analysis. During the pre-scanning process, each sample was presumed to have equilibrated to instrument temperature. While instrumental temperature does vary (K. Norris, personal communication), this variation is less than that likely to occur in the ambient temperature in a laboratory which is not equipped with an air-conditioning system. Preliminary experimental work had indicated 15 scans to be an appropriate compromise number on the basis of visual examination to test sample spectra. All spectra (478 in total) were collected in duplicate (including re-sampling); mean spectra were used in all subsequent calculations. 3 Dimension Reduction and Preprocessing Spectral data between 1100 and 2498 nm were used in this study. Each spectrum therefore contains 700 wavelengths with adjacent absorption values being highly correlated. Therefore before using model-based classification methods, a dimension reduction step is required – this avoids singular covariance matrices, improves computational efficiency and increases statistical modelling possibilities. The technique chosen for data reduction in this 5 1.2 0.8 0.4 0.0 Intensity pure adulterated 1100 1300 1500 1700 1900 2100 2300 2500 Wavelength (nm) Figure 1: Spectra of Pure and Adulterated Honey Samples study is wavelet analysis. 3.1 Wavelet Analysis Wavelet analysis is a technique commonly used in image and signal processing in order to compress data. Here, it is used to decompose each spectrum into a series of wavelet coefficients. Without any thresholding of these coefficients, the original spectra can be exactly reconstructed from the coefficients. However, many of the coefficients in the wavelet analysis are zero or close to zero. By thresholding the coefficients that are zero or close to zero, it is possible to dramatically reduce the dimensionality of the dataset. The resulting recomposed spectra are then approximations of each of the individual spectra. Ogden[8] gives a good practical introduction to wavelet analysis. 3.1.1 Daubechies’ Wavelet Daubechies’ wavelet is a consistently reliable type to use and is the default within wavethresh[9] . To efficiently carry out wavelet analysis, the data dimension should be of the order 2k , where k is an integer. Unfortunately, this can result in quite a lot of information being set aside. Techniques of extending the data to bring them up to the nearest 2k are available, but in this case these methods result in problems when carrying out the modelbased discriminant analysis – the associated variance structures are often singular. Thus the central 29 = 512 observations were chosen – the range 6 (1290 nm – 2312 nm). 3.1.2 Thresholding techniques Various methods of thresholding[10] wavelet functions in order to achieve dimension reduction have been published. As a default procedure, universal hard thresholding is used. When using p k universal thresholding, the threshold λ = σ̂ 2 log 2 , where σ̂ is a robust estimate of the standard deviation of the coefficients and there are 2k coefficients. Other thresholding techniques may sometimes provide better approximations of the spectra but their use adds another decision into the process that may lead to overfitting rather than a general procedure. 0.4 Thresholded 0.0 Intensity 0.8 Actual 1290 1490 1690 1890 2090 2290 Wavelength (nm) Figure 2: Actual and Thresholded Spectra 3.2 Preprocessing: Savitzky-Golay Using the Savitzky-Golay[11] algorithm to calculate derivatives of the data is commonly used in spectroscopic analysis. This concentrates analysis on the shape rather than the height of the spectra. Thus we compare this method of pre-processing the data with the alternative of no preprocessing. 7 4 Classification Techniques The classification techniques used on this data set were based on Gaussian mixture models; under these models each group was modelled using a Gaussian distribution. The covariance of each of the Gaussian models is structured in a parsimonious manner using constraints on the eigen decomposition of the covariance matrix[4] . This approach offers the ability to model groups that have distinct volume, shape and orientation properties. We introduce mixture models in general terms in Section 4.1 and then show the development of clustering and classification methods from mixtures. 4.1 Mixture Models The mixture model assumes that observations come from one of G groups, that observations within each group g are modelled by a density f (·|θg ) where θg are unknown parameters and that the probability of coming from group g is pg . Therefore, given data y with independent multivariate observations y1 , . . . , yM , a mixture model with G groups has a likelihood function Lmix (θ1 , . . . , θG ; p1 , . . . , pG |y) = M X G Y pg f (ym |θg ). (1) m=1 g=1 The Gaussian mixture model further assumes the density f to be a multivariate Gaussian density φ, parameterized by a mean µg and covariance matrix Σg exp{− 21 (ym − µg )T Σ−1 g (ym − µg )} p φg (ym |µg , Σg ) ≡ det(2πΣg ) The multivariate Gaussian densities imply that the groups are centered at the means µg with shape, orientation and volume of the scatter of observations within the group depending on the covariances Σg . The Σg can be decomposed using an eigen decomposition into the form, Σg = λg Dg Ag DgT , where λg is a constant of proportionality, Dg an orthogonal matrix of eigenvectors and Ag is a diagonal matrix where the elements are proportional to the eigenvalues as described by Fraley and Raftery[4] . The parameters λg , Ag and Dg have interpretations in terms of volume, shape and orientation of the scatter for the component. The parameter λg 8 controls the volume, Ag the shape of the scatter and Dg controls the orientation . Constraining the parameters to be equal across groups gives great modelling flexibility. Some of the options for constraining the covariance parameters are given in Table 1. Table 1: Parametrizations of the covariance matrix Σg Model ID: E=Equal, V=Variable, I=Identity Model ID Decomposition Structure EII Σg = λI Spherical VII Σg = λg I Spherical EEI Σg = λA Diagonal VEI Σg = λg A Diagonal EVI Σg = λAg Diagonal VVI Σg = λg Ag Diagonal T EEE Σg = λDAD Ellipsoidal EEV Σg = λDg ADgT Ellipsoidal T VEV Σg = λg Dg ADg Ellipsoidal T VVV Σg = λg Dg Ag Dg Ellipsoidal The letters in ModelID denote the volume, shape and orientation repectively. For example, EEV represents equal volume and shape with variable orientation. The mixture model (1) can be fitted to multivariate observations y1 , y2 , . . . , yN by maximizing the log-likelihood (1) using the EM algorithm[12] . The resulting output from the EM algorithm includes estimates of the probability of group membership for each observation; these can be used to cluster the observations into their most probable groups. This procedure is the basis of model-based clustering[4] and is easily implemented in the mclust library[13] for the statistics package R[14] . However, in model-based discriminant analysis (also known as eigenvalue discriminant analysis)[3] , the model is fitted to the data which consist of multivariate observations wn where n = 1, 2, . . . , N and labels ln where lng = 1 if observation n belongs to group g and 0 otherwise. Therefore, the resulting likelihood function is, Ldisc (p1 , p2 , . . . , pG ; θ1 , θ2 , . . . , θG |w, l) = N Y G Y [pg f (wn |θg )]lng (2) n=1 g=1 The log-likelihood function (2) is maximized yielding parameter estimates p̂1 , p̂2 , . . . , p̂G and θ̂1 , θ̂2 , . . . , θ̂G . For stability, equal probabilities, p̂1 = . . . = p̂G = 1/G, are often assumed. 9 The posterior probability of group membership for an observation y whose label is unknown can be estimated as p̂g f (y|θ̂g ) P(Group g|y) ≈ PG q=1 p̂q f (y|θ̂q ) (3) and observations thus can be classified into their most probable group. The mclust[13] package can also be used to perform the model-based discriminant analysis. This again allows for the possibility of the models demonstrated in Table 1. It is worth noting that Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) are special cases of model-based discriminant analysis and they correspond to the EEE and VVV models respectively. 4.2 Discriminant Analysis With Updating Model-based discriminant analysis as developed in[3] only uses the observations with known group membership in the model fitting procedure. Once the model is fitted, the observations with unknown group labels can be classified into their most probable groups. An alternative approach is to model both the labelled data (w, l) and the unlabelled data y and to maximize the resulting log-likelihood for the combined model. The likelihood function for the combined data is a product of the likelihood functions given in (2) and (1). This classification approach was recently developed in[5] and was demonstrated to give improved classification performance over the classical model-based discriminant analysis. With this modelling approach the likelihood function is of the form Lupdate (p, θ|w, l, y) = Ldisc (p, θ|w, l)Lmix (p, θ|y) N Y G M X G Y Y lng = [pg f (wn |θg )] × pg fg (ym |θg ) (4) n=1 g=1 m=1 g=1 The log-likelihood (4) is maximized using the EM algorithm (Section 4.3) to find estimates for p (if estimated) and θ. Output from the EM algorithm includes estimates of the probability of group membership for the unlabelled observations y, as given in (3). The EM algorithm for maximizing the log-likelihood (4) proceeds iteratively substituting the unknown labels with their estimated values. At each iteration the estimated labels are updated and new parameter estimates are produced. By passing the estimated values of the unknown labels into the EM algorithm it is possible to “update” the classification results with 10 some of the knowledge gained from fitting the model to all of the data. Indeed, even with small training sets containing unrepresentative proportions of pure/adulterated honey samples, updating shows consistency. 4.3 EM Algorithm The Expectation Maximization Algorithm[12] is ideally suited to the problem of maximizing the log-likelihood function when some of the data have unknown group labels; this arises in the model-based clustering likelihood (1) and the model-based discriminant analysis with updating likelihood (4). In this section, we show the steps involved in the EM algorithm for modelbased discriminant analysis with updating; the model-based clustering steps are shown in[4] . Considering data to be classified as consisting of M multivariate observations consisting of two parts: known, ym and unknown zm . In this context the spectroscopic data are observed and thus known, so are treated as the ym . The labels (pure or adulterated) are unknown and thus are treated as the zm . Additionally, N labelled observations are available which consist of two parts: known wn and known labels ln . The unobserved portion of the data z, is a matrix of indicator functions, so that zm = (zm1 , . . . , zmG ), where each zmg is 1 if ym is from group g and 0 otherwise. Then the observed data likelihood can be written in the form LO (p, θ|wN , lN , yM ) = N Y G Y lng [pg f (wn |θg )] × n=1 g=1 M X G Y pg fg (ym |θg ) (5) m=1 g=1 and the complete data likelihood is LC (p, θ|wN , lN , yM , zM ) = N Y G Y M Y G Y × [pg f (ym |θg )]zmg lng [pg f (wn |θg )] n=1 g=1 m=1 g=1 (6) Initial estimates of p̂ (if estimated) and θ̂ = (µ̂, Σ̂) are taken from classical model-based discriminant analysis, by maximizing (2). The expected value of the unknown labels are calculated so that (k) (k+1) ẑmg (k) p̂g f (ym |θ̂g ) ← PG (k) q=1 (k) p̂q f (ym |θ̂q ) (7) for g = 1, . . . , G and m = 1, . . . , M and the parameters p and θ can then be 11 estimated by: p̂(k+1) g µ̂(k+1) g P (k+1) + M m=1 ẑmg ← if estimated N +M PN PM (k+1) n=1 lng wn + m=1 ẑmg ym ← PN PM (k+1) n=1 lng + m=1 ẑmg PN n=1 lng (8) The estimates of Σg again depend on the constraints placed on the eigenvalue decomposition; details of the calculations are given in[3, 15] . The iterative process continues until convergence is achieved. The use of an Aitken acceleration-based convergence criterion is discussed in[5] . Updating can take two forms – soft and hard updating. With soft updating (EM) updates of the missing labels are made using (7), so the unknown labels are replaced by probabilities (or a soft classification). Whereas, hard updating (CEM) replaces the probabilities given in (7) with an indicator vector of the most probable group. The hard classification algorithm does not maximize (4) but actually tries to maximize (6) (although local maxima are a possibility). 5 Regression-Based Techniques Commonly used in near infrared spectroscopy on food samples, discriminant partial least squares was found to be a reliable method of classifying Irish honey samples[7] . 5.1 Partial Least Squares Regression Partial least squares regression (PLSR) was developed by Wold[16, 17] and is based on the assumption of a linear relationship between the observed variables (e.g. the spectroscopy measurements) and the outcome variable (e.g. pure or adulterated). It is similar to principal components regression (PCR). Given that X is an n × p matrix that represents observed values:n samples on p measurement points; and that y is a vector of length n representing the outcome variable/label. If S is the sample covariance matrix of X, the similarity between PCR and PLSR can be seen when one examines what is being maximised in each situation. PCR maximises the function Var(Xα) where vm is the mth principal component and vlT Sα = 0 for l = 1, . . . , m − 1 while PLSR maximises Corr2 (y, Xα)Var(Xα) where φ̂m is the mth PLS direction and φTl Sα = 0 for l = 1, . . . , m − 1, both subject to the condition that ||α|| = 1. However, within the PLS maximization problem, the variance term does dominate. 12 5.1.1 Algorithm for PLSR Each observation xj is standardized to have 0 mean and variance of 1. Ini(0) tialize ŷ(0) = 1ỹ, xj = xj for j = 1, . . . , p and φ̂1j = hxj , yi – the univariate regression coefficient of y on each xj . For m = 1, . . . , p D E P (m−1) (m−1) • hm = pj=1 φ̂mj xj where φ̂mj = xj ,y • θ̂m = hhm , yi / hhm , hm i • ŷ(m) = ŷ(m−1) + θ̂m hm E i hD (m−1) (m−1) (m) / hhm , hm i hm for j = 1, . . . , p so that − hm , xj • xj = xj (m−1) each xj is othogonalized with respect to hm It has been noted that the sequence of PLS coefficients for m = 1, . . . , p represents the conjugate gradient sequence for computing least squares solutions. 5.1.2 Number of Parameters It uses m relevant loadings/components in the model. However, deciding on m is not trivial[18] . Even when m is known, the number of parameters in the model is open for debate. There is a problem in calculating the degrees of freedom of a model using PLS[19] . Calculating the number of parameters in a model is especially relevant when using a complexity penalty as part of the model selection criterion. For the purposes of this study, the number of parameters in the population model was assumed to be p (p + 1) +m+1 2 so that if m = 0 no correlation between X and y exists and m = p is the full least squares model; this agrees with [18] . 6 Model Selection and Verification The model with the best Bayesian Information Criterion value was chosen in each case – so that a consistent criterion could be applied to all models. It is also possible to choose models based on leave-one-out cross validation, but this is computationally expensive. The certainty of the classification decisions are measured using Brier’s score[20] . 13 6.1 Bayesian Information Criterion (BIC) The results given are models selected using the Bayesian Information Criterion (BIC), where the BIC of a function is BIC = 2loglikelihood − d log(N + M ), given that N + M is the number of observations from f and d is the number of parameters used in the model. 6.2 Brier’s Score Brier[20] developed a method of producing a continuous performance measure where perfect prediction gives a Brier’s score of zero. Given G groups and N samples and forecasted probabilities ẑn1 , . . . , ẑnG for sample n belonging to group 1, . . . , G respectively then the Brier’s score, B, is G N 1 XX B= (ẑgn − ztruegn )2 2n g=1 n=1 (9) where ztruegn is an indicator variable for the actual group membership. It is especially useful for determining the certainty of predictions. Some observations may be just barely put into the correct group, or indeed just miss out on correct classification. Observations that are barely classified correctly will add more to the Brier’s score than those where a more certain classification is made. A trait of PLSR is that some regression outputs may in fact be beyond the zero-one scale. For the purposes of calculating a pseudo-Brier score, such results were set to be equal to either zero or one so that these certain classifications do not add to the total. 7 Comparison of Methods As an example of the data input to the classification problem, Figure 1 shows transflectrance spectra of authentic and adulterated honey samples. Visual examination of these spectra reveal a sloping baseline, the absense of fine structure and the impossibility of visually differentiating between samples, features which are typical of near infrared spectral collections. For these reasons, multivariate statistical methods are required to efficiently and effectively extract useful information from such datasets. The data consist of 157 unadulterated samples, 225 extended with fructose:glucose mixtures, 56 extended with beet invert syrup and 40 samples 14 extended with high fructose corn syrup. Let fg, bi and cs represent adulteration by fructose:glucose mixtures, beet invert syrup and high fructose corn syrup respectively and let adult represent samples adulterated with any of the three adulterant types. The figures reported are the mean correct classification rates (as percentages) over 400 random splits of the data into training and test sets and their associated standard deviations (in brackets). Three different types of training data were examined: a) correct proportions of pure and each type of adulterant b) correct proportions of pure and adulterated c) unrepresentative proportions of pure and adulterated The performance of model-based discriminant analysis (DA), soft updating (EM), hard updating (CEM) and partial least squares regression (PLSR) was measured at three different ratios of training/test data – 50/50%, 25/75% and 10/90% in order to fully examine the robustness of each technique to varying sample size. 7.1 Correct Proportions of Pure and of Each Type of Adulterant in Training Set The training sets, at 50%, 25% and 10% of the data, are completely representatative of the entire population studied. Training Test pure: 79 78 fg: 112 113 bi: 28 28 cs: 20 20 pure: 39 118 fg: 56 169 bi: 14 42 cs: 10 30 pure: 16 141 fg: 22 203 bi: 6 50 cs: 4 36 Method DA EM CEM PLSR DA EM CEM PLSR DA EM CEM PLSR BIC selection Classification Rate 94.86 (1.01) 93.02 (2.02) 93.95 (1.56) 94.66 (1.04) 93.51 (1.07) 84.49 (12.72) 91.60 (3.94) 93.88 (1.14) 90.81 (1.70) 52.38 (15.58) 69.80 (20.85) 92.46 (3.75) method Brier’s Score 0.060 (0.014) 0.060 (0.016) 0.053 (0.009) 0.053 (0.007) 0.071 (0.012) 0.140 (0.120) 0.076 (0.037) 0.060 (0.011) 0.104 (0.045) 0.465 (0.157) 0.294 (0.209) 0.075 (0.031) There is no significant difference in the performance of DA and PLSR, but both outperform the updating procedures. All methods have decreasing performance as the training percentage decreases. However, the decrease is not large for DA and PLSR. 15 As the percentage in the training set decreases the number of factors selected by PLSR becomes more variable (8–14 factors at 50% to 1–30 factors at 10%). DA chooses more parsimonious models as the training set percentage decreases, but the updating procedures (EM and CEM) choose models with more parameters as the size of the training set decreases relative to the test set. 7.2 Correct Proportions of Pure/Adulterated in Training Set Only the proportion of pure to adulterated samples are kept fixed – the proportion of each type of adulterant is allowed to vary between simulations. Training Test pure: 79 78 adult: 160 161 pure: adult: 39 80 118 241 pure: adult: 16 32 141 289 Method DA EM CEM PLSR DA EM CEM PLSR DA EM CEM PLSR BIC selection Classification Rate 94.66 (1.18) 93.84 (2.41) 93.84 (1.57) 94.58 (0.91) 93.45 (1.01) 83.86 (12.18) 90.85 (7.17) 93.85 (1.01) 90.63 (1.74) 51.79 (15.00) 68.14 (20.99) 92.45 (4.14) method Brier’s Score 0.062 (0.015) 0.059 (0.020) 0.054 (0.015) 0.054 (0.006) 0.073 (0.013) 0.145 (0.117) 0.083 (0.069) 0.060 (0.010) 0.115 (0.057) 0.470 (0.151) 0.312 (0.210) 0.074 (0.032) As in Section 7.1, there is no significant difference between DA and PLSR, which both outperform the updating methods. Classification performance is slightly worse (as expected) but is actually quite robust to type of adulterant. The number of factors chosen by PLSR again becomes more variable as the proportion of training data to test data decreases. The pattern of DA choosing more parsimonious models while updating procedures choose models with more parameters again occurs with decreasing training data proportion. 7.3 Unrepresentative Proportions of Pure/Adulterated In this section the ratios of pure to adulterated samples in the training sets (and hence also in the test sets) are varied in order to examine the performance of each method when the training data does not accurately reflect 16 Table 2: 50% Training Data / 50% Test Data BIC selection method Training Test Method Classification Rate Brier’s Score pure: 40 117 DA 92.47 (1.52) 0.079 (0.018) EM 93.13 (1.55) 0.061 (0.015) adult: 199 122 CEM 92.45 (1.81) 0.067 (0.017) PLSR 95.15 (1.15) 0.058 (0.006) pure: 20 102 DA 90.01 (1.60) 0.101 (0.017) adult: 219 137 EM 90.32 (3.35) 0.089 (0.031) CEM 89.24 (2.90) 0.100 (0.029) PLSR 97.08 (1.28) 0.050 (0.008) pure: 99 58 DA 94.98 (1.21) 0.058 (0.013) EM 92.69 (1.80) 0.064 (0.016) adult: 140 181 CEM 93.82 (1.48) 0.055 (0.014) PLSR 94.66 (1.22) 0.054 (0.007) pure: 119 38 DA 95.34 (0.92) 0.064 (0.016) adult: 120 201 EM 90.82 (4.41) 0.081 (0.036) CEM 92.87 (2.91) 0.064 (0.027) PLSR 94.87 (1.11) 0.047 (0.004) the entire population structure. In the 50% training data situation, PLSR marginally outperforms DA and updating methods when the number of pure samples in the training set is less than expected. There is no significant difference between PLSR and DA when the number of pure samples in the training set is greater than expected. In the 25% training data there is almost no difference in classification performance between DA and PLSR, with DA marginally outperforming PLSR when the number of pure samples in the training data set is greater than expected (> 39 pure samples). With 10% of the data used as a training set, DA shows its robustness to both training sample size and composition. It should be noted that PLSR again also does well in terms of classification performance but does have a much higher standard deviation than DA. Where there are fewer than the expected number of pure samples (e.g. < 79 in Table 2) in the training set, DA almost always chooses EEE, while when there are more than the expected number of pure samples in the training set, DA chooses VEV most frequently. The range of factors chosen by PLSR for when the number of pure samples in the training set is higher than expected is smaller than when the number of pure samples in the training set is less than expected. 17 Table 3: 25% Training Data / 75% Test Data BIC selection method Training Test Method Classification Rate Brier’s Score pure: 20 137 DA 92.25 (1.68) 0.079 (0.017) EM 72.37 (18.16) 0.256 (0.178) adult: 99 222 CEM 82.16 (16.19) 0.169 (0.159) PLSR 93.12 (3.59) 0.067 (0.023) pure: 10 147 DA 91.12 (1.63) 0.089 (0.014) adult: 109 212 EM 68.93 (18.38) 0.291 (0.183) CEM 82.28 (15.57) 0.168 (0.151) PLSR 92.68 (6.09) 0.070 (0.037) pure: 49 108 DA 93.98 (0.98) 0.068 (0.012) EM 85.05 (10.34) 0.135 (0.099) adult: 70 251 CEM 90.79 (6.48) 0.084 (0.063) PLSR 93.50 (1.35) 0.067 (0.012) pure: 59 98 DA 93.73 (1.00) 0.072 (0.015) adult: 60 261 EM 84.92 (9.30) 0.137 (0.088) CEM 90.83 (3.49) 0.084 (0.032) PLSR 93.41 (1.59) 0.068 (0.014) Table 4: 10% Training Data / 90% Test Data BIC selection method Training Test Method Classification Rate Brier’s Score pure: 8 149 DA 89.87 (2.08) 0.110 (0.047) adult: 40 281 EM 61.02 (17.37) 0.368 (0.171) CEM 85.91 (8.41) 0.132 (0.080) PLSR 88.17 (7.33) 0.106 (0.047) pure: 4 153 DA 88.66 (2.37) 0.115 (0.025) adult: 44 277 EM 66.37 (15.83) 0.302 (0.142) CEM 91.88 (0.94) 0.074 (0.009) PLSR 85.95 (8.67) 0.120 (0.054) pure: 20 137 DA 90.89 (1.70) 0.099 (0.023) adult: 28 293 EM 74.45 (16.39) 0.241 (0.164) CEM 85.15 (11.28) 0.140 (0.112) PLSR 89.75 (5.42) 0.098 (0.035) pure: 24 133 DA 90.92 (1.74) 0.104 (0.029) adult: 24 297 EM 80.03 (11.17) 0.187 (0.112) CEM 86.05 (8.56) 0.131 (0.084) PLSR 89.83 (4.97) 0.097 (0.033) 18 8 Conclusions Both PLSR and model-based discriminant analysis achieve good classification results. Unlike previous studies[5] the updating techniques do not improve classification performance. This may be due to the structure, or lack thereof, within the adulterated samples. Comparing the Brier’s scores to the proportion of samples misclassified shows that classification decisions made are quite definitive. Model-based discriminant analysis is more flexible and more robust than PLSR. As the training set proportion decreases and the make up of the training set becomes unrepresentative of the population, model-based discriminant analysis shows its strengths. The results from model-based discriminant analysis are in the form of the probability of being a member of each group rather than, as in PLSR, a score. Thus a cost function can be easily imposed at the end of the classification process and the results have a natural interpretation. 9 Acknowledgements The work reported in this paper is funded by Teagasc under the Walsh Fellowship Scheme. Thanks are due to individual beekeepers for the provision of honey samples and to the Irish Department of Agriculture & Food for financial support (FIRM programme) which enabled collection and spectral analysis of the samples. This work is also supported under the Science Foundation of Ireland Basic Research Grant scheme (Grant 04/BR/M0057). References [1] M. Lees, editor. Food Authenticity: Issues and Methodologies. Eurofins Scientific, Nantes, France, 1998. [2] European Commission. Council Directive 2001/110/EC of 20 June 2001, relating to honey, 2002. [3] H. Bensmail and G. Celeux. Regularized Gaussian discriminant analysis through eigenvalue decomposition. Journal of the American Statistical Association, 91:1743–1748, 1996. [4] C. Fraley and A. E. Raftery. Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association, 97:611–631, 2002. 19 [5] N. Dean, T. B. Murphy, and G. Downey. Using Unlabelled Data To Update Classication Rules With Applications In Food Authenticity Studies. Journal of the Royal Statistical Society, Series C, 55:1–14, 2006. [6] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer, 2001. [7] G. Downey, V. Fouratier, and J. D. Kelly. Detection of honey adulteration by addition of fructose and glucose using near infrared transflectance spectroscopy. Journal of Near Infrared Spectroscopy, 11:447– 456, 2003. [8] R. T. Ogden. Essential Wavelets for Statistical Applications and Data Analysis. Birkhauser, 1997. [9] A. Kovac (1997), M. Maechler (1999), and G. Nason (R-port). wavethresh: Software to perform wavelet statistics and transforms., 2004. R package version 2.2-8. [10] D. L. Donoho and I. M. Johnstone. Ideal Spatial Adaptation by Wavelet Shrinkage. Biometrika, 81(3):425–455, 1994. [11] A. Savitzky and M. J. E. Golay. Smoothing and differentiation of data by simplified least squares procedures. Analytical Chemistry, 36:1627– 1639, 1964. [12] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood for incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society, Series B, 39:1–38, 1977. [13] C. Fraley, A. E. Raftery, Dept. of Statistics, University of Washington, and R. Wehrens (R-port). mclust: Model-based cluster analysis. R package version 2.1-8. [14] R Development Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria, 2004. ISBN 3-900051-07-0. [15] G. Celeux and G. Govaert. Gaussian parsimonious clustering models. Pattern Recognition, 28:781–793, 1995. [16] H. Wold. Estimation of principal components and related models by iterative least squares. In Multivariate Analysis, pages 391–420, 1966. 20 [17] H. Wold. Nonlinear estimation by iterative least square procedures. In Research Papers in Statistics: Festschrift for J. Neyman, pages 411–444, 1966. [18] I. S. Helland. Some theorectical aspects of partial least squares regression. Chemometrics Intell. Lab. Syst, 58:97–107, 2001. [19] H. Van Der Voet. Pseudo-degrees of freedom for partial least squares. J. Chemometrics, 13:195–208, 1999. [20] G. W. Brier. Verification of forecasts expressed in terms of probability. Monthy Weather Review, 78:1–3, 1950. 21