Download Since we are dealing with count data, seems reasonable to fit this

USING FINITE MIXTURE OF MULTIVARIATE POISSON DISTRIBUTIONS FOR DETECTION OF MEASUREMENT ERRORS IN COUNT DATA Author: Bernardo João Rota Supervisor : Thomas Laitila Abstract This study is concerned in identification of measurement errors in multivariate count data using finite mixture of Poisson models. Two approaches for constructing the multivariate Poisson models are considered, the multivariate Poisson with common covariance and the restricted covariance Poisson model. The restricted covariance model was used for empirical application. The dataset is a result of labour market survey conducted at Statistics Sweden concerning employment, job vacancies and wages. Two methods for classifying the observation as outlier or normal are described: the Modified Likelihood Ratio statistic and the CaseLikelihood. The former was applied in the empirical study and found to be effective in identification of measurement errors. A new method for detection of measurement errors based in Mahalanobis distance is proposed. Keywords: Measurement errors, Multivariate Poisson, Finite Mixture Model, EM-algorithm, Modified likelihood ratio test, Reduced sub-ordering, Caselikelihood To my Parents, my brothers, my friends and my teachers at Örebro University Acknowledgements The author would like to thank all Professors in Applied Statistic Master program at Örebro University by their insightful lectures that enable a deep understanding of the program contents. Would also like to address thank Prof. Sune Karlsson by the supervision during the four semesters of studies at this University. A special thank is addressed to Prof. Thomas Laitila who friendly supervised the thesis. Finally would like to address thank to Statistics Sweden, this Institution disponibilized the data used in the empirical application. Parte of the thesis was written at this Institution INDEX OF CONTENTS 1. Introduction ......................................................................................................................... 1 2. Poisson models ..................................................................................................................... 3 2.1 Univariate Poisson Model............................................................................................... 3 2.2 Multivariate Poisson Model ............................................................................................ 3 3. Finite Mixture Models (FMM) ........................................................................................... 9 3.1 Finite mixture of multivariate Poisson distribution ...................................................... 10 3.2 Estimation of FMM ....................................................................................................... 11 3.3 The expectation-maximization (EM) algorithm ............................................................ 12 4. Measures of classification of observed values ................................................................. 15 4.1 Modified likelihood ratio test ........................................................................................ 15 4.2 The Caselikelihood ........................................................................................................ 16 4.3 The reduced sub-ordering method ................................................................................ 16 5. Application ......................................................................................................................... 18 5.1 Data description ............................................................................................................ 18 5.2 Methods ......................................................................................................................... 18 5.3 Empirical Results .......................................................................................................... 21 6. Conclusions ........................................................................................................................ 30 References: ............................................................................................................................. 31 Appendix………………...…………………………………………………………………..34 INDEX OF FIGURES Fig 1. Pairwise plot of the variables ........................................................................................ 19 Fig 2. Plot log likelihood vs number of components ............................................................. 21 Fig 3. Boxplots for each cluster illustrating some suspicious observation ............................. 28 Fig 4. Histogram for each cluster illustrating suspicious observations ................................... 29 INDEX OF TABLES Table 1. Information criteria ................................................................................................... 22 Table 2. Variable means in the data set ................................................................................... 22 Table 3. Covariances among the variables in the data set ...................................................... 22 Table 4. MLE of model parameters (std. errors in parenthesis).............................................. 23 Table 5. Number of observation and number of suspicious for each component .................. 25 Table 6. Accuracy in detection of errors by method ............................................................... 25 1. Introduction The need of statistical information from surveys in order to assist in making decisions is essential. The quality of this information is of great importance, since it affects the quality of decisions. Considering the accuracy of this information as a measure of quality, it is worth to work to achieve this accuracy. The accuracy of the statistical information (estimates) can be measured via mean squared error (MSE), a measure that incorporates both the variance and the bias of the estimator. Small mean squared error is always desired for the estimator in order to produce good estimates. Among the issues that make hard to attain the desired minimum mean squared error are the errors arising from the all processes involving the survey procedure. The survey operation includes three main stages namely: sample selection, data collection and data processing (Särndal et al., 1992). Each one of these stages is subject to error. For this study the errors arising in the data collection stage are of concern. The errors mentioned above are called measurement errors and are defined as the difference between the recorded value on a study variable for the sampled element and the true (in general unknown) value. These errors may increase the bias and/or the variance of the estimated parameters, which are undesirable effects. The effect of measurement errors depends on their magnitude, the greater the difference between the observed values and the unobserved true values the greater are the negative influence on the properties of the estimator. The paper considers identification of measurement errors using techniques for outlier detection. Among several definitions of an outlier, consider the simplest one which is connected with the term anomalous measures and states that “a considerably different observation from the remainders, as it was generated by different mechanism can be though of an outlier” (Elahi et al., 2009). These errors can occur by chance, and are often an indicative of measurement errors. Identifying them will provide an added value in the process of obtaining accurate or nearly so statistics. If these anomalous observations are well identified and well corrected, the estimation may safely use simple linear estimators instead of robust ones. Identification of anomalous observations is a matter in various fields like in fraud detection, network intrusion, monitoring criminal activities, etc. Several methods for identification of anomalous measures are proposed in literature. They are essentially based on supervised or unsupervised learning approaches. A supervised 1 approach needs to learn a classification model over a set of sample data already labelled as outlier or not, and by using this model, a new observation may be classified as an outlier or not. In unsupervised approaches the detection of outliers is without training samples (Ye et al., 2009). In these latter approaches, most of the algorithms are distance based, density based or distribution based. In the distribution-based methods the data is fitted with a statistical model and an observation can be classified as anomalous or not on the basis of such a model. Yamanishi et al. (2004) applied the method for the Smart Sifter program which is a program for outlier identification using Finite Mixture Models. Finite mixture models can be used in modelling for identification of anomalous measures as an alternative of the earlier approaches which become inefficient when the dimension of the variable and the amount of data increases. This study uses a finite mixture of multivariate Poisson distributions as the probabilistic model for detection of outliers in multivariate count data. The estimation of parameters of the mixture is done via a version of the expectation-maximization (EM) algorithm of Dempster et al. (1977). This paper aims to: 1. Investigate and estimate the mixture of the multivariate Poisson distributions. 2. Use the mixture model as the probabilistic model for identification of measurement errors. 3. Propose a new measure for identification of outliers in FMM modeling. Here two models are assumed. Model 1 assumes common covariance between all variables of interest. Model 2 assumes pairwise interactions between all variables of interest to reduce the complexity in implementation of the multivariate Poisson distribution. In order to simplify the structure of the model 2, the dataset can be used to test for significant relations between pairs of variables, giving rise to the known as the restricted covariance Poisson model. This latter approach enables the removal of non correlated pairs from the model, which in turn enables to reduce the number of free parameters to be estimated. The EM algorithm version for finite mixture of multivariate Poisson distribution of Karlis (2003) and Brijs et al. (2004) is used to estimate the parameters in the model. These latter authors derive the modified version of EM algorithm for multivariate Poisson of Karlis 2 (2003) for the case of four variables, this algorithm is extended here to the case of m variables. The rest of the paper is organized as follow: Section 2 with two subsections presents a theory concerning univariate and multivariate Poisson distributions respectively, Section 3 is constituted by three subsection and presents some theory in Finite Mixture models in particular Finite mixture of multivariate Poisson distributions, the estimation of this mixture and theory in Expectation Maximization algorithm. Section 4 is concerned with measures of classification and discusses the Modified likelihood ratio test, the CaseLikelihood and the proposed approach. In Section 5 an empirical study is taken into account, this section is composed by three subsections presenting respectively the data description, the methods applied and the results obtained with respective discussion. Finally Section 6 presents a brief conclusion 2. Poisson models 2.1 Univariate Poisson Model The most commonly used distributions to model count data are binomial, negative binomial and Poisson. Count data can be defined as a data taking non negative integer values. Cameron and Trivedi (1996) discuss the Poisson model as the benchmark for modelling count data. If the variable z is distributed as univariate Poisson distribution with parameter θ , then its probability mass function (pmf) can be represented as Po( z;θ ) = e −θ θ z z! where the parameter θ is both the mean and variance of the distribution. Karlis and Meligkotsidou (2007) noted that Poisson distribution has played a prominent role in modelling univariate count data. However, its multivariate counterpart has rarely been used. A reason for this is the complexity associated to this distribution in applications. 2.2 Multivariate Poisson Model Earlier approaches in dealing with multivariate count data were mostly based on approximations with continuous models, e.g. normal distribution. However these 3 approximations can be misleading since count data can be characterized by the existence of many zero counts and/or small count numbers. Let Z = ( Z1 , Z 2 ,...., Z m )′ define a sequence of discrete random variables, and suppose there is a vector Y = (Y1 , Y2 ,...., Yq )′ of independent random variables each following a Poisson distribution with parameter θ k , that is Yk Po(θ k ), (k = 1,..., q) . Furthermore consider a q sequence of m-dimensional binary vectors (∆1 , ∆ 2 ,..., ∆ q ) . Writing Z as Z = ∑ ∆kYk , then Z k =1 is distributed as an m-variate Poisson variable with parameter vector θ = (θ1,θ2 ,...,θq )′ , mean E(Z;θ ) = E(∆Y;θ ) = ∆θ and covariance ∆ ∑ = diag (θ1 , θ 2 ,..., θ q ) and var( Z ;θ ) = var(∆Y ;θ ) = ∆ ∑ ∆′ , matrix where is an m × q matrix with columns ∆ i (i = 1,..., q ) . Each Z j ( j = 1, 2,..., m) marginally follows a univariate Poisson distribution (Brijs, et al., 2002). Karlis and Meligkotsidou (2005) state that a multivariate Poisson distribution in its full parameterized form can be constructed with a matrix ∆% = (∆% , ∆% ,..., ∆% 1 2 m ∆ which can be splitted such that ) , where each ∆% r (r = 1,..., m) is a sub matrix of dimension m × Crm with exactly r ones and m − r zeros in each column. Here Crm denotes the binomial coefficients. Redefining the vector Y as Y% = (Y%1 , Y%2 ,..., Y%m )′ , where Y%j (j=1,...,m) is a C mj -dimensional q m k =1 j =1 column vector, Z can be written as Z = ∑ ∆ k Yk = ∑ ∆% jY%j . The structure of the matrix ∆% allows to depict the covariance between each pair Z i and Z j . Let for instance Z = ( Z1 , Z 2 , Z 3 )′ and Y = (Y1 , Y2 , Y3 , Y12 , Y13 , Y23 , Y123 )′ where Yij (i = 1, 2 j = 2, 3 i < j ) implies the covariance between ( Z i , Z j ) and Y123 implies the threefold covariance of ( Z1 , Z 2 , Z 3 ) . Define Y% = Y%1′ Y%2′ Y%3′′ , with The Y%1 = [Y1 Y2 Y3 ]′ Y%2 = [Y12 Y13 Y23 ]′ Y%3 = [Y123 ]′ ∆ matrix then becomes 4 1 0 0 1 1 0 1 ∆ = 0 1 0 1 0 1 1 0 0 1 0 1 1 1 Splitting this matrix into three sub matrices yields 1 0 0  1 1 0  1 ∆% 1 = 0 1 0  , ∆% 2 = 1 0 1  , ∆% 3 = 1 0 0 1   0 1 1  1 7 3 k =1 j =1 and Z = ∑ ∆ k Yk = ∑ ∆% jY%j whereby  Z1 = Y1 + Y12 + Y13 + Y123  Z = Z2 = Y2 + Y12 + Y23 + Y123 Z = Y + Y + Y + Y  3 3 13 23 123 Let θ = (θ1 ,θ 2 ,θ3 ,θ12 ,θ13 ,θ 23 ,θ123 )′ , then the mean of the trivariate Poisson vector Z is  θ1 + θ12 + θ13 + θ123  E ( Z ) = ∆θ ′ = θ 2 + θ12 + θ 23 + θ123  θ3 + θ13 + θ 23 + θ123  and the covariance matrix equals θ 1 + θ 1 2 + θ 1 3 + θ 1 2 3 var( Z ) = ∆ Σ ∆ ′ =  θ 12 + θ 123   θ 13 + θ 123 θ 12 + θ 123 θ 2 + θ 12 + θ 23 + θ 123 θ 23 + θ 123 θ 13 + θ 123   θ 23 + θ 123  θ 3 + θ 1 3 + θ 2 3 + θ 1 2 3  where Σ = diag (θ1 ,θ 2 , θ3 ,θ12 ,θ13 ,θ 23 ,θ123 ) . Here Σ is the covariance matrix of Y and becomes diagonal to justify the idea of independence of the random variables Yi , (i = 1,..., q) The derivation so far (Johnson et al., 1997) leads to the following trivariate pmf: 5 P(z1, z2 , z3 ) = exp(−(θ1 + θ2 + θ3) + (θ12 + θ23 + θ13) −θ123) × × min(z1,z2 ) min(z2,z3) min(z1,z3) min(l ,m, j ) ∑ ∑ ∑ ∑ H( z;θ ) l =0 m=0 j =0 i=0 Where H ( z;θ ) = {i !( l − i ) !( j − i ) !( z1 − l − } j + i ) !( m − i ) !( z 2 − l − m + i ) !( z 3 − m − j + i ) ! −1 × z1 − l − j + i i l−i j−i × (θ − θ × θ 1 2 3 × (θ 1 2 − θ 1 2 3 ) × (θ 1 3 − θ 1 2 3 ) + θ × ) 1 12 123 × (θ 2 3 − θ 1 2 3 ) m −i × (θ 2 − θ 1 2 − θ 2 3 + θ 1 2 3 ) z2 − l − m + i × (θ 3 − θ 2 3 − θ 1 3 + θ 1 2 3 ) z3 − m − j + i The multivariate Poisson has several shortcomings due to the complicated form of the joint probability function (Brijs et al., 2002). The overcome of these shortcomings leads to some simplified approaches, for instance the assumption of a common covariance term for all variables of interest. Under this framework Johnson et al. (1997) considers a sequence of univariate Poisson variables, Y = (Y0 , Y1 ,..., Ym )′ each with parameter θ j ( j = 0,...., m) and the sequence of discrete random vectors Z = ( Z1 ,..., Z m )′ where Z k = Y0 + Yk (k=1,...,m) . Then Z is distributed as a multivariate Poisson with joint probability function P ( z ) = P ( z 1 , z 2 , ..., z m ) = z m θ t t m in { z1 ,..., z m } m  z m l ∑ ∏ = ex p ( − ∑ θ i ) ⋅ ∏ i=0 k =0 t =1 z ! l =1  k  t    k !    θ0 m ∏ θh h =1     k . (1) The marginal distribution for Z k is a univariate Poisson with parameter θ0 + θ k , where θ0 is the common covariance term for all pair of variables. Model (1) has been widely used by many authors. Assuming common covariance reduces the computational burden, but the assumption makes the model less realistic. Another approach is to consider only pairwise dependence of the variables, which implies the exclusion of the third or higher order interaction of the variables from the ∆ matrix. Following the last example of three variables the ∆ matrix as shown by Brijs et al. (2002) becomes 6 1 0 0 1 1 0  ∆ =  0 1 0 1 0 1   0 0 1 0 1 1   Z1 = Y1 + Y12 + Y13  Z = Z2 = Y2 + Y12 + Y23 (2) Z = Y + Y + Y  3 3 13 23 Y = (Y1 , Y2 , Y3 , Y12 , Y13 , Y23 )′ and The computation burdening associated with model (1) and with the model generated in (2) is high, especially when the variable’s dimension and/or the database increase. Karlis (2003) and Johnson et al. (1997) suggest the recurrence relations presented in Kawamura (1973, 1985, and 1987) and Kano and Kawamura (1991) for the computation of multivariate Poisson density. The multivariate Poisson model satisfies the following recurrence relations: z i ⋅ P ( z ) = θ i ⋅ P ( z − π i ) + θ 0 ⋅ P ( z − ι) (i=1..., m ) and k p θj i =1 j =1 zj P ( z1 ,..., z p , 0,..., 0) = P ( z − ∑ π i ) ⋅ ∏ p = 1,..., m − 1 Where, πi is a vector of zeros except the ith element which is 1, and ι is a vector with all elements equal to 1. m The order of zi ´s and 0´s can be interchanged to cover all cases and P(0) = exp(−∑θ j ) . j =0 When min( z1 ,..., zm ) = 0 then m m j=0 j =1 P ( z ) = exp(− ∑ θ j ) ⋅ ∏ θ zj zj! . The recurrence relations presented so far correspond to the general scheme, and the following is the scheme for the bivariate case, since this is the required case in this paper. The detailed scheme for higher dimension can be found in Kawamura (1973, 1985, and 1987). The bivariate Poisson as presented in Kawamura (1985) can be written as follow min( z1 , z 2 ) P ( z1 , z 2 ) = ∑ Po ( z1 − i; θ1 ) Po ( z 2 − i; θ 2 ) Po (i; θ12 ) i=0 m i n ( z1 , z 2 ) = ∑ i= 0 ex p (−θ1 − θ 2 − θ 12 ) ⋅ θ 1z 1 −i θ z2 − i 2 θ 1i2 ( z1 − i ) !( z 2 − i ) ! i ! 7 This density satisfies the following recurrence equations: z1 = z2 = 0 → P(0,0) = exp(−θ1 − θ2 − θ12 ) z1 > 0, z2 = 0 → z1 ⋅ P( z1 ,0) = θ1 ⋅ P( z1 − 1,0) z1 = 0, z2 > 0 → z2 ⋅ P(0, z1 ) = θ2 ⋅ P(0, z2 − 1) z1 , z2 > 0 → z1 ⋅ P( z1 , z2 ) = θ1 ⋅ P( z1 − 1, z2 ) + θ12 ⋅ P( z1 − 1, z2 − 1) z1 , z2 > 0 → z2 ⋅ P( z1 , z2 ) = θ2 ⋅ P( z1 , z2 − 1) + θ12 ⋅ P( z1 − 1, z2 − 1) These recurrence relations minimize the complexity in the computation of probability mass function of multivariate Poisson, even though high level complexity remains. The ∆ matrix in (2) can be represented as ∆ = (∆1 , ∆ 2 ) , where ∆1 is a m × m identity matrix and ∆ 2 is a m × Cm2 matrix with exactly 2 ones and m-2 zeros in each column. The structure of this matrix in general is  1   0  0   0 ∆ =   .  .   .  0 m1  0 1 0 0 . . . 0 0 0 1 0 . . . 0 . 0 0 . . . . . . . . . . . . . . . . . . . . 01m . . . . 0 1m m 1 1 0 0 . . . 0 1 0 1 0 . . . 0 ... 0 0 . . . . ... 0 2 1,Cm 0 0 . . . 1 1 2 m ,Cm              (3) which is composed by two sub matrices, a m × m identity matrix and the m × Cm2 matrix with exactly 2 ones and m-2 zeros. The product of (3) with a m + Cm2 dimensional vector of independent random Poisson variables Y = (Y1 , Y2 ,..., Ym , Y12 , Y13 ,..., Y1m , Y23 , Y24 ,..., Y2m ,...,...., Ym−1,m )′ where Yi Po(θi ) (i = 1,..., m) and Yij Po(θ ij ) (i = 1,..., m − 1, j = 2,..., m, i < j ) leads to the following multivariate Poisson random variable:   Z = Y + Y + Y + ... + Y 1 12 13 1m  1  Z 2 = Y 2 + Y1 2 + Y 2 3 + ... + Y 2 m   ........................................ Z =   Z i = Y i + Y 1 i + Y 2 i + . . . + Y im  ........................................   Z m = Y m + Y1 m + Y 2 m + ... + Y m   −1 m 8 Since the above structure assumes only pairwise interaction between the variables, it simplifies the general structure of multivariate Poisson (Brijs et al., 2002), which can be reduced to a product of bivariate Poisson models as shown: m P( z1 , z2 ,..., zm ) = ∏ P( zi , z j ) (i = 1,.., m -1, j = i +1,.., m) , i< j where min( zi , z j ) P( zi , z j ) = ∑ i exp(−θi −θ j −θij ) ⋅ k =0 z j −k θiz −kθ j θijk (zi − k)!(z j − k)!k ! Thus m min( zi ,z j ) P(z1, z2 ,..., zm ) = ∏ i< j ∑ i exp(−θi −θ j −θij ) ⋅ k =0 z j −k θiz −kθ j θijk (zi − k)!(z j − k)!k! . The usefulness of this model lies in the fact that its complexity can be greatly reduced by analysing the interaction effects between the variables of interest (Brijs et al., 2004). By log linear analysis for instance non significant interactions can be used to cancel out free parameters in the covariance structure of multivariate Poisson. 3. Finite Mixture Models (FMM) A finite mixture model is a probabilistic model for density estimation using a mixture distribution. It is a widely used approach for modelling data presenting heterogeneity pattern, (Rouse, 2005). The advantage of mixture models lies in the fact that they can model quite complex distributions by an effective choice of its components to represent accurately local areas of support of the true distributions. The components of the mixture are chosen by their tractability, so inferences about the phenomenon to be modelled may easily be made. Let Z be a random variable with density function f ( z ) . A finite mixture model (FrühwirthSchnatter, 2006; Schlattmann, 2009; McLachlan and Peel, 2000) decomposes a density f into a sum of L component probability density function. Denote f l the l th probability L density function, then the finite mixture density can be written as f ( z ) = ∑υ l f l ( z ) where l =1 L υl are mixing proportions with the properties υl ≥ 0 and ∑υ l = 1 . If Z is a scalar random l =1 variable with density function f ( z; Ψ ) parameterized by element vector Ψ and z is the respective realization of Z , then Z will be an outcome of an l finite mixture distribution if 9 L its density can be written as f ( z; Ψ ) = ∑ υl f l ( z;θ l ) where θl is the parameter of the l th l =1 pdf. 3.1 Finite mixture of multivariate Poisson distribution A frequent problem in application of the Poisson model is overdispersion. The model assumes the same mean and variance, in many cases the sample variance is greater than the sample mean originating the phenomenon known as overdispersion. The presence of overdispersion is an indicative that simple Poisson model (univariate or multivariate) is not appropriate for modelling the data. One alternative is a mixture distribution since the latter takes extra variability into account. Under (2) and assuming the definition of Z =(Z1,Z2,..., Zm)′ , the parameterized probability mass m function is P( z | θ ) = ∏ P( zi , z j | θi , θ j , θij ) . Representing the pmf as a mixture of L pmfs i< j of multivariate Poisson results in L m l =1 i< j P( z | Ψ) = ∑υl (∏ P( zi , z j | θil ,θ lj ,θijl )) , Ψ = (υ1 ,υ 2 ,....,υ L −1 , θ1′ , θ 2′ ,......, θ L′ )′ where υl as defined above are the mixing proportions and θl = (θ1l ,θ2l ,...,θml ,θ12l ,θ13l ,...,θ1ml ,θ23l ,θ24l ,...,...,θ( m−1) ml )′ are the component parameters. The full mixture model is m L P(z | Ψ ) = ∑ υ ⋅ (∏ ∑ l l =1 i< j l i l j l ij exp( −θ − θ − θ ) ⋅ k =0 z j −k l θ iz − k lθ j l min( z i , z j ) i θ ijk ( z i − k )!( z j − k )! k ! ) . (4) The expected mean and variance of this mixture model (Karlis and Meligkotsidou, 2007) are respectively L E ( Z ) = ∑υl ∆θl l =1 and 10  L   L  L ′   Var ( Z ) = ∆ ∑ υl ⋅ (Σl + θl ⋅θl′) −  ∑υl ⋅θl  ∑υl ⋅ θl  ∆′  l =1  l =1  l =1    where Σ l = diag (θ1l , θ 2 l ,..., θ ql ) and ∆ as defined in (3). 3.2 Estimation of FMM There are a number of approaches in estimation of mixture distributions including graphical methods, method of moments, minimum distance-methods, maximum likelihood and Bayesian approaches. This is motivated by the fact that explicit formulas for parameter estimates are not available (Titterington et al., 1985). The maximum likelihood estimation via the EM algorithm estimation method is the most used approach. Let an unobservable L × 1 random vector X n be associated with observation zn , and let xn be a realization of X n . The vector xn assumes a value of one in its l th element if zn belongs to component l , and all remaining elements of xn are zero. The probability that zn comes from the l th component is υl . X n is said to come from a multinomial distribution with one trial and L outcomes with respective probabilities υl . The probability mass function can be L represented as P ( xn ) = ∏υl[ xn ]l , [ x n ]l is the l th element of xn . Then, f ( z; Ψ ) and f l ( z;θl ) are l =1 the unconditional and conditional densities of Z given [ X n ]l = 1 , respectively. The posterior probability that Z belongs to the l th component given z and Ψ is ωl ( z; Ψ) = P([ X ]l = 1; z, Ψ) = υl fl ( z;θl ) L ∑υ f ( z;θq ) q q q =1 Here υl is the prior probability that z belongs to l th class in the mixture. McLachlan and Peel (2000) refer to Z as the incomplete data and ( Z , X ) = Z c is the complete data. The complete data log-likelihood is obtained as N L log( Lc ( Ψ )) = ∑ ∑ [ x n ]l log(υ l f l ( z n ; θ l )) . n =1 l =1 11 Solving Ψ = arg max log( Lc ( Ψ )) yields the maximum likelihood estimator (MLE) of Ψ parameter vector Ψ . Generally the problem of finding Ψ is hard, since there is lack of explicit analytical expressions. 3.3 The expectation-maximization (EM) algorithm The EM-algorithm is an important tool in the estimation of complex models, but the ordinary EM algorithm of Dempster et al. (1977) still need improvement for particular models since this algorithm makes the computational procedure tedious especially in models involving data sets in high dimension. Schlattmann (2009) points this as the reason why its application and modification for finite mixture models is still an active area of research. Karlis (2003) developed a version of EM algorithm for multivariate Poisson and Brijs et al. (2004) extended this EM version to a mixture of multivariate Poisson distribution. In the present case of mixture of multivariate Poisson, the complete data is given by ( Z , Y , X ) = Z c where Z = ( Z1 ,..., Z m ) is the given dataset, X n = ( X n1 ,..., X nL ) correspond to component membership and Ynl = (Yn1l ,..., Ynql ) is the component specific latent variable with l corresponding to a component and Y is the variable used in (2), the complete log likelihood is: N L N L N L q log(Lc (Ψ)) = ∑∑[ xn ]l log(υl f ( ynl ;θl )) = ∑∑[ xn ]l logυl + ∑∑[ xn ]l ∏log f ( ynpl ;θ pl ) n=1 l =1 N L n=1 l =1 N L q =∑∑[ xn ]l logυl + ∑∑[ xn ]l ∑log( n=1 l =1 N L n=1 l =1 N L n=1 l =1 y exp(−θ pl )θ plnpl p =1 p =1 ynpl ! ) q =∑∑[ xn ]l logυl + ∑∑[ xn ]l ∑(−θ pl + ynpl ⋅ logθ pl − log ynpl !) n=1 l =1 n=1 l =1 p =1 Selecting suitable starting values for the parameters Ψη −1 = (υ1η −1 ,υ2η −1 ,....,υηL −−11 ,θ η −1 )′ , the EM algorithm first (E-step) finds the conditional expected value of the indicator variables X, that is, ωn = E ( X n | Z n , Ψη −1 ) . Under the model arising from (2) with Ψ = Ψη −1 yields 12 in −k [η −1] m min( zin , z jn ) ωnl = [η −1] l υ υl[η−1] fl (zn ;θl[η−1] ) L ∑υ q=1 [η −1] q q [η −1] q f (zn ;θ L exp(−θ −θ [η −1] jl [η −1] ijl −θ )⋅ θilz m min( zin , z jn ) ∑υ [η −1] q q=1 ∑ ⋅∏ exp(−θ −θ [η −1] jq [η −1] ijq −θ )⋅ k =0 i< j ⋅θijlk[η−1] z jn −k[η −1] θiqz −k[η−1] ⋅θ jq in [η −1] iq z −k[η −1] ⋅θ jljn (zin − k)!(z jn − k)!k ! k =0 i< j = ) ∑ ⋅∏ [η −1] il ⋅θijqk[η−1] (zin − k)!(z jn − k)!k ! This is followed by the computation of the expected value of the latent variable Y given the observation group membership and the actual parameters, that is min( zin , z jn ) ε ijnl = E (Yijnl | zn ,[ xn ]l = 1, Ψ η −1 ∑ )= k ⋅ P (Yijnl = k | zin , z jn ,[ xn ]l = 1, Ψη −1 ) k =0 min( zin , z jn ) ∑ = k⋅ P(Yijnl = k , zin , z jn | [ xn ]l = 1, Ψη −1 ) P( zin , z jn | [ xn ]l = 1, Ψη −1 ) k =0 min( zin , z jn ) ∑ k ⋅ P0 ( zin − k | θil[η −1] ) P0 ( z jn − k | θ [jlη −1] ) P0 (k | θijl[η −1] ) k =0 = P( zi , z j | θil[η −1] , θ [jlη −1] , θijl[η −1] ) in − k [η −1] min( zin , z jn ) ∑ = k ⋅ exp(−θ [η −1] il −θ [η −1] jl −θ [η −1] ijl )⋅ θilz k =0 min( zin , z jn ) ∑ exp(−θ [η −1] il −θ [η −1] jl −θ k =0 [η −1] ijl )⋅ θ z − k [η −1] ⋅ θ jljn ⋅ θijlk [η −1] ( zin − k )!( z jn − k )!k ! zin − k [η −1] il z − k [η −1] ⋅ θ jljn ⋅θijlk [η −1] ( zin − k )!( z jn − k )!k ! (i = 1,..., m − 1, j = 2,..., m, i < j ) ε inl = E (Yinl | zn ,[ xn ]l = 1, Ψη −1 ) = zin − ∑ ε ianl a∈ϖ ( j ) (i = 1,..., m) and ϖ ( j ) = { j : ε ijnl exists} define ε = ( ε1 , ε 2 ,..., ε m , ε12 , ε13 ,..., ε1m , ε 23 , ε 24 ,...,..., ε m −1,m ) Second step (M-step) updates the parameters N υl = N ∑ωnl ∑ω n=1 n=1 N nl and θhl = ⋅εhnl N ∑ω nl n=1 n =1,..., N; l =1,..., L and h ∈ε if some convergence criterion is satisfied stop the algorithm, otherwise back to E-step. If the model (1) is employed the EM algorithm takes slightly different form due to the structure of the model. The conditional expectation of the indicator variables X becomes: 13 k  [η−1]  m m θ zt [η −1] min{ z1,...,zm}  m  θ0l   k υl[η−1] ⋅ e xp(−∑θil[η−1] ) ⋅ ∏ tl ! C k ∑ ∏ z p  p=1   m [η−1]  t =1 zt ! k =0 i=1 [η −1] [η −1]  ∏θhl  υl fl (zn ;θl )  h=1  = ωnl = L k [η −1]  [η−1]  η [ − 1] z υ θ ( ; ) f z z z min ,..., t L m { 1 m}  m k   θ0q  ∑ q q n q m θ tq [η −1] [η −1] q=1 ⋅ − ⋅ υ θ xp( ) e Czp k ! m ∑ ∏ ∑ ∑ q iq   ∏ k =0 t =1 zt ! p=1   ∏θhq[η−1]  q=1 i=1  h=1  And the expected value of the latent variable Y given the observation group membership and the actual parameters is ε 0 nl = E (Y0 nl | z n ,[ x n ]l = 1, Ψ η −1 ) = min( z1 n ,..., z mn ) ∑ k ⋅ P (Y0 nl = k | z1n ,..., z mn ,[ x n ]l = 1, Ψ η −1 ) k =0 m in( z1 n ,..., z mn ) ∑ = k⋅ k =0 min( z1 n ,..., z mn ) ∑ = k =0 P (Y0 nl = k , z1n ,..., z mn | [ xn ]l = 1, Ψ η −1 ) P ( z1n ,..., z mn | [ x n ]l = 1, Ψ η −1 ) m k ⋅ ∏ P0 ( z in − k | θ il[η −1] )P0 ( k | θ 0[ηl −1] ) i =1 [η −1] ) P ( z1n ,..., z mn | θ 0[ηl −1] ,..., θ ml min( z1 n ,..., z mn ) ∑ k =0 = m k ⋅ ∏ P0 ( z in − k | θ il[η −1] )P0 ( k | θ 0[ηl −1] ) i =1 m m i =0 t =1 e xp( − ∑ θ il[η −1] ) ⋅ ∏ θ zt [η −1] tl min { z1 ,..., zm } zt ! k =0 ∑  [η −1]  m k   θ0l C z p k !  m  ∏ p =1   ∏ θ hl[η −1]  h =1      k ε inl = E (Yinl | zn ,[ xn ]l = 1, Ψη −1 ) = zin − ε 0 nl (i = 1,..., m) define ε = ( ε 0 , ε1 , ε 2 ,..., ε m ) Second step (M-step) updates the parameters N υl = N ∑ωnl ∑ω n=1 n=1 N nl and θhl = ⋅εhnl N ∑ω nl n=1 n =1,..., N; l =1,..., L and h ∈ε if some convergence criterion is satisfied stop the algorithm, otherwise back to E-step. 14 4. Measures of classification of observed values The literature provides numerous measures for classification of observations as anomalous or normal. The modified likelihood ratio test, the caselikelihood test and the proposed approach based in order principles are the techniques considered in this paper for anomalous detection. 4.1 Modified likelihood ratio test Wang et al. (1997) suggested the modified likelihood ratio test by considering the value of likelihood ratio statistic λ for testing the null hypothesis H 0 : znew ∈ Ρ and the alternative hypothesis H1 : znew ∉ Ρ , where Ρ is the population from which the training data was obtained and znew represents a new observation. If g ( znew ; φ ) is a density of znew under the alternative hypothesis, there is only one observation znew from which to estimate φ since this is not functionally related with Ψ . Then, one can not be able to reasonably estimate g ( znew ; φ ) , one way to estimate this density can be applying nonparametric strategy. But since there is a single observation, g ( znew ; φ ) is constant in the maximization process (Woodward and Sain, 2003). The constant form of g ( znew ; φ ) can be dropped, thereby originating the λ% version of λ , known as the modified likelihood ratio test. n λ% = sup ∏ f ( zi ;θ ) f ( znew ;θ ) sup Lnew (θ ) θ ∈Ψ sup L(θ ) θ ∈Ψ = θ ∈Ψ i =1 n sup ∏ f ( zi ;θ ) θ ∈Ψ i =1 Where θ is the MLE of the mixture density using a the training data set z = ( z1 , z2 ,....., zn ) . A simple observation of the formula above enables to understand that, under the alternative hypothesis one expects λ% to be small, which in turn is an indicative that the new observation does not come from the same mixture distribution. The null distribution of λ% can be assessed by using the bootstrap methods with either parametric or nonparametric resampling. This bootstrap approach relies on the fact that asymptotically, that is, as n → ∞ , λ% ≈ f ( znew ;θ ) (Wang et al., 1997), this result is important in the sense that it enables to test for anomalous observations in large databases. The threshold of λ% denoted here λ%C is the (100c)th percentile of λ%b* which is the λ% statistic version obtained with the bth bootstrap sample data. Wang et al. (1997) state that if α is the desired significance level for the test then α = p , B +1 15 B is the number of bootstrap samples and λ%C will be taken to be the pth smallest element λ%b* , b= 1,..,B 4.2 The Caselikelihood As the case of likelihood ratio statistic, the caselikelihood (Tang and MacLennan, 2005) is another test which can be used for anomaly detection. This measure returns a nonzero probability value and works in two different frameworks, the normalized and the nonnormalized. Under the latter framework, the returned value is the product of single probabilities corresponding to each variable value of the suspicious observation. As one can realize when the number of variables increases the returned probability value tends to decrease which in turn becomes hard to interpret this probability value. To overcome this issue in the former framework, the probability of the observation under the model learned by the algorithm is divided by the probability computed without the model using raw statistics. The approach can be used as follow: compute the statistic π = η ( z% ) η (z) + Where η ( z ) and 1 n η ( z% ) are the likelihoods computed with training data only and with training data jointly with the new observation. The value 1 represents the empirical n probability of the new observation. This statistic returns a nonzero probability value such that, if this value is greater than 0.5 is a strong indication that the new observation is distributed as the training data while values less than 0.5 indicate that the new observation is not distributed as the training data. 4.3 The reduced sub-ordering method An important principle in expressing the features of the data is ordering. Order principles can be used for purposes of ease, speed and efficiency in statistical analysis. Features as extreme values, variability and contamination maybe encountered via an ordering principle applied to the data. In the univariate data (Barnett, 1976) the principle of ordering is clear without ambiguity. Let z1 , z2 ,..., zn be the respective realizations of the random 16 variables Z1 , Z 2 ,..., Z n , then z(1) , z(2) ,..., z( n ) can be regarded as realizations of the random variables Z (1) , Z (2) ,..., Z ( n ) which is a ordered sequence of Z1 , Z 2 ,..., Z n using a specific rule (e.g. z(1) < z(2) < ... < z( n ) ). The principles of ordering in univariate data are in general not applied in the multivariate data sets. In this case there are no order properties. However some effort has been done resulting in some types of ordering principles termed sub-ordering principles. Among them marginal ordering, reduced ordering, partial ordering and conditional ordering stand out. For the purpose of this paper the reduced ordering is the suitable method applied. This method has two phases, first each multivariate observation zi , i = 1,.., n is transformed to a scalar ci yielding in n univariate observations. Mostly this transformation is made with distance metric (Laurikkala et al., 2000). Then the univariate order principles can be applied to the transformed data. In this paper the metric is the Mahalanobis distance. ci2 = ( zi −ω)′ Π−1 ( zi −ω) Where the parameter ω represents the population mean vector and Π is the covariance matrix, these parameters can be replaced by their respective sample counterpart. This distance has the appealing property that it incorporates the dependencies between attributes, which is important property in anomalous identification in multivariate data. Another important property of this distance is that the units of the observations have no influence on the distance estimated since each observation is standardized to zero mean and unit variance. The Gamma-type probability plots can be used for anomalous identification with Mahalanobis distance (Laurikkala et al., 2000). However this requires that the observations follow a multivariate normal distribution. Since this is not the case here, the transformed data will be informally used for anomalous identification through box plots. Boxplot is a graphical method for visualization of data features as the case of symmetry, extremeness, unusual data points, etc. it uses objective rules in identification of unusual observation. The main contribution of this paper is the following: the mixture model is used for clustering the observations, further the mean and variance are obtained for each cluster, followed by the determination of the Mahalanobis distance for each observation to each one of the clusters. Based on the obtained distances observations are assigned to a cluster at smaller distance, giving rise to a new grouping. Boxplots are plotted for each group of data and suspicious observations are visualized. 17 5. Application 5.1 Data description The dataset contains observations from labour market surveys conducted at Statistics Sweden concerning employment, job vacancies and wages. The survey is conducted each quarter using stratified sampling, with stratification over industry and the size of the establishment (nº of employees according to the business register). The sampling frame is constructed from the Swedish business register. Establishments with more than 100 employees are selected with probability one. These establishments report each month in the quarter. Establishments in strata with less than 100 employees are selected using an SI design. These Establishments report only on a randomly selected month in the quarter. In total there are nearly 19000 establishments sampled in each quarter. The non-response is about 7%, depending on which quarter is selected. Data is collected with a mail questionnaire, with the option of using an internet questionnaire. For this study the data for the fourth quarter of year 2008 was used. Among several variables reported, here four variables are selected: Number of persons with posts with conditional tenure (male ( Z1 )/female ( Z 2 )) also named long term employees. Number of persons employed for a limited period (male ( Z3 )/female ( Z 4 )) also named short term employees. 5.2 Methods The dataset described above totalize 22763 observations divided in two groups, group A composed by companies with more than 100 employees and group B of that companies with the number of employees less than 100. The B group of data is used here with a total of 13745 observations. There are available two versions of the dataset, the original and the edited versions. There are 658 measurement errors based on a comparison of the edited set and the unedited. There are 515 missing observations in the B group corresponding to 3.7% of the data in that group. Previously model estimation two extremely abnormal observations respectively z8062 = (8888,1,0,1) and z10454 = (2,8888, 0, 0) were removed from the database. Each detected observation was further inspected whether it was present in the edited set or not. Detected observations that are not found in the edited set are effectively errors. 18 The characteristic of those 658 measurement errors is as follow: 239 observations are composed by zero counts for all variables, 113 observations are characterized by large counts in at least one item and the remaining observations are characterized by having nonzero counts in at least one of their items but the counts are small. As referred in section 1, the pairwise correlation between the variables were computed and found that only variables 3 and 4 have a significant correlation. Other pairwise correlations are not significant. The Fig. 1 shows the plot of the variables. Fig 1. Pairwise plot of the variables Based on the relation between the variables the mixture model in (4) becomes L P ( z1 , z 2 , z3 , z 4 ) = ∑ υ l ⋅ PO ( z1 ; lθ1 ) ⋅ PO ( z 2 ; lθ 2 ) l =1 where PO ( zi ; lθ i ) = min( z3 , z4 ) ∑ k =0 θ 3z − k lθ 4z l exp( −θ 3l − θ 4l − θ 34l ) ⋅ 3 4 −k l θ 34k ( z3 − k )!( z 4 − k )! k ! (5) exp( − l θ i ) ⋅l θ izi zi ! The dataset described above was used in estimation of the model (5). The EM algorithm of section 3.3 was used to fit sequentially models with L =1 component to L =16 components. For each model a set of n (= sample size) uniform numbers were generated from interval [1: L] and rounded. The L different numbers add up to L different groups. Each observation was assigned to the group from where the uniform number generated belongs to. The probability of each data group was used as initial values for the component proportion parameters. The mixture of Poisson models lead to two groups of parameters, the component proportions 19 υ1 ,...,υ L and θ1′,...,θ L′ parameters. For the second group of parameters five uniform random sets were generated in the range of the data points for each L. The algorithm was run 50 iterations for each set and the parameter values yielding the maximum value for the likelihood were taken as initial values. Selected the starting values, the algorithm was run until the relative change in the log likelihood was less than 10−6 for each L. The model selection was based on BIC and AIC, computed respectively as follow: BIC ( L) = −2 *log(likelihood ) + L * log(n) and AIC (l ) = −2 *log(likelihood ) + 2* L Where log(likelihood ) is the maximized log likelihood assuming a mixture model with L components. The log likelihood continued increasing for values of L greater than 16 but with little difference. To avoid very small groupings which may distort the inferences, as well as to avoid burdening in computation since the database is large, sixteen component mixture model was considered. The Likelihood ratio statistic and the Caselikelihood basically lead to the same conclusions (Wang, 2009) and the fact that the database is large which should imply to run the algorithm many times for the caselikelihood test only the likelihood ratio statistic is used for testing whether the observation is suspicious or normal. The classification based on modified likelihood ratio statistic was as follow: 1. The λ% was obtained for each observation in the database. 2. Obtained 10000 bootstrap resampling of the (n+1)st observation 3. For each bootstrapped element b = 1,...,10000 , λ%b* was calculated 4. α = p , α -significance level for the test and B is the number of resampling, the ( B + 1) threshold λ%C is the pth smallest value λ%b* . Then if λ%i ≤ λ%C , (i = 1,..., n + 1) the observation is classified as suspicious, otherwise the observation is normal. 20 For the proposed approach based in order principles, the estimated mixture model was used as a basis to cluster the observations (Brijs et al., 2004). Under the mixture framework each observation has a probability to belong to a specific component and is thus assigned to a cluster (component) with higher probability. The mean and variance were obtained for each cluster followed by the determination of Mahalanobis distance of each observation to each one of clusters. The observations are classified as suspicious if they are suspicious for all clusters. Thus a second assignment was needed; the observations were again assigned to a cluster at smaller distance. Boxplots and Histograms were plotted for each cluster and suspicious observations could be visualized. 5.3 Empirical Results The fig 1 shows the plot of log likelihood for each number of components. Clearly the log likelihood finds its convergence in the neighbourhood of 50000 Fig 2. Plot log likelihood vs number of components The table 1 shows the information criteria BIC and AIC for each model with one component trough 16 components. According to these criteria the model with 16 components is the best which presents the smallest BIC and AIC. 21 Table 1. Information criteria Component BIC AIC 1 524753.5 524746.0 2 344244.6 344229.6 3 304623.7 304601.2 4 282832.0 282802.0 5 268784.4 268747.0 6 258883.3 258838.4 7 248982.2 248929.8 8 242665.7 242605.8 9 239039.4 238972.0 10 234753.1 234678.2 11 233756.8 233674.4 12 229949.1 229859.2 13 228409.0 228311.6 14 227236.4 227131.6 15 225539.3 225427.0 16 223853.6 223733.8 Table 2. Variable means in the data set Z1 9.714 Z2 6.044 Z3 1.208 Z4 1.116 Table 3. Covariances among the variables in the data set Variable Z1 Z1 9.714 Z2 -1.631*10-7 Z2 -1.631*10-7 6.044 Z3 -6.566*10-9 -7.358*10-10 -4.326*10-9 Z4 -8 -3.860*10 Z3 -6.566*10-9 -7.358*10-10 Z4 -3.860*10-8 1.208 0.221 1.116 -4.326*10-9 0.221 The table 2 above reports the mean per variable, it can be seen that in average the companies have more long term employed men than women, but this difference is not verified for short term employment. Table 3 reports the covariances matrix of the four variables. The obtained results agree with the previous analysis taken, there is a small positive covariance between variables 3 and 4, but the other pairwise covariances estimated can be neglected. 22 Table 4. MLE of model parameters (std. errors in parenthesis) Weights Parameters Cluster 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 θ1 υL θ3 θ4 θ34 38.155 4.276 0.100 0.940 0.072 (4.547 ) (0.566 ) (0.023) (0.213) (0.020) 60.192 12.019 0.585 2.558 0.330 (2.060 ) (2.825) (0.205) (0.659 ) (0.110) 1.694 5.677 0.810 0.093 0.103 (0.244) (0.927) (0.128 ) (0.036 ) (0.036 ) 8.212 34.522 12.565 1.652 1.888 (0.855 ) (8.125 ) (2.131) (0.346) (0.305) 1.284 0.681 0.037 0.062 0.016 (0.160) (0.065) (0.062) (0.024) (0.014) 6.213 1.351 0.038 0.188 0.012 (0.221) (0.266 ) (0.043) (0.070) (0.005) 32.274 32.382 11.290 9.531 3.939 (6.472) (6.458) (5.151 ) (3.753) (2.992 ) 9.115 16.031 39.464 17.648 5.553 (4.294) (6.132 ) (9.891) (4.634) (1.196) 26.881 5.052 2.257 18.082 1.517 (8.421) (1.775) (0.800 ) (5.571) (0.827) 17.932 2.455 0.072 0.726 0.088 (0.908) (0.322) (0.081 ) (0.240) (0.029) 3.231 1.128 1.614 5.057 1.750*10-9 (0.694) (0.176 ) (0.243 ) (1.116) (3.00*10-10 ) 8.041 8.214 0.218 0.241 0.084 (0.412) (0.506) (0.062) (0.082 ) (0.028) 6.964 21.851 1.688 0.304 3.840 (1.512) (2.458 ) (0.167 ) (0.130) (0.195 ) 33.644 40.766 1.994 1.082 0.584 (10.870) (3.145) (0.604) (0.248) (1.743) 22.844 15.885 0.568 0.642 0.533 (1.375) (1.168) (0.133) (0.208) (0.412) 6.140 7.763 6.841 3.579 1.054 (1.158) (1.136) (1.172) (0.781) (0.314) υL 0.043 0.025 0.119 0.014 0.297 0.167 0.007 0.005 0.011 0.069 0.026 0.080 0.039 0.023 0.048 0.028 23 Table 4 presents the estimated parameters and their respective standard errors for the model with 16 components. θ1 ,θ 2 ,θ 3 and and θ 4 are the respective parameters for variables 1 through 4 θ34 represents the covariance between variables 3 and 4. The υL are component proportions. The component 8 is that which less contributes to the model while component 5 is the component which highly contributes to the model. Table 5 reports the number of observations and number of suspicious observations per cluster according to two methods. The observations are assigned to a cluster with higher conditional probability, as stated in subsection 5.2. By employing the modified likelihood ratio statistic (method 1) the observation is classified as suspicious or normal. Under method 2, after the first assignment in method 1, the observation must further be assigned to a closest cluster according to its smallest Mahalanobis distance. For example in cluster 1 there are 573 observations, under Mahalanobis distance 72 observations departed from this cluster, while in cluster 4 with 187 observations, under method 2 this cluster received more 48 new observations. Under method 1 at 1% significance level, 136 observations are detected as suspicious and 86 of these observations are effectively edited observations. At 5% significance level 472 observations are detected and 117 are found to be with measurement errors. At 10% significance level 821 observations are detected and 221 are measurement errors. Employing the second method 441 observations are found to be suspicious, but after inspecting whether each one of these observations appear in the edited set or not 197 are found to be effectively measurement errors. The success of detecting measurement errors is illustrated in Table 6. Figures 3 and 4 show the Boxplots and the corresponding histograms constructed with the distances obtained under method 2. These figures confirm the existence of outliers for each cluster. Some observations affect greatly the data distribution, suggesting extremely abnormal values. They are the case of observation 7782 and 11278 in cluster 2 that completely distorts the aspect of the boxplot and histogram as can be seen in the next boxplot/histogram plotted after removal of these observations. The same happens in cluster 14 in which observations 9880 and 10063 greatly distorts the boxplot and the histogram for that cluster. 24 Table 5. Number of observation and number of suspicious for each component Cluster Number of Observations per cluster Number of suspicious observations per cluster 501 1% 0 Method 1 5% 6 10% 32 336 412 12 81 109 24 3 1535 1563 0 0 13 29 4 187 235 15 90 129 17 5 3994 3163 0 0 0 21 6 2243 1956 0 0 0 89 7 88 167 18 43 68 17 8 64 81 32 23 44 9 9 141 233 23 37 61 18 10 909 933 0 5 26 26 11 329 725 4 23 33 21 12 1012 1079 0 0 4 31 13 516 546 1 9 91 8 14 314 335 23 72 107 35 15 622 675 0 21 58 39 16 364 626 8 62 46 31 Total 132727 13230 136 472 821 441 Method1 Method2 1 573 2 Method 2 16 Table 6. Accuracy in detection of errors by method Method1 Method 2 Sig. Level 1% 5% 10% Detected 136 472 821 441 M. Errors 86 117 221 197 25 In general the accuracy in detection with these two methods differ considerably. The difference may be justified by the fact that the observation in method 1 is classified as error or not based on probability value, while the second method relies on the element distance to its nearest cluster centre. In this case, observations that are not effectively measurement errors can be detected as errors because they are situated relatively far from the center of that specific cluster. For instance the observation z = (0,3,3, 0) is detected as error, while is effectively not, this is because its associated cluster is composed by elements with smaller values. There are two different types of outliers in Boxplots namely mild outliers and extreme outliers. The former is the designation of those values situated within Q3+1.5IQR and Q3+3IQR, while the latter is the designation of observation with values above Q3+3IQR where Q3 is the third quartile and IQR is the interquartile range. Most of these 244 observations found by method 2 can be mild outlier, which are observation that can present some characteristic with little difference from the reminder observations but are not effectively outliers. The boxplot helps to understand such feature, for instance in cluster 5, method 1 finds no abnormal observation at any significance level while method 2 finds 21 suspicious observations, but the boxplot of this cluster suggests no outlier at all (see Fig.3 cluster 5) which enhance the idea that the detected observations deserve further check to effectively declare the observation as outlier or normal. The B set presents some observations composed by extremely high counts and the second method detects all these observations. As discussed in section 5.2, from the 658 measurement errors present in the dataset, almost half of these observations are composed by zero counts for all items, most of these observations were associated to cluster 5 composed by small counts; these observations may not have very negative impact for the estimated parameters if comparing with the general pattern of the data. 86 measurement errors out of 136 suspicious at 1% significance level constitute high level measurement errors detection, which enable to believe that the method is effective. The same argument can be used for the second method which results in 441 suspicious observation and 197 were effectively measurement errors. 26 Fig 3. Boxplots for each cluster illustrating some suspicious observation 27 (Fig 4. cont) Fig 5. Histogram for each cluster illustrating suspicious observations 28 (Fig. 6 cont) 29 6. Conclusions The paper is concerned with measurement errors identification in multivariate count data using finite mixture of Multivariate Poisson models. Two methods for constructing a multivariate Poisson distribution are discussed: the multivariate Poisson distribution with common covariance structure and the multivariate Poisson distribution with restricted covariance structure. The latter is used in the empirical application. Two strategies for outlier identification are also discussed namely Modified likelihood ratio test and the Caselikelihood. The Caselikelihood is not effective for large databases, which is the case in this study. Thus only the modified likelihood ratio test is applied in the empirical study. A new approach based in ordering principles is proposed and used in the empirical application. The mixture model was used for clustering the observation. BIC and AIC criteria chose the 16 component mixture model. A previous analysis of dependence of the variables was taken and revealed that only variables 3 and 4 have a significant correlation, the other pairwise associations resulted in non significant dependence between the variables involved these simplified the structure of the multivariate Poisson, after employing the model the estimated covariances between the variables suggest the same conclusion. According to the results obtained it can be concluded that effectively the approaches are useful for detection of erroneous observations since all observations that greatly differ from the remainders were detected by the two strategies applied. However the Table 6 suggest that the proposed approach (method 2) is more effective in identification of measurement errors. This method is simpler than employing a modified likelihood ratio which needs the employment of bootstrap techniques. The multivariate Poisson model construction and the EM algorithm schema applied were both developed to minimize computation effort, however the empirical study here showed that the computations are still tedious. To further minimize the computation effort acceleration techniques must be considered, for instance the use of IEM algorithm of Neal and Hinton (1998) described in McLachlan and Peel (2000). The results obtained here strengthens the thought of lack of need to impose complex structure in the multivariate Poisson construction, which makes the computation tedious. 30 References: Barnett, V. (1976). The ordering of multivariate data. Journal of the Royal Statistical Society A, 139, 318-354. Brijs, T., Karlis, D., Swinnen, G., Vanhoof, K. and Wets, G. (2002). Tuning the Multivariate Poisson Mixture Model for Clustering Supermarket Shoppers. Department of Statistics Athens University of Economics. Available in: http://alpha.luc.ac.be/~brijs/./pubs/louvain2002.p Brijs, T., Karlis, D., Swinnen, G., Vanhoof, K., Wets, G. and Manchanda, P. (2004). A multivariate Poisson mixture model for marketing applications. Statistica Neerlandica. 58, 322–348. Cameron, C. A. and Trivedi, P. K. (1996). Count Data Models for Financial Data. Handbook of statistics, Vol. 14, Statistical Methods in Finance, 363-392, Amsterdam, North-Holland. Dempster, A. P., Laird, N. M. and Rubin, D.B. (1977). Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society. Series B 39, 1–38. Elahi, M., Li, K., Nisar, W., Lv, X. and Wang, H. (2009). Detection of Local Outlier Over Dynamic Data Streams using Efficient Partitioning Method. Journal WRI World Congress on Computer Science and Information Engineering, 76-81 Frühwirth-Schnatter, S. (2006). Finite Mixture and Markov Switching Models. Springer, New York Johnson, N. L., Kotz, S. and Balakrishnan, N. (1997). Multivariate distributions. Wiley, New York Kano, K. and Kawamura, K. (1991) On recurrence relations for the probability function of multivariate generalized Poisson distribution. Communications in Statistics, 20: 1, 165 — 178 31 Karlis, D. and Meligkotsidou, M. (2007). Finite mixtures of multivariate Poisson distributions with application. Journal of Statistical Planning and Inference, 137, 1942 – 1960 Karlis, D. (2003). An EM algorithm for multivariate Poisson distribution and related models. Journal of Applied Statistics, 30: 1, 63 — 77 Kawamura, k. (1985). A note on the recurrent relations for the bivariate Poisson distribution. Kodai math. j. 8, 70-78 Kawamura, K. (1987), Calculation of density for the multivariate Poisson d i s t r i b u t i o n. Kodai Math, 10, 231 - 241. Kawamura, K. (1973). The structure of bivariate Poisson distribution. Kodai Math. Sem. Rep., 25(2), 246-256. Laurikkala, J., Juhola, M. and Kentala, E. (2000). Informal identification of outliers in medical data. The fifth workshop on Intelligent Data Analysis in Medicine and Pharmacology, Berlin, pp. 20-24 McLachlan, D. and Peel, D. (2000). Finite Mixture Models. Wiley, New York Rouse, D. M. (2005). Estimation of Finite Mixture Models. Vailable in www.lib.ncsu.edu/theses/available/etd-11282005-140114/.../etd.pdf Särndal, C., Swenson, B., Wretman, J. (1992). Model Assisted Survey Sampling. Springer, New York Schlattmann, P. (2009). Medical applications of Finite Mixture Models. Springer, New York Tang, Z. and MacLennan, J. (2005). Data Mining with SQL Server 2005. Wiley, John and Sons Titterington, D.M., Smith, A.F.M. and Makov, U.E., (1985). Statistical Analysis of finite mixture distributions. Willey, New York 32 Tsiamyrtzis, P. and Karlis, D. (2004). Strategies for Efficient Computation of Multivariate Poisson Probabilities. Communications in statistics. Simulation and Computation. 33, 2, 271–292. Wang, L. (2009). Outlier Detection with Finite Mixture Model. Orebro university. Master thesis. Wang, S., Woodward, W.A., Gray, H.L., Wiechecki, S. and Sain, S.R. (1997). A New Test for Outlier Detection from a Multivariate Mixture Distribution. Journal of Computional and Graphical Statistics, 6, 285–299. Woodward, W. A. and Sain, S. R. (2003). Testing for outliers from a mixture distribution when some data are missing. Computational statistics and Data analysis, 44, 193-210 Yamanishi, K and Takeuchi, J. (2004). On-Line Unsupervised Outlier Detection Using Finite Mixtures with Discounting Learning Algorithms. Data Mining and Knowledge Discovery, 8, 275–300. Ye, M., Li, X. and Orlowska, E. M. (2009). Projected outlier detection in high-dimensional mixed-attributes data set. Expert Systems with Applications: 7104-7113 33 Appendix R code R code for parameter estimation rm(list=ls()) data<-read.table("C:\\Users\\User\\Desktop\\dataw.txt",header=TRUE) EIJnl<-NULL Q<-NULL m<-4 #number of variables n<-length(data$v1) #number of observations z<-matrix(c(data$v1,data$v2,data$v3,data$v4),n,m) data1<-matrix(c(data$v1,data$v2,data$v3,data$v4),n,m) fact<-function(y){x<-1 if(y==0) x<-1 else { for(i in 1:y) x<-i*x } return (x) } q<-m+1 L<-16 data$v5<-runif(n,1,L) Wnl<-matrix(,n,L) lambda<-matrix(,L,q) Eps<-matrix(,n,L) Eij<-matrix(,n,L) Bij<-matrix(,n,L) v<-NULL V<-function(L){ for(i in 1:L) v[i]<-sum(data$v5==i)/n return(v) } R<-function(y){if(y-as.integer(y)<0.5)y<-as.integer(y) else y<-as.integer(y)+1 return(y)} for(i in 1:n){data$v5[i]<-R(data$v5[i]) } v<-V(L)# vector of components weights v 34 div<-function(L){ for(i in 1:L){ datadiv<-matrix(,sum(data$v5==i),m);l<-0 for(j in 1:n){ if(data$v5[j]==i){l<-l+1 datadiv[l,]<-z[j,] } } lambda[i,]<c(runif(1,0,max(data$v1)),runif(1,0,max(data$v2)),runif(1,0,max(data$v3)),runif(1,0,max(da ta$v4)),runif(1,0,100)) } return(lambda) } lambda<-div(L) lambda v P<-NULL Po<-function(x,l){ po<-(exp(-l)*l^x)/fact(x) return(po) } Bp<-function(y1,y2,lambda1,lambda2,lambda12){initial<-0 expn<-exp(-(lambda1+lambda2+lambda12)) for(i in 0:min(y1,y2)){ initial<-initial+(lambda1^(y1-i)*lambda2^(y2-i)*lambda12^(i))/(fact(y1i)*fact(y2-i)*fact(i)) } bp<-expn*initial return(bp) } Condf<-function(x,l){ product<Po(z[x,1],lambda[l,1])*Po(z[x,2],lambda[l,2])*pbivpois(z[x,3],z[x,4],c(lambda[l,3],lambda[l, 4],lambda[l,5])) return(product) } Uncondf<-function(x){ prodt<-0 for(s in 1:L){ 35 prodt<-prodt+v[s]*Condf(x,s) } return(prodt) } f<-function(x,l){ wnl<-v[l]*Condf(x,l)/Uncondf(x) return(wnl) } WNL<-function(){ for(x in 1:n){ for(l in 1:L){ Wnl[x,l]<-f(x,l) } } return(Wnl) } repeat{ wnl<-WNL() wnl eijl<-function(z1,z2,li,lj,lij){eps<-0 for(r in 0:min(z1,z2)){ eps<-eps+r*Po(z1-r,li)*Po(z2-r,lj)*Po(r,lij) } return(eps) } Eijnl<-function(x,l){ij<-0; t<-NULL;B<-1;BP<-NULL g<eijl(z[x,3],z[x,4],lambda[l,3],lambda[l,4],lambda[l,5])/pbivpois(z[x,3],z[x,4],c(lambda[l,3],la mbda[l,4],lambda[l,5])) return(g) } EIJNL<-function(x){ p<-NULL for(l in 1:L) p<-c(p,Eijnl(x,l)) return(p) } Bijn<-function(){ for(x in 1:n) Bij[x,]<-EIJNL(x) return(Bij) } epsl<-Bijn() 36 epsl Eijl<-function(x){ e1<-NULL;e2<-NULL;e3<-NULL;e4<-NULL for(l in 1:L){ e1<-c(e1,z[x,1]) e2<-c(e2,z[x,2]) e3<-c(e3,z[x,3]-epsl[x,l]) e4<-c(e4,z[x,4]-epsl[x,l]) } e<-c(e1,e2,e3,e4) return(e) } Epsl<-function(){Eps<-matrix(,n,m*L) for(x in 1:n)Eps[x,]<-Eijl(x) return(Eps) } Epij<-Epsl() Epij vnew<- function(){ for(l in 1:L) v[l]<-sum(wnl[,l])/n return(v) } v<-vnew() Lambdas<-function(){ L1<-NULL;L2<-NULL;L3<-NULL;L4<-NULL;L5<-NULL Lambda<-matrix(,L,q) for(l in 1:L){ L1<-c(L1,sum(wnl[,l]*Epij[,l])/sum(wnl[,l])) L2<-c(L2,sum(wnl[,l]*Epij[,L+l])/sum(wnl[,l])) L3<-c(L3,sum(wnl[,l]*Epij[,2*L+l])/sum(wnl[,l])) L4<-c(L4,sum(wnl[,l]*Epij[,3*L+l])/sum(wnl[,l])) L5<-c(L5,sum(wnl[,l]*epsl[,l])/sum(wnl[,l])) } Lambda<-matrix(c(L1,L2,L3,L4,L5),L,q) return(Lambda) } lambda<-Lambdas() F<-function(){f<-0;Lik<-0 for(x in 1:n){ for(l in 1:L){ f<-f+v[l]*Condf(x,l) 37 } Lik<-Lik+log(f) } return(Lik) } P<-c(P,F()) if(g>1){Q[g-1]<-P[g]-P[g-1]} if(length(P)>20)break } R code for bootstrap of standard errors rm(list=ls()) B<-matrix(,100,5*16 ) data<-read.table("C:\\Users\\User\\Desktop\\datawb.txt",header=TRUE) g<-0 repeat{ g<-g+1 Lambda16<-matrix(c(38.15450799 ,60.1916087, 1.69377401 , 8.212057 ,1.28443760, 6.21332185 ,32.273688 ,9.115402, 26.881270 ,17.93242178, 3.230533e+00 ,8.0414025, 6.9642155 ,33.6442291 ,22.8443448 ,6.139962, 4.27558684, 12.0189718, 5.67742617, 34.521507 ,0.68127276, 1.35062338 ,32.382003 ,16.031309 , 5.052374 , 2.45507732, 1.128142e+00 ,8.2137431, 21.8512490, 40.7658999, 15.8852785 ,7.763343, 0.10049911 , 0.5855036, 0.80964569, 12.565228 ,0.03695881, 0.03759575 ,11.290171 ,39.463898 , 2.257400 , 0.07172701, 1.614032e+00, 0.2182159, 1.6884718, 1.9941723 , 0.5676319 ,6.840999, 0.94018625 , 2.5575498, 0.09286526, 1.652264, 0.06172529, 0.18817737 ,9.530685 ,17.648157, 18.082073 , 0.72565410 ,5.056689e+00 ,0.2409899 ,0.3035150 , 1.0822242 , 0.6416134 ,3.579238, 38 0.07160081, 0.3303317, 0.10266641, 1.887709, 0.01592727, 0.01227602 ,3.939290 , 5.553087 , 1.517195 , 0.08755417 ,1.750189e-09 ,0.0835878 ,0.3840347, 0.5838264 , 0.5328981 ,1.054430),16,5) v16<-c(0.042842779 ,0.025215228 ,0.119230845, 0.014196339, 0.296965265, 0.166934240, 0.006602024 ,0.004643517, 0.010663367, 0.069124459 ,0.026006180 ,0.079718448, 0.038710435 ,0.023024375, 0.048251016, 0.027871484) EIJnl<-NULL m<-4 #number of variables n<-length(data$X1) #number of observations v<-matrix(c(data$X1,data$X2,data$X3,data$X4),n,m) T<-c(seq(1:n)) for(i in 1:n){ j<-sample(T,1) v[i,]<-v[j,] } z<-matrix(c(data$X1,data$X2,data$X3,data$X4),n,m) data<-data.frame(v) fact<-function(y){x<-1 if(y==0) x<-1 else { for(i in 1:y) x<-i*x 39 } return (x) } q<-m+1 L<-16 Wnl<-matrix(,n,L) Eps<-matrix(,n,L) Eij<-matrix(,n,L) Bij<-matrix(,n,L) R<-function(y){if(y-as.integer(y)<0.5)y<-as.integer(y) else y<-as.integer(y)+1 return(y)} lambda<-Lambda9 v<-v9 Po<-function(x,l){ po<-(exp(-l)*l^x)/fact(x) return(po) 40 } Bp<-function(y1,y2,lambda1,lambda2,lambda12){initial<-0 expn<-exp(-(lambda1+lambda2+lambda12)) for(i in 0:min(y1,y2)){ initial<-initial+(lambda1^(y1-i)*lambda2^(y2i)*lambda12^(i))/(fact(y1-i)*fact(y2-i)*fact(i)) } bp<-expn*initial return(bp) } Condf<-function(x,l){ product<Po(z[x,1],lambda[l,1])*Po(z[x,2],lambda[l,2])*pbivpois(z[x,3],z[x,4],c(lambda[ l,3],lambda[l,4],lambda[l,5])) return(product) } Uncondf<-function(x){ prodt<-0 for(s in 1:L){ prodt<-prodt+v[s]*Condf(x,s) 41 } return(prodt) } f<-function(x,l){ wnl<-v[l]*Condf(x,l)/Uncondf(x) return(wnl) } WNL<-function(){ for(x in 1:n){ for(l in 1:L){ Wnl[x,l]<-f(x,l) } } return(Wnl) } wnl<-WNL() eijl<-function(z1,z2,li,lj,lij){eps<-0 for(r in 0:min(z1,z2)){ eps<-eps+r*Po(z1-r,li)*Po(z2-r,lj)*Po(r,lij) } return(eps) } 42 Eijnl<-function(x,l){ij<-0; t<-NULL;B<-1;BP<-NULL g<eijl(z[x,3],z[x,4],lambda[l,3],lambda[l,4],lambda[l,5])/pbivpois(z[x,3],z[x,4],c(l ambda[l,3],lambda[l,4],lambda[l,5])) return(g) } EIJNL<-function(x){ p<-NULL for(l in 1:L) p<-c(p,Eijnl(x,l)) return(p) } Bijn<-function(){ for(x in 1:n) Bij[x,]<-EIJNL(x) return(Bij) } epsl<-Bijn() Eijl<-function(x){ e1<-NULL;e2<-NULL;e3<-NULL;e4<-NULL for(l in 1:L){ e1<-c(e1,z[x,1]) e2<-c(e2,z[x,2]) e3<-c(e3,z[x,3]-epsl[x,l]) e4<-c(e4,z[x,4]-epsl[x,l]) } e<-c(e1,e2,e3,e4) 43 return(e) } Epsl<-function(){Eps<-matrix(,n,m*L) for(x in 1:n)Eps[x,]<-Eijl(x) return(Eps) } Epij<-Epsl() vnew<- function(){ for(l in 1:L) v[l]<-sum(wnl[,l])/n return(v) } v<-vnew() Lambdas<-function(){ L1<-NULL;L2<-NULL;L3<-NULL;L4<-NULL;L5<- NULL Lambda<-matrix(,L,q) for(l in 1:L){ L1<-c(L1,sum(wnl[,l]*Epij[,l])/sum(wnl[,l])) L2<-c(L2,sum(wnl[,l]*Epij[,L+l])/sum(wnl[,l])) L3<-c(L3,sum(wnl[,l]*Epij[,2*L+l])/sum(wnl[,l])) L4<-c(L4,sum(wnl[,l]*Epij[,3*L+l])/sum(wnl[,l])) 44 L5<-c(L5,sum(wnl[,l]*epsl[,l])/sum(wnl[,l])) } Lambda<-matrix(c(L1,L2,L3,L4,L5),L,q) return(Lambda) } lambda<-Lambdas() C<-NULL for(i in 1:L)C<-c(C,lambda[i,]) B[g,]<-C if(g>99)break } B SD<-NULL for(i in 1: length(B[1,])) SD[i]<-sd(B[,i]) R code for computation of the threshold of modified likelihood ratio test rm(list=ls()) data<-read.table("C:\\Users\\User\\Desktop\\datar.txt",header=TRUE) #data<-read.table("C:\\Users\\User\\Desktop\\dataw.txt",header=TRUE) 45 Lambda16<-matrix(c(38.15450799 ,60.1916087, 1.69377401 , 8.212057 ,1.28443760, 6.21332185 ,32.273688 ,9.115402, 26.881270 ,17.93242178, 3.230533e+00 ,8.0414025, 6.9642155 ,33.6442291 ,22.8443448 ,6.139962, 4.27558684, 12.0189718, 5.67742617, 34.521507 ,0.68127276, 1.35062338 ,32.382003 ,16.031309 , 5.052374 , 2.45507732, 1.128142e+00 ,8.2137431, 21.8512490, 40.7658999, 15.8852785 ,7.763343, 0.10049911 , 0.5855036, 0.80964569, 12.565228 ,0.03695881, 0.03759575 ,11.290171 ,39.463898 , 2.257400 , 0.07172701, 1.614032e+00, 0.2182159, 1.6884718, 1.9941723 , 0.5676319 ,6.840999, 0.94018625 , 2.5575498, 0.09286526, 1.652264, 0.06172529, 0.18817737 ,9.530685 ,17.648157, 18.082073 , 0.72565410 ,5.056689e+00 ,0.2409899 ,0.3035150 , 1.0822242 , 0.6416134 ,3.579238, 0.07160081, 0.3303317, 0.10266641, 1.887709, 0.01592727, 0.01227602 ,3.939290 , 5.553087 , 1.517195 , 0.08755417 ,1.750189e-09 ,0.0835878 ,0.3840347, 0.5838264 , 0.5328981 ,1.054430),16,5) v16<-c(0.042842779 ,0.025215228 ,0.119230845, 0.014196339, 0.296965265, 0.166934240, 0.006602024 ,0.004643517, 0.010663367, 0.069124459 ,0.026006180 ,0.079718448, 0.038710435 ,0.023024375, 0.048251016, 0.027871484) m<-4 #number of variables n<-length(data$v1) #number of observations z<-matrix(c(data$v1,data$v2,data$v3,data$v4),n,m) R<-function(y){if(y-as.integer(y)<0.5)y<-as.integer(y) else y<-as.integer(y)+1 return(y)} 46 fact<-function(y){x<-1 if(y==0) x<-1 else { for(i in 1:y) x<-i*x } return (x) } q<-m+1 L<-16 lambda<-Lambda16 v<-v16 Po<-function(x,l){ po<-(exp(-l)*l^x)/fact(x) return(po) } Bp<-function(y1,y2,lambda1,lambda2,lambda12){initial<-0 expn<-exp(-(lambda1+lambda2+lambda12)) for(i in 0:min(y1,y2)){ initial<-initial+(lambda1^(y1-i)*lambda2^(y2i)*lambda12^(i))/(fact(y1-i)*fact(y2-i)*fact(i)) } bp<-expn*initial return(bp) } 47 Condf<-function(x,l){ product<Po(z[x,1],lambda[l,1])*Po(z[x,2],lambda[l,2])*pbivpois(z[x,3],z[x,4],c(lambda[ l,3],lambda[l,4],lambda[l,5])) return(product) } F<-function(x){ f<-0 for(l in 1:L){ f<-f+v[l]*Condf(x,l) } return(f) } B<-NULL NB<-10000 nn<-c(seq(n)) for(i in 1:NB){ j<-sample(nn,1) B<-c(B,F(j)) } alf<-0.1 j<-alf*(NB+1) p<-R(j) B<-sort(B) 48 Wc<-B[p] NN<-NULL for(i in 1:n){ if(F(i)<Wc) NN<-c(NN,i) } NN 49

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Since we are dealing with count data, seems reasonable to fit this