Download Since we are dealing with count data, seems reasonable to fit this

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
USING FINITE MIXTURE OF MULTIVARIATE POISSON DISTRIBUTIONS FOR
DETECTION OF MEASUREMENT ERRORS IN COUNT DATA
Author: Bernardo João Rota
Supervisor : Thomas Laitila
Abstract
This study is concerned in identification of measurement errors in multivariate count data
using finite mixture of Poisson models. Two approaches for constructing the multivariate
Poisson models are considered, the multivariate Poisson with common covariance and the
restricted covariance Poisson model. The restricted covariance model was used for empirical
application. The dataset is a result of labour market survey conducted at Statistics Sweden
concerning employment, job vacancies and wages. Two methods for classifying the
observation as outlier or normal are described: the Modified Likelihood Ratio statistic and
the CaseLikelihood. The former was applied in the empirical study and found to be effective
in identification of measurement errors. A new method for detection of measurement errors
based in Mahalanobis distance is proposed.
Keywords: Measurement errors, Multivariate Poisson, Finite Mixture Model, EM-algorithm,
Modified likelihood ratio test, Reduced sub-ordering, Caselikelihood
To my Parents, my brothers, my friends and my teachers at Örebro University
Acknowledgements
The author would like to thank all Professors in Applied Statistic Master program at Örebro
University by their insightful lectures that enable a deep understanding of the program
contents.
Would also like to address thank Prof. Sune Karlsson by the supervision during the four
semesters of studies at this University.
A special thank is addressed to Prof. Thomas Laitila who friendly supervised the thesis.
Finally would like to address thank to Statistics Sweden, this Institution disponibilized the
data used in the empirical application. Parte of the thesis was written at this Institution
INDEX OF CONTENTS
1. Introduction ......................................................................................................................... 1
2. Poisson models ..................................................................................................................... 3
2.1 Univariate Poisson Model............................................................................................... 3
2.2 Multivariate Poisson Model ............................................................................................ 3
3. Finite Mixture Models (FMM) ........................................................................................... 9
3.1 Finite mixture of multivariate Poisson distribution ...................................................... 10
3.2 Estimation of FMM ....................................................................................................... 11
3.3 The expectation-maximization (EM) algorithm ............................................................ 12
4. Measures of classification of observed values ................................................................. 15
4.1 Modified likelihood ratio test ........................................................................................ 15
4.2 The Caselikelihood ........................................................................................................ 16
4.3 The reduced sub-ordering method ................................................................................ 16
5. Application ......................................................................................................................... 18
5.1 Data description ............................................................................................................ 18
5.2 Methods ......................................................................................................................... 18
5.3 Empirical Results .......................................................................................................... 21
6. Conclusions ........................................................................................................................ 30
References: ............................................................................................................................. 31
Appendix………………...…………………………………………………………………..34
INDEX OF FIGURES
Fig 1. Pairwise plot of the variables ........................................................................................ 19
Fig 2. Plot log likelihood vs number of components ............................................................. 21
Fig 3. Boxplots for each cluster illustrating some suspicious observation ............................. 28
Fig 4. Histogram for each cluster illustrating suspicious observations ................................... 29
INDEX OF TABLES
Table 1. Information criteria ................................................................................................... 22
Table 2. Variable means in the data set ................................................................................... 22
Table 3. Covariances among the variables in the data set ...................................................... 22
Table 4. MLE of model parameters (std. errors in parenthesis).............................................. 23
Table 5. Number of observation and number of suspicious for each component .................. 25
Table 6. Accuracy in detection of errors by method ............................................................... 25
1. Introduction
The need of statistical information from surveys in order to assist in making decisions is
essential. The quality of this information is of great importance, since it affects the quality of
decisions. Considering the accuracy of this information as a measure of quality, it is worth to
work to achieve this accuracy. The accuracy of the statistical information (estimates) can be
measured via mean squared error (MSE), a measure that incorporates both the variance and
the bias of the estimator. Small mean squared error is always desired for the estimator in
order to produce good estimates. Among the issues that make hard to attain the desired
minimum mean squared error are the errors arising from the all processes involving the
survey procedure. The survey operation includes three main stages namely: sample selection,
data collection and data processing (Särndal et al., 1992). Each one of these stages is subject
to error. For this study the errors arising in the data collection stage are of concern.
The errors mentioned above are called measurement errors and are defined as the difference
between the recorded value on a study variable for the sampled element and the true (in
general unknown) value. These errors may increase the bias and/or the variance of the
estimated parameters, which are undesirable effects. The effect of measurement errors
depends on their magnitude, the greater the difference between the observed values and the
unobserved true values the greater are the negative influence on the properties of the
estimator. The paper considers identification of measurement errors using techniques for
outlier detection.
Among several definitions of an outlier, consider the simplest one which is connected with
the term anomalous measures and states that “a considerably different observation from the
remainders, as it was generated by different mechanism can be though of an outlier” (Elahi et
al., 2009). These errors can occur by chance, and are often an indicative of measurement
errors. Identifying them will provide an added value in the process of obtaining accurate or
nearly so statistics. If these anomalous observations are well identified and well corrected,
the estimation may safely use simple linear estimators instead of robust ones. Identification
of anomalous observations is a matter in various fields like in fraud detection, network
intrusion, monitoring criminal activities, etc.
Several methods for identification of anomalous measures are proposed in literature. They
are essentially based on supervised or unsupervised learning approaches. A supervised
1
approach needs to learn a classification model over a set of sample data already labelled as
outlier or not, and by using this model, a new observation may be classified as an outlier or
not. In unsupervised approaches the detection of outliers is without training samples (Ye et
al., 2009). In these latter approaches, most of the algorithms are distance based, density based
or distribution based.
In the distribution-based methods the data is fitted with a statistical model and an observation
can be classified as anomalous or not on the basis of such a model. Yamanishi et al. (2004)
applied the method for the Smart Sifter program which is a program for outlier identification
using Finite Mixture Models. Finite mixture models can be used in modelling for
identification of anomalous measures as an alternative of the earlier approaches which
become inefficient when the dimension of the variable and the amount of data increases.
This study uses a finite mixture of multivariate Poisson distributions as the probabilistic
model for detection of outliers in multivariate count data. The estimation of parameters of the
mixture is done via a version of the expectation-maximization (EM) algorithm of Dempster
et al. (1977).
This paper aims to:
1. Investigate and estimate the mixture of the multivariate Poisson distributions.
2. Use the mixture model as the probabilistic model for identification of measurement errors.
3. Propose a new measure for identification of outliers in FMM modeling.
Here two models are assumed. Model 1 assumes common covariance between all variables
of interest. Model 2 assumes pairwise interactions between all variables of interest to reduce
the complexity in implementation of the multivariate Poisson distribution. In order to
simplify the structure of the model 2, the dataset can be used to test for significant relations
between pairs of variables, giving rise to the known as the restricted covariance Poisson
model. This latter approach enables the removal of non correlated pairs from the model,
which in turn enables to reduce the number of free parameters to be estimated.
The EM algorithm version for finite mixture of multivariate Poisson distribution of Karlis
(2003) and Brijs et al. (2004) is used to estimate the parameters in the model. These latter
authors derive the modified version of EM algorithm for multivariate Poisson of Karlis
2
(2003) for the case of four variables, this algorithm is extended here to the case of m
variables.
The rest of the paper is organized as follow: Section 2 with two subsections presents a
theory concerning univariate and multivariate Poisson distributions respectively, Section 3 is
constituted by three subsection and presents some theory in Finite Mixture models in
particular Finite mixture of multivariate Poisson distributions, the estimation of this mixture
and theory in Expectation Maximization algorithm. Section 4 is concerned with measures of
classification and discusses the Modified likelihood ratio test, the CaseLikelihood and the
proposed approach. In Section 5 an empirical study is taken into account, this section is
composed by three subsections presenting respectively the data description, the methods
applied and the results obtained with respective discussion. Finally Section 6 presents a brief
conclusion
2. Poisson models
2.1 Univariate Poisson Model
The most commonly used distributions to model count data are binomial, negative binomial
and Poisson. Count data can be defined as a data taking non negative integer values.
Cameron and Trivedi (1996) discuss the Poisson model as the benchmark for modelling
count data. If the variable z is distributed as univariate Poisson distribution with parameter
θ , then its probability mass function (pmf) can be represented as
Po( z;θ ) =
e −θ θ z
z!
where the parameter θ is both the mean and variance of the distribution. Karlis and
Meligkotsidou (2007) noted that Poisson distribution has played a prominent role in
modelling univariate count data. However, its multivariate counterpart has rarely been used.
A reason for this is the complexity associated to this distribution in applications.
2.2 Multivariate Poisson Model
Earlier approaches in dealing with multivariate count data were mostly based on
approximations with continuous models, e.g. normal distribution. However these
3
approximations can be misleading since count data can be characterized by the existence of
many zero counts and/or small count numbers.
Let Z = ( Z1 , Z 2 ,...., Z m )′ define a sequence of discrete random variables, and suppose there is
a vector Y = (Y1 , Y2 ,...., Yq )′ of independent random variables each following a Poisson
distribution with parameter θ k , that is Yk
Po(θ k ), (k = 1,..., q) . Furthermore consider a
q
sequence of m-dimensional binary vectors (∆1 , ∆ 2 ,..., ∆ q ) . Writing Z as Z = ∑ ∆kYk , then Z
k =1
is distributed as an m-variate Poisson variable with parameter vector θ = (θ1,θ2 ,...,θq )′ , mean
E(Z;θ ) = E(∆Y;θ ) = ∆θ
and
covariance
∆
∑ = diag (θ1 , θ 2 ,..., θ q ) and
var( Z ;θ ) = var(∆Y ;θ ) = ∆ ∑ ∆′ ,
matrix
where
is an m × q matrix with columns ∆ i (i = 1,..., q ) . Each
Z j ( j = 1, 2,..., m) marginally follows a univariate Poisson distribution (Brijs, et al., 2002).
Karlis and Meligkotsidou (2005) state that a multivariate Poisson distribution in its full
parameterized form can be constructed with a matrix
∆% = (∆% , ∆% ,..., ∆%
1
2
m
∆
which can be splitted such that
) , where each ∆% r (r = 1,..., m) is a sub matrix of dimension m × Crm with
exactly r ones and m − r zeros in each column. Here Crm denotes the binomial coefficients.
Redefining the vector Y as Y% = (Y%1 , Y%2 ,..., Y%m )′ , where Y%j (j=1,...,m) is a C mj -dimensional
q
m
k =1
j =1
column vector, Z can be written as Z = ∑ ∆ k Yk = ∑ ∆% jY%j .
The structure of the matrix ∆% allows to depict the covariance between each pair Z i and Z j .
Let
for
instance
Z = ( Z1 , Z 2 , Z 3 )′ and Y = (Y1 , Y2 , Y3 , Y12 , Y13 , Y23 , Y123 )′
where
Yij (i = 1, 2 j = 2, 3 i < j ) implies the covariance between ( Z i , Z j ) and Y123 implies the threefold
covariance of ( Z1 , Z 2 , Z 3 ) .
Define
Y% = Y%1′ Y%2′ Y%3′′ , with
The
Y%1 = [Y1 Y2 Y3 ]′
Y%2 = [Y12 Y13 Y23 ]′
Y%3 = [Y123 ]′
∆ matrix then becomes
4
1 0 0 1 1 0 1
∆ = 0 1 0 1 0 1 1
0 0 1 0 1 1 1
Splitting this matrix into three sub matrices yields
1 0 0 
1 1 0 
1
∆% 1 = 0 1 0  , ∆% 2 = 1 0 1  , ∆% 3 = 1
0 0 1 
 0 1 1 
1
7
3
k =1
j =1
and Z = ∑ ∆ k Yk = ∑ ∆% jY%j whereby
 Z1 = Y1 + Y12 + Y13 + Y123

Z = Z2 = Y2 + Y12 + Y23 + Y123
Z = Y + Y + Y + Y
 3 3 13 23 123
Let θ = (θ1 ,θ 2 ,θ3 ,θ12 ,θ13 ,θ 23 ,θ123 )′ , then the mean of the trivariate Poisson vector Z is
 θ1 + θ12 + θ13 + θ123 
E ( Z ) = ∆θ ′ = θ 2 + θ12 + θ 23 + θ123 
θ3 + θ13 + θ 23 + θ123 
and the covariance matrix equals
θ 1 + θ 1 2 + θ 1 3 + θ 1 2 3
var( Z ) = ∆ Σ ∆ ′ = 
θ 12 + θ 123


θ 13 + θ 123
θ 12 + θ 123
θ 2 + θ 12 + θ 23 + θ 123
θ 23 + θ 123
θ 13 + θ 123


θ 23 + θ 123

θ 3 + θ 1 3 + θ 2 3 + θ 1 2 3 
where Σ = diag (θ1 ,θ 2 , θ3 ,θ12 ,θ13 ,θ 23 ,θ123 ) . Here Σ is the covariance matrix of Y and
becomes diagonal to justify the idea of independence of the random variables Yi , (i = 1,..., q)
The derivation so far (Johnson et al., 1997) leads to the following trivariate pmf:
5
P(z1, z2 , z3 ) = exp(−(θ1 + θ2 + θ3) + (θ12 + θ23 + θ13) −θ123) ×
×
min(z1,z2 ) min(z2,z3) min(z1,z3) min(l ,m, j )
∑
∑
∑
∑ H( z;θ )
l =0
m=0
j =0
i=0
Where
H ( z;θ ) =
{i !( l − i ) !( j − i ) !( z1 − l −
}
j + i ) !( m − i ) !( z 2 − l − m + i ) !( z 3 − m − j + i ) !
−1
×
z1 − l − j + i
i
l−i
j−i
× (θ − θ
× θ 1 2 3 × (θ 1 2 − θ 1 2 3 )
× (θ 1 3 − θ 1 2 3 )
+
θ
×
)
1
12
123
× (θ 2 3 − θ 1 2 3 )
m −i
× (θ 2 − θ 1 2 − θ 2 3 + θ 1 2 3 )
z2 − l − m + i
× (θ 3 − θ 2 3 − θ 1 3 + θ 1 2 3 )
z3 − m − j + i
The multivariate Poisson has several shortcomings due to the complicated form of the joint
probability function (Brijs et al., 2002). The overcome of these shortcomings leads to some
simplified approaches, for instance the assumption of a common covariance term for all
variables of interest. Under this framework Johnson et al. (1997) considers a sequence of
univariate Poisson variables, Y = (Y0 , Y1 ,..., Ym )′ each with parameter θ j ( j = 0,...., m) and
the sequence of discrete random vectors Z = ( Z1 ,..., Z m )′ where Z k = Y0 + Yk (k=1,...,m) .
Then Z is distributed as a multivariate Poisson with joint probability function
P ( z ) = P ( z 1 , z 2 , ..., z m ) =
z
m θ t t m in { z1 ,..., z m } m  z
m
l
∑
∏
= ex p ( − ∑ θ i ) ⋅ ∏
i=0
k =0
t =1 z !
l =1  k

t

 
k ! 


θ0
m
∏ θh
h =1




k
. (1)
The marginal distribution for Z k is a univariate Poisson with parameter θ0 + θ k , where θ0 is
the common covariance term for all pair of variables. Model (1) has been widely used by
many authors. Assuming common covariance reduces the computational burden, but the
assumption makes the model less realistic.
Another approach is to consider only pairwise dependence of the variables, which implies the
exclusion of the third or higher order interaction of the variables from the ∆ matrix.
Following the last example of three variables the ∆ matrix as shown by Brijs et al. (2002)
becomes
6
1 0 0 1 1 0 
∆ =  0 1 0 1 0 1 
 0 0 1 0 1 1 
 Z1 = Y1 + Y12 + Y13

Z = Z2 = Y2 + Y12 + Y23 (2)
Z = Y + Y + Y
 3 3 13 23
Y = (Y1 , Y2 , Y3 , Y12 , Y13 , Y23 )′
and
The computation burdening associated with model (1) and with the model generated in (2) is
high, especially when the variable’s dimension and/or the database increase. Karlis (2003)
and Johnson et al. (1997) suggest the recurrence relations presented in Kawamura (1973,
1985, and 1987) and Kano and Kawamura (1991) for the computation of multivariate
Poisson density. The multivariate Poisson model satisfies the following recurrence relations:
z i ⋅ P ( z ) = θ i ⋅ P ( z − π i ) + θ 0 ⋅ P ( z − ι) (i=1..., m )
and
k
p
θj
i =1
j =1
zj
P ( z1 ,..., z p , 0,..., 0) = P ( z − ∑ π i ) ⋅ ∏
p = 1,..., m − 1
Where, πi is a vector of zeros except the ith element which is 1, and ι is a vector with all
elements equal to 1.
m
The order of zi ´s and 0´s can be interchanged to cover all cases and P(0) = exp(−∑θ j ) .
j =0
When min( z1 ,..., zm ) = 0 then
m
m
j=0
j =1
P ( z ) = exp(− ∑ θ j ) ⋅ ∏
θ
zj
zj!
.
The recurrence relations presented so far correspond to the general scheme, and the
following is the scheme for the bivariate case, since this is the required case in this paper.
The detailed scheme for higher dimension can be found in Kawamura (1973, 1985, and
1987).
The bivariate Poisson as presented in Kawamura (1985) can be written as follow
min( z1 , z 2 )
P ( z1 , z 2 ) =
∑
Po ( z1 − i; θ1 ) Po ( z 2 − i; θ 2 ) Po (i; θ12 )
i=0
m i n ( z1 , z 2 )
=
∑
i= 0
ex p (−θ1 − θ
2
− θ 12 ) ⋅
θ 1z
1
−i
θ
z2 − i
2
θ 1i2
( z1 − i ) !( z 2 − i ) ! i !
7
This density satisfies the following recurrence equations:
z1 = z2 = 0 → P(0,0) = exp(−θ1 − θ2 − θ12 )
z1 > 0, z2 = 0 → z1 ⋅ P( z1 ,0) = θ1 ⋅ P( z1 − 1,0)
z1 = 0, z2 > 0 → z2 ⋅ P(0, z1 ) = θ2 ⋅ P(0, z2 − 1)
z1 , z2 > 0 → z1 ⋅ P( z1 , z2 ) = θ1 ⋅ P( z1 − 1, z2 ) + θ12 ⋅ P( z1 − 1, z2 − 1)
z1 , z2 > 0 → z2 ⋅ P( z1 , z2 ) = θ2 ⋅ P( z1 , z2 − 1) + θ12 ⋅ P( z1 − 1, z2 − 1)
These recurrence relations minimize the complexity in the computation of probability mass
function of multivariate Poisson, even though high level complexity remains.
The ∆ matrix in (2) can be represented as ∆ = (∆1 , ∆ 2 ) , where ∆1 is a m × m
identity matrix
and ∆ 2 is a m × Cm2 matrix with exactly 2 ones and m-2 zeros in each column. The structure
of this matrix in general is
 1

 0
 0

 0
∆ = 
 .
 .

 .
 0 m1

0
1
0
0
.
.
.
0
0
0
1
0
.
.
.
0
.
0
0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
01m
.
.
.
.
0
1m m
1
1
0
0
.
.
.
0
1
0
1
0
.
.
.
0
...
0
0
.
.
.
.
...
0
2
1,Cm
0
0
.
.
.
1
1
2
m ,Cm













(3)
which is composed by two sub matrices, a m × m identity matrix and the m × Cm2 matrix
with exactly 2 ones and m-2 zeros. The product of (3) with a m + Cm2 dimensional vector of
independent random Poisson variables Y = (Y1 , Y2 ,..., Ym , Y12 , Y13 ,..., Y1m , Y23 , Y24 ,..., Y2m ,...,...., Ym−1,m )′
where Yi
Po(θi ) (i = 1,..., m) and Yij
Po(θ ij ) (i = 1,..., m − 1, j = 2,..., m, i < j ) leads to the
following multivariate Poisson random variable:

 Z = Y + Y + Y + ... + Y
1
12
13
1m
 1
 Z 2 = Y 2 + Y1 2 + Y 2 3 + ... + Y 2 m

 ........................................
Z = 
 Z i = Y i + Y 1 i + Y 2 i + . . . + Y im
 ........................................

 Z m = Y m + Y1 m + Y 2 m + ... + Y m


−1 m
8
Since the above structure assumes only pairwise interaction between the variables, it
simplifies the general structure of multivariate Poisson (Brijs et al., 2002), which can be
reduced to a product of bivariate Poisson models as shown:
m
P( z1 , z2 ,..., zm ) = ∏ P( zi , z j )
(i = 1,.., m -1, j = i +1,.., m) ,
i< j
where
min( zi , z j )
P( zi , z j ) =
∑
i
exp(−θi −θ j −θij ) ⋅
k =0
z j −k
θiz −kθ j θijk
(zi − k)!(z j − k)!k !
Thus
m min( zi ,z j )
P(z1, z2 ,..., zm ) = ∏
i< j
∑
i
exp(−θi −θ j −θij ) ⋅
k =0
z j −k
θiz −kθ j θijk
(zi − k)!(z j − k)!k!
.
The usefulness of this model lies in the fact that its complexity can be greatly reduced by
analysing the interaction effects between the variables of interest (Brijs et al., 2004). By log
linear analysis for instance non significant interactions can be used to cancel out free
parameters in the covariance structure of multivariate Poisson.
3. Finite Mixture Models (FMM)
A finite mixture model is a probabilistic model for density estimation using a mixture
distribution. It is a widely used approach for modelling data presenting heterogeneity pattern,
(Rouse, 2005). The advantage of mixture models lies in the fact that they can model quite
complex distributions by an effective choice of its components to represent accurately local
areas of support of the true distributions. The components of the mixture are chosen by their
tractability, so inferences about the phenomenon to be modelled may easily be made.
Let Z be a random variable with density function f ( z ) . A finite mixture model (FrühwirthSchnatter, 2006; Schlattmann, 2009; McLachlan and Peel, 2000) decomposes a density f
into a sum of L component probability density function. Denote f l the l th probability
L
density function, then the finite mixture density can be written
as f ( z ) =
∑υ
l
f l ( z ) where
l =1
L
υl are mixing proportions with the properties υl ≥ 0 and
∑υ
l
= 1 . If Z is a scalar random
l =1
variable with density function f ( z; Ψ ) parameterized by element vector Ψ and z is the
respective realization of Z , then Z will be an outcome of an l finite mixture distribution if
9
L
its density can be written as f ( z; Ψ ) = ∑ υl f l ( z;θ l ) where θl is the parameter of the l th
l =1
pdf.
3.1 Finite mixture of multivariate Poisson distribution
A frequent problem in application of the Poisson model is overdispersion. The model
assumes the same mean and variance, in many cases the sample variance is greater than the
sample mean originating the phenomenon known as overdispersion. The presence of
overdispersion is an indicative that simple Poisson model (univariate or multivariate) is not
appropriate for modelling the data. One alternative is a mixture distribution since the latter
takes extra variability into account.
Under (2) and assuming the definition of Z =(Z1,Z2,..., Zm)′ , the parameterized probability mass
m
function is P( z | θ ) = ∏ P( zi , z j | θi , θ j , θij ) . Representing the pmf as a mixture of L pmfs
i< j
of multivariate Poisson results in
L
m
l =1
i< j
P( z | Ψ) = ∑υl (∏ P( zi , z j | θil ,θ lj ,θijl )) , Ψ = (υ1 ,υ 2 ,....,υ L −1 , θ1′ , θ 2′ ,......, θ L′ )′
where
υl
as
defined
above
are
the
mixing
proportions
and
θl = (θ1l ,θ2l ,...,θml ,θ12l ,θ13l ,...,θ1ml ,θ23l ,θ24l ,...,...,θ( m−1) ml )′ are the component parameters.
The full mixture model is
m
L
P(z | Ψ ) =
∑ υ ⋅ (∏ ∑
l
l =1
i< j
l
i
l
j
l
ij
exp( −θ − θ − θ ) ⋅
k =0
z j −k l
θ iz − k lθ j
l
min( z i , z j )
i
θ ijk
( z i − k )!( z j − k )! k !
) . (4)
The expected mean and variance of this mixture model (Karlis and Meligkotsidou, 2007) are
respectively
L
E ( Z ) = ∑υl ∆θl
l =1
and
10
 L

 L
 L
′ 

Var ( Z ) = ∆ ∑ υl ⋅ (Σl + θl ⋅θl′) −  ∑υl ⋅θl  ∑υl ⋅ θl  ∆′
 l =1
 l =1
 l =1
 

where
Σ l = diag (θ1l , θ 2 l ,..., θ ql ) and ∆ as defined in (3).
3.2 Estimation of FMM
There are a number of approaches in estimation of mixture distributions including graphical
methods, method of moments, minimum distance-methods, maximum likelihood and
Bayesian approaches. This is motivated by the fact that explicit formulas for parameter
estimates are not available (Titterington et al., 1985). The maximum likelihood estimation
via the EM algorithm estimation method is the most used approach.
Let an unobservable L × 1 random vector X n be associated with observation zn , and let xn be
a realization of X n . The vector xn assumes a value of one in its l th element if zn belongs to
component l , and all remaining elements of xn are zero. The probability that zn comes from
the l th component is υl . X n is said to come from a multinomial distribution with one trial
and L outcomes with respective probabilities υl . The probability mass function can be
L
represented as P ( xn ) = ∏υl[ xn ]l ,
[ x n ]l
is the l th element of xn . Then, f ( z; Ψ ) and f l ( z;θl ) are
l =1
the unconditional and conditional densities of Z given [ X n ]l = 1 , respectively. The posterior
probability that Z belongs to the l th component given z and Ψ is
ωl ( z; Ψ) = P([ X ]l = 1; z, Ψ) =
υl fl ( z;θl )
L
∑υ
f ( z;θq )
q q
q =1
Here υl is the prior probability that z belongs to l th class in the mixture. McLachlan and
Peel (2000) refer to Z as the incomplete data and ( Z , X ) = Z c is the complete data. The
complete data log-likelihood is obtained as
N
L
log( Lc ( Ψ )) = ∑ ∑ [ x n ]l log(υ l f l ( z n ; θ l )) .
n =1 l =1
11
Solving
Ψ = arg max log( Lc ( Ψ )) yields the maximum likelihood estimator (MLE) of
Ψ
parameter vector Ψ . Generally the problem of finding Ψ is hard, since there is lack of
explicit analytical expressions.
3.3 The expectation-maximization (EM) algorithm
The EM-algorithm is an important tool in the estimation of complex models, but the ordinary
EM algorithm of Dempster et al. (1977) still need improvement for particular models since
this algorithm makes the computational procedure tedious especially in models involving
data sets in high dimension. Schlattmann (2009) points this as the reason why its application
and modification for finite mixture models is still an active area of research. Karlis (2003)
developed a version of EM algorithm for multivariate Poisson and Brijs et al. (2004)
extended this EM version to a mixture of multivariate Poisson distribution.
In the present case of mixture of multivariate Poisson, the complete data is given by
( Z , Y , X ) = Z c where Z = ( Z1 ,..., Z m ) is the given dataset, X n = ( X n1 ,..., X nL ) correspond to
component membership and Ynl = (Yn1l ,..., Ynql ) is the component specific latent variable with
l corresponding to a component and Y is the variable used in (2), the complete log likelihood
is:
N
L
N
L
N
L
q
log(Lc (Ψ)) = ∑∑[ xn ]l log(υl f ( ynl ;θl )) = ∑∑[ xn ]l logυl + ∑∑[ xn ]l ∏log f ( ynpl ;θ pl )
n=1 l =1
N
L
n=1 l =1
N
L
q
=∑∑[ xn ]l logυl + ∑∑[ xn ]l ∑log(
n=1 l =1
N
L
n=1 l =1
N
L
n=1 l =1
y
exp(−θ pl )θ plnpl
p =1
p =1
ynpl !
)
q
=∑∑[ xn ]l logυl + ∑∑[ xn ]l ∑(−θ pl + ynpl ⋅ logθ pl − log ynpl !)
n=1 l =1
n=1 l =1
p =1
Selecting suitable starting values for the parameters Ψη −1 = (υ1η −1 ,υ2η −1 ,....,υηL −−11 ,θ η −1 )′ , the EM
algorithm first (E-step) finds the conditional expected value of the indicator variables X, that
is, ωn = E ( X n | Z n , Ψη −1 ) .
Under the model arising from (2) with Ψ = Ψη −1 yields
12
in −k [η −1]
m min( zin , z jn )
ωnl =
[η −1]
l
υ
υl[η−1] fl (zn ;θl[η−1] )
L
∑υ
q=1
[η −1]
q
q
[η −1]
q
f (zn ;θ
L
exp(−θ
−θ
[η −1]
jl
[η −1]
ijl
−θ
)⋅
θilz
m min( zin , z jn )
∑υ
[η −1]
q
q=1
∑
⋅∏
exp(−θ
−θ
[η −1]
jq
[η −1]
ijq
−θ
)⋅
k =0
i< j
⋅θijlk[η−1]
z jn −k[η −1]
θiqz −k[η−1] ⋅θ jq
in
[η −1]
iq
z −k[η −1]
⋅θ jljn
(zin − k)!(z jn − k)!k !
k =0
i< j
=
)
∑
⋅∏
[η −1]
il
⋅θijqk[η−1]
(zin − k)!(z jn − k)!k !
This is followed by the computation of the expected value of the latent variable Y given the
observation group membership and the actual parameters, that is
min( zin , z jn )
ε ijnl = E (Yijnl | zn ,[ xn ]l = 1, Ψ
η −1
∑
)=
k ⋅ P (Yijnl = k | zin , z jn ,[ xn ]l = 1, Ψη −1 )
k =0
min( zin , z jn )
∑
=
k⋅
P(Yijnl = k , zin , z jn | [ xn ]l = 1, Ψη −1 )
P( zin , z jn | [ xn ]l = 1, Ψη −1 )
k =0
min( zin , z jn )
∑
k ⋅ P0 ( zin − k | θil[η −1] ) P0 ( z jn − k | θ [jlη −1] ) P0 (k | θijl[η −1] )
k =0
=
P( zi , z j | θil[η −1] , θ [jlη −1] , θijl[η −1] )
in − k [η −1]
min( zin , z jn )
∑
=
k ⋅ exp(−θ
[η −1]
il
−θ
[η −1]
jl
−θ
[η −1]
ijl
)⋅
θilz
k =0
min( zin , z jn )
∑
exp(−θ
[η −1]
il
−θ
[η −1]
jl
−θ
k =0
[η −1]
ijl
)⋅
θ
z − k [η −1]
⋅ θ jljn
⋅ θijlk [η −1]
( zin − k )!( z jn − k )!k !
zin − k [η −1]
il
z − k [η −1]
⋅ θ jljn
⋅θijlk [η −1]
( zin − k )!( z jn − k )!k !
(i = 1,..., m − 1, j = 2,..., m, i < j )
ε inl = E (Yinl | zn ,[ xn ]l = 1, Ψη −1 ) = zin −
∑
ε ianl
a∈ϖ ( j )
(i = 1,..., m) and ϖ ( j ) = { j : ε ijnl exists}
define ε = ( ε1 , ε 2 ,..., ε m , ε12 , ε13 ,..., ε1m , ε 23 , ε 24 ,...,..., ε m −1,m )
Second step (M-step) updates the parameters
N
υl =
N
∑ωnl
∑ω
n=1
n=1
N
nl
and θhl =
⋅εhnl
N
∑ω
nl
n=1
n =1,..., N; l =1,..., L and h ∈ε
if some convergence criterion is satisfied stop the algorithm, otherwise back to E-step.
If the model (1) is employed the EM algorithm takes slightly different form due to the
structure of the model. The conditional expectation of the indicator variables X becomes:
13
k
 [η−1] 
m
m θ zt [η −1] min{ z1,...,zm}  m
 θ0l 

k
υl[η−1] ⋅ e xp(−∑θil[η−1] ) ⋅ ∏ tl
!
C
k
∑
∏
z
p
 p=1   m [η−1] 
t =1 zt !
k =0
i=1
[η −1]
[η −1]
 ∏θhl 
υl fl (zn ;θl )
 h=1

=
ωnl = L
k
[η −1]
 [η−1] 
η
[
−
1]
z
υ
θ
(
;
)
f
z
z
z
min
,...,
t
L
m
{ 1 m}  m k   θ0q 
∑
q q n q
m θ
tq
[η −1]
[η −1]
q=1
⋅
−
⋅
υ
θ
xp(
)
e
Czp k ! m
∑
∏
∑
∑
q
iq

 ∏
k =0
t =1 zt !
p=1
  ∏θhq[η−1] 
q=1
i=1
 h=1

And the expected value of the latent variable Y given the observation group membership and
the actual parameters is
ε 0 nl = E (Y0 nl | z n ,[ x n ]l = 1, Ψ η −1 ) =
min( z1 n ,..., z mn )
∑
k ⋅ P (Y0 nl = k | z1n ,..., z mn ,[ x n ]l = 1, Ψ η −1 )
k =0
m in( z1 n ,..., z mn )
∑
=
k⋅
k =0
min( z1 n ,..., z mn )
∑
=
k =0
P (Y0 nl = k , z1n ,..., z mn | [ xn ]l = 1, Ψ η −1 )
P ( z1n ,..., z mn | [ x n ]l = 1, Ψ η −1 )
m
k ⋅ ∏ P0 ( z in − k | θ il[η −1] )P0 ( k | θ 0[ηl −1] )
i =1
[η −1]
)
P ( z1n ,..., z mn | θ 0[ηl −1] ,..., θ ml
min( z1 n ,..., z mn )
∑
k =0
=
m
k ⋅ ∏ P0 ( z in − k | θ il[η −1] )P0 ( k | θ 0[ηl −1] )
i =1
m
m
i =0
t =1
e xp( − ∑ θ il[η −1] ) ⋅ ∏
θ
zt [η −1]
tl
min { z1 ,..., zm }
zt !
k =0
∑
 [η −1]
 m k   θ0l
C z p k !  m
 ∏
p =1
  ∏ θ hl[η −1]
 h =1





k
ε inl = E (Yinl | zn ,[ xn ]l = 1, Ψη −1 ) = zin − ε 0 nl
(i = 1,..., m)
define ε = ( ε 0 , ε1 , ε 2 ,..., ε m )
Second step (M-step) updates the parameters
N
υl =
N
∑ωnl
∑ω
n=1
n=1
N
nl
and θhl =
⋅εhnl
N
∑ω
nl
n=1
n =1,..., N; l =1,..., L and h ∈ε
if some convergence criterion is satisfied stop the algorithm, otherwise back to E-step.
14
4. Measures of classification of observed values
The literature provides numerous measures for classification of observations as anomalous or
normal. The modified likelihood ratio test, the caselikelihood test and the proposed approach
based in order principles are the techniques considered in this paper for anomalous detection.
4.1 Modified likelihood ratio test
Wang et al. (1997) suggested the modified likelihood ratio test by considering the value of
likelihood ratio statistic λ for testing the null hypothesis H 0 : znew ∈ Ρ and the alternative
hypothesis H1 : znew ∉ Ρ , where Ρ is the population from which the training data was
obtained and znew represents a new observation. If g ( znew ; φ ) is a density of znew under the
alternative hypothesis, there is only one observation znew from which to estimate φ since this
is not functionally related with Ψ . Then, one can not be able to reasonably
estimate g ( znew ; φ ) , one way to estimate this density can be applying nonparametric strategy.
But since there is a single observation, g ( znew ; φ ) is constant in the maximization process
(Woodward and Sain, 2003). The constant form of g ( znew ; φ ) can be dropped, thereby
originating the λ% version of λ , known as the modified likelihood ratio test.
n
λ% =
sup ∏ f ( zi ;θ ) f ( znew ;θ )
sup Lnew (θ )
θ ∈Ψ
sup L(θ )
θ ∈Ψ
=
θ ∈Ψ i =1
n
sup ∏ f ( zi ;θ )
θ ∈Ψ i =1
Where θ is the MLE of the mixture density using a the training data set z = ( z1 , z2 ,....., zn ) .
A simple observation of the formula above enables to understand that, under the alternative
hypothesis one expects λ% to be small, which in turn is an indicative that the new observation
does not come from the same mixture distribution. The null distribution of λ% can be assessed
by using the bootstrap methods with either parametric or nonparametric resampling. This
bootstrap approach relies on the fact that asymptotically, that is, as n → ∞ , λ% ≈ f ( znew ;θ )
(Wang et al., 1997), this result is important in the sense that it enables to test for anomalous
observations in large databases. The threshold of λ% denoted here λ%C is the (100c)th
percentile of λ%b* which is the λ% statistic version obtained with the bth bootstrap sample data.
Wang et al. (1997) state that if α is the desired significance level for the test then α =
p
,
B +1
15
B is the number of bootstrap samples and λ%C will be taken to be the pth smallest element λ%b* ,
b= 1,..,B
4.2 The Caselikelihood
As the case of likelihood ratio statistic, the caselikelihood (Tang and MacLennan, 2005) is
another test which can be used for anomaly detection. This measure returns a nonzero
probability value and works in two different frameworks, the normalized and the
nonnormalized. Under the latter framework, the returned value is the product of single
probabilities corresponding to each variable value of the suspicious observation. As one can
realize when the number of variables increases the returned probability value tends to
decrease which in turn becomes hard to interpret this probability value. To overcome this
issue in the former framework, the probability of the observation under the model learned by
the algorithm is divided by the probability computed without the model using raw statistics.
The approach can be used as follow: compute the statistic
π =
η ( z% )
η (z) +
Where η ( z ) and
1
n
η ( z% ) are the likelihoods computed with training data only and with
training data jointly with the new observation. The value
1
represents the empirical
n
probability of the new observation. This statistic returns a nonzero probability value such
that, if this value is greater than 0.5 is a strong indication that the new observation is
distributed as the training data while values less than 0.5 indicate that the new observation is
not distributed as the training data.
4.3 The reduced sub-ordering method
An important principle in expressing the features of the data is ordering. Order principles
can be used for purposes of ease, speed and efficiency in statistical analysis. Features as
extreme values, variability and contamination maybe encountered via an ordering principle
applied to the data. In the univariate data (Barnett, 1976) the principle of ordering is clear
without ambiguity. Let
z1 , z2 ,..., zn
be the respective realizations of the random
16
variables Z1 , Z 2 ,..., Z n , then z(1) , z(2) ,..., z( n ) can be regarded as realizations of the random
variables Z (1) , Z (2) ,..., Z ( n ) which is a ordered sequence of Z1 , Z 2 ,..., Z n using a specific rule
(e.g. z(1) < z(2) < ... < z( n ) ). The principles of ordering in univariate data are in general not
applied in the multivariate data sets. In this case there are no order properties. However some
effort has been done resulting in some types of ordering principles termed sub-ordering
principles. Among them marginal ordering, reduced ordering, partial ordering and
conditional ordering stand out.
For the purpose of this paper the reduced ordering is the suitable method applied. This
method has two phases, first each multivariate observation zi , i = 1,.., n is transformed to a
scalar ci yielding in n univariate observations. Mostly this transformation is made with
distance metric (Laurikkala et al., 2000). Then the univariate order principles can be applied
to the transformed data. In this paper the metric is the Mahalanobis distance.
ci2 = ( zi −ω)′ Π−1 ( zi −ω)
Where the parameter ω represents the population mean vector and Π is the covariance
matrix, these parameters can be replaced by their respective sample counterpart. This
distance has the appealing property that it incorporates the dependencies between attributes,
which is important property in anomalous identification in multivariate data. Another
important property of this distance is that the units of the observations have no influence on
the distance estimated since each observation is standardized to zero mean and unit variance.
The Gamma-type probability plots can be used for anomalous identification with
Mahalanobis distance (Laurikkala et al., 2000). However this requires that the observations
follow a multivariate normal distribution. Since this is not the case here, the transformed data
will be informally used for anomalous identification through box plots. Boxplot is a
graphical method for visualization of data features as the case of symmetry, extremeness,
unusual data points, etc. it uses objective rules in identification of unusual observation. The
main contribution of this paper is the following: the mixture model is used for clustering the
observations, further the mean and variance are obtained for each cluster, followed by the
determination of the Mahalanobis distance for each observation to each one of the clusters.
Based on the obtained distances observations are assigned to a cluster at smaller distance,
giving rise to a new grouping. Boxplots are plotted for each group of data and suspicious
observations are visualized.
17
5. Application
5.1 Data description
The dataset contains observations from labour market surveys conducted at Statistics Sweden
concerning employment, job vacancies and wages. The survey is conducted each quarter
using stratified sampling, with stratification over industry and the size of the establishment
(nº of employees according to the business register). The sampling frame is constructed from
the Swedish business register. Establishments with more than 100 employees are selected
with probability one. These establishments report each month in the quarter. Establishments
in strata with less than 100 employees are selected using an SI design. These Establishments
report only on a randomly selected month in the quarter. In total there are nearly 19000
establishments sampled in each quarter. The non-response is about 7%, depending on which
quarter is selected. Data is collected with a mail questionnaire, with the option of using an
internet questionnaire. For this study the data for the fourth quarter of year 2008 was used.
Among several variables reported, here four variables are selected:
Number of persons with posts with conditional tenure (male ( Z1 )/female ( Z 2 )) also named
long term employees.
Number of persons employed for a limited period (male ( Z3 )/female ( Z 4 )) also named short
term employees.
5.2 Methods
The dataset described above totalize 22763 observations divided in two groups, group A
composed by companies with more than 100 employees and group B of that companies with
the number of employees less than 100. The B group of data is used here with a total of
13745 observations. There are available two versions of the dataset, the original and the
edited versions. There are 658 measurement errors based on a comparison of the edited set
and the unedited. There are 515 missing observations in the B group corresponding to 3.7%
of the data in that group. Previously model estimation two extremely abnormal observations
respectively z8062 = (8888,1,0,1) and z10454 = (2,8888, 0, 0) were removed from the database.
Each detected observation was further inspected whether it was present in the edited set or
not. Detected observations that are not found in the edited set are effectively errors.
18
The characteristic of those 658 measurement errors is as follow: 239 observations are
composed by zero counts for all variables, 113 observations are characterized by large counts
in at least one item and the remaining observations are characterized by having nonzero
counts in at least one of their items but the counts are small.
As referred in section 1, the pairwise correlation between the variables were computed and
found that only variables 3 and 4 have a significant correlation. Other pairwise correlations
are not significant. The Fig. 1 shows the plot of the variables.
Fig 1. Pairwise plot of the variables
Based on the relation between the variables the mixture model in (4) becomes
L
P ( z1 , z 2 , z3 , z 4 ) = ∑ υ l ⋅ PO ( z1 ; lθ1 ) ⋅ PO ( z 2 ; lθ 2 )
l =1
where
PO ( zi ; lθ i ) =
min( z3 , z4 )
∑
k =0
θ 3z − k lθ 4z
l
exp( −θ 3l − θ 4l − θ 34l ) ⋅
3
4 −k
l
θ 34k
( z3 − k )!( z 4 − k )! k !
(5)
exp( − l θ i ) ⋅l θ izi
zi !
The dataset described above was used in estimation of the model (5). The EM algorithm of
section 3.3 was used to fit sequentially models with L =1 component to L =16 components.
For each model a set of n (= sample size) uniform numbers were generated from interval [1:
L] and rounded. The L different numbers add up to L different groups. Each observation was
assigned to the group from where the uniform number generated belongs to. The probability
of each data group was used as initial values for the component proportion parameters. The
mixture of Poisson models lead to two groups of parameters, the component proportions
19
υ1 ,...,υ L and θ1′,...,θ L′ parameters. For the second group of parameters five uniform random
sets were generated in the range of the data points for each L. The algorithm was run 50
iterations for each set and the parameter values yielding the maximum value for the
likelihood were taken as initial values.
Selected the starting values, the algorithm was run until the relative change in the log
likelihood was less than 10−6 for each L. The model selection was based on BIC and AIC,
computed respectively as follow:
BIC ( L) = −2 *log(likelihood ) + L * log(n)
and
AIC (l ) = −2 *log(likelihood ) + 2* L
Where log(likelihood ) is the maximized log likelihood assuming a mixture model with L
components. The log likelihood continued increasing for values of L greater than 16 but with
little difference. To avoid very small groupings which may distort the inferences, as well as
to avoid burdening in computation since the database is large, sixteen component mixture
model was considered.
The Likelihood ratio statistic and the Caselikelihood basically lead to the same conclusions
(Wang, 2009) and the fact that the database is large which should imply to run the algorithm
many times for the caselikelihood test only the likelihood ratio statistic is used for testing
whether the observation is suspicious or normal.
The classification based on modified likelihood ratio statistic was as follow:
1. The λ% was obtained for each observation in the database.
2. Obtained 10000 bootstrap resampling of the (n+1)st observation
3. For each bootstrapped element b = 1,...,10000 , λ%b* was calculated
4. α =
p
, α -significance level for the test and B is the number of resampling, the
( B + 1)
threshold λ%C is the pth smallest value λ%b* .
Then if λ%i ≤ λ%C , (i = 1,..., n + 1) the
observation is classified as suspicious, otherwise the observation is normal.
20
For the proposed approach based in order principles, the estimated mixture model was used
as a basis to cluster the observations (Brijs et al., 2004). Under the mixture framework each
observation has a probability to belong to a specific component and is thus assigned to a
cluster (component) with higher probability. The mean and variance were obtained for each
cluster followed by the determination of Mahalanobis distance of each observation to each
one of clusters. The observations are classified as suspicious if they are suspicious for all
clusters. Thus a second assignment was needed; the observations were again assigned to a
cluster at smaller distance. Boxplots and Histograms were plotted for each cluster and
suspicious observations could be visualized.
5.3 Empirical Results
The fig 1 shows the plot of log likelihood for each number of components. Clearly the log
likelihood finds its convergence in the neighbourhood of 50000
Fig 2. Plot log likelihood vs number of components
The table 1 shows the information criteria BIC and AIC for each model with one component
trough 16 components. According to these criteria the model with 16 components is the best
which presents the smallest BIC and AIC.
21
Table 1. Information criteria
Component
BIC
AIC
1
524753.5
524746.0
2
344244.6
344229.6
3
304623.7
304601.2
4
282832.0
282802.0
5
268784.4
268747.0
6
258883.3
258838.4
7
248982.2
248929.8
8
242665.7
242605.8
9
239039.4
238972.0
10
234753.1
234678.2
11
233756.8
233674.4
12
229949.1
229859.2
13
228409.0
228311.6
14
227236.4
227131.6
15
225539.3
225427.0
16
223853.6
223733.8
Table 2. Variable means in the data set
Z1
9.714
Z2
6.044
Z3
1.208
Z4
1.116
Table 3. Covariances among the variables in the data set
Variable
Z1
Z1
9.714
Z2
-1.631*10-7
Z2
-1.631*10-7
6.044
Z3
-6.566*10-9
-7.358*10-10
-4.326*10-9
Z4
-8
-3.860*10
Z3
-6.566*10-9
-7.358*10-10
Z4
-3.860*10-8
1.208
0.221
1.116
-4.326*10-9
0.221
The table 2 above reports the mean per variable, it can be seen that in average the companies
have more long term employed men than women, but this difference is not verified for short
term employment. Table 3 reports the covariances matrix of the four variables. The obtained
results agree with the previous analysis taken, there is a small positive covariance between
variables 3 and 4, but the other pairwise covariances estimated can be neglected.
22
Table 4. MLE of model parameters (std. errors in parenthesis)
Weights
Parameters
Cluster
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
θ1
υL
θ3
θ4
θ34
38.155
4.276
0.100
0.940
0.072
(4.547 )
(0.566 )
(0.023)
(0.213)
(0.020)
60.192
12.019
0.585
2.558
0.330
(2.060 )
(2.825)
(0.205)
(0.659 )
(0.110)
1.694
5.677
0.810
0.093
0.103
(0.244)
(0.927)
(0.128 )
(0.036 )
(0.036 )
8.212
34.522
12.565
1.652
1.888
(0.855 )
(8.125 )
(2.131)
(0.346)
(0.305)
1.284
0.681
0.037
0.062
0.016
(0.160)
(0.065)
(0.062)
(0.024)
(0.014)
6.213
1.351
0.038
0.188
0.012
(0.221)
(0.266 )
(0.043)
(0.070)
(0.005)
32.274
32.382
11.290
9.531
3.939
(6.472)
(6.458)
(5.151 )
(3.753)
(2.992 )
9.115
16.031
39.464
17.648
5.553
(4.294)
(6.132 )
(9.891)
(4.634)
(1.196)
26.881
5.052
2.257
18.082
1.517
(8.421)
(1.775)
(0.800 )
(5.571)
(0.827)
17.932
2.455
0.072
0.726
0.088
(0.908)
(0.322)
(0.081 )
(0.240)
(0.029)
3.231
1.128
1.614
5.057
1.750*10-9
(0.694)
(0.176 )
(0.243 )
(1.116)
(3.00*10-10 )
8.041
8.214
0.218
0.241
0.084
(0.412)
(0.506)
(0.062)
(0.082 )
(0.028)
6.964
21.851
1.688
0.304
3.840
(1.512)
(2.458 )
(0.167 )
(0.130)
(0.195 )
33.644
40.766
1.994
1.082
0.584
(10.870)
(3.145)
(0.604)
(0.248)
(1.743)
22.844
15.885
0.568
0.642
0.533
(1.375)
(1.168)
(0.133)
(0.208)
(0.412)
6.140
7.763
6.841
3.579
1.054
(1.158)
(1.136)
(1.172)
(0.781)
(0.314)
υL
0.043
0.025
0.119
0.014
0.297
0.167
0.007
0.005
0.011
0.069
0.026
0.080
0.039
0.023
0.048
0.028
23
Table 4 presents the estimated parameters and their respective standard errors for the model
with 16 components. θ1 ,θ 2 ,θ 3 and
and
θ 4 are the respective parameters for variables 1 through 4
θ34 represents the covariance between variables 3 and 4. The υL are component
proportions. The component 8 is that which less contributes to the model while component 5
is the component which highly contributes to the model.
Table 5 reports the number of observations and number of suspicious observations per cluster
according to two methods. The observations are assigned to a cluster with higher conditional
probability, as stated in subsection 5.2. By employing the modified likelihood ratio statistic
(method 1) the observation is classified as suspicious or normal. Under method 2, after the
first assignment in method 1, the observation must further be assigned to a closest cluster
according to its smallest Mahalanobis distance. For example in cluster 1 there are 573
observations, under Mahalanobis distance 72 observations departed from this cluster, while
in cluster 4 with 187 observations, under method 2 this cluster received more 48 new
observations.
Under method 1 at 1% significance level, 136 observations are detected as suspicious and 86
of these observations are effectively edited observations. At 5% significance level 472
observations are detected and 117 are found to be with measurement errors. At 10%
significance level 821 observations are detected and 221 are measurement errors. Employing
the second method 441 observations are found to be suspicious, but after inspecting whether
each one of these observations appear in the edited set or not 197 are found to be effectively
measurement errors. The success of detecting measurement errors is illustrated in Table 6.
Figures 3 and 4 show the Boxplots and the corresponding histograms constructed with the
distances obtained under method 2. These figures confirm the existence of outliers for each
cluster. Some observations affect greatly the data distribution, suggesting extremely
abnormal values. They are the case of observation 7782
and 11278 in cluster 2 that
completely distorts the aspect of the boxplot and histogram as can be seen in the next
boxplot/histogram plotted after removal of these observations. The same happens in cluster
14 in which observations 9880 and 10063 greatly distorts the boxplot and the histogram for
that cluster.
24
Table 5. Number of observation and number of suspicious for each component
Cluster
Number of Observations per
cluster
Number of suspicious observations per
cluster
501
1%
0
Method 1
5%
6
10%
32
336
412
12
81
109
24
3
1535
1563
0
0
13
29
4
187
235
15
90
129
17
5
3994
3163
0
0
0
21
6
2243
1956
0
0
0
89
7
88
167
18
43
68
17
8
64
81
32
23
44
9
9
141
233
23
37
61
18
10
909
933
0
5
26
26
11
329
725
4
23
33
21
12
1012
1079
0
0
4
31
13
516
546
1
9
91
8
14
314
335
23
72
107
35
15
622
675
0
21
58
39
16
364
626
8
62
46
31
Total
132727
13230
136
472
821
441
Method1
Method2
1
573
2
Method 2
16
Table 6. Accuracy in detection of errors by method
Method1
Method 2
Sig. Level
1%
5%
10%
Detected
136
472
821
441
M. Errors
86
117
221
197
25
In general the accuracy in detection with these two methods differ considerably. The
difference may be justified by the fact that the observation in method 1 is classified as error
or not based on probability value, while the second method relies on the element distance to
its nearest cluster centre. In this case, observations that are not effectively measurement
errors can be detected as errors because they are situated relatively far from the center of that
specific cluster. For instance the observation z = (0,3,3, 0) is detected as error, while is
effectively not, this is because its associated cluster is composed by elements with smaller
values. There are two different types of outliers in Boxplots namely mild outliers and
extreme outliers. The former is the designation of those values situated within Q3+1.5IQR
and Q3+3IQR, while the latter is the designation of observation with values above Q3+3IQR
where Q3 is the third quartile and IQR is the interquartile range. Most of these 244
observations found by method 2 can be mild outlier, which are observation that can present
some characteristic with little difference from the reminder observations but are not
effectively outliers. The boxplot helps to understand such feature, for instance in cluster 5,
method 1 finds no abnormal observation at any significance level while method 2 finds 21
suspicious observations, but the boxplot of this cluster suggests no outlier at all (see Fig.3
cluster 5) which enhance the idea that the detected observations deserve further check to
effectively declare the observation as outlier or normal. The B set presents some observations
composed by extremely high counts and the second method detects all these observations. As
discussed in section 5.2, from the 658 measurement errors present in the dataset, almost half
of these observations are composed by zero counts for all items, most of these observations
were associated to cluster 5 composed by small counts; these observations may not have very
negative impact for the estimated parameters if comparing with the general pattern of the
data. 86 measurement errors out of 136 suspicious at 1% significance level constitute high
level measurement errors detection, which enable to believe that the method is effective. The
same argument can be used for the second method which results in 441 suspicious
observation and 197 were effectively measurement errors.
26
Fig 3. Boxplots for each cluster illustrating some suspicious observation
27
(Fig 4. cont)
Fig 5. Histogram for each cluster illustrating suspicious observations
28
(Fig. 6 cont)
29
6. Conclusions
The paper is concerned with measurement errors identification in multivariate count data
using finite mixture of Multivariate Poisson models. Two methods for constructing a
multivariate Poisson distribution are discussed: the multivariate Poisson distribution with
common covariance structure and the multivariate Poisson distribution with restricted
covariance structure. The latter is used in the empirical application. Two strategies for outlier
identification are also discussed namely Modified likelihood ratio test and the
Caselikelihood. The Caselikelihood is not effective for large databases, which is the case in
this study. Thus only the modified likelihood ratio test is applied in the empirical study. A
new approach based in ordering principles is proposed and used in the empirical application.
The mixture model was used for clustering the observation. BIC and AIC criteria chose the
16 component mixture model. A previous analysis of dependence of the variables was taken
and revealed that only variables 3 and 4 have a significant correlation, the other pairwise
associations resulted in non significant dependence between the variables involved these
simplified the structure of the multivariate Poisson, after employing the model the estimated
covariances between the variables suggest the same conclusion.
According to the results obtained it can be concluded that effectively the approaches are
useful for detection of erroneous observations since all observations that greatly differ from
the remainders were detected by the two strategies applied. However the Table 6 suggest that
the proposed approach (method 2) is more effective in identification of measurement errors.
This method is simpler than employing a modified likelihood ratio which needs the
employment of bootstrap techniques.
The multivariate Poisson model construction and the EM algorithm schema applied were
both developed to minimize computation effort, however the empirical study here showed
that the computations are still tedious. To further minimize the computation effort
acceleration techniques must be considered, for instance the use of IEM algorithm of Neal
and Hinton (1998) described in McLachlan and Peel (2000).
The results obtained here strengthens the thought of lack of need to impose complex
structure in the multivariate Poisson construction, which makes the computation tedious.
30
References:
Barnett, V. (1976). The ordering of multivariate data. Journal of the Royal Statistical Society
A, 139, 318-354.
Brijs, T., Karlis, D., Swinnen, G., Vanhoof, K. and Wets, G. (2002). Tuning the Multivariate
Poisson Mixture Model for Clustering Supermarket Shoppers. Department of Statistics
Athens University of Economics. Available in:
http://alpha.luc.ac.be/~brijs/./pubs/louvain2002.p
Brijs, T., Karlis, D., Swinnen, G., Vanhoof, K., Wets, G. and Manchanda, P. (2004). A
multivariate Poisson mixture model for marketing applications. Statistica Neerlandica. 58,
322–348.
Cameron, C. A. and Trivedi, P. K. (1996). Count Data Models for Financial Data. Handbook
of statistics, Vol. 14, Statistical Methods in Finance, 363-392, Amsterdam, North-Holland.
Dempster, A. P., Laird, N. M. and Rubin, D.B. (1977). Maximum Likelihood from
Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society. Series B 39,
1–38.
Elahi, M., Li, K., Nisar, W., Lv, X. and Wang, H. (2009). Detection of Local Outlier Over
Dynamic Data Streams using Efficient Partitioning Method. Journal WRI World Congress on
Computer Science and Information Engineering, 76-81
Frühwirth-Schnatter, S. (2006). Finite Mixture and Markov Switching Models. Springer, New
York
Johnson, N. L., Kotz, S. and Balakrishnan, N. (1997). Multivariate distributions. Wiley, New
York
Kano, K. and Kawamura, K. (1991) On recurrence relations for the probability function of
multivariate generalized Poisson distribution. Communications in Statistics, 20: 1, 165 —
178
31
Karlis, D. and Meligkotsidou, M. (2007). Finite mixtures of multivariate Poisson
distributions with application. Journal of Statistical Planning and Inference, 137, 1942 –
1960
Karlis, D. (2003). An EM algorithm for multivariate Poisson distribution and related models.
Journal of Applied Statistics, 30: 1, 63 — 77
Kawamura, k. (1985). A note on the recurrent relations for the bivariate Poisson distribution.
Kodai math. j. 8, 70-78
Kawamura, K. (1987), Calculation of density for the multivariate Poisson d i s t r i b u t i o n.
Kodai Math, 10, 231 - 241.
Kawamura, K. (1973). The structure of bivariate Poisson distribution. Kodai Math. Sem.
Rep., 25(2), 246-256.
Laurikkala, J., Juhola, M. and Kentala, E. (2000). Informal identification of outliers in
medical data. The fifth workshop on Intelligent Data Analysis in Medicine and
Pharmacology, Berlin, pp. 20-24
McLachlan, D. and Peel, D. (2000). Finite Mixture Models. Wiley, New York
Rouse,
D.
M.
(2005).
Estimation
of
Finite
Mixture
Models.
Vailable
in
www.lib.ncsu.edu/theses/available/etd-11282005-140114/.../etd.pdf
Särndal, C., Swenson, B., Wretman, J. (1992). Model Assisted Survey Sampling. Springer,
New York
Schlattmann, P. (2009). Medical applications of Finite Mixture Models. Springer, New York
Tang, Z. and MacLennan, J. (2005). Data Mining with SQL Server 2005. Wiley, John and
Sons
Titterington, D.M., Smith, A.F.M. and Makov, U.E., (1985). Statistical Analysis of finite
mixture distributions. Willey, New York
32
Tsiamyrtzis, P. and Karlis, D. (2004). Strategies for Efficient Computation of Multivariate
Poisson Probabilities. Communications in statistics. Simulation and Computation. 33, 2,
271–292.
Wang, L. (2009). Outlier Detection with Finite Mixture Model. Orebro university. Master
thesis.
Wang, S., Woodward, W.A., Gray, H.L., Wiechecki, S. and Sain, S.R. (1997). A New Test
for Outlier Detection from a Multivariate Mixture Distribution. Journal of Computional and
Graphical Statistics, 6, 285–299.
Woodward, W. A. and Sain, S. R. (2003). Testing for outliers from a mixture distribution
when some data are missing. Computational statistics and Data analysis, 44, 193-210
Yamanishi, K and Takeuchi, J. (2004). On-Line Unsupervised Outlier Detection Using
Finite Mixtures with Discounting Learning Algorithms. Data Mining and
Knowledge Discovery, 8, 275–300.
Ye, M., Li, X. and Orlowska, E. M. (2009). Projected outlier detection in high-dimensional
mixed-attributes data set. Expert Systems with Applications: 7104-7113
33
Appendix R code
R code for parameter estimation
rm(list=ls())
data<-read.table("C:\\Users\\User\\Desktop\\dataw.txt",header=TRUE)
EIJnl<-NULL
Q<-NULL
m<-4 #number of variables
n<-length(data$v1) #number of observations
z<-matrix(c(data$v1,data$v2,data$v3,data$v4),n,m)
data1<-matrix(c(data$v1,data$v2,data$v3,data$v4),n,m)
fact<-function(y){x<-1
if(y==0) x<-1
else {
for(i in 1:y) x<-i*x
}
return (x)
}
q<-m+1
L<-16
data$v5<-runif(n,1,L)
Wnl<-matrix(,n,L)
lambda<-matrix(,L,q)
Eps<-matrix(,n,L)
Eij<-matrix(,n,L)
Bij<-matrix(,n,L)
v<-NULL
V<-function(L){
for(i in 1:L) v[i]<-sum(data$v5==i)/n
return(v)
}
R<-function(y){if(y-as.integer(y)<0.5)y<-as.integer(y)
else y<-as.integer(y)+1
return(y)}
for(i in 1:n){data$v5[i]<-R(data$v5[i]) }
v<-V(L)# vector of components weights
v
34
div<-function(L){
for(i in 1:L){ datadiv<-matrix(,sum(data$v5==i),m);l<-0
for(j in 1:n){
if(data$v5[j]==i){l<-l+1
datadiv[l,]<-z[j,]
}
}
lambda[i,]<c(runif(1,0,max(data$v1)),runif(1,0,max(data$v2)),runif(1,0,max(data$v3)),runif(1,0,max(da
ta$v4)),runif(1,0,100))
}
return(lambda)
}
lambda<-div(L)
lambda
v
P<-NULL
Po<-function(x,l){ po<-(exp(-l)*l^x)/fact(x)
return(po)
}
Bp<-function(y1,y2,lambda1,lambda2,lambda12){initial<-0
expn<-exp(-(lambda1+lambda2+lambda12))
for(i in 0:min(y1,y2)){
initial<-initial+(lambda1^(y1-i)*lambda2^(y2-i)*lambda12^(i))/(fact(y1i)*fact(y2-i)*fact(i))
}
bp<-expn*initial
return(bp)
}
Condf<-function(x,l){
product<Po(z[x,1],lambda[l,1])*Po(z[x,2],lambda[l,2])*pbivpois(z[x,3],z[x,4],c(lambda[l,3],lambda[l,
4],lambda[l,5]))
return(product)
}
Uncondf<-function(x){
prodt<-0
for(s in 1:L){
35
prodt<-prodt+v[s]*Condf(x,s)
}
return(prodt)
}
f<-function(x,l){
wnl<-v[l]*Condf(x,l)/Uncondf(x)
return(wnl)
}
WNL<-function(){
for(x in 1:n){
for(l in 1:L){
Wnl[x,l]<-f(x,l)
}
}
return(Wnl)
}
repeat{
wnl<-WNL()
wnl
eijl<-function(z1,z2,li,lj,lij){eps<-0
for(r in 0:min(z1,z2)){ eps<-eps+r*Po(z1-r,li)*Po(z2-r,lj)*Po(r,lij)
}
return(eps)
}
Eijnl<-function(x,l){ij<-0; t<-NULL;B<-1;BP<-NULL
g<eijl(z[x,3],z[x,4],lambda[l,3],lambda[l,4],lambda[l,5])/pbivpois(z[x,3],z[x,4],c(lambda[l,3],la
mbda[l,4],lambda[l,5]))
return(g)
}
EIJNL<-function(x){ p<-NULL
for(l in 1:L) p<-c(p,Eijnl(x,l))
return(p)
}
Bijn<-function(){
for(x in 1:n) Bij[x,]<-EIJNL(x)
return(Bij)
}
epsl<-Bijn()
36
epsl
Eijl<-function(x){ e1<-NULL;e2<-NULL;e3<-NULL;e4<-NULL
for(l in 1:L){
e1<-c(e1,z[x,1])
e2<-c(e2,z[x,2])
e3<-c(e3,z[x,3]-epsl[x,l])
e4<-c(e4,z[x,4]-epsl[x,l])
}
e<-c(e1,e2,e3,e4)
return(e)
}
Epsl<-function(){Eps<-matrix(,n,m*L)
for(x in 1:n)Eps[x,]<-Eijl(x)
return(Eps)
}
Epij<-Epsl()
Epij
vnew<- function(){ for(l in 1:L) v[l]<-sum(wnl[,l])/n
return(v)
}
v<-vnew()
Lambdas<-function(){ L1<-NULL;L2<-NULL;L3<-NULL;L4<-NULL;L5<-NULL
Lambda<-matrix(,L,q)
for(l in 1:L){
L1<-c(L1,sum(wnl[,l]*Epij[,l])/sum(wnl[,l]))
L2<-c(L2,sum(wnl[,l]*Epij[,L+l])/sum(wnl[,l]))
L3<-c(L3,sum(wnl[,l]*Epij[,2*L+l])/sum(wnl[,l]))
L4<-c(L4,sum(wnl[,l]*Epij[,3*L+l])/sum(wnl[,l]))
L5<-c(L5,sum(wnl[,l]*epsl[,l])/sum(wnl[,l]))
}
Lambda<-matrix(c(L1,L2,L3,L4,L5),L,q)
return(Lambda)
}
lambda<-Lambdas()
F<-function(){f<-0;Lik<-0
for(x in 1:n){
for(l in 1:L){ f<-f+v[l]*Condf(x,l)
37
}
Lik<-Lik+log(f)
}
return(Lik)
}
P<-c(P,F())
if(g>1){Q[g-1]<-P[g]-P[g-1]}
if(length(P)>20)break
}
R code for bootstrap of standard errors
rm(list=ls())
B<-matrix(,100,5*16 )
data<-read.table("C:\\Users\\User\\Desktop\\datawb.txt",header=TRUE)
g<-0
repeat{
g<-g+1
Lambda16<-matrix(c(38.15450799 ,60.1916087, 1.69377401 , 8.212057
,1.28443760, 6.21332185 ,32.273688
,9.115402, 26.881270 ,17.93242178,
3.230533e+00 ,8.0414025, 6.9642155 ,33.6442291 ,22.8443448 ,6.139962,
4.27558684, 12.0189718, 5.67742617, 34.521507 ,0.68127276, 1.35062338
,32.382003 ,16.031309 , 5.052374 , 2.45507732, 1.128142e+00 ,8.2137431,
21.8512490, 40.7658999, 15.8852785 ,7.763343,
0.10049911 , 0.5855036, 0.80964569, 12.565228 ,0.03695881, 0.03759575
,11.290171 ,39.463898 , 2.257400 , 0.07172701, 1.614032e+00, 0.2182159,
1.6884718, 1.9941723 , 0.5676319 ,6.840999,
0.94018625 , 2.5575498, 0.09286526, 1.652264, 0.06172529, 0.18817737
,9.530685 ,17.648157, 18.082073 , 0.72565410 ,5.056689e+00 ,0.2409899
,0.3035150 , 1.0822242 , 0.6416134 ,3.579238,
38
0.07160081, 0.3303317, 0.10266641, 1.887709, 0.01592727, 0.01227602
,3.939290 , 5.553087 , 1.517195 , 0.08755417 ,1.750189e-09 ,0.0835878
,0.3840347, 0.5838264 , 0.5328981 ,1.054430),16,5)
v16<-c(0.042842779 ,0.025215228 ,0.119230845, 0.014196339, 0.296965265,
0.166934240,
0.006602024
,0.004643517,
0.010663367,
0.069124459
,0.026006180
,0.079718448,
0.038710435 ,0.023024375, 0.048251016, 0.027871484)
EIJnl<-NULL
m<-4 #number of variables
n<-length(data$X1) #number of observations
v<-matrix(c(data$X1,data$X2,data$X3,data$X4),n,m)
T<-c(seq(1:n))
for(i in 1:n){
j<-sample(T,1)
v[i,]<-v[j,]
}
z<-matrix(c(data$X1,data$X2,data$X3,data$X4),n,m)
data<-data.frame(v)
fact<-function(y){x<-1
if(y==0) x<-1
else {
for(i in 1:y) x<-i*x
39
}
return (x)
}
q<-m+1
L<-16
Wnl<-matrix(,n,L)
Eps<-matrix(,n,L)
Eij<-matrix(,n,L)
Bij<-matrix(,n,L)
R<-function(y){if(y-as.integer(y)<0.5)y<-as.integer(y)
else y<-as.integer(y)+1
return(y)}
lambda<-Lambda9
v<-v9
Po<-function(x,l){ po<-(exp(-l)*l^x)/fact(x)
return(po)
40
}
Bp<-function(y1,y2,lambda1,lambda2,lambda12){initial<-0
expn<-exp(-(lambda1+lambda2+lambda12))
for(i in 0:min(y1,y2)){
initial<-initial+(lambda1^(y1-i)*lambda2^(y2i)*lambda12^(i))/(fact(y1-i)*fact(y2-i)*fact(i))
}
bp<-expn*initial
return(bp)
}
Condf<-function(x,l){
product<Po(z[x,1],lambda[l,1])*Po(z[x,2],lambda[l,2])*pbivpois(z[x,3],z[x,4],c(lambda[
l,3],lambda[l,4],lambda[l,5]))
return(product)
}
Uncondf<-function(x){
prodt<-0
for(s in 1:L){
prodt<-prodt+v[s]*Condf(x,s)
41
}
return(prodt)
}
f<-function(x,l){
wnl<-v[l]*Condf(x,l)/Uncondf(x)
return(wnl)
}
WNL<-function(){
for(x in 1:n){
for(l in 1:L){
Wnl[x,l]<-f(x,l)
}
}
return(Wnl)
}
wnl<-WNL()
eijl<-function(z1,z2,li,lj,lij){eps<-0
for(r in 0:min(z1,z2)){ eps<-eps+r*Po(z1-r,li)*Po(z2-r,lj)*Po(r,lij)
}
return(eps)
}
42
Eijnl<-function(x,l){ij<-0; t<-NULL;B<-1;BP<-NULL
g<eijl(z[x,3],z[x,4],lambda[l,3],lambda[l,4],lambda[l,5])/pbivpois(z[x,3],z[x,4],c(l
ambda[l,3],lambda[l,4],lambda[l,5]))
return(g)
}
EIJNL<-function(x){ p<-NULL
for(l in 1:L) p<-c(p,Eijnl(x,l))
return(p)
}
Bijn<-function(){
for(x in 1:n) Bij[x,]<-EIJNL(x)
return(Bij)
}
epsl<-Bijn()
Eijl<-function(x){ e1<-NULL;e2<-NULL;e3<-NULL;e4<-NULL
for(l in 1:L){
e1<-c(e1,z[x,1])
e2<-c(e2,z[x,2])
e3<-c(e3,z[x,3]-epsl[x,l])
e4<-c(e4,z[x,4]-epsl[x,l])
}
e<-c(e1,e2,e3,e4)
43
return(e)
}
Epsl<-function(){Eps<-matrix(,n,m*L)
for(x in 1:n)Eps[x,]<-Eijl(x)
return(Eps)
}
Epij<-Epsl()
vnew<- function(){ for(l in 1:L) v[l]<-sum(wnl[,l])/n
return(v)
}
v<-vnew()
Lambdas<-function(){
L1<-NULL;L2<-NULL;L3<-NULL;L4<-NULL;L5<-
NULL
Lambda<-matrix(,L,q)
for(l in 1:L){
L1<-c(L1,sum(wnl[,l]*Epij[,l])/sum(wnl[,l]))
L2<-c(L2,sum(wnl[,l]*Epij[,L+l])/sum(wnl[,l]))
L3<-c(L3,sum(wnl[,l]*Epij[,2*L+l])/sum(wnl[,l]))
L4<-c(L4,sum(wnl[,l]*Epij[,3*L+l])/sum(wnl[,l]))
44
L5<-c(L5,sum(wnl[,l]*epsl[,l])/sum(wnl[,l]))
}
Lambda<-matrix(c(L1,L2,L3,L4,L5),L,q)
return(Lambda)
}
lambda<-Lambdas()
C<-NULL
for(i in 1:L)C<-c(C,lambda[i,])
B[g,]<-C
if(g>99)break
}
B
SD<-NULL
for(i in 1: length(B[1,]))
SD[i]<-sd(B[,i])
R code for computation of the threshold of modified likelihood ratio test
rm(list=ls())
data<-read.table("C:\\Users\\User\\Desktop\\datar.txt",header=TRUE)
#data<-read.table("C:\\Users\\User\\Desktop\\dataw.txt",header=TRUE)
45
Lambda16<-matrix(c(38.15450799 ,60.1916087, 1.69377401 , 8.212057
,1.28443760, 6.21332185 ,32.273688
,9.115402, 26.881270 ,17.93242178,
3.230533e+00 ,8.0414025, 6.9642155 ,33.6442291 ,22.8443448 ,6.139962,
4.27558684, 12.0189718, 5.67742617, 34.521507 ,0.68127276, 1.35062338
,32.382003 ,16.031309 , 5.052374 , 2.45507732, 1.128142e+00 ,8.2137431,
21.8512490, 40.7658999, 15.8852785 ,7.763343,
0.10049911 , 0.5855036, 0.80964569, 12.565228 ,0.03695881, 0.03759575
,11.290171 ,39.463898 , 2.257400 , 0.07172701, 1.614032e+00, 0.2182159,
1.6884718, 1.9941723 , 0.5676319 ,6.840999,
0.94018625 , 2.5575498, 0.09286526, 1.652264, 0.06172529, 0.18817737
,9.530685 ,17.648157, 18.082073 , 0.72565410 ,5.056689e+00 ,0.2409899
,0.3035150 , 1.0822242 , 0.6416134 ,3.579238,
0.07160081, 0.3303317, 0.10266641, 1.887709, 0.01592727, 0.01227602
,3.939290 , 5.553087 , 1.517195 , 0.08755417 ,1.750189e-09 ,0.0835878
,0.3840347, 0.5838264 , 0.5328981 ,1.054430),16,5)
v16<-c(0.042842779 ,0.025215228 ,0.119230845, 0.014196339, 0.296965265,
0.166934240,
0.006602024
,0.004643517,
0.010663367,
0.069124459
,0.026006180
,0.079718448,
0.038710435 ,0.023024375, 0.048251016, 0.027871484)
m<-4 #number of variables
n<-length(data$v1) #number of observations
z<-matrix(c(data$v1,data$v2,data$v3,data$v4),n,m)
R<-function(y){if(y-as.integer(y)<0.5)y<-as.integer(y)
else y<-as.integer(y)+1
return(y)}
46
fact<-function(y){x<-1
if(y==0) x<-1
else {
for(i in 1:y) x<-i*x
}
return (x)
}
q<-m+1
L<-16
lambda<-Lambda16
v<-v16
Po<-function(x,l){ po<-(exp(-l)*l^x)/fact(x)
return(po)
}
Bp<-function(y1,y2,lambda1,lambda2,lambda12){initial<-0
expn<-exp(-(lambda1+lambda2+lambda12))
for(i in 0:min(y1,y2)){
initial<-initial+(lambda1^(y1-i)*lambda2^(y2i)*lambda12^(i))/(fact(y1-i)*fact(y2-i)*fact(i))
}
bp<-expn*initial
return(bp)
}
47
Condf<-function(x,l){
product<Po(z[x,1],lambda[l,1])*Po(z[x,2],lambda[l,2])*pbivpois(z[x,3],z[x,4],c(lambda[
l,3],lambda[l,4],lambda[l,5]))
return(product)
}
F<-function(x){ f<-0
for(l in 1:L){ f<-f+v[l]*Condf(x,l)
}
return(f)
}
B<-NULL
NB<-10000
nn<-c(seq(n))
for(i in 1:NB){
j<-sample(nn,1)
B<-c(B,F(j))
}
alf<-0.1
j<-alf*(NB+1)
p<-R(j)
B<-sort(B)
48
Wc<-B[p]
NN<-NULL
for(i in 1:n){
if(F(i)<Wc) NN<-c(NN,i)
}
NN
49