Download C5.1.2: Classification methodology

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Mixture model wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Naive Bayes classifier wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Published as Chapter 16.1.2 in: W. Kloesgen, Jan Zytkow (eds.): Handbook of Data Mining and Knowledge Discovery.
Oxford University Press, 2002, 258-267
Handbook, Section C5.1.2: Classification methodology, H.H. Bock, (Version 23/12/99)
C5.1.2: Classification methodology
Summary: This section surveys a range of classification (discrimination) methods
which are based on probabilistic or statistical models such as Bayes methods, maximum likelihood, nearest neighbour classifiers, non-parametric kernel density methods, plug-in rules etc. Additionally, we point to various algorithmic approaches for
classification such as neural networks, support-vector machines and decision trees
which are, however, fully discussed in subsequent sections of this handbook. A major
part is devoted to the specification and estimation of various types of recovery rates
and misclassification probabilities of a (fixed or data dependent) classifier. Finally,
we describe preprocessing methods for the selection of ’most informative’ variables.
Keywords: Bayesian classification, discrimination methods, nonparametric classification, nearest-neighbour classification, error probabilities, selection of variables,
cross-validation, bootstrapping.
This section is devoted to the discrimination or classification problem, i.e.,
the problem of assigning an object O to one of m given classes (populations)
Π1 , ..., Πm on the basis of a data vector x = (x(1) , ..., x(k) )0 from a sample space
X (e.g., Rk ) which collects the values of k explanatory or ’predictive’ variables
X (1) , ..., X (k) which were observed for this object O [link to Section C5.1.1].
Depending on the available information on the classes and the type of classifi1
Published as Chapter 16.1.2 in: W. Kloesgen, Jan Zytkow (eds.): Handbook of Data Mining and Knowledge Discovery.
Oxford University Press, 2002, 258-267
cation rule, we may distinguish the following cases:
(1) Probabilistic approach with known parameters [link to C5.1.2.1]
(2) Probabilistic approach with estimated parameters [link to C5.1.2.2]
(3) k-nearest-neighbour rules, Fisher’s geometrical approach [link to C5.1.2.2.c,
link to C5.1.6, link to C5.1.2.4]
(4) Neural network classification [link to C5.1.2.2.e, link to C5.1.8]
(5) AI algorithms such as: decision trees [link to C5.1.3], rule-based approaches,
etc.
The performance of classifiers is often measured in terms of (various types of)
recovery rates and misclassification probabilities. Their specification and estimation is addressed in Section C.5.1.2.3.
C5.1.2.1 Probabilistic approach with known parameters
The decision-oriented approach for classification starts from a probabilistic
model and minimizes either an expected loss or a total misclassification probability. The basic model assumes:
(1) An object O is randomly sampled from a heterogeneous population Π with m
subpopulations (classes) Π1 , ..., Πm and corresponding prior probabilities (class
frequencies) π1 , ..., πm > 0 with
Pm
i=1
πi = 1. Thus πi is the probability that
the object O is actually sampled from the class Πi .
(2) The observable feature vector X = (X (1) , ..., X (k) )0 for O is a random vector
with values in a space X (e.g., X = Rk ) and with a class-specific distribution
2
Published as Chapter 16.1.2 in: W. Kloesgen, Jan Zytkow (eds.): Handbook of Data Mining and Knowledge Discovery.
Oxford University Press, 2002, 258-267
density fi (x) where i ∈ {1, ..., m} denotes the index of the population Πi to
which O belongs. In many cases fi (x) is taken from a parametric density family
f (x; ϑ) such that fi (x) = f (x; ϑi ) with a class-specific parameter ϑ = ϑi , e.g.,
from a normal distribution Nk (µi , Σi ) where the unknown parameter vector
ϑi = (µi , Σi ) comprises the class mean µi ∈ Rk and the covariance matrix Σi
of X (for i = 1, ..., m). For discrete data fi (x) is the probability that X takes
the value x (in the i-th class).
A (non-randomized) decision rule (classifier) is a function δ : X → {1, ..., m}
which specifies, for each x ∈ X, the index δ(x) = i of the class Πi to which
an object O with data x is assigned. In contrast, a randomized decision rule
φ or φ(x) = (φ1 (x), ..., φm (x)) specifies, for each x ∈ X and each class Πi , a
probability or plausability φi (x) ≥ 0 for assigning the observation x to the i-th
class Πi (with φ1 (x) + · · · + φm (x) = 1).
In the (unrealistic) case where all πi , fi (·) or ϑi are known, there are some
well-established methods for defining ’optimum classifiers’ φ which are then
typically used in practice and underly implicitly many algorithms of pattern
recognition, artificial intelligence and supervised learning.
a) The Bayesian classification rule for a general loss function:
The Bayesian approach assumes that each decision is related to a specified loss
(or gain): Let Lti ≥ 0 be the loss incured when assigning an object from Πt
to the class Πi which means a misclassification if t 6= i and a correct classifi-
3
Published as Chapter 16.1.2 in: W. Kloesgen, Jan Zytkow (eds.): Handbook of Data Mining and Knowledge Discovery.
Oxford University Press, 2002, 258-267
cation if t = i. Then, each classification rule φ = (φ1 , ..., φm ) has an expected
(average) loss, also termed the Bayesian loss of φ, which is given by:
r(φ, π) :=
Here αti (φ) =
R
Z
m
X
φi (x)
Rk i=1
"
m
X
#
Lti πt ft (x) dx =
t=1
m X
m
X
Lti πt · αti (φ). (1)
i=1 t=1
φi (x)ft (x)dx denotes the probability of assigning an object
from Πt to Πi when using φ. For t 6= i this is the error probability of type (t, i),
whereas for t = i αii (φ) is termed the recovery probability of type i.
A decision rule φ∗ which minimizes the loss (??) is called a Bayesian classifier.
It appears that φ∗ is essentially given by the rule:
hi (x) :=
m
X
Lti πt ft (x) →
t=1
min
i∈{1,...,m}
.
(2)
Thus, an object O with observation vector x is assigned to the class Πi with
minimum value hi (x) among h1 (x), ..., hm (x). This is equivalent to say that the
class index i minimizes the a posteriori loss given by Li (x) := hi (x)/[
Here the denominator f (x) :=
Pm
t=1
Pm
t=1
πt ft (x)].
πt ft (x) is the marginal density of the ran-
dom data vector X, it is a mixture of the m class densities f1 , ..., fm . Formally,
we may put φ∗i (x) = 1 (whereas φ∗j (x) = 0 for all classes j 6= i) and see that φ∗
is a non-randomized classifier.
In particular, the set Ai := { x ∈ Rk | hi (x) = min{h1 (x), ..., hm (x)} } = { x ∈
Rk | φ∗i (x) = 1 } of all data vectors x which are assigned to the same class
Πi is called the acceptance region for Πi . The form of these regions Ai and
of their joint boundaries ∂Ai depends primarily on the densities f1 , ..., fm . If
4
Published as Chapter 16.1.2 in: W. Kloesgen, Jan Zytkow (eds.): Handbook of Data Mining and Knowledge Discovery.
Oxford University Press, 2002, 258-267
these boundaries ∂Ai are linear, quadratic,... we speak of a linear, quadratic,...
classifier; the practical usefulness of a classifier is largely dependent on the substantial interpretability of these boundaries.
b) The Bayesian rule for a 0-1 loss function:
In the case of the two-valued 0-1 loss function Lti = 1 or 0 for t 6= i and t = i,
respectively, the expected loss (??) reduces to the total error probability
r(φ, π) =
m X
X
πt αti (φ) = 1 −
i=1 t6=i
where α(φ) :=
Pm
i=1
m
X
πi αii (φ) =: 1 − α(φ)
(3)
i=1
πi αii (φ) = 1 − r(φ, π) is the total probability for a correct
decision (overall recovery rate, hitting probability). Since the corresponding
Bayesian classifier φ∗ minimizes (??), it maximizes the recovery rate α(φ).
Substitution of Lti into (??) yields hi (x) = f (x) − πi fi (x) → mini , therefore φ∗
assigns an observed vector x ∈ X to the class Πi with maximum value of πi fi (x)
or, equivalently, with maximum a posteriori probability pi (x) for the class Πi :
pi (x) :=
πi fi (x)
→
f (x)
max .
i∈{1,...,m}
(4)
The minimum attainable total error probability is then given by r(φ∗ , π) =
1 − α(φ∗ ) = 1 −
R
m(x)dx with the maximum m(x) := maxi {πi fi (x)}. Note
that the m posterior probabilities (p1 (x), ..., pm (x)) define a fuzzy classification
for each data point x: In this interpretation pi (x) is the degree of membership
of x in the i-th class, and the ’fuzzy class’ i is characterized by the function
pi (x) from X to the unit interval [0, 1] (see, e.g., Bandemer and Gottwald 1995).
5
Published as Chapter 16.1.2 in: W. Kloesgen, Jan Zytkow (eds.): Handbook of Data Mining and Knowledge Discovery.
Oxford University Press, 2002, 258-267
c) The Bayesian rule for uniform priors: the maximum-likelihood classifier
If all classes are equally likely, i.e. πi = 1/m for all i, then (??) reduces to the
maximum likelihood (m.l.) discrimination rule:
fi (x) →
max .
(5)
i∈{1,...,m}
d) Detection of ’unclassifiable’ objects
Various modifications are possible in the formulation of the classification or
discrimination problem, in particular the consideration of an (m + 1)-th decision category i = 0 which corresponds to ’postponing the decision’ and insofar
collects ’unclassifiable objects’. If in the case of uniform priors all losses L t0 for
’postponing’ are assumed to have the same value d with 0 = Ltt < Lt0 ≡ d <
Lti = 1 (for all t, i ≥ 1 and t 6= i), then the corresponding Bayes rule φ∗ is
formulated in terms of the maximum function m(x):
Decide for ’x is unclassifiable’ if
Assign x to the class Πi
m(x) < (1 − d)f (x).
if fi (x) = m(x) ≥ (1 − d)f (x).
e) Normal distribution models:
A commonly used (but often inappropriate) distribution model is provided by
the normal distribution: Each Πi is characterized by a p-dimensional normal
distribution with fi (x; ϑi )=N
ˆ k (µi , Σ) with the class-specific mean ϑi ≡ µi ∈ Rk
and a positive definite covariance matrix Σ. The corresponding Bayesian classifier (??) reduces (after some elementary calculations) to the rule:
di (x) := ||x − µi ||2Σ−1 − 2 log πi →
6
min ,
i∈{1,...,m}
(6)
Published as Chapter 16.1.2 in: W. Kloesgen, Jan Zytkow (eds.): Handbook of Data Mining and Knowledge Discovery.
Oxford University Press, 2002, 258-267
i.e., it minimizes the squared Mahalanobis distance ||x−µi ||2Σ−1 := (x−µi )0 Σ−1 (x−
µi ) and can be equivalently written with Fisher’s linear discrimination function
Lti (x): Decide for class Πi if
1
[||x − µt ||2Σ−1 − ||x − µi ||2Σ−1 ]
2
πt
µi + µt 0 −1
) Σ (µi − µt ) ≥ log
= (x −
2
πi
Lti (x) :=
(7)
for t = 1, ..., m. For a uniform prior with πi ≡ 1/m the rules (??) and (??)
reduce both to the minimum-distance rule:
d˜i (x) := ||x − µi ||2Σ−1 →
min ,
i∈{1,...,m}
i.e., Lti (x) ≥ 0 for all t.
(8)
If, more generally, each class has its specific covariance matrix such that Πi
is described by a normal density fi (x)=N
ˆ k (µi , Σi ) with the parameter ϑi =
(µi , Σi ), the Bayes rule (??) is given by:
di (x) := ||x − µi ||2Σ−1 + log |Σi | − 2 log πi →
i
min
i∈{1,...,m}
(9)
and may be formulated with quadratic discriminant functions Lti (x): Decide
for the class Πi if
Lti (x) := ||x − µt ||2Σ−1 − ||x − µi ||2Σ−1 ≥ log
t
i
πt
πi
for t = 1, ..., m.
(10)
C5.1.2.2 Probabilistic approach with estimated parameters
In practical applications, the priors πi , the class-specific densities fi (x), and/or
the class parameters ϑi are unknown or only partially known. Then the typical approach consists in (a) estimating the unknown densities or parameters
7
Published as Chapter 16.1.2 in: W. Kloesgen, Jan Zytkow (eds.): Handbook of Data Mining and Knowledge Discovery.
Oxford University Press, 2002, 258-267
from appropriate training data (learning samples) and (b) using the previously
described optimum classification rules [link to C5.1.2.1] with the proviso that
unknown densities or parameters are substituted by their estimates (plug-in
rules). In the most simple case, the data consist of n pairs (x1 , z1 ), ..., (xn , zn )
where xj ∈ X are the data points and zj ∈ {1, ..., m} is the known class
membership of data point xj (which is, e.g., provided by a ’teacher’: ’learning
with a teacher’, ’supervised classification’). Collecting all data from the same
class Πi (i.e., with zj = i) in a set Ci , this yields m samples or training sets
Ci = {xi1 , ..., xini } where ni is the number of data points originating from the
i-th class Πi , with n = n1 + · · · + nm . – Basically, we have to distinguish
between parametric and non-parametric models.
a) Parametric models:
In the case of a parametric density model fi (x) = f (x; ϑi ) the unknown parameters ϑi can be replaced, e.g., by their m.l. estimates ϑ̂i obtained from the
training data in C1 , ..., Cm . This yields the plug-in version φ(n) of the m.l.
classifier (??):
fˆi (x) := f (x; ϑ̂i ) →
max .
(11)
i∈{1,...,m}
For a normal distribution model with fi =N
ˆ k (µi , Σi ) or fi =N
ˆ k (µi , Σ) with unknown µi , Σi and Σ, the estimates are given by the class centroids µ̂i = xi· :=
(1/ni )
Pni
i=1
xij and the empirical covariance matrices Σ̂i = (1/ni )
xi· )(xij − xi· )0 or Σ̂ = (1/n)
Pn
i=1
Pni
j=1 (xij
−
ni Σ̂i , respectively, and lead to minimum-
8
Published as Chapter 16.1.2 in: W. Kloesgen, Jan Zytkow (eds.): Handbook of Data Mining and Knowledge Discovery.
Oxford University Press, 2002, 258-267
distance rules of the type:
dˆi (x) := ||x − xi· ||2Σ̂−1 → min
i
dˆi (x) := ||x − xi· ||2Σ̂−1 → min
or
i
i
(12)
with quadratic and linear discrimination functions, respectively.
Plug-in versions for Bayesian rules with unknown priors πi require a sampling
scheme which allows for the estimation of the parameters πi as well (which
is not possible when fixing the sizes n1 , ..., nm of the training samples Ci beforehand). This can be attained by sampling the n training objects randomly
from the entire population Π = Π1 + · · · + Πm such that πi is the probability
of membership to Πi (this is the probability model in [link to C5.1.2.1]; mixture sampling). Then, if Ni is the random number of objects sampled from
Πi (with
Pm
i=1
Ni = n and a joint polynomial distribution for N1 , ..., Nm ), the
relative frequency π̂i := Ni /n provides an unbiased and consistent estimate for
πi which can be used for a Bayesian plug-in rule corresponding, e.g., to (??):
π̂i f (x; ϑ̂i ) → maxi .
b) Nonparametric models:
In cases where a parametric distribution model is inappropriate (e.g., since the
boundaries between the acceptance regions A1 , ..., Am for the classes are expected to be nonlinear or irregular) a nonparametric density estimate will be
used, e.g., the Parzen or kernel density estimator given by:
n
i
x − xij
1 X
·
K(
)
fˆi (x) :=
k
nh j=1
h
9
x ∈ Rk .
(13)
Published as Chapter 16.1.2 in: W. Kloesgen, Jan Zytkow (eds.): Handbook of Data Mining and Knowledge Discovery.
Oxford University Press, 2002, 258-267
Here the kernel function K(·) is typically a distribution density such as K(x) =
(2π)−k/2 exp{−||x||2 /2} (other options: uniform density in the unit cube or in
the unit ball, Epanechnikov kernel etc.). It is common experience that the performance of a density-based classifier depends not so much on the choice of the
kernel function, but mainly on the choice of the bandwidth h > 0 which specifies the neighbourhood of data points and can also be chosen in dependence
on the data. For a large dimension k, a primary requirement is a sufficiently
large number ni of samples for each class (’curse of dimensionality’). In order
to attain consistent rules when ni → ∞, the bandwidth h → 0 must approach
0, but sufficiently slowly such that hnki → ∞.
c) Nearest-neighbour discrimination, k-nearest-neighbour classifier
The minimum-distance classifiers (??) can be generalized in many ways in
order to comply with geometrical intuition or non-quantitative data (qualitative or mixed data, symbolic data etc.). Common methods are provided by the
nearest-neighbour classification rule (NN classifier) and the k-nearest-neighbour
(k-NN) classifier.
Both methods start from a measure d(x, y) for the dissimilarity or distance
between two elements x, y of the sample space X (e.g., the Euclidean or Mahalanobis distance). Let d(x, Ci ) := miny∈Ci d(x, y) denote the minimum distance
between a data point x and the i-th training sample Ci (i = 1, ..., m). Then the
inverse 1/d(x, Ci ) is a measure for the density of points from Ci in the neigh-
10
Published as Chapter 16.1.2 in: W. Kloesgen, Jan Zytkow (eds.): Handbook of Data Mining and Knowledge Discovery.
Oxford University Press, 2002, 258-267
bourhood of x and insofar an ’estimate’ for fi (x). With this interpretation,
the NN classification rule: d(x, Ci ) → mini , is a plug-in version of the m.l.
classifier (??): the NN rule assigns a data point x to the class Πi with minimum distance d(x, Ci ) (i.e., to the class of the nearest neighbour of x among
all training data).
A generalized version considers, for a fixed integer k (typically 1 ≤ k ≤ 10),
the set S (k) (x) which contains the k nearest neighbours of x within the total
set X(n) = C1 + · · · + Cm of all n data. Denote by ki (x) = |Ci ∩ S (k) (x)| the
number of data points from the learning sample Ci which are among those k
nearest neighbours of x such that ki (x)/n might be considered as an estimate
for fi (x). Then the k-NN classifier assigns a new data point x to the class Πi
which contains the maximum number of k-nearest neighbours in the training
samples, i.e., with i := argmaxj=1,...,m {kj }. The theory of minimum-distance
and k-NN classifiers is surveyed, e.g., in Devroye et al. (1996) where consistency and error bounds are derived (e.g., for n → ∞ with k = kn → ∞).
d) Mixture models
A special plug-in rule derives from the probabilistic model described in Section C5.1.2.1 if we consider the marginal density f (x; π, θ) =
Pm
i=1
Pm
i=1
πi fi (x) =
πi f (x; ϑi ) (mixture density) of an observation X = (X (1) , ..., X (k) )0 (i.e.,
without considering its class membership). The unknown parameter vector
θ := (ϑ1 , ..., ϑm ) and the prior π = (π1 , ..., πm ) are estimated from a training
11
Published as Chapter 16.1.2 in: W. Kloesgen, Jan Zytkow (eds.): Handbook of Data Mining and Knowledge Discovery.
Oxford University Press, 2002, 258-267
sample {x1 , ..., xn } (without class memberships) by the maximum likelihood
method, i.e., by minimizing the negative loglikelihood:
−loglik(π, θ) :=
n
X
− log
j=1
m
X
πi f (xj ; ϑi )
i=1
!
→ min .
π,θ
(14)
Optimum parameter values are found by iterative numerical methods such as
the EM or SEM algorithm (SEM =
ˆ Stochastic Expectation Maximization; see
McLachlan and Krishnan 1997). The resulting estimates π̂, ϑ̂i are used for
obtaining plug-in rules as previously described.
e) Neural networks
Neural networks are often used for classification purposes, e.g., in pattern
recognition, credit scoring, robot control etc. Relevant types of neural networks can be seen as devices or algorithms for approximating an unknown
function y = g(x) by a semi-parametric ansatz function ŷ = ĝ(x; w) with
an unknown (and typically high-dimensional) parameter vector w (typically
termed a ’weight’ vector), e.g., in the form of a radial basis function or a multilayer network. The optimum approximation to g(·) is found by observing data
points of the form yj = g(xj ) or yj = g(xj ) + Uj (with a random error Uj )
and minimizing the deviation between the data y1 , ..., yn and their ’predictions’
ĝ(x1 ; w), ..., ĝ(xn ; w) w.r.t. w (various deviation measures can be used). This
process is often performed in a recursive way such that x1 , x2 , ... are observed
sequentially and, after observing xn+1 , the previous estimate w (n) for w is suitably updated (sequential learning).
12
Published as Chapter 16.1.2 in: W. Kloesgen, Jan Zytkow (eds.): Handbook of Data Mining and Knowledge Discovery.
Oxford University Press, 2002, 258-267
Neural network classifiers result if such a procedure is applied to a classification
problem where y denotes, e.g., the observed class membership and ĝ(x; w) is
a discrimination rule. Corresponding methods use a neural network in order
to estimate the unknown class densities g(x) = fi (x) or the posterior probabilities g(x) = pi (x) of the classes, and then use the estimates fˆi (x) or p̂i (x) in
the formulas for the classical (Bayes, maximum-likelihood, minimum-distance)
classifiers such as (??) or (??). For details see [link to C5.1.8].
C5.1.2.3 Estimating the error probabilities of a classification rule
The performance of a (fixed, non-randomized) classification rule φ with acceptance regions A1 , ..., Am ∈ Rk is characterized by
• the m true recovery rates αii (φ) :=
R
φi (x)fi (x)dx =
• the m(m−1) true error probabilities αti (φ) :=
(for t 6= i).
R
R
Ai
fi (x)dx, and
φi (x)ft (x)dx =
R
Ai
ft (x)dx
If the densities fi (x) or the parameters ϑi in f (x; ϑi ) are estimated from a
training sample X(n) = {xij |i = 1, ..., m, j = 1, ..., ni } of size n, these formulas
will depend on n and on the data set X(n) . A similar remark holds in the
case of a data-dependent classifier φ(·) = φ(n) (·) = φ(n) (· ; X(n) ) with data(n)
dependent acceptance regions Ai
(e.g., a plug-in rule). As a consequence,
we must distinguish among various conceptually different specifications of an
’error or recovery probability’ and also among various different error estimates:
13
Published as Chapter 16.1.2 in: W. Kloesgen, Jan Zytkow (eds.): Handbook of Data Mining and Knowledge Discovery.
Oxford University Press, 2002, 258-267
1. The plug-in estimate α̂ti (φ) for the probability αti (φ), for a fixed known
classifier φ:
α̂ti (φ) =
Z
φi (x)fˆt (x)dx =
Z
fˆt (x)dx.
(15)
Ai
2. The actual (true) error/recovery rate of a data-dependent classifier φ(n) :
αti (φ
(n)
)=
Z
(n)
φi (x)ft (x)dx
=
Z
(n)
ft (x)dx
(16)
Ai
3. The estimated error/recovery rate of φ(n) obtained by substituting an
estimated density (e.g., with estimated parameters):
α̂ti (φ
(n)
)=
Z
(n)
φi (x)fˆt (x)dx
=
Z
(n)
fˆt (x)dx
(17)
Ai
4. The apparent error/recovery rate of φ(n) (also termed resubstitution estimate):
αti,app (φ(n) ) = Uti /nt
where Uti =
Pni
j=1
(n)
φi (xtj ) =
Pni
(xtj )
j=1 IA(n)
i
(18)
is the number of data from
the t-th training sample Ct which are assigned to the class Πi by the
classifier φ(n) (confusion matrix). Unfortunately, this estimator is much
too optimistic: For example, the estimated recovery rate αii,app (φ(n) ) is
typically quite larger than the true value αii (φ(n) ) to be estimated since
the latter one is estimated from the same data X(n) which have tuned the
classifier φ(n) .
5. The expected error/recovery probability ET [α̂ti ]:
14
Published as Chapter 16.1.2 in: W. Kloesgen, Jan Zytkow (eds.): Handbook of Data Mining and Knowledge Discovery.
Oxford University Press, 2002, 258-267
For any of the data-dependent estimates α̂ti given before, ET [α̂ti ] is defined as the expected value of α̂ti with respect to all training samples X(n)
(e.g., under the mixture model). This is a fixed (i.e., non data-dependent)
probability which characterizes the overall quality of the entire classification process, including the ’learning’ of the (plug-in) classifier φ(n) from
the fluctuating data.
In practice, these definitions are often confounded or uncritically used.
Considering a Bayesian classifier φ∗ and a corresponding plug-in rule φ(n) , it is
expected that for an increasing sample size n → ∞ and when using consistent
(n)
parameter estimates ϑi
or fˆi (x), the true and estimated error/recovery rates
αti (φ∗ ), αti (φ(n) ), α̂ti (φ(n) ) etc. should all be close to each other. In this context, there exists a range of convergence theorems, e.g. for α̂ti (φ(n) ) → αti (φ∗ ),
and finite sample inequalities for the (maximum) deviation between true and
estimated error/recovery probabilities. Similar results hold for the empirical
Bayesian risk r(φ(n) , π) which converges to the minimum risk r ∗ = r(φ∗ , π) of
the Bayesian rule φ∗ . These topics are investigated in the context of computational learning theory which also yields bounds for the risk difference r(φ(n) , π)−
r∗ which are formulated in terms of the so-called Vapnik-Chervonenkis dimension (Devroye et al. 1997).
The quality of a classification rule should not be evaluated by considering error rates exclusively (which can be large even for an optimum classifier if the
15
Published as Chapter 16.1.2 in: W. Kloesgen, Jan Zytkow (eds.): Handbook of Data Mining and Knowledge Discovery.
Oxford University Press, 2002, 258-267
underlying populations Πi are not well separable). Other properties of a classifier such as its generalization ability, the ease of application, its stability, or its
robustness (against departures from the underlying model; Kharin 1996) will
sometimes be equally or even more important when selecting an ’appropriate’
decision rule (see Hand 1997) .
Test samples and cross-validation methods
A commonly used method for obtaining an unbiased, consistent estimate for the
actual error/recovery probabilities αti (φ(n) ) of a classifier φ(n) (obtained from
the training data in X(n) ) proceeds as follows: We observe, in addition to the
training data, from each population Πi a new test sample Ti = {yi1 , ..., yimi }
which is independent from X(n) , and calculate the relative error/recovery frequencies inside Tt :
α̂ti (φ
(n)
) = (1/mt )
mt
X
(n)
φi (ytj )
j=1
= (1/mt )
mt
X
IA(n) (ytj )
(19)
i
j=1
for t = 1, ..., m. This approach is typically realized in the way that the original
n data points are randomly split into a training set (of size 2n/3, say, with m
training classes Ci ) which yields the classifier, and a test set (the remaining
n/3 data points with test samples Ti ) which is used for evaluation afterwards.
Since this splitting process needs (spoils) a lot of data, some more refined and
economical tools have been designed under the heading cross-validation methods where one single or some few elements are singled out in turn.
A common approach is the jackknife or leave-one-out method for evaluating a
16
Published as Chapter 16.1.2 in: W. Kloesgen, Jan Zytkow (eds.): Handbook of Data Mining and Knowledge Discovery.
Oxford University Press, 2002, 258-267
data-dependent (plug-in) classifier φ(n) (e.g., a plug-in version of the Bayesian
rule φ∗ ): Given n training data xj with known class memberships zj ∈ {1, ..., m}
(for j = 1, ..., n), we eliminate in turn each data point xj , build the decision rule
φ(n−1,j) from the remaining n − 1 data points (in the same way as φ(n) has been
constructed from the entire data set) and classify the j-th point using φ(n−1,j) .
Denote by dj the index of the obtained class for xj . Then we estimate the true
probabilities αti (φ∗ ), αti (φ(n) ), or ET [αti (φ(n) )] by the the relative frequency
eti /n where U
eti := #{j ∈ {1, ..., n}|cj = t, dj = i} is the number of data xj
U
from Πt which were classified by the rule φ(n−1,j) into the class Πi . Similarly,
α̂ :=
Pm e
(n)
). –
i=1 Uii /n provides an estimate for the overall recovery rate α(φ
Theoretical results as well as simulations show that this method yields quite
precise estimates, even for moderate sample sizes.
The bootstrapping method:
This method is commonly used for estimating the error rates αti (φ(n) ), (??), of
a (plug-in) rule φ(n) . Basically, it works by replacing in (??) the unknown distribution (density fi (·)) of X by the empirical distribution of data points from
Πi , e.g., taken from the training set Ci . Thus, for given sample sizes m1 , ..., mm
(often: mi ≡ ni ), the method takes repeatedly (N times, say) a random subsample of size mi from each training set Ci and iterates the following steps (1)
to (3) for ν = 1, ..., N :
(1) Sample mi data points (with replacement) from Ci , obtaining a bootstrap
17
Published as Chapter 16.1.2 in: W. Kloesgen, Jan Zytkow (eds.): Handbook of Data Mining and Knowledge Discovery.
Oxford University Press, 2002, 258-267
[ν]
data set Ci (typically with repetitions) for each i = 1, ..., m.
[ν]
[ν]
(2) Using the data in C1 , ..., Cm , construct the corresponding plug-in rule φ[ν]
(in the same way as φ(n) is constructed from C1 , ..., Cm ).
[ν]
e [ν] denote the number of data points from Ct and Ct[ν] ,
(3) Let Uti and U
ti
[ν]
respectively, which were assigned to the class Πi by φ[ν] . Calculate Dti :=
[ν]
[ν]
e /mt , i.e., the difference of the corresponding resubstitution estiUti /nt − U
ti
mates.
Finally, the bootstrap estimator for αti (φ(n) ) is given by:
α̂ti,boot := αti,app (φ(n) ) + (1/N )
N
X
[ν]
Dti
(20)
ν=1
where αti,app (φ(n) ) = Uti /nt is the resubstitution estimator for φ(n) . – This estimator yields quite accurate estimates for the actual error probability α ti (φ(n) )
even for close classes and small class sizes ni .
C5.1.2.4 Fisher’s classical linear discriminant analysis
A geometrical point of view underlies the classical linear discriminant theory
developed by R.A. Fisher. This approach can be interpreted in the way as to
search for an s-dimensional hyperplane H in Rk such that the known classification C1 , ..., Cm of the training data xij is best reproduced by the n projection
points yij = πH (xij ) of xij onto H in the sense that the variance between the
classes
Pm
i=1
ni ||y i· − y||2 is minimized by H. It appears that the optimum
hyperplane is spanned by the s first eigenvectors v1 , ..., vs ∈ Rk of the betweenclass covariance matrix B :=
Pm
i=1
ni (xi· − x)(xi· − x)0 and that, insofar, the
18
Published as Chapter 16.1.2 in: W. Kloesgen, Jan Zytkow (eds.): Handbook of Data Mining and Knowledge Discovery.
Oxford University Press, 2002, 258-267
discrimination process may be based on the n reduced, s-dimensional feature
vectors zij := (v10 xij , ..., vs0 xij )0 (with j = 1, ..., ni ; i = 1, ..., m). For s = 2 dimensions these vectors can easily be shown on the screen and the separating
boundaries Lti (z) = 0 for the projected classes (for the minimum-distance classifier (??)) can be simultaneously displayed in R2 .
C5.1.2.5 Selection and preprocessing of variables
A major step in classification and pattern recognition concerns the specification
of variables which are able to distinguish the underlying or conjectured classes
Π1 , ..., Πm . In data mining, this problem is superposed by the fact that most
data bases in enterprises or marketing institutions store typically so many information and details about the underlying subjects (customers, products, bank
transfers) that it makes no sense to use all these variables for the classification
process (for technical as well as for economic reasons). Therefore, any classification process is usually preceded by a selection of a sufficiently small number
of (hopefully) ’informative’, ’discriminating’ or ’predictive’ variables.
There exist various statistical methods for selecting or transforming variables:
principal components (identical or related to the Karhunen-Loève expansion),
Fisher’s projection method described in [link to C.5.1.2.4], projection pursuit
methods, etc. These methods typically yield a small set of s, say, linear combinations of the k original variables X (1) , ..., X (k) . In contrast, the decision
tree approach constructs a higly non-linear classification rule and acceptance
19
Published as Chapter 16.1.2 in: W. Kloesgen, Jan Zytkow (eds.): Handbook of Data Mining and Knowledge Discovery.
Oxford University Press, 2002, 258-267
regions Ai in a recursive, ’monothetic’ way: First by dissecting the entire set of
training samples optimally on the basis of a single variable which is optimally
selected (i.e., such that the m classes are best separated in the training set),
and then by iterating this dissection process iteratively for each of the attained
sub-samples (until a stopping criterion precludes further splitting). As a result
we obtain a decision tree where the acceptance regions Ai result from a recursive combination of rules relating to single variables only. Further details are
described in [link to C.5.1.3] and [link to C.5.1.4],
Another type of methods concentrates on the selection of a suitable subset of
variables S ⊂ {1, ..., k} of a given size s = |S| from the original k variables
(with 2 ≤ s < k) and then using these variables instead of the original ones
in the formerly described Bayesian, maximum likelihhod and NN classifiers.
This classical model selection approach works typically with information measures (Akaike, Schwarz, ICOMP) which are to be optimized. A more recent
approach looks for a selection S which minimizes (an estimate of) the total
error probability of the corresponding (optimum or plug-in) classifier built from
the s selected variables. Both approaches proceed by successively eliminating
a single variable from the entire set of all k variables (backward method), or
by successively adding one more variable to an initial choice of one variable
(forward method). The method is locally optimum insofar as, in each step, it
eliminates (includes) the variable which leads to the smallest (estimated) total
error probability of the corresponding (optimum, Bayes, plug-in, or empirical)
20
Published as Chapter 16.1.2 in: W. Kloesgen, Jan Zytkow (eds.): Handbook of Data Mining and Knowledge Discovery.
Oxford University Press, 2002, 258-267
decision rule until the given dimension s is attained or the error will be too
large.
Finally, we point to the approaches developed under the heading of ’supportvector machines’ and ’potential function method’. In this approach, the acceptance regions Ai for the underlying classes Πi are defined by linear boundaries
in Rk which are computationally optimized in order to attain small error rates
for the training samples. Since, however, many practical applications require
the consideration of non-linear class boundaries, this approach is combined by
a suitable non-linear transformation ψ(x) of the original data vector x: Then
the linear support vector classifier is formulated in terms of the transformed
data ψ(x) (instead of using non-linear boundaries for the original data x). This
approach is also investigated in the framework of ’computational learning theory’, a good reference is provided by Cristianini and Shawe-Taylor (2000).
Classical sources on classification, discrimination and pattern recognition are
Young and Calvert (1974), Lachenbruch (1975), Krishnaiah & Kanal (1982),
Fukunaga (1990), Goldstein & Dillon (1978), Niemann (1990) and Hand (1986).
Modern viewpoints are considered, e.g., in Breiman et al. (1984), McLachlan
(1992), Ripley (1996), Kharin (1996), Devroye et al. (1997), Bock and Diday
(1999). Nonparametric density estimation is presented in Tapia and Thompson
(1978), Devroye (1987) and Silverman (1986).
References:
21
Published as Chapter 16.1.2 in: W. Kloesgen, Jan Zytkow (eds.): Handbook of Data Mining and Knowledge Discovery.
Oxford University Press, 2002, 258-267
Bandemer, H., Gottwald, S. 1995. Fuzzy sets, fuzzy logic, fuzzy methods with
application. Wiley, Chichester.
Bock, H.H., Diday, E. 1999. Analysis of symbolic data. Exploratory methods
for extracting information from complex data. Springer Verlag, Heidelberg –
Berlin. Presents classification and data analysis methods for set-valued and
probabilistic data.
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, Ch. J. 1984. Classification
and regression trees. Wadsworth, Belmont. Describes rule-based classification
methods (decision trees).
Cristianini, N., Shawe-Taylor, J. 2000. An introduction to support vector machines. Cambridge University Press, Cambridge.
Devroye, L. 1987. A course on density estimation. Birkhäuser, Boston – Basel.
Devroye, L., Györfi, L. and Lugosi, G. 1997. A probabilistic theory of pattern
recognition. Springer, New York. An excellent and comprehensive survey on
classical and recent results in pattern recognition, classification, computational
learning theory and neural networks.
Fukunaga, K. 1990. Introduction to statistical pattern recognition. Academic
Press, New York. An engineering-oriented presentation.
Goldstein, M. and Dillon, W.R. 1978. Discrete discriminant analysis. Wiley,
New York. Concentrates on categorical data.
Hand, D.J. 1986. Discrimination and classification. Wiley, New York.
22
Published as Chapter 16.1.2 in: W. Kloesgen, Jan Zytkow (eds.): Handbook of Data Mining and Knowledge Discovery.
Oxford University Press, 2002, 258-267
Hand, D.J. 1997. Construction and assessment of classification rules. Wiley,
New York. A survey on discrimination methods and the evaluation of their
performance with applications and practical comments.
Kharin, Y. 1996. Robustness in statistical pattern recognition. Kluwer Academic Publishers, Dordrecht. Concentrates on robustness studies for discrimination and clustering methods.
Krishnaiah, P.R. and Kanal, L.N. (eds.). 1982. Classification, pattern recognition and reduction of dimensionality. Handbook of Statistics, vol. 2. North
Holland/Elsevier, Amsterdam. With survey articles on many special topics in
classification.
Lachenbruch, P.A. 1975. Discriminant analysis. Hafner/Macmillan, London,
1975. A classical reference with an emphasis on normal theory discrimination.
McLachlan, G.J. 1992. Discriminant analysis and statistical pattern recognition. Wiley, New York. An unvaluable source for all topics of discriminant
analysis and its applications, with an emphasis on the statistical background.
McLachlan, G.J., Krishnan, Th. 1997. The EM algorithm and extensions. Wiley, New York.
Läuter, J. 1992. Stabile multivariate Verfahren. Diskriminanzanalyse, Regressionsanalyse, Faktoranalyse. Akademie Verlag, Berlin. Discusses error probabilities as a function of n and p, contains discrimination methods for badly
behaving data, e.g., by using penalty functions and ridge methods.
23
Published as Chapter 16.1.2 in: W. Kloesgen, Jan Zytkow (eds.): Handbook of Data Mining and Knowledge Discovery.
Oxford University Press, 2002, 258-267
Niemann, H. 1990. Pattern analysis and understanding. Springer Verlag,
Berlin. The new version of a classical survey on pattern recognition and its
applications in technology and knowledge processing.
Ripley, B.D. 1996. Pattern recognition and neural networks. Cambridge University Press, Cambridge, UK. A very appealing survey on the statistical bases
of neural network classification methods.
Silverman, B.W. 1986. Density estimation for statistics and data analysis.
Chapman and Hall, London – New York.
Tapia, R.A., Thompson, J.R. 1978. Nonparametric probability density estimation. John Hopkins Univ. Press, Baltimore.
Young, T.Y. and Calvert, T.W. 1974. Classificaton, estimation and pattern
recognition. American Elsevier, New York. A classical standard reference.
24