Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Published as Chapter 16.1.2 in: W. Kloesgen, Jan Zytkow (eds.): Handbook of Data Mining and Knowledge Discovery. Oxford University Press, 2002, 258-267 Handbook, Section C5.1.2: Classification methodology, H.H. Bock, (Version 23/12/99) C5.1.2: Classification methodology Summary: This section surveys a range of classification (discrimination) methods which are based on probabilistic or statistical models such as Bayes methods, maximum likelihood, nearest neighbour classifiers, non-parametric kernel density methods, plug-in rules etc. Additionally, we point to various algorithmic approaches for classification such as neural networks, support-vector machines and decision trees which are, however, fully discussed in subsequent sections of this handbook. A major part is devoted to the specification and estimation of various types of recovery rates and misclassification probabilities of a (fixed or data dependent) classifier. Finally, we describe preprocessing methods for the selection of ’most informative’ variables. Keywords: Bayesian classification, discrimination methods, nonparametric classification, nearest-neighbour classification, error probabilities, selection of variables, cross-validation, bootstrapping. This section is devoted to the discrimination or classification problem, i.e., the problem of assigning an object O to one of m given classes (populations) Π1 , ..., Πm on the basis of a data vector x = (x(1) , ..., x(k) )0 from a sample space X (e.g., Rk ) which collects the values of k explanatory or ’predictive’ variables X (1) , ..., X (k) which were observed for this object O [link to Section C5.1.1]. Depending on the available information on the classes and the type of classifi1 Published as Chapter 16.1.2 in: W. Kloesgen, Jan Zytkow (eds.): Handbook of Data Mining and Knowledge Discovery. Oxford University Press, 2002, 258-267 cation rule, we may distinguish the following cases: (1) Probabilistic approach with known parameters [link to C5.1.2.1] (2) Probabilistic approach with estimated parameters [link to C5.1.2.2] (3) k-nearest-neighbour rules, Fisher’s geometrical approach [link to C5.1.2.2.c, link to C5.1.6, link to C5.1.2.4] (4) Neural network classification [link to C5.1.2.2.e, link to C5.1.8] (5) AI algorithms such as: decision trees [link to C5.1.3], rule-based approaches, etc. The performance of classifiers is often measured in terms of (various types of) recovery rates and misclassification probabilities. Their specification and estimation is addressed in Section C.5.1.2.3. C5.1.2.1 Probabilistic approach with known parameters The decision-oriented approach for classification starts from a probabilistic model and minimizes either an expected loss or a total misclassification probability. The basic model assumes: (1) An object O is randomly sampled from a heterogeneous population Π with m subpopulations (classes) Π1 , ..., Πm and corresponding prior probabilities (class frequencies) π1 , ..., πm > 0 with Pm i=1 πi = 1. Thus πi is the probability that the object O is actually sampled from the class Πi . (2) The observable feature vector X = (X (1) , ..., X (k) )0 for O is a random vector with values in a space X (e.g., X = Rk ) and with a class-specific distribution 2 Published as Chapter 16.1.2 in: W. Kloesgen, Jan Zytkow (eds.): Handbook of Data Mining and Knowledge Discovery. Oxford University Press, 2002, 258-267 density fi (x) where i ∈ {1, ..., m} denotes the index of the population Πi to which O belongs. In many cases fi (x) is taken from a parametric density family f (x; ϑ) such that fi (x) = f (x; ϑi ) with a class-specific parameter ϑ = ϑi , e.g., from a normal distribution Nk (µi , Σi ) where the unknown parameter vector ϑi = (µi , Σi ) comprises the class mean µi ∈ Rk and the covariance matrix Σi of X (for i = 1, ..., m). For discrete data fi (x) is the probability that X takes the value x (in the i-th class). A (non-randomized) decision rule (classifier) is a function δ : X → {1, ..., m} which specifies, for each x ∈ X, the index δ(x) = i of the class Πi to which an object O with data x is assigned. In contrast, a randomized decision rule φ or φ(x) = (φ1 (x), ..., φm (x)) specifies, for each x ∈ X and each class Πi , a probability or plausability φi (x) ≥ 0 for assigning the observation x to the i-th class Πi (with φ1 (x) + · · · + φm (x) = 1). In the (unrealistic) case where all πi , fi (·) or ϑi are known, there are some well-established methods for defining ’optimum classifiers’ φ which are then typically used in practice and underly implicitly many algorithms of pattern recognition, artificial intelligence and supervised learning. a) The Bayesian classification rule for a general loss function: The Bayesian approach assumes that each decision is related to a specified loss (or gain): Let Lti ≥ 0 be the loss incured when assigning an object from Πt to the class Πi which means a misclassification if t 6= i and a correct classifi- 3 Published as Chapter 16.1.2 in: W. Kloesgen, Jan Zytkow (eds.): Handbook of Data Mining and Knowledge Discovery. Oxford University Press, 2002, 258-267 cation if t = i. Then, each classification rule φ = (φ1 , ..., φm ) has an expected (average) loss, also termed the Bayesian loss of φ, which is given by: r(φ, π) := Here αti (φ) = R Z m X φi (x) Rk i=1 " m X # Lti πt ft (x) dx = t=1 m X m X Lti πt · αti (φ). (1) i=1 t=1 φi (x)ft (x)dx denotes the probability of assigning an object from Πt to Πi when using φ. For t 6= i this is the error probability of type (t, i), whereas for t = i αii (φ) is termed the recovery probability of type i. A decision rule φ∗ which minimizes the loss (??) is called a Bayesian classifier. It appears that φ∗ is essentially given by the rule: hi (x) := m X Lti πt ft (x) → t=1 min i∈{1,...,m} . (2) Thus, an object O with observation vector x is assigned to the class Πi with minimum value hi (x) among h1 (x), ..., hm (x). This is equivalent to say that the class index i minimizes the a posteriori loss given by Li (x) := hi (x)/[ Here the denominator f (x) := Pm t=1 Pm t=1 πt ft (x)]. πt ft (x) is the marginal density of the ran- dom data vector X, it is a mixture of the m class densities f1 , ..., fm . Formally, we may put φ∗i (x) = 1 (whereas φ∗j (x) = 0 for all classes j 6= i) and see that φ∗ is a non-randomized classifier. In particular, the set Ai := { x ∈ Rk | hi (x) = min{h1 (x), ..., hm (x)} } = { x ∈ Rk | φ∗i (x) = 1 } of all data vectors x which are assigned to the same class Πi is called the acceptance region for Πi . The form of these regions Ai and of their joint boundaries ∂Ai depends primarily on the densities f1 , ..., fm . If 4 Published as Chapter 16.1.2 in: W. Kloesgen, Jan Zytkow (eds.): Handbook of Data Mining and Knowledge Discovery. Oxford University Press, 2002, 258-267 these boundaries ∂Ai are linear, quadratic,... we speak of a linear, quadratic,... classifier; the practical usefulness of a classifier is largely dependent on the substantial interpretability of these boundaries. b) The Bayesian rule for a 0-1 loss function: In the case of the two-valued 0-1 loss function Lti = 1 or 0 for t 6= i and t = i, respectively, the expected loss (??) reduces to the total error probability r(φ, π) = m X X πt αti (φ) = 1 − i=1 t6=i where α(φ) := Pm i=1 m X πi αii (φ) =: 1 − α(φ) (3) i=1 πi αii (φ) = 1 − r(φ, π) is the total probability for a correct decision (overall recovery rate, hitting probability). Since the corresponding Bayesian classifier φ∗ minimizes (??), it maximizes the recovery rate α(φ). Substitution of Lti into (??) yields hi (x) = f (x) − πi fi (x) → mini , therefore φ∗ assigns an observed vector x ∈ X to the class Πi with maximum value of πi fi (x) or, equivalently, with maximum a posteriori probability pi (x) for the class Πi : pi (x) := πi fi (x) → f (x) max . i∈{1,...,m} (4) The minimum attainable total error probability is then given by r(φ∗ , π) = 1 − α(φ∗ ) = 1 − R m(x)dx with the maximum m(x) := maxi {πi fi (x)}. Note that the m posterior probabilities (p1 (x), ..., pm (x)) define a fuzzy classification for each data point x: In this interpretation pi (x) is the degree of membership of x in the i-th class, and the ’fuzzy class’ i is characterized by the function pi (x) from X to the unit interval [0, 1] (see, e.g., Bandemer and Gottwald 1995). 5 Published as Chapter 16.1.2 in: W. Kloesgen, Jan Zytkow (eds.): Handbook of Data Mining and Knowledge Discovery. Oxford University Press, 2002, 258-267 c) The Bayesian rule for uniform priors: the maximum-likelihood classifier If all classes are equally likely, i.e. πi = 1/m for all i, then (??) reduces to the maximum likelihood (m.l.) discrimination rule: fi (x) → max . (5) i∈{1,...,m} d) Detection of ’unclassifiable’ objects Various modifications are possible in the formulation of the classification or discrimination problem, in particular the consideration of an (m + 1)-th decision category i = 0 which corresponds to ’postponing the decision’ and insofar collects ’unclassifiable objects’. If in the case of uniform priors all losses L t0 for ’postponing’ are assumed to have the same value d with 0 = Ltt < Lt0 ≡ d < Lti = 1 (for all t, i ≥ 1 and t 6= i), then the corresponding Bayes rule φ∗ is formulated in terms of the maximum function m(x): Decide for ’x is unclassifiable’ if Assign x to the class Πi m(x) < (1 − d)f (x). if fi (x) = m(x) ≥ (1 − d)f (x). e) Normal distribution models: A commonly used (but often inappropriate) distribution model is provided by the normal distribution: Each Πi is characterized by a p-dimensional normal distribution with fi (x; ϑi )=N ˆ k (µi , Σ) with the class-specific mean ϑi ≡ µi ∈ Rk and a positive definite covariance matrix Σ. The corresponding Bayesian classifier (??) reduces (after some elementary calculations) to the rule: di (x) := ||x − µi ||2Σ−1 − 2 log πi → 6 min , i∈{1,...,m} (6) Published as Chapter 16.1.2 in: W. Kloesgen, Jan Zytkow (eds.): Handbook of Data Mining and Knowledge Discovery. Oxford University Press, 2002, 258-267 i.e., it minimizes the squared Mahalanobis distance ||x−µi ||2Σ−1 := (x−µi )0 Σ−1 (x− µi ) and can be equivalently written with Fisher’s linear discrimination function Lti (x): Decide for class Πi if 1 [||x − µt ||2Σ−1 − ||x − µi ||2Σ−1 ] 2 πt µi + µt 0 −1 ) Σ (µi − µt ) ≥ log = (x − 2 πi Lti (x) := (7) for t = 1, ..., m. For a uniform prior with πi ≡ 1/m the rules (??) and (??) reduce both to the minimum-distance rule: d˜i (x) := ||x − µi ||2Σ−1 → min , i∈{1,...,m} i.e., Lti (x) ≥ 0 for all t. (8) If, more generally, each class has its specific covariance matrix such that Πi is described by a normal density fi (x)=N ˆ k (µi , Σi ) with the parameter ϑi = (µi , Σi ), the Bayes rule (??) is given by: di (x) := ||x − µi ||2Σ−1 + log |Σi | − 2 log πi → i min i∈{1,...,m} (9) and may be formulated with quadratic discriminant functions Lti (x): Decide for the class Πi if Lti (x) := ||x − µt ||2Σ−1 − ||x − µi ||2Σ−1 ≥ log t i πt πi for t = 1, ..., m. (10) C5.1.2.2 Probabilistic approach with estimated parameters In practical applications, the priors πi , the class-specific densities fi (x), and/or the class parameters ϑi are unknown or only partially known. Then the typical approach consists in (a) estimating the unknown densities or parameters 7 Published as Chapter 16.1.2 in: W. Kloesgen, Jan Zytkow (eds.): Handbook of Data Mining and Knowledge Discovery. Oxford University Press, 2002, 258-267 from appropriate training data (learning samples) and (b) using the previously described optimum classification rules [link to C5.1.2.1] with the proviso that unknown densities or parameters are substituted by their estimates (plug-in rules). In the most simple case, the data consist of n pairs (x1 , z1 ), ..., (xn , zn ) where xj ∈ X are the data points and zj ∈ {1, ..., m} is the known class membership of data point xj (which is, e.g., provided by a ’teacher’: ’learning with a teacher’, ’supervised classification’). Collecting all data from the same class Πi (i.e., with zj = i) in a set Ci , this yields m samples or training sets Ci = {xi1 , ..., xini } where ni is the number of data points originating from the i-th class Πi , with n = n1 + · · · + nm . – Basically, we have to distinguish between parametric and non-parametric models. a) Parametric models: In the case of a parametric density model fi (x) = f (x; ϑi ) the unknown parameters ϑi can be replaced, e.g., by their m.l. estimates ϑ̂i obtained from the training data in C1 , ..., Cm . This yields the plug-in version φ(n) of the m.l. classifier (??): fˆi (x) := f (x; ϑ̂i ) → max . (11) i∈{1,...,m} For a normal distribution model with fi =N ˆ k (µi , Σi ) or fi =N ˆ k (µi , Σ) with unknown µi , Σi and Σ, the estimates are given by the class centroids µ̂i = xi· := (1/ni ) Pni i=1 xij and the empirical covariance matrices Σ̂i = (1/ni ) xi· )(xij − xi· )0 or Σ̂ = (1/n) Pn i=1 Pni j=1 (xij − ni Σ̂i , respectively, and lead to minimum- 8 Published as Chapter 16.1.2 in: W. Kloesgen, Jan Zytkow (eds.): Handbook of Data Mining and Knowledge Discovery. Oxford University Press, 2002, 258-267 distance rules of the type: dˆi (x) := ||x − xi· ||2Σ̂−1 → min i dˆi (x) := ||x − xi· ||2Σ̂−1 → min or i i (12) with quadratic and linear discrimination functions, respectively. Plug-in versions for Bayesian rules with unknown priors πi require a sampling scheme which allows for the estimation of the parameters πi as well (which is not possible when fixing the sizes n1 , ..., nm of the training samples Ci beforehand). This can be attained by sampling the n training objects randomly from the entire population Π = Π1 + · · · + Πm such that πi is the probability of membership to Πi (this is the probability model in [link to C5.1.2.1]; mixture sampling). Then, if Ni is the random number of objects sampled from Πi (with Pm i=1 Ni = n and a joint polynomial distribution for N1 , ..., Nm ), the relative frequency π̂i := Ni /n provides an unbiased and consistent estimate for πi which can be used for a Bayesian plug-in rule corresponding, e.g., to (??): π̂i f (x; ϑ̂i ) → maxi . b) Nonparametric models: In cases where a parametric distribution model is inappropriate (e.g., since the boundaries between the acceptance regions A1 , ..., Am for the classes are expected to be nonlinear or irregular) a nonparametric density estimate will be used, e.g., the Parzen or kernel density estimator given by: n i x − xij 1 X · K( ) fˆi (x) := k nh j=1 h 9 x ∈ Rk . (13) Published as Chapter 16.1.2 in: W. Kloesgen, Jan Zytkow (eds.): Handbook of Data Mining and Knowledge Discovery. Oxford University Press, 2002, 258-267 Here the kernel function K(·) is typically a distribution density such as K(x) = (2π)−k/2 exp{−||x||2 /2} (other options: uniform density in the unit cube or in the unit ball, Epanechnikov kernel etc.). It is common experience that the performance of a density-based classifier depends not so much on the choice of the kernel function, but mainly on the choice of the bandwidth h > 0 which specifies the neighbourhood of data points and can also be chosen in dependence on the data. For a large dimension k, a primary requirement is a sufficiently large number ni of samples for each class (’curse of dimensionality’). In order to attain consistent rules when ni → ∞, the bandwidth h → 0 must approach 0, but sufficiently slowly such that hnki → ∞. c) Nearest-neighbour discrimination, k-nearest-neighbour classifier The minimum-distance classifiers (??) can be generalized in many ways in order to comply with geometrical intuition or non-quantitative data (qualitative or mixed data, symbolic data etc.). Common methods are provided by the nearest-neighbour classification rule (NN classifier) and the k-nearest-neighbour (k-NN) classifier. Both methods start from a measure d(x, y) for the dissimilarity or distance between two elements x, y of the sample space X (e.g., the Euclidean or Mahalanobis distance). Let d(x, Ci ) := miny∈Ci d(x, y) denote the minimum distance between a data point x and the i-th training sample Ci (i = 1, ..., m). Then the inverse 1/d(x, Ci ) is a measure for the density of points from Ci in the neigh- 10 Published as Chapter 16.1.2 in: W. Kloesgen, Jan Zytkow (eds.): Handbook of Data Mining and Knowledge Discovery. Oxford University Press, 2002, 258-267 bourhood of x and insofar an ’estimate’ for fi (x). With this interpretation, the NN classification rule: d(x, Ci ) → mini , is a plug-in version of the m.l. classifier (??): the NN rule assigns a data point x to the class Πi with minimum distance d(x, Ci ) (i.e., to the class of the nearest neighbour of x among all training data). A generalized version considers, for a fixed integer k (typically 1 ≤ k ≤ 10), the set S (k) (x) which contains the k nearest neighbours of x within the total set X(n) = C1 + · · · + Cm of all n data. Denote by ki (x) = |Ci ∩ S (k) (x)| the number of data points from the learning sample Ci which are among those k nearest neighbours of x such that ki (x)/n might be considered as an estimate for fi (x). Then the k-NN classifier assigns a new data point x to the class Πi which contains the maximum number of k-nearest neighbours in the training samples, i.e., with i := argmaxj=1,...,m {kj }. The theory of minimum-distance and k-NN classifiers is surveyed, e.g., in Devroye et al. (1996) where consistency and error bounds are derived (e.g., for n → ∞ with k = kn → ∞). d) Mixture models A special plug-in rule derives from the probabilistic model described in Section C5.1.2.1 if we consider the marginal density f (x; π, θ) = Pm i=1 Pm i=1 πi fi (x) = πi f (x; ϑi ) (mixture density) of an observation X = (X (1) , ..., X (k) )0 (i.e., without considering its class membership). The unknown parameter vector θ := (ϑ1 , ..., ϑm ) and the prior π = (π1 , ..., πm ) are estimated from a training 11 Published as Chapter 16.1.2 in: W. Kloesgen, Jan Zytkow (eds.): Handbook of Data Mining and Knowledge Discovery. Oxford University Press, 2002, 258-267 sample {x1 , ..., xn } (without class memberships) by the maximum likelihood method, i.e., by minimizing the negative loglikelihood: −loglik(π, θ) := n X − log j=1 m X πi f (xj ; ϑi ) i=1 ! → min . π,θ (14) Optimum parameter values are found by iterative numerical methods such as the EM or SEM algorithm (SEM = ˆ Stochastic Expectation Maximization; see McLachlan and Krishnan 1997). The resulting estimates π̂, ϑ̂i are used for obtaining plug-in rules as previously described. e) Neural networks Neural networks are often used for classification purposes, e.g., in pattern recognition, credit scoring, robot control etc. Relevant types of neural networks can be seen as devices or algorithms for approximating an unknown function y = g(x) by a semi-parametric ansatz function ŷ = ĝ(x; w) with an unknown (and typically high-dimensional) parameter vector w (typically termed a ’weight’ vector), e.g., in the form of a radial basis function or a multilayer network. The optimum approximation to g(·) is found by observing data points of the form yj = g(xj ) or yj = g(xj ) + Uj (with a random error Uj ) and minimizing the deviation between the data y1 , ..., yn and their ’predictions’ ĝ(x1 ; w), ..., ĝ(xn ; w) w.r.t. w (various deviation measures can be used). This process is often performed in a recursive way such that x1 , x2 , ... are observed sequentially and, after observing xn+1 , the previous estimate w (n) for w is suitably updated (sequential learning). 12 Published as Chapter 16.1.2 in: W. Kloesgen, Jan Zytkow (eds.): Handbook of Data Mining and Knowledge Discovery. Oxford University Press, 2002, 258-267 Neural network classifiers result if such a procedure is applied to a classification problem where y denotes, e.g., the observed class membership and ĝ(x; w) is a discrimination rule. Corresponding methods use a neural network in order to estimate the unknown class densities g(x) = fi (x) or the posterior probabilities g(x) = pi (x) of the classes, and then use the estimates fˆi (x) or p̂i (x) in the formulas for the classical (Bayes, maximum-likelihood, minimum-distance) classifiers such as (??) or (??). For details see [link to C5.1.8]. C5.1.2.3 Estimating the error probabilities of a classification rule The performance of a (fixed, non-randomized) classification rule φ with acceptance regions A1 , ..., Am ∈ Rk is characterized by • the m true recovery rates αii (φ) := R φi (x)fi (x)dx = • the m(m−1) true error probabilities αti (φ) := (for t 6= i). R R Ai fi (x)dx, and φi (x)ft (x)dx = R Ai ft (x)dx If the densities fi (x) or the parameters ϑi in f (x; ϑi ) are estimated from a training sample X(n) = {xij |i = 1, ..., m, j = 1, ..., ni } of size n, these formulas will depend on n and on the data set X(n) . A similar remark holds in the case of a data-dependent classifier φ(·) = φ(n) (·) = φ(n) (· ; X(n) ) with data(n) dependent acceptance regions Ai (e.g., a plug-in rule). As a consequence, we must distinguish among various conceptually different specifications of an ’error or recovery probability’ and also among various different error estimates: 13 Published as Chapter 16.1.2 in: W. Kloesgen, Jan Zytkow (eds.): Handbook of Data Mining and Knowledge Discovery. Oxford University Press, 2002, 258-267 1. The plug-in estimate α̂ti (φ) for the probability αti (φ), for a fixed known classifier φ: α̂ti (φ) = Z φi (x)fˆt (x)dx = Z fˆt (x)dx. (15) Ai 2. The actual (true) error/recovery rate of a data-dependent classifier φ(n) : αti (φ (n) )= Z (n) φi (x)ft (x)dx = Z (n) ft (x)dx (16) Ai 3. The estimated error/recovery rate of φ(n) obtained by substituting an estimated density (e.g., with estimated parameters): α̂ti (φ (n) )= Z (n) φi (x)fˆt (x)dx = Z (n) fˆt (x)dx (17) Ai 4. The apparent error/recovery rate of φ(n) (also termed resubstitution estimate): αti,app (φ(n) ) = Uti /nt where Uti = Pni j=1 (n) φi (xtj ) = Pni (xtj ) j=1 IA(n) i (18) is the number of data from the t-th training sample Ct which are assigned to the class Πi by the classifier φ(n) (confusion matrix). Unfortunately, this estimator is much too optimistic: For example, the estimated recovery rate αii,app (φ(n) ) is typically quite larger than the true value αii (φ(n) ) to be estimated since the latter one is estimated from the same data X(n) which have tuned the classifier φ(n) . 5. The expected error/recovery probability ET [α̂ti ]: 14 Published as Chapter 16.1.2 in: W. Kloesgen, Jan Zytkow (eds.): Handbook of Data Mining and Knowledge Discovery. Oxford University Press, 2002, 258-267 For any of the data-dependent estimates α̂ti given before, ET [α̂ti ] is defined as the expected value of α̂ti with respect to all training samples X(n) (e.g., under the mixture model). This is a fixed (i.e., non data-dependent) probability which characterizes the overall quality of the entire classification process, including the ’learning’ of the (plug-in) classifier φ(n) from the fluctuating data. In practice, these definitions are often confounded or uncritically used. Considering a Bayesian classifier φ∗ and a corresponding plug-in rule φ(n) , it is expected that for an increasing sample size n → ∞ and when using consistent (n) parameter estimates ϑi or fˆi (x), the true and estimated error/recovery rates αti (φ∗ ), αti (φ(n) ), α̂ti (φ(n) ) etc. should all be close to each other. In this context, there exists a range of convergence theorems, e.g. for α̂ti (φ(n) ) → αti (φ∗ ), and finite sample inequalities for the (maximum) deviation between true and estimated error/recovery probabilities. Similar results hold for the empirical Bayesian risk r(φ(n) , π) which converges to the minimum risk r ∗ = r(φ∗ , π) of the Bayesian rule φ∗ . These topics are investigated in the context of computational learning theory which also yields bounds for the risk difference r(φ(n) , π)− r∗ which are formulated in terms of the so-called Vapnik-Chervonenkis dimension (Devroye et al. 1997). The quality of a classification rule should not be evaluated by considering error rates exclusively (which can be large even for an optimum classifier if the 15 Published as Chapter 16.1.2 in: W. Kloesgen, Jan Zytkow (eds.): Handbook of Data Mining and Knowledge Discovery. Oxford University Press, 2002, 258-267 underlying populations Πi are not well separable). Other properties of a classifier such as its generalization ability, the ease of application, its stability, or its robustness (against departures from the underlying model; Kharin 1996) will sometimes be equally or even more important when selecting an ’appropriate’ decision rule (see Hand 1997) . Test samples and cross-validation methods A commonly used method for obtaining an unbiased, consistent estimate for the actual error/recovery probabilities αti (φ(n) ) of a classifier φ(n) (obtained from the training data in X(n) ) proceeds as follows: We observe, in addition to the training data, from each population Πi a new test sample Ti = {yi1 , ..., yimi } which is independent from X(n) , and calculate the relative error/recovery frequencies inside Tt : α̂ti (φ (n) ) = (1/mt ) mt X (n) φi (ytj ) j=1 = (1/mt ) mt X IA(n) (ytj ) (19) i j=1 for t = 1, ..., m. This approach is typically realized in the way that the original n data points are randomly split into a training set (of size 2n/3, say, with m training classes Ci ) which yields the classifier, and a test set (the remaining n/3 data points with test samples Ti ) which is used for evaluation afterwards. Since this splitting process needs (spoils) a lot of data, some more refined and economical tools have been designed under the heading cross-validation methods where one single or some few elements are singled out in turn. A common approach is the jackknife or leave-one-out method for evaluating a 16 Published as Chapter 16.1.2 in: W. Kloesgen, Jan Zytkow (eds.): Handbook of Data Mining and Knowledge Discovery. Oxford University Press, 2002, 258-267 data-dependent (plug-in) classifier φ(n) (e.g., a plug-in version of the Bayesian rule φ∗ ): Given n training data xj with known class memberships zj ∈ {1, ..., m} (for j = 1, ..., n), we eliminate in turn each data point xj , build the decision rule φ(n−1,j) from the remaining n − 1 data points (in the same way as φ(n) has been constructed from the entire data set) and classify the j-th point using φ(n−1,j) . Denote by dj the index of the obtained class for xj . Then we estimate the true probabilities αti (φ∗ ), αti (φ(n) ), or ET [αti (φ(n) )] by the the relative frequency eti /n where U eti := #{j ∈ {1, ..., n}|cj = t, dj = i} is the number of data xj U from Πt which were classified by the rule φ(n−1,j) into the class Πi . Similarly, α̂ := Pm e (n) ). – i=1 Uii /n provides an estimate for the overall recovery rate α(φ Theoretical results as well as simulations show that this method yields quite precise estimates, even for moderate sample sizes. The bootstrapping method: This method is commonly used for estimating the error rates αti (φ(n) ), (??), of a (plug-in) rule φ(n) . Basically, it works by replacing in (??) the unknown distribution (density fi (·)) of X by the empirical distribution of data points from Πi , e.g., taken from the training set Ci . Thus, for given sample sizes m1 , ..., mm (often: mi ≡ ni ), the method takes repeatedly (N times, say) a random subsample of size mi from each training set Ci and iterates the following steps (1) to (3) for ν = 1, ..., N : (1) Sample mi data points (with replacement) from Ci , obtaining a bootstrap 17 Published as Chapter 16.1.2 in: W. Kloesgen, Jan Zytkow (eds.): Handbook of Data Mining and Knowledge Discovery. Oxford University Press, 2002, 258-267 [ν] data set Ci (typically with repetitions) for each i = 1, ..., m. [ν] [ν] (2) Using the data in C1 , ..., Cm , construct the corresponding plug-in rule φ[ν] (in the same way as φ(n) is constructed from C1 , ..., Cm ). [ν] e [ν] denote the number of data points from Ct and Ct[ν] , (3) Let Uti and U ti [ν] respectively, which were assigned to the class Πi by φ[ν] . Calculate Dti := [ν] [ν] e /mt , i.e., the difference of the corresponding resubstitution estiUti /nt − U ti mates. Finally, the bootstrap estimator for αti (φ(n) ) is given by: α̂ti,boot := αti,app (φ(n) ) + (1/N ) N X [ν] Dti (20) ν=1 where αti,app (φ(n) ) = Uti /nt is the resubstitution estimator for φ(n) . – This estimator yields quite accurate estimates for the actual error probability α ti (φ(n) ) even for close classes and small class sizes ni . C5.1.2.4 Fisher’s classical linear discriminant analysis A geometrical point of view underlies the classical linear discriminant theory developed by R.A. Fisher. This approach can be interpreted in the way as to search for an s-dimensional hyperplane H in Rk such that the known classification C1 , ..., Cm of the training data xij is best reproduced by the n projection points yij = πH (xij ) of xij onto H in the sense that the variance between the classes Pm i=1 ni ||y i· − y||2 is minimized by H. It appears that the optimum hyperplane is spanned by the s first eigenvectors v1 , ..., vs ∈ Rk of the betweenclass covariance matrix B := Pm i=1 ni (xi· − x)(xi· − x)0 and that, insofar, the 18 Published as Chapter 16.1.2 in: W. Kloesgen, Jan Zytkow (eds.): Handbook of Data Mining and Knowledge Discovery. Oxford University Press, 2002, 258-267 discrimination process may be based on the n reduced, s-dimensional feature vectors zij := (v10 xij , ..., vs0 xij )0 (with j = 1, ..., ni ; i = 1, ..., m). For s = 2 dimensions these vectors can easily be shown on the screen and the separating boundaries Lti (z) = 0 for the projected classes (for the minimum-distance classifier (??)) can be simultaneously displayed in R2 . C5.1.2.5 Selection and preprocessing of variables A major step in classification and pattern recognition concerns the specification of variables which are able to distinguish the underlying or conjectured classes Π1 , ..., Πm . In data mining, this problem is superposed by the fact that most data bases in enterprises or marketing institutions store typically so many information and details about the underlying subjects (customers, products, bank transfers) that it makes no sense to use all these variables for the classification process (for technical as well as for economic reasons). Therefore, any classification process is usually preceded by a selection of a sufficiently small number of (hopefully) ’informative’, ’discriminating’ or ’predictive’ variables. There exist various statistical methods for selecting or transforming variables: principal components (identical or related to the Karhunen-Loève expansion), Fisher’s projection method described in [link to C.5.1.2.4], projection pursuit methods, etc. These methods typically yield a small set of s, say, linear combinations of the k original variables X (1) , ..., X (k) . In contrast, the decision tree approach constructs a higly non-linear classification rule and acceptance 19 Published as Chapter 16.1.2 in: W. Kloesgen, Jan Zytkow (eds.): Handbook of Data Mining and Knowledge Discovery. Oxford University Press, 2002, 258-267 regions Ai in a recursive, ’monothetic’ way: First by dissecting the entire set of training samples optimally on the basis of a single variable which is optimally selected (i.e., such that the m classes are best separated in the training set), and then by iterating this dissection process iteratively for each of the attained sub-samples (until a stopping criterion precludes further splitting). As a result we obtain a decision tree where the acceptance regions Ai result from a recursive combination of rules relating to single variables only. Further details are described in [link to C.5.1.3] and [link to C.5.1.4], Another type of methods concentrates on the selection of a suitable subset of variables S ⊂ {1, ..., k} of a given size s = |S| from the original k variables (with 2 ≤ s < k) and then using these variables instead of the original ones in the formerly described Bayesian, maximum likelihhod and NN classifiers. This classical model selection approach works typically with information measures (Akaike, Schwarz, ICOMP) which are to be optimized. A more recent approach looks for a selection S which minimizes (an estimate of) the total error probability of the corresponding (optimum or plug-in) classifier built from the s selected variables. Both approaches proceed by successively eliminating a single variable from the entire set of all k variables (backward method), or by successively adding one more variable to an initial choice of one variable (forward method). The method is locally optimum insofar as, in each step, it eliminates (includes) the variable which leads to the smallest (estimated) total error probability of the corresponding (optimum, Bayes, plug-in, or empirical) 20 Published as Chapter 16.1.2 in: W. Kloesgen, Jan Zytkow (eds.): Handbook of Data Mining and Knowledge Discovery. Oxford University Press, 2002, 258-267 decision rule until the given dimension s is attained or the error will be too large. Finally, we point to the approaches developed under the heading of ’supportvector machines’ and ’potential function method’. In this approach, the acceptance regions Ai for the underlying classes Πi are defined by linear boundaries in Rk which are computationally optimized in order to attain small error rates for the training samples. Since, however, many practical applications require the consideration of non-linear class boundaries, this approach is combined by a suitable non-linear transformation ψ(x) of the original data vector x: Then the linear support vector classifier is formulated in terms of the transformed data ψ(x) (instead of using non-linear boundaries for the original data x). This approach is also investigated in the framework of ’computational learning theory’, a good reference is provided by Cristianini and Shawe-Taylor (2000). Classical sources on classification, discrimination and pattern recognition are Young and Calvert (1974), Lachenbruch (1975), Krishnaiah & Kanal (1982), Fukunaga (1990), Goldstein & Dillon (1978), Niemann (1990) and Hand (1986). Modern viewpoints are considered, e.g., in Breiman et al. (1984), McLachlan (1992), Ripley (1996), Kharin (1996), Devroye et al. (1997), Bock and Diday (1999). Nonparametric density estimation is presented in Tapia and Thompson (1978), Devroye (1987) and Silverman (1986). References: 21 Published as Chapter 16.1.2 in: W. Kloesgen, Jan Zytkow (eds.): Handbook of Data Mining and Knowledge Discovery. Oxford University Press, 2002, 258-267 Bandemer, H., Gottwald, S. 1995. Fuzzy sets, fuzzy logic, fuzzy methods with application. Wiley, Chichester. Bock, H.H., Diday, E. 1999. Analysis of symbolic data. Exploratory methods for extracting information from complex data. Springer Verlag, Heidelberg – Berlin. Presents classification and data analysis methods for set-valued and probabilistic data. Breiman, L., Friedman, J.H., Olshen, R.A., Stone, Ch. J. 1984. Classification and regression trees. Wadsworth, Belmont. Describes rule-based classification methods (decision trees). Cristianini, N., Shawe-Taylor, J. 2000. An introduction to support vector machines. Cambridge University Press, Cambridge. Devroye, L. 1987. A course on density estimation. Birkhäuser, Boston – Basel. Devroye, L., Györfi, L. and Lugosi, G. 1997. A probabilistic theory of pattern recognition. Springer, New York. An excellent and comprehensive survey on classical and recent results in pattern recognition, classification, computational learning theory and neural networks. Fukunaga, K. 1990. Introduction to statistical pattern recognition. Academic Press, New York. An engineering-oriented presentation. Goldstein, M. and Dillon, W.R. 1978. Discrete discriminant analysis. Wiley, New York. Concentrates on categorical data. Hand, D.J. 1986. Discrimination and classification. Wiley, New York. 22 Published as Chapter 16.1.2 in: W. Kloesgen, Jan Zytkow (eds.): Handbook of Data Mining and Knowledge Discovery. Oxford University Press, 2002, 258-267 Hand, D.J. 1997. Construction and assessment of classification rules. Wiley, New York. A survey on discrimination methods and the evaluation of their performance with applications and practical comments. Kharin, Y. 1996. Robustness in statistical pattern recognition. Kluwer Academic Publishers, Dordrecht. Concentrates on robustness studies for discrimination and clustering methods. Krishnaiah, P.R. and Kanal, L.N. (eds.). 1982. Classification, pattern recognition and reduction of dimensionality. Handbook of Statistics, vol. 2. North Holland/Elsevier, Amsterdam. With survey articles on many special topics in classification. Lachenbruch, P.A. 1975. Discriminant analysis. Hafner/Macmillan, London, 1975. A classical reference with an emphasis on normal theory discrimination. McLachlan, G.J. 1992. Discriminant analysis and statistical pattern recognition. Wiley, New York. An unvaluable source for all topics of discriminant analysis and its applications, with an emphasis on the statistical background. McLachlan, G.J., Krishnan, Th. 1997. The EM algorithm and extensions. Wiley, New York. Läuter, J. 1992. Stabile multivariate Verfahren. Diskriminanzanalyse, Regressionsanalyse, Faktoranalyse. Akademie Verlag, Berlin. Discusses error probabilities as a function of n and p, contains discrimination methods for badly behaving data, e.g., by using penalty functions and ridge methods. 23 Published as Chapter 16.1.2 in: W. Kloesgen, Jan Zytkow (eds.): Handbook of Data Mining and Knowledge Discovery. Oxford University Press, 2002, 258-267 Niemann, H. 1990. Pattern analysis and understanding. Springer Verlag, Berlin. The new version of a classical survey on pattern recognition and its applications in technology and knowledge processing. Ripley, B.D. 1996. Pattern recognition and neural networks. Cambridge University Press, Cambridge, UK. A very appealing survey on the statistical bases of neural network classification methods. Silverman, B.W. 1986. Density estimation for statistics and data analysis. Chapman and Hall, London – New York. Tapia, R.A., Thompson, J.R. 1978. Nonparametric probability density estimation. John Hopkins Univ. Press, Baltimore. Young, T.Y. and Calvert, T.W. 1974. Classificaton, estimation and pattern recognition. American Elsevier, New York. A classical standard reference. 24