* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Lecture 1 Overview
Survey
Document related concepts
Transcript
Bios 760R, Lecture 1 Overview Overview of the course Classification and Clustering The “curse of dimensionality” Reminder of some background knowledge Tianwei Yu RSPH Room 334 [email protected] 1 Course Outline Instructor: Tianwei Yu Office: GCR Room 334 Email: [email protected] Office Hours: by appointment. Teaching Assistant: Mr. Qingpo Cai Office Hours: TBA Course Website: http://web1.sph.emory.edu/users/tyu8/740/index.htm Overview Lecture 1 Overview Lecture 2 Bayesian decision theory Lecture 3 Similarity measures, Clustering (1) Lecture 4 Clustering (2) Classification Lecture 5 Density estimation and classification Clustering Lecture 6 Linear machine Dimension reduction Lecture 7 Support vector machine Lecture 8 Generalized additive model Lecture 9 Boosting Lecture 10 Tree and forest Lecture 11 Bump hunting; Neural network (1) Lecture 12 Neural network (2) Lecture 13 Model generalization Lecture 14 Dimension reduction (1) Lecture 15 Dimension reduction (2) Lecture 16 Some applications Focus of the course: The Clustering lectures will be given by Dr. Elizabeth Chong. 3 Overview References: Textbook: The elements of statistical learning. Hastie, Tibshirani & Friedman. Free at: http://statweb.stanford.edu/~tibs/ElemStatLearn/ Other references: Pattern classification. Duda, Hart & Stork. Data clustering: theory, algorithms and application. Gan, Ma & Wu. Applied multivariate statistical analysis. Johnson & Wichern. Evaluation: Three projects (30% each) 4 Overview Supervised learning ”direct data mining” Classification Estimation Prediction Unsupervised learning ”indirect data mining” Clustering Association rules Description, dimension reduction and visualization Machine Learning /Data mining Semi-supervised learning Modified from Figure 1.1 from <Data Clustering> by Gan, Ma and Wu 5 Overview In supervised learning, the problem is well-defined: Given a set of observations {xi, yi}, estimate the density Pr(Y, X) Usually the goal is to find the model/parameters to minimize a loss, A common loss is Expected Prediction Error: It is minimized at Objective criteria exists to measure the success of a supervised learning mechanism. 6 Overview In unsupervised learning, there is no output variable, all we observe is a set {xi}. The goal is to infer Pr(X) and/or some of its properties. When the dimension is low, nonparametric density estimation is possible; When the dimension is high, may need to find simple properties without density estimation, or apply strong assumptions to estimate the density. There is no objective criteria from the data itself; to justify a result: > Heuristic arguments, > External information, > Evaluate based on properties of the data 7 Classification The general scheme. An example. 8 Classification In most cases, a single feature is not enough to generate a good classifier. 9 Classification Two extremes: overly rigid and overly flexible classifiers. 10 Classification Goal: an optimal trade-off between model simplicity and training set performance. This is similar to the AIC / BIC / …… model selection in regression. 11 Classification An example of the overall scheme involving classification: 12 Classification A classification project: a systematic view. 13 Clustering Assign observations into clusters, such that those within each cluster are more closely related to one another than objects assigned to different clusters. Detect data relations Find natural hierarchy Ascertain the data consists of distinct subgroups …... 14 Clustering Mathematically, we hope to estimate the number of clusters k, and the membership matrix U In fuzzy clustering, we have 15 Clustering Some clusters are well-represented by center+spread model; Some are not. 16 Curse of Dimensionality Bellman R.E., 1961. In p-dimensions, to get a hypercube with volume r, the edge length needed is r1/p. In 10 dimensions, to capture 1% of the data to get a local average, we need 63% of the range of each input variable. 17 Curse of Dimensionality In other words, To get a “dense” sample, if we need N=100 samples in 1 dimension, then we need N=10010 samples in 10 dimensions. In high-dimension, the data is always sparse and do not support density estimation. More data points are closer to the boundary, rather than to any other data point prediction is much harder near the edge of the training sample. 18 Curse of Dimensionality Estimating a 1D density with 40 data points. Standard normal distribution. 19 Curse of Dimensionality Estimating a 2D density with 40 data points. 2D normal distribution; zero mean; variance matrix is identity matrix. 20 Curse of Dimensionality Another example – the EPE of the nearest neighbor predictor. To find E(Y|X=x), take the average of data points close to a given x, i.e. the top k nearest neighbors of x Assumes f(x) is well-approximated by a locally constant function When N is large, the neighborhood is small, the prediction is accurate. 21 Curse of Dimensionality Just a reminder, the expected prediction error contains variance and bias components. Under model: Y=f(X)+ε EPE ( x0 ) = E[(Y - fˆ ( x0 )) 2 ] = E[(e 2 + 2e ( f ( x0 ) - fˆ ( x0 )) + ( f ( x0 ) - fˆ ( x0 )) 2 )] = s 2 + E[( f ( x ) - fˆ ( x )) 2 ] 0 0 = s 2 + E[ fˆ ( x0 ) - E ( fˆ ( x0 ))]2 + [ E ( fˆ ( x0 )) - f ( x0 )]2 = s 2 + Var ( fˆ ( x )) + Bias 2 ( fˆ ( x )) 0 0 22 Curse of Dimensionality Data: Uniform in [−1, 1]p 23 Curse of Dimensionality 24 Curse of Dimensionality We have talked about the curse of dimensionality in the sense of density estimation. In a classification problem, we do not necessarily need density estimation. Generative model --- care about the mechanism: class density function. Learns p(X, y), and predict using p(y|X). In high dimensions, this is difficult. Discriminative model --- care about boundary. Learns p(y|X) directly, potentially with a subset of X. 25 Curse of Dimensionality X1 Generative model X2 y X3 … Discriminative model y Example: Classifying belt fish and carp. Looking at the length/width ratio is enough. Why should we care how many teeth each kind of fish have, or what shape fins they have? 26 Reminder of some results for random vectors Wiki The multivariate Gaussian distribution: The covariance matrix, not limited to Gaussian. * Gaussian is fully defined by mean vector and covariance matrix (first and second moments). 27 Reminder of some results for random vectors The correlation matrix: Relationship with covariance matrix: V 1 1 2 V rV 2 [ = diag s ii 1 2 ] =S 28 Reminder of some results for random vectors k k x ¢Ax = å å aij x i x j Quadratic form: i=1 j=1 A linear combination of the elements of a random vector with mean μ and variance-covariance matrix Σ: 2-D example: Var(aX1 + bX 2 ) = E[(aX1 + bX 2 ) - (am1 + bm2 )]2 = E[a(X1 - m1 ) + b(X 2 - m2 )]2 és11 s12 ùéaù = a s11 + b s 22 + 2abs12 = [a b]ê úê ú = c ¢Sc ës12 s 22 ûëbû 2 2 29 Reminder of some results for random vectors A “new” random vector generated from linear combinations of a random vector: 30 Reminder of some results for random vectors Let A be a kxk square symmetrix matrix, then it has k pairs of eigenvalues and eigenvectors. A can be decomposed as: A = l1e1e1¢ + l2e2e2¢ + ....... + lkekek¢ = PLP ¢ Positive-definite matrix: x ¢Ax > 0,"x ¹ 0 l1 ³ l2 ³ ...... ³ lk > 0 Note : x ¢Ax = l1 ( x ¢e1 ) 2 + ...... + lk ( x ¢ek ) 2 Reminder of some results for random vectors Inverse of a positive-definite matrix: k A = PL P ¢ = å -1 -1 i=1 1 li eiei¢ Square root matrix of a positive-definite matrix: k A1 2 = PL 2 P ¢ = å li eiei¢ 1 i=1 Reminder of some results for random vectors 33 Reminder of some results for random vectors Proof of the first (and second) point of the previous slide. 34 Reminder of some results for random vectors With a sample of the random vector: 35 Reminder of some results for random vectors To estimate mean vector and covariance matrix: ⌢ m=X n ⌢ 1 ¢ S=S= X X X X ( ) ( ) å i i n -1 i=1 36 Reminder of the ROC curve J Clin Pathol 2009;62:1-5 Reminder of the ROC curve