Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Object Orie’d Data Analysis, Last Time • OODA in Image Analysis – Landmarks, Boundary Rep’ns, Medial Rep’ns • Mildly Non-Euclidean Spaces – M-rep data on manifolds – Geodesic Mean – Principal Geodesic Analysis – Limitations - Cautions Return to Big Picture Main statistical goals of OODA: • Understanding population structure – Low dim’al Projections, PCA, PGA, … • Classification (i. e. Discrimination) – Understanding 2+ populations • Time Series of Data Objects – Chemical Spectra, Mortality Data Classification - Discrimination Background: Two Class (Binary) version: Using “training data” from Class +1 and Class -1 Develop a “rule” for assigning new data to a Class Canonical Example: Disease Diagnosis • New Patients are “Healthy” or “Ill” • Determined based on measurements Classification - Discrimination Important Distinction: Classification vs. Clustering Classification: Class labels are known, Goal: understand differences Clustering: Goal: Find class labels (to be similar) Both are about clumps of similar data, but much different goals Classification - Discrimination Important Distinction: Classification vs. Clustering Useful terminology: Classification: Clustering: supervised learning unsupervised learning Classification - Discrimination Terminology: For statisticians, these are synonyms For biologists, classification means: • Constructing taxonomies • And sorting organisms into them (maybe this is why discrimination was used, until politically incorrect…) Classification (i.e. discrimination) There are a number of: • Approaches • Philosophies • Schools of Thought Too often cast as: Statistics vs. EE - CS Classification (i.e. discrimination) EE – CS variations: • Pattern Recognition • Artificial Intelligence • Neural Networks • Data Mining • Machine Learning Classification (i.e. discrimination) Differing Viewpoints: Statistics • Model Classes with Probability Distribut’ns • Use to study class diff’s & find rules EE – CS • Data are just Sets of Numbers • Rules distinguish between these Current thought: combine these Classification (i.e. discrimination) Important Overview Reference: Duda, Hart and Stork (2001) • Too much about neural nets??? • Pizer disagrees… • Update of Duda & Hart (1973) Classification (i.e. discrimination) For a more classical statistical view: McLachlan (2004). • Likelihood theory, etc. • Not well tuned to HDLSS data Classification Basics Personal Viewpoint: Point Clouds Classification Basics Simple and Natural Approach: Mean Difference a.k.a. Centroid Method Find “skewer through two meatballs” Classification Basics For Simple Toy Example: Project On MD & split at center Classification Basics Why not use PCA? Reasonable Result? Doesn’t use class labels… • Good? • Bad? Classification Basics Harder Example (slanted clouds): Classification Basics PCA for slanted clouds: PC1 terrible PC2 better? Still misses right dir’n Doesn’t use Class Labels Classification Basics Mean Difference for slanted clouds: A little better? Still misses right dir’n Want to account for covariance Classification Basics Mean Difference & Covariance, Simplest Approach: Rescale (standardize) coordinate axes i. e. replace (full) data matrix: x11 x1n 1 / s1 0 x11 / s1 X X x 0 1/ s x /s x d1 d1 d dn d Then do Mean Difference Called “Naïve Bayes Approach” x1n / s1 xdn / sd Classification Basics Naïve Bayes Reference: Domingos & Pazzani (1997) Most sensible contexts: • Non-comparable data • E.g. different units Classification Basics Problem with Naïve Bayes: Only adjusts Variances Not Covariances Doesn’t solve this problem Classification Basics Better Solution: Fisher Linear Discrimination Gets the right dir’n How does it work? Fisher Linear Discrimination Other common terminology (for FLD): Linear Discriminant Analysis (LDA) Original Paper: Fisher (1936) Fisher Linear Discrimination Careful development: Useful notation (data vectors of length d ): Class +1: X ( 1) 1 Class -1: ,..., X ( 1) n1 Centerpoints: n1 1 ( 1) ( 1) and X Xi n1 i 1 X X ( 1) ( 1) 1 ,..., X ( 1) n1 1 n1 ( 1) Xi n1 i 1 Fisher Linear Discrimination Covariances, ˆ (k ) ~ ( k ) ~ ( k ) t for X X k 1, 1 (outer products) Based on centered, normalized data matrices: 1 ~ (k ) (k ) (k ) (k ) (k ) X X 1 X ,..., X nk X nk Note: use “MLE” version of estimated covariance matrices, for simpler notation Fisher Linear Discrimination Major Assumption: Class covariances are the same (or “similar”) Like this: Not this: Fisher Linear Discrimination Good estimate of (common) within class cov? Pooled (weighted average) within class cov: ( 1) ( 1) ˆ ˆ ~~ t n n w 1 1 ˆ XX n1 n1 based on the combined full data matrix: ~ 1 ~ ( 1) ~ ( 1) n1 X n1 X X n Fisher Linear Discrimination Note: ̂ is similar to w ̂ from before I.e. covariance matrix ignoring class labels Important Difference: Class by Class Centering Will be important later Fisher Linear Discrimination Simple way to find “correct cov. adjustment”: Individually transform subpopulations so “spherical” about their means For k 1, 1 define Y (k ) i ˆ w 1 / 2 X (k ) i Fisher Linear Discrimination Then: In Transformed Space, Best separating hyperplane is Perpendicular bisector of line between means Fisher Linear Discrimination In Transformed Space, Separating Hyperplane has: Transformed Normal Vector: ( 1) ( 1) w 1 / 2 w 1 / 2 ˆ ˆ n TFLD X X ( 1) ( 1) w 1 / 2 ˆ X X Transformed Intercept: TFLD 1 ˆ w 1 / 2 ( 1) 1 ˆ w 1 / 2 ( 1) X X 2 2 1 / 2 1 1 (2) (1) ˆ w X X 2 2 Sep. Hyperp. has Equation: y : y, n TFLD TFLD , n TFLD Fisher Linear Discrimination Thus discrimination rule is: 0 Given a new data vector X , Choose Class +1 when: ˆ w 1 / 2 X , nTFLD TFLD , nTFLD 0 i.e. (transforming back to original space) X 0 , ˆ w 1 / 2 nTFLD ˆ w 1/ 2 TFLD , ˆ w 1 / 2 nTFLD X , n FLD FLD , n FLD 0 where: ( 1) ( 1) w 1 / 2 w 1 ˆ ˆ n FLD n TFLD X X 1 ( 1) 1 ( 1) w 1/ 2 ˆ FLD TFLD X X 2 2 Fisher Linear Discrimination So (in orig’l space) have separ’ting hyperplane with: FLD Normal vector: n FLD Intercept: Fisher Linear Discrimination Relationship to Mahalanobis distance For X 1 , X 2 ~ N , , a natural distance 1/ 2 t 1 measure is: d M X 1 , X 2 X 1 X 2 X 1 X 2 Idea: • “unit free”, i.e. “standardized” • essentially mod out covariance structure • Euclidean dist. applied to 1/ 2 X 1 & • Same as key transformation for FLD • I.e. FLD is 1 / 2 X 2 mean difference in Mahalanobis space Classical Discrimination Above derivation of FLD was: • Nonstandard • Not in any textbooks(?) • Nonparametric (don’t need Gaussian data) • I.e. Used no probability distributions • More Machine Learning than Statistics Classical Discrimination FLD Likelihood View Assume: Class distributions are multivariate N , • (k ) w for k 1, 1 strong distributional assumption + common covariance Classical Discrimination FLD Likelihood View (cont.) At a location x , the likelihood ratio, for 0 choosing between Class +1 and Class -1, is: LR x , 0 ( 1) , ( 1) , w x 0 w ( 1) / x w where w is the Gaussian density with covariance 0 ( 1) Classical Discrimination FLD Likelihood View (cont.) Simplifying, using the Gaussian density: x w 1 2 d /2 w e 1 x t w x / 2 Gives (critically using common covariances): LR x , 0 ( 1) , ( 1) 2 log LR x , 0 x 0 , w e ( 1) , ( 1) x ( 1) t w 1 0 x 0 ( 1 ) w t 1 x x x / 2 0 ( 1 ) , w ( 1) x 0 0 ( 1 ) t w 1 x ( 1) t w 1 0 0 ( 1 ) ( 1) Classical Discrimination FLD Likelihood View (cont.) But: x x x x 2 x so: 2 log LRx , , , 2 x 0 (k ) t w 1 0 ( 1) 0 0t w 1 w 1 0t (k ) ( 1) ( 1) 0t 0 w 1 ( 1) ( 1) 2 log LR x , 0t x w 1 ( 1) ( 1) (k ) w 1 (k ) w ( 1) Thus LRx 0 , ( 1) , ( 1) , w 1 when i.e. (k ) 0 ( 1) , ( 1) w 1 ( 1) , w 0 1 ( 1) ( 1) ( 1) ( 1) w 1 2 ( 1) Classical Discrimination FLD Likelihood View (cont.) ( 1) ( 1) Replacing , and w by maximum likelihood estimates: (1) (1) w X , X and ̂ Gives the likelihood ratio discrimination rule: Choose Class +1, when 1 ( 1) ( 1) ( 1) ( 1) ( 1) ( 1) w 1 w 1 ˆ ˆ x X X X X X X 2 0t Same as above, so: FLD can be viewed as Likelihood Ratio Rule Classical Discrimination FLD Generalization I Gaussian Likelihood Ratio Discrimination (a. k. a. “nonlinear discriminant analysis”) Idea: Assume class distributions are N , (k ) Different covariances! (k ) Likelihood Ratio rule is straightf’d num’l calc. (thus can easily implement, and do discrim’n) Classical Discrimination Gaussian Likelihood Ratio Discrim’n (cont.) No longer have separ’g hyperplane repr’n (instead regions determined by quadratics) (fairly complicated case-wise calculations) Graphical display: for each point, color as: Yellow if assigned to Class +1 Cyan if assigned to Class -1 (intensity is strength of assignment) Classical Discrimination FLD for Tilted Point Clouds – Works well Classical Discrimination GLR for Tilted Point Clouds – Works well Classical Discrimination FLD for Donut – Poor, no plane can work Classical Discrimination GLR for Donut – Works well (good quadratic) Classical Discrimination FLD for X – Poor, no plane can work Classical Discrimination GLR for X – Better, but not great Classical Discrimination Summary of FLD vs. GLR: • Tilted Point Clouds Data • • – – FLD good GLR good – – FLD bad GLR good – – FLD bad GLR OK, not great Donut Data X Data Classical Conclusion: GLR generally better (will see a different answer for HDLSS data)