Download Mean Difference

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Principal component analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Object Orie’d Data Analysis, Last Time
• OODA in Image Analysis
– Landmarks, Boundary Rep’ns, Medial Rep’ns
• Mildly Non-Euclidean Spaces
– M-rep data on manifolds
– Geodesic Mean
– Principal Geodesic Analysis
– Limitations
-
Cautions
Return to Big Picture
Main statistical goals of OODA:
• Understanding population structure
– Low dim’al Projections, PCA, PGA, …
• Classification (i. e. Discrimination)
– Understanding 2+ populations
• Time Series of Data Objects
– Chemical Spectra, Mortality Data
Classification - Discrimination
Background: Two Class (Binary) version:
Using “training data” from
Class +1 and
Class -1
Develop a “rule” for
assigning new data to a Class
Canonical Example: Disease Diagnosis
•
New Patients are “Healthy” or “Ill”
•
Determined based on measurements
Classification - Discrimination
Important Distinction:
Classification vs. Clustering
Classification:
Class labels are known,
Goal: understand differences
Clustering:
Goal: Find class labels (to be similar)
Both are about clumps of similar data,
but much different goals
Classification - Discrimination
Important Distinction:
Classification vs. Clustering
Useful terminology:
Classification:
Clustering:
supervised learning
unsupervised learning
Classification - Discrimination
Terminology:
For statisticians, these are synonyms
For biologists, classification means:
•
Constructing taxonomies
•
And sorting organisms into them
(maybe this is why discrimination
was used, until politically incorrect…)
Classification (i.e. discrimination)
There are a number of:
•
Approaches
•
Philosophies
•
Schools of Thought
Too often cast as:
Statistics vs. EE - CS
Classification (i.e. discrimination)
EE – CS variations:
•
Pattern Recognition
•
Artificial Intelligence
•
Neural Networks
•
Data Mining
•
Machine Learning
Classification (i.e. discrimination)
Differing Viewpoints:
Statistics
•
Model Classes with Probability Distribut’ns
•
Use to study class diff’s & find rules
EE – CS
•
Data are just Sets of Numbers
•
Rules distinguish between these
Current thought:
combine these
Classification (i.e. discrimination)
Important Overview Reference:
Duda, Hart and Stork (2001)
•
Too much about neural nets???
•
Pizer disagrees…
•
Update of Duda & Hart (1973)
Classification (i.e. discrimination)
For a more classical statistical view:
McLachlan (2004).
•
Likelihood theory, etc.
•
Not well tuned to HDLSS data
Classification Basics
Personal Viewpoint:
Point Clouds
Classification Basics
Simple and Natural Approach:
Mean Difference
a.k.a.
Centroid Method
Find “skewer through two meatballs”
Classification Basics
For Simple Toy Example:
Project
On MD
& split
at center
Classification Basics
Why not use PCA?
Reasonable
Result?
Doesn’t use
class labels…
•
Good?
•
Bad?
Classification Basics
Harder Example (slanted clouds):
Classification Basics
PCA for slanted clouds:
PC1 terrible
PC2 better?
Still misses
right dir’n
Doesn’t use
Class Labels
Classification Basics
Mean Difference for slanted clouds:
A little
better?
Still misses
right dir’n
Want to
account for
covariance
Classification Basics
Mean Difference & Covariance,
Simplest Approach:
Rescale (standardize) coordinate axes
i. e. replace (full) data matrix:
 x11  x1n 
1 / s1  0 
 x11 / s1





X         
 X   
x

 0  1/ s 
x /s

x
 d1

 d1 d
dn 
d 
Then do Mean Difference
Called “Naïve Bayes Approach”
 x1n / s1 


 
 xdn / sd 
Classification Basics
Naïve Bayes Reference:
Domingos & Pazzani (1997)
Most sensible contexts:
• Non-comparable data
• E.g. different units
Classification Basics
Problem with Naïve Bayes:
Only adjusts
Variances
Not Covariances
Doesn’t solve
this problem
Classification Basics
Better Solution: Fisher Linear Discrimination
Gets the
right dir’n
How does
it work?
Fisher Linear Discrimination
Other common terminology (for FLD):
Linear Discriminant Analysis (LDA)
Original Paper:
Fisher (1936)
Fisher Linear Discrimination
Careful development:
Useful notation (data vectors of length d ):
Class +1:
X
( 1)
1
Class -1:
,..., X
( 1)
n1
Centerpoints:
n1
1
( 1)
( 1)
and
X

Xi

n1 i 1
X
X
( 1)
( 1)
1
,..., X
( 1)
n1
1 n1 ( 1)

Xi

n1 i 1
Fisher Linear Discrimination
Covariances, ˆ
(k )
~ ( k ) ~ ( k ) t for
X X
k  1,  1
(outer products)
Based on centered, normalized data matrices:

1
~ (k )
(k )
(k )
(k )
(k )
X 
X 1  X ,..., X nk  X
nk

Note: use “MLE” version of estimated
covariance matrices, for simpler notation
Fisher Linear Discrimination
Major Assumption:
Class covariances are the same (or “similar”)
Like this:
Not this:
Fisher Linear Discrimination
Good estimate of (common) within class cov?
Pooled (weighted average) within class cov:
( 1)
( 1)
ˆ
ˆ
~~ t
n


n

w

1

1
ˆ 
 XX
n1  n1
based on the combined full data matrix:
~ 1
~ ( 1)
~ ( 1)
 n1 X n1 X 
X
n
Fisher Linear Discrimination
Note:
̂ is similar to
w
̂ from before
I.e. covariance matrix ignoring class labels
Important Difference:
Class by Class Centering
Will be important later
Fisher Linear Discrimination
Simple way to find “correct cov. adjustment”:
Individually transform subpopulations so
“spherical” about their means
For k  1,  1 define Y
(k )
i
 ˆ

w 1 / 2
X
(k )
i
Fisher Linear Discrimination
Then:
In Transformed Space,
Best separating hyperplane
is
Perpendicular bisector of
line between means
Fisher Linear Discrimination
In Transformed Space,
Separating Hyperplane has:
Transformed Normal Vector:
( 1)
( 1)
w 1 / 2
w 1 / 2
ˆ
ˆ
n TFLD    X
   X

( 1)
( 1)
w 1 / 2
ˆ
  
X
X

Transformed Intercept:
 TFLD 
1 ˆ w 1 / 2 ( 1) 1 ˆ w 1 / 2 ( 1)
  X    X
2
2
1 / 2 1
1 (2)
(1)
 ˆ w   X  X 
2
2

Sep. Hyperp. has Equation:
y :
y, n TFLD   TFLD , n TFLD

Fisher Linear Discrimination
Thus discrimination rule is:
0
Given a new data vector X ,
Choose Class +1 when:
ˆ

w 1 / 2
X , nTFLD   TFLD , nTFLD
0
i.e. (transforming back to original space)
X
0
, ˆ 
w 1 / 2
nTFLD 
ˆ 
w 1/ 2
 TFLD
, ˆ 
w 1 / 2
nTFLD
X , n FLD   FLD , n FLD
0
where:

( 1)
( 1)
w 1 / 2
w 1
ˆ
ˆ
n FLD    n TFLD    X
X
1 ( 1) 1 ( 1) 

w 1/ 2
ˆ
 FLD     TFLD   X  X 
2
2


Fisher Linear Discrimination
So (in orig’l space) have separ’ting hyperplane with:
 FLD
Normal vector: n FLD
Intercept:
Fisher Linear Discrimination
Relationship to Mahalanobis distance
For X 1 , X 2 ~ N  ,  , a natural distance
1/ 2
t 1
measure is: d M  X 1 , X 2    X 1  X 2    X 1  X 2 
Idea:
•
“unit free”,
i.e. “standardized”
•
essentially mod out covariance structure
•
Euclidean dist. applied to  1/ 2 X 1 &
•
Same as key transformation for FLD
•
I.e. FLD is
 1 / 2 X 2
mean difference in Mahalanobis space
Classical Discrimination
Above derivation of FLD was:
•
Nonstandard
•
Not in any textbooks(?)
•
Nonparametric (don’t need Gaussian data)
•
I.e. Used no probability distributions
•
More Machine Learning than Statistics
Classical Discrimination
FLD Likelihood View
Assume:
Class distributions are multivariate

N  ,
•
(k )
w

for k  1,  1
strong distributional assumption
+ common covariance
Classical Discrimination
FLD Likelihood View (cont.)
At a location x , the likelihood ratio, for
0
choosing between Class +1 and Class -1, is:

LR x , 
0
( 1)
,
( 1)
,
w
   x
0
w

( 1)
/  x
w
where   w is the Gaussian density
with covariance
0

( 1)

Classical Discrimination
FLD Likelihood View (cont.)
Simplifying, using the Gaussian density:
  x 
w
1
2 
d /2

w
e
1
  x t  w x  / 2


Gives (critically using common covariances):

LR x , 
0
( 1)

,
( 1)
 2 log LR x , 

0
 x 
0

, w  e
( 1)
,
( 1)
  x
( 1) t
w 1

0

  x 0   ( 1 )  w

t

1
x   x     x    / 2
0
( 1 )
, w 

( 1)
 x
0

0
( 1 ) t
w 1
  x
( 1) t
w 1
0
0
( 1 )

( 1)

Classical Discrimination
FLD Likelihood View (cont.)
But:
x
  x     x  x  2 x    
so:  2 log LRx ,  ,  ,   
 2 x           
0

(k ) t
w 1
0
( 1)
0
0t
w 1
w 1
0t
(k )
( 1)
( 1)
0t
0
w 1
( 1)
( 1)

 2 log LR x , 
0t
x 
w 1

( 1)

( 1)
 
(k )

w 1

(k )
w
( 1)
Thus LRx 0 ,  ( 1) ,  ( 1) ,  w   1 when
i.e.
(k )
0
( 1)
,
( 1)
 
w 1
( 1)


, w  0
1 ( 1)
( 1)
( 1)
( 1)
w 1
     
2

( 1)

Classical Discrimination
FLD Likelihood View (cont.)
( 1)
( 1)
Replacing  ,  and  w
by maximum likelihood estimates:
(1)
(1)
w
X , X
and ̂
Gives the likelihood ratio discrimination rule:
Choose Class +1, when

 
 
1 ( 1)
( 1)
( 1)
( 1)
( 1)
( 1)
w 1
w 1
ˆ
ˆ
x  X
X
 X
X
 X
X
2
0t
Same as above, so: FLD can be viewed as
Likelihood Ratio Rule

Classical Discrimination
FLD Generalization I
Gaussian Likelihood Ratio Discrimination
(a. k. a. “nonlinear discriminant analysis”)
Idea:
Assume
class distributions are

N  ,
(k )
Different covariances!
(k )

Likelihood Ratio rule is straightf’d num’l calc.
(thus can easily implement, and do discrim’n)
Classical Discrimination
Gaussian Likelihood Ratio Discrim’n (cont.)
No longer have separ’g hyperplane repr’n
(instead regions determined by quadratics)
(fairly complicated case-wise calculations)
Graphical display: for each point, color as:
Yellow if assigned to Class +1
Cyan if assigned to Class -1
(intensity is strength of assignment)
Classical Discrimination
FLD for Tilted Point Clouds – Works well
Classical Discrimination
GLR for Tilted Point Clouds – Works well
Classical Discrimination
FLD for Donut – Poor, no plane can work
Classical Discrimination
GLR for Donut – Works well (good quadratic)
Classical Discrimination
FLD for X – Poor, no plane can work
Classical Discrimination
GLR for X – Better, but not great
Classical Discrimination
Summary of FLD vs. GLR:
• Tilted Point Clouds Data
•
•
–
–
FLD good
GLR good
–
–
FLD bad
GLR good
–
–
FLD bad
GLR OK, not great
Donut Data
X Data
Classical Conclusion: GLR generally better
(will see a different answer for HDLSS data)