Download Linear Classification Classification Representing Classes What is a

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

German tank problem wikipedia , lookup

Regression analysis wikipedia , lookup

Choice modelling wikipedia , lookup

Time series wikipedia , lookup

Linear regression wikipedia , lookup

Transcript
Classification
Linear Classification
• Supervised learning framework
• Features can be anything
• Targets are discrete classes:
– Safe mushrooms vs. poisonous
– Malignant vs. benign
– Good credit risk vs. bad
Ron Parr
CPS 271
• Can we treat classes as numbers?
– Single class?
– Multi class?
With content adapted from Andrew Ng, Lise Getoor, and Tom Dietterich
Figures from textbook courtesy of Chris Bishop and © Chris Bishop
Representing Classes
• Interpret t(i) as the probability that the ith element
is in a particular class
• Classes usually disjoint
• For multiclass, t(i) is a vector
• t(i)[j]=t(i)j=1 if ith element is in class j, 0 OTW
• Notation: For convenience, we will sometimes
refer to the “raw” variables x, rather than the
features as seen through the lens of our features, φ
Decision Boundaries
• A classifier can be viewed as partitioning the input space or feature
space X into decision regions
What is a Linear Disciminant?
• Simplest kind of classifer, a linear threshold unit (LTU):
1 if w1 x1 + L + θ n wn ≥ w0
y ( x) = 
0 otherwise
•
•
•
•
We sometimes assume w0=1, so y(x)=wTx
A linear discriminant is an n-1 dimensional hyperplane
w is orthogonal to this
We’ll look at three algorithms, all of which learn linear decision
boundaries:
– Directly learn the LTU: Using Least Mean Square (LMS) algorithm
– Learn the conditional distribution: Logistic regression
– Learn the joint distribution: Linear discriminant analysis (LDA)
What can be expressed?
• Examples of things that can be expressed
(Assume n boolean (0/1 features)
– Conjunctions:
0
x2
0
0
• x1^x3^x4 : 1⋅x1 + 0⋅x2 +1⋅x3 + 1⋅x4 ≥ 3
• x1^¬x3^x4: 1⋅x1 + 0⋅x2 +-1⋅x3 + 1⋅x4 ≥ 2
0
– at-least-m-of-n
0
0
• at-least-2-of(x1,x2,x4)
• 1⋅x1 + 1⋅x2 + 0⋅x3 + 1⋅x4 ≥ 2
1
1
0
• Examples of things that cannot be expressed:
1
1
– Non-trivial disjunctions:
1
• (x1^x3) + (x3^x4)
x1
• A linear threshold unit always produces a linear decision boundary. A
set of points that can be separated by a linear decision boundary is
linearly separable.
– Exclusive-Or
• (x1^¬x2) + (¬x1^x2)
Multiclass
Non-linearly separable example
• k classes
• O(k2) one vs. one classifiers
x2
– Expensive
– May not be consistent
0
• k-1 one vs. rest classifiers
1
1
– Less expensive
– Still may not be consistent
0
• K linear functions
x1
– Assign x to class j if wjTx> wiTx for all i
– Gives convex, singly connected decision regions
– How to pick the linear functions?
Why not use regression?
The “Neural” Story (Part I)
• Regression minimizes sum of squared
errors on target function
• Gives strong influence to outliers
• Nice to justify machine learning w/nature
• Naïve introspection works badly
• Neural model biologically plausible
• Single neuron, linear threshold unit = perceptron
• (Longer rant on this later…)
Perceptron
xj
Perceptron Learning
wj
node/
neuron
Y
f
f is a simple step function (sgn)
•
•
•
•
•
•
We are given a set of inputs x(1)…x(n)
t(1)…t(n) is a set of target outputs (boolean) {-1,1}
w is our set of weights
output of perceptron = wTx
Perceptron_error(x(i), w) = -net(x(i),w) t(i)
Goal: Pick w to optimize:
min
w
∑ perceptron _ error(x
(i )
, w)
i∈misclassif ied
Perceptron Learning Properties
(LTU Properties)
• Good news:
– If there exists a set of weights that will correctly
classify every example, the perceptron learning rule
will find it
– Does not depend on step size
(i )
∀ : w j ← w j + αx j t ( i )
∀
i∈misclassified j
• Bad news:
– Perceptrons can represent only a small class of
functions, “linearly separable,” functions
– May oscillate if not separable
– No obvious generalization for multiclass
!
""
! ## #""$" %&#!
Why this form?
Logistic Regression
• One reason: transforms a linear function in the range (-∞,+∞) to be
positive and sum to 1 so that it can represent a probability
• In logistic regression, we learn the conditional distribution P(t|x)
• Let pt(x; w) be our estimate of P(t|x), where w is a vector of
adjustable parameters.
• Assume there are two classes, t = 0 and t = 1 and
p1 (x; w ) =
ew
T
x
ew
T
1 + ew x
p0 (x; w ) = 1 − p1 (x; w )
T
x
1 + ew
T
x
1.0
• This is equivalent to
log
p1 (x; w )
= wT x
p0 (x; w )
0.0
-10.0
0.0
10.0
• IOW, the log odds of class 1 is a linear function of x
Log Likelihood for Conditional
Probability Estimators
Constructing a Learning Algorithm
•
Find the probability distribution h that is most likely, given the data.
P( X | hw ) P (hw )
arg max P (hw | X ) = arg max
P( X )
hw
hw
= arg max P( X | hw ) P (hw )
•
•
– if y(i) = 0, the log likelihood is log(1-p1(x; w))
– if y(i) = 1, the log likelihood is log p1(x; θ)
by Bayes’ Rule
because P(X) doesn’t depend on h
hw
= arg max P( X | hw )
if we assume P(h) is uniform
We can express the log likelihood in a compact from called the cross-entropy
Take an example (x(i),t(i))
•
These two are mutually exclusive, so we can combine them to get:
[
]
l(w; x ( i ) , t ) = log P (t ( i ) | x (i ) , w ) = (1 − t ( i ) ) log 1 − p1 ( x ( i ) ; w ) + t ( i ) log p1 ( x ( i ) ; w )
hw
= arg max log P( X | hw )
because log is monotone
•
The goal of our learning algorithm will be to find w to maximize:
hw
•
•
•
The likelihood function views P(X|hw) as a function of the parameters in the
model. In this case, our parameters are the weights, w.
L(w;X) = P(X|hw)
The log likelihood is a commonly used objective function for learning
algorithms. It is denoted l(w;X)
The w that maximizes the likelihood of the training data is called the
maximum likelihood estimator
J (w ) = l( w; X, t )
Gradient cont.
Computing the Gradient
•
∂J (w )
∂
=∑
l(θ ; t ( i ) , x ( i ) )
∂w j
i ∂w j
p1 ( x (i ) ; w ) =
∂
∂
l ( w; t ( i ) ; x ( i ) ) =
((1 − t (i ) ) log 1 − p1 (x ( i ) ; w ) + t ( i ) log p1 (x (i ) ; w ))
∂w j
∂w j
[
]
 ∂p (x (i ) ; w )  ( i )
+t
= (1 − t i ) 1− p ( x1( i ) ;w )  − 1


1
∂w j


[
=
=
t(i)
p1 ( x ( i ) ; w )
[
=
 ∂p1 (x (i ) ; w ) 
1


p1 ( x ( i ) ;w ) 

∂w j


]
p1 ( x ( i ) ;w )(1− p1 ( x ( i ) ;w ))
]
T (i )
x
T (i)
∂
1
∂p1 (x ( i ) ; w )
(1 + e − w x )
=
T (i)
(1 + e − w x ) 2 ∂w j
∂w j
1
(1 + e − w
T (i )
x
)2
1
(1 + e − w
e −w
T (i )
x
∂
(− w T x (i ) )
∂w j
T i
T (i)
x
)2
e −w x (− x (i) j )
= p1 (x ( i ) ; w )(1 − p1 (x ( i ) ; w )) x (i ) j
] ∂p (∂xw ; w) 
(i)
t ( i ) − p1 ( x ( i ) ;w )
1
p1 ( x ( i ) ; w )(1− p1 ( x ( i ) ;w ))


j
Summary of Logistic Regression
The gradient of the loglikelihood for a single point is:
∂
l ( w; x ( i ) , t ( i ) )
∂w j
[
=[
=
t ( i ) − p1 ( x ( i ) ; w )
] ∂p (∂xw ; w) 


]p (x ; w)(1 − p (x
t ( i ) − p1 ( x ( i ) ; w )
• Learns the Conditional Probability Distribution P(t|x)
• No closed form solution
• Very simple expression for gradient
(i )
1
p1 ( x ( i ) ; w )(1− p1 ( x ( i ) ; w ))
p1 ( x ( i ) ; w )(1− p1 ( x ( i ) ; w ))
j
(i )
1
(i )
1
; w )) x ( i ) j
= (t ( i ) − p1 ( x (i ) ; w )) x (i ) j
•
1
1 + e −w
So we get:
=
 ∂p1 (x (i ) ; w ) 




∂w j


Gradient cont.
•
•
=
 ∂p ( x ( i ) ; w ) 
( i)

− 1− p(1−( xt ( i ) ); w )  1


1
∂w j


t ( i ) (1− p1 ( x ( i ) ;w )) −(1− t ( i ) ) p1 ( x ( i ) ;w )
[
Another way of writing the logistic regression function is:
The overall gradient is:
∂J (w )
= ∑ (t ( i ) − p1 (x i ; w )) x (i ) j
∂w j
i
Compare w/percepton rule!
What We Already Know
• Linear Threshold Unit (LTU)
– Tries to discover a linear function (in feature
space) that separates positive and negative
examples
• Logistic Regression
– Solve by local search:
• Begins with initial weight vector.
• Gradient ascent to maximize objective function.
• Objective function is the log likelihood of the data
• Algorithm seeks the probability distribution P(t|x) that is most
likely given the data.
• May be done online or in batch
• Can be used with acceleration methods (NewtonRaphson, etc.)
Density Estimation
• Basic unsupervised learning technique
• Discussed here in context of classification
• Idea: Estimate joint probability of features and
class labels
– Uses regression to estimate the function
log
p1 (x; w )
= wT x
p0 (x; w )
Discrete Case
• Suppose we know P(X1…XnT)
• How do we get this?
• Maximum likelihood estimate comes from
counting (relative frequency)
• Bernouli distribution
Betting on y
• Assuming:
– Binary loss function
– Choices: t0, t1
• Favor t0 when P(t0|x1…xn) > P(t1|x1…xn)
• Use definition of conditional probability:
• We see a new x1…xn
• What is our guess for t?
P (t 0 | x1...xn ) =
Naïve Bayes in Action
So, are we done???
• How many parameters needed for joint?
• Is this practical?
• Simplification (Naïve Bayes):
•
•
•
•
Spam filtering
X1…Xn: Spam related features
t: Spam label
Combine Bayes Rule w/Naïve Bayes:
P(t | X 1...X n ) =
P( X 1... X n | t ) = ∏ P( X i | t )
i
Q: How is this more practical?
Is Naïve Bayes Reasonable?
• Are features correlated within classes?
• How would it hurt us if they were?
P (t 0 x1...xn )
P ( x1...xn )
∏ P(X
=
i
P ( X 1...X n | t ) P (t )
P( X 1... X n )
| t ) P(t )
i
P ( X 1...X n )
Things to note:
Do we worry about P(X1…Xn)?
Influence of P(t)?
Linear Discriminant Analysis
• In LDA, we learn the distribution P(x|t)
• We assume that x is continuous
• We assume P(x|t) is distributed according to a
multivariate normal distribution and P(t) is a discrete
distribution
• More on this when we discuss Bayesian
networks
Estimating the MVG parameters
• Also called Gaussian Discriminant Analysis
• Here
• Given a set of data points {x1,…, xN}, the maximum likelihood
estimates for the parameters of the MVG are:
µˆ =
1
N
∑x
Putting it all together in LDA
– t ∼ Bernoulli(w)
– x|t=0 ∼ Ν(µ1, Σ)
– x|t=1 ∼ Ν(µ2, Σ)
• Writing this out, we get:
(i )
i
p ( x | t = 0) =
1
Σˆ =
N
∑ (x
(i)
− µˆ )( x (i) − µˆ )T
p( x | t = 1) =
i
Picking A Class
1
(2π )n / 2 Σ 1 / 2
 1

T
exp − ( x − µ1 ) Σ −1 (x − µ1 )
 2

• Recall we assumed Σ same for all classes
• When is P(y0|x)>P(y1|x)???
Prior class
probability
MVG conditional
feature probability
1
( 2π )n / 2 Σ 1 / 2
P( X | t ) P (t )
P( X )
1
( 2π )n / 2 Σ 1 / 2
Prior feature
probability (ignored)
Posterior
label probability
 1

T
exp − (x − µ 0 ) Σ −1 ( x − µ0 )
 2

The Beauty of Homoscedasticity
• We again use Bayes rule:
P (t | X ) =
1
( 2π )n / 2 Σ 1 / 2
Example
 1

T
exp − ( x − µ 0 ) Σ −1 (x − µ 0 ) p ( y 0) >
 2


 1
T
exp − ( x − µ1 ) Σ −1 ( x − µ1 ) p ( y1)

 2
(x − µ0 )T Σ −1 (x − µ0 ) > (x − µ1 )T Σ −1 (x − µ1 ) + k
Linear!!!
Homoscedastic LDA Discussion
1
0
• For multiclass, this gives convex decision
boundaries
• Satisfies desiderata for multiclass decision
boundaries
−1
−2
−3
−4
−5
−6
−7
−2
−1
0
1
2
3
4
5
6
7
• How realistic is this?
• What do we give up?
The decision boundary is at p(y=1|x) = 0.5
Heteroscedastic Distributions
Comparing LTU, LR, LDA
• Big debate about the relative merits of
– direct classifiers (like LTU) versus
– conditional models (like LR) versus
– generative models (like LDA)
'())*+,-. *-,/01+ 23()) 41,01)5 ,- 67,) 89(+438:
•
LDA vs LR
Issues
What is the relationship?
– In LDA, it turns out the p(t|x) can be expressed as a logistic function
where the weights are some function of µ1, µ2, and Σ!
– But, the converse is NOT true. If p(t|x) is a logistic function, that does
not imply p(x|t) is MVG
• Statistical efficiency: if the generative model is correct, then it usually
gives better accuracy, especially for small training sets.
• Computational efficiency: generative models typically are the easiest
to compute. In LDA, we estimated the parameters directly, no need
for gradient ascent
• Robustness to changing loss function: Both generative and
conditional models allow the loss function to change without reestimating the model. This is not true for direct LTU methods
• Robustness to model assumptions: The generative model usually
performs poorly when the assumptions are violated.
• Robustness to missing values and noise: In many applications, some
of the features x(i)j may be missing or corrupted for some training
examples. Generative models provide better ways of handling this
than non-generative models.
• LDA makes stronger modeling assumptions than LR
– when these modeling assumptions are correct, LDA will perform better
• LDA is asymptotically efficient: in the limit of very large training sets, there is
no algorithm that is strictly better than LDA
– however, when these assumptions are incorrect, LR is more robust
• weaker assumptions, more robust to deviations from modeling assumptions
• if the data are non-Gaussian, then in the limit, logistic outperforms LDA
• For this reason, LR is a more commonly used algorithm