Download Lecture7 linear File - Dr. Manal Helal Moodle Site

Document related concepts

Linear least squares (mathematics) wikipedia , lookup

System of linear equations wikipedia , lookup

Euclidean vector wikipedia , lookup

Four-vector wikipedia , lookup

Laplace–Runge–Lenz vector wikipedia , lookup

Covariance and contravariance of vectors wikipedia , lookup

Vector space wikipedia , lookup

Matrix calculus wikipedia , lookup

Coefficient of determination wikipedia , lookup

Ordinary least squares wikipedia , lookup

Least squares wikipedia , lookup

Transcript
Linear Classifiers
Pattern Recognition and
Image Analysis
Dr. Manal Helal – Fall 2014
Lecture 7
2
Linear Models

Linear models

Perceptron

Naïve Bayes

Logistic regression
3
Linear
Models for Classification
Linear
Models
Probability & Bayesian Inference
6

!
Linear models for classification separate input vectors into
Linear models for classification separate input vectors into
classes using linear (hyperplane) decision boundaries.
classes using linear (hyperplane) decision boundaries.
!

Example:
Example:
2D Input vector x
2D Input vector x
Two discrete classes C1 and C2

Two discrete classes C1 and C2
4
2
0
x2 −2
−4
−6
−8
−4
−2
0
2
4
x1
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
6
8
4
An Example

Positive examples are blank, negative are filled
Which line describe the decision
boundary?
Higher dimensions?!
Think of training examples as
points in d-dimensional space.
Each dimension corresponds to one
feature.
A linear binary classifier defines a
plane in the space which separates
positive from negative examples.
5
Linear Decision Boundary

A hyper-plane is a generalization of a straight line to > 2
dimensions

A hyper-plane contains all the points in a d dimensional
space satisfying the following equation:
w1x1 +w2x2,...,+wdxd +w0 =0

Each coefficient wi can be thought of as a weight on the
corresponding feature

The vector containing all the weights
w = (w0, . . . , wd) is the parameter vector or weight vector
6
Normal Vector

Geometrically, the weight vector w is a normal vector of the
separating hyper-plane

A normal vector of a surface is any vector which is
perpendicular to it
7
Hyper-plane as a classifier

Let
g(x)=w1x1 +w2x2,...,+wdxd +w0

Then
8
Bias

The slope of the hyper-plane is determined by w1...wd. The
location (intercept) is determined by bias w0

Include bias in the weight vector and add a dummy
component to the feature vector

Set this component to x0 = 1 Then
g(x) = w . x
9
Separating hyper-planes in 2
dimensions
o Class Discriminant Function
t
10
Probability & Bayesian Inference
x2
y> 0
y= 0
y< 0
R1
R2
+y(x)
w 0 = wt x +w0
y(x) ≥ 0 → x assigned to C1
< 0 → x assigned
to C2
0 ®y(x)assigned
to C
x
w
y(x)
⇤w ⇤
1
0®
assigned to C2
x⇥
x1
) = 0 defines the decision boundary
Thus y(x) = 0 defines the decision
boundary
− w0
⇤w ⇤
11
Learning

The goal of the learning process is to come up with a “good”
weight vector w

The learning process will use examples to guide the search
of a “good” w

Different notions of “goodness” exist, which yield different
learning algorithms

We will describe some of these algorithms in the following
slides
Perceptron
12
13
Perceptron Training

How do we find a set of weights that separate our classes

Perceptron: A simple mistake-driven online algorithm
1.
􏰌Start with a zero weight vector and process each training
example in turn.
2.
􏰌If the current weight vector classifies the current
example incorrectly, move the weight vector in the right
direction.
3.
􏰌If weights stop changing, stop

If examples are linearly separable, then this algorithm is
guaranteed to converge to the solution vector
14
Fixed increment online perceptron
algorithm

Binary classification, with classes +1 and −1 Decision function
y′ = sign(w · x)

Perceptron(x1:N , y1:N , I):
1:w←0
2: for i=1...I do
3:
for n = 1...N do
4:
5:
6: return w
if y(n)(w · x(n)) ≤ 0 then
w ← (w · x(n)) ≤ 0 then
15
Or more explicitly

1:
w←0

2:
for i=1...I do

3:

4:

5:

6:

7:

8:

9:

10: return w
Tracing an example for a NAND function is found in:
http://en.wikipedia.org/wiki/Perceptron
for n = 1...N do
if y(n) = sign(w · x(n)) then
pass
elseif y(n)=+1∧sign(w·x(n))=−1 then
w ← w + x(n)
elseif y(n)=−1∧sign(w·x(n))=+1 then
w ← w − x(n)
16
Weight averaging

Although the algorithm is guaranteed to converge, the solution
is not unique!

Sensitive to the order in which examples are processed

Separating the training sample does not equal good accuracy
on unseen data

Empirically, better generalization performance with weight
averaging

􏰌A method of avoiding overfitting

􏰌As final weight vector, use the mean of all the weight vector values
for each step of the algorithm

(cf. regularization in a following session)
Efficient averaged perceptron
algorithm
Perceptron(x1:N , y1:N , I):
1: w←0; wa←0
2: b←0;ba←0
3: c←1
4: for i=1...I do
5:
6:
for n = 1...N do
if y(n)(w · x(n) + b) ≤ 0 then
7:
w←w+y(n)x(n) ;b←b+y(n)
8:
wa ← wa + cy(n)x(n) ; ba ← ba + cy(n)
9:
c←c+1
10: return (w − wa/c, b − ba/c)
17
Naïve Bayes
18
19
Probabilistic Model

Instead of thinking in terms of multidimensional space...

Classification can be approached as a probability estimation
problem

We will try to find a probability distribution which


Describes well our training data

Allows us to make accurate predictions
We’ll look at Naive Bayes as a simplest example of a
probabilistic classifier
20
Representation of Examples
 We
are trying to classify documents. Let’s
represent a document as a sequence of terms
(words) it contains t = (t1...tn)
 For
(binary) classification we want to find the
most probable class:
ŷ= argmax P(Y =y|t)
 But
y∈{−1,+1}
documents are close to unique: we cannot
reliably condition Y |t
 Bayes’ rule
to the rescue
21
Bayes Rule

Bayes rule determines how joint and conditional
probabilities are related
22
Prior and likelihood

With Bayes’ rule we can invert the direction of conditioning

Decomposed the task into estimating the prior P (Y ) (easy)
and the likelihood P (t|Y = y)
23
Conditional Independence
 How
to estimate P (t|Y = y)?
 Naively
assume the occurrence of each word in the
document is independent of the others, when
conditioned on the class
24
Naive Bayes

Putting it all together
25
Decision Function

For binary classification:
26
Documents in Vector Notation

Let’s represent documents as vocabulary-size-dimensional
binary vectors

Dimension i indicates how many times the ith vocabulary item
appears in document x
27
Naive Bayes in Vector Notation

Counts appear as exponents:

If we take the logarithm of the threshold (ln 1 = 0) and g, we’ll
get the same decision function
28
Linear Classifier

Remember the linear classifier?

Log prior ratio corresponds to the bias term

Log likelihood ratios correspond to feature weights
29
What is the difference

Training criterion and procedure
Perceptron

Zero-one loss function

Error-driven algorithm
30
Naive Bayes
 Maximum
 Find
Likelihood criterion
parameters which maximize the log likelihood
Parameters reduce to relative counts
+
Ad-hoc smoothing
 Alternatives
(e.g. maximum a posteriori)
31
Comparison
Logistic Regression
32
33
Probabilistic Conditional Model

Let’s try to come up with a probabilistic model, which has some
of the advantages of perceptron

Model P(y|x) directly, and not via P(x|y) and Bayes rule as in
Naïve Bayes

Avoid issue of dependencies between features of x

We’ll take linear regression as a starting point

The goal is to adapt regression to model class-conditional probability
34
Linear Regression

Training data: observations paired with outcomes (n ∈ R)

Observations have features (predictors, typically also real
numbers)

The model is a regression line y = ax + b which best fits the
observations

a is the slope

b is the intercept

This model has two parameters (or weights)

One feature = x

Example:

x = number of vague adjectives in property descriptions

y = amount house sold over asking price
35
36
Multiple Linear Regression
 More
generally
y
where
= outcome
 w0 = intercept
 x1..xd = features vector and w1..wd weight vector
 Get rid of bias:
 Linear
regression: uses g(x) directly
 Linear
classifier: uses sign(g(x))
37
Learning Linear Regression
 Minimize
sum squared error over N training examples
 Closed-form
formula for choosing the best weights w:
where the matrix X contains training example features,
and y is the vector of outcomes.
38
Logistic Regression
 In
logistic regression we use the linear model to
assign probabilities to class labels
 For
binary classification, predict P (Y = 1|x). But
predictions of linear regression model are ∈ R,
whereas P(Y =1|x)∈[0,1]
 Instead
predict logit function of the probability:

Solving for P (Y = 1|x) we obtain:
39
40
Logistic Regression – Classification

Example x belongs to class 1 if:

Equation w · x = 0 defines a hyper-plane with points above
belonging to class 1
41
Multinomial Logistic Regression

Logistic regression generalized to more than two classes
42
Learning Parameters

Conditional likelihood estimation: choose the weights which
make the probability of the observed values y be the highest,
given the observations xi

For the training set with N examples:
43
Error Function

Equivalently, we seek the value of the parameters which
minimize the error function:
44
A problem in convex optimization

L-BFGS (Limited-memory Broyden-Fletcher-GoldfarbShanno method)

gradient descent

conjugate gradient

iterative scaling algorithms
Gradient Descent Learning Rule
 Weight
update rule:
wj =

wj + h ( t(i) – s[f(i)] ) s[f(i)] xj(i)
can rewrite as:
wj =
wj + h * error * c * xj(i)
The Basic idea



A gradient is a slope of a function
That is, a set of partial derivatives, one for each dimension
(parameter)
By following the gradient of a convex function we can descend
to the bottom (minimum)
What is the gradient?




Partial derivatives of E(w0, w1,w2):
e.g., d E(w0, w1,w2) /d w1
Gradient is defined as the vector of partial derivatives:
Dw = [ d E(w) /d w0, d E(w) /d w1, d E(w) /d w2 ]

= gradient of w

= vector of derivatives (defined here on 3 dimensions)

1. E(w) and Dw can be evaluated at any particular point w

2. The components of the gradient Dw tell us how fast E(w) is
changing in each direction

3. When interpreted as a vector, Dw is the direction of steepest
increase => - Dw is the direction of steepest decrease
Gradient Descent Rule in Multiple
Dimensions
Gradient Descent Rule:
w new = w old - h D (w)
where
D (w) is the gradient and
h is the learning rate (small, positive)
Notes:
1. This moves us downhill in direction D (w) (steepest downhill direction)
2. How far we go is determined by the value of h
3. The perceptron learning rule is a special case of this general method
Illustration of Gradient Descent
E(w)
w1
w0
Illustration of Gradient Descent
E(w)
w1
w0
Illustration of Gradient Descent
E(w)
w1
Direction of steepest
descent = direction of
negative gradient
w0
Illustration of Gradient Descent
E(w)
w1
Original point in
weight space
New point in
w0
weight space
Comments on the Gradient Descent
Algorithm

Equivalent to hill-climbing heuristic search

Works on any objective function E(w)


as long as we can evaluate the gradient D (w)

this can be very useful for minimizing complex functions E
Local minima

can have multiple local minima (note: for perceptron, E(w) only has a
single global minimum, so this is not a problem)

gradient descent goes to the closest local minimum:


solution: random restarts from multiple places in weight space
(note: no local minima for perceptron learning)
General Gradient Descent
Algorithm

Define an objective function E(w) (function to be minimized)

We want to find the vector of values w that minimize E(w)

Algorithm:

pick an initial set of weights w, e.g. randomly

evaluate D (w) at w


update all the weights



note: this can be done numerically or in closed form
w new = w old - h D (w)
check if D (w) is approximately 0

if so, we have converged to a “flat minimum”

if not, we move again in weight space
For perceptron learning, D (w) is ( t(i) – s[f(i)] ) s[f(i)] xj(i)
Minimization of Mean Squared
Error, E(w)
E(w)
w1
Minimum of
function E(w)
Minimization of Mean Squared
Error, E(w)
E(w)
w1
d E(w)/
dw1
w1
Moving Downhill: Move in direction of
negative derivative
E(w)
Decreasing E(w)
w1
d E(w)/
dw1
w1
d E(w)/dw1 > 0
w1 <= w1 - h d E(w)/dw1
i.e., the rule decreases w1
Moving Downhill: Move in direction of
negative derivative
E(w)
Decreasing E(w)
w1
d E(w)/
dw1
w1
d E(w)/dw1 < 0
w1 <= w1 - h d E(w)/dw1
i.e., the rule increases w1
58
Gradient Descent Example

Find argminθ f(θ) where f(θ) = θ2

Initial value of θ1 = −1

Gradient function: ∇f(θ) = 2θ

Update: θ(n+1) = θ(n) − η∇f (θ(n))

The learning rate η (= 0.2) controls the speed of the descent

After first iteration: θ(2) = −1 − 0.2(2) = −0.6
59
Five Iterations of Gradient Descent
60
Stochastic Gradient Descent (SGD)

We could compute the gradient of error for the full dataset
before each update Instead

Compute the gradient of the error for a single example

update the weight

Move on to the next example

On average, we’ll move in the right direction

Efficient, online algorithm

However, stochastic, uses random samples to accommodate
for lots of local maxima/minima
61
Error gradient

The gradient of the error function is the set of partial derivatives
of the error function with respect to the parameters Wyi
62
Single training example
63
Update

Stochastic gradient update step
64
Update: Explicit

For the correct class (y = y(n))
where 1 − P (Y = y|x(n), W) is the residual

For all other classes (y≠y(n))
65
66
Logistics Regression SGD vs
Perceptron

Very similar update!

Perceptron is simply an instantiation of SGD for a particular
error function

The perceptron criterion: for a correctly classified example
zero error; for a misclassified example −y(n)w · x(n)
67
Comparison
68
Exercise

Do Example 2.2.1:2, and 2.3.1, not using a randomly
generated data as shown, but on your project’s dataset.