* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Lecture7 linear File - Dr. Manal Helal Moodle Site
Linear least squares (mathematics) wikipedia , lookup
System of linear equations wikipedia , lookup
Euclidean vector wikipedia , lookup
Four-vector wikipedia , lookup
Laplace–Runge–Lenz vector wikipedia , lookup
Covariance and contravariance of vectors wikipedia , lookup
Vector space wikipedia , lookup
Matrix calculus wikipedia , lookup
Coefficient of determination wikipedia , lookup
Linear Classifiers
Pattern Recognition and
Image Analysis
Dr. Manal Helal – Fall 2014
Lecture 7
2
Linear Models
Linear models
Perceptron
Naïve Bayes
Logistic regression
3
Linear
Models for Classification
Linear
Models
Probability & Bayesian Inference
6
!
Linear models for classification separate input vectors into
Linear models for classification separate input vectors into
classes using linear (hyperplane) decision boundaries.
classes using linear (hyperplane) decision boundaries.
!
Example:
Example:
2D Input vector x
2D Input vector x
Two discrete classes C1 and C2
Two discrete classes C1 and C2
4
2
0
x2 −2
−4
−6
−8
−4
−2
0
2
4
x1
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
6
8
4
An Example
Positive examples are blank, negative are filled
Which line describe the decision
boundary?
Higher dimensions?!
Think of training examples as
points in d-dimensional space.
Each dimension corresponds to one
feature.
A linear binary classifier defines a
plane in the space which separates
positive from negative examples.
5
Linear Decision Boundary
A hyper-plane is a generalization of a straight line to > 2
dimensions
A hyper-plane contains all the points in a d dimensional
space satisfying the following equation:
w1x1 +w2x2,...,+wdxd +w0 =0
Each coefficient wi can be thought of as a weight on the
corresponding feature
The vector containing all the weights
w = (w0, . . . , wd) is the parameter vector or weight vector
6
Normal Vector
Geometrically, the weight vector w is a normal vector of the
separating hyper-plane
A normal vector of a surface is any vector which is
perpendicular to it
7
Hyper-plane as a classifier
Let
g(x)=w1x1 +w2x2,...,+wdxd +w0
Then
8
Bias
The slope of the hyper-plane is determined by w1...wd. The
location (intercept) is determined by bias w0
Include bias in the weight vector and add a dummy
component to the feature vector
Set this component to x0 = 1 Then
g(x) = w . x
9
Separating hyper-planes in 2
dimensions
o Class Discriminant Function
t
10
Probability & Bayesian Inference
x2
y> 0
y= 0
y< 0
R1
R2
+y(x)
w 0 = wt x +w0
y(x) ≥ 0 → x assigned to C1
< 0 → x assigned
to C2
0 ®y(x)assigned
to C
x
w
y(x)
⇤w ⇤
1
0®
assigned to C2
x⇥
x1
) = 0 defines the decision boundary
Thus y(x) = 0 defines the decision
boundary
− w0
⇤w ⇤
11
Learning
The goal of the learning process is to come up with a “good”
weight vector w
The learning process will use examples to guide the search
of a “good” w
Different notions of “goodness” exist, which yield different
learning algorithms
We will describe some of these algorithms in the following
slides
Perceptron
12
13
Perceptron Training
How do we find a set of weights that separate our classes
Perceptron: A simple mistake-driven online algorithm
1.
Start with a zero weight vector and process each training
example in turn.
2.
If the current weight vector classifies the current
example incorrectly, move the weight vector in the right
direction.
3.
If weights stop changing, stop
If examples are linearly separable, then this algorithm is
guaranteed to converge to the solution vector
14
Fixed increment online perceptron
algorithm
Binary classification, with classes +1 and −1 Decision function
y′ = sign(w · x)
Perceptron(x1:N , y1:N , I):
1:w←0
2: for i=1...I do
3:
for n = 1...N do
4:
5:
6: return w
if y(n)(w · x(n)) ≤ 0 then
w ← (w · x(n)) ≤ 0 then
15
Or more explicitly
1:
w←0
2:
for i=1...I do
3:
4:
5:
6:
7:
8:
9:
10: return w
Tracing an example for a NAND function is found in:
http://en.wikipedia.org/wiki/Perceptron
for n = 1...N do
if y(n) = sign(w · x(n)) then
pass
elseif y(n)=+1∧sign(w·x(n))=−1 then
w ← w + x(n)
elseif y(n)=−1∧sign(w·x(n))=+1 then
w ← w − x(n)
16
Weight averaging
Although the algorithm is guaranteed to converge, the solution
is not unique!
Sensitive to the order in which examples are processed
Separating the training sample does not equal good accuracy
on unseen data
Empirically, better generalization performance with weight
averaging
A method of avoiding overfitting
As final weight vector, use the mean of all the weight vector values
for each step of the algorithm
(cf. regularization in a following session)
Efficient averaged perceptron
algorithm
Perceptron(x1:N , y1:N , I):
1: w←0; wa←0
2: b←0;ba←0
3: c←1
4: for i=1...I do
5:
6:
for n = 1...N do
if y(n)(w · x(n) + b) ≤ 0 then
7:
w←w+y(n)x(n) ;b←b+y(n)
8:
wa ← wa + cy(n)x(n) ; ba ← ba + cy(n)
9:
c←c+1
10: return (w − wa/c, b − ba/c)
17
Naïve Bayes
18
19
Probabilistic Model
Instead of thinking in terms of multidimensional space...
Classification can be approached as a probability estimation
problem
We will try to find a probability distribution which
Describes well our training data
Allows us to make accurate predictions
We’ll look at Naive Bayes as a simplest example of a
probabilistic classifier
20
Representation of Examples
We
are trying to classify documents. Let’s
represent a document as a sequence of terms
(words) it contains t = (t1...tn)
For
(binary) classification we want to find the
most probable class:
ŷ= argmax P(Y =y|t)
But
y∈{−1,+1}
documents are close to unique: we cannot
reliably condition Y |t
Bayes’ rule
to the rescue
21
Bayes Rule
Bayes rule determines how joint and conditional
probabilities are related
22
Prior and likelihood
With Bayes’ rule we can invert the direction of conditioning
Decomposed the task into estimating the prior P (Y ) (easy)
and the likelihood P (t|Y = y)
23
Conditional Independence
How
to estimate P (t|Y = y)?
Naively
assume the occurrence of each word in the
document is independent of the others, when
conditioned on the class
24
Naive Bayes
Putting it all together
25
Decision Function
For binary classification:
26
Documents in Vector Notation
Let’s represent documents as vocabulary-size-dimensional
binary vectors
Dimension i indicates how many times the ith vocabulary item
appears in document x
27
Naive Bayes in Vector Notation
Counts appear as exponents:
If we take the logarithm of the threshold (ln 1 = 0) and g, we’ll
get the same decision function
28
Linear Classifier
Remember the linear classifier?
Log prior ratio corresponds to the bias term
Log likelihood ratios correspond to feature weights
29
What is the difference
Training criterion and procedure
Perceptron
Zero-one loss function
Error-driven algorithm
30
Naive Bayes
Maximum
Find
Likelihood criterion
parameters which maximize the log likelihood
Parameters reduce to relative counts
+
Ad-hoc smoothing
Alternatives
(e.g. maximum a posteriori)
31
Comparison
Logistic Regression
32
33
Probabilistic Conditional Model
Let’s try to come up with a probabilistic model, which has some
of the advantages of perceptron
Model P(y|x) directly, and not via P(x|y) and Bayes rule as in
Naïve Bayes
Avoid issue of dependencies between features of x
We’ll take linear regression as a starting point
The goal is to adapt regression to model class-conditional probability
34
Linear Regression
Training data: observations paired with outcomes (n ∈ R)
Observations have features (predictors, typically also real
numbers)
The model is a regression line y = ax + b which best fits the
observations
a is the slope
b is the intercept
This model has two parameters (or weights)
One feature = x
Example:
x = number of vague adjectives in property descriptions
y = amount house sold over asking price
35
36
Multiple Linear Regression
More
generally
y
where
= outcome
w0 = intercept
x1..xd = features vector and w1..wd weight vector
Get rid of bias:
Linear
regression: uses g(x) directly
Linear
classifier: uses sign(g(x))
37
Learning Linear Regression
Minimize
sum squared error over N training examples
Closed-form
formula for choosing the best weights w:
where the matrix X contains training example features,
and y is the vector of outcomes.
38
Logistic Regression
In
logistic regression we use the linear model to
assign probabilities to class labels
For
binary classification, predict P (Y = 1|x). But
predictions of linear regression model are ∈ R,
whereas P(Y =1|x)∈[0,1]
Instead
predict logit function of the probability:
Solving for P (Y = 1|x) we obtain:
39
40
Logistic Regression – Classification
Example x belongs to class 1 if:
Equation w · x = 0 defines a hyper-plane with points above
belonging to class 1
41
Multinomial Logistic Regression
Logistic regression generalized to more than two classes
42
Learning Parameters
Conditional likelihood estimation: choose the weights which
make the probability of the observed values y be the highest,
given the observations xi
For the training set with N examples:
43
Error Function
Equivalently, we seek the value of the parameters which
minimize the error function:
44
A problem in convex optimization
L-BFGS (Limited-memory Broyden-Fletcher-GoldfarbShanno method)
gradient descent
conjugate gradient
iterative scaling algorithms
Gradient Descent Learning Rule
Weight
update rule:
wj =
wj + h ( t(i) – s[f(i)] ) s[f(i)] xj(i)
can rewrite as:
wj =
wj + h * error * c * xj(i)
The Basic idea
A gradient is a slope of a function
That is, a set of partial derivatives, one for each dimension
(parameter)
By following the gradient of a convex function we can descend
to the bottom (minimum)
What is the gradient?
Partial derivatives of E(w0, w1,w2):
e.g., d E(w0, w1,w2) /d w1
Gradient is defined as the vector of partial derivatives:
Dw = [ d E(w) /d w0, d E(w) /d w1, d E(w) /d w2 ]
= gradient of w
= vector of derivatives (defined here on 3 dimensions)
1. E(w) and Dw can be evaluated at any particular point w
2. The components of the gradient Dw tell us how fast E(w) is
changing in each direction
3. When interpreted as a vector, Dw is the direction of steepest
increase => - Dw is the direction of steepest decrease
Gradient Descent Rule in Multiple
Dimensions
Gradient Descent Rule:
w new = w old - h D (w)
where
D (w) is the gradient and
h is the learning rate (small, positive)
Notes:
1. This moves us downhill in direction D (w) (steepest downhill direction)
2. How far we go is determined by the value of h
3. The perceptron learning rule is a special case of this general method
Illustration of Gradient Descent
E(w)
w1
w0
Illustration of Gradient Descent
E(w)
w1
w0
Illustration of Gradient Descent
E(w)
w1
Direction of steepest
descent = direction of
negative gradient
w0
Illustration of Gradient Descent
E(w)
w1
Original point in
weight space
New point in
w0
weight space
Comments on the Gradient Descent
Algorithm
Equivalent to hill-climbing heuristic search
Works on any objective function E(w)
as long as we can evaluate the gradient D (w)
this can be very useful for minimizing complex functions E
Local minima
can have multiple local minima (note: for perceptron, E(w) only has a
single global minimum, so this is not a problem)
gradient descent goes to the closest local minimum:
solution: random restarts from multiple places in weight space
(note: no local minima for perceptron learning)
General Gradient Descent
Algorithm
Define an objective function E(w) (function to be minimized)
We want to find the vector of values w that minimize E(w)
Algorithm:
pick an initial set of weights w, e.g. randomly
evaluate D (w) at w
update all the weights
note: this can be done numerically or in closed form
w new = w old - h D (w)
check if D (w) is approximately 0
if so, we have converged to a “flat minimum”
if not, we move again in weight space
For perceptron learning, D (w) is ( t(i) – s[f(i)] ) s[f(i)] xj(i)
Minimization of Mean Squared
Error, E(w)
E(w)
w1
Minimum of
function E(w)
Minimization of Mean Squared
Error, E(w)
E(w)
w1
d E(w)/
dw1
w1
Moving Downhill: Move in direction of
negative derivative
E(w)
Decreasing E(w)
w1
d E(w)/
dw1
w1
d E(w)/dw1 > 0
w1 <= w1 - h d E(w)/dw1
i.e., the rule decreases w1
Moving Downhill: Move in direction of
negative derivative
E(w)
Decreasing E(w)
w1
d E(w)/
dw1
w1
d E(w)/dw1 < 0
w1 <= w1 - h d E(w)/dw1
i.e., the rule increases w1
58
Gradient Descent Example
Find argminθ f(θ) where f(θ) = θ2
Initial value of θ1 = −1
Gradient function: ∇f(θ) = 2θ
Update: θ(n+1) = θ(n) − η∇f (θ(n))
The learning rate η (= 0.2) controls the speed of the descent
After first iteration: θ(2) = −1 − 0.2(2) = −0.6
59
Five Iterations of Gradient Descent
60
Stochastic Gradient Descent (SGD)
We could compute the gradient of error for the full dataset
before each update Instead
Compute the gradient of the error for a single example
update the weight
Move on to the next example
On average, we’ll move in the right direction
Efficient, online algorithm
However, stochastic, uses random samples to accommodate
for lots of local maxima/minima
61
Error gradient
The gradient of the error function is the set of partial derivatives
of the error function with respect to the parameters Wyi
62
Single training example
63
Update
Stochastic gradient update step
64
Update: Explicit
For the correct class (y = y(n))
where 1 − P (Y = y|x(n), W) is the residual
For all other classes (y≠y(n))
65
66
Logistics Regression SGD vs
Perceptron
Very similar update!
Perceptron is simply an instantiation of SGD for a particular
error function
The perceptron criterion: for a correctly classified example
zero error; for a misclassified example −y(n)w · x(n)
67
Comparison
68
Exercise
Do Example 2.2.1:2, and 2.3.1, not using a randomly
generated data as shown, but on your project’s dataset.