Download Lecture7 linear File - Dr. Manal Helal Moodle Site

Linear Classifiers Pattern Recognition and Image Analysis Dr. Manal Helal – Fall 2014 Lecture 7 2 Linear Models  Linear models  Perceptron  Naïve Bayes  Logistic regression 3 Linear Models for Classification Linear Models Probability & Bayesian Inference 6  ! Linear models for classification separate input vectors into Linear models for classification separate input vectors into classes using linear (hyperplane) decision boundaries. classes using linear (hyperplane) decision boundaries. !  Example: Example: 2D Input vector x 2D Input vector x Two discrete classes C1 and C2  Two discrete classes C1 and C2 4 2 0 x2 −2 −4 −6 −8 −4 −2 0 2 4 x1 CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition 6 8 4 An Example  Positive examples are blank, negative are filled Which line describe the decision boundary? Higher dimensions?! Think of training examples as points in d-dimensional space. Each dimension corresponds to one feature. A linear binary classifier defines a plane in the space which separates positive from negative examples. 5 Linear Decision Boundary  A hyper-plane is a generalization of a straight line to > 2 dimensions  A hyper-plane contains all the points in a d dimensional space satisfying the following equation: w1x1 +w2x2,...,+wdxd +w0 =0  Each coefficient wi can be thought of as a weight on the corresponding feature  The vector containing all the weights w = (w0, . . . , wd) is the parameter vector or weight vector 6 Normal Vector  Geometrically, the weight vector w is a normal vector of the separating hyper-plane  A normal vector of a surface is any vector which is perpendicular to it 7 Hyper-plane as a classifier  Let g(x)=w1x1 +w2x2,...,+wdxd +w0  Then 8 Bias  The slope of the hyper-plane is determined by w1...wd. The location (intercept) is determined by bias w0  Include bias in the weight vector and add a dummy component to the feature vector  Set this component to x0 = 1 Then g(x) = w . x 9 Separating hyper-planes in 2 dimensions o Class Discriminant Function t 10 Probability & Bayesian Inference x2 y> 0 y= 0 y< 0 R1 R2 +y(x) w 0 = wt x +w0 y(x) ≥ 0 → x assigned to C1 < 0 → x assigned to C2 0 ®y(x)assigned to C x w y(x) ⇤w ⇤ 1 0® assigned to C2 x⇥ x1 ) = 0 defines the decision boundary Thus y(x) = 0 defines the decision boundary − w0 ⇤w ⇤ 11 Learning  The goal of the learning process is to come up with a “good” weight vector w  The learning process will use examples to guide the search of a “good” w  Different notions of “goodness” exist, which yield different learning algorithms  We will describe some of these algorithms in the following slides Perceptron 12 13 Perceptron Training  How do we find a set of weights that separate our classes  Perceptron: A simple mistake-driven online algorithm 1. 􏰌Start with a zero weight vector and process each training example in turn. 2. 􏰌If the current weight vector classifies the current example incorrectly, move the weight vector in the right direction. 3. 􏰌If weights stop changing, stop  If examples are linearly separable, then this algorithm is guaranteed to converge to the solution vector 14 Fixed increment online perceptron algorithm  Binary classification, with classes +1 and −1 Decision function y′ = sign(w · x)  Perceptron(x1:N , y1:N , I): 1:w←0 2: for i=1...I do 3: for n = 1...N do 4: 5: 6: return w if y(n)(w · x(n)) ≤ 0 then w ← (w · x(n)) ≤ 0 then 15 Or more explicitly  1: w←0  2: for i=1...I do  3:  4:  5:  6:  7:  8:  9:  10: return w Tracing an example for a NAND function is found in: http://en.wikipedia.org/wiki/Perceptron for n = 1...N do if y(n) = sign(w · x(n)) then pass elseif y(n)=+1∧sign(w·x(n))=−1 then w ← w + x(n) elseif y(n)=−1∧sign(w·x(n))=+1 then w ← w − x(n) 16 Weight averaging  Although the algorithm is guaranteed to converge, the solution is not unique!  Sensitive to the order in which examples are processed  Separating the training sample does not equal good accuracy on unseen data  Empirically, better generalization performance with weight averaging  􏰌A method of avoiding overfitting  􏰌As final weight vector, use the mean of all the weight vector values for each step of the algorithm  (cf. regularization in a following session) Efficient averaged perceptron algorithm Perceptron(x1:N , y1:N , I): 1: w←0; wa←0 2: b←0;ba←0 3: c←1 4: for i=1...I do 5: 6: for n = 1...N do if y(n)(w · x(n) + b) ≤ 0 then 7: w←w+y(n)x(n) ;b←b+y(n) 8: wa ← wa + cy(n)x(n) ; ba ← ba + cy(n) 9: c←c+1 10: return (w − wa/c, b − ba/c) 17 Naïve Bayes 18 19 Probabilistic Model  Instead of thinking in terms of multidimensional space...  Classification can be approached as a probability estimation problem  We will try to find a probability distribution which   Describes well our training data  Allows us to make accurate predictions We’ll look at Naive Bayes as a simplest example of a probabilistic classifier 20 Representation of Examples  We are trying to classify documents. Let’s represent a document as a sequence of terms (words) it contains t = (t1...tn)  For (binary) classification we want to find the most probable class: ŷ= argmax P(Y =y|t)  But y∈{−1,+1} documents are close to unique: we cannot reliably condition Y |t  Bayes’ rule to the rescue 21 Bayes Rule  Bayes rule determines how joint and conditional probabilities are related 22 Prior and likelihood  With Bayes’ rule we can invert the direction of conditioning  Decomposed the task into estimating the prior P (Y ) (easy) and the likelihood P (t|Y = y) 23 Conditional Independence  How to estimate P (t|Y = y)?  Naively assume the occurrence of each word in the document is independent of the others, when conditioned on the class 24 Naive Bayes  Putting it all together 25 Decision Function  For binary classification: 26 Documents in Vector Notation  Let’s represent documents as vocabulary-size-dimensional binary vectors  Dimension i indicates how many times the ith vocabulary item appears in document x 27 Naive Bayes in Vector Notation  Counts appear as exponents:  If we take the logarithm of the threshold (ln 1 = 0) and g, we’ll get the same decision function 28 Linear Classifier  Remember the linear classifier?  Log prior ratio corresponds to the bias term  Log likelihood ratios correspond to feature weights 29 What is the difference  Training criterion and procedure Perceptron  Zero-one loss function  Error-driven algorithm 30 Naive Bayes  Maximum  Find Likelihood criterion parameters which maximize the log likelihood Parameters reduce to relative counts + Ad-hoc smoothing  Alternatives (e.g. maximum a posteriori) 31 Comparison Logistic Regression 32 33 Probabilistic Conditional Model  Let’s try to come up with a probabilistic model, which has some of the advantages of perceptron  Model P(y|x) directly, and not via P(x|y) and Bayes rule as in Naïve Bayes  Avoid issue of dependencies between features of x  We’ll take linear regression as a starting point  The goal is to adapt regression to model class-conditional probability 34 Linear Regression  Training data: observations paired with outcomes (n ∈ R)  Observations have features (predictors, typically also real numbers)  The model is a regression line y = ax + b which best fits the observations  a is the slope  b is the intercept  This model has two parameters (or weights)  One feature = x  Example:  x = number of vague adjectives in property descriptions  y = amount house sold over asking price 35 36 Multiple Linear Regression  More generally y where = outcome  w0 = intercept  x1..xd = features vector and w1..wd weight vector  Get rid of bias:  Linear regression: uses g(x) directly  Linear classifier: uses sign(g(x)) 37 Learning Linear Regression  Minimize sum squared error over N training examples  Closed-form formula for choosing the best weights w: where the matrix X contains training example features, and y is the vector of outcomes. 38 Logistic Regression  In logistic regression we use the linear model to assign probabilities to class labels  For binary classification, predict P (Y = 1|x). But predictions of linear regression model are ∈ R, whereas P(Y =1|x)∈[0,1]  Instead predict logit function of the probability:  Solving for P (Y = 1|x) we obtain: 39 40 Logistic Regression – Classification  Example x belongs to class 1 if:  Equation w · x = 0 defines a hyper-plane with points above belonging to class 1 41 Multinomial Logistic Regression  Logistic regression generalized to more than two classes 42 Learning Parameters  Conditional likelihood estimation: choose the weights which make the probability of the observed values y be the highest, given the observations xi  For the training set with N examples: 43 Error Function  Equivalently, we seek the value of the parameters which minimize the error function: 44 A problem in convex optimization  L-BFGS (Limited-memory Broyden-Fletcher-GoldfarbShanno method)  gradient descent  conjugate gradient  iterative scaling algorithms Gradient Descent Learning Rule  Weight update rule: wj =  wj + h ( t(i) – s[f(i)] ) s[f(i)] xj(i) can rewrite as: wj = wj + h * error * c * xj(i) The Basic idea    A gradient is a slope of a function That is, a set of partial derivatives, one for each dimension (parameter) By following the gradient of a convex function we can descend to the bottom (minimum) What is the gradient?     Partial derivatives of E(w0, w1,w2): e.g., d E(w0, w1,w2) /d w1 Gradient is defined as the vector of partial derivatives: Dw = [ d E(w) /d w0, d E(w) /d w1, d E(w) /d w2 ]  = gradient of w  = vector of derivatives (defined here on 3 dimensions)  1. E(w) and Dw can be evaluated at any particular point w  2. The components of the gradient Dw tell us how fast E(w) is changing in each direction  3. When interpreted as a vector, Dw is the direction of steepest increase => - Dw is the direction of steepest decrease Gradient Descent Rule in Multiple Dimensions Gradient Descent Rule: w new = w old - h D (w) where D (w) is the gradient and h is the learning rate (small, positive) Notes: 1. This moves us downhill in direction D (w) (steepest downhill direction) 2. How far we go is determined by the value of h 3. The perceptron learning rule is a special case of this general method Illustration of Gradient Descent E(w) w1 w0 Illustration of Gradient Descent E(w) w1 w0 Illustration of Gradient Descent E(w) w1 Direction of steepest descent = direction of negative gradient w0 Illustration of Gradient Descent E(w) w1 Original point in weight space New point in w0 weight space Comments on the Gradient Descent Algorithm  Equivalent to hill-climbing heuristic search  Works on any objective function E(w)   as long as we can evaluate the gradient D (w)  this can be very useful for minimizing complex functions E Local minima  can have multiple local minima (note: for perceptron, E(w) only has a single global minimum, so this is not a problem)  gradient descent goes to the closest local minimum:   solution: random restarts from multiple places in weight space (note: no local minima for perceptron learning) General Gradient Descent Algorithm  Define an objective function E(w) (function to be minimized)  We want to find the vector of values w that minimize E(w)  Algorithm:  pick an initial set of weights w, e.g. randomly  evaluate D (w) at w   update all the weights    note: this can be done numerically or in closed form w new = w old - h D (w) check if D (w) is approximately 0  if so, we have converged to a “flat minimum”  if not, we move again in weight space For perceptron learning, D (w) is ( t(i) – s[f(i)] ) s[f(i)] xj(i) Minimization of Mean Squared Error, E(w) E(w) w1 Minimum of function E(w) Minimization of Mean Squared Error, E(w) E(w) w1 d E(w)/ dw1 w1 Moving Downhill: Move in direction of negative derivative E(w) Decreasing E(w) w1 d E(w)/ dw1 w1 d E(w)/dw1 > 0 w1 <= w1 - h d E(w)/dw1 i.e., the rule decreases w1 Moving Downhill: Move in direction of negative derivative E(w) Decreasing E(w) w1 d E(w)/ dw1 w1 d E(w)/dw1 < 0 w1 <= w1 - h d E(w)/dw1 i.e., the rule increases w1 58 Gradient Descent Example  Find argminθ f(θ) where f(θ) = θ2  Initial value of θ1 = −1  Gradient function: ∇f(θ) = 2θ  Update: θ(n+1) = θ(n) − η∇f (θ(n))  The learning rate η (= 0.2) controls the speed of the descent  After first iteration: θ(2) = −1 − 0.2(2) = −0.6 59 Five Iterations of Gradient Descent 60 Stochastic Gradient Descent (SGD)  We could compute the gradient of error for the full dataset before each update Instead  Compute the gradient of the error for a single example  update the weight  Move on to the next example  On average, we’ll move in the right direction  Efficient, online algorithm  However, stochastic, uses random samples to accommodate for lots of local maxima/minima 61 Error gradient  The gradient of the error function is the set of partial derivatives of the error function with respect to the parameters Wyi 62 Single training example 63 Update  Stochastic gradient update step 64 Update: Explicit  For the correct class (y = y(n)) where 1 − P (Y = y|x(n), W) is the residual  For all other classes (y≠y(n)) 65 66 Logistics Regression SGD vs Perceptron  Very similar update!  Perceptron is simply an instantiation of SGD for a particular error function  The perceptron criterion: for a correctly classified example zero error; for a misclassified example −y(n)w · x(n) 67 Comparison 68 Exercise  Do Example 2.2.1:2, and 2.3.1, not using a randomly generated data as shown, but on your project’s dataset.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Lecture7 linear File - Dr. Manal Helal Moodle Site