* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Lecture7 linear File - Dr. Manal Helal Moodle Site
Linear least squares (mathematics) wikipedia , lookup
System of linear equations wikipedia , lookup
Euclidean vector wikipedia , lookup
Four-vector wikipedia , lookup
Laplace–Runge–Lenz vector wikipedia , lookup
Covariance and contravariance of vectors wikipedia , lookup
Vector space wikipedia , lookup
Matrix calculus wikipedia , lookup
Coefficient of determination wikipedia , lookup
Linear Classifiers Pattern Recognition and Image Analysis Dr. Manal Helal – Fall 2014 Lecture 7 2 Linear Models Linear models Perceptron Naïve Bayes Logistic regression 3 Linear Models for Classification Linear Models Probability & Bayesian Inference 6 ! Linear models for classification separate input vectors into Linear models for classification separate input vectors into classes using linear (hyperplane) decision boundaries. classes using linear (hyperplane) decision boundaries. ! Example: Example: 2D Input vector x 2D Input vector x Two discrete classes C1 and C2 Two discrete classes C1 and C2 4 2 0 x2 −2 −4 −6 −8 −4 −2 0 2 4 x1 CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition 6 8 4 An Example Positive examples are blank, negative are filled Which line describe the decision boundary? Higher dimensions?! Think of training examples as points in d-dimensional space. Each dimension corresponds to one feature. A linear binary classifier defines a plane in the space which separates positive from negative examples. 5 Linear Decision Boundary A hyper-plane is a generalization of a straight line to > 2 dimensions A hyper-plane contains all the points in a d dimensional space satisfying the following equation: w1x1 +w2x2,...,+wdxd +w0 =0 Each coefficient wi can be thought of as a weight on the corresponding feature The vector containing all the weights w = (w0, . . . , wd) is the parameter vector or weight vector 6 Normal Vector Geometrically, the weight vector w is a normal vector of the separating hyper-plane A normal vector of a surface is any vector which is perpendicular to it 7 Hyper-plane as a classifier Let g(x)=w1x1 +w2x2,...,+wdxd +w0 Then 8 Bias The slope of the hyper-plane is determined by w1...wd. The location (intercept) is determined by bias w0 Include bias in the weight vector and add a dummy component to the feature vector Set this component to x0 = 1 Then g(x) = w . x 9 Separating hyper-planes in 2 dimensions o Class Discriminant Function t 10 Probability & Bayesian Inference x2 y> 0 y= 0 y< 0 R1 R2 +y(x) w 0 = wt x +w0 y(x) ≥ 0 → x assigned to C1 < 0 → x assigned to C2 0 ®y(x)assigned to C x w y(x) ⇤w ⇤ 1 0® assigned to C2 x⇥ x1 ) = 0 defines the decision boundary Thus y(x) = 0 defines the decision boundary − w0 ⇤w ⇤ 11 Learning The goal of the learning process is to come up with a “good” weight vector w The learning process will use examples to guide the search of a “good” w Different notions of “goodness” exist, which yield different learning algorithms We will describe some of these algorithms in the following slides Perceptron 12 13 Perceptron Training How do we find a set of weights that separate our classes Perceptron: A simple mistake-driven online algorithm 1. Start with a zero weight vector and process each training example in turn. 2. If the current weight vector classifies the current example incorrectly, move the weight vector in the right direction. 3. If weights stop changing, stop If examples are linearly separable, then this algorithm is guaranteed to converge to the solution vector 14 Fixed increment online perceptron algorithm Binary classification, with classes +1 and −1 Decision function y′ = sign(w · x) Perceptron(x1:N , y1:N , I): 1:w←0 2: for i=1...I do 3: for n = 1...N do 4: 5: 6: return w if y(n)(w · x(n)) ≤ 0 then w ← (w · x(n)) ≤ 0 then 15 Or more explicitly 1: w←0 2: for i=1...I do 3: 4: 5: 6: 7: 8: 9: 10: return w Tracing an example for a NAND function is found in: http://en.wikipedia.org/wiki/Perceptron for n = 1...N do if y(n) = sign(w · x(n)) then pass elseif y(n)=+1∧sign(w·x(n))=−1 then w ← w + x(n) elseif y(n)=−1∧sign(w·x(n))=+1 then w ← w − x(n) 16 Weight averaging Although the algorithm is guaranteed to converge, the solution is not unique! Sensitive to the order in which examples are processed Separating the training sample does not equal good accuracy on unseen data Empirically, better generalization performance with weight averaging A method of avoiding overfitting As final weight vector, use the mean of all the weight vector values for each step of the algorithm (cf. regularization in a following session) Efficient averaged perceptron algorithm Perceptron(x1:N , y1:N , I): 1: w←0; wa←0 2: b←0;ba←0 3: c←1 4: for i=1...I do 5: 6: for n = 1...N do if y(n)(w · x(n) + b) ≤ 0 then 7: w←w+y(n)x(n) ;b←b+y(n) 8: wa ← wa + cy(n)x(n) ; ba ← ba + cy(n) 9: c←c+1 10: return (w − wa/c, b − ba/c) 17 Naïve Bayes 18 19 Probabilistic Model Instead of thinking in terms of multidimensional space... Classification can be approached as a probability estimation problem We will try to find a probability distribution which Describes well our training data Allows us to make accurate predictions We’ll look at Naive Bayes as a simplest example of a probabilistic classifier 20 Representation of Examples We are trying to classify documents. Let’s represent a document as a sequence of terms (words) it contains t = (t1...tn) For (binary) classification we want to find the most probable class: ŷ= argmax P(Y =y|t) But y∈{−1,+1} documents are close to unique: we cannot reliably condition Y |t Bayes’ rule to the rescue 21 Bayes Rule Bayes rule determines how joint and conditional probabilities are related 22 Prior and likelihood With Bayes’ rule we can invert the direction of conditioning Decomposed the task into estimating the prior P (Y ) (easy) and the likelihood P (t|Y = y) 23 Conditional Independence How to estimate P (t|Y = y)? Naively assume the occurrence of each word in the document is independent of the others, when conditioned on the class 24 Naive Bayes Putting it all together 25 Decision Function For binary classification: 26 Documents in Vector Notation Let’s represent documents as vocabulary-size-dimensional binary vectors Dimension i indicates how many times the ith vocabulary item appears in document x 27 Naive Bayes in Vector Notation Counts appear as exponents: If we take the logarithm of the threshold (ln 1 = 0) and g, we’ll get the same decision function 28 Linear Classifier Remember the linear classifier? Log prior ratio corresponds to the bias term Log likelihood ratios correspond to feature weights 29 What is the difference Training criterion and procedure Perceptron Zero-one loss function Error-driven algorithm 30 Naive Bayes Maximum Find Likelihood criterion parameters which maximize the log likelihood Parameters reduce to relative counts + Ad-hoc smoothing Alternatives (e.g. maximum a posteriori) 31 Comparison Logistic Regression 32 33 Probabilistic Conditional Model Let’s try to come up with a probabilistic model, which has some of the advantages of perceptron Model P(y|x) directly, and not via P(x|y) and Bayes rule as in Naïve Bayes Avoid issue of dependencies between features of x We’ll take linear regression as a starting point The goal is to adapt regression to model class-conditional probability 34 Linear Regression Training data: observations paired with outcomes (n ∈ R) Observations have features (predictors, typically also real numbers) The model is a regression line y = ax + b which best fits the observations a is the slope b is the intercept This model has two parameters (or weights) One feature = x Example: x = number of vague adjectives in property descriptions y = amount house sold over asking price 35 36 Multiple Linear Regression More generally y where = outcome w0 = intercept x1..xd = features vector and w1..wd weight vector Get rid of bias: Linear regression: uses g(x) directly Linear classifier: uses sign(g(x)) 37 Learning Linear Regression Minimize sum squared error over N training examples Closed-form formula for choosing the best weights w: where the matrix X contains training example features, and y is the vector of outcomes. 38 Logistic Regression In logistic regression we use the linear model to assign probabilities to class labels For binary classification, predict P (Y = 1|x). But predictions of linear regression model are ∈ R, whereas P(Y =1|x)∈[0,1] Instead predict logit function of the probability: Solving for P (Y = 1|x) we obtain: 39 40 Logistic Regression – Classification Example x belongs to class 1 if: Equation w · x = 0 defines a hyper-plane with points above belonging to class 1 41 Multinomial Logistic Regression Logistic regression generalized to more than two classes 42 Learning Parameters Conditional likelihood estimation: choose the weights which make the probability of the observed values y be the highest, given the observations xi For the training set with N examples: 43 Error Function Equivalently, we seek the value of the parameters which minimize the error function: 44 A problem in convex optimization L-BFGS (Limited-memory Broyden-Fletcher-GoldfarbShanno method) gradient descent conjugate gradient iterative scaling algorithms Gradient Descent Learning Rule Weight update rule: wj = wj + h ( t(i) – s[f(i)] ) s[f(i)] xj(i) can rewrite as: wj = wj + h * error * c * xj(i) The Basic idea A gradient is a slope of a function That is, a set of partial derivatives, one for each dimension (parameter) By following the gradient of a convex function we can descend to the bottom (minimum) What is the gradient? Partial derivatives of E(w0, w1,w2): e.g., d E(w0, w1,w2) /d w1 Gradient is defined as the vector of partial derivatives: Dw = [ d E(w) /d w0, d E(w) /d w1, d E(w) /d w2 ] = gradient of w = vector of derivatives (defined here on 3 dimensions) 1. E(w) and Dw can be evaluated at any particular point w 2. The components of the gradient Dw tell us how fast E(w) is changing in each direction 3. When interpreted as a vector, Dw is the direction of steepest increase => - Dw is the direction of steepest decrease Gradient Descent Rule in Multiple Dimensions Gradient Descent Rule: w new = w old - h D (w) where D (w) is the gradient and h is the learning rate (small, positive) Notes: 1. This moves us downhill in direction D (w) (steepest downhill direction) 2. How far we go is determined by the value of h 3. The perceptron learning rule is a special case of this general method Illustration of Gradient Descent E(w) w1 w0 Illustration of Gradient Descent E(w) w1 w0 Illustration of Gradient Descent E(w) w1 Direction of steepest descent = direction of negative gradient w0 Illustration of Gradient Descent E(w) w1 Original point in weight space New point in w0 weight space Comments on the Gradient Descent Algorithm Equivalent to hill-climbing heuristic search Works on any objective function E(w) as long as we can evaluate the gradient D (w) this can be very useful for minimizing complex functions E Local minima can have multiple local minima (note: for perceptron, E(w) only has a single global minimum, so this is not a problem) gradient descent goes to the closest local minimum: solution: random restarts from multiple places in weight space (note: no local minima for perceptron learning) General Gradient Descent Algorithm Define an objective function E(w) (function to be minimized) We want to find the vector of values w that minimize E(w) Algorithm: pick an initial set of weights w, e.g. randomly evaluate D (w) at w update all the weights note: this can be done numerically or in closed form w new = w old - h D (w) check if D (w) is approximately 0 if so, we have converged to a “flat minimum” if not, we move again in weight space For perceptron learning, D (w) is ( t(i) – s[f(i)] ) s[f(i)] xj(i) Minimization of Mean Squared Error, E(w) E(w) w1 Minimum of function E(w) Minimization of Mean Squared Error, E(w) E(w) w1 d E(w)/ dw1 w1 Moving Downhill: Move in direction of negative derivative E(w) Decreasing E(w) w1 d E(w)/ dw1 w1 d E(w)/dw1 > 0 w1 <= w1 - h d E(w)/dw1 i.e., the rule decreases w1 Moving Downhill: Move in direction of negative derivative E(w) Decreasing E(w) w1 d E(w)/ dw1 w1 d E(w)/dw1 < 0 w1 <= w1 - h d E(w)/dw1 i.e., the rule increases w1 58 Gradient Descent Example Find argminθ f(θ) where f(θ) = θ2 Initial value of θ1 = −1 Gradient function: ∇f(θ) = 2θ Update: θ(n+1) = θ(n) − η∇f (θ(n)) The learning rate η (= 0.2) controls the speed of the descent After first iteration: θ(2) = −1 − 0.2(2) = −0.6 59 Five Iterations of Gradient Descent 60 Stochastic Gradient Descent (SGD) We could compute the gradient of error for the full dataset before each update Instead Compute the gradient of the error for a single example update the weight Move on to the next example On average, we’ll move in the right direction Efficient, online algorithm However, stochastic, uses random samples to accommodate for lots of local maxima/minima 61 Error gradient The gradient of the error function is the set of partial derivatives of the error function with respect to the parameters Wyi 62 Single training example 63 Update Stochastic gradient update step 64 Update: Explicit For the correct class (y = y(n)) where 1 − P (Y = y|x(n), W) is the residual For all other classes (y≠y(n)) 65 66 Logistics Regression SGD vs Perceptron Very similar update! Perceptron is simply an instantiation of SGD for a particular error function The perceptron criterion: for a correctly classified example zero error; for a misclassified example −y(n)w · x(n) 67 Comparison 68 Exercise Do Example 2.2.1:2, and 2.3.1, not using a randomly generated data as shown, but on your project’s dataset.