Download Learning From Data Lecture 8 Linear Classification and Regression

Learning From Data Lecture 8 Linear Classification and Regression Linear Classification Linear Regression M. Magdon-Ismail CSCI 4100/6100 recap: Approximation Versus Generalization VC Analysis Eout ≤ Ein + Ω(dvc) 1. Did you fit your data well enough (Ein)? 2. Are you confident your Ein will generalize to Eout Bias-Variance Analysis Eout = bias + var 1. How well can you fit your data (bias)? 2. How close to that best fit can you get (var)? y y out-of-sample error Error model complexity x x VC dimension, dvc The VC Insuarance Co. The VC bound is like a warranty. If you look at your data before choosing H, your warranty is void. c AM L Creator: Malik Magdon-Ismail Linear Classification and Regression: 2 /21 ḡ(x) y d∗vc y in-sample error ḡ(x) sin(x) sin(x) x x H0 bias = 0.50; var = 0.25. Eout = 0.75 X H1 bias = 0.21; var = 1.69. Eout = 1.90 Recap: learning curve −→ recap: Decomposing The Learning Curve Bias-Variance Analysis Eout generalization error Ein Expected Error Expected Error VC Analysis Eout variance Ein bias in-sample error Number of Data Points, N Pick H that can generalize and has a good chance to fit the data c AM L Creator: Malik Magdon-Ismail Number of Data Points, N Pick (H, A) to approximate f and not behave wildly after seeing the data Linear Classification and Regression: 3 /21 3 learning problems −→ Three Learning Problems Credit Analysis Approve or Deny Classification y = ±1 Amount of Credit Regression y∈R Probability of Default Logistic Regression y ∈ [0, 1] • Linear models are perhaps the fundamental model. • The linear model is the first model to try. c AM L Creator: Malik Magdon-Ismail Linear Classification and Regression: 4 /21 Linear signal −→ The Linear Signal linear in x: gives the line/hyperplane separator ↓ t s=w x ↑ linear in w: makes the algorithms work x is the augmented vector: x ∈ {1} × Rd c AM L Creator: Malik Magdon-Ismail Linear Classification and Regression: 5 /21 Using the linear signal −→ The Linear Signal s = wt x                 −→                → sign(wtx) {−1, +1} → wtx R → θ(wtx) [0, 1] y = θ(s) c AM L Creator: Malik Magdon-Ismail Linear Classification and Regression: 6 /21 Classification and PLA −→ Linear Classification Hlin t = h(x) = sign(w x) 1. Ein ≈ Eout because dvc = d + 1, Eout (h) ≤ Ein (h) + O r d log N N ! . 2. If the data is linearly separable, PLA will find a separator =⇒ Ein = 0. w(t + 1) = w(t) + x∗y∗ ↑ misclassified data point Ein = 0 =⇒ Eout ≈ 0 (f is well approximated by a linear fit). What if the data is not separable (Ein = 0 is not possible)? How to ensure Ein ≈ 0 is possible? c AM L Creator: Malik Magdon-Ismail Linear Classification and Regression: 7 /21 pocket algorithm select good features Non-separable data −→ Non-Separable Data c AM L Creator: Malik Magdon-Ismail Linear Classification and Regression: 8 /21 Pocket algorithm −→ The Pocket Algorithm Minimizing Ein is a hard combinatorial problem. The Pocket Algorithm – Run PLA – At each step keep the best Ein (and w) so far. (Its not rocket science, but it works.) (Other approaches: linear regression, logistic regression, linear programming . . . ) c AM L Creator: Malik Magdon-Ismail Linear Classification and Regression: 9 /21 Digits −→ Digits Data Each digit is a 16 × 16 image. c AM L Creator: Malik Magdon-Ismail Linear Classification and Regression: 10 /21 Input is 256 dimensional −→ Digits Data Each digit is a 16 × 16 image. [-1 -1 -1 -1 -1 -1 -1 -0.63 0.86 -0.17 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -0.99 0.3 1 0.31 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -0.41 1 0.99 -0.57 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -0.68 0.83 1 0.56 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -0.94 0.54 1 0.78 -0.72 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0.1 1 0.92 -0.44 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -0.26 0.95 1 -0.16 -1 -1 -1 -0.99 -0.71 -0.83 -1 -1 -1 -1 -1 -0.8 0.91 1 0.3 -0.96 -1 -1 -0.55 0.49 1 0.88 0.09 -1 -1 -1 -1 0.28 1 0.88 -0.8 -1 -0.9 0.14 0.97 1 1 1 0.99 -0.74 -1 -1 -0.95 0.84 1 0.32 -1 -1 0.35 1 0.65 -0.10 -0.18 1 0.98 -0.72 -1 -1 -0.63 1 1 0.07 -0.92 0.11 0.96 0.30 -0.88 -1 -0.07 1 0.64 -0.99 -1 -1 -0.67 1 1 0.75 0.34 1 0.70 -0.94 -1 -1 0.54 1 0.02 -1 -1 -1 -0.90 0.79 1 1 1 1 0.53 0.18 0.81 0.83 0.97 0.86 -0.63 -1 -1 -1 -1 -0.45 0.82 1 1 1 1 1 1 1 1 0.13 -1 -1 -1 -1 -1 -1 -0.48 0.81 1 1 1 1 1 1 0.21 -0.94 -1 -1 -1 -1 -1 -1 -1 -0.97 -0.42 0.30 0.82 1 0.48 -0.47 -0.99 -1 -1 -1 -1] x = (1, x1, · · · , x256) ← input w = (w0, w1, · · · , w256) ← linear model c AM L Creator: Malik Magdon-Ismail dvc = 257 Linear Classification and Regression: 11 /21 Intensity and symmetry features −→ Intensity and Symmetry Features feature: an important property of the input that you think is useful for classification. (dictionary.com: a prominent or conspicuous part or characteristic) x = (1, x1, x2) ← input w = (w0, w1, w2) ← linear model c AM L Creator: Malik Magdon-Ismail dvc = 3 Linear Classification and Regression: 12 /21 PLA on digits data −→ PLA on Digits Data PLA Error (log scale) 50% Eout 10% 1% Ein 0 250 500 750 1000 Iteration Number, t c AM L Creator: Malik Magdon-Ismail Linear Classification and Regression: 13 /21 Pocket on digits data −→ Pocket on Digits Data PLA 50% Eout Error (log scale) Error (log scale) 50% Pocket 10% 1% 10% Eout 1% Ein Ein 0 250 500 750 1000 0 Iteration Number, t c AM L Creator: Malik Magdon-Ismail 250 500 750 1000 Iteration Number, t Linear Classification and Regression: 14 /21 Regression −→ Linear Regression age gender salary debt years in job years at home ... 32 years male 40,000 26,000 1 year 3 years ... Classification: Approve/Deny Regression: Credit Line (dollar amount) h(x) = d X regression ≡ y ∈ R w i xi = w t x i=0 c AM L Creator: Malik Magdon-Ismail Linear Classification and Regression: 15 /21 Squared error −→ y y Least Squares Linear Regression x1 x y = f (x) + ǫ in-sample error out-of-sample error c AM L Creator: Malik Magdon-Ismail Ein(h) = 1 N N P (h(xn) − yn)2 n=1 Eout(h) = Ex[(h(x) − y)2] Linear Classification and Regression: 16 /21 x2 ←− noisy target P (y|x)      h(x) = wtx     Matrix representation −→ Using Matrices for Linear Regression   —x1—  —x2—   X= ..   —xN — | {z } data matrix, N × (d + 1) c AM L Creator: Malik Magdon-Ismail   {z }  target vector   ŷ1 w x1  ŷ2   wtx2     ŷ =   ..  =  ..  = Xw ŷN wtxN y1  y2   y=  ..  yN |  N 1 X Ein(w) = (ŷn − yn)2 N n=1 | {z t in-sample predictions } − y ||22 = 1 || ŷ N = 1 N || Xw = t t 1 (w X Xw N − y ||22 − 2wtXty + yty) Linear Classification and Regression: 17 /21 Pseudoinverse solution −→ Linear Regression Solution 1 (wtXt Xw − 2wtXt y + yty) Ein(w) = N Vector Calculus: To minimize Ein (w), set ∇w Ein (w) = 0. ∇w (wtAw) = (A + At )w, ∇w (wt b) = b. A = Xt X and b = Xt y: 2 t ∇w Ein(w) = (X Xw − Xty) N Setting ∇Ein(w) = 0: XtXw = Xty ←− wlin = (XtX)−1Xty c AM L Creator: Malik Magdon-Ismail ←− Linear Classification and Regression: 18 /21 normal equations when XtX is invertible Regression algorithm −→ Linear Regression Algorithm Linear Regression Algorithm: 1. Construct the matrix X and the vector y from the data set (x1, y1), · · · , (xN , yN ), where each x includes the x0 = 1 coordinate,     —x1— y1  —x2—   y2     X= y= ..  ..  . , —xN — yN {z } {z } | | data matrix target vector 2. Compute the pseudo inverse X† of the matrix X. If XtX is invertible, X† = (XtX)−1Xt 3. Return wlin = X†y. c AM L Creator: Malik Magdon-Ismail Linear Classification and Regression: 19 /21 Generalization −→ Generalization The linear regression algorithm gets the smallest possible Ein in one step. Generalization is also good. There are other bounds, for example: d E[Eout(h)] = E[Ein(h)] + O N Expected Error One can obtain a regression version of dvc. Eout σ2 Ein d+1 c AM L Creator: Malik Magdon-Ismail Linear Classification and Regression: 20 /21 Number of Data Points, N Regression for classification −→ Linear Regression for Classification Linear regression can learn any real valued target function. For example yn = ±1. (±1 are real values!) Use linear regression to get w with wtxn ≈ yn = ±1 Then sign(wtxn ) will likely agree with yn = ±1. Example. Classifying 1 from not 1 Symmetry These can be good initial weights for classification. (multiclass → 2 class) Average Intensity c AM L Creator: Malik Magdon-Ismail Linear Classification and Regression: 21 /21

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Learning From Data Lecture 8 Linear Classification and Regression