Download Learning From Data Lecture 8 Linear Classification and Regression

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Learning From Data
Lecture 8
Linear Classification and Regression
Linear Classification
Linear Regression
M. Magdon-Ismail
CSCI 4100/6100
recap:
Approximation Versus Generalization
VC Analysis
Eout ≤ Ein + Ω(dvc)
1. Did you fit your data well enough (Ein)?
2. Are you confident your Ein will generalize to Eout
Bias-Variance Analysis
Eout = bias + var
1. How well can you fit your data (bias)?
2. How close to that best fit can you get (var)?
y
y
out-of-sample error
Error
model complexity
x
x
VC dimension, dvc
The VC Insuarance Co.
The VC bound is like a warranty.
If you look at your data before choosing H, your
warranty is void.
c AM
L Creator: Malik Magdon-Ismail
Linear Classification and Regression: 2 /21
ḡ(x)
y
d∗vc
y
in-sample error
ḡ(x)
sin(x)
sin(x)
x
x
H0
bias = 0.50;
var = 0.25.
Eout = 0.75 X
H1
bias = 0.21;
var = 1.69.
Eout = 1.90
Recap: learning curve −→
recap:
Decomposing The Learning Curve
Bias-Variance Analysis
Eout
generalization error
Ein
Expected Error
Expected Error
VC Analysis
Eout
variance
Ein
bias
in-sample error
Number of Data Points, N
Pick H that can generalize and has a good
chance to fit the data
c AM
L Creator: Malik Magdon-Ismail
Number of Data Points, N
Pick (H, A) to approximate f and not behave
wildly after seeing the data
Linear Classification and Regression: 3 /21
3 learning problems −→
Three Learning Problems
Credit
Analysis
Approve
or Deny
Classification
y = ±1
Amount
of Credit
Regression
y∈R
Probability
of Default
Logistic Regression
y ∈ [0, 1]
• Linear models are perhaps the fundamental model.
• The linear model is the first model to try.
c AM
L Creator: Malik Magdon-Ismail
Linear Classification and Regression: 4 /21
Linear signal −→
The Linear Signal
linear in x: gives the line/hyperplane separator
↓
t
s=w x
↑
linear in w: makes the algorithms work
x is the augmented vector: x ∈ {1} × Rd
c AM
L Creator: Malik Magdon-Ismail
Linear Classification and Regression: 5 /21
Using the linear signal −→
The Linear Signal
s = wt x
















−→ 














→
sign(wtx)
{−1, +1}
→
wtx
R
→
θ(wtx)
[0, 1]
y = θ(s)
c AM
L Creator: Malik Magdon-Ismail
Linear Classification and Regression: 6 /21
Classification and PLA −→
Linear Classification
Hlin
t
= h(x) = sign(w x)
1. Ein ≈ Eout because dvc = d + 1,
Eout (h) ≤ Ein (h) + O
r
d
log N
N
!
.
2. If the data is linearly separable, PLA will find a separator =⇒ Ein = 0.
w(t + 1) = w(t) + x∗y∗
↑
misclassified data point
Ein = 0 =⇒ Eout ≈ 0
(f is well approximated by a linear fit).
What if the data is not separable (Ein = 0 is not possible)?
How to ensure Ein ≈ 0 is possible?
c AM
L Creator: Malik Magdon-Ismail
Linear Classification and Regression: 7 /21
pocket algorithm
select good features
Non-separable data −→
Non-Separable Data
c AM
L Creator: Malik Magdon-Ismail
Linear Classification and Regression: 8 /21
Pocket algorithm −→
The Pocket Algorithm
Minimizing Ein is a hard combinatorial problem.
The Pocket Algorithm
– Run PLA
– At each step keep the best Ein (and w) so far.
(Its not rocket science, but it works.)
(Other approaches: linear regression, logistic regression, linear programming . . . )
c AM
L Creator: Malik Magdon-Ismail
Linear Classification and Regression: 9 /21
Digits −→
Digits Data
Each digit is a 16 × 16 image.
c AM
L Creator: Malik Magdon-Ismail
Linear Classification and Regression: 10 /21
Input is 256 dimensional −→
Digits Data
Each digit is a 16 × 16 image.
[-1 -1 -1 -1 -1 -1 -1 -0.63 0.86 -0.17 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -0.99 0.3 1 0.31 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -0.41 1 0.99 -0.57 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -0.68 0.83 1
0.56 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -0.94 0.54 1 0.78 -0.72 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0.1 1 0.92 -0.44 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -0.26 0.95 1 -0.16 -1 -1 -1 -0.99 -0.71 -0.83 -1
-1 -1 -1 -1 -0.8 0.91 1 0.3 -0.96 -1 -1 -0.55 0.49 1 0.88 0.09 -1 -1 -1 -1 0.28 1 0.88 -0.8 -1 -0.9 0.14 0.97 1 1 1 0.99 -0.74 -1 -1 -0.95 0.84 1 0.32 -1 -1 0.35 1 0.65 -0.10 -0.18 1 0.98
-0.72 -1 -1 -0.63 1 1 0.07 -0.92 0.11 0.96 0.30 -0.88 -1 -0.07 1 0.64 -0.99 -1 -1 -0.67 1 1 0.75 0.34 1 0.70 -0.94 -1 -1 0.54 1 0.02 -1 -1 -1 -0.90 0.79 1 1 1 1 0.53 0.18 0.81 0.83 0.97
0.86 -0.63 -1 -1 -1 -1 -0.45 0.82 1 1 1 1 1 1 1 1 0.13 -1 -1 -1 -1 -1 -1 -0.48 0.81 1 1 1 1 1 1 0.21 -0.94 -1 -1 -1 -1 -1 -1 -1 -0.97 -0.42 0.30 0.82 1 0.48 -0.47 -0.99 -1 -1 -1 -1]
x = (1, x1, · · · , x256)
← input
w = (w0, w1, · · · , w256) ← linear model
c AM
L Creator: Malik Magdon-Ismail
dvc = 257
Linear Classification and Regression: 11 /21
Intensity and symmetry features −→
Intensity and Symmetry Features
feature: an important property of the input that you think is useful for classification.
(dictionary.com: a prominent or conspicuous part or characteristic)
x = (1, x1, x2)
← input
w = (w0, w1, w2) ← linear model
c AM
L Creator: Malik Magdon-Ismail
dvc = 3
Linear Classification and Regression: 12 /21
PLA on digits data −→
PLA on Digits Data
PLA
Error (log scale)
50%
Eout
10%
1%
Ein
0
250
500
750
1000
Iteration Number, t
c AM
L Creator: Malik Magdon-Ismail
Linear Classification and Regression: 13 /21
Pocket on digits data −→
Pocket on Digits Data
PLA
50%
Eout
Error (log scale)
Error (log scale)
50%
Pocket
10%
1%
10%
Eout
1%
Ein
Ein
0
250
500
750
1000
0
Iteration Number, t
c AM
L Creator: Malik Magdon-Ismail
250
500
750
1000
Iteration Number, t
Linear Classification and Regression: 14 /21
Regression −→
Linear Regression
age
gender
salary
debt
years in job
years at home
...
32 years
male
40,000
26,000
1 year
3 years
...
Classification: Approve/Deny
Regression: Credit Line (dollar amount)
h(x) =
d
X
regression ≡ y ∈ R
w i xi = w t x
i=0
c AM
L Creator: Malik Magdon-Ismail
Linear Classification and Regression: 15 /21
Squared error −→
y
y
Least Squares Linear Regression
x1
x
y = f (x) + ǫ
in-sample error
out-of-sample error
c AM
L Creator: Malik Magdon-Ismail
Ein(h) =
1
N
N
P
(h(xn) − yn)2
n=1
Eout(h) = Ex[(h(x) − y)2]
Linear Classification and Regression: 16 /21
x2
←− noisy target P (y|x)





h(x) = wtx




Matrix representation −→
Using Matrices for Linear Regression


—x1—
 —x2— 

X=
..


—xN —
|
{z
}
data matrix, N × (d + 1)
c AM
L Creator: Malik Magdon-Ismail


{z
}

target vector


ŷ1
w x1
 ŷ2   wtx2 
 

ŷ = 
 ..  =  ..  = Xw
ŷN
wtxN
y1
 y2 

y=
 .. 
yN
|

N
1 X
Ein(w) =
(ŷn − yn)2
N n=1
|
{z
t
in-sample predictions
}
− y ||22
=
1
|| ŷ
N
=
1
N || Xw
=
t t
1
(w
X Xw
N
− y ||22
− 2wtXty + yty)
Linear Classification and Regression: 17 /21
Pseudoinverse solution −→
Linear Regression Solution
1
(wtXt Xw − 2wtXt y + yty)
Ein(w) =
N
Vector Calculus: To minimize Ein (w), set ∇w Ein (w) = 0.
∇w (wtAw) = (A + At )w,
∇w (wt b) = b.
A = Xt X and b = Xt y:
2 t
∇w Ein(w) = (X Xw − Xty)
N
Setting ∇Ein(w) = 0:
XtXw = Xty
←−
wlin = (XtX)−1Xty
c AM
L Creator: Malik Magdon-Ismail
←−
Linear Classification and Regression: 18 /21
normal equations
when
XtX is invertible
Regression algorithm −→
Linear Regression Algorithm
Linear Regression Algorithm:
1. Construct the matrix X and the vector y from the data set
(x1, y1), · · · , (xN , yN ), where each x includes the x0 = 1 coordinate,




—x1—
y1
 —x2— 
 y2 



X=
y=
..
 ..  .
,
—xN —
yN
{z
}
{z
}
|
|
data matrix
target vector
2. Compute the pseudo inverse X† of the matrix X. If XtX is invertible,
X† = (XtX)−1Xt
3. Return wlin = X†y.
c AM
L Creator: Malik Magdon-Ismail
Linear Classification and Regression: 19 /21
Generalization −→
Generalization
The linear regression algorithm gets the smallest possible Ein in one step.
Generalization is also good.
There are other bounds, for example:
d
E[Eout(h)] = E[Ein(h)] + O
N
Expected Error
One can obtain a regression version of dvc.
Eout
σ2
Ein
d+1
c AM
L Creator: Malik Magdon-Ismail
Linear Classification and Regression: 20 /21
Number of Data Points, N
Regression for classification −→
Linear Regression for Classification
Linear regression can learn any real valued target function.
For example yn = ±1.
(±1 are real values!)
Use linear regression to get w with wtxn ≈ yn = ±1
Then sign(wtxn ) will likely agree with yn = ±1.
Example.
Classifying 1 from not 1
Symmetry
These can be good initial weights for classification.
(multiclass → 2 class)
Average Intensity
c AM
L Creator: Malik Magdon-Ismail
Linear Classification and Regression: 21 /21
Related documents