Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Learning From Data
Lecture 8
Linear Classification and Regression
Linear Classification
Linear Regression
M. Magdon-Ismail
CSCI 4100/6100
recap:
Approximation Versus Generalization
VC Analysis
Eout ≤ Ein + Ω(dvc)
1. Did you fit your data well enough (Ein)?
2. Are you confident your Ein will generalize to Eout
Bias-Variance Analysis
Eout = bias + var
1. How well can you fit your data (bias)?
2. How close to that best fit can you get (var)?
y
y
out-of-sample error
Error
model complexity
x
x
VC dimension, dvc
The VC Insuarance Co.
The VC bound is like a warranty.
If you look at your data before choosing H, your
warranty is void.
c AM
L Creator: Malik Magdon-Ismail
Linear Classification and Regression: 2 /21
ḡ(x)
y
d∗vc
y
in-sample error
ḡ(x)
sin(x)
sin(x)
x
x
H0
bias = 0.50;
var = 0.25.
Eout = 0.75 X
H1
bias = 0.21;
var = 1.69.
Eout = 1.90
Recap: learning curve −→
recap:
Decomposing The Learning Curve
Bias-Variance Analysis
Eout
generalization error
Ein
Expected Error
Expected Error
VC Analysis
Eout
variance
Ein
bias
in-sample error
Number of Data Points, N
Pick H that can generalize and has a good
chance to fit the data
c AM
L Creator: Malik Magdon-Ismail
Number of Data Points, N
Pick (H, A) to approximate f and not behave
wildly after seeing the data
Linear Classification and Regression: 3 /21
3 learning problems −→
Three Learning Problems
Credit
Analysis
Approve
or Deny
Classification
y = ±1
Amount
of Credit
Regression
y∈R
Probability
of Default
Logistic Regression
y ∈ [0, 1]
• Linear models are perhaps the fundamental model.
• The linear model is the first model to try.
c AM
L Creator: Malik Magdon-Ismail
Linear Classification and Regression: 4 /21
Linear signal −→
The Linear Signal
linear in x: gives the line/hyperplane separator
↓
t
s=w x
↑
linear in w: makes the algorithms work
x is the augmented vector: x ∈ {1} × Rd
c AM
L Creator: Malik Magdon-Ismail
Linear Classification and Regression: 5 /21
Using the linear signal −→
The Linear Signal
s = wt x
−→
→
sign(wtx)
{−1, +1}
→
wtx
R
→
θ(wtx)
[0, 1]
y = θ(s)
c AM
L Creator: Malik Magdon-Ismail
Linear Classification and Regression: 6 /21
Classification and PLA −→
Linear Classification
Hlin
t
= h(x) = sign(w x)
1. Ein ≈ Eout because dvc = d + 1,
Eout (h) ≤ Ein (h) + O
r
d
log N
N
!
.
2. If the data is linearly separable, PLA will find a separator =⇒ Ein = 0.
w(t + 1) = w(t) + x∗y∗
↑
misclassified data point
Ein = 0 =⇒ Eout ≈ 0
(f is well approximated by a linear fit).
What if the data is not separable (Ein = 0 is not possible)?
How to ensure Ein ≈ 0 is possible?
c AM
L Creator: Malik Magdon-Ismail
Linear Classification and Regression: 7 /21
pocket algorithm
select good features
Non-separable data −→
Non-Separable Data
c AM
L Creator: Malik Magdon-Ismail
Linear Classification and Regression: 8 /21
Pocket algorithm −→
The Pocket Algorithm
Minimizing Ein is a hard combinatorial problem.
The Pocket Algorithm
– Run PLA
– At each step keep the best Ein (and w) so far.
(Its not rocket science, but it works.)
(Other approaches: linear regression, logistic regression, linear programming . . . )
c AM
L Creator: Malik Magdon-Ismail
Linear Classification and Regression: 9 /21
Digits −→
Digits Data
Each digit is a 16 × 16 image.
c AM
L Creator: Malik Magdon-Ismail
Linear Classification and Regression: 10 /21
Input is 256 dimensional −→
Digits Data
Each digit is a 16 × 16 image.
[-1 -1 -1 -1 -1 -1 -1 -0.63 0.86 -0.17 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -0.99 0.3 1 0.31 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -0.41 1 0.99 -0.57 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -0.68 0.83 1
0.56 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -0.94 0.54 1 0.78 -0.72 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0.1 1 0.92 -0.44 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -0.26 0.95 1 -0.16 -1 -1 -1 -0.99 -0.71 -0.83 -1
-1 -1 -1 -1 -0.8 0.91 1 0.3 -0.96 -1 -1 -0.55 0.49 1 0.88 0.09 -1 -1 -1 -1 0.28 1 0.88 -0.8 -1 -0.9 0.14 0.97 1 1 1 0.99 -0.74 -1 -1 -0.95 0.84 1 0.32 -1 -1 0.35 1 0.65 -0.10 -0.18 1 0.98
-0.72 -1 -1 -0.63 1 1 0.07 -0.92 0.11 0.96 0.30 -0.88 -1 -0.07 1 0.64 -0.99 -1 -1 -0.67 1 1 0.75 0.34 1 0.70 -0.94 -1 -1 0.54 1 0.02 -1 -1 -1 -0.90 0.79 1 1 1 1 0.53 0.18 0.81 0.83 0.97
0.86 -0.63 -1 -1 -1 -1 -0.45 0.82 1 1 1 1 1 1 1 1 0.13 -1 -1 -1 -1 -1 -1 -0.48 0.81 1 1 1 1 1 1 0.21 -0.94 -1 -1 -1 -1 -1 -1 -1 -0.97 -0.42 0.30 0.82 1 0.48 -0.47 -0.99 -1 -1 -1 -1]
x = (1, x1, · · · , x256)
← input
w = (w0, w1, · · · , w256) ← linear model
c AM
L Creator: Malik Magdon-Ismail
dvc = 257
Linear Classification and Regression: 11 /21
Intensity and symmetry features −→
Intensity and Symmetry Features
feature: an important property of the input that you think is useful for classification.
(dictionary.com: a prominent or conspicuous part or characteristic)
x = (1, x1, x2)
← input
w = (w0, w1, w2) ← linear model
c AM
L Creator: Malik Magdon-Ismail
dvc = 3
Linear Classification and Regression: 12 /21
PLA on digits data −→
PLA on Digits Data
PLA
Error (log scale)
50%
Eout
10%
1%
Ein
0
250
500
750
1000
Iteration Number, t
c AM
L Creator: Malik Magdon-Ismail
Linear Classification and Regression: 13 /21
Pocket on digits data −→
Pocket on Digits Data
PLA
50%
Eout
Error (log scale)
Error (log scale)
50%
Pocket
10%
1%
10%
Eout
1%
Ein
Ein
0
250
500
750
1000
0
Iteration Number, t
c AM
L Creator: Malik Magdon-Ismail
250
500
750
1000
Iteration Number, t
Linear Classification and Regression: 14 /21
Regression −→
Linear Regression
age
gender
salary
debt
years in job
years at home
...
32 years
male
40,000
26,000
1 year
3 years
...
Classification: Approve/Deny
Regression: Credit Line (dollar amount)
h(x) =
d
X
regression ≡ y ∈ R
w i xi = w t x
i=0
c AM
L Creator: Malik Magdon-Ismail
Linear Classification and Regression: 15 /21
Squared error −→
y
y
Least Squares Linear Regression
x1
x
y = f (x) + ǫ
in-sample error
out-of-sample error
c AM
L Creator: Malik Magdon-Ismail
Ein(h) =
1
N
N
P
(h(xn) − yn)2
n=1
Eout(h) = Ex[(h(x) − y)2]
Linear Classification and Regression: 16 /21
x2
←− noisy target P (y|x)
h(x) = wtx
Matrix representation −→
Using Matrices for Linear Regression
—x1—
—x2—
X=
..
—xN —
|
{z
}
data matrix, N × (d + 1)
c AM
L Creator: Malik Magdon-Ismail
{z
}
target vector
ŷ1
w x1
ŷ2 wtx2
ŷ =
.. = .. = Xw
ŷN
wtxN
y1
y2
y=
..
yN
|
N
1 X
Ein(w) =
(ŷn − yn)2
N n=1
|
{z
t
in-sample predictions
}
− y ||22
=
1
|| ŷ
N
=
1
N || Xw
=
t t
1
(w
X Xw
N
− y ||22
− 2wtXty + yty)
Linear Classification and Regression: 17 /21
Pseudoinverse solution −→
Linear Regression Solution
1
(wtXt Xw − 2wtXt y + yty)
Ein(w) =
N
Vector Calculus: To minimize Ein (w), set ∇w Ein (w) = 0.
∇w (wtAw) = (A + At )w,
∇w (wt b) = b.
A = Xt X and b = Xt y:
2 t
∇w Ein(w) = (X Xw − Xty)
N
Setting ∇Ein(w) = 0:
XtXw = Xty
←−
wlin = (XtX)−1Xty
c AM
L Creator: Malik Magdon-Ismail
←−
Linear Classification and Regression: 18 /21
normal equations
when
XtX is invertible
Regression algorithm −→
Linear Regression Algorithm
Linear Regression Algorithm:
1. Construct the matrix X and the vector y from the data set
(x1, y1), · · · , (xN , yN ), where each x includes the x0 = 1 coordinate,
—x1—
y1
—x2—
y2
X=
y=
..
.. .
,
—xN —
yN
{z
}
{z
}
|
|
data matrix
target vector
2. Compute the pseudo inverse X† of the matrix X. If XtX is invertible,
X† = (XtX)−1Xt
3. Return wlin = X†y.
c AM
L Creator: Malik Magdon-Ismail
Linear Classification and Regression: 19 /21
Generalization −→
Generalization
The linear regression algorithm gets the smallest possible Ein in one step.
Generalization is also good.
There are other bounds, for example:
d
E[Eout(h)] = E[Ein(h)] + O
N
Expected Error
One can obtain a regression version of dvc.
Eout
σ2
Ein
d+1
c AM
L Creator: Malik Magdon-Ismail
Linear Classification and Regression: 20 /21
Number of Data Points, N
Regression for classification −→
Linear Regression for Classification
Linear regression can learn any real valued target function.
For example yn = ±1.
(±1 are real values!)
Use linear regression to get w with wtxn ≈ yn = ±1
Then sign(wtxn ) will likely agree with yn = ±1.
Example.
Classifying 1 from not 1
Symmetry
These can be good initial weights for classification.
(multiclass → 2 class)
Average Intensity
c AM
L Creator: Malik Magdon-Ismail
Linear Classification and Regression: 21 /21