Download The Elements of Statistical Learning

Document related concepts

Computational phylogenetics wikipedia , lookup

Computer simulation wikipedia , lookup

Machine learning wikipedia , lookup

Mathematical optimization wikipedia , lookup

Theoretical computer science wikipedia , lookup

Vector generalized linear model wikipedia , lookup

Corecursion wikipedia , lookup

Predictive analytics wikipedia , lookup

Inverse problem wikipedia , lookup

Types of artificial neural networks wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Regression analysis wikipedia , lookup

Data assimilation wikipedia , lookup

Pattern recognition wikipedia , lookup

Generalized linear model wikipedia , lookup

Transcript
The Elements of Statistical Learning
Thomas Lengauer, Christian Merkwirth
using the book by Hastie, Tibshirani,
Friedman
1
The Elements of Statistical Learning
l
Prerequisites:
l
l
l
l
Vordiplom in mathematics or
computer science or
equivalent
Linear algebra
Basic knowledge in statistics
l
Time
l
l
Lecture: Wed 11-13,
HS024, Building 46 (MPI)
Tutorial: Fri 14-16
Rm. 15, Building 45 (CS Dep.)
Biweekly, starting Oct. 31
Credits:
l
Übungsschein, based on
l At least 50% of points in
homework
l Final exam, probably oral
l
Good for
l Bioinformatics
l CS Theory or Applications
2
1. Introduction
3
Applications of Statistical Learning
Medical: Predicted whether a patient, hospitalized due to a
heart attack, will have a second heart attack
Data: demographic, diet, clinical measurements
Business/Economics: Predict the price of stock 6 months
from now.
Data: company performance, economic data
Vision: Identify hand-written ZIP codes
Data: Model hand-written digits
Medical: Amount of glucose in the blood of a diabetic
Data: Infrared absorption spectrum of blood sample
Medical: Risk factors for prostate cancer
Data: Clinical, demographic
4
Types of Data
l
Two basically different types of data
l
l
l
Data are predicted
l
l
l
l
l
l
Quantitative (numerical): e.g. stock price
Categorical (discrete, often binary): cancer/no cancer
on the basis of a set of features (e.g. diet or clinical measurements)
from a set of (observed) training data on these features
For a set of objects (e.g. people).
Inputs for the problems are also called predictors or independent
variables
Outputs are also called responses or dependent variables
The prediction model is called a learner or estimator (Schätzer).
l
l
Supervised learning: learn on outcomes for observed features
Unsupervised learning: no feature values available
5
Example 1: Email Spam
l
l 4601 email messages
each labeled email (+) or
spam (-)
l
l
l
spam
email
george
0.00
1.27
you
2.26
1.27
your
1.38
0.44
hp
0.02
0.90
free
0.52
0.07
hpl
0.01
0.43
!
0.51
0.11
our
0.51
0.18
re
0.13
0.42
edu
0.01
0.29
remove
0.28
0.01
Data:
The relative frequencies of the
57 most commonly occurring
words and punctuation marks
in the message
Prediction goal: label future
messages email (+) or
spam (-)
Supervised learning problem
on categorical data:
classification problem
Words with largest difference
between spam and email shown.
6
Example 1: Email Spam
l
Examples of rules for
prediction:
l
l
l
If (%george<0.6) and
(%you>1.5) then spam else
email
If (0.2·%you-0.3 ·%george)>0
then spam else email
Tolerance to errors:
l
l
Tolerant to letting through
some spam (false positive)
No tolerance towards throwing
out email (false negative)
spam
email
george
0.00
1.27
you
2.26
1.27
your
1.38
0.44
hp
0.02
0.90
free
0.52
0.07
hpl
0.01
0.43
!
0.51
0.11
our
0.51
0.18
re
0.13
0.42
edu
0.01
0.29
remove
0.28
0.01
Words with largest difference
between spam and email shown.
7
Example 2: Prostate Cancer
l
Data (by Stamey et al. 1989):
l
l
l
Given:
lcavol log cancer volume
lweight log prostate weight
age
lbph log benign hyperplasia
amount
svi seminal vesicle invasion
lcp log capsular penetration
gleason gleason score
pgg45 percent gleason
scores 4 or 5
Predict:PSA (prostate specific
antigen) level
Supervised learning problem
on quantitative data:
regression problem
8
Example 2: Prostate Cancer
l
l
l
Figure shows
scatter plots of the
input data,
projected onto two
variables,
respectively.
The first row shows
the outcome of the
prediction,
projected onto
each input variable,
respectively.
The variables svi
and gleason are
categorical
9
Example 3: Recognition of Handwritten Digits
l
l
Data: images are single digits
16x16 8-bit gray-scale,
normalized for size and
orientation
Classify: newly written digits
l
l
Non-binary classification
problem
Low tolerance to
misclassifications
10
Example 4: DNA Expression Microarrays
l
Data:
l
l
l
l
l
Color intensities signifying the
abundance levels of mRNA for a
number of genes (6830) in
several (64) different cell states
(samples).
Red over-expressed gene
Green under-expressed gene
Black normally expressed gene
(according to some predefined
background)
Predict:
l
l
l
Which genes show similar
expression over the samples
Which samples show similar
expression over the genes
(unsupervised learning problem)
Which genes are highly over or
under expressed in certain
cancers (supervised learning
problem)
samples
11
genes
2. Overview of Supervised Learning
12
2.2 Notation
l
Inputs
l
l
l
l
l
l
l
Outputs
l
l
l
X, Xj (jth element of vector X )
p #inputs, N #observations
X matrix written in bold
Vectors written in bold xi if they
have N components and thus
summarize all observations on
variable Xi
Vectors are assumed to be
column vectors
Discrete inputs often described by
characteristic vector (dummy
variables)
quantitative Y
qualitative G (for group)
Observed variables in lower case
l
The i-th observed value of X is xi
and can be a scalar or a vector
l
Main question of this lecture:
Given the value of an input
vector X, make a good
prediction Ŷ of the output Y
The prediction should be of the
same kind as the searched
output (categorical vs.
quantitative)
Exception: Binary outputs can
be approximated by values in
[0,1], which can be interpreted
as probabilities
This generalizes to k-level
outputs.
13
2.3.1 Simple Approach 1: Least-Squares
l
l
Given inputs
X=(X1,X2,…,Xp)
Predict output Y via the model
p
Yˆ  ˆ0   X j ˆ j
j 1
ˆ0 bias
l
Include the constant variable 1 in
X
Yˆ  X T ̂
l
l
Here Y is scalar
(If Y is a K-vector then X is a pxK
matrix)
l
l
In the (p+1)-dimensional inputoutput space, (X,Ŷ )
represents a hyperplane
If the constant is included in X,
then the hyperplane goes
through the origin
f (X )  X T 
is a linear function
f ( X )  
is a vector that points in the
steepest uphill direction
14
2.3.1 Simple Approach 1: Least-Squares
l
l

In the (p+1)-dimensional inputoutput space, (X,Ŷ )
represents a hyperplane
If the constant is included in X,
then the hyperplane goes
through the origin
f (X )  X T 
is a linear function
f ( X )  
is a vector that points in the
steepest uphill direction
15
2.3.1 Simple Approach 1: Least-Squares
l
l
l
Training procedure: Method of
least-squares
N = #observations
Minimize the residual sum of
squares
N
RSS (  )   ( yi  x  )
i 1
l
T
i
2
This quadratic function always
has a global minimum, but it may
not be unique
Differentiating w.r.t.  yields the
normal equations
XT (y  X )  0
l
If XTX is nonsingular, then the
unique solution is
ˆ  ( XT X) 1 XT y
Or equivalently
RSS (  )  (y  X )T (y  X )
l
l
l
The fitted value at input x is
yˆ ( x)  xT ̂
l
The entire surface is
characterized by ̂ .
16
2.3.1 Simple Approach 1: Least-Squares
l
l
l
l
l
Example:
Data on two inputs X1 and X2
Output variable has values
GREEN (coded 0) and RED
(coded 1)
100 points per class
Regression line is defined by
xT ˆ  0.5
̂
xT ˆ  0.5
l
Easy but many
misclassifications if the
problem is not linear
xT ˆ  0.5
X1
X2
17
2.3.2 Simple Approach 2: Nearest Neighbors
15-nearest neighbor averaging
l
Uses those observations in
the training set closest to the
given input.
Yˆ  0.5
1
Yˆ ( x) 
yi

k xi N k ( x )
l
l
l
Nk(x) is the set of the k
closest points to x is the
training sample
Average the outcome of the
k closest training sample
points
Fewer misclassifications
X1
Yˆ  0.5
Yˆ  0.5
X2
18
2.3.2 Simple Approach 2: Nearest Neighbors
1-nearest neighbor averaging
l
Uses those observation in
the training set closest to the
given input.
1
Yˆ ( x) 
yi

k xi N k ( x )
l
l
l
Nk(x) is the set of the k
closest points to x is the
training sample
Average the outcome of the
k closest training sample
points
No misclassifications:
X1
Overtraining
Yˆ  0.5
Yˆ  0.5
Yˆ  0.5
X2
19
2.3.3 Comparison of the Two Approaches
l
Least squares
l
K-nearest neighbors
20
2.3.3 Comparison of the Two Approaches
l
l
Least squares
p parameters
p = #features
l
l
K-nearest neighbors
Apparently one parameter k
In fact N/k parameters
N = #observations
21
2.3.3 Comparison of the Two Approaches
l
l
Least squares
p parameters
p = #features
l
l
l
Low variance (robust)
l
K-nearest neighbors
Apparently one parameter k
In fact N/k parameters
N = #observations
High variance (not robust)
22
2.3.3 Comparison of the Two Approaches
l
l
Least squares
p parameters
p = #features
l
l
l
l
Low variance (robust)
High bias (rests on strong
assumptions)
l
l
K-nearest neighbors
Apparently one parameter k
In fact N/k parameters
N = #observations
High variance (not robust)
Low bias (rests only on weak
assumptions)
23
2.3.3 Comparison of the Two Approaches
l
l
Least squares
p parameters
p = #features
l
l
l
l
Low variance (robust)
High bias (rests on strong
assumptions)
Good for Scenario 1:
Training data in each class
generated from a twodimensional Gaussian, the two
Gaussians are independent
and have different means
l
l
l
l
K-nearest neighbors
Apparently one parameter k
In fact N/k parameters
N = #observations
High variance (not robust)
Low bias (rests only on weak
assumptions)
Good for Scenario 2:
Training data in each class
from a mixture of 10 lowvariance Gaussians, with
means again distributed as
Gaussian
(first choose the Gaussian, then
choose the point according to the
Gaussian)
24
2.3.3 Origin of the Data
l Mixture of the two scenarios
Step 1: Generate 10 means mk from
the bivariate Gaussian distribution
N((1,0)T,I) and label this class
GREEN
Step 2: Similarly, generate 10 means
from the from the bivariate
Gaussian distribution N((0,1)T,I)
and label this class RED
Step 3: For each class, generate 100
observations as follows:
l
l
l
For each observation, pick an mk
at random with probability 1/10
Generate a point according to
N(mk,I/ 5)
Similar to scenario 2
Result of 10 000 classifications
25
2.3.3 Variants of These Simple Methods
l Kernel methods: use weights that decrease smoothly to
zero with distance from the target point, rather than the
0/1 cutoff used in nearest-neighbor methods
l In high-dimensional spaces, some variables are
emphasized more than others
l Local regression fits linear models (by least squares)
locally rather than fitting constants locally
l Linear models fit to a basis expansion of the original
inputs allow arbitrarily complex models
l Projection pursuit and neural network models are sums
of nonlinearly transformed linear models
26
2.4 Statistical Decision Theory
l
l
l
l
l
l
Random input vector: X  p
Random output variable: Y 
Joint distribution: Pr(X,Y )
We are looking for a function
f(x) for predicting Y given the
values of the input X
The loss function L(Y,f(X))
shall penalize errors
Squared error loss:
L(Y , f ( X ))  (Y  f ( X ))
2
l
Expected prediction error (EPE):
EPE ( f )  E(Y  f ( X )) 2
  ( y  f ( x)) 2 Pr( dx, dy)
l
l
Since Pr(X,Y )=Pr(Y |X )Pr(X )
EPE can also be written as
EPE ( f )  E X EY | X ([Y  f ( X )]2 | X )
l
Thus it suffices to minimize EPE
pointwise
f ( x)  arg min c EY | X ([Y  c]2 | X  x)
Regression function: f ( x)  E(Y | X  x)
27
2.4 Statistical Decision Theory
l
Nearest neighbor methods try
to directly implement this
recipe
l
l
fˆ ( x)  Ave ( yi | xi  Nk ( x))
l
Several approximations
l
l
l
Since no duplicate
observations, expectation
over a neighborhood
Expectation approximated by
averaging over observations
With increasing k and number
of observations the average
gets (provably) more stable
l
But often we do not have large
samples
By making assumptions
(linearity) we can reduce the
number of required
observations greatly.
With increasing dimension the
neighborhood grows
exponentially. Thus the rate of
convergence to the true
estimator (with increasing k)
decreases
Regression function: f ( x)  E(Y | X  x)
28
2.4 Statistical Decision Theory
l
l
Linear regression
Assumes that the regression
function is approximately linear
l
f ( x)  xT 
l
l
This is a model-based
approach
After plugging this expression
into EPE and differentiating
w.r.t. , we can solve for 
EPE ( f )  E( (Y  X T  )T (Y  X T  ))
  [E( X , X T )]1 E( X , Y )
Again, linear regression
replaces the theoretical
expectation by averaging over
the observed data
N
RSS (  )   ( yi  xiT  ) 2
i 1
ˆ  ( XT X) 1 XT y
l
Summary:
l
l
Least squares assumes that
f(x) is well approximated by a
globally linear function
Nearest neighbors assumes
that f(x) is well approximated
by a locally constant function.
Regression function: f ( x)  E(Y | X  x)
29
2.4 Statistical Decision Theory
l
l
Additional methods in this
book are often model-based
but more flexible than the
linear model.
Additive models
l
L1 (Y , f ( X )) | Y  f ( X ) |
l
p
l
Each fj is arbitrary
In this case
fˆ ( x)  median (Y | X  x)
f (X )   f j (X j )
j 1
What happens if we use
another loss function?
l
l
l
More robust than the
conditional mean
L1 criterion not differentiable
Squared error most popular
30
2.4 Statistical Decision Theory
l
l
l
l
Procedure for categorical
output variable G with values
from G
Loss function is kxk matrix L
where k = card(G)
L is zero on the diagonal
L(k, ℓ) is the price paid for
misclassifying a element from
class Gk as belonging to class
l
EPE  E[ L(G, Gˆ ( X ))]
l
l
Expectation taken w.r.t. the joint
distribution Pr(G,X)
Conditioning yields
K
EPE  E
L[G , Gˆ ( X )] Pr(G , X )
X
l
Gℓ
l
Expected prediction error (EPE)

k
k 1
k
Again, pointwise minimization
suffices
K
Frequently 0-1 loss function
used: L(k, ℓ) = 1-dkl
Bayes Classifier
Gˆ ( X )  arg min gG  L(G k , g ) Pr(G k | X  x)
k 1
l
Or simply
Gˆ ( X )  G k if Pr(G k | X  x)
 max Pr( g | X  x)
gG
31
2.4 Statistical Decision Theory
l
Expected prediction error (EPE)
EPE  E[ L(G, Gˆ ( X ))]
l
l
Expectation taken w.r.t. the joint
distribution Pr(G,X)
Conditioning yields
K
EPE  E
L[G , Gˆ ( X )] Pr(G , X )
X
l

k
k 1
k
Again, pointwise minimization
suffices
K
Gˆ ( X )  arg min gG  L(G k , g ) Pr(G k | X  x)
k 1
Bayes-optimal decision boundary
l
Or simply
Gˆ ( X )  G k if Pr(G k | X  x)
 max Pr( g | X  x)
Bayes Classifier
gG
32
2.5 Local Methods in High Dimensions
l
l
l
Curse of Dimensionality:
Local neighborhoods become
increasingly global, as the
number of dimension
increases
Example: Points uniformly
distributed in a p-dimensional
unit hypercube.
Hypercubical neighborhood in
p dimensions that captures a
fraction r of the data
l
l
l
Has edge length ep(r) = r1/p
e10(0.01) = 0.63
e10(0.1) = 0.82
l
To cover 1% of the data we must
cover 63% of the range of an
input variable
33
2.5 Local Methods in High Dimensions
l
Reducing r reduces the number of
observations and thus the stability
l
To cover 1% of the data we must
cover 63% of the range of an
input variable
34
2.5 Local Methods in High Dimensions
l
l
l
In high dimensions, all sample
points are close to the edge of the
sample
N data points uniformly distributed
in a p-dimensional unit ball
centered at the origin
Median distance from the closest
point to the origin (Homework)
 1
d ( p, N )  1 
 2
l
l
1/ N
l
l
Sampling density is
proportional to N1/p
If N1 = 100 is a dense sample
for one input then N10 = 10010
is an equally dense sample for
10 inputs.
1/ p




d(10,500) = 0.52
More than half the way to the
boundary
1/2 1/2
median
35
2.5 Local Methods in High Dimensions
l
Another example
l T set of training points xi
l
generated uniformly in [-1,1]p
(red)
Functional relationship
between X and Y (green)
Y  f (X )  e
l
l
prediction
8 X 2
No measurement error
Error of a 1-nearest neighbor
classifier in estimating f(0)
(blue)
10 training points
closest
training
point
36
2.5 Local Methods in High Dimensions
l
l
Another example
Problem deterministic:
Prediction error is the meansquared error for estimating
f(0)
prediction
downward bias
MSE ( x0 )  ET [ f ( x0 )  yˆ 0 ]2
 ET [ yˆ 0  ET ( yˆ 0 )]2
 [ET ( yˆ 0 )  f ( x0 )]2
 VarT ( yˆ 0 )  BiasT2 ( yˆ 0 )
telescoping
10 training points
closest
training
point
37
Side Calculation: Bias-Variance Decomposition
MSE ( x0 )  ET [( f ( x0 )  yˆ 0 ) 2 ]  ET [( yˆ 0  ET ( yˆ 0 )  ET ( yˆ 0 )  f ( x0 )) 2 ]
 ET [ yˆ 0  ET ( yˆ 0 )]2  [ET ( yˆ 0 )  f ( x0 )]2
 2ET [( yˆ 0  ET ( yˆ 0 )) (ET ( yˆ 0 )  f ( x0 ))]
const. Bias
 BiasT2 ( yˆ 0 )  VarT ( yˆ 0 )  2 BiasT ( yˆ 0 ) ET [( yˆ 0  ET ( yˆ 0 ))]
0
38
2.5 Local Methods in High Dimensions
l
l
l
Another example
1-d vs. 2-d
Bias increases
2-d bias
MSE ( x0 )  ET [ f ( x0 )  yˆ 0 ]2
 ET [ yˆ 0  ET ( yˆ 0 )]2
 [ET ( yˆ 0 )  f ( x0 )]2
1-d bias
 VarT ( yˆ 0 )  BiasT2 ( yˆ 0 )
39
2.5 Local Methods in High Dimensions
l
The case on N=1000 training points
Average
Bias
increases
since distance
of nearest
neighbour
increases
Variance does not
increase since
function symmetric
around 0
40
2.5 Local Methods in High Dimensions
l
Yet another example
Y  f (X ) 
1
( X 1  1)3
2
Variance increases
since function is not
symmetric around 0
Bias
increases
moderately
since function
is monotonic
41
2.5 Local Methods in High Dimensions
l
l
Assume now a linear relationship l
with measurement error
l
Y  X T    ,  ~ N (0, s 2 )
l
We fit the model with least
Additional variance s2 since
output nondeterministic
Variance depends on x0
If N is large we get
squares, for arbitrary test point x0
E x0 EPE ( x0 )  s 2 ( p / N )  s 2
N
yˆ0  x ˆ  x0T    i ( x0 ) i
T
0
i
i 1
( x0 ) is the i  th element of X ( X T X )1 x0
EPE( x0 )  E y0 | x0 ET ( y0  yˆ 0 ) 2
Additional variance
l
l
l
Variance negligible for large N
or small s
No bias
Curse of dimensionality
controlled
 Var( y0 | x0 )  ET [ y0  ET ( yˆ 0 )]2  [ET ( yˆ 0 )  ET ( y0 )]2
 Var( y0 | x0 )  Var( yˆ 0 )  Bias 2 ( yˆ 0 )
 s 2  ET [ x0T ( XT X) 1 x0 ]s 2  02
42
2.5 Local Methods in High Dimensions
l
More generally
Y  f ( X )   , X uniform,  ~ N (0,1)
l
l
Sample size: N = 500
Linear case
l
l
EPE(LeastSquares)
is slightly above 1
no bias
EPE(1-NN) always
above 2, grows
slowly
as nearest training
point strays from
target
EPE Ratio 1-NN/Least Squares
f ( x)  x1
f ( x)  ( x1  1)3 / 2
43
2.5 Local Methods in High Dimensions
l
More generally
Y  f ( X )   , X uniform,  ~ N (0,1)
l
l
Sample size: N = 500
Cubic Case
l
EPE Ratio 1-NN/Least Squares
EPE(LeastSquares)
is biased, thus ratio
smaller
f ( x)  x1
f ( x)  ( x1  1)3 / 2
44
2.6 Statistical Models
l NN methods are the direct implementation of
f ( x)  E(Y | X  x)
l But can fail in two ways
l With high dimensions NN need not be close to the target point
l If special structure exists in the problem, this can be used to
reduce variance and bias
45
2.6.1 Additive Error Model
l
l
l
l
Assume additive error model
Y  f (X )  
E( )  0
 independen t of X
Then Pr(Y|X) depends only
on the conditional mean of f(x)
This model is a good
approximation in many cases
In many cases, f(x) is
deterministic and error enters
through uncertainty in the
input. This can often be
mapped on uncertainty in the
output with deterministic input.
46
2.6.2 Supervised Learning
l
l
Supervised learning
The learning algorithm
modifies its input/output
relationship in dependence on
the observed error
yi  fˆ ( xi )
l
This can be a continuous
process
47
2.6.3 Function Approximation
l
Data: pairs (xi,yi) that are
points in (p+1)-space
F:
l
l
l
l
p
l
Linear basis expansions have
the more general form
 , yi  f ( xi )   i
More general input spaces are
possible
Want a good approximation of
f(x) in some region of input
space, given the training set T
Many models have certain
parameters 
E.g. for the linear model
f(x)=xT and   
K
f ( x)   hk ( x) k
k 1
l
Examples
l
l
l
Polynomial expansions:
hk(x) = x1x22
Trigonometric expansions
hk(x) = cos(x1)
Sigmoid expansion
hk ( x) 
1
1  exp(  xT  k )
48
2.6.3 Function Approximation
l
Approximating f by minimizing
the residual sum of squares
l
Linear basis expansions have
the more general form
N
RSS ( )   ( yi  f ( xi ))
K
f ( x)   hk ( x) k
2
k 1
i 1
l
Examples
l
l
l
Polynomial expansions:
hk(x) = x1x22
Trigonometric expansions
hk(x) = cos(x1)
Sigmoid expansion
hk ( x) 
1
1  exp(  xT  k )
49
2.6.3 Function Approximation
l
Approximating f by minimizing
the residual sum of squares
N
RSS( )   ( yi  f ( xi )) 2
i 1
l
Intuition
l
l
l
l
l
f surface in (p+1)-space
Observe noisy realizations
Want fitted surface as close to
the observed points as
possible
Distance measured by RSS
Methods:
l
l
Closed form: if basis function
have no hidden parameters
Iterative: otherwise
50
2.6.3 Function Approximation
l
l
Approximating f by
maximizing the likelihood
Assume an independently
drawn random sample
yi , i = 1,…,N from a
probability density Pr (y). The
log-probability of observing the
sample is
N
L( )   log Pr ( yi )
i 1
l
Set  to maximize L()
51
2.6.3 Function Approximation
l
l
Approximating f by
maximizing the likelihood
Assume an independently
drawn random sample
yi , i = 1,…,N from a
probability density Pr (y). The
log-probability of observing the
sample is
l
Y  f ( X )  
 ~ N (0, s 2 )
l
is equivalent to maximum
likelihood with the likelihood
function
Pr(Y | X , ) ~ N ( f ( X ), s 2 )
l
This is, because in this case
the log-likelihood function is
N
L( )   log( 2 )  N log s
2
1 N

( yi  f ( xi )) 2
2 
2s i 1
N
L( )   log Pr ( yi )
i 1
l
Set  to maximize L()
proportional to RSS
Least squares with the additive
error model
52
2.6.3 Function Approximation
l
l
Approximating the regression
function Pr(G|X) by
maximizing the likelihood for a
qualitative output G.
Conditional probability of each
class given X
Pr(G  G k | X  x)  pk , ( x), k  1,, k
l
Then the log-likelihood, also
called the cross-entropy, is
N
L( )   log p gi , ( xi )
i 1
53
2.7 Structured Regression Models
l
l
Problem with regression:
Minimizing
N
RSS( )   ( yi  f ( xi )) 2
i 1
l
l
has infinitely many
(interpolating) solutions
If we have repeated outcomes
at each point, we can use
them to decrease the variance
by better estimating the
average
Otherwise restrict the set of
functions to “smooth” functions
l
l
Choice of set is model choice
Major topic of this course
l
l
Restricting function spaces:
Choose function space of low
complexity
l
l
l
l
l
l
Close to constant, linear or low-order
polynomial in small neighborhoods
VC dimension is a relevant complexity
measure in this context
Estimator does averaging or local
polynomial fitting
The larger the neighborhood, the
stronger the constraint
Metric used is important, either
explicitly or implicitly defined
All such methods run into problems
with high dimensions, therefore need
metrics that allow neighborhoods to
be large in at least some dimensions
54
2.8 Classes of Restricted Estimators
l
l
Bayesian Methods
Formula for joint probabilities
l
Pr( X , Y )  Pr(Y | X ) Pr( X )
= Pr( X | Y ) Pr(Y )
Bayes Formula
Prior (probability) for Y
l
RSS is penalized with a
roughness penalty
PRSS( f ;  )  RSS( f )  J ( f )
l
J(f) is large for ragged
functions
l
Pr( X | Y ) Pr(Y )
Pr(Y | X ) =
Pr( X )
E.g. cubic smoothing spline is
the solution for the leastsquares problem
N
PRSS( f ;  )   ( yi  f ( xi )) 2
Posterior (probability) for Y
i 1
   [ f ( x)]2 dx
l
Large second derivative is
penalized
55
2.8 Classes of Restricted Estimators
l
Introducing penalty functions is a l
type of regularization
l
l
l
It works against overfitting
It implements beliefs about
unseen parts of the problem
In Bayesian terms
l
l
Penalty J is the log-prior
(probability distribution)
PRSS is the log-posterior
(probability distribution)
RSS is penalized with a
roughness penalty
PRSS( f ;  )  RSS( f )  J ( f )
l
J(f) is large for ragged
functions
l
E.g. cubic smoothing spline
for least squares
N
PRSS( f ;  )   ( yi  f ( xi )) 2
i 1
   [ f ( x)]2 dx
l
Large second derivative is
penalized
56
2.8.2 Kernel Methods and Local Regression
l
Kernel functions
l
l
l
l
Simplest Kernel estimate:
Nadaraya-Watson weighted
average
N
K (x , x ) y
ˆf ( x )  i 1  0 i i
0
N
i1 K ( x0 , xi )
l
General local regression
estimate of f (x0) is
fˆ ( x0 ) where ˆ minimizes
model the local
neighborhoods in NN methods
define the function space
used for approximation
Gaussian kernel
 x  x0
1
K  ( x0 , x)  exp 

2

l
l
2



assigns weights to points that
die exponentially with the
square of the distance from
the point x0
 controls the variance
N
RSS ( f , x0 )   K ( x0 , xi )( yi  f ( xi )) 2
i 1
57
2.8.2 Kernel Methods and Local Regression
l
f is a simple function such as
a low-order polynomial
l
l
l
l
Simplest Kernel estimate:
Nadaraya-Watson weighted
average
N
K (x , x ) y
ˆf ( x )  i 1  0 i i
0
N
i1 K ( x0 , xi )
l
General local regression
estimate of f(x0) is
fˆ ( x0 ) where ˆ minimizes
f(x) = 0
Nadaraya-Watson estimate
f(x) = 0+ 1x
local linear regression model
NN methods can be regarded
as kernel methods with a
special metric

K k ( x, x0 )  I x  x0  x( k )  x0
)
x( k ) trainin g sample ranked k
N
RSS ( f , x0 )   K ( x0 , xi )( yi  f ( xi )) 2
i 1
in distance from x0
I
indicator function I (b)  d b, true
58
2.8.3 Basis Functions and Dictionary Methods
l
l
Include linear and polynomial
expansions and more
General form
M
f ( x)    m hm ( x)
m 1
l
Linear in 
59
2.8.3 Basis Functions and Dictionary Methods
l
l
Include linear and polynomial
expansions and more
General form
M
f ( x)    m hm ( x)
m 1
l
l
l
Linear in 
Examples
Splines
l
Parameters are the points of
attachment of the polynomials
(knots)
60
2.8.3 Basis Functions and Dictionary Methods
l
l
Include linear and polynomial
expansions and more
General form
M
f ( x)    m hm ( x)
m 1
l
l
l
Linear in 
Examples
Splines
l
l
Radial Basis Functions
M
f ( x)   K m ( m m , x) m
m 1
l
Parameters are
l Centroids mm
l Scales m
Parameters are the points of
attachment of the polynomials
(knots)
61
2.8.3 Basis Functions and Dictionary Methods
l
l
Include linear and polynomial
expansions and more
General form
l
M
f ( x)   K m ( m m , x) m
m 1
M
f ( x)    m hm ( x)
l
Linear in 
Examples
Splines
l
Parameters are the points of
attachment of the polynomials
(knots)
Parameters are
l Centroids mm
l Scales m
m 1
l
l
l
Radial Basis Functions
l
Neural Networks
neuron weight
M
f ( x)    ms ( mT x  bm )
m 1
s ( x)  1 /(1  e  x )
neuron output
62
2.9 Model Selection
l
Smoothing and complexity
parameters
l
l
l
l
l
Coefficient of the penalty term
Width of the kernel
Number of basis functions
The setting of the parameters
implements a tradeoff between
bias and variance
Example: k-NN methods
l
Generalization error
EPE k ( x0 )  E[Y  fˆk ( x0 ) | X  x0 ]
 s 2  [Bias 2 ( fˆk ( x0 ))  VarT ( fˆk ( x0 ))]
2
1 k

 s2
2
 s   f ( x0 )   f ( x( ) )  
k 1
k


irreducible error
mean-square error
Y  f (X )  
E( )  0
Var ( )  s 2
l
Assume that the values of the
xi are fixed in advance
underfit region
overfit region
63