Download Linear regression

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Basis (linear algebra) wikipedia , lookup

Signal-flow graph wikipedia , lookup

Dual space wikipedia , lookup

Linear least squares (mathematics) wikipedia , lookup

System of linear equations wikipedia , lookup

Bra–ket notation wikipedia , lookup

Linear algebra wikipedia , lookup

Transcript
CZECH TECHNICAL UNIVERSITY IN PRAGUE
Faculty of Electrical Engineering
Department of Cybernetics
Linear regression
Petr Pošı́k
c 2015
P. Pošı́k Artificial Intelligence – 1 / 9
Linear regression
c 2015
P. Pošı́k Artificial Intelligence – 2 / 9
Linear regression
Regression task is a supervised learning task, i.e.
■ a training (multi)set T = {( x(1) , y(1) ), . . . , ( x(| T |) , y(| T |) )} is available, where
Linear regression
• Regression
• Notation remarks
• Train, apply
• 1D regression
• LSM
• Minimizing J (w, T )
• Multivariate linear
regression
c 2015
P. Pošı́k ■ the labels y(i) are quantitave, often continuous (as opposed to classification tasks
where y(i) are nominal).
■ Its purpose is to model the relationship between independent variables (inputs)
x = ( x1 , . . . , x D ) and the dependent variable (output) y.
Artificial Intelligence – 3 / 9
Linear regression
Regression task is a supervised learning task, i.e.
■ a training (multi)set T = {( x(1) , y(1) ), . . . , ( x(| T |) , y(| T |) )} is available, where
Linear regression
• Regression
• Notation remarks
• Train, apply
• 1D regression
• LSM
• Minimizing J (w, T )
• Multivariate linear
regression
■ the labels y(i) are quantitave, often continuous (as opposed to classification tasks
where y(i) are nominal).
■ Its purpose is to model the relationship between independent variables (inputs)
x = ( x1 , . . . , x D ) and the dependent variable (output) y.
Linear regression is a particular regression model which assumes (and learns) linear
relationship between the inputs and the output:
where
yb = h( x) = w0 + w1 x1 + . . . + w D x D = w0 + hw, xi = w0 + xw T ,
b is the model prediction (estimate of the true value y),
■ y
■ h( x) is the linear model (a hypothesis),
■ w0 , . . . , w D are the coefficients of the linear function, w0 is the bias, organized in a row
vector w,
■ hw, xi is a dot product of vectors w and x (scalar product),
■ which can be also computed as a matrix product xw T if w and x are row vectors.
c 2015
P. Pošı́k Artificial Intelligence – 3 / 9
Notation remarks
Linear regression
• Regression
• Notation remarks
• Train, apply
• 1D regression
• LSM
• Minimizing J (w, T )
• Multivariate linear
regression
c 2015
P. Pošı́k Homogeneous coordinates: If we add “1” as the first element of x so that
x = (1, x1 , . . . , x D ), then we can write the linear model in an even simpler form (without
the explicit bias term):
yb = h( x) = w0 · 1 + w1 x1 + . . . + w D x D = hw, xi = xw T .
Artificial Intelligence – 4 / 9
Notation remarks
Linear regression
• Regression
• Notation remarks
• Train, apply
• 1D regression
• LSM
• Minimizing J (w, T )
• Multivariate linear
regression
Homogeneous coordinates: If we add “1” as the first element of x so that
x = (1, x1 , . . . , x D ), then we can write the linear model in an even simpler form (without
the explicit bias term):
yb = h( x) = w0 · 1 + w1 x1 + . . . + w D x D = hw, xi = xw T .
Matrix notation: If we organize the data into matrix X and vector y, such that




(
1
)
(
1
)
1
x
y
 .



.
.



,
X= .
and
y=
.
.


.
.
.
1 x(|T |)
y(|T |)
and similarly with y
b, then we can write a batch computation of predictions for all data in
X as
y
b = Xw T .
c 2015
P. Pošı́k Artificial Intelligence – 4 / 9
Two operation modes
Any ML model has 2 operation modes:
Linear regression
• Regression
• Notation remarks
• Train, apply
• 1D regression
• LSM
• Minimizing J (w, T )
• Multivariate linear
regression
c 2015
P. Pošı́k 1. learning (training, fitting) and
2. application (testing, making predictions).
Artificial Intelligence – 5 / 9
Two operation modes
Any ML model has 2 operation modes:
Linear regression
• Regression
• Notation remarks
• Train, apply
• 1D regression
• LSM
• Minimizing J (w, T )
• Multivariate linear
regression
c 2015
P. Pošı́k 1. learning (training, fitting) and
2. application (testing, making predictions).
The model h can be viewed as a function of 2 variables: h( x, w).
Artificial Intelligence – 5 / 9
Two operation modes
Any ML model has 2 operation modes:
Linear regression
• Regression
• Notation remarks
• Train, apply
• 1D regression
• LSM
• Minimizing J (w, T )
• Multivariate linear
regression
1. learning (training, fitting) and
2. application (testing, making predictions).
The model h can be viewed as a function of 2 variables: h( x, w).
Model application: If the model is given (w is fixed), we can manipulate x to make
predictions:
yb = h( x, w) = hw ( x).
c 2015
P. Pošı́k Artificial Intelligence – 5 / 9
Two operation modes
Any ML model has 2 operation modes:
Linear regression
• Regression
• Notation remarks
• Train, apply
• 1D regression
• LSM
• Minimizing J (w, T )
• Multivariate linear
regression
1. learning (training, fitting) and
2. application (testing, making predictions).
The model h can be viewed as a function of 2 variables: h( x, w).
Model application: If the model is given (w is fixed), we can manipulate x to make
predictions:
yb = h( x, w) = hw ( x).
Model learning: If the data is given (T is fixed), we can manipulate the model parameters
w to fit the model to the data:
w∗ = argmin J (w, T ).
w
c 2015
P. Pošı́k Artificial Intelligence – 5 / 9
Two operation modes
Any ML model has 2 operation modes:
Linear regression
• Regression
• Notation remarks
• Train, apply
• 1D regression
• LSM
• Minimizing J (w, T )
• Multivariate linear
regression
1. learning (training, fitting) and
2. application (testing, making predictions).
The model h can be viewed as a function of 2 variables: h( x, w).
Model application: If the model is given (w is fixed), we can manipulate x to make
predictions:
yb = h( x, w) = hw ( x).
Model learning: If the data is given (T is fixed), we can manipulate the model parameters
w to fit the model to the data:
w∗ = argmin J (w, T ).
w
How to train the model?
c 2015
P. Pošı́k Artificial Intelligence – 5 / 9
Simple (univariate) linear regression
Simple (univariate) regression deals with cases where x(i) = x (i) , i.e. the examples are
described by a single feature (they are 1-dimensional).
Linear regression
• Regression
• Notation remarks
• Train, apply
• 1D regression
• LSM
• Minimizing J (w, T )
• Multivariate linear
regression
c 2015
P. Pošı́k Artificial Intelligence – 6 / 9
Simple (univariate) linear regression
Simple (univariate) regression deals with cases where x(i) = x (i) , i.e. the examples are
described by a single feature (they are 1-dimensional).
Linear regression
• Regression
• Notation remarks
• Train, apply
• 1D regression
• LSM
• Minimizing J (w, T )
• Multivariate linear
regression
c 2015
P. Pošı́k Fitting a line to data:
■ find parameters w0 , w1 of a linear model ŷ = w0 + w1 x
|T|
■ given a traning (multi)set T = {( x (i) , y(i) )}i=1 .
Artificial Intelligence – 6 / 9
Simple (univariate) linear regression
Simple (univariate) regression deals with cases where x(i) = x (i) , i.e. the examples are
described by a single feature (they are 1-dimensional).
Linear regression
• Regression
• Notation remarks
• Train, apply
• 1D regression
• LSM
• Minimizing J (w, T )
• Multivariate linear
regression
Fitting a line to data:
■ find parameters w0 , w1 of a linear model ŷ = w0 + w1 x
|T|
■ given a traning (multi)set T = {( x (i) , y(i) )}i=1 .
How to fit depending on the number of training examples:
■ Given a single example (1 equation, 2 parameters)
⇒ infinitely many linear function can be fitted.
■ Given 2 examples (2 equations, 2 parameters)
⇒ exactly 1 linear function can be fitted.
■ Given 3 or more examples (> 2 equations, 2 parameters)
⇒ no line can be fitted without any error
⇒ a line which minimizes the “size” of error y − yb can be fitted:
w∗ = (w0∗ , w1∗ ) = argmin J (w0 , w1 , T ).
w0 ,w1
c 2015
P. Pošı́k Artificial Intelligence – 6 / 9
The least squares method
The least squares method (LSM) suggests to choose such parameters w which minimize
the mean squared error
Linear regression
• Regression
• Notation remarks
• Train, apply
• 1D regression
• LSM
• Minimizing J (w, T )
• Multivariate linear
regression
1
J (w) =
|T|
|T|
∑
i =1
y
(i )
(i )
− yb
2
1
=
|T|
∑
i =1
y
(i )
(i )
− hw ( x )
2
.
y
( x (3) , yb(3) )
( x (2) , y (2) )
| y (2)
− yb(2) |
( x (3) , y (3) )
( x (2) , yb(2) )
( x (1) , yb(1) )
w0
( x (1) , y (1) )
0
c 2015
P. Pošı́k |T|
yb = w0 + w1 x
|y(3) − yb(3) |
w1
1
|y(1) − yb(1) |
x
Artificial Intelligence – 7 / 9
The least squares method
The least squares method (LSM) suggests to choose such parameters w which minimize
the mean squared error
Linear regression
• Regression
• Notation remarks
• Train, apply
• 1D regression
• LSM
• Minimizing J (w, T )
• Multivariate linear
regression
1
J (w) =
|T|
|T|
∑
i =1
y
(i )
(i )
− yb
2
1
=
|T|
|T|
∑
i =1
y
(i )
(i )
− hw ( x )
2
.
y
( x (3) , yb(3) )
( x (2) , y (2) )
| y (2)
− yb(2) |
( x (3) , y (3) )
( x (2) , yb(2) )
( x (1) , yb(1) )
w0
( x (1) , y (1) )
yb = w0 + w1 x
|y(3) − yb(3) |
w1
1
|y(1) − yb(1) |
x
0
Explicit solution:
|T|
w1 =
c 2015
P. Pošı́k ∑i=1 ( x (i) − x )(y(i) − y)
|T|
∑ i =1 ( x ( i )
−
x )2
=
s xy
s2x
w0 = y − w1 x
Artificial Intelligence – 7 / 9
Universal fitting method: minimization of cost function J
The landscape of J in the space of w0 and w1 :
1 .0
0 .8
45
40
J(w0 ,w1 )
35
0 .6
w1
30
25
20
15
10
5
0
0 .4
1 .0
0 .8
0 .2
0 .6
0
20
40
w0
0 .4
0 .2
60
80
w1
1 0 0 0 .0
0 .0
0
20
40
60
w0
80
100
Gradually better linear models found by an optimization method (BFGS):
200
200
200
200
150
150
150
150
hp
250
hp
250
hp
250
hp
250
100
100
100
100
50
50
50
50
0
0
100
200
300
d is p
c 2015
P. Pošı́k 400
500
0
0
100
200
300
d is p
400
500
0
0
100
200
300
d is p
400
500
0
0
100
200
300
400
500
d is p
Artificial Intelligence – 8 / 9
Multivariate linear regression
(i )
(i )
Multivariate linear regression deals with cases where x(i) = ( x1 , . . . , x D ), i.e. the
examples are described by more than 1 feature (they are D-dimensional).
Linear regression
• Regression
• Notation remarks
• Train, apply
• 1D regression
• LSM
• Minimizing J (w, T )
• Multivariate linear
regression
c 2015
P. Pošı́k Artificial Intelligence – 9 / 9
Multivariate linear regression
(i )
(i )
Multivariate linear regression deals with cases where x(i) = ( x1 , . . . , x D ), i.e. the
examples are described by more than 1 feature (they are D-dimensional).
Linear regression
• Regression
• Notation remarks
• Train, apply
• 1D regression
• LSM
• Minimizing J (w, T )
• Multivariate linear
regression
c 2015
P. Pošı́k Model fitting:
b = xw T
■ find parameters w = (w1 , . . . , w D ) of a linear model y
|T|
■ given the training (multi)set T = {( x(i) , y(i) )}i=1 .
■ The model is a hyperplane in the D + 1-dimensional space.
Artificial Intelligence – 9 / 9
Multivariate linear regression
(i )
(i )
Multivariate linear regression deals with cases where x(i) = ( x1 , . . . , x D ), i.e. the
examples are described by more than 1 feature (they are D-dimensional).
Linear regression
• Regression
• Notation remarks
• Train, apply
• 1D regression
• LSM
• Minimizing J (w, T )
• Multivariate linear
regression
Model fitting:
b = xw T
■ find parameters w = (w1 , . . . , w D ) of a linear model y
|T|
■ given the training (multi)set T = {( x(i) , y(i) )}i=1 .
■ The model is a hyperplane in the D + 1-dimensional space.
Fitting methods:
1. Numeric optimization of J (w, T ):
■ Works as for simple regression, it only searches a space with more dimensions.
■ Sometimes one need to tune some parameters of the optimization algorithm to
work properly (learning rate in gradient descent, etc.).
■ May be slow (many iterations needed), but works even for very large D.
2. Normal equation:
w ∗ = ( X T X ) −1 X T y
■ Method to solve for the optimal w∗ analytically!
■ No need to choose optimization algorithm parameters.
■ No iterations.
■ Needs to compute ( X T X )−1 , which is O( D3 ). Slow, or intractable, for large D.
c 2015
P. Pošı́k Artificial Intelligence – 9 / 9