Download ICS 178 Introduction Machine Learning & data Mining

ICS 178 Introduction Machine Learning & data Mining Instructor max Welling Lecture 4: Least squares Regression What have we done so far? parametric unsupervised non-parametric density estimation: parzen-windowing classification supervised future example: k-mean kNN regression classification regression future example: today: logistic regression least-squares X Problem # of mantee-kills versus boats Goal • Given data find a linear relationship between them. • In 1 dimension we have data {Xn,Yn} (blue dots) • We imagine vertical springs (red lines) between the data and a stiff rod (line). (imagine they can slide over the rod so they remain vertical). • Springs have rest length 0, so they compete to pull the rod towards them. • The relaxed solution is what we are after. Cost Function We measure the total squared length of all the springs: N 2 N Error  d   (Yn  aXn  b ) n 1 2 n n 1 We can now take derivatives wrt a,b and set that to 0. After some algebra (on white board) we find, a  * 1 N 1 Y X   n n N n 1 N 1 N 1 Y  n N n 1 N N X n 1 2  1 N  X  X  N  n  n 1  n 1  N 1 N * * 1 b  Xn Yn  a N  N n 1 n 1 N 2 n n More Variables • More generally, we want to have Dx input variables and Dy output variables. N n 1  1 A*   N  N 1  N Y X n 1 2 N Error  d  ||Yn  AXn  b || • The cost is now: n T n 2 n 1 Y  n N n 1 N  n 1 N X n 1 T n  1  N   E [YX T ]  E [Y ]E [X ]T * COV [X ]1 1 b  N * 1 Yn  A  N n 1 N *  E [Y ]  A *E [X ] N X n 1 n N X X n 1 n T n 1  N 1 X  n N n 1 N N X n 1 T n    1 In Matlab function [A,b] = LSRegression(X,Y) [D,N] = size(X); EX = sum(X,2)/N; CovX = X*X'/N - EX*EX'; EY = sum(Y,2)/N; CovXY = Y*X'/N - EY*EX'; A = CovXY * inv(CovX); b = EY - A*EX; Statistical Interpretation • We can think of the problem as one where we are trying to find the probability distribution for P(Y|X). • We can write: Yn  AXn  b  dn where d is the residual error pointing vertically from the line to the data-point. • d is a random vector and we may assume is has a Gaussian distribution. d N (0,  ) Statistical Interpretation dn  (Yn  AXn  b ) Yn N (AXn  b ,  )  N (0,  ) 1 1 exp[ 2 ||Yn  AXn  b ||2 ] 2 2 • We can now maximize the probability of the data under the model by adapting the parameters A,b. • If we use negative log-probability we get: C  1 2 N 2 ||Y n 1 n  AXn  b ||2  log   cons . • Looks familiar? • We can also optimize for  (It won’t affect A,b) • This is called “maximum likelihood learning”.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download ICS 178 Introduction Machine Learning & data Mining