Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
ICS 178 Introduction Machine Learning & data Mining Instructor max Welling Lecture 4: Least squares Regression What have we done so far? parametric unsupervised non-parametric density estimation: parzen-windowing classification supervised future example: k-mean kNN regression classification regression future example: today: logistic regression least-squares X Problem # of mantee-kills versus boats Goal • Given data find a linear relationship between them. • In 1 dimension we have data {Xn,Yn} (blue dots) • We imagine vertical springs (red lines) between the data and a stiff rod (line). (imagine they can slide over the rod so they remain vertical). • Springs have rest length 0, so they compete to pull the rod towards them. • The relaxed solution is what we are after. Cost Function We measure the total squared length of all the springs: N 2 N Error d (Yn aXn b ) n 1 2 n n 1 We can now take derivatives wrt a,b and set that to 0. After some algebra (on white board) we find, a * 1 N 1 Y X n n N n 1 N 1 N 1 Y n N n 1 N N X n 1 2 1 N X X N n n 1 n 1 N 1 N * * 1 b Xn Yn a N N n 1 n 1 N 2 n n More Variables • More generally, we want to have Dx input variables and Dy output variables. N n 1 1 A* N N 1 N Y X n 1 2 N Error d ||Yn AXn b || • The cost is now: n T n 2 n 1 Y n N n 1 N n 1 N X n 1 T n 1 N E [YX T ] E [Y ]E [X ]T * COV [X ]1 1 b N * 1 Yn A N n 1 N * E [Y ] A *E [X ] N X n 1 n N X X n 1 n T n 1 N 1 X n N n 1 N N X n 1 T n 1 In Matlab function [A,b] = LSRegression(X,Y) [D,N] = size(X); EX = sum(X,2)/N; CovX = X*X'/N - EX*EX'; EY = sum(Y,2)/N; CovXY = Y*X'/N - EY*EX'; A = CovXY * inv(CovX); b = EY - A*EX; Statistical Interpretation • We can think of the problem as one where we are trying to find the probability distribution for P(Y|X). • We can write: Yn AXn b dn where d is the residual error pointing vertically from the line to the data-point. • d is a random vector and we may assume is has a Gaussian distribution. d N (0, ) Statistical Interpretation dn (Yn AXn b ) Yn N (AXn b , ) N (0, ) 1 1 exp[ 2 ||Yn AXn b ||2 ] 2 2 • We can now maximize the probability of the data under the model by adapting the parameters A,b. • If we use negative log-probability we get: C 1 2 N 2 ||Y n 1 n AXn b ||2 log cons . • Looks familiar? • We can also optimize for (It won’t affect A,b) • This is called “maximum likelihood learning”.