Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
ICS 178 Introduction Machine Learning & data Mining Instructor max Welling Lecture 6: Logistic Regression Logistic Regression • This is also regression but with targets Y=(0,1). I.e. it is classification! • We will fit a regression function on P(Y=1|X) linear regression logistic regression fi (x ) Aij xj Aij xj Yn  AXn  b P (Yn  1 | Xn )  f (AXn  b ) f (X )  1 1  exp[ (AX  b )] Sigmoid function f(x) fi (x ) Aij data-points with Y=1 xj P (Yn  1 | Xn )  f (AXn  b ) data-points with Y=0 f (X )  1 1  exp[ (AX  b )] In 2 Dimensions A,b determine 1) orientation 2) thickness (margin) 3) offset of decision surface sigmoid f(x) Cost Function • We want a different error measure that is better suited for 0/1 data. • This can be derived from maximizing the probability of the data again. P (Yn  1 | Xn )  f (AXn  b ) P (Yn  0 | Xn )  1  f (AXn  b ) P (Yn | Xn )  f (Xn )Yn (1  f (Yn ))1Yn N Error  Yn logf (Xn )  (1  Yn )log(1  f (Xn )) n 1 Learning A,b • Again, we take the derivatives of the Error wrt to the parameters. • This time however, we can’t solve them analytically, so we use gradient descent. dError A  A  dA dError b  b  db Gradients for Logistic Regression • After the math (on the white-board) we find: T Error   Yn 1  f (Xn )   (1  Yn )f (Xn )  Xn A n Error   Yn 1  f (Xn )   (1  Yn )f (Xn ) b n Note: first term in each eqn. (multiplied by Y) only sums over data with Y=1, while second term (multiplied by (1-Y) only sums over data with Y=0. Follow the gradient until the change in A,b falls below a small theshold (e.g. 1E-6). Classification • Once we have found the optimal values for A,b we classify future data with: Ynew  round (f (Xnew )) • Least squares and Logistic regression are parametric methods since all the information in the data is stored in the parameters A,b, i.e. after learning you can toss out the data. • Also, the decision surface is always linear, its complexity does not grow with the amount of data. • We have imposed our prior knowledge that the decision surface should be linear. A Real Example collaboration with S. Cole) • Fingerprints are matched against a data-base. • Each match is scored. • Using Logistic Regression we try to predict if a future match is a real or false. • Human fingerprint examiners claim 100% accuracy. Is this true? Exercise • You have layed your hands on a dataset where data have a single attribute and a class label (0 or 1). You train a logistic regression classifier. A new data-case is presented. What do you do to decide in what class it falls (use an equation or pseudo-code) • How many parameters are there to tune for this problem? Explain what these parameters mean in terms of the function P(Y=1|X).