Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Logistic Regression Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata September 1, 2014 Power (bhp) Recall: Linear Regression 200 180 160 140 120 100 80 60 40 20 0 0 500 1000 1500 2000 2500 Engine displacement (cc) Assume: the relation is linear Then for a given x (=1800), predict the value of y Both the dependent and the independent variables are continuous 2 Scenario: Heart disease – vs – Age Training set Age (numarical): independent variable Heart disease (Y) Yes Heart disease (Yes/No): dependent variable with two classes No 0 20 40 60 Age (X) 80 100 Task: Given a new person’s age, predict if (s)he has heart disease The task: calculate P(Y = Yes | X) 3 Scenario: Heart disease – vs – Age Training set Age (numarical): independent variable Heart disease (Y) Yes Heart disease (Yes/No): dependent variable with two classes No 0 20 40 60 Age (X) 80 100 Task: Given a new person’s age, predict if (s)he has heart disease Calculate P(Y = Yes | X) for different ranges of X A curve that estimates the probability P(Y = Yes | X) 4 The Logistic function Logistic function on t : takes values between 0 and 1 et 1 Logistic(t) = = t 1+ e 1+ e-t If t is a linear function of x L(t) t = b0 + b1x Logistic function becomes: t The logistic curve F(x) = 1 1+ e-( b0 +b1x) Probability of the dependent variable Y taking one value against another 5 The Likelihood function Let, a discrete random variable X has a probability distribution p(x; θ), that depends on a parameter θ In case of Bernoulli’s distribution p(x;q ) = q x (1- q )1-x Intuitively, likelihood is “how likely” is an outcome being estimated correctly by the parameter θ – For x = 1, p(x;θ) = θ – For x = 0, p(x;θ) = 1−θ Given a set of data points x1, x2 ,…, xn, the likelihood function is defined as: n l(q ) = Õ p(xi ;q ) i=1 6 About the Likelihood function n l(q ) = Õ p(xi ;q ) i=1 The actual value does not have any meaning, only the relative likelihood matters, as we want to estimate the parameter θ Constant factors do not matter Likelihood is not a probability density function The sum (or integral) does not add up to 1 In practice it is often easier to work with the log-likelihood Provides same relative comparison The expression becomes a sum æ n ö n L(q ) = ln (l(q )) = ln ç Õ p(xi ;q )÷ = å ln ( p(xi ;q )) è i=1 ø i=1 7 Example Experiment: a coin toss, not known to be unbiased Random variable X takes values 1 if head and 0 if tail Data: 100 outcomes, 75 heads, 25 tails L(q ) = 75´ ln(q )+ 25´ ln(1- q ) Relative likelihood: if θ1 > θ2, L(θ1) > L(θ2) 8 Maximum likelihood estimate Maximum likelihood estimation: Estimating the set of values for the parameters (for example, θ) which maximizes the likelihood function Estimate: én ù argmaxq [ L(q )] = argmaxq êå ln ( p(xi ;q ))ú ë i=1 û One method: Newton’s method – Start with some value of θ and iteratively improve – Converge when improvement is negligible May not always converge 9 Taylor’s theorem If f is a – Real-valued function – k times differentiable at a point a, for an integer k > 0 Then f has a polynomial approximation at a In other words, there exists a function hk, such that and lim x®a ( hk (x)) = 0 Polynomial approximation (k-th order Taylor’s polynomial) 10 Newton’s method Finding the global maximum w* of a function f of one variable Assumptions: 1. 2. The function f is smooth The derivative of f at w* is 0, second derivative is negative Start with a value w = w0 Near the maximum, approximate the function using a second order Taylor polynomial df 1 d2 f f (w) » f (w0 ) + (w - w0 ) + (w - w0 ) 2 dw w=w0 2 dw w=w0 1 » f (w0 ) + (w - w0 ) f '(w0 ) + (w - w0 ) f ''(w0 ) 2 Using the gradient descent approach iteratively estimate the maximum of f 11 Newton’s method 1 f (w) » f (w0 ) + (w - w0 ) f '(w0 ) + (w - w0 ) f ''(w0 ) 2 Take derivative w.r.t. w, and set it to zero at a point w1 1 f '(w1 ) » 0 = f '(w0 ) + f ''(w0 )´ 2(w1 - w0 ) 2 f '(w0 ) Þ w1 = w0 f ''(w0 ) Iteratively: wn+1 = wn - f '(wn ) f ''(wn ) Converges very fast, if at all Use the optim function in R 12 Logistic Regression: Estimating β0 and β1 Logistic function eb0 +b1x 1 F(x) = = b0 +b1x 1+ e 1+ e-( b0 +b1x) Log-likelihood function – Say we have n data points x1, x2 ,…, xn – Outcomes y1, y2 ,…, yn, each either 0 or 1 – Each yi = 1 with probabilities p and 0 with probability 1 − p n L(b ) = ln (l(b )) = å yi ln p(xi ) + (1- yi )ln(1- p(xi )) i=1 n = å yi ( b0 + b1 x ) - ln(1+ eb0 +b1x ) i=1 13 Visualization Fit some plot with parameters β0 and β1 Heart disease (Y) Yes 0.25 0.75 0.5 No 0 20 40 60 80 100 Age (X) 14 Visualization Fit some plot with parameters β0 and β1 Iteratively adjust curve and the probabilities of some point being classified as one class vs another Heart disease (Y) Yes 0.25 0.75 0.5 No 0 20 40 60 80 100 Age (X) For a single independent variable x the separation is a point x = a 15 Two independent variables 150 100 50 0.75 0.5 0.25 0 Income (thousand rupees) 200 Separation is a line where the probability becomes 0.5 30 40 50 60 70 80 Age (Years) 16 Wrapping up classification CLASSIFICATION 17 Binary and Multi-class classification Binary classification: – Target class has two values – Example: Heart disease Yes / No Multi-class classification – Target class can take more than two values – Example: text classification into several labels (topics) Many classifiers are simple to use for binary classification tasks How to apply them for multi-class problems? 18 Compound and Monolithic classifiers Compound models – By combining binary submodels – 1-vs-all: for each class c, determine if an observation belongs to c or some other class – 1-vs-last Monolithic models (a single classifier) – Examples: decision trees, k-NN 19