Download Association Rule Mining - Indian Statistical Institute

Logistic Regression Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata September 1, 2014 Power (bhp) Recall: Linear Regression 200 180 160 140 120 100 80 60 40 20 0 0 500 1000 1500 2000 2500 Engine displacement (cc)  Assume: the relation is linear  Then for a given x (=1800), predict the value of y  Both the dependent and the independent variables are continuous 2 Scenario: Heart disease – vs – Age Training set Age (numarical): independent variable Heart disease (Y) Yes Heart disease (Yes/No): dependent variable with two classes No 0 20 40 60 Age (X) 80 100 Task: Given a new person’s age, predict if (s)he has heart disease The task: calculate P(Y = Yes | X) 3 Scenario: Heart disease – vs – Age Training set Age (numarical): independent variable Heart disease (Y) Yes Heart disease (Yes/No): dependent variable with two classes No 0 20 40 60 Age (X) 80 100 Task: Given a new person’s age, predict if (s)he has heart disease  Calculate P(Y = Yes | X) for different ranges of X  A curve that estimates the probability P(Y = Yes | X) 4 The Logistic function Logistic function on t : takes values between 0 and 1 et 1 Logistic(t) = = t 1+ e 1+ e-t If t is a linear function of x L(t) t = b0 + b1x Logistic function becomes: t The logistic curve F(x) = 1 1+ e-( b0 +b1x) Probability of the dependent variable Y taking one value against another 5 The Likelihood function  Let, a discrete random variable X has a probability distribution p(x; θ), that depends on a parameter θ  In case of Bernoulli’s distribution p(x;q ) = q x (1- q )1-x  Intuitively, likelihood is “how likely” is an outcome being estimated correctly by the parameter θ – For x = 1, p(x;θ) = θ – For x = 0, p(x;θ) = 1−θ  Given a set of data points x1, x2 ,…, xn, the likelihood function is defined as: n l(q ) = Õ p(xi ;q ) i=1 6 About the Likelihood function n l(q ) = Õ p(xi ;q ) i=1  The actual value does not have any meaning, only the relative likelihood matters, as we want to estimate the parameter θ  Constant factors do not matter  Likelihood is not a probability density function  The sum (or integral) does not add up to 1  In practice it is often easier to work with the log-likelihood  Provides same relative comparison  The expression becomes a sum æ n ö n L(q ) = ln (l(q )) = ln ç Õ p(xi ;q )÷ = å ln ( p(xi ;q )) è i=1 ø i=1 7 Example  Experiment: a coin toss, not known to be unbiased  Random variable X takes values 1 if head and 0 if tail  Data: 100 outcomes, 75 heads, 25 tails L(q ) = 75´ ln(q )+ 25´ ln(1- q )  Relative likelihood: if θ1 > θ2, L(θ1) > L(θ2) 8 Maximum likelihood estimate  Maximum likelihood estimation: Estimating the set of values for the parameters (for example, θ) which maximizes the likelihood function  Estimate: én ù argmaxq [ L(q )] = argmaxq êå ln ( p(xi ;q ))ú ë i=1 û  One method: Newton’s method – Start with some value of θ and iteratively improve – Converge when improvement is negligible  May not always converge 9 Taylor’s theorem  If f is a – Real-valued function – k times differentiable at a point a, for an integer k > 0 Then f has a polynomial approximation at a  In other words, there exists a function hk, such that and lim x®a ( hk (x)) = 0 Polynomial approximation (k-th order Taylor’s polynomial) 10 Newton’s method  Finding the global maximum w* of a function f of one variable Assumptions: 1. 2. The function f is smooth The derivative of f at w* is 0, second derivative is negative  Start with a value w = w0  Near the maximum, approximate the function using a second order Taylor polynomial df 1 d2 f f (w) » f (w0 ) + (w - w0 ) + (w - w0 ) 2 dw w=w0 2 dw w=w0 1 » f (w0 ) + (w - w0 ) f '(w0 ) + (w - w0 ) f ''(w0 ) 2  Using the gradient descent approach iteratively estimate the maximum of f 11 Newton’s method 1 f (w) » f (w0 ) + (w - w0 ) f '(w0 ) + (w - w0 ) f ''(w0 ) 2  Take derivative w.r.t. w, and set it to zero at a point w1 1 f '(w1 ) » 0 = f '(w0 ) + f ''(w0 )´ 2(w1 - w0 ) 2 f '(w0 ) Þ w1 = w0 f ''(w0 ) Iteratively: wn+1 = wn - f '(wn ) f ''(wn )  Converges very fast, if at all  Use the optim function in R 12 Logistic Regression: Estimating β0 and β1  Logistic function eb0 +b1x 1 F(x) = = b0 +b1x 1+ e 1+ e-( b0 +b1x)  Log-likelihood function – Say we have n data points x1, x2 ,…, xn – Outcomes y1, y2 ,…, yn, each either 0 or 1 – Each yi = 1 with probabilities p and 0 with probability 1 − p n L(b ) = ln (l(b )) = å yi ln p(xi ) + (1- yi )ln(1- p(xi )) i=1 n = å yi ( b0 + b1 x ) - ln(1+ eb0 +b1x ) i=1 13 Visualization  Fit some plot with parameters β0 and β1 Heart disease (Y) Yes 0.25 0.75 0.5 No 0 20 40 60 80 100 Age (X) 14 Visualization  Fit some plot with parameters β0 and β1  Iteratively adjust curve and the probabilities of some point being classified as one class vs another Heart disease (Y) Yes 0.25 0.75 0.5 No 0 20 40 60 80 100 Age (X) For a single independent variable x the separation is a point x = a 15 Two independent variables 150 100 50 0.75 0.5 0.25 0 Income (thousand rupees) 200 Separation is a line where the probability becomes 0.5 30 40 50 60 70 80 Age (Years) 16 Wrapping up classification CLASSIFICATION 17 Binary and Multi-class classification  Binary classification: – Target class has two values – Example: Heart disease Yes / No  Multi-class classification – Target class can take more than two values – Example: text classification into several labels (topics)  Many classifiers are simple to use for binary classification tasks  How to apply them for multi-class problems? 18 Compound and Monolithic classifiers  Compound models – By combining binary submodels – 1-vs-all: for each class c, determine if an observation belongs to c or some other class – 1-vs-last  Monolithic models (a single classifier) – Examples: decision trees, k-NN 19

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Association Rule Mining - Indian Statistical Institute