Download Association Rule Mining - Indian Statistical Institute

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

German tank problem wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Choice modelling wikipedia , lookup

Coefficient of determination wikipedia , lookup

Maximum likelihood estimation wikipedia , lookup

Transcript
Logistic Regression
Debapriyo Majumdar
Data Mining – Fall 2014
Indian Statistical Institute Kolkata
September 1, 2014
Power (bhp)
Recall: Linear Regression
200
180
160
140
120
100
80
60
40
20
0
0
500
1000
1500
2000
2500
Engine displacement (cc)
 Assume: the relation is linear
 Then for a given x (=1800), predict the value of y
 Both the dependent and the independent variables are
continuous
2
Scenario: Heart disease – vs – Age
Training set
Age (numarical):
independent variable
Heart disease (Y)
Yes
Heart disease (Yes/No):
dependent variable with
two classes
No
0
20
40
60
Age (X)
80
100
Task: Given a new
person’s age, predict if
(s)he has heart disease
The task: calculate P(Y = Yes | X)
3
Scenario: Heart disease – vs – Age
Training set
Age (numarical):
independent variable
Heart disease (Y)
Yes
Heart disease (Yes/No):
dependent variable with
two classes
No
0
20
40
60
Age (X)
80
100
Task: Given a new
person’s age, predict if
(s)he has heart disease
 Calculate P(Y = Yes | X) for different ranges of X
 A curve that estimates the probability P(Y = Yes | X)
4
The Logistic function
Logistic function on t : takes values between 0 and 1
et
1
Logistic(t) =
=
t
1+ e 1+ e-t
If t is a linear function of x
L(t)
t = b0 + b1x
Logistic function becomes:
t
The logistic curve
F(x) =
1
1+ e-( b0 +b1x)
Probability of the dependent variable
Y taking one value against another
5
The Likelihood function
 Let, a discrete random variable X has a probability distribution
p(x; θ), that depends on a parameter θ
 In case of Bernoulli’s distribution
p(x;q ) = q x (1- q )1-x
 Intuitively, likelihood is “how likely” is an outcome being
estimated correctly by the parameter θ
– For x = 1, p(x;θ) = θ
– For x = 0, p(x;θ) = 1−θ
 Given a set of data points x1, x2 ,…, xn, the likelihood function
is defined as:
n
l(q ) = Õ p(xi ;q )
i=1
6
About the Likelihood function
n
l(q ) = Õ p(xi ;q )
i=1
 The actual value does not have any meaning, only the relative
likelihood matters, as we want to estimate the parameter θ
 Constant factors do not matter
 Likelihood is not a probability density function
 The sum (or integral) does not add up to 1
 In practice it is often easier to work with the log-likelihood
 Provides same relative comparison
 The expression becomes a sum
æ n
ö n
L(q ) = ln (l(q )) = ln ç Õ p(xi ;q )÷ = å ln ( p(xi ;q ))
è i=1
ø i=1
7
Example
 Experiment: a coin toss, not known to be unbiased
 Random variable X takes values 1 if head and 0 if tail
 Data: 100 outcomes, 75 heads, 25 tails
L(q ) = 75´ ln(q )+ 25´ ln(1- q )
 Relative likelihood: if θ1 > θ2, L(θ1) > L(θ2)
8
Maximum likelihood estimate
 Maximum likelihood estimation: Estimating the set of
values for the parameters (for example, θ) which
maximizes the likelihood function
 Estimate:
én
ù
argmaxq [ L(q )] = argmaxq êå ln ( p(xi ;q ))ú
ë i=1
û
 One method: Newton’s method
– Start with some value of θ and iteratively improve
– Converge when improvement is negligible
 May not always converge
9
Taylor’s theorem
 If f is a
– Real-valued function
– k times differentiable at a point a, for an integer k > 0
Then f has a polynomial approximation at a
 In other words, there exists a function hk, such that
and
lim x®a ( hk (x)) = 0
Polynomial approximation
(k-th order Taylor’s
polynomial)
10
Newton’s method
 Finding the global maximum w* of a function f of one variable
Assumptions:
1.
2.
The function f is smooth
The derivative of f at w* is 0, second derivative is negative
 Start with a value w = w0
 Near the maximum, approximate the function using a second
order Taylor polynomial
df
1
d2 f
f (w) » f (w0 ) + (w - w0 )
+ (w - w0 ) 2
dw w=w0 2
dw
w=w0
1
» f (w0 ) + (w - w0 ) f '(w0 ) + (w - w0 ) f ''(w0 )
2
 Using the gradient descent approach iteratively estimate the
maximum of f
11
Newton’s method
1
f (w) » f (w0 ) + (w - w0 ) f '(w0 ) + (w - w0 ) f ''(w0 )
2
 Take derivative w.r.t. w, and set it to zero at a point w1
1
f '(w1 ) » 0 = f '(w0 ) + f ''(w0 )´ 2(w1 - w0 )
2
f '(w0 )
Þ w1 = w0 f ''(w0 )
Iteratively: wn+1 = wn -
f '(wn )
f ''(wn )
 Converges very fast, if at all
 Use the optim function in R
12
Logistic Regression: Estimating β0 and β1
 Logistic function
eb0 +b1x
1
F(x) =
=
b0 +b1x
1+ e
1+ e-( b0 +b1x)
 Log-likelihood function
– Say we have n data points x1, x2 ,…, xn
– Outcomes y1, y2 ,…, yn, each either 0 or 1
– Each yi = 1 with probabilities p and 0 with probability 1 − p
n
L(b ) = ln (l(b )) = å yi ln p(xi ) + (1- yi )ln(1- p(xi ))
i=1
n
= å yi ( b0 + b1 x ) - ln(1+ eb0 +b1x )
i=1
13
Visualization
 Fit some plot with
parameters β0 and β1
Heart disease (Y)
Yes
0.25
0.75
0.5
No
0
20
40
60
80
100
Age (X)
14
Visualization
 Fit some plot with
parameters β0 and β1
 Iteratively adjust
curve and the
probabilities of some
point being classified
as one class vs
another
Heart disease (Y)
Yes
0.25
0.75
0.5
No
0
20
40
60
80
100
Age (X)
For a single independent variable x the separation is a point x = a
15
Two independent variables
150
100
50
0.75
0.5
0.25
0
Income (thousand rupees)
200
Separation is a line
where the probability
becomes 0.5
30
40
50
60
70
80
Age (Years)
16
Wrapping up classification
CLASSIFICATION
17
Binary and Multi-class classification
 Binary classification:
– Target class has two values
– Example: Heart disease Yes / No
 Multi-class classification
– Target class can take more than two values
– Example: text classification into several labels (topics)
 Many classifiers are simple to use for binary
classification tasks
 How to apply them for multi-class problems?
18
Compound and Monolithic classifiers
 Compound models
– By combining binary submodels
– 1-vs-all: for each class c, determine if an observation
belongs to c or some other class
– 1-vs-last
 Monolithic models (a single classifier)
– Examples: decision trees, k-NN
19