Download Antonia Rom: Modeling of Binary Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Regression analysis wikipedia , lookup

Choice modelling wikipedia , lookup

Bias of an estimator wikipedia , lookup

Data assimilation wikipedia , lookup

Time series wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Maximum likelihood estimation wikipedia , lookup

Transcript
Vorlesung
Generalized Linear
Regressionmodels
Antonia Rom
Chapter 4 - Modeling of Binary Data
Introduction
What is important in modeling?
Problems, Obstacles
4.1 Maximum Likelihood Estimation
What is the ML-estimation?
Single Binary Response
Grouped Data
Asymptotic Properties
Existence of ML-Estimates
Estimation conditioned on predictor values
VO, Wien, 2014-05-15
2
Introduction
A generalized linear model consists of:
- a probability distribution from the exponential family
- a linear predictor η = Xβ
- a link function (with a response function h such that E(Y) = h-1(η))
Binary regression model
Linear predictor
Probability
Response function
h is a fully specified function.
In this chapter the logit model is used. In this case h is the logistic distribution function
VO, Wien, 2014-05-15
3
Introduction
The link function is the inverse of the response function = h’
It determines functional form of the response probabilities.
The Linear predictor determines which variables are included and in what form they
determine the response - The unknown parameters, β, can be estimated with
maximum likelihood estimation.
The maximum likelihood estimation is a iterative algorithm.
-> Linear predictors can contain polynomial versions of continuous variables, dummy
variables and interaction effects.
Care should be taken when specifying constituents of the model like
the linear predictors!
VO, Wien, 2014-05-15
4
Introduction
•
•
•
•
•
Discrepancy between data and model. Does the fit of the model support the inferences
drawn in the model?
Relevance of variables and form of the linear predictor. Which variables should be
included and how?
Explanatory power of the covariates
Prognostic power of the model
Choice of link funkction. Which link funktion fits the data well and has a simple
interpretation!
Aspects are not independent:
Model should present appropriate approximation with simple predictor, specification
determines the goodness-of-fit
linear predictor aims at finding an adequate form of covariates, reducing variable set,
explanatory value aims at quantifying the effect of the covariates within the model
First chapter about estimation - Maximum likelihood estimation!!
VO, Wien, 2014-05-15
5
Maximum-Likelihood Estimation
Basic principle is to construct the likelihood of the unknown parameters for the
sample data. (Which parameter (mean, variance) makes the sample the
most likely.) The distribution has to be known!
The likelihood represents the joint probability or probability density of the
observed data, considered as a function of the unknown parameters.
What does this mean in praxis?
Example:
2 MP3 – Players , exact the same, only shuffle mode without display!
1 with 5 songs
1 with 20 songs
Each MP3-Player contains your favorite song,
Unfortunately you mixed both of them. So you take one, turn it on and your
favorite song is played.
If you would have to bet, which one would you choose? – The one with 5
songs!!!
VO, Wien, 2014-05-15
6
Maximum-Likelihood Estimation
An event A happened. One tries to find inference on an underlying variable
B (e.g. a special parameter).
Therefore one looks on the conditional probability for A for all possible
estimations ˆbi of B, if ˆbi is true.
The value of ˆbi, for which P(A|ˆbi) is a maximum, is the best predictor for
b.
The conditional probability P(A|ˆbi) counts for the given event A.
P(A|ˆbi) is also called L(ˆbi) (Likelihood of ˆbi).
The ML-estimator is the value, for which the likelihood is a maximum.
-> therefore the name Maximum Likelihood.
VO, Wien, 2014-05-15
7
Maximum-Likelihood Estimation
If there are too many ˆbi, it is not possible to calculate every L(ˆbi).In this case a
Likelihood-Function is built and the maximum is calculated with L’(ˆb) = 0.
Fact (1)
The likelihood L(X) is not the probability for the event x to happen, but the
conditional probability for the already happened event y, if x already happened
before.
L(X) = P(Z|X)
Fact (2)
Sum of all Likelihoods is not 1.
VO, Wien, 2014-05-15
8
MLE – Definition: Single binary response
The likelihood for the response is given by
Observations are considered independent. The maximum likelihood of β are those values of β^ that
maximizes the likelihood.
L values can get very small so log-likelihood is used instead
The value β^ can be obtained by solving the system of equations
Derivatives are the so-called score function s(β)
(iterative solving)
VO, Wien, 2014-05-15
9
MLE – Definition:
grouped data – binomially distributed responses
Several, independent binary responses
P is assumed to depend on x only, the mean is assumed to be the same for all the binary
observations at this value.
The model has the form:
For the collection of binary variables the likelihood has the form
The likelihood for the number of success defined as Lbin(β) and the binary observation likelihood L(β)
differ in the binomial factor , which is irrelevant during maximization, because it doesn‘t depend on
β. Therefore the log-likelihood is:
The score function of the logit model is:
VO, Wien, 2014-05-15
10
MLE – Definition: Asymptotic Properties
The MLE has some favorable properties.
The MLE estimator exists and is unique asymptotically.
It is consistent and asymptotically normally distributed.
It is asymptotically efficient.
Consistency
Likelihood is a smooth function and behaves in a nice way, and it‘s maximum is achieved in a unique
point ˆ
Two functions Ln and L are getting closer, the points of each maximum should also get closer which
exactly means that ˆ  0
VO, Wien, 2014-05-15
11
MLE – Definition: Asymptotic Properties
Asymptotic normality:
The estimator not only converges to the unknown parameter, but it converges fast enough.
In MLE theory the asymptotic variance of the estimator is determined by the information or the
Fischer-Matrix
For binary data
For grouped data
VO, Wien, 2014-05-15
12
MLE – Definition:
Existence of maximum Likelihood Estimates
For a finite sample size it may happen, that ML estimators do not exist.
ML-Estimates do not exist, when you have a data set with complete separation
ML-Estimates may not exist, if you have a data set with quasi-complete separation.
ML-Estimates do exist, when you have a data set with overlap.
ML- Estimates do exist, when you have a data set with linear dependency.
ML-Estimates exist, if there is no hyper plane that separates the 0 and the 1
responses.
VO, Wien, 2014-05-15
13
MLE – Definition:
Estimation conditioned on Predictor Values
Sometimes samples can be conditional on the response y.
In such stratified samples one observes x values given at y=1 and x values given at y=0.
A common case is case-control studies in biomedicine., where y = 1(cases) and y=0 (controls)
(choice-based sampling in econometrics)
Let us consider the most simple case of binary predictor with y={0,1} and x={0,1}
with

is the odds ratio, which contains the association between y and x
Parameter of association is the same
e estimate coefficient β of the original logit model
VO, Wien, 2014-05-15
14
MLE – Definition:
Estimation conditioned on Predictor Values
This way might be motivated by the specific structure of the logit model.
We go back to chapter 2.2.2 Derivation of the binary logit model to assume that perdictors are normally
distributed.
denoting the density given y=r
denoting the marginal probability
From the Bayer‘s theorem, follows:
Therefore
or
holds.
This shows that a logit model holds if
has a linear form and contains
and only the intercept depends on the marginal probabilities.
The important point is, that the marginals determine only the intercept!
VO, Wien, 2014-05-15
15
MLE – Definition:
Estimation conditioned on Predictor Values
The likelihood for a given y differs from the likelihood given predictors.
By using
one obtains for the log-likelihood conditional on y
Marginal distribution of y
(fixed by the sampling)
Equivalent to the
conditional log-likelihood
Marginal distribution of x
(can be maximized by
empirical distribution)
VO, Wien, 2014-05-15
16
Summary
general binary model:
link function and linear predictor
Care should be taken when estimating these constituents!
Maximum – Likelihood Estimation
Basic principle is to construct the likelihood of the unknown parameters for the sample data!
MLE can cope with difficult and complicated linear predictors (interactions, dummy variables etc.)
iterative algorithm
Properties of MLE
It is consistent and asymptotically normally distributed.
It is asymptotic efficient. (Fischer-Matrix)
Maximum-Likelihood estimators might not exist. They do exist when the data set has overlap or
linear dependency.
Depending on the data set, ML can also be conditional on the response y.
VO, Wien, 2014-05-15
17
Thank you for your attention!
VO, Wien, 2014-05-15
18
Man beachte den feinen Unterschied:
für die Wahrscheinlichkeitsfunktion interessierten wir uns, weil sie uns die
Eintrittswahrscheinlichkeiten von Realisationen für gegebene Parameter θ
angibt.
Bei der Likelihoodfunktion nehmen wir die Stichprobe als gegeben an und
interessieren uns für den unbekannten Parameter θ, der die
Realisation der gegebenen Stichprobe ‘am wahrscheinlichsten’ macht!
VO, Wien, 2014-05-15
19
VO, Wien, 2014-05-15
20
VO, Wien, 2014-05-15
21
MLE - Example
VO, Wien, 2014-05-15
22
MLE - Example
You can now calculate the probability of Bryant scoring the amounts he actually scored.
Basic principle of MLE!!!
to construct the likelihood of the unknown parameters for the sample data
Let f(ε) denote the density function for ε. (Recall that the density function is like a probability
function, and that the density for a normal variable is a bell curve with its maximum at ε=0.)
Given the prediction M and the density function, you can compute the probability of Bryant scoring
any particular point total Y. This is given by the formula f(Y-M) = f(ε).
- For example, if you believe that M=32, then the probability that Bryant scores 35 is
given by f(35-32) = f(3).
- If σ=6, for example, then examination of the normal table reveals f(3) = 08
Assume that Bryant’s scoring in one game is independent of what he scored in the prior game.
- Recall that the probability of two independent events occurring is just the product of the
probability that each occurs.
- It follows that the probability, or likelihood, of Bryant scoring exactly 33, 22, 25, 40, and 30 points is
just the product of the probabilities of his getting each of these scores.
Given any prediction M, you can write the likelihood score as:
Likelihood score = L = f(33-M) · f(22-M) · f(25-M) · f(40-M) · f(30-M).
VO, Wien, 2014-05-15
23
MLE - Example
You want to find “maximum likelihood estimator” (MLE) of M!
- This is the value of M that maximizes L
- Intuitively, you know that the MLE of M would not be 15 or 50 or some number far from his typical scoring
output. It is almost impossible that a player who is predicted to score 15 points per game would actually
score 33, 22, 25 40, and 30.
- In fact, if M = 15 and σ= 6, then
- L= f(33-15) · f(22-15) · f(25-15) · f(40-15) · f(30-15) = f(18) · f(7) · f(10) · f(25) · f(15) < .0000001
But 32 might be a good candidate to be the MLE. Someone predicted to score 32 points per game has a
reasonable chance of scoring 33, 22, 25, 40, and 30.
- In this case, L= f(1) · f(-10) · f(-7) · f(8) · f(-2) ≈.00005
- It turns out that MLE estimate of M is given by the mean of the realized values of Y.
That is, M = 30 and L= .00014
VO, Wien, 2014-05-15
24
VO, Wien, 2014-05-15
25
VO, Wien, 2014-05-15
26