Download Basic principles of probability theory

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Log-linear and logistic models
•
•
•
•
•
•
Generalised linear model
ANOVA revisited
Log-linear model: Poisson distribution
logistic model: Binomial distribution
Deviances
R commands for log-linear and logistic models
ANOVA revisited
Let us recall purpose of ANOVA: We want to know difference between effects of different
parameters. There might be several set of parameters. If number of parameters is two
then t-test is suitable for testing difference between means. If there are more than two
parameters then we design experiment according to one of the schemes (n-way crossed,
n-fold nested or mixture of them). When we have the result of experiments then we fit
various linear models under different hypotheses. Then we calculate likelihood ratio
(LR) tests. LR test turns out to be related with ratio of sum of the squares under different
hypotheses. And this ratio is related with F-distribution (if observations are distributed
normally). If F-value is large enough then we say that differences between means are
significant. If it is small we say that differences are not significant and we can remove
some parameters from our model.
One of the assumptions in ANOVA model is that observations are distributed normally. If
number of observations is large enough then this assumption works very well.
There are cases when ANOVA is not adequate using linear model. Examples are:
Outcomes are success or failure. In this case binomial distribution is more adequate.
Outcome is the number of occurrences. In this case Poisson distribution is more adequate.
One more feature of Binomial and Poisson distribution is that they can be applied to
categorical variables (since these distributions are discrete).
Generalised linear model
Linear models are useful when the distribution of the observations are or can be approximated
with normal distribution. Even if it is not case for large number of observation normal
distribution is safe assumption. However there are many cases when different model
should be used. Generalised linear model is a way of generalising linear models to a wide
range of distributions. If distribution of the observations is the from the family of
generalised exponential family and mean value of this distribution is linear on the input
parameters then generalised linear model can be used. Recall generalised exponential
family:
P( y )  e(( yA( )  B( )) / S ( )  D( y,  ))
Following distributions belong to the generalised exponential family (note that parameters we are
considering are the mean values and for simplicity take S()=1).
1
1
normal : A( )   , D( y )   ( y 2  log( 2 )), B( )    2
2
2
n

binomial : A( )  log(
), D( y )  log( ), B( )  n log( 1   )
y
1
Poisson : A( )  log(  ), D( y )   log( y!), B( )  
Other members of this family include: gamma, exponential and many others.
If some function of the mean (, ,  for above cases) is a linear function of the observations then
it can be handled using generalised linear model. Usually this function is taken A().
A(θ)  Xβ
Generalised linear model: cont
• Without loss of generality general exponential family can be written:
f ( y |  ,  )  exp(( y  B( )) / S ( )  D( y,  ))
If we assume that form of the distributions for different observations are
same but parameters are different (but  is the same for all
observations) then generalised linear model will maximise the
likelihood function (if observations are independent):
n
1 n
l  log( L( y1 , , , yn |  ,  ) 
C ( yi ,  )  max
 ( yi i  B( i ))  
S ( ) i 1
i 1
under conditions that :
p
 i   xij  j  θ  Xβ
j 1
Poisson distribution: log-linear model
If the distribution of observations is Poisson then log-linear model should be used.
Recall that Poisson distribution is from exponential family and the function A of
the mean value is logarithm. It can be handled using generalised linear model.
When log-linear model is appropriate: When outcomes are frequencies (expressed as
integers) and parameters are categorical then log-linear model is appropriate.
When we fit log-linear model then we can find estimated mean using exponential
function:
log(  )   ai xi
  e a x
i i
Example: Relation between gray hair and age
Age
gray hair
under 40
over 40
yes
27
18
no
33
22
It is similar to two-fold nested ANOVA model. We could analyse this type of data using
the log-linear model.
Binomial distribution: logistic model
If the distribution of the result of experiment is binomial, i.e. outcome is 0 or 1 (success of
failrure) then logistic model can be used. Recall that function of mean value A has the
form:
A( )  log(

1
)
This function has a special name – logit. It has several advantages: If logit() has been estimated
then we can find  and it is between 0 and 1. If probability of success is larger than failure
then this function is positive, otherwise it is negative. Changing places of success and
failure changes only the sign of this function. This model can be used when outcomes are
binary (0 and 1).
If logit() is linear then we can find :
log it ( )   ai xi
e  ai xi

1  e  ai xi
For logistic model either grouped variables (fraction of successes) or individual items (every
individual have success (1) or failure (0) can be used.
Ratio of the probability of success to the probability of failure is also called odds.
Deviances
In linear model we maximise the likelihood with full model and under the hypothesis. Then
ratio of the values of the likelihood function under two hypotheses (null and alternative)
is related with F-distribution. Interpretation is that how much variance would increase if
we would remove part of the model (null hypothesis).
In logisitc and log-linear model analysis again likelihood function is maximised under the nulland alternative hypotheses. Then logarithm of ratio of the values of the likelihood under
these two hypotheses is related asymptotically with chi-squared distribution:
 2  2.0(log L( H 0 )  log L( H1 ))  2 log
L( H 0 )
L( H 1 )
That is the reason why in log-linear and logistic regressions it is usual to talk about deviances
and chi-squared statistics instead of variances and F-statistics. Analysis based on loglinear and logistic models (in general for generalised linear models) is usually called
analyisis of deviances. Reason for this is that chi-squared is related with deviation of the
fitted model and observations.
Another test is based on Pearson’s chi-squared test. These two tests behave similarly as the
number of observations increases.
R commands for log-linear model
log-linear model can be analysed using generalised linear model. Once the factors, the
data and the formula have been decided then we can use:
result <- glm(data~formula,family=poisson)
It will give us fitted model. Then we can use
anova.glm(result,test=‘Chisq’)
anova(result,test=‘Chisq’)
plot(result)
summary(result)
Interpretation of the results is similar to linear model ANOVA table. Degrees of
freedom is defined similarly. Only difference is that instead of sum of squares
deviances are used.
R commands for logistic regression
Similar to log-linear model: Decide what are the data, the factors and what formula
should be used. Then use generalised linear model to fit.
result <- glm(data~formula,family=binomial)
then analyse using
anova(result,test=“Chisq”)
summary(result)
plot(result)
Exercises: generalised linear
a)
Show that gamma distribution is from the exponential family (r is
constant):
 r y r 1 y
f (y |) 
e
( r )
b)
Find moment generating function for natural exponential family:
f ( y |  ,  )  exp(( y  B( )) / S ( )  D( y,  ))
Hint: Use the fact that density of the distribution should be normalised to 1:
 f ( y |  , )dy  1  exp(- B( ) / S ( )   exp(( y / S ( )  D( y, ))dy
Then use the definition of moment generating function. Find the first and
the second moments.
Related documents