Download Basic principles of probability theory

Log-linear and logistic models • • • • • • Generalised linear model ANOVA revisited Log-linear model: Poisson distribution logistic model: Binomial distribution Deviances R commands for log-linear and logistic models ANOVA revisited Let us recall purpose of ANOVA: We want to know difference between effects of different parameters. There might be several set of parameters. If number of parameters is two then t-test is suitable for testing difference between means. If there are more than two parameters then we design experiment according to one of the schemes (n-way crossed, n-fold nested or mixture of them). When we have the result of experiments then we fit various linear models under different hypotheses. Then we calculate likelihood ratio (LR) tests. LR test turns out to be related with ratio of sum of the squares under different hypotheses. And this ratio is related with F-distribution (if observations are distributed normally). If F-value is large enough then we say that differences between means are significant. If it is small we say that differences are not significant and we can remove some parameters from our model. One of the assumptions in ANOVA model is that observations are distributed normally. If number of observations is large enough then this assumption works very well. There are cases when ANOVA is not adequate using linear model. Examples are: Outcomes are success or failure. In this case binomial distribution is more adequate. Outcome is the number of occurrences. In this case Poisson distribution is more adequate. One more feature of Binomial and Poisson distribution is that they can be applied to categorical variables (since these distributions are discrete). Generalised linear model Linear models are useful when the distribution of the observations are or can be approximated with normal distribution. Even if it is not case for large number of observation normal distribution is safe assumption. However there are many cases when different model should be used. Generalised linear model is a way of generalising linear models to a wide range of distributions. If distribution of the observations is the from the family of generalised exponential family and mean value of this distribution is linear on the input parameters then generalised linear model can be used. Recall generalised exponential family: P( y )  e(( yA( )  B( )) / S ( )  D( y,  )) Following distributions belong to the generalised exponential family (note that parameters we are considering are the mean values and for simplicity take S()=1). 1 1 normal : A( )   , D( y )   ( y 2  log( 2 )), B( )    2 2 2 n  binomial : A( )  log( ), D( y )  log( ), B( )  n log( 1   ) y 1 Poisson : A( )  log(  ), D( y )   log( y!), B( )   Other members of this family include: gamma, exponential and many others. If some function of the mean (, ,  for above cases) is a linear function of the observations then it can be handled using generalised linear model. Usually this function is taken A(). A(θ)  Xβ Generalised linear model: cont • Without loss of generality general exponential family can be written: f ( y |  ,  )  exp(( y  B( )) / S ( )  D( y,  )) If we assume that form of the distributions for different observations are same but parameters are different (but  is the same for all observations) then generalised linear model will maximise the likelihood function (if observations are independent): n 1 n l  log( L( y1 , , , yn |  ,  )  C ( yi ,  )  max  ( yi i  B( i ))   S ( ) i 1 i 1 under conditions that : p  i   xij  j  θ  Xβ j 1 Poisson distribution: log-linear model If the distribution of observations is Poisson then log-linear model should be used. Recall that Poisson distribution is from exponential family and the function A of the mean value is logarithm. It can be handled using generalised linear model. When log-linear model is appropriate: When outcomes are frequencies (expressed as integers) and parameters are categorical then log-linear model is appropriate. When we fit log-linear model then we can find estimated mean using exponential function: log(  )   ai xi   e a x i i Example: Relation between gray hair and age Age gray hair under 40 over 40 yes 27 18 no 33 22 It is similar to two-fold nested ANOVA model. We could analyse this type of data using the log-linear model. Binomial distribution: logistic model If the distribution of the result of experiment is binomial, i.e. outcome is 0 or 1 (success of failrure) then logistic model can be used. Recall that function of mean value A has the form: A( )  log(  1 ) This function has a special name – logit. It has several advantages: If logit() has been estimated then we can find  and it is between 0 and 1. If probability of success is larger than failure then this function is positive, otherwise it is negative. Changing places of success and failure changes only the sign of this function. This model can be used when outcomes are binary (0 and 1). If logit() is linear then we can find : log it ( )   ai xi e  ai xi  1  e  ai xi For logistic model either grouped variables (fraction of successes) or individual items (every individual have success (1) or failure (0) can be used. Ratio of the probability of success to the probability of failure is also called odds. Deviances In linear model we maximise the likelihood with full model and under the hypothesis. Then ratio of the values of the likelihood function under two hypotheses (null and alternative) is related with F-distribution. Interpretation is that how much variance would increase if we would remove part of the model (null hypothesis). In logisitc and log-linear model analysis again likelihood function is maximised under the nulland alternative hypotheses. Then logarithm of ratio of the values of the likelihood under these two hypotheses is related asymptotically with chi-squared distribution:  2  2.0(log L( H 0 )  log L( H1 ))  2 log L( H 0 ) L( H 1 ) That is the reason why in log-linear and logistic regressions it is usual to talk about deviances and chi-squared statistics instead of variances and F-statistics. Analysis based on loglinear and logistic models (in general for generalised linear models) is usually called analyisis of deviances. Reason for this is that chi-squared is related with deviation of the fitted model and observations. Another test is based on Pearson’s chi-squared test. These two tests behave similarly as the number of observations increases. R commands for log-linear model log-linear model can be analysed using generalised linear model. Once the factors, the data and the formula have been decided then we can use: result <- glm(data~formula,family=poisson) It will give us fitted model. Then we can use anova.glm(result,test=‘Chisq’) anova(result,test=‘Chisq’) plot(result) summary(result) Interpretation of the results is similar to linear model ANOVA table. Degrees of freedom is defined similarly. Only difference is that instead of sum of squares deviances are used. R commands for logistic regression Similar to log-linear model: Decide what are the data, the factors and what formula should be used. Then use generalised linear model to fit. result <- glm(data~formula,family=binomial) then analyse using anova(result,test=“Chisq”) summary(result) plot(result) Exercises: generalised linear a) Show that gamma distribution is from the exponential family (r is constant):  r y r 1 y f (y |)  e ( r ) b) Find moment generating function for natural exponential family: f ( y |  ,  )  exp(( y  B( )) / S ( )  D( y,  )) Hint: Use the fact that density of the distribution should be normalised to 1:  f ( y |  , )dy  1  exp(- B( ) / S ( )   exp(( y / S ( )  D( y, ))dy Then use the definition of moment generating function. Find the first and the second moments.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Basic principles of probability theory