Download LISA Short Course Series Generalized Linear Models (GLMs

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Regression toward the mean wikipedia , lookup

Time series wikipedia , lookup

Choice modelling wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Generalized Linear Models
(GLMs) & Categorical Data
Analysis (CDA) in R
Hong Tran, April 21, 2015
Laboratory for Interdisciplinary Statistical Analysis
LISA helps VT researchers benefit from the use of
Statistics
Collaboration:
Visit our website to request personalized statistical advice and assistance with:
Designing Experiments • Analyzing Data • Interpreting Results
Grant Proposals • Software (R, SAS, JMP, Minitab...)
LISA statistical collaborators aim to explain concepts in ways useful for your research.
Great advice right now: Meet with LISA before collecting your data.
LISA also offers:
Educational Short Courses: Designed to help graduate students apply statistics in their research
Walk-In Consulting: Available Monday-Friday from 1-3 PM in the Old Security Building (OSB) for questions <30 mins. See our website for
additional times and locations.
All services are FREE for VT researchers. We assist with research—not class projects or homework.
www.lisa.stat.vt.edu
Outline








1.What is CDA?
2.Contingency Table
3.Measures of Association
4.Test of Independence
5.What is GLM? When should we use it?
6.How to evaluate the GLM models?
7.Logistic Regression
8.Poisson Regression
What is CDA?
Dependent Variable
(Y)
Independent
Variables (X)
Model
Continuous (Normal)
Continuous
Linear Regression
Continuous
Categorical
ANOVA
Continuous
Mixed
ANCOVA
Categorical
Categorical
CDA
Contingency Table
 I rows for categories in X
 J rows for categories in Y
 Values in cell=possible outcomes
Example 1 of Contingency Table
 the relationship between smoking and epidermoid/undifferentiated
pulmonary carcinoma (cancer)
 Cohort study conducted
 2x2 contingency table
 Does smoking increase the risk of having epidermoid/undifferentiated
pulmonary carcinoma?
Generating Contingency Table in R
 Input the 2×2 table in R as a 2×2 matrix
 Change the matrix to table using the function as.table(), because
some functions are happier with tables than matrices
Measure of Association
 Continuous Variables-Pearson Correlation Coefficient
 Ordinal Variables-Pearson Correlation Coefficient
 Nominal Variables-Phi Coefficient and Cramer’s V
Pearson Correlation
Pearson Correlation Example 2
 mtcars in R
 1974 Motor Trend US magazine
mpg: miles per gallon
wt: weight
drat: rare axle ratio
Phi Coefficient
 measures the association between two binary variables.
 Its value ranges from -1 to +1, where +1/-1 indicates perfect positive
association/negative association, 0 indicates no association.
 The square of the phi coefficient is related to the chi-squared
statistic for a 2×2 contingency table.
Cramer’s V
 Cramer’s V measures the association between two nominal
variables.
 It varies from 0 (no association) to 1 (complete association) and can
reach 1 only when the two variables are equal to each other.
Measures of Association
Comments:
 1, When the two variables are binary, Cramer’s V is the same as Phi
Coefficient
 2, In R, under library(psych), use function phi() for Phi Coefficient
 3, In R, under library(vcd), use function assocstats() for Cramer’s V
Test of Independence
 Large Sample Size
Chi-square Test
 Small Sample Size
Fisher’s Exact Test
Test of Independence (Chi-square Test)
Back to Example 1
Cases
Control
Total
Smoke
18/313
13/313
31/313
Non-smoker
46/313
236/313
282/313
Total
64/313
249/313
1
Test of Independence (Chi-square Test)
Test of Independence
(Fisher’s Exact Test)
 When any of the expected counts fall below 5, Chi-square test is not
appropriate. Instead, we use Fisher’s Exact Test.
 Example 3: The following data are from a Stanford University study of
the effectiveness of the antidepressant Celexain the treatment of
compulsive shopping.
Worse
Same
Better
Celexain
2
3
7
Placebo
2
8
2
Test of Independence
 Chi-Square Test
Use R function chisq.test()
 Fisher’s Exact Test
Use R function fisher.test()
Generalized Linear Models
 When the response variables are not continuous, not normally
distributed
 Count numbers: 1, 2, 3,…
 Binary: 0 and 1
Comparison
General Linear Model
Generalized Linear
Model
Special cases
ANOVA, ANCOVA,
MANOVA,
MANCOVA, linear
regression, mixed
model
Linear regression,
logistic regression,
Poisson regression
Function in R
lm
glm
Typical method
estimation
Least Square
Maximum Likelihood
Ordinary Linear Regression
 Ordinary Linear Regression (OLR) investigates and models the linear
relationship between independent variables and dependent
variables that are continuous.
 The simplest regression is Simple Linear regression, which models the
linear relationship between a single independent variable and a
single dependent variable.
 Simple Linear Regression Model:
Assumptions in OLR
The assumptions are:
 The true relationship between x and y is linear.
 The errors are normally distributed with mean zero and unknown
common variance 𝜎 2 .
 The errors are uncorrelated.
The possible approaches when the assumptions of a normally
distributed dependent variable with constant variance are violated:
 Data transformations
 Weighted least squares
 Generalized linear model (GLM)
GLM Model
 𝑔 function is called the link function because it connects the mean
𝜇 and the linear predictor 𝑥
 Dependent variable’s distribution must come from the Exponential
Family of Distributions
 Includes Normal, Bernoulli, Binomial, Poisson, Gamma, etc.
3 Components
 Random: Identifies dependent Y and its probability distribution
 Systematic: Independent variables in a linear predictor function
 Link function: Invertible function 𝑔.that links the mean of the
dependent variable to the systematic component.
Response Distribution
Types of GLMs
GLM and OLR
 Ordinary linear regression is a special case of GLM
 In OLR, the 3 components for GLM are:
 Random: the dependent variable is normally distributed with mean
𝜇 and variance 𝜎 2
 Systematic: Independent variables in a linear predictor function
 Link function: Identity link 𝑔(𝜇)=𝜇
Therefore, the GLM model for Ordinary linear regression is
Model Evaluation: Deviance
 Deviance: measures how close the predicted values from the fitted
model match the actual values from the raw data.
Definition:
Deviance = -2[log-likelihood(proposed model)-loglikelihood(saturated model)]
 A saturated model is a model that fits the data perfectly, so its loglikelihood is the maximum. It has as many parameters as
observations and hence it provides no simplification at all.
 The deviance has a chi-squared asymptotic null distribution.
 The degree of freedom is n-p, where n is the number of observations
and p is the number of model parameters.
 Smaller deviance, the better the model
Inference in GLM
 Goodness of Fit test
─ The null hypothesis is that the model is a good alternative to the
saturated model.
─ Deviance is the Likelihood Ratio Statistic
 Likelihood Ratio test
- Allows for the comparison of one model to another model by looking
at the difference in deviance of the two models.
-Null Hypothesis: the predictor variables in Model 1 that are not found
in Model 2 are not significant to the model fit.
-Alternative Hypothesis: the predictor variables in Model 1 that are not
found in Model 2 are significant to the model fit.
─ LRS is distributed as Chi-square distribution.
─ Simpler models have larger deviance.
Model Comparison in GLM
 Two additional measures for model comparison are:
─ AkaikeInformation Criterion (AIC)
•Penalizes model for having many parameters
•AIC=-2logLikelihood+2*p where p is the number of parameters in the
model
•The smaller AIC, the better the model
─ Bayesian Information Criterion (BIC)
•BIC=-2logLikelihood+ln(n)*p where p is the number of parameters in
the model and n is the number of observations
•Usually stronger penalization for additional parameter than AIC
•The smaller BIC, the better the model
Summary
 Setup of GLM
 Inference in GLM
 Deviance and Likelihood Ratio Test
─ Test goodness of fit for the proposed GLM model
─ Test the significance of a predictor variable or set of predictor variables
in the model
 Model Comparison in GLM
─ AIC
─ BIC
Logistic Regression
 Logistic regression is a regression technique for predicting the
outcome of a binary dependent variable.
Example: y=1-Success, 0-Failure
 Random Component: the dependent variable follows a Bernoulli
distribution
─ Probability of Success: 𝑝
─ Probability of Failure: 1-𝑝
─ The probability of obtaining y=1 or y=0 is given by Bernoulli
Distribution:
─ Mean(Y): μ=𝑝
Logistic Regression
Logistic Regression
Steps for Logistic Regression in R
 1.Create a single vector of 0’s and 1’s for the response variable.
 2.Use the function glm() family=binomial to fit the model.
 3.Test for goodness of fit and significance of predictors.
 4.Interpretation
Poisson Regressions
 Poisson regression is a regression technique for predicting the
outcome of a count dependent variable.
 Dependent variable measures the number of occurrences in a
given time frame.
 Outcomes equal to 0,1,2,…
 Examples:
Number of penalties during a football game.
Number of customers shop at a grocery store on a given day.
Number of car accidents at an intersection during a period of time.
Poisson Regression
Poisson Regression
Steps for Poisson Regression in R
 1.Input data where y is a column of counts.
 2.Use the function glm() family=poisson to fit the model.
 3.Test for goodness of fit and significance of predictors.