Download GCRC Data Analysis notes 5

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
GCRC Data Analysis with SPSS Workshop
Session 5
Follow Up on FEV data
Binary and Categorical Outcomes
2x2 tables
2xK tables
JxK tables
Logistic Regression
Definitions
Model selection
Assessing the Model Fit
Low Birth Weight data
1
Regression of LOG(FEV) on 4 predictors
Full Dataset (N=654)
Subset on >8 yrs (N=439)
b
SEb
p
STD
b
VIF
Age
.023
.003
.000
.207
Ht
.043
.002
.000
.73
Smoke
-.046 .021
Sex
.029
rsq
.81
.012
b
SEb
p
STD
b
VIF
3.0
.022
.004
.000
.196
1.6
2.8
.044
.002
.000
.707
1.6
.028
-.041 1.2
-.05
.021
.018
-.07
1.2
.013
.044
.011
.015
.446
.022
1.2
1.1
.67
2
Contingency Tables
X and Y are categorical response variables (with I and J categories)
The probability distribution {πij} is the joint distribution of X and Y
If X is not random, joint distribution is not meaningful, but the
distribution of Y is conditional on X
Marginal Distribution: the row {πi.} and column {π.j} totals obtained by
summing the joint probabilities
Conditional Distribution: Given a subject is in row i of X, πj|i is the
probability of classification into column j of Y
Prospective studies: the totals {ni.} for X are usually fixed, and each row
of J counts is an independent multinomial sample on Y
Retrospective studies: the totals {n.j} for Y are usually fixed and each
column of I counts is an independent multinomial sample on X
Cross-sectional studies: the total sample size is fixed, and the IJ cell
counts are a multinomial sample
3
Contingency Tables (continued)
Joint (Conditional) and Marginal Probability
Y
X
1
2
Total
1
π11
(π1|1)
π 12
(π2|1)
π1.
(1.0)
2
π21
(π1|2)
π 22
(π2|2)
π2.
(1.0)
Total
π.1
π.2
1.0
4
In any case, if X and Y are independent, π ij = πi.π.i
Maximum Likelihood estimates for π ij are the cell proportions pij=nij/n
Under the assumption of independence, the expected cell counts are
mˆ ij  npi p j 
ni n j
n
And the chi square statistic
2 
(nij  mˆ ij ) 2
mˆ ij
with (I-1)(J-1) degrees of freedom, can be used to test the null
hypothesis of independence
5
For a single multinomial variable, the analogous statistic, constructed
similarly:
(ni  mi ) 2
 
mi
2
with (I-1) degrees of freedom
can be used to compare the observed cell proportions to a distribution
with fixed values {πi0} (also known as goodness-of-fit)
Here mi=nπi0
6
SPSS output (complete list in SPSS Notes)
analyze>descriptive statistics>crosstab
select STATISTICS, chi-square
• Likelihood Ratio is a goodness-of-fit statistic similar to Pearson's
chi-square. For large sample sizes, the two statistics are equivalent.
The advantage of the likelihood-ratio chi-square is that it can be
subdivided into interpretable parts that add up to the total.
For smaller sample sizes, this is the statistic to report; since it
approaches the Pearson as n increases, it can be reported in either
case.
• Fisher’s Exact Test is a test for independence in a 2 X 2 table. It is
most useful when the total sample size and the expected values are
small. The test holds the marginal totals fixed and computes the
hypergeometric probability that n11 is at least as large as the
observed value
7
http://www.swogstat.org/stat/public/fisher.htm
Y
X
yes
no
total
yes
3
7
10
no
5
10
15
total
8
17
TABLE = [ 3 , 7 , 5 , 10 ]
Left : p-value = 0.6069
Right : p-value = 0.72639
2-Tail : p-value = 1
The output consists of three p-values:
Left: Use this when the alternative to independence is that there is
negative association between the variables.
That is, the observations tend to lie in lower left and upper right.
Right: Use this when the alternative to independence is that there is
positive association between the variables.
That is, the observations tend to lie in upper left and lower right.
2-Tail: Use this when there is no prior alternative.
8
Multiple Logistic Regression
E(Y|X)=P(Y=1|x) = Π(X) =
e
 0  1 X1   p X p
1 e
 0  1 X1   p X p
The relationship between πi and X is S shaped
The logit (log-odds) transformation (link function)
  ( x) 
g ( x)  ln 
  0  1 x     p X p

1   ( x) 
Has many of the desirable properties of the linear regression model,
while relaxing some of the assumptions.
Maximum Likelihood (ML) model parameters are estimated by iteration
9
Assumptions for Logistic Regression
•
The independent variables are liner in the logit. It is also
possible to add explicit interaction and power terms, as in OLS
regression.
•
The dependent variable need not be normally distributed (it is
assumed to be distributed within the range of the exponential
family of distributions, such as normal, Poisson, binomial,
gamma).
•
The dependent variable need not be homoscedastic for each
level of the independents; that is, there is no homogeneity of
variance assumption.
•
Normally distributed error terms are not assumed.
•
The independent variables may be binary, categorical,
continuous
10
Applications
Identify risk factors
Ho: β0 = 0
while controlling for confounders and other important determinants of
the event
Classification: Predict outcome for a new observation with a particular
constellation of risk factors (a form of discriminant analysis)
11
Design Variables (coding)
In SPSS, designate Categorical to get k-1 indicators for a k-level factor
design variable
D1
D2
RACE
White
Black
Other
0
1
0
0
0
1
12
• Interpretation of the parameters
If p is the probability of an event and O is the odds for that event then
probabilit y of event
p
O

1  p probabilit y of no event
… the link function in logistic regression gives the log-odds
  ( x) 
g ( x)  ln 
  0  1 x     p X p

1   ( x) 
13
…and the odds ratio, OR, is
Y=1
Y=0
X=1
e  0  1
 (1) 
1  e  0  1
1
1   (1) 
1  e  0  1
X=0
e 0
 (0) 
1  e 0
1
1   (0) 
1  e 0
 (1)[1   (0)]
OR 
 tedious al gebra   e 
 (0)[1   (1)]
1
14
Definitions and Annotated SPSS output for Logistic Regression
http://www2.chass.ncsu.edu/garson/pa765/logistic.htm#assumpt
Virtually any sin that can be committed with least squares
regression can be committed with logistic regression. These
include stepwise procedures and arriving at a final model by
looking at the data. All of the warnings and recommendations
made for least squares regression apply to logistic regression as
well ...
Gerard Dallal
15
•Assessing the Model Fit
There are several R2-like measures; they are not goodness-of-fit
tests but rather attempt to measure strength of association
Cox and Snell's R-Square is an attempt to imitate the
interpretation of multiple R-Square based on the likelihood, but
its maximum can be (and usually is) less than 1.0, making it
difficult to interpret. It is part of SPSS output.
Nagelkerke's R-Square is a further modification of the Cox and
Snell coefficient to assure that it can vary from 0 to 1. That is,
Nagelkerke's R2 divides Cox and Snell's R2 by its maximum in
order to achieve a measure that ranges from 0 to 1. Therefore
Nagelkerke's R-Square will normally be higher than the Cox and
Snell measure. It is part of SPSS output and is the mostreported of the R-squared estimates. See Nagelkerke (1991).
16
Hosmer and Lemeshow's Goodness of Fit Test
tests the null hypothesis that the data were generated by the
fitted model
1.
2.
divide subjects into deciles based on predicted probabilities
compute a chi-square from observed and expected
frequencies
3.
compute a probability (p) value from the chi-square
distribution with 8 degrees of freedom to test the fit of the
logistic model
If the Hosmer and Lemeshow Goodness-of-Fit test statistic has
p = .05 or less, we reject the null hypothesis that there is no
difference between the observed and model-predicted
values of the dependent. (This means the model predicts
values significantly different from the observed values).
17
Observed vs. Predicted
This particular model performs better
when the event rate is low
20
18
16
14
12
observed
10
8
6
4
2
0
0
5
10
15
20
expected
18
•Check for Linearity in the LOGIT
Box-Tidwell Transformation (Test): Add to the logistic model
interaction terms which are the crossproduct of each
independent times its natural logarithm [(X)ln(X)]. If these terms
are significant, then there is nonlinearity in the logit. This method
is not sensitive to small nonlinearities.
Orthogonal polynomial contrasts, an option in SPSS, may be
used. This option treats each independent as a categorical
variable and computes logit (effect) coefficients for each
category, testing for linear, quadratic, cubic, or higher-order
effects. The logit should not change over the contrasts. This
method is not appropriate when the independent has a large
number of values, inflating the standard errors of the contrasts.
19
• Residual Plots
Plot the Cook’s distance against
ˆ j
Several other plots suggested in Hosmer & Lemishow (p177) involve
further manipulation of the statistics produced by SPSS
• External Validation
a new sample
a hold-out sample
• Cross Validation (classification)
n-fold (leave 1 out)
V-fold (divide data into V subsets)
20
Pitfalls
1.
2.
3.
4.
Multiple comparisons (data driven model/data dredging)
Over fitting
-complex models fit to a small dataset
good fit in THIS dataset, but not generalize: you’re modeling the
random error
at least 10 events per independent variable
-validation
new data to check predictive ability, calibration
hold-out sample
-look for sensitivity to a single observation (residuals)
Violating the assumptions
more serious in prediction models than association
There are many strategies: don’t try them all
-chose one based on the structure of the question
-draw primary conclusions based on that one
-examine robustness to other strategies
21
CASE STUDY
1.
2.
Develop a strategy for analyzing Hosmer &
Lemishow’s Low Birth weight data using LOW as
the dependent variable
Try ANCOVA for the same data with BWT (birth weight
in grams) as the dependent variable
LBW.SAV is on the S drive under GCRC data analysis
22
References
Hosmer, D.W. and Lemishow, S, (2000) Applied Logistic Regression,
2nd ed., John Wiley & Sons, New York, NY
Harrell, F. E., Lee, K. L., Mark, D. B. (1996) “Multivariable Prognostic
models: Issues in Developing Models, Evaluating Assumptions and
Adequacy, and Measuring and Reducing Errors”, Statistics in
Medicine, 15, 361-387
Nagelkerke, N. J. D. (1991). “A note on a general definition of the
coefficient of determination” Biometrika, Vol. 78, No. 3: 691-692.
Covers the two measures of R-square for logistic regression which
are found in SPSS output.
Agresti, A. (1990) Categorical Data Analysis, John Wiley & Sons, New
York, NY
23