Download Epidemiological analysis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Linear regression wikipedia , lookup

Regression analysis wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Least squares wikipedia , lookup

Choice modelling wikipedia , lookup

Data assimilation wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Epidemiological analysis
PhD-course in epidemiology
Lau Caspar Thygesen
Associate professor, PhD
25th February 2014
Age standardization
• Incidence and prevalence are strongly agedependent
– Risks rising (e.g. chronic diseases) or declining
(e.g. measles) with age
• Comparisons between populations and over
time may be very misleading
• A single age-independent index representing a
set of age-specific rates may be more
appropriate
Mortality in Denmark and Greenland,
men, 1975
Please interpret this table?
Direct standardization
IR(DK-standardized to Greenlandic age-distribution)
= 0.016*12.2+0.076*0.7+0.268*0.160+0.506*1.4+0.110*11.2+0.024*66.5
= 3.8
Indirect standardization
2009
2007
2005
2003
2001
1999
1997
1995
1993
1991
1989
1987
1985
1983
1981
1979
1977
1975
1973
1971
1969
1967
1965
1963
1961
1959
1957
1955
1953
1951
7
1949
8
1947
1945
1943
Lung Cancer Denmark Women
9
rateCrude
segi
scand
6
5
3
2
1
0
Example 2
• Incidence of multiple sclerosis
• Denmark
• 1950-2004
4
• European Standard Population
2009
2007
2005
2003
2001
1999
1997
1995
1993
1991
1989
1987
1985
1983
1981
1979
1977
1975
1973
1971
1969
1967
1965
1963
1961
1959
1957
1955
1953
8
1951
• 1943-2010
1949
• Denmark
1947
• Trend study of lung cancer incidence among
women
1945
1943
Example
Lung Cancer Denmark Women
9
rateCrude
7
6
5
4
3
2
1
0
Example indirect standardization
• 19,185 subjects (3,817 women) who attended
outpatient clinics for alcohol abusers
• Copenhagen
• 1952-1992
• Compare incidence of heart disease by the
incidence rate in the greater Copenhagen area
Problems
• Direct standardisation can produce unreliable
estimates when the calculations are based on
small numbers
• Indirect standardisations from different
populations cannot be directly compared –
only compared to the standard
Compared to regression methods
• Regression based methods are available but are
rarely applied in practice
• When individual data are available (presence /
absence of disease, age and sex), a logistic
regression can be used to estimate the
standardized rate
• The main advantage is that it allows adjustment
by continuous variables in addition to categorical
variables
Missing data
• What does missing mean
• The pattern of missingness (nomenclature)
– How and why is it missing?
• Methods for handling
Missing values
• Common in research
– Nonresponse
– Loss to follow-up
– Lack of overlap between linked data sets (not
so common)
Unit Nonresponse Examples
What is item nonresponse?
• Unit Nonresponse vs. Item Nonresponse
ID
Q1 Q2 Q3
456
1
1
2
457
4
2
458
?
?
459
3
2
ID
Q1 Q2 Q3
456
1
1
2
1
457
4
?
1
?
458
?
2
1
1
459
3
2
?
•
•
•
•
•
•
Person who is not at home
Person who does not pick up the phone
Person who hangs up on you
Rat that dies before the study
The country you could not get data on
etc.
Item Nonresponse
•
•
•
•
•
“I Don’t Know”
Refusals to respond
Questions left blank
Failed measurement
etc.
Best way to deal with Missing
Data is not to have any
Minimizing Unit Nonresponse
•
•
•
•
•
•
Call back if not home
Refusal conversion
Don’t mess up
Clear and understandable questionnaire
Polite request
Incentives
What kind of missing data should be
modeled?
• If an item is missing from your dataset but you
suspect that it has a true value
• I don’t know might simply mean I don’t know
– Don’t model it as if there was a true value
• Dead people (attrition)
Minimizing Item Nonresponse
• Well written questions
• Minimize misunderstandings
– cross-cultural example
– Standardized vs. non-standardized
• Minimize skip patterns
The pattern of missingness (nomenclature)
• Ignorable
– MCAR - Missing Completely at Random
– MAR - Missing at Random
• Non-ignorable
– NMAR - Not Missing at Random
Missing completely at random
Missing Completely at Random: if the data are missing
completely at random then missing values cannot be
predicted any better
•
•
•
•
Cause of missingness completely random process (like coin
flip)
Cause uncorrelated with variables of interest
– Example: parents move
No bias if cause omitted
In the unlikely event that the process is missing completely at
random, then inferences based on complete cases are
unbiased, but inefficient because we have lost some cases
Missing not at random
Non-Ignorable / NMAR: if the probability that a cell is missing
depends on the unobserved value of the missing value
For example, individuals’ responses to income questions, where
high income people are more likely to refuse to answer survey
questions about income and other variables in the data set
cannot predict which respondents have high income
If your missing data is non-ignorable, then inferences based on
complete cases will be biased and inefficient
Missing at random
•
Missingness may be related to measured variables
•
But no residual relationship with unmeasured variables
•
No bias if you control for measured variables
•
For example, if highly educated are more likely to participate
in a survey, then the process is missing at random as long we
know the educational level of all persons
•
If data is missing at random, then inferences based on
complete cases will be biased and inefficient
Classical Missing Data Treatments
• Whatever you do, you are doing something
– Case Deletion
• Listwise (complete case analysis)
• Pairwise (available case analysis)
– Indicator variable (dummy variable)
– Single Imputation
• (Unconditional) Mean Imputation
• Conditional Mean Imputation (expected value)
– Weighting
Listwise Deletion and Multi-Item
• Excludes the whole case
• Default in most software
• Works if mechanism is MCAR
and if pattern and sample size allows (need
to have enough complete cases)
• Can be biased
Pairwise Deletion
• An option for using all available information
correlation/covariance matrixes
• Different calculations may be based on different
populations
• Very unpredictable bias
Mean imputation
Indicator method
• For each variable with missing values, create a
missing-value indicator to accompany the
variable in all analysis
• Assumes MCAR
• Even if the stratum is just a random sample of
all subjects, the stratum will yield a
confounded estimate of the exposure effect
• Technique
– Calculate mean over cases that have values for Y
– Impute this mean where Y is missing
– Ditto for X1, X2, etc.
• Problems
– ignores relationships among X and Y
• underestimates covariances
(Unconditional) Mean Imputation
Mean imputation
• Standard errors too low
• CI difficult to calculate
Scatterplots are from Joe Schafer’s website
Conditional mean imputation
• Technique & implicit models
– If Y is missing
• impute mean of cases
with similar values for X1, X2
– Y = b0 + X1 b1 + X2 b2
– Likewise, if X2 is missing
• impute mean of cases
with similar values for X1, Y
– X1 = g0 + X1 g1 + Y g2
– If both Y and X2 are missing
• impute means of cases
with similar values for X1
– Y = d0 + X1 d1
– X2= f0 + X1 f1
• Problem
– Ignores random components (no e)
àUnderestimates variances, se’s
Imputation of Expected Value
• Good for creating expected values
• Bad for multivariate analysis
– Decreases standard errors
– Creates overconfident outcomes
– Increases probability of Type I error
Imputation variation
Problem with single imputation
• Sampling variation
– If you take a different sample
• Underestimates se’s!
• Treats imputed values like observed values
– when they are actually less certain
• Ignores imputation variation
• you get different parameter estimates
– Standard errors reflect this
– One way to estimate sampling variation
• measure variation across multiple samples
• called “bootstrapping”
• Imputation variation
– If you impute different values
• you get different parameter estimates
– Standard errors should reflect this, too
– One way to estimate imputation variation
• measure variation across multiple imputed data sets
• called “multiple imputation”
Multiple Imputation
• Models both expected value and uncertainty.
• Using the Missing Data Model you specify it
simulates and imputes missing values “multiple”
times creating M complete datasets
– (M=5 is usually OK. It is a good idea to simulate more)
• Analyze each dataset independently
• Combines results to get unbiased estimates. Models
both uncertainty and expectation
Example
PROC MI
Multiple Imputation Simple Procedure
1. Impute using PROC MI
3. Do analysis: PROC REG, LOGISTIC, etc.
using by _imputation_; in the procedure
• Typical syntax:
proc mi data=bmx out=impdat seed=33155;
var bmxbmi bmxht bmxwt bmxarmc bmxarml;
run;
1 copy of data with missing values
out=
5 copies of data with imputed values
(will be different across copies)
seed=
random seed, you can keep same to
reconstruct your results
var
Variables with missing values you
need imputed, in model, and those
that may be helpful with imputation
• data=
•
4. Combine results using PROC MIANALYZE
•
•
PROC MI Sample Output
PROC MI Options
• nimpute=5
• minimum=0 0 0 0
maximum=1 1 1 90
• round=1 1 1 0.01
# imputations, default=5
0 gives missing patterns
set min & max, sometimes
doesn’t converge as well
round off option
Output dataset
Regression
• Fit your model as if data had no missing values,
using by _imputation_;
• proc reg data=impdat outest=parmcov covout;
model bmxbmi=bmxht bmxwt bmxarmc bmxarml;
by _imputation_;
run;
• You’ll get nimpute (usually 5) sets of output
• Estimates, covariances, errors will be combined
in MIANALYZE
• Need to generate parameter estimates and
covariance data set (varies by procedure)
Parameter Est. & Covariance Matrix
• proc logistic data=impdat descending;
model bmxbmi=bmxht bmxwt bmxarmc bmxarml
/covb;
by _imputation_;
ods output ParameterEstimates=parmsdat
CovB=covbdat;
run;
• proc mixed data=impdat;
model bmxbmi=bmxht bmxwt bmxarmc bmxarml
/solution covb;
by _imputation_;
ods output covparms=parmcov;
run;
Parameter Est. & Covariance Matrix
• proc genmod data=impdat;
model bmxbmi=bmxht bmxwt bmxarmc bmxarml
/covb;
by _imputation_;
ods output ParameterEstimates=parmsdat
CovB=covbdat;
run;
PROC MIANALYZE
PROC MIANALYZE Output
• Syntax depends on what procedure you used in previous step:
• proc mianalyze data=parmcov;
(or)
proc mianalyze parms=parmsdat covb=covbdat;
(or)
proc mianalyze parms=parmsdat xpxi=xpxidat;
(then type this:)
modeleffects intercept bmxht bmxwt bmxarmc bmxarml;
run;
• Note the “var” statement is now “modeleffects”
• Note that the dependent variable is omitted
STATA
*preparing dataset for multipel imputation
mi query
mi set mlong
mi describe, detail
mi register imputed total
set seed 29390
mi impute mvn total = i.smoking i.isced4 i.samliv3 i.s57a_ i.alder4
i.gender, add(20) force
mi describe, detail
•
*rounding the imputed binary values to the nearest integer
*replace bingedrinking = 0 if bingedrinking <0.5
*replace bingedrinking = 1 if bingedrinking >0.5
*replace change_new = round(change_new)
*examination of imputations: comparing main descriptive statistics
from some imputations to those from the observed data
mi xeq 0 1 20: summarize total
mi estimate: xtmixed total i.gender group##month || username:, mle
mi estimate: mean total, over(sex group month)
Weigted regression
• Suppose that a national survey sampled 2000
subjects with 1000 men and 1000 women
• The response were 500 for men and 750 for
women
• If there are large differences between men and
women, a simple average of 2000 observations
will be a distorted representation of the
population mean
• By down-weighting women and up-weighting
men we could obtain the accurate picture of the
population
Values not missing at random (NMAR)
• Probability that values are missing
depends on the missing values themselves
• e.g., the probability that weight Y is missing
– is higher for the overweight (depends on Y)
– is higher for women (depends on X1)
• and sometimes X1 is missing, too.
• Methods available – not today!