Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Epidemiological analysis PhD-course in epidemiology Lau Caspar Thygesen Associate professor, PhD 25th February 2014 Age standardization • Incidence and prevalence are strongly agedependent – Risks rising (e.g. chronic diseases) or declining (e.g. measles) with age • Comparisons between populations and over time may be very misleading • A single age-independent index representing a set of age-specific rates may be more appropriate Mortality in Denmark and Greenland, men, 1975 Please interpret this table? Direct standardization IR(DK-standardized to Greenlandic age-distribution) = 0.016*12.2+0.076*0.7+0.268*0.160+0.506*1.4+0.110*11.2+0.024*66.5 = 3.8 Indirect standardization 2009 2007 2005 2003 2001 1999 1997 1995 1993 1991 1989 1987 1985 1983 1981 1979 1977 1975 1973 1971 1969 1967 1965 1963 1961 1959 1957 1955 1953 1951 7 1949 8 1947 1945 1943 Lung Cancer Denmark Women 9 rateCrude segi scand 6 5 3 2 1 0 Example 2 • Incidence of multiple sclerosis • Denmark • 1950-2004 4 • European Standard Population 2009 2007 2005 2003 2001 1999 1997 1995 1993 1991 1989 1987 1985 1983 1981 1979 1977 1975 1973 1971 1969 1967 1965 1963 1961 1959 1957 1955 1953 8 1951 • 1943-2010 1949 • Denmark 1947 • Trend study of lung cancer incidence among women 1945 1943 Example Lung Cancer Denmark Women 9 rateCrude 7 6 5 4 3 2 1 0 Example indirect standardization • 19,185 subjects (3,817 women) who attended outpatient clinics for alcohol abusers • Copenhagen • 1952-1992 • Compare incidence of heart disease by the incidence rate in the greater Copenhagen area Problems • Direct standardisation can produce unreliable estimates when the calculations are based on small numbers • Indirect standardisations from different populations cannot be directly compared – only compared to the standard Compared to regression methods • Regression based methods are available but are rarely applied in practice • When individual data are available (presence / absence of disease, age and sex), a logistic regression can be used to estimate the standardized rate • The main advantage is that it allows adjustment by continuous variables in addition to categorical variables Missing data • What does missing mean • The pattern of missingness (nomenclature) – How and why is it missing? • Methods for handling Missing values • Common in research – Nonresponse – Loss to follow-up – Lack of overlap between linked data sets (not so common) Unit Nonresponse Examples What is item nonresponse? • Unit Nonresponse vs. Item Nonresponse ID Q1 Q2 Q3 456 1 1 2 457 4 2 458 ? ? 459 3 2 ID Q1 Q2 Q3 456 1 1 2 1 457 4 ? 1 ? 458 ? 2 1 1 459 3 2 ? • • • • • • Person who is not at home Person who does not pick up the phone Person who hangs up on you Rat that dies before the study The country you could not get data on etc. Item Nonresponse • • • • • “I Don’t Know” Refusals to respond Questions left blank Failed measurement etc. Best way to deal with Missing Data is not to have any Minimizing Unit Nonresponse • • • • • • Call back if not home Refusal conversion Don’t mess up Clear and understandable questionnaire Polite request Incentives What kind of missing data should be modeled? • If an item is missing from your dataset but you suspect that it has a true value • I don’t know might simply mean I don’t know – Don’t model it as if there was a true value • Dead people (attrition) Minimizing Item Nonresponse • Well written questions • Minimize misunderstandings – cross-cultural example – Standardized vs. non-standardized • Minimize skip patterns The pattern of missingness (nomenclature) • Ignorable – MCAR - Missing Completely at Random – MAR - Missing at Random • Non-ignorable – NMAR - Not Missing at Random Missing completely at random Missing Completely at Random: if the data are missing completely at random then missing values cannot be predicted any better • • • • Cause of missingness completely random process (like coin flip) Cause uncorrelated with variables of interest – Example: parents move No bias if cause omitted In the unlikely event that the process is missing completely at random, then inferences based on complete cases are unbiased, but inefficient because we have lost some cases Missing not at random Non-Ignorable / NMAR: if the probability that a cell is missing depends on the unobserved value of the missing value For example, individuals’ responses to income questions, where high income people are more likely to refuse to answer survey questions about income and other variables in the data set cannot predict which respondents have high income If your missing data is non-ignorable, then inferences based on complete cases will be biased and inefficient Missing at random • Missingness may be related to measured variables • But no residual relationship with unmeasured variables • No bias if you control for measured variables • For example, if highly educated are more likely to participate in a survey, then the process is missing at random as long we know the educational level of all persons • If data is missing at random, then inferences based on complete cases will be biased and inefficient Classical Missing Data Treatments • Whatever you do, you are doing something – Case Deletion • Listwise (complete case analysis) • Pairwise (available case analysis) – Indicator variable (dummy variable) – Single Imputation • (Unconditional) Mean Imputation • Conditional Mean Imputation (expected value) – Weighting Listwise Deletion and Multi-Item • Excludes the whole case • Default in most software • Works if mechanism is MCAR and if pattern and sample size allows (need to have enough complete cases) • Can be biased Pairwise Deletion • An option for using all available information correlation/covariance matrixes • Different calculations may be based on different populations • Very unpredictable bias Mean imputation Indicator method • For each variable with missing values, create a missing-value indicator to accompany the variable in all analysis • Assumes MCAR • Even if the stratum is just a random sample of all subjects, the stratum will yield a confounded estimate of the exposure effect • Technique – Calculate mean over cases that have values for Y – Impute this mean where Y is missing – Ditto for X1, X2, etc. • Problems – ignores relationships among X and Y • underestimates covariances (Unconditional) Mean Imputation Mean imputation • Standard errors too low • CI difficult to calculate Scatterplots are from Joe Schafer’s website Conditional mean imputation • Technique & implicit models – If Y is missing • impute mean of cases with similar values for X1, X2 – Y = b0 + X1 b1 + X2 b2 – Likewise, if X2 is missing • impute mean of cases with similar values for X1, Y – X1 = g0 + X1 g1 + Y g2 – If both Y and X2 are missing • impute means of cases with similar values for X1 – Y = d0 + X1 d1 – X2= f0 + X1 f1 • Problem – Ignores random components (no e) àUnderestimates variances, se’s Imputation of Expected Value • Good for creating expected values • Bad for multivariate analysis – Decreases standard errors – Creates overconfident outcomes – Increases probability of Type I error Imputation variation Problem with single imputation • Sampling variation – If you take a different sample • Underestimates se’s! • Treats imputed values like observed values – when they are actually less certain • Ignores imputation variation • you get different parameter estimates – Standard errors reflect this – One way to estimate sampling variation • measure variation across multiple samples • called “bootstrapping” • Imputation variation – If you impute different values • you get different parameter estimates – Standard errors should reflect this, too – One way to estimate imputation variation • measure variation across multiple imputed data sets • called “multiple imputation” Multiple Imputation • Models both expected value and uncertainty. • Using the Missing Data Model you specify it simulates and imputes missing values “multiple” times creating M complete datasets – (M=5 is usually OK. It is a good idea to simulate more) • Analyze each dataset independently • Combines results to get unbiased estimates. Models both uncertainty and expectation Example PROC MI Multiple Imputation Simple Procedure 1. Impute using PROC MI 3. Do analysis: PROC REG, LOGISTIC, etc. using by _imputation_; in the procedure • Typical syntax: proc mi data=bmx out=impdat seed=33155; var bmxbmi bmxht bmxwt bmxarmc bmxarml; run; 1 copy of data with missing values out= 5 copies of data with imputed values (will be different across copies) seed= random seed, you can keep same to reconstruct your results var Variables with missing values you need imputed, in model, and those that may be helpful with imputation • data= • 4. Combine results using PROC MIANALYZE • • PROC MI Sample Output PROC MI Options • nimpute=5 • minimum=0 0 0 0 maximum=1 1 1 90 • round=1 1 1 0.01 # imputations, default=5 0 gives missing patterns set min & max, sometimes doesn’t converge as well round off option Output dataset Regression • Fit your model as if data had no missing values, using by _imputation_; • proc reg data=impdat outest=parmcov covout; model bmxbmi=bmxht bmxwt bmxarmc bmxarml; by _imputation_; run; • You’ll get nimpute (usually 5) sets of output • Estimates, covariances, errors will be combined in MIANALYZE • Need to generate parameter estimates and covariance data set (varies by procedure) Parameter Est. & Covariance Matrix • proc logistic data=impdat descending; model bmxbmi=bmxht bmxwt bmxarmc bmxarml /covb; by _imputation_; ods output ParameterEstimates=parmsdat CovB=covbdat; run; • proc mixed data=impdat; model bmxbmi=bmxht bmxwt bmxarmc bmxarml /solution covb; by _imputation_; ods output covparms=parmcov; run; Parameter Est. & Covariance Matrix • proc genmod data=impdat; model bmxbmi=bmxht bmxwt bmxarmc bmxarml /covb; by _imputation_; ods output ParameterEstimates=parmsdat CovB=covbdat; run; PROC MIANALYZE PROC MIANALYZE Output • Syntax depends on what procedure you used in previous step: • proc mianalyze data=parmcov; (or) proc mianalyze parms=parmsdat covb=covbdat; (or) proc mianalyze parms=parmsdat xpxi=xpxidat; (then type this:) modeleffects intercept bmxht bmxwt bmxarmc bmxarml; run; • Note the “var” statement is now “modeleffects” • Note that the dependent variable is omitted STATA *preparing dataset for multipel imputation mi query mi set mlong mi describe, detail mi register imputed total set seed 29390 mi impute mvn total = i.smoking i.isced4 i.samliv3 i.s57a_ i.alder4 i.gender, add(20) force mi describe, detail • *rounding the imputed binary values to the nearest integer *replace bingedrinking = 0 if bingedrinking <0.5 *replace bingedrinking = 1 if bingedrinking >0.5 *replace change_new = round(change_new) *examination of imputations: comparing main descriptive statistics from some imputations to those from the observed data mi xeq 0 1 20: summarize total mi estimate: xtmixed total i.gender group##month || username:, mle mi estimate: mean total, over(sex group month) Weigted regression • Suppose that a national survey sampled 2000 subjects with 1000 men and 1000 women • The response were 500 for men and 750 for women • If there are large differences between men and women, a simple average of 2000 observations will be a distorted representation of the population mean • By down-weighting women and up-weighting men we could obtain the accurate picture of the population Values not missing at random (NMAR) • Probability that values are missing depends on the missing values themselves • e.g., the probability that weight Y is missing – is higher for the overweight (depends on Y) – is higher for women (depends on X1) • and sometimes X1 is missing, too. • Methods available – not today!