* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Samples
Survey
Document related concepts
Transcript
CTSI BERD Research Methods Seminar Series Statistical Analysis II Mosuk Chow, Ph.D. Senior Scientist and Professor Statistics Department University Park November 8, 2016 Basic statistical concepts (from Stat I) Descriptive statistics (numeric/graphical) Population distribution vs. Sampling distribution Standard Deviation vs. Standard Error Estimation of population mean Confidence interval Hypothesis testing P-value Outline for Stat II Estimate population proportion Paired design 1-sample t Non-paired design 2-sample t Pooled variance versus non-pooled variance Estimation of population proportion (p) Examples: Proportion of patients who became infected Proportion of patients who are cured Proportion of individuals positive on a blood test Proportion of adverse drug reactions Proportion of premature infants who survive Sampling Distribution of Sample Proportion Sampling distribution of sample proportion can be approximated by normal distribution when sample size is sufficiently large (central limit theorem) The standard error of a sample proportion p is estimated by: p̂ (1 p̂) SE(p̂) n 95% Confidence Interval for a Proportion pˆ 2 SE (pˆ ) The rule of thumb for good normal approximation is n pˆ 5 and n (1 pˆ ) 5 Example In a study of 200 patients, 90 patients experienced adverse drug reactions The estimated proportion who experience an adverse drug reaction is 90 pˆ 0.45 200 95% confidence interval for the population proportion is 0.45 0.55 0.45 2 200 = (0.38, 0.52) Paired design Paired design Self-pairing: Measurements are taken at two distinct points in time from a single subject (e.g. Before vs. After) Matched pairs (e.g., twins, eyes, subjects matched on important characteristics such as age and gender) Why pairing? Control extraneous noise Control confounding factors that affect the comparison Make comparison more precise Example: Blood Pressure and Oral Contraceptive Use (n=10 women) Participant BP Before OC After-Before 1 126 2 105 3 104 4 115 … BP After OC 132 109 102 117 Paired samples sample 115.6 2nd sample Sb=11.3 Sa=13.1 1st Sample Mean: Sample Standard Deviation: 120.4 6 4 -2 2 Example (cont.) Scientific questions: What is the mean change in blood pressure after oral contraceptives (OC) use in a population of women who use OC? Estimate the mean change by a confidence interval approach Is there any change in mean blood pressure after oral contraceptives use in a population of women who use OC? Hypothesis testing Inference on mean change Due to the design of the study, we can reduce the BP information on two samples (women’s BP prior to OC use and the same subject’s BP after OC use) into one piece of information: information on the differences in BP between the times points for the same subject. Perform the one sample inference on the difference for the relevant research question. Inference on mean change Reduce the BP information on two samples (women prior to OC use, women after OC use) into one piece of information: information on the differences in BP between the times points. The sample average of the differences: xdiff Sample standard deviation of the differences: Sd 95% confidence interval for mean change in BP: xdiff ± tn-1,0.975 Sd n where n is the number of pairs, tn-1,0.975 is the critical value from t distribution with df=n-1. The sample average of the differences is 4.8, which can also be obtained by xdiff xafter xbefore (4.8 = 120.4 – 115.6) The sample standard deviation of the differences is n sd 2 ( x x ) diff i diff i 1 n 1 4.6 Example: Blood Pressure and Oral Contraceptive Use (n=10 women) Participant Before 1 2 3 4 … BP Before OC 126 105 104 115 115.6 Sample Mean: SD: BP After OC 132 109 102 117 120.4 After6 4 -2 2 4.8 xbefore xafter xdiff Sb=11.3 Sa=13.1 Sd=4.6 95% CI of mean change in BP 4.8 ± t9, 0.975 S d n 4.6 4.8 2.26 10 4.8 ± 2.26 1.45 1.52 to 8.08 Notes The number 0 is NOT in the confidence interval (1.52, 8.08) Because 0 is not in the interval, this suggests there is a non-zero change in BP over time. The BP change could be due to factors other than oral contraceptives. A control group of comparable women who were not taking oral contraceptives but taking the placebo would strengthen this study. Comparison of Two Independent Samples A Low Carbohydrate as Compared with a Low Fat Diet in Severe Obesity1 132 severely obese participants randomized to one of two diet groups Participants followed for a six-month period At the end of the study period Participants on the low carbohydrate diet lost more weight than those on a low fat diet. 1Samaha, F., et. al. A Low-Carbohydrate as Compared with a Low-Fat Diet in Severe Obesity, New England Journal of Medicine 348;21 Comparison of Two Independent Samples Number of Subjects Mean Weight Change (kg) Post-diet less pre-diet Standard Deviation of Weight Change (kg) Diet Group Low Fat Low Carb 68 64 -1.8 -5.7 3.9 8.6 Is weight loss associated with diet type? Comparison of Two Independent Samples In statistical terms, is there a difference in the average weight loss for the participants on the low fat diet as compared to participants on the low carbohydrate diet? Although there are paired pre/post measurements on each participant, the comparison of interest is not paired. For each participant we compute a change in weight (after diet weight minus before diet weight) However, we are comparing the changes in weight between two independent diet groups. Comparison of Two Independent Samples We have two samples: {x11, x12, x13,…, x1n1} and {x21, x22, x23,…, x2n2} drawn from populations with means 1 and 2 and variances 12 and 22 , respectively. The two samples are independent; there is no pairing of observations. We would like to estimate the difference of the population means, 2 - 1. Using the confidence interval, we can decide whether the two means are different. Comparison of Two Independent Samples We know our best estimate for the mean (of a single population) is the sample mean, x . It would seem sensible to estimate 1 with x1 , and 2 with x2 and 2 – 1 with x2 x1 . Sampling Distribution of the Difference in Sample Means Since we have largish samples (both greater than 30) we know the sampling distributions of the sample means in both groups are approximately normal It turns out the difference of any quantities, which are (approximately) normally distributed, is also normally distributed. Sampling Distribution of the Difference in Sample Means So, the good news is . . . The sampling distribution of the difference of two sample means, each based on large samples, approximates a normal distribution. This sampling distribution is centered at the true mean difference, µ2 - µ1. Confidence Interval for (2 - 1) We can construct a confidence interval for 2 - 1 using the (pivotal) quantity ( X 2 X 1 ) ( 2 1) T Standard Error( X 2 X 1 ) Two Independent (Unpaired) Samples The standard error of the difference for two independent samples is calculated differently than we did for paired designs. The formula for the standard error of the difference depends on the sample sizes in both groups and standard deviations in both groups. Comparison of Two Independent Samples The formula is x x / n1 / n2 2 1 2 1 2 2 If we follow the same reasoning we did for the one sample case, we could substitute s1 and s2 for 1 and 2, respectively, to give an estimate of sx2 x1 s12 / n1 s22 / n2 Comparison of Two Independent Samples The distribution of Ts (X 2 X 1 ) ( 2 1 ) S12 / n1 S22 / n2 can be approximated by the t distribution where the degrees of freedom are calculated as ( s12 / n1 s22 / n2 ) 2 d 2 ( s1 / n1 ) 2 /(n1 1) ( s22 / n2 ) 2 /( n2 1) You may see this referred to as Welch’s or Satterthwaite’s approximation. Confidence Interval for (2 - 1) We can construct a confidence interval for 2 - 1 using the (pivotal) quantity (X 2 X 1 ) ( 2 1 ) S / n1 S / n2 2 1 2 2 An approximate (1- ) 100% confidence interval is given by X 2 X 1 t d ,1 / 2 S / n1 S / n2 2 1 2 2 Comparison of Two Independent Samples with equal variance (21=22 =2) If 12 and 22 are unknown, but equal to a common value 2, we could “pool” our samples to obtain an estimate of 2 to estimate the standard error of the difference in sample means: The previous estimate we were working with x2 x1 12 / n1 22 / n2 is an unpooled estimate because we obtained estimates of 12 and 22 separately. sx2 x1 s / n1 s / n2 2 1 2 2 Comparison of Two Independent Samples (cont.) A pooled estimate of 2 is n1 s 2 p (x x1 ) ( x2 j x2 ) 2 1i i 1 n2 j 1 n1 1 n2 1 2 (n1 1) s12 (n2 1) s22 . n1 n2 2 When 12=22=2, we have x x / n1 / n2 / n1 / n2 2 1 2 1 2 2 2 2 Comparison of Two Independent Samples (cont.) If we substitute the pooled estimator of 2 into (X 2 X 1 ) ( 2 1 ) / n1 / n2 2 1 2 2 (X 2 X 1 ) ( 2 1 ) / n1 / n2 2 2 , we have TP ( X 2 X 1 ) ( 2 1 ) S P2 / n1 S P2 / n2 ( X 2 X 1 ) ( 2 1 ) S P2 (1 / n1 1 / n2 ) Comparison of Two Independent Samples (cont.) TP follows a t distribution with n1+n2-2 degrees of freedom. A (1- ) 100% confidence interval is given by ( X 2 X 1 ) t n1 n2 2, 1 / 2 S (1 / n1 1 / n2 ) 2 P Choosing when to Pool One rule of thumb is to use the pooled variances as long as the ratio of the sample standard deviations (larger s/smaller s) is 2, but this cutoff is somewhat arbitrary. Usually the results are not that different. If you are unsure of which one to use, go with the separate variance as that is more conservative. Diet and Weight Loss Example A 95% confidence interval is ( X 2 X 1 ) t n1 n2 2,1 / 2 S (1 / n1 1 / n2 ) 2 P (5.7 (1.8)) T68 64 2, 0.975 1.15 3.9 1.98 1.15 3.9 2.277 (6.2,1.6) kg Back to Blood Pressure and Oral Contraceptive Use (n=10 women) Participant Before 1 2 3 4 … BP Before OC 126 105 104 115 115.6 BP After OC 132 109 102 117 120.4 After6 4 -2 2 4.8 Sample Mean: xbefore xafter xdiff SD: Sb=11.3 Sa=13.1 Sd=4.6 If we do not realize that we should use the paired t but use the two sample t procedure to obtain the CI, will the interval be wider or narrower? Anwser: Paired t Confidence Interval: ( X 2 X 1 ) t n , 1 / 2 S d2 / n 2-sample t Confidence Interval ( X 2 X 1 ) tn1 n2 2, 1 / 2 S (1 / n1 1 / n2 ) 2 P It is very important to know the design and use the appropriate statistical technique to analyze the data. If we have a control group for the OC example, then we will use two sample t to compare the mean change in blood pressure in the two groups. THE END Want to learn more statistics or have consultations, contact: http://ctsi.psu.edu/ctsiprograms/biostatisticsepidemiologyresearch-design/