Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data assimilation wikipedia , lookup
Interaction (statistics) wikipedia , lookup
Choice modelling wikipedia , lookup
Least squares wikipedia , lookup
Regression toward the mean wikipedia , lookup
Linear regression wikipedia , lookup
Regression analysis wikipedia , lookup
Time series wikipedia , lookup
Statistics for Medical Researchers Hongshik Ahn Professor Department of Applied Math and Statistics Stony Brook University Biostatistician, Stony Brook GCRC Contents 1. 2. 3. 4. 5. Experimental Design Descriptive Statistics and Distributions Comparison of Means Comparison of Proportions Power Analysis/Sample Size Calculation 6. Correlation and Regression 2 1. Experimental Design Experiment Treatment: something that researchers administer to experimental units Factor: controlled independent variable whose levels are set by the experimenter Experimental design Control Treatment Placebo effect Blind single blind, double blind, triple blind 3 1. Experimental Design Randomization Completely randomized design Randomized block design: if there are specific differences among groups of subjects Permuted block randomization: used for s mall studies to maintain reasonably good balance among groups Stratified block randomization: matching 4 1. Experimental Design Completely randomized design The computer generated sequence: 4,8,3,2,7,2,6,6,3,4,2,1,6,2,0,……. Two Groups (criterion: even-odd): AABABAAABAABAAA…… Three Groups: (criterion:{1,2,3}~A, {4,5,6}~B, {7,8,9}~C; ignore 0’s) BCAACABBABAABA…… Two Groups: different randomization ratios(eg.,2:3): (criterion:{0,1,2,3}~A, {4,5,6,7,8,9}~B) BBAABABBABAABAA…….. 5 1. Experimental Design Permuted block randomization With a block size of 4 for two groups(A,B), there are 6 possible permutations and they can be coded as: 1=AABB, 2=ABAB, 3=ABBA, 4=BAAB, 5=BABA, 6=BBAA Each number in the random number sequence in turn selects the next block, determining the next four participant allocations (ignoring numbers 0,7,8 and 9). e.g., The sequence 67126814…. will produce BBAA AABB ABAB BBAA AABB BAAB. In practice, a block size of four is too small since researchers may crack the code and risk selection bias. Mixing block sizes of 4 and 6 is better with the size kept un known to the investigator. 6 1. Experimental Design Methods of Sampling Random sampling Systematic sampling Convenience sampling Stratified sampling 7 1. Experimental Design Random Sampling Selection so that each individual member has an equal chance of being selected Systematic Sampling Select some starting point and then select every k th element in the population 8 1. Experimental Design Convenience Sampling Use results that are easy to get 9 1. Experimental Design Stratified Sampling Draw a sample from each stratum 10 2. Descriptive Statistics & Distributions Parameter: population quantity Statistic: summary of the sample Inference for parameters: use sample Central Tendency Mean (average) Median (middle value) Variability Variance: measure of variation Standard deviation (sd): square root of variance Standard error (se): sd of the estimate Median, quartiles, min., max, range, boxplot Proportion 11 2. Descriptive Statistics & Distributions Normal distribution 12 2. Descriptive Statistics & Distributions Standard normal distribution: Mean 0, variance 1 13 2. Descriptive Statistics & Distributions Z-test for means T-test for means if sd is unknown 14 3. Inference for Means Two-sample t-test Two independent groups: Control and treatment Continuous variables Assumption: populations are normally distributed Checking normality Histogram Normal probability curve (Q-Q plot): straight? Shapiro-Wilk test, Kolmogorov-Smirnov test, Anderson-Darling test If the normality assumption is violated T-test is not appropriate. Possible transformation Use non-parametric alternative: Mann-Whitney Utest (Wilcoxon rank-sum test) 15 3. Inference for Means A clinical trial on effectiveness of drug A in prev enting premature birth 30 pregnant women are randomly assigned to control and treatment groups of size 15 each Primary endpoint: weight of the babies at birth Treatment n Control 15 15 mean 7.08 6.26 sd 0.90 0.96 16 3. Inference for Means Hypothesis: The group means are different Null hypothesis (Ho): 1 = 2 Alternative hypothesis (H1): 1 2 Significance level: = 0.05 Assumption: Equal variance Degrees of freedom (df): n1 n2 2 Calculate the T-value (test statistic) T ( x1 x2 ) ( 1 2 ) s p (1 / n1 ) (1 / n2 ) P-value: Type I error rate (false positive rate) Reject Ho if p-value < Do not reject Ho if p-value > 17 3. Inference for Means Previous example: Test at 0.05 2 2 ( n 1 ) s ( n 1 ) s 14 (. 90 ) 14 (. 96 ) 2 1 1 2 2 sp 0.866 n1 n2 2 15 15 2 2 t 2 ( x1 x2 ) ( 1 2 ) s p (1 / n1 ) (1 / n2 ) 7.08 6.26 2.413 0.866 (1 / 15) (1 / 15) P-value: 0.026 < 0.05 Reject the null hypothesis that there is no drug effect. 18 3. Inference for Means Confidence interval (CI): An interval of values used to estimate the true val ue of a population parameter. The probability 1- that is the proportion of times that the CI actually contains the population parameter, assuming that the estimation process is repeated a large number of times. Common choices: 90% CI ( = 10%), 95% CI ( = 5%), 99% CI ( = 1%) 19 3. Inference for Means CI for a comparison of two means: ( x1 x2 ) E 1 2 ( x1 x2 ) E where E t / 2,n n 2 s p (1 / n1 ) (1 / n2 ) 1 2 A 95% CI for the previous example: E t.025, 28s p (1 / 15) (1 / 15) (2.048) .866[(1 / 15) 1 / 15)] .70 (7.08 6.26) .70 (.12,1.52) 3. Inference for Means SAS programming for Two-Sample T-test Data steps : Click ‘File’ Click ‘Import Data’ Select a data source Click ‘Browse’ and find the path of the data file Click ‘Next’ Fill the blank of ‘Member’ with the name of the SAS data set Click ‘Finish’ Procedure steps : Click ‘Solutions’ Click ‘Analysis’ Click ‘Analyst’ Click ‘File’ Click ‘Open By SAS Name’ Select the SAS data set and Click ‘OK’ Click ‘Statistics’ Click ‘ Hypothesis Tests’ Click ‘Two-Sample T-test for Means’ Select the independent variable as ‘Group’ and the dependent variable as ‘Dependent’ Choose the interested Hypothesis and Click ‘OK’ 21 3. Inference for Means Click ‘File’ to import data and create the SAS data set. Click ‘Solution’ to create a project to run statistical test Click ‘File’ to open the SAS data set. Click ‘Statistics’ to select the statistical procedure. 22 3. Inference for Means Mann-Whitney U-Test (Wilcoxon Rank-Sum Test) Nonparametric alternative to two-sample t-test The populations don’t need to be normal H0: The two samples come from populations with equal medians H1: The two samples come from populations with different medians 23 3. Inference for Means Mann-Whitney U-Test Procedure Temporarily combine the two samples into one big sample, then replace each sample value with its rank Find the sum of the ranks for either one of the two samples Calculate the value of the z test statistic 24 3. Inference for Means Mann-Whitney U-Test, Example Numbers in parentheses are their ranks beginning with a rank of 1 assigne d to the lowest value of 17.7. R1 and R2: sum of ranks 25 3. Inference for Means Hypothesis: The group means are different Ho: Men and women have same median BMI’s H1: Men and women have different median BMI’s n1 ( n1 n2 1) 13(13 12 1) R 169 2 2 R z n1n2 (n1 n2 1) 12 R R R (13)(12)(13 12 1) 18.385 12 187 169 0.98 18.385 p-value 0.33, thus we do not reject H0 at =0.05. There is no significant difference in BMI between men and women. 26 3. Inference for Means SAS Programming for Mann-Whitney U-Test Procedure Data steps : The same as slide 21. Procedure steps : Click ‘Solutions’ Click ‘Analysis’ Click ‘Analyst’ Click ‘File’ Click ‘Open By SAS Name’ Select the SAS data set and Click ‘OK’ Click ‘Statistics’ Click ‘ ANOVA’ Click ‘Nonparametric One-Way ANOVA’ Select the ‘Dependent’ and ‘Independent’ variables respectively and choose the interested test Click ‘OK’ 27 3. Inference for Means Click ‘File’ to open the SAS data set. Click ‘Statistics’ to select the statistical procedure. Select the dependent and independent variables: 28 3. Inference for Means Paired t-test Mean difference of matched pairs Test for changes (e.g., before & after) The measures in each pair are correlated. Assumption: population is normally distributed Take the difference in each pair and perform onesample t-test. Check normality If the normality assumption is viloated T-test is not appropriate. Use non-parametric alternative: Wilcoxon signed rank test 29 3. Inference for Means Notation for paired t-test d = individual difference between the two values of a single matched pair µd = mean value of the differences d for the population of paired data = mean value of the differences d for the paired sample data d d sd = standard deviation of the differences d for the paired sample data n = number of pairs 30 3. Inference for Means Example: Systolic Blood Pressure ID Without OC’s With OC’s Difference 1 115 128 13 2 112 115 3 3 107 106 -1 4 119 128 9 5 115 122 7 6 138 145 7 7 126 132 6 8 105 109 4 9 104 102 -2 10 115 117 2 OC: Oral contraceptive 31 3. Inference for Means Hypothesis: The group means are different Ho: d 0 vs. H1: d 0 Significance level: = 0.05 Degrees of freedom (df): n 1 9 Test statistic d d 4.8 t 3.32 sd / n 4.57 / 10 P-value: 0.009, thus reject Ho at =0.05 The data support the claim that oral contraceptives affect the systolic bp. 32 3. Inference for Means Confidence interval for matched pairs 100(1-)% CI: sd sd , d t / 2, n 1 d t / 2,n 1 n n 95% CI for the mean difference of the systolic bp: d t0.025,9 sd 4.57 4.8 2.26 4.8 3.27 10 10 (1.53, 8.07) 33 3. Inference for Means SAS Programming for Paired T-test Data steps : The same as slide 21. Procedure steps : Click ‘Solutions’ Click ‘Analysis’ Click ‘Analyst’ Click ‘File’ Click ‘Open By SAS Name’ Select the SAS data set and Click ‘OK’ Click ‘Statistics’ Click ‘ Hypothesis tests’ Click ‘Two-Sample Paired T-test for means’ Select the ‘Group1’ and ‘Group2’ variables respectively Click ‘OK’ (Note: You can also calculate the difference, and use it as the dependent variable to run the one-sample t-test) 34 3. Inference for Means Click ‘File’ to open the SAS data set. Click ‘Statistics’ to select the statistical procedure. Put the two group variables into ‘Group 1’ and ‘Group 2’ 35 3. Inference for Means Comparison of more than two means: ANOVA (Analysis of Variance) One-way ANOVA: One factor, eg., control, drug 1, drug 2 Two-way ANOVA: Two factors, eg., drugs, age g roups Repeated measures: If there is a repeated meas ures within subject such as time points 36 3. Inference for means Example: Pulmonary disease Endpoint: Mid-expiratory flow (FEF) in L/s 6 groups: nonsmokers (NS), passive smokers (PS), noninhaling smokers (NI), light smokers (LS), moderate smokers (MS) and heavy smokers (HS) Group name Mean FEF SD FEF n NS 3.78 0.79 200 PS 3.30 0.77 200 NI 3.32 0.86 50 LS 3.23 0.78 200 MS 2.73 0.81 200 HS 2.59 0.82 200 37 3. Inference for means Example: Pulmonary disease Ho: group means are the same H1: not all the groups means are the same SS df Between 184.38 5 36.875 1044 0.636 Within 663.87 Total 848.25 MS F statistic P-value 58.0 <0.001 P-value<0.001 There is a significant difference in the mean FEF among the groups. Comparison of specific groups: linear contrast Multiple comparison: Bonferroni adjustment (/n) 38 3. Inference for Means SAS Programming for One-Way ANOVA Data steps : The same as slide 21. Procedure steps : Click ‘Solutions’ Click ‘Analysis’ Click ‘Analyst’ Click ‘File’ Click ‘Open By SAS Name’ Select the SAS data set and Click ‘OK’ Click ‘Statistics’ Click ‘ ANOVA’ Click ‘One-Way ANOVA’ Select the ‘Independent’ and ‘Dependent’ variables respectively Click ‘OK’ 39 3. Inference for Means Click ‘File’ to open the SAS data set. Click ‘Solutions’ to select the statistical procedure. Select the dependent and Independent variables: 40 4. Inference for Proportions Chi-square test Testing difference of two proportions n: #successes, p: success rate Requirement: np 5 & n(1 p) 5 H0: p1 = p2 H1: p1 p2 (for two-sided test) If the requirement is not satisfied, use Fisher’s exact test. 41 5. Power/Sample Size Calculation Decide significance level (eg. 0.05) Decide desired power (eg. 80%) One-sided or two-sided test Comparison of means: two-sample t-test Need to know sample means in each group Need to know sample sd’s in each group Calculation: use software (Nquery, power, etc) Comparison of proportions: Chi-square test Need to know sample proportions in each group Continuity correction Small sample size: Fisher’s exact test Calculation: use software 42 6. Correlation and Regression Correlation Pearson correlation for continuous variables Spearman correlation for ranked variables Chi-square test for categorical variables Pearson correlation Correlation coefficient (r): -1<r<1 Test for coefficient: t-test Larger sample more significant for the same value of the correlation coefficient Thus it is not meaningful to judge by the magnitude of the correlation coefficient. Judge the significance of the correlation by pvalue 43 6. Correlation and Regression Regression Objective Find out whether a significant linear relationship exists between the response and independent variables Use it to predict a future value Notation X: independent (predictor) variable Y: dependent (response) variable Multiple linear regression model y 0 1 x1 ... κxk Where is the random error Checking the model (assumption) Normality: q-q plot, histogram, Shapiro-Wilk test Equal variance: predicted y vs. error is a band shape Linear relationship: predicted y vs. each x 44 6. Correlation and Regression Weight (x1) in LB Age (x2) Blood pressure (y) 152 50 120 183 20 141 171 20 124 165 30 126 158 30 117 161 50 129 149 60 123 158 50 125 170 40 132 153 55 123 164 40 132 190 40 155 185 20 147 45 6. Correlation and Regression The regression equation is y 65.1 1.08x1 0.425x2 The mean blood pressure increases by 1.08 if weight (x1) increases by one pound and age (x2) remains fixed. Similarly, a 1-year increase in age with the weight held fixed will increase the mean blood pressure by 0.425. Predictor Coefficient se T-ratio P-value Constant -65.10 14.94 -4.36 0.001 x1 1.077 0.077 13.98 0.000 x2 0.425 0.073 5.82 0.000 s=2.509 R2=95.8% Error sd is estimated as 2.509 with df=13-3=10 95.8% of the variation in y can be explained by the regression. 46 6. Correlation and Regression SAS Programming for Linear Regression Data steps : The same as slide 21. Procedure steps : Click ‘Solutions’ Click ‘Analysis’ Click ‘Analyst’ Click ‘File’ Click ‘Open By SAS Name’ Select the SAS data set and Click ‘OK’ Click ‘Statistics’ Click ‘ Regression’ Click ‘Linear’ Select the ‘Dependent’ (Response) variable and the ‘Explanatory’ (Predictor) variable respectively Click ‘OK’ 47 6. Correlation and Regression Click ‘File’ to open the SAS data set. Click ‘Solutions’ to select the statistical procedure. Select the dependent and explanatory variables: 48 6. Correlation and Regression Other regression models Polynomial regression Transformation Logistic regression 49