Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
http://cc.jlu.edu.cn/ms.html Medical Statistics 9 Tao Yuchun 1 2014.3.25 Statistical Analysis of Enumeration Data 2. Statistical Inference for enumeration data 2 2014.3.25 9.1 Sampling error of frequency • Example Suppose the death rate is 0.2, if the rats are fed with a kind of poison. What will happen when we do the experiment on n=1, 2, 3 or 4 rat(s)? 3 2014.3.25 n 1 d 0 1 2 0 1 2 0 1 2 3 0 1 2 3 4 3 4 Frequency distribution Sample rate 0.8 0/1=0 0.2 1/1=1 0.8×0.8=0.64 0/2=0 0.8×0.2+0.2×0.8=0.32 1/2=0.5 0.2×0.2 2/2=1 0.8×0.8×0.8=0.512 0/3=0 3(0.8×0.8×0.2)=0.384 1/3=0.3 3(0.8×0.2×0.2)=0.096 2/3=0.7 0.2×0.2×0.2=0.008 3/3=1 0.8×0.8×0.8×0.8=0.4096 0/4=0 4(0.8×0.8×0.8×0.2)=0.4096 1/4=0.25 6(0.8×0.8×0.2×0.2)=0.1536 2/4=0.5 4(0.8×0.2×0.2×0.2)=0.0256 3/4=0.75 0.2×0.2×0.2×0.2=0.0016 4/4=1 4 2014.3.25 In general, Supposed the population proportion is , sample size =n. • The frequency P P X is a random variable. n (1 ) P n • When is unknown and n is big enough, approximately equal to P is p(1 p) SP n 5 2014.3.25 Example 9-1 HBV Surface antigen. 200 people were tested, 7 positive. X 7 P 3.5% n 200 p(1 p) 0.035(1 0.035) SP n 200 0.0130 1.30% 6 2014.3.25 •In theory If the sample size n is big enough, and observed frequency is p , then we have approximately p(1 p) P ~ N ( , ) n 7 2014.3.25 9.2 Confidence Interval of Probability If the sample size n is big enough, and observed frequency is p , then 95% Confidence interval: p(1 p) : p 1.96 n 99% Confidence interval: : p(1 p) p 2.58 n 8 2014.3.25 Example 9-2 HBV Surface antigen. 200 people were tested, 7 positive. Calculate confidence interval for the π . p(1 p) : p 1.96 n 3.5% 1.96 1.30% 0.95% ~ 6.05% p(1 p) : p 2.58 n 3.5% 2.58 1.30% 0.15% ~ 6.85% 9 2014.3.25 •Distinguish between μ and for sampling error and confidence interval X X x p n (1 ) p n n S SX n Sp p(1 p) n : p Z / 2 S p : X t / 2 S X 10 2014.3.25 9.3 The hypothesis testing of proportion (Z test) (1) Comparison of sample proportion and population proportion ( Onesample Z test) Example 9-3 Cerebral infarction Cases Cure rate New Method 98 50% Routine 30%. • 50% is sample proportion, p=50%. • 30% is population proportion, π0=30%. 11 2014.3.25 •Hypotheses and α : H 0 : 0 0.3 H1 : 0 0.3 α= 0.05 •Statistic Z : Z p 0 0 (1 0 ) n 0 .5 0 .3 4.32 0.3(1 0.3) 98 •Decision rule : If |Z|≥Zα , then reject H0 ; Otherwise, no reason to reject H0 (accept H0 ). 12 2014.3.25 • Zα is : •Two sides: •One side: Z 0.05 1.96, Z 0.01 2.58 Z 0.05 1.65, Z 0.01 2.33 Since |Z|=4.32>Z0.05=1.96, reject H0 . New method is better than routine. (2) Comparison of two sample proportions ( Two-samples Z test) Example 9-4 Carrier rate of Hepatitis in B City: 522 people were tested, 24 carriers, p1= 4.06% (population carrier rate: 1); in Countryside: 478 people were tested, 33 carriers, p2= 6.90% (population carrier rate: 2). 13 2014.3.25 H 0 : 1 2 H1 : 1 2 α= 0.05 X1 X 2 pc n1 n2 24 33 pc 0.057 522 478 S p1 p2 S p1 p2 1 1 pc (1 pc )( ) n1 n2 1 1 0.057(1 0.057)( ) 0.0147 522 478 14 2014.3.25 • here pc is pooled estimation of two sample proportions, Sp1-p2 is standard error of p1-p2. •Statistic Z : p1 p2 Z S p1 p2 p1 p2 0.046 0.069 Z 1.565 S p1 p2 0.0147 •Decision rule : If |Z|≥Zα , then reject H0 ; Otherwise, no reason to reject H0 (accept H0 ). Since |Z|=1.565<Z0.05=1.96, not reject H0 . B City is same as Countryside for population carrier rate (1=2). 15 2014.3.25 Summary • The parameter estimation and hypothesis testing of proportion are based on the normal approximation (when sample size is big enough). • How big is enough? By experience, n > 5 and n(1-) >5 . For sample: np > 5 and n(1-p) >5 . • If the sample size is not big, Z test can’t be used and there is no t-test for proportion. (see more detailed text book) 16 2014.3.25 9.4 Chi-square test The Z test can only be used for comparing with a given 0 (one sample) or comparing 1 with 2 (two samples). If we need to compare more than two samples, Chi-square test is widely used. 17 2014.3.25 (1) Basic idea of χ2 test • Given a set of actual frequency distribution A1, A2, A3 … to test whether the data follow certain theory. • If the theory is true, then we will have a set of theoretical frequency distribution: T1, T2, T3 … • Comparing A1, A2, A3 … and T1, T2, T3 …, If they are quite different, then the theory might not be true; Otherwise, the theory is acceptable. 18 2014.3.25 (2) Chi-square test for 2×2 table Example 9-5 Acute lower respiratory infection Treatment Effect Non-effect Total Effect rate Drug A 68(64.82) a 6(9.18) b 74 (a+b) 91.89 % Drug B 52(55.18) c 11(7.82) d 63(c+d) 82.54 % Total 120 (a+c) 17 (b+d) 137 87.59 % H0: 1=2 H1: 1≠2 • here 1 is population effect rate for drug A, 2 is population effect rate for drug B. α=0.05 19 2014.3.25 To calculate the theoretical frequencies; If H0 is true, 1=2 120/137 T11=74120/137 =64.82, T21=63120/137=55.18 T12=7417/137 =9.18, T22=6317/137=7.82 nR nC TRC nR : Row total nC : Column tot al n To compare A and T by a statistic 2 ; 2 2 2 ( A T ) ( A T ) ( A T ) 2 11 11 12 12 ...... T11 T12 T 20 2014.3.25 (A T) T Karl Pearson 2 2 • Chi-square test was invented by Karl Pearson. • Chi-square test is also called Pearson’s chi-square test . 1857 - 1936 If H0 is true, 2 follows a chi-square distribution. = (row-1)(column-1) If the 2 value is big enough, we doubt about H0 , then reject H0 ! 21 2014.3.25 •For Example 9-5 : 2 2 2 2 ( 68 64 . 82 ) ( 52 55 . 18 ) ( 6 9 . 18 ) ( 11 7 . 82 ) 2 2.734 64.82 55.18 9.18 7.82 = (row-1)(column-1)=(2-1)(2-1)=1, • 2α(ν) =20.05(1)=3.84, Now, 2 =2.734<3.84, then P > 0.05, H0 is not rejected. We have no reason to say the effects of two treatments are different. •Question: What is 2α(ν) ? Why 2 < 2α(ν) , then P > 0.05 ? 22 2014.3.25 • Chi-square distribution is a distribution for continuous variable. • Chi-square distribution has a parameter-- (degree of freedom), it determines shape of 2 curve. • The area under 2 curve is distribution of 2 probability. χ2 ν=3 ν=5 ν=10 ν=30 The 2 curves for different 23 2014.3.25 • The Table for 2 distribution. • 2 critical value denotes 2α(ν) , α is probability, ν is degree of freedom. •The area under the 2 curve means [for20.05(1)]: 24 2014.3.25 • For 22 table, there is a specific formula of chisquare calculation: a b a+b ad bc n a bc d a c b d 2 c a+c d c+d b+d n 2 •For Example 9-5 : 68 52 120 6 11 17 74 63 137 (68 11 6 52) 137 2.734 74 63 120 17 2 2 25 2014.3.25 • Chi-square test required large sample. • Pearson’s chi-square test statistic follows chi-square distribution approximately. •For 22 table : (1) If n≥40, and every Ti ≥5, 2 test is applicable; (2) If n < 40 or Ti < 1, 2 test is not applicable, you should use Fisher’s Exact Test; (3) If n≥40, and only one 1≤Ti < 5, 2 test needs adjustment. 26 2014.3.25 • The correction formula of 2 test for 22 table : A T 0.5 2 2 T 2 n ad bc n 2 2 a bc d a c b d 27 2014.3.25 Example 9-6 Hematosepsis Treatment Effective Drug A Drug B Total • Here 28 (26.09) 12 (13.91) 40 No effect Total 2 (3.91) 4 (2.09) 6 30 16 46 Effective rate (%) 93.33 75.00 86.96 n=46>40, but T12=306/46=3.91< 5; T22=166/46=2.09< 5. • You should use the correction formula of 2 test for 22 table : 28 2014.3.25 28 12 40 H 0 : 1 2 , 2 4 6 30 16 46 H1 : 1 2 0.05 46 2 ( 28 4 2 12 ) 46 2 2 1.687 30 16 40 6 (2 1)( 2 1) 1 02.05(1) 3.84 1.687 3.84, P 0.05, H 0 is not rejected. We have no reason to say the effects of two treatments are different. 29 2014.3.25 (3) Chi-square test for R×C table Example 9-7 Leukaemia table 9-7 Blood types for the different leukaemia patients diseases A B O AB Total H1 acute leukaemia chronic leukaemia Total 58 49 59 18 184 43 27 33 8 111 101 76 92 26 295 H0: The distributions of blood types in two populations are all same H1: The distributions are not all same 30 2014.3.25 The formula of 2 test statistic for R×C table : 2 A 2 n 1 nR nC nR : Row total nC : Column tot al •For Example 9-7 : 2 2 2 58 49 8 2 295 ... 1 1.84 26 111 101184 76 184 ν=(R-1)(C-1)=(2-1)(4-1)=3,Checked χ20.05(3)=7.81, now χ2=1.84<7.81,then P>0.05, H0 is not rejected. The distributions of blood types in two populations are same. 31 2014.3.25 •Question: Why 2=1.84 < 20.05(3)=7.81, then P > 0.05 ? •The answer is in this figure ! 32 2014.3.25 (4) Caution for Chi-square test (1) Either 22 table or RC table are all called contingency table. 22 table is a special case of RC table . (2) When R >2, “H0 is rejected”only means there is difference among some groups. Does not necessary mean that all the groups are different. (3) The 2 test requires large sample : By experience, The theoretical frequencies should be greater than 5 in more than 4/5 cells ; 33 2014.3.25 The theoretical frequency in any cell should be greater than 1. Otherwise, we can not use chi-square test directly. • If the above requirements are violated, what should we do? (1) Increase the sample size. (2) Re-organize the categories, Pool some categories, or Cancel some categories. 34 2014.3.25 You should know: Chi-square test is a very important method of Statistical inference for enumeration data ! C 35 2014.3.25