Download Sampling and estimation 2

Sampling and estimation 2 Tron Anders Moger 27.09.2006 Confidence intervals (rep.) • Assume that X1 ,..,Xn are a random sample from a normal distribution • Recall that x has expected value  and variance 2/n • The interval x + 1.96/√n is called a 95% confidence interval for  • Means that the interval will contain the population mean 95% of the time • Often interpreted as if we are 95% certain that the population mean lies in this interval Hypothesis testing (rep.) • Have a data sample • Would like to test if there is evidence that a parameter value calculated from the data is different from the value in a null hypothesis H0 • If so, means that H0 is rejected in favour of some alternative H1 • Have to construct a test statistic • It must: – Have a higher probability for ”extreme” values under H1 than under H0 – Have a known distribution under H0 (when simple) Two important quantities • P-value = probability of the observed value or something more extreme assuming null hypothesis • Significance level α= the value at which we reject H0 • If the value of the test statistic is ”too extreme”, then H0 is rejected • P-value=0.05: We want the probability that the observed difference is due to chance to be below 5%, or, equivalently: • We want to be 95% sure that we do not reject H0 when it is true in reality Note: • There is an asymmetry between H0 and H1: In fact, if the data is inconclusive, we end up not rejecting H0. • If H0 is true the probability to reject H0 is (say) 5%. That DOES NOT MEAN we are 95% certain that H0 is true! • How much evidence we have for choosing H1 over H0 depends entirely on how much more probable rejection is if H1 is true. Errors of types I and II • The above can be seen as a decision rule for H0 or H1. • For any such rule we can compute (if both H0 and H1 are simple hypotheses): 1-α Power 1 - β H0 true Accept H0 P(accept | H0) Reject H0 P(reject | H0) H1 true P(accept | H1) TYPE II error TYPE I error Significance level α P(reject | H1) β Sample size computations • For a sample from a normal population with known variance, the size of the conficence interval for the mean depends only on the sample size. • So we can compute the necessary sample size to match a required accuracy • Note: If the variance is unknown, it must somehow be estimated on beforehand to do the computation • Works also for population proportion estimation, giving an inequality for the required sample size Power computations • If you reject H0, you know very little about the evidence for H1 versus H0 unless you study the power of the test. • The power is 1 minus the probability of rejecting H0 given that a hypothesis in H1 is true (1-β). • Thus it is a function of the possible hypotheses in H1. • We would like our tests to have as high power as possible. Example 1: Normal distribution with unknown variance • Assume X 1 , X 2 ,..., X n ~ N (  ,  2 ) • Then X  ~ tn 1 s/ n • Thus P( X  tn1, / 2 s n    X  tn1, / 2 s n )  • So a confidence interval for  , with significance  is given by ( X  tn1, / 2 sn , X  tn1, / 2 sn ) Example 1 (Hypothesis testing) • Hypotheses: H 0 :   0 H1 :   0 X  0 • Test statistic ~ tn 1 under H0 s/ n • Reject H0 if X  0  tn1, / 2 X  0  tn1, / 2 s n s n or if • Alternatively, the p-value for the test can be computed (if X  0 ) as the  such that X  0  tn1, / 2 2 n Example 1 (cont.) H1 :   0 • Hypotheses: H 0 :   0 X  0 • Test statistic ~ tn 1 assuming   0 s/ n • Reject H0 if X  0  tn1, s n • Alternatively, the p-value for the test can be computed as the  such that X  0  tn1, 2n Energy intake in kJ • SUBJECT 1 2 3 4 5 6 7 8 9 10 11 INTAKE 5260 5470 5640 6180 6390 6515 6805 7515 7515 8230 8770 Recommended energy intake: 7725kJ Want to test if it applies to the 11 women H0:  (mean energy intake)=7725 H1: 7725 From Explore in SPSS: i p t E i I M N 4 6 9 L 5 I n U 3 5 0 M 0 V 5 S 2 M 0 M 0 R 0 I n 0 S 8 1 K 3 9 Test result: 6753.6  7725 t  2.821 1142.1 / 11 • This quantity is t-distributed with 10 degrees of freedom (number of subjects -1) • Choose significance level α=0.05 • From table 8 p.870 in the book, t11-1,0.05/2=2.262 • If the H0 is true, the interval (-2.262, 2.262) covers 95% of the distribution • Reject H0 since the test statistic is outside the interval, or, equivalently, because X  7725  2.262 *1142.1 / 11  6946 • Can’t find exact p-value from the table • Could have had α=0.01 or 0.1, but 0.05 is most common In SPSS: Analyze - Compare means - One-sample t test Test variable: intake Test value: 7725 l e t E e N e a I m u n a l r e i g a d r t w p i I 6 5 7 Differences between means • Assume X 1 , X 2 ,..., X n ~ N (  x ,  x2 ) and Y1 , Y2 ,..., Ym ~ N ( y ,  y2 ), all data independent • We would like to study the difference xy • Three different cases: – Matched pairs – Unknown but equal population variances – Unknown and possibly different pop. variances Matched pairs • Common situation: Several measurements on each individual, or on closely related objects • These measurements will not be independent (why?) • Generally a problem in statistics, but simple if you only have two measurements • The key is to use the difference between the means, instead of each mean seperately Example 2: Matched pairs • In practice, the basis is that x-y=0 1 n • Set Di  X i  Yi and D  i 1 Di n • We get D  0 sD / n • Where ~ tn 1 n 1 sD  ( Di  D )  i 1 n 1 • Confidence interval for x-y D  tn 1, / 2 s D / n Example 2 (Hypothesis testing) • Hypotheses: • Test statistic: H 0 : x   y H1 :  x   y D ~ tn 1 sD / n ”Matched pairs T test” • Reject H0 if D  t n 1, / 2 sD / n or if D  t n 1, / 2 sD / n Example: Energy intake kJ SUBJECT PREMENST POSTMENS 1 2 3 4 5 6 7 8 9 10 11 5260.0 5470.0 5640.0 6180.0 6390.0 6515.0 6805.0 7515.0 7515.0 8230.0 8770.0 Number of cases read: 11 3910.0 4220.0 3885.0 5160.0 5645.0 4680.0 5265.0 5975.0 6790.0 6900.0 7335.0 Want to test if energy intake is different before and after menstruation. H0: premenst= postmenst H1: premenst postmenst Number of cases listed: 11 Confidence interval and p-values for paired t-tests in SPSS • Analyze - Compare Means Paired-Samples T Test. • Click on the two variabels you want to test, and move them to Paired variables Paired Samples Test Paired Differences 95% Confidence Interval of the Difference Mean Pair 1 PREMENST POSTMENST 1320,454 55 Std. Deviation Std. Error Mean 366,74551 110,5779 3 Lower 1074,071 56 Upper 1566,837 53 t 11,941 • Conclusion: Reject H0 on 5% sig. level df Sig. (2-tailed) 10 ,000 Example 3: Unknown but equal population variances • We get X  Y  (x   y ) 2 p where s 2p  2 p ~ t nx  n y  2 s s  nx ny (nx  1) sx2  (ny  1) s y2 nx  ny  2 • Confidence interval for  x   y X  Y  tnx  ny 2, / 2 s 2p nx  s 2p ny Example 3 (Hypothesis testing) • Hypotheses: H 0 : x   y X Y • Test statistic: s 2 p nx • Reject H0 if s nx  s ~ t nx  n y  2 2 p ny ”T test with equal variances” X Y 2 p  s H1 :  x   y 2 p ny  tnx  ny  2, / 2 or if X Y s 2 p nx  s 2 p ny  tnx  ny  2, / 2 Assumptions 1. Independence: All observations are independent. Achieved by taking random samples of individuals; for paired t-test independence is achieved by using the difference between measurements 2. Normally distributed data (Check: histograms, tests for normal distribution, QQ plots) 3. Equal variance or standard deviations in the groups • Assumptions can be checked in histograms, box plots etc. (or tests for normality) • What if the variances are unequal? Example 4: Unknown and possibly unequal population variances • We get X  Y  (x   y ) 2 y sx2 s  nx n y where   sx2 nx  s 2y ny ~ t  2 2 2 ( sx2 / nx )2 ( s y / ny )  nx  1 ny  1 • Conf. interval for  x   y X  Y  t , / 2 2 sx2 s y  nx n y Example 4 (Hypothesis testing) • Hypotheses: • Test statistic H 0 : x   y X Y 2 sx2 s y  nx n y H1 :  x   y ~ t ”T test with unequal variances” • Reject H0 if X Y 2 x s y2 s  nx n y  t , / 2 or if X Y 2 x 2 y s s  nx n y  t , / 2 Example 5: The variance of a normal distribution • Assume X 1 , X 2 ,..., X n ~ N (  ,  2 ) 2 ( n  1) s • Then ~  n21 • • 2  2  (n  1) s 2 2 Thus P  n1,1 / 2   2  n1, / 2      Confidence interval for  2  (n  1) s 2 (n  1) s 2  , 2  2    n 1,1 / 2  n 1, / 2  Example 5: Comparing variances for normal distributions • Assume X 1 , X 2 ,..., X n ~ N (  x ,  x2 ) Y1 , Y2 ,..., Ym ~ N ( y ,  y2 ) 2 2 • We get sx /  x ~ Fnx 1,ny 1 2 2 sy /  y • Fnx-1,ny-1 is an F distribution with nx-1 and ny-1 degrees of freedom • We can use this exactly as before to obtain a confidence interval for  x2 /  y2 and for testing for example if  x2   y2 • Note: The assumption of normality is crucial! ID 1 2 12 13 14 15 GROUP 0 0 .... 0 0 1 1 ENERGY 6.13 7.05 10.15 10.88 8.79 9.19 .... 21 22 1 1 11.85 12.79 Number of cases read: 22 Example: Energy expenditure in two groups, lean and obese. Want to test if there is any difference. H0: lean= obese H1: lean obese In SPSS: • Analyze - Compare Means Independent-Samples T Test • Move Energy to “Test-variable” • Move Group to “Grouping variable” Click “Define Groups” and write 0 and 1 for the two groups S Output: t E e N e e G E l 2 1 4 o 8 9 0 S s u f a V o n a l e r . e E a e e 2 o F d p i t r r g w f p e E E 2 9 6 0 1 6 6 5 8 a E 6 9 1 6 8 2 1 n Above 0.05: Read first line (Equal variances assumed) Otherwise: Read second line (Equal variances not assumed) Conclusion • The observed mean for the lean was 8.1, and for the obese 10.3 (mean difference -2.2, 95% confidence interval for the difference (-3.4, -1.1)) • The difference between the groups was significant on a 5%-level (since the CI does not include the value 0) • The p-value was 0.001. • H0 is rejected Example 6: Population proportions • Assume X ~ Bin (n, P) , so that Pˆ  frequency. Pˆ  P • Then • Thus X is a n ~ N (0,1) P(1  P ) / n (approximately, for large n) Pˆ  P ~ N (0,1) (approximately, for large n) Pˆ (1  Pˆ ) / n  ˆ (1  Pˆ ) ˆ (1  Pˆ )  P P   P Pˆ  Z  / 2  P  Pˆ  Z  / 2   n n   • Thus • Confidence interval for P  ˆ (1  Pˆ ) ˆ (1  Pˆ )  P P  Pˆ  Z  ˆ , P  Z  /2  /2   n n   Example 6 (Hypothesis testing) • Hypotheses: H0:P=P0 H1:PP0 Pˆ  P0 • Test statistic P0 (1  P0 ) under H0, for large n • Reject H0 if P  P0  Z  / 2 P  P0  Z  / 2 ~ N (0,1) P0 (1  P0 ) n P0 (1  P0 ) n or if Example 7: Differences between population proportions • Assume X 1 ~ Bin (n1 , P1 ) and X 2 ~ Bin (n2 , P2,) X2 X1 ˆ ˆ so that P1  and P2  are n2 n1 frequencies Pˆ1  Pˆ2  ( P1  P2 ) P1 (1  P1 ) P2 (1  P2 )  n1 n2 ~ N (0,1) • Then • Confidence interval for P1-P2 (approximately)  ˆ1 (1  Pˆ1 ) Pˆ2 (1  Pˆ2 )  P  Pˆ  Pˆ  Z   2  /2  1  n1 n2   Example 7 (Hypothesis testing) • Hypotheses: H0:P1=P2 H1:P1P2 • Test statistic Pˆ1  Pˆ2 ~ N (0,1) Pˆ0 (1  Pˆ0 ) Pˆ0 (1  Pˆ0 )  n1 n2 n1 Pˆ1  n 2 Pˆ2 where ˆ P0  n1  n 2 • Reject H0 if Pˆ1  Pˆ2  Z / 2 Pˆ0 (1  Pˆ0 ) Pˆ0 (1  Pˆ0 )  n1 n2 • Spontanous abortions among nurses helping with operations and other nurses • Want to test if there is difference between the proportions of abortions in the two groups • H0: Pop.nurses=Pothers H1: Pop.nursesPothers Op.nurses Other nurses No. interviewed 67 92 No. births 36 34 No. abortions 10 3 27.8 8.8 Percent aportions Calculation: • P1=0.278 P2=0.088 n1=36 n2=34 Total no. abortions 10  3 p   0.186 Total no. pregnancie s 36  34 z= 0.278  0.088  2.04 1 1 (  )0.186(1  0.186) 36 34 • P-value 0.0414=4.1%, reject H0 on 5%sig.level (can’t do this in SPSS) • 95% confidence interval for P1-P2: Pˆ1 (1  Pˆ1 ) Pˆ2 (1  Pˆ2 ) ˆ ˆ ( P1  P2 )  1.96 *   (0.015,0.190) n1 n2 Next week: • Next lecture will be about modelling relationships between continuous variables • Linear regression

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Sampling and estimation 2