Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
STP 420 SUMMER 2002 STP 420 INTRODUCTION TO APPLIED STATISTICS NOTES PART 2 – PROBABILITY AND INFERENCE CHAPTER 7 INFERENCE FOR DISTRIBUTIONS 7.1 Inference for the Mean of a Population The t distributions – density curve - symmetric about the mean 0 - area under the curve is 1 - shape similar to the standard normal curve - has mean 0 and standard deviation varies and decreases as sample size increases - as n becomes large the t curve approaches the N(0, 1) curve - more appropriate since the of a population is rarely known. If we sample from a population where the standard deviation is not known, then we have to estimate it using the sample mean. The z statistic is now not valid and we use a more appropriate statistic (t statistic). Standard error ( SE x ) of a statistic – standard deviation is estimated from the data SE x s n 1 STP 420 SUMMER 2002 The t distributions Suppose that an SRS of size n is drawn from an N(, ) population. Then the onex sample t statistic has the t distribution with n – 1 degrees of t s/ n freedom. degrees of freedom – always n – 1 since only n – 1 of the observations are free The One-Sample t Confidence Interval Suppose that an SRS of size n is drawn from a population having unknown mean . A level C confidence interval for is xt * s n where t* is the value for the t(n – 1) density curve with area C between –t* and t*. This interval is exact when the population distribution is normal and is approximately correct for large n in other cases (non normal populations). Margin of error - t * s is similar in structure as when we used the N(0, 1) distribution n except we replace z* with t* and by s. We can report the confidence interval, or we can report the mean of the interval with half the confidence interval as the margin of error. 2 STP 420 SUMMER 2002 The One-Sample t Test Suppose that an SRS of size n is drawn from a population having unknown mean . To test the hypothesis H0 : = 0 based on an SRS of size n, compute the one-sample t x 0 statistic in terms of a random variable T having the t(n – 1) t s/ n distribution, the P-value for a test of H0 against Ha : > 0 is P(T t) Ha : < 0 is P(T t) Ha : 0 is 2P(T |t|) These P-values are exact if the population distribution is normal and are approximately correct for large n in other cases. It is wrong to look at the data and then decide whether you want to do a one-tailed test instead of a two-tailed test. If you have no previous knowledge that suggest the current data being more or less, then go with a two-sided test. Matched pairs t procedures Subjects are matched in pairs. Eg. Difference of pretest scores and post test scores for same set of individuals form a data set that can be tested using the same test as before. 1. A matched pairs analysis is needed when there are two measurements or observations on each individual and we want to examine the change from the first to the second. Before and after are common. 3 STP 420 SUMMER 2002 2. For each individual compute after minus before. 3. Analyze the difference using the one-sample confidence interval and significancetesting procedures. Robustness of the t procedures robust – procedures that are not strongly affected by non-normality Robust Procedures - A statistical inference procedure is robust if the probability calculations required are insensitive to violations of the assumptions made. Some practical guidelines 1. Sample size < 15: Use t procedures if data are close to normal. Do not use if data clearly non normal or outliers present 2. Sample size >=15: Use t procedures except when outliers are present or strong skewness of data 3. Large samples: t procedures can be used even for clearly skewed distributions (n 40) The Sign Test When populations are nonnormal, distribution-free procedures/tests are more straightforward. They have two drawbacks, 1. Less powerful than tests designed for specific distribution like the t test. 2. Often need to modify hypothesis to use distribution-free tests. 4 STP 420 SUMMER 2002 Distribution-free tests – stated in terms of median rather than the mean. - good when distribution is skewed Sign test – simplest distribution-free test - based on counts and the binomial distribution H0 : p = ½ Ha : p > ½ == == H0 : population median = 0 Ha : population median > 0 p is the probability of improvement from lets say a pretest to a post test p = ½ implies no improvement p > ½ implies improvement The Sign Test for Matched Pairs Ignore pairs with difference 0; the number of trials n is the count of the remaining pairs. The test statistic is the count X of pairs with a positive difference, P-values for X are based of the binomial B(n, ½) distribution. Considering the pretest/posttest experiment, the sign test tests the hypothesis that the median of the differences between the pretest and posttest scores is 0. The sign test does not use the actual scores but uses a count of the improvement (differences between pretest and posttest greater than 0) 5 STP 420 SUMMER 2002 7.2 Comparing Two Means Two-Sample Problems 1. Goal of inferences is to compare responses in two groups 2. Each group is a sample from a distinct population 3. Responses in each group are independent of those from the other groups. The two-sample z statistic ( is known) Suppose that x1 is the mean of an SRS of size n1 drawn from an N(1, 1) population and that x 2 is the mean of an independent SRS of size n2 drawn from an N(2, 2) population. Then the two-sample z statistic z ( x 1 x 2 ) ( 1 2 ) 12 n1 22 has standard normal N(0, 1) sampling distribution. n2 The two-sample t significance test ( is unknown) Suppose that an SRS of size n1 drawn from a normal population with unknown mean 1 and that an independent SRS of size n2 is drawn from another normal population with unknown mean 2. To test the hypothesis H0 : 1 = 2, compute the two-sample t statistic t x1 x 2 s12 s 22 n1 n2 and use P-values or critical values for the t(k) distribution, where the degrees of freedom k are either approximated by software or are the smaller of n1 – 1 and n2 – 1. 6 STP 420 SUMMER 2002 The Two-Sample t Confidence Interval Suppose that an SRS of size n1 is drawn from a normal population with unknown mean 1 and that an independent SRS of size n2 is drawn from another normal population with unknown mean 2. The confidence interval for 1 - 2 is given by s12 s 22 ( x1 x 2 ) t * n1 n2 has confidence level at least C no matter what the population standard deviation maybe. Here t*, is the value for the t(k) density curve with area C between –t* and t*. The value of the degrees of freedom k is either approximated by software or we use the smaller of n1 – 1 and n2 – 1. Robustness of the two-sample procedures Two-sample procedures are more robust that the one-sample procedures. If two populations have same shape, small samples (~5) are okay; otherwise if populations have different shapes, larger samples are needed. Equal sample sizes are better to work with. Inference for small samples We need to be very careful. Not enough observations for boxplots or normal quantile plots. 7