Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Degrees of freedom (statistics) wikipedia , lookup
Foundations of statistics wikipedia , lookup
History of statistics wikipedia , lookup
Confidence interval wikipedia , lookup
Bootstrapping (statistics) wikipedia , lookup
Taylor's law wikipedia , lookup
Resampling (statistics) wikipedia , lookup
Topic 6B Estimation and Hypothesis Testing When a parameter (e.g., the average)) is estimated using a sample of data, the estimated value will vary, depending on the particular sample chosen. Sampling variation, or more formally, sampling distribution, of the estimated parameter gives us a frame of reference of how accurate the estimate is likely to be. If we repeatedly sample, and our estimated parameter does not change much, then we are confident that the estimate from just one sample is likely to be accurate. On the other hand, if our estimated parameter changes quite markedly for different samples of data, then we are not at all confident that the estimate from just one sample is likely to be accurate. Whenever we report an estimated value (e.g., the average of a sample of data), we must provide our degree of confidence about the accuracy of the estimate. Typically, in reports, you will see results such as However, typically, we will not have the resources to repeatedly sample to obtain the sampling variation of our estimated value to report a confidence interval. Statistical theory can help us with the computation of the confidence interval so we don’t need to resort to repeated sampling to establish confidence intervals. The sampling distributions of various statistics (e.g., mean, percentage, median, standard deviation, etc) are different, and they require different statistical theory to derive the sampling distributions. Below we will focus on the sampling distribution of the Mean. That is, whenever we are computing averages, we can use a formula to compute the confidence interval based on just one sample of data. Central Limit Theorem (CLT) The sampling distribution of the mean of independently drawn observations will be approximately normally distributed, even if the distribution from which the sample is drawn is not normal. (Check internet sources/other references for descriptions of the central limit theorem) A simulation can be conducted to illustrate CLT. The data set, StudentLiteracyScores.sav, from Topic 6, contains 27598 students’ reading test scores. The following shows the histogram of the reading scores: (try to produce this in SPSS) This histogram looks quite skewed (i.e, not normally distributed). Compute the mean and standard deviation of the reading scores: Mean:_______________ Standard deviation:_________________ If we sample from this population, and compute the mean of the sample, we will not get exactly the population mean. There will be variations in mean values across different sample. Select one random sample of 100 students. You can do this in SPSS by selecting the menu Data Select Cases Random sample of cases Exactly 100 cases from the first 27598 cases. Also copy selected cases to a new data set, and give a file name (see option in the Select Cases dialog box, near the bottom). Compute the mean of this sample. Repeat this a few times by drawing a few samples, and see the variation of the mean values of different samples. In real-life, typically, only one sample is drawn, so only one mean value will be reported, and the sample mean value will be inferred as representing the population mean value. There will be some ‘error’ as the mean value from one sample is not likely to be exactly the same as the population mean. We need to report the ‘confidence interval’ of the estimated mean value, to acknowledge the degree of uncertainty we have regarding where the population mean may be. Using just a few samples, we cannot establish the sampling distribution (i.e., variation) of the sample mean values across samples. Let’s try the following more efficient procedure. In SPSS, generate a random number between 1 and 276 for each student in the data file StudentLiteracyScores.sav. As there are around 27600 students, there should be approximately 100 students with the same random number. In SPSS syntax window (select File New Syntax window), type the following: compute SampleNumber=rnd(uniform(1)*276+0.5). sort cases by SampleNumber. execute. (uniform(1) generates a random number from 0 to 1. The function rnd rounds a number to the nearest integer. The addition of 0.5 is to make sure we round up to the nearest integer, so we have values between 1 and 276, and not from 0 to 276.) Select and run this code. This SPSS code randomly assigns a sample number between 1 to 276 for each student. Essentially we have selected 276 samples, with approximately 100 students in each sample. Compute the sample mean for each of the 276 samples: Analyze Compare Means select Reading score in the dependent list select SampleNumber in the independent list. The output shows the sample mean for each sample. Scan through and note the variations of the sample mean values for the 276 samples. When we state the confidence interval for one sample, we should state the variation (i.e., the standard deviation) of these sample means. So put the 276 sample means back into SPSS, and compute the mean and standard deviation of the sample means, and make a histogram. You should get a histogram like this one (replace my picture with yours): Now this histogram looks normally distributed! (Note that our sampling is not quite what the Central Limit Theorem stipulates. Under the CLT, the samples and sample elements should be independently drawn. Whereas in our case, the non-overlapping nature of the samples results in dependencies between samples. Nevertheless, it still illustrates that the sampling distribution of the sample mean is approximately normally distributed, even when the population distribution of the original observations are not normally distributed.) Compute the mean and standard deviation of the 276 sample means: (A) Mean of the 276 sample means:____________ (B) Standard deviation of the 276 sample means:____________ The standard deviation of the 276 sample means is called the standard error. Given that the sampling distribution of the sample means is approximately normally distributed, we can compute a confidence interval based on normal distribution. That is, for normal distributions, about 95% of the observations lie between mean±1.96×standard deviation. In our case, about 95% of the sample mean values should lie between ____________ and _____________ (work out the two values using (A) and (B) above) Formula for computing the standard error In practice, we can use the result derived from statistical theory that the standard error of the mean is approximately n where is the population standard deviation, and n is the sample size. In our case, n is around 100, is 5.8, so using this formula, the standard error should be about 0.58. How does this compare with what you obtained in (B) above? In real life, we don’t know the value . But, scanning over the standard deviation of each sample of around 100 observations, you will find that the sample standard deviation of 100 observations is a good estimate of the population standard deviation. In practice, how to compute confident interval of sample mean (1) draw a sample of size n (2) compute the sample mean ( ) (3) compute the sample standard deviation ( ̂ ) (4) compute the 95% confidence interval of the true mean using 1.96 n In one sentence, describe the meaning of a statement like the following: The estimated mean of height is 174cm ± 29cm (95% confidence interval) ____________________________________________________________________ ____________________________________________________________________ ____________________________________________________________________ General process of making inference about a statistic (1) Establish the sampling distribution of the statistic to assess the variability of the statistic. For example, if we are interested in the mean reading score of students in Victoria, we take a sample and compute the sample mean. Because this sample mean is not the population mean, there is likely variation in the value of the sample mean if different samples are drawn. We need to find out how large the variation is. If the variation is large, then our estimate is probably not very accurate to represent the population mean. If the variation is small, then our estimate is probably quite close to the population mean. (2) We can repeatedly sample to establish the sampling distribution of the statistic of interest. But this is impractical as it will be too costly. Making inferences about Mean We can use the central limit theorem to establish the sampling distribution of the sample mean, without doing repeated sampling. Central limit theorem says that the mean of independently drawn observations will be approximately normally distributed, even if the distribution from which the sample is drawn is not normal. Further, it can be shown that mean values computed from samples of size n have a normal distribution with mean and standard deviation of n (known as the standard error), where is the mean of the distribution we draw our samples from, and is the standard deviation of the distribution we draw our samples from. That is, if X denotes the sample mean, then X has a standard normal distribution n with mean zero and standard deviation 1 (z-score). For a standard normal distribution, 95% of the observations lie within 1.96. With a little re-arrangement of the equation, it can be shown that 95% of the time, or, we are 95% confident that, X 1.96 n X 1.96 n (There is a 95% chance that the population mean lies within the range shown above.) Hypothesis Testing Hypothesis testing is about using data to make (statistical) conclusions about a hypothesis. For example, if I have a hypothesis that the mean of students’ population reading score is 17 out of 30. H 0 : 17 I draw a sample, say, of 100 students. The sample mean and standard deviation of my sample are 18.4 and 5.8 respectively. The 95% confidence interval for the mean is 18.4 1.96 5.8 17.3,19.5 100 The 95% confidence interval of (17.3, 19.5) means that, based on our sample, there is a 95% chance that the true mean lies between (17.3, 19.5). There is a 5% chance that the true mean lies outside this interval. As the hypothesised mean value, 17, is outside this confidence interval, we conclude that we will reject the null hypothesis at the 95% confidence level. Sometimes this is also said as at the level of p=0.05. This means that there is a 5% chance that we have incorrectly rejected the null hypothesis. More generally, we make inferences from our sample about the likelihood of population parameters, and we make conclusions about the hypothesis based on our inference. Sample size and hypothesis testing Now, draw a sample of 10 from our reading score data. Test the hypothesis that H 0 : 17 What is the confidence interval in this case? 95% confidence of the mean is between ________________ and _____________. What is your conclusion about the hypothesis? Reject or Accept? Next, draw a sample of 20, and then 50, and then 200. See the difference you will make in accepting or rejecting the null hypothesis at p=0.05? Sample of 20: Reject or Accept? Sample of 50: Reject or Accept? Sample of 200:Reject or Accept? What if you use p=0.1 (90% confidence interval (normal distribution for 90% of the sample means is between 1.64 rejecting or accepting the hypothesis? n )? Would you change you decision of Make a table below: Sample size Reject or Accept Reject or Accept H 0 : 17 H 0 : 17 at p=0.05 at p=0.1 10 50 100 200 Given that we know that the true population mean is 18.98 (which, in real-life, we will not know), what do you think about your conclusions in the above table? What if the hypothesis is H 0 : 18 ? Could you reject this hypothesis? What sample size would you use to reject this hypothesis? A cartoon in Darrell Huff’s book on “how to lie with statistics” depicted one person asking “I want to know the truth”, and another person replying “it ain’t statistics”. What is your assessment of statistical hypothesis testing in relation to this cartoon? What DOES statistics tell you? Some discussion points: (1) For a population of people, the height distribution is normally distributed with a mean of 170 cm and a standard deviation of 12 cm. Dave has a height of 196cm. Could Dave be from this population of people? (2) In a region, the number of raining days per year is approximately normally distributed, with a mean of 85 days and a standard deviation of 15 days (the distribution was established by collecting 200 years of data). One year, the number of raining days was 120 days. Was this year an ‘abnormal’ year? If so, can we look at the 200 years of data, what percentage of years do you think will be ‘abnormal’? But, the 200 years of data has been used to establish the ‘norm’, so how can any particular year be ‘abnormal’?