Download Worksheet B

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

Foundations of statistics wikipedia , lookup

History of statistics wikipedia , lookup

Confidence interval wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Topic 6B
Estimation and Hypothesis Testing
When a parameter (e.g., the average)) is estimated using a sample of data, the
estimated value will vary, depending on the particular sample chosen. Sampling
variation, or more formally, sampling distribution, of the estimated parameter gives us
a frame of reference of how accurate the estimate is likely to be. If we repeatedly
sample, and our estimated parameter does not change much, then we are confident
that the estimate from just one sample is likely to be accurate. On the other hand, if
our estimated parameter changes quite markedly for different samples of data, then
we are not at all confident that the estimate from just one sample is likely to be
accurate.
Whenever we report an estimated value (e.g., the average of a sample of data), we
must provide our degree of confidence about the accuracy of the estimate. Typically,
in reports, you will see results such as
However, typically, we will not have the resources to repeatedly sample to obtain the
sampling variation of our estimated value to report a confidence interval. Statistical
theory can help us with the computation of the confidence interval so we don’t need
to resort to repeated sampling to establish confidence intervals.
The sampling distributions of various statistics (e.g., mean, percentage, median,
standard deviation, etc) are different, and they require different statistical theory to
derive the sampling distributions. Below we will focus on the sampling distribution of
the Mean. That is, whenever we are computing averages, we can use a formula to
compute the confidence interval based on just one sample of data.
Central Limit Theorem (CLT)
The sampling distribution of the mean of independently drawn observations will be
approximately normally distributed, even if the distribution from which the sample is
drawn is not normal.
(Check internet sources/other references for descriptions of the central limit theorem)
A simulation can be conducted to illustrate CLT. The data set,
StudentLiteracyScores.sav, from Topic 6, contains 27598 students’ reading test
scores. The following shows the histogram of the reading scores: (try to produce this
in SPSS)
This histogram looks quite skewed (i.e, not normally distributed). Compute the mean
and standard deviation of the reading scores:
Mean:_______________
Standard deviation:_________________
If we sample from this population, and compute the mean of the sample, we will not
get exactly the population mean. There will be variations in mean values across
different sample.
Select one random sample of 100 students. You can do this in SPSS by selecting the
menu
Data  Select Cases  Random sample of cases  Exactly 100 cases from the first
27598 cases.
Also copy selected cases to a new data set, and give a file name (see option in the
Select Cases dialog box, near the bottom).
Compute the mean of this sample.
Repeat this a few times by drawing a few samples, and see the variation of the mean
values of different samples.
In real-life, typically, only one sample is drawn, so only one mean value will be
reported, and the sample mean value will be inferred as representing the population
mean value. There will be some ‘error’ as the mean value from one sample is not
likely to be exactly the same as the population mean. We need to report the
‘confidence interval’ of the estimated mean value, to acknowledge the degree of
uncertainty we have regarding where the population mean may be.
Using just a few samples, we cannot establish the sampling distribution (i.e.,
variation) of the sample mean values across samples. Let’s try the following more
efficient procedure.
In SPSS, generate a random number between 1 and 276 for each student in the data
file StudentLiteracyScores.sav. As there are around 27600 students, there should be
approximately 100 students with the same random number. In SPSS syntax window
(select File  New  Syntax window), type the following:
compute SampleNumber=rnd(uniform(1)*276+0.5).
sort cases by SampleNumber.
execute.
(uniform(1) generates a random number from 0 to 1. The function rnd rounds a
number to the nearest integer. The addition of 0.5 is to make sure we round up to the
nearest integer, so we have values between 1 and 276, and not from 0 to 276.)
Select and run this code.
This SPSS code randomly assigns a sample number between 1 to 276 for each
student. Essentially we have selected 276 samples, with approximately 100 students
in each sample.
Compute the sample mean for each of the 276 samples:
Analyze  Compare Means  select Reading score in the dependent list 
select SampleNumber in the independent list.
The output shows the sample mean for each sample. Scan through and note the
variations of the sample mean values for the 276 samples. When we state the
confidence interval for one sample, we should state the variation (i.e., the standard
deviation) of these sample means.
So put the 276 sample means back into SPSS, and compute the mean and standard
deviation of the sample means, and make a histogram.
You should get a histogram like this one (replace my picture with yours):
Now this histogram looks normally distributed!
(Note that our sampling is not quite what the Central Limit Theorem stipulates. Under
the CLT, the samples and sample elements should be independently drawn. Whereas
in our case, the non-overlapping nature of the samples results in dependencies
between samples. Nevertheless, it still illustrates that the sampling distribution of the
sample mean is approximately normally distributed, even when the population
distribution of the original observations are not normally distributed.)
Compute the mean and standard deviation of the 276 sample means:
(A) Mean of the 276 sample means:____________
(B) Standard deviation of the 276 sample means:____________
The standard deviation of the 276 sample means is called the standard error.
Given that the sampling distribution of the sample means is approximately normally
distributed, we can compute a confidence interval based on normal distribution. That
is, for normal distributions, about 95% of the observations lie between
mean±1.96×standard deviation.
In our case, about 95% of the sample mean values should lie between
____________ and _____________
(work out the two values using (A) and (B) above)
Formula for computing the standard error
In practice, we can use the result derived from statistical theory that the standard error
of the mean is approximately

n
where  is the population standard deviation, and n is the sample size.
In our case, n is around 100,  is 5.8, so using this formula, the standard error should
be about 0.58. How does this compare with what you obtained in (B) above?
In real life, we don’t know the value  . But, scanning over the standard deviation of
each sample of around 100 observations, you will find that the sample standard
deviation of 100 observations is a good estimate of the population standard deviation.
In practice, how to compute confident interval of sample
mean
(1) draw a sample of size n
(2) compute the sample mean (  )
(3) compute the sample standard deviation ( ̂ )
(4) compute the 95% confidence interval of the true mean using   1.96 

n
In one sentence, describe the meaning of a statement like the following:
The estimated mean of height is 174cm ± 29cm (95% confidence interval)
____________________________________________________________________
____________________________________________________________________
____________________________________________________________________
General process of making inference about a statistic
(1) Establish the sampling distribution of the statistic to assess the variability of the
statistic.
For example, if we are interested in the mean reading score of students in Victoria, we
take a sample and compute the sample mean. Because this sample mean is not the
population mean, there is likely variation in the value of the sample mean if different
samples are drawn. We need to find out how large the variation is. If the variation is
large, then our estimate is probably not very accurate to represent the population
mean. If the variation is small, then our estimate is probably quite close to the
population mean.
(2) We can repeatedly sample to establish the sampling distribution of the statistic of
interest. But this is impractical as it will be too costly.
Making inferences about Mean
We can use the central limit theorem to establish the sampling distribution of the
sample mean, without doing repeated sampling. Central limit theorem says that the
mean of independently drawn observations will be approximately normally
distributed, even if the distribution from which the sample is drawn is not normal.
Further, it can be shown that mean values computed from samples of size n have a
normal distribution with mean  and standard deviation of

n
(known as the
standard error), where  is the mean of the distribution we draw our samples from,
and  is the standard deviation of the distribution we draw our samples from.
That is, if X denotes the sample mean, then
X 

has a standard normal distribution
n
with mean zero and standard deviation 1 (z-score). For a standard normal distribution,
95% of the observations lie within 1.96. With a little re-arrangement of the equation,
it can be shown that 95% of the time, or, we are 95% confident that,
X  1.96  
n
 

X  1.96  
n
(There is a 95% chance that the population mean lies within the range shown above.)
Hypothesis Testing
Hypothesis testing is about using data to make (statistical) conclusions about a
hypothesis.
For example, if I have a hypothesis that the mean of students’ population reading
score is 17 out of 30.
H 0 :   17
I draw a sample, say, of 100 students. The sample mean and standard deviation of my
sample are 18.4 and 5.8 respectively. The 95% confidence interval for the mean is
18.4  1.96 
5.8
 17.3,19.5
100
The 95% confidence interval of (17.3, 19.5) means that, based on our sample, there is
a 95% chance that the true mean lies between (17.3, 19.5). There is a 5% chance that
the true mean lies outside this interval.
As the hypothesised mean value, 17, is outside this confidence interval, we conclude
that we will reject the null hypothesis at the 95% confidence level. Sometimes this is
also said as at the level of p=0.05. This means that there is a 5% chance that we have
incorrectly rejected the null hypothesis.
More generally, we make inferences from our sample about the likelihood of
population parameters, and we make conclusions about the hypothesis based on our
inference.
Sample size and hypothesis testing
Now, draw a sample of 10 from our reading score data. Test the hypothesis that
H 0 :   17
What is the confidence interval in this case?
95% confidence of the mean is between ________________ and _____________.
What is your conclusion about the hypothesis?
Reject or Accept?
Next, draw a sample of 20, and then 50, and then 200. See the difference you will
make in accepting or rejecting the null hypothesis at p=0.05?
Sample of 20: Reject or Accept?
Sample of 50: Reject or Accept?
Sample of 200:Reject or Accept?
What if you use p=0.1 (90% confidence interval (normal distribution for 90% of the
sample means is between   1.64 
rejecting or accepting the hypothesis?

n
)? Would you change you decision of
Make a table below:
Sample size
Reject or Accept Reject or Accept
H 0 :   17
H 0 :   17
at p=0.05
at p=0.1
10
50
100
200
Given that we know that the true population mean is 18.98 (which, in real-life, we
will not know), what do you think about your conclusions in the above table?
What if the hypothesis is H 0 :   18 ? Could you reject this hypothesis? What sample
size would you use to reject this hypothesis?
A cartoon in Darrell Huff’s book on “how to lie with statistics” depicted one person
asking “I want to know the truth”, and another person replying “it ain’t statistics”.
What is your assessment of statistical hypothesis testing in relation to this cartoon?
What DOES statistics tell you?
Some discussion points:
(1) For a population of people, the height distribution is normally distributed with a
mean of 170 cm and a standard deviation of 12 cm. Dave has a height of 196cm.
Could Dave be from this population of people?
(2) In a region, the number of raining days per year is approximately normally
distributed, with a mean of 85 days and a standard deviation of 15 days (the
distribution was established by collecting 200 years of data). One year, the number of
raining days was 120 days. Was this year an ‘abnormal’ year?
If so, can we look at the 200 years of data, what percentage of years do you think will
be ‘abnormal’? But, the 200 years of data has been used to establish the ‘norm’, so
how can any particular year be ‘abnormal’?