Download Objectives

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

German tank problem wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Gibbs sampling wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Objectives 3.3 Toward statistical inference
p 
Population versus sample (CIS, Chapter 6)
p 
Toward statistical inference
p 
Sampling variability
p 
Further reading:
http://onlinestatbook.com/2/estimation/characteristics.html (some of the
concepts introduced in this link are beyond this class)
Adapted from authors’ slides © 2012 W.H. Freeman and Company The inconvenient truth p 
p 
p 
p 
So far we have assumed the mean of a population is known.
In reality the population is unknown so its mean is unknown.
Inference is detecting/find the unknown population mean based on a
very small sample from the population.
We illustrate what is meant by this in the following examples.
p 
See also the recent journal article from Poultry Science.
Towards statistical inference p 
p 
p 
p 
p 
A survey of 2000 randomly sampled college students, 62% of this
sample reported they have encountered some type of harassment.
Parents are worried: What is the truth about the millions of students
who are currently at college?
Because the sample was taken at random it seems quite reasonable
to suppose this sample is representative of the population of college
students. This suggests that about 62% of all college students may
have encountered some type of harassment. 62% is in fact an
estimate of the total proportion who may have encountered
harassment. What is the exact proportion?
This is the start of statistical inference, where we infer conclusions
on the entire population based on a sample.
62% is not the exact value, it will vary from sample to sample, and
our objective in the next few lectures is to understand this variability.
This will help us to understand the reliability of the estimate.
Refresher: De9initions p 
Population: The entire group
of individuals in which we are
interested but cannot assess
or observe directly.
p 
How well the sample represents
the population depends on the
sampling method, as well as on
the sample size.
Examples: All college students,
All calves etc.
p 
Often the population is
described by a
Population
mathematical model.
p 
Sample: The part of the
population we actually examine
and for which we do have data.
A parameter is a number
describing a characteristic of
the population.
Sample
p 
A statistic is a number
describing a characteristic of a
sample.
Example: M&M data q 
q 
q 
q 
q 
To illustrate what we mean by a population and sample, let us return
to the M&M example.
Let us suppose that the 170 M&M bags represent the population of
M&Ms (in reality we do not observe the population – so this is just
an example for illustration). The population mean for the number of
M&Ms is 13.54.
A random sample of size 5 is taken. There are 1705 different random
samples that can be taken!
Note: Examples of random samples are given in homework 1.
On the next two slides we show how to sample from the distribution.
q 
q 
q 
q 
Top plot: The distribution for the number of M&Ms in a bag (over 170
bags).
Middle plot: One sample of size 5
Lower plot: The average of that sample (sample mean).
Observe how the sample mean is different for the two samples.
Sample 1 Sample 2 Sampling variability As illustrated from the previous example, for every sample taken from a
population, we are likely to get a different set of individuals and
calculate a different value for our statistic (such as the sample mean).
This is called sampling variability.
This would suggest that the sample and the statistic contains no
information about the population. However….
The good news is that, if we imagine taking lots of random samples of
the same size from a given population, the variation from sample to
sample—the sampling distribution—will follow a predictable pattern.
All of statistical inference is based on this; to see how trustworthy a
statistic is what happens of we kept repeating the sampling many
times?
We measure the quality of a statistic (such as the sample mean) with:
p 
Accuracy (bias) – Random samples provide accurate estimates of
a parameter because they are unbiased (or close to unbiased,
depending on the random sampling method).
p 
p 
Using a well constructed statistic.
p 
Typically we will assume an estimator is unbiased.
p 
p 
This is done by sampling in a good way (ie. Randomly sampling over
the population of interest).
When reading an article identify the population of interest and
potentially biases which may arise.
Reliability (variable) – A reliable estimation method is one that
would give similar results if the random sampling is repeated over.
The less variable a statistic, the more reliable it is.
p  Random sampling enables us to measure the variability of a statistic.
p  We do this with the standard error – in the next slide we define what
this means.
p  Important: The larger the sample size, the less variable the
corresponding estimator will be.
To understand the above concepts look at the question at the end of this page:
http://onlinestatbook.com/2/estimation/characteristics.html
p 
Measuring Variability p 
We have come across variability before. Recall in Chapter 3 we
used the standard deviation to measure the variability in the sample.
We recall that the sample standard deviation is the deviation from
each observation to the sample mean:
v
u
u
s=t
q 
1
n
1
n
X
(Xi
X̄)2
i=1
The same criterion is used to measure the variability in the sample mean
(and all other estimators). This is called the standard error.
q  More precisely, we measure the average spread from each estimator
to the true mean.
q  Looking back at the M&M examples, it would appear that we have to
calculate 1705 sample means!
q  This is impossible.
q  Remarkably we can find a very nice expression for the standard error
which requires very little effort!
Population size does not matter There are about 15 million students in higher education. In the
harassment survey about 2000 people were randomly surveyed.
This means that the sexual harassment survey interviewed about
one in every 7500 students. 62% is a estimate of the true population
proportion.
p  Question: Would the estimate of the proportion be better if the
population size were smaller? For example, 1.5 million students
rather than 15 million student.
p  Answer: No. Only the size of the sample, in this case n=2000, has
an influence on it’s reliability, not the size of the population.
Statistical inference is not based on how close the sample size is to the
population (usually we assume that the population is infinite). It is
based on the idea that simple random sample gives a representative
sample over the entire population.
p 
Summary and what’s to come The techniques of statistics allow us to draw inferences or conclusions
about a population using the data from a sample.
p 
Your estimate of the population parameter is only as good as your
sampling design. à Work hard to eliminate biases (design your
experiment well).
p 
Your sample statistic is only an estimate − and if you randomly sampled
again you would probably get a somewhat different result (more of this
next).
p 
In the next section we will show:
q 
The distribution of the estimates (for much of the course it will be the
sample mean) will, if the sample size is large enough, be normally
distributed – even if the observations are not normal.
q 
The standard error (reliability) has a simple formula!
Objectives 5.1
Sampling distribution of a sample mean (CIS, Chapter 8)
x
p 
The mean and standard deviation of
p 
For normally distributed populations
p 
The central limit theorem (CIS, Chapter 8 and p103)
p 
Additional reading:
€
http://onlinestatbook.com/2/sampling_distributions/
samp_dist_mean.html
Adapted from authors’ slides © 2012 W.H. Freeman and Company Simulation tools used p 
To demonstrate the concepts I am using here I will be using an Applet in
Statcrunch called sampling distribution. It is highly recommended that you
try this out yourself.
p 
p 
p 
p 
p 
Applets -> Sampling Distributions.
Select the distribution (from uniform etc) or choose the data table (your own
data). Press computer.
Choose your sample size (this is how large a sample you use).
1000 times etc. has NOTHING to do with sample size. It is the number of
samples you draw (this part is the thought experiment). You should make this as
large as possible (I usually set it to 100,000).
Press the + sign next to Sampling means to get the QQplot of the distribution
of the sample mean. Do not press the + sign next to Samples – this will give you
the QQplot of the sample.
Conceptionally, what we will be doing is rather sophisticated and it will take
time to precisely understand the ideas behind inference.
This is NOT plug and chug.
Note that you can customize the (parent) distribution from which you sample
from by simply left clicking over the parent distribution and moving the cursor
as you want the shape of the distribution to be.
p 
M&M example p 
Look first at the distribution of the total number M&Ms in a bag. We
will treat this as our `population’.
Just comparing the histogram with
the normal curve we can see that
it is not normal. There are two
reasons for this:
a) The mix of different type of
M&Ms (milk chocolate, peanut
and peanut butter), will induce
multimodalness in the distribution.
b) The number of M&Ms is a
numerical discrete random
variable.
In the following examples we will be drawing M&M bags (numbers) from this
distribution. It is analogous to putting all 170 counts in a bag and drawing
them out (with replacement). We see that we are most likely to draw the
number 18 and least likely to draw 14 (within the range 5-21).
Distribution of average: sample 5 p 
Let us now look at the distribution of the sample mean of all samples of
size 5. That is we randomly sample 5 values from the population, and
take the sample mean.
QQplot of average: sample 5 p 
Let us now look at the QQplot of the sample mean of all samples of size
5 (corresponding to the histogram on the previous page)
Observations:
1.  The histogram of the sample
mean is more bell-shaped than
the original distribution.
However, it is certainly not
normal (the spikes we see is
due taking average of 5
numbers, which is not
continuous enough).
2.  There is less spread in the
distribution of the averages
than the original histogram.
3.  The QQplot shows a large
deviation from normality in the
tails.
Distribution of average: sample 10 p 
Let us now look at the distribution of the sample mean of all samples of
size 10. That is we randomly sample 10 values from the population, and
take the sample mean.
QQplot of average: sample 10 p 
Let us now look at the QQplot of the sample mean of all samples of size
10 (corresponding to the histogram on the previous page)
Observations:
1.  The histogram of the sample
mean is a lot more bell-shaped
than the original distribution.
The spikes that were seen for
sample size 5 have gone (the
bumps you see on the
histogram are due to binwidth).
1.  There is even less spread in
the distribution of the averages
than the original histogram.
2.  The QQplot shows only a small
deviation from normality in the
top tail of the distribution.
Distribution of average: sample 20 p 
Let us now look at the distribution of the sample mean of all samples of
size 20. That is we randomly sample 20 values from the population, and
take the sample mean.
QQplot of average: sample 20 p 
Let us now look at the QQplot of the sample mean of all samples of size
20 (corresponding to the histogram on the previous page)
Observations:
1.  The histogram of the sample
mean is pretty much normal.
2.  There is even less spread in
the distribution of the averages
than the original histogram.
3.  The QQplot shows only a very
tiny deviation from normality in
the tails of the distribution.
Distribution of average: sample 40 p 
Let us now look at the distribution of the sample mean of all samples of
size 40. That is we randomly sample 40 values from the population, and
take the sample mean.
QQplot of average: sample 40 p 
Let us now look at the QQplot of the sample mean of all samples of size
40 (corresponding to the histogram on the previous page)
Observations:
1.  The histogram of the sample
mean is almost normal.
2.  There is even less spread in
the distribution of the averages
than the original histogram.
3.  The QQplot is very close to the
x=y line.
Summary: Sampling distribution of M&Ms Summary of averages of M&Ms Sample size
original
5
10
20
40
mean
13.54
13.54
13.54
13.54
13.54
standard error
4.64
2.07
1.466
1.037
0.7357
p
= 4.64/p1
= 4.64/p 5
= 4.64/p10
= 4.64/p20
= 4.64/ 40
comment
Not normal
More unimodal
Getting normal
Mostly there
Pretty much normal.
This example illustrates three major insights:
q  The distributions of the sample means are centered about the true
mean 13.54. This tells us that the sample mean is not biased.
q  We see that the spread in the sample means decreases as the
sample size used to evaluate them increases. The spread/reliability/
variability is measured using the standard error which has the formula
σ/√n (in this case σ=4.64 and n=5,10,20 or 40).
q  The distribution of the sample mean becomes more normal (look at
the QQplots) as the sample size grows.
Properties: Sample mean for normally distributed data When a variable in a population is normally distributed, the sampling
distribution of x for all possible samples of size n is also normally
distributed.
€
If the population is Normal(µ, σ)
Sampling distribution
then the sample mean’s
distribution is Normal(µ, σ/√n).
Note that the sample
average has less
variability than any
individual observation.
Population
Properties: Sample mean of non-­‐normal distributed data Central Limit Theorem: When randomly sampling from any population
with mean µ and standard deviation σ, if n is large enough then the
sampling distribution of is approximately normal: ~ N(µ, σ /√n).
x
Sampling
distribution of
x for n = 2
observations
Population with
strongly skewed
distribution
€
€
Sampling
distribution of
x for n = 10
observations
Sampling
distribution of
x for n = 25
observations
€
Calculation Practice In 2010 the combined SAT scores had mean 1016 and standard deviation 212. They
also had approximately normal distribution.
Population distribution is Normal(µ = 1016; σ = 212).
p 
In Chapter 4, we used the normal distribution to show that the probability of a
randomly selected student scoring 1100 or higher is 34.5%.
p 
Now, suppose 50 students are randomly selected and their SAT scores averaged.
What is the probability that the average is greater than 1100?
Sampling distribution of the sample average when n = 50 is
Normal(µ = 1016; σ /√n = 212 /√ 50 = 29.98).
Using these values, the z-score for 1100 is
z=
( x − µ)
σ
n
=
1100 − 1016
212
`
50
=
84
= 2.80.
29.98
In Table A, the area to the right of 2.80 is 0.0025. So there is only a 0.25%
chance that the average of 50 randomly sampled students is more than 1100. In
this example we do not use the CLT because the original data is assumed normal.
Calculation Practice Hypokalemia is diagnosed when blood potassium levels are below 3.5mEq/dl.
Let’s assume that we know a patient whose measured potassium levels vary
daily according to the Normal(µ = 3.8, σ = 0.2) distribution .
If only one measurement is made, what is the probability that this patient will be
misdiagnosed with Hypokalemia?
( x − µ) 3.5 − 3.8
z=
=
= −1.5 , P(z < −1.5) = 0.0668 ≈ 7%.
σ
0.2
Instead, if measurements are taken on 4 separate days, what is the
probability of a misdiagnosis (in this case sample mean based on 4 is below
3.5)?
z=
( x − µ)
σ
n
=
3.5 − 3.8
0.2
4
= −3 ,
P(z < −3) = 0.0013 ≈ 0.1%.
Note: If the problem is about the sample mean, make sure to standardize (get z) using
the standard error for the sample mean.
Calculation Practice: using the CLT p 
In Chapter 4 we discussed ACT scores. We argued that because the
grades were numerical discrete over a small range, that the grade
distribution could not be normally distributed. This means we cannot
use the normal distribution to calculate probabilities for one randomly
selected person. BUT if the sample size is large enough we can use
the normal distribution to calculate probabilities for averages. We
recall the mean ACT score is 20 with standard deviation 5.
p 
p 
Question: 50 students are randomly selected. Calculate the probability
their average (sample mean) score will be greater than 18.
Answer: The mean of the sample mean has the same mean as the
original distribution, which we know is 20. The standard error of the sample
mean is s.e. = 5/√50 = 0.707. We use this to make the z-transform
z=
18 20
=
0.707
2.828
Looking up the z-tables using a
computer we see that probability
is 99.7%. This means there is a
very large chance the sample
mean greater is than 18.
Calculation Practice p 
q 
Let us return to the weights of calves at 0.5 weeks. The distribution
is below
Looking at the plot, it seems that a normal density (with mean 90.11 and
standard deviation 7.7) is a rough approximation of the underlying distribution
of calves weights (see also the QQplot given at the end of Chapter 4).
q 
q 
Question (a): Using the normal density calculate the approximate
probability that a calf weights more than 100 pounds.
Answer: Make a z-transform=(100-90.11)/7.7 =1.28. Looking this up in the
z-tables we have 90%. Therefore the approximate probability that a calf is
greater than 100 is 10%.
Question (b): Let us suppose that the sample mean of 10 calves is taken.
Using the normal approximation of the sample mean, what is the probability
that the sample mean will be greater than 100 pounds?
p  Answer: The mean of the sample mean is the same as the mean weight of
cows which is 90.11. The standard deviation of the sample mean is
7.7/√10 = 2.4. By making the z-transform we have
z=(100-90.11)/2.4 = 4.12. Looking 4.12 up in the z-tables, we see that
it is in the far upper tails, thus the probability is close to 0%.
The size of the probabilities calculated in (a) and (b) are compared in the above
plots.
p 
q 
Of the two probabilities calculated above, which is likely to be closest
to the true probability?
q 
q 
Both probabilities were calculated using the normal distribution. But this is
only an approximation of the true distribution of calf weights and sample
mean of calf weights. From the histogram on two pages back, it appears
that the density for the underlying weights of calves is only very
approximately normal. Thus it is unlikely that the probability calculated for
the weight of one calf is that accurate.
On the other hand the Central Limit Theorem tells is that the distribution of
the sample mean gets closer to normal as the sample size grows. The
second probability we calculated was based on the average weight of 10
calves. The distribution of the average is likely to be more normal than the
weight of calves. Thus the second probability based on the average is
more accurate (close to the true probability).
Calculation practice p 
A farmer wants to use a vehicle to carry 30 0.5 week old calves. The
vehicle he plans to use can carry a maximum load of 2760 pounds.
He knows that the mean weight of a calf is 90.11 pounds and the
standard deviation is 7.7. What is the chance the vehicle can carry
the calves?
p 
We need to turn the total weight into the sample mean. We observe, if
the total weight of 30 calves needs to be less than 2760 pounds this is
the same as the sample mean weight of 30 calves must be less than
2760/30 = 92:
30
X
30
1 X
2760
Xi < 2760 ) X̄ =
Xi <
30
30
i=1
i=1
Therefore, we have turned the problem from totals into averages and
apply the CLT to calculate the probability using the normal distribution.
Calculation practice (cont) p 
p 
p 
p 
P
We know from the central limit theorem that the sample mean is close to
normally distributed. Thus the distribution of the sample mean is normal
with mean 90.11 and standard deviation 7.7/√30 = 1.4.
We know that for the vehicle to carry the calves, the sample mean has to
be less than 92 pounds.
Calculate the z-transform z=(92-90.11)/1.4 = 1.35 and look up the ztables to get 91.1.
Conclusion: There is a 91.1% chance the vehicle can carry the 30 0.5
week old calves. In mathematical symbols:
30
X
i=1
!
Xi < 2760
=P
1
X̄ =
30
30
X
2760
Xi <
30
i=1
!
= P (Z < 1.35) = 0.911
How large is a large enough sample size? It depends on the population distribution. More observations are
required if the population distribution has a large standard deviation or
if it is far from normal in distribution.
p 
p 
p 
p 
A sample size of 25 is generally enough to obtain a normal sampling
distribution from a population with some skewness or even mild outliers.
A sample size of 40 will typically be good enough to overcome some
skewness and outliers.
More importantly, n should be large enough to make the standard error
sufficiently small – then we can get meaningful and precise inferences.
We can check this by using the Sampling distribution applet.
In many cases, even n = 40 is not large enough to
give results reliable enough when there is a lot at
stake. This is why clinical trials, political polls and
marketing surveys typically observe 100’s or even
1000’s of individuals.
The effect of skewness on the CLT Below we look at the sample mean taken from data with a large right skew
The corresponding QQplot of the sample mean Observations:
1.  We see that the standard
error is 0.756 = 4.7/√40,
which is as it should be.
2.  However, the QQplot
deviates far from normality
in the tables. The
distribution of the sample
mean still has a slight right
skew (look back at the
QQplots in Chapter 4). This
demonstrates that when
data is highly skewed, we
need a much large sample
size for the CLT to kick in.
3.  Calculations based on
normality of the the average
will not be completely
correct.
Effect of binary data on the CLT Binary data arises in
several situations. It
includes Male or Female.
Like or Dislike, wherever
there are two possible
outcomes.
In this example, we have
encrypted one outcome
with zero and the other
with 1 (it does not really
matter which way). We
see that the proportion in
the one category is about
20% - this is what is meant
by the mean. This data is
discrete and clearly
skewed.
The corresponding QQplot of the sample mean Observations:
1.  We see that the standard
error is 0.0571 = 0.405/√50,
which is as it should be.
2.  However, the QQplot
deviates far from normality
in the tables. The lines
across demonstrate that the
average over 50 still takes
discrete values (though not
integers). We also see a U
shape that shows that the
sample mean is still
skewed.
3.  Calculations based on
normality of the the average
will not be completely
correct.
Example: Income distribution Let’s consider the very large database of individual incomes from the Bureau of
Labor Statistics as our population. Income is strongly right skewed.
p 
p 
We take 1000 SRSs of 25 incomes, calculate the sample mean for
each, and make a histogram of these 1000 means.
We also take 1000 SRSs of 100 incomes, calculate the sample mean
for each, and make a histogram of these 1000 means.
Which histogram
corresponds to
samples of size
100? Which to
samples of size 25?
So many standard deviations! In statistics we talk about different kinds of standard deviations, and it
can be hard to keep track of them:
p  s is the standard deviation of a set (sample) of data. It is a statistic
we can compute once we have the data.
p  σ is the standard deviation of a population (which is much too big to
observe completely). It is a parameter – usually, we will never know
its true value.
p  σ /√n is the standard deviation of the values of x from all possible
random samples of size n. It refers to the sample mean, not to data.
It is also called the standard error of x .
p  s /√n is our estimate of σ /√n, since we do not know the value of σ.
€ n = 459 responded to the
From a survey of students taking statistics,
question “How many Facebook friends do you have?” The sample mean
was x = 566.9 and the sample
€ standard deviation was s = 589.5. The
standard error for the sample mean is s /√n = 589.5/√459 = 27.52.
x
is an estimate for µ = mean of the population of all students required to
take the class and s is an estimate for the population standard deviation σ.
Summary x
p 
is always unbiased for µ, even if the population’s distribution is
very different from a normal distribution.
p 
The standard deviation of
random sampling.
p 
If the population is approximately normal or if the sample size n is
large, we can use the normal distribution to compute probabilities for
€
x. We just have to remember to use σ /√n, not σ, in the
denominator when calculating z.
p 
p 
x , σ /√n, measures the variability due to
This means we can say something about how close x is likely to be
to µ. Generally it is quite likely (95% chance) that it will be within 2
standard errors of µ.
Not all variables are normally distributed and large samples are not
€
always attainable. In such circumstances, a statistician should be
consulted for proper methods of statistical inference and calculation.
Accompanying problems associated with this Chapter p 
p 
p 
p 
Quiz 5
Quiz 6
Homework 2, Q6.
Homework 3.