Download F13_Lect7_chpt1314stvers

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Chapter 13
Sampling distributions
Objectives (PSLS Chapter 13 & 14)
Sampling distributions

Parameter versus statistic (Awards 27-31)

The law of large numbers (Law of Large Number Award, 28)

Sampling distributions (Sampling Distribution Award, 28)

Sampling distribution of the sample mean x

The central limit theorem (Central Limits Theorem Award, 29)
(Samp. Distribution Award)
Parameter versus statistic


Population: the entire group
of individuals in which we are
interested but usually can’t
assess directly.
A parameter is a number
summarizing the population.
Parameters are usually
unknown.

Sample: the part of the
population we actually examine
and for which we do have data.

A statistic is a number
summarizing a sample. We
often use a statistic to estimate
an unknown population
parameter.
Population
Sample
The law of large numbers
Law of large numbers: As the number of randomly drawn observations
(n) in a sample increases,
the mean of the sample (x̅) gets
closer and closer to the population
mean m (quantitative variable).
the sample proportion ( p̂) gets
closer and closer to the population
proportion p (categorical variable).
Sampling distributions
Different random samples taken from the same population will give
different statistics. But there is a predictable pattern in the long run.
 A statistic computed from a random sample is a random variable.
The sampling distribution of a statistic is the
probability distribution of that statistic for samples
of a given size n taken from a given population.
Note: When sampling randomly from a given population:

The law of large numbers describes what to expect if we took
samples of increasing size n.

A sampling distribution describes what would happen if we took
all possible random samples of a fixed size n.
Both are conceptual ideas with many important practical applications.
We rely on their well known mathematical properties, but we don’t
build actual sampling distributions when analyzing data.
Sampling distribution of x̅ (the sample mean)

The mean of the sampling distribution of x̅ is μ.
There is no tendency for a sample average to fall systematically
above or below μ, even if the population distribution is skewed.
 x̅ is an unbiased estimate of the population mean μ
assuming the samples are randomly chosen.

The standard deviation (σ) of the sampling distribution of x̅
is σ/√n.
The standard deviation of the sampling distribution measures how
much the sample statistic x̅ varies from sample to sample.
 Averages are less variable than individual observations.
For Normally distributed populations
When a variable in a population is Normally distributed, the sampling
distribution of the sample mean x̅ is also Normally distributed.
Sample means
population N(μ, σ)
↓
sampling distribution
N(μ, σ/√n)
population
Deer mice (Peromyscus maniculatus) have a body length (excluding the tail)
known to vary Normally, with a mean body length µ = 86 mm, and standard
deviation σ = 8 mm.
For random samples of 20 deer mice, the distribution of the sample mean body
length is approximately,
A) Normal, mean 86, standard deviation 8 mm.
B) Normal, mean 86, standard deviation 20 mm.
C) Normal, mean 86, standard deviation 1.8 mm.
D) Normal, mean 86, standard deviation 3.9 mm.
Standardizing a Normal sampling distribution (z)
When the sampling distribution is Normal, we can standardize the value
of a sample mean x̅ to obtain a z-score. This z-score can then be used
to find areas under the sampling distribution from Table B.
x
N(µ, σ/√n)
x-m
z=
s n
z
N(0,1)
Here, we work with the sampling distribution,
and σ/√n is its standard deviation (indicative of spread).
Remember that σ is the standard deviation of the original population.
Hypokalemia is diagnosed when blood potassium levels are low, below
3.5mEq/dl. Let’s assume that we know a patient whose measured
potassium levels vary daily according to iid ~ N(m = 3.8, s = 0.2).
If only one measurement is made, what's the probability that this patient
will be diagnosed hypokalemic? Would this be a misdiagnosis?
The central limit theorem
Central limit theorem: When randomly sampling from any population
with mean m and standard deviation s, when n is large enough, the
sampling distribution of x̅ is approximately Normal: N(μ, σ/√n).

The larger the sample size n, the better the approximation of Normality.

This is very useful in inference: Many statistical tests assume Normality for
the sampling distribution. The central limit theorem tells us that, if the
sample size is large enough, we can safely make this assumption even if
the raw data appear non-Normal.
How large a sample size?
It depends on the population distribution. More observations are
required if the population distribution is far from Normal.

A sample size of 25 or more is generally enough to obtain a Normal
sampling distribution from a skewed population, even with mild outliers in
the sample.

A sample size of 40 or more will typically be good enough to overcome an
extremely skewed population and mild (but not extreme) outliers in the
sample.
In many cases, n = 25 isn’t a huge sample. Thus,
even for strange population distributions we can
assume a Normal sampling distribution of the
sample mean, and work with it to solve problems!
Population with strongly
skewed distribution
Sampling distribution of x for
n = 2 observations

Sampling distribution of x for
n = 10 observations
Sampling distribution of x for
n = 25 observations
How do we know if the population is Normal or not?

Sometimes we are told that a variable has an approximately Normal
distribution (e.g. large studies on human height or bone density).

Most of the time, we just don’t know. All we have is sample data.
 We can summarize the data with a histogram and describe its shape and
estimate the likely magnitude of error. We can run simulations to quantify it.
 If the sample is random, the shape of the histogram should be similar to the
shape of the population distribution.
 The central limit theorem can help guess whether the sampling distribution
should look roughly Normal or not.
12
Frequency
Number of subjects
(a) Angle of big toe deformations
in 38 patients:
10
8
• Symmetrical, one small outlier
• Population likely close to Normal
• Sampling distribution ~ Normal
6
4
2
0
10
15
20
25
30
35
40
45
HAV angle
(b) Histogram of number of fruit
per day for 74 adolescent girls
• Skewed, no outlier
• Population likely skewed
• Sampling distribution ~ Normal
given large sample size
50
More
12
Sample of 28 acorns:
10
Describe the distribution of the sample.
What can you assume about the
population distribution?
Frequency
Atlantic acorn sizes (in cm3)
8
6
4
2
0
1.5
3
4.5
6
7.5
Acorn sizes
What would be the shape of the sampling distribution:

For samples of size 1?

For samples of size 5?

For samples of size 15?

For samples of size 50?
9
10.5 M
Objectives (PSLS Chapter 14)
Estimation

Uncertainty and confidence (Margin of Error/CIs Award, 30)

Confidence intervals (Margin of Error/CIs Award, 30)
Uncertainty and confidence
If you picked different samples from a population, you would probably
get different sample means ( x̅ ) and virtually none of them would
actually equal the true population mean, m.
Use of sampling distributions
n
Sample means,
n subjects
If the population is N(μ,σ), the
x
sampling distribution is N(μ,σ/√n).

s
If not, the sampling distribution is
n
Population, x
individual subjects
~N(μ,σ/√n) if n is large enough.
s
m
 We can take just one random sample of size n, and rely on the

known properties of sampling distributions to estimate the
sampling distribution.
When we take a random sample, we
can compute the sample mean and an
interval of size plus-or-minus 2σ/√n
about the mean.
s
s
n
n

x̅

Based on the ~68-95-99.7% rule, we
can expect that:
~95% of all intervals computed with this
method capture the parameter μ.
Red arrow: Interval
of size plus or
minus 1.96*σ/√n
Blue dot: mean
value of a given
random sample
Confidence intervals
A confidence interval is a range of values with an associated
probability, or confidence level, C. This probability quantifies the
chance that the interval contains the unknown population parameter.
m falls within the interval with probability (confidence level) C.
The margin of error, m
A confidence interval (“CI”) can be expressed as:

a center ± a margin of error:

an interval:
μ within x̅ ± m
μ within (x̅ − m) to (x̅ + m)
The confidence level C
(in %) represents an area
of corresponding size C
under the sampling
distribution.
m
m
The weight of single eggs varies Normally with standard deviation 5 g.
Think of a carton of 12 eggs as an SRS of size 12.

What is the distribution of the sample means
x?


You buy one carton of 12 eggs. The average egg weight is x̅ = 64.2g. What
can you infer about the mean µ of this population with roughly 95% confidence?
CI for a Normal population mean (σ known)
When taking a random sample from a Normal population with known
standard deviation σ, a level C confidence interval for µ is:
x  z* s

n
or
x m
σ/√n is the standard deviation of
C
the sampling distribution

C is the area under the N(0,1)
between −z* and z*
-z*
z*
80% confidence level C
How do we find z* values?
We can use a table of z and t values
(Table C). For a given confidence
level C, the appropriate z* value is
listed in the same column.
(…)
For 95% confidence level,
z* = 1.96 (almost 2)
Link between confidence level and margin of error
The confidence level C determines the value of z* (in Table C).
The margin of error also depends on z*.
m  z *s
n
Higher confidence C implies a larger
margin of error m (less precision more
accuracy).

C
A lower confidence level C produces a
smaller margin of error m (more
precision less accuracy).
 win/loose situation
m
−Z*
m
Z*

Density of bacteria in solution
Measurement equipment has normal distribution with standard
deviation σ = 1 million bacteria/ml of fluid.
3 measurements: 24, 29, and 31 million bacteria/ml.
Mean:
x = 28 million bacteria/ml. Find the 99% and 90% CI.