Download Sampling distribution of

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Central limit theorem wikipedia , lookup

Transcript
Objectives (BPS chapter 11)
Sampling distributions

Parameter versus statistic

The law of large numbers

What is a sampling distribution?

The sampling distribution of

The central limit theorem

Statistical process control
x
Reminder:


Parameter versus statistic
Population: the entire group of
individuals in which we are
interested but can’t usually
assess directly.
A parameter is a number
describing a characteristic of
the population. Parameters
are usually unknown.

Sample: the part of the
population we actually examine
and for which we do have data.

A statistic is a number
describing a characteristic of a
sample. We often use a statistic
to estimate an unknown
population parameter.
Population
Sample
The law of large numbers
Law of large numbers: As the number of randomly-drawn observations
(n) in a sample increases,
the mean of the sample (x) gets
closer and closer to the population
mean m (quantitative variable).

the sample proportion (p̂
) gets
closer and closer to the population
proportion p (categorical variable).
What is a sampling distribution?
The sampling distribution of a statistic is the distribution of all
possible values taken by the statistic when all possible samples of a
fixed size n are taken from the population. It is a theoretical idea—we
do not actually build it.
The sampling distribution of a statistic is the probability distribution of
that statistic.
Note: When sampling randomly from a given population,

the law of large numbers describes what happens when the sample size n
is gradually increased.

The sampling distribution describes what happens when we take all
possible random samples of a fixed size n.
Sampling distribution of x
(the sample mean)
We take many random samples of a given size n from a population
with mean m and standard deviation s.
Some sample means will be above the population mean m and some
will be below, making up the sampling distribution.
Sampling distribution of “x bar”
Histogram
of some
sample
averages
For any population with mean m and standard deviation s:
The mean, or center of the sampling distribution of
population mean m.

x , is equal to the
The standard deviation of the sampling distribution is s/√n, where n
is the sample size.

Sampling distribution of
s/√n
m
x

Mean of a sampling distribution of
x:
There is no tendency for a sample mean to fall systematically above or
below m, even if the distribution of the raw data is skewed. Thus, the mean
of the sampling distribution of x is an unbiased estimate of the population
mean m —it will be “correct on average” in many samples.

Standard deviation of a sampling distribution of
x:
The standard deviation of the sampling distribution measures how much the
sample statistic
x
varies from sample to sample. It is smaller than the
standard deviation of the population by a factor of √n.  Averages are less
variable than individual observations.
For normally distributed populations
When a variable in a population is normally distributed, then the
sampling distribution of x for all possible samples of size n is also
normally distributed.
Sample means
If the population is
N(m,s), then the sample
means distribution is
N(m,s/√n).
Population
IQ scores: population vs. sample
In a large population of adults, the mean IQ is 112 with standard deviation 20.
Suppose 200 adults are randomly selected for a market research campaign.

The distribution of the sample mean IQ is
A) exactly normal, mean 112, standard deviation 20.
B) approximately normal, mean 112, standard deviation 20.
C) approximately normal, mean 112 , standard deviation 1.414.
D) approximately normal, mean 112, standard deviation 0.1.
C) approximately normal, mean 112, standard deviation 1.414.
Population distribution: N (m = 112; s = 20)
Sampling distribution for n = 200 is N (m = 112; s /√n = 1.414)
Application
Hypokalemia is diagnosed when blood potassium levels are low, below
3.5mEq/dl. Let’s assume that we know a patient whose measured potassium
levels vary daily according to a normal distribution N(m = 3.8, s = 0.2).
If only one measurement is made, what's the probability that this patient will be
misdiagnosed hypokalemic?
z
(x  m)
s
3.5  3.8

0.2
z = 1.5, P(z < 1.5) = 0.0668 ≈ 7%
If instead measurements are taken on four separate days, what is the
probability of such a misdiagnosis?
( x  m ) 3.5  3.8
z

s n
0.2 4
z = 3, P(z < 1.5) = 0.0013 ≈ 0.1%
Note:
Make sure to standardize (z) using the standard deviation for the sampling distribution.
Practical note


Large samples are not always attainable.

Sometimes the cost, difficulty, or preciousness of what is studied limits
drastically any possible sample size.

Blood samples/biopsies: no more than a handful of repetitions
acceptable. Often we even make do with just one.

Opinion polls have a limited sample size due to time and cost of
operation. During election times, though, sample sizes are increased
for better accuracy.
Not all variables are normally distributed.

Income is typically strongly skewed for example.

Is
x still a good estimator of m then?
The central limit theorem
Central Limit Theorem: When randomly sampling from any population
with mean m and standard deviation s, when n is large enough, the
sampling distribution of
x
is approximately normal: N(m,s/√n).
Population with
strongly skewed
distribution
Sampling
distribution of
x for n = 2
observations

Sampling
distribution of
x for n = 10
observations
Sampling
distribution of
x for n = 25
observations
Income distribution
Let’s consider the very large database of individual incomes from the Bureau of
Labor Statistics as our population. It is strongly right-skewed.

We take 1000 SRSs of 100 incomes, calculate the sample mean for
each, and make a histogram of these 1000 means.

We also take 1000 SRSs of 25 incomes, calculate the sample mean for
each, and make a histogram of these 1000 means.
Which histogram
corresponds to the
samples of size
100? 25?
How large a sample size?
It depends on the population distribution. More observations are
required if the population distribution is far from normal.

A sample size of 25 is generally enough to obtain a normal sampling
distribution from a strong skewness or even mild outliers.

A sample size of 40 will typically be good enough to overcome extreme
skewness and outliers.
In many cases, n = 25 isn’t a huge sample. Thus,
even for strange population distributions we can
assume a normal sampling distribution of the
mean, and work with it to solve problems.
Statistical process control
Industrial processes tend to have normally distributed variability, in part
as a consequence of the central limit theorem applying to the sum of
many small influential factors. Random samples taken over time can
thus be used to easily verify that a given process is not getting out of
“control.”
What is statistical control?
A variable that continues to be described by the same distribution when
observed over time is said to be in statistical control, or simply in
control.
Process-monitoring
What are the required conditions?
We measure a quantitative variable x that has a normal distribution.
The process has been operating in control for a long period, so that we
know the process mean µ and the process standard deviation σ that
describe the distribution of x as long as the process remains in control.
An
x
control chart displays the average of samples of size n taken at
regular intervals from such a process. It is a way to monitor the process
and alert us when it has been disturbed so that it is now out of control.
This is a signal to find and correct the cause of the disturbance.
x
control charts
For a process with known mean µ standard deviation σ, we calculate
the mean
x
of samples of constant size n taken at regular intervals.
Plot x (vertical axis)
against time (horizontal axis).

Draw a horizontal center
line at µ.

Draw two horizontal
control limits at µ ± 3σ/√n
(UCL and LCL).

A machine tool cuts circular pieces. A sample of four pieces is
taken hourly, giving these average measurements (in 0.0001
inches from the specified diameter).
Because measurements are made from the specified diameter,
we have a given target µ = 0 for the process mean. The process
standard deviation σ = 0.31. What is going on?
x
xx x
x
x
For the
x
chart, the
center line is 0 and
the control limits are
±3σ/√4 = ± 0.465.
Sample
x
1
−0.14
2
0.09
3
0.17
4
0.08
5
−0.17
6
0.36
7
0.30
8
0.19
9
0.48
10
0.29
11
0.48
12
0.55
13
0.50
14
0.37
15
0.69
16
0.47
17
0.56
18
0.78
19
0.75
20
0.49
21
0.79
The process mean has drifted. Maybe the cutting blade is getting dull, or a
screw got a bit loose.