Download Sampling - University of New Haven

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Sampling
Marc H. Mehlman
[email protected]
University of New Haven
Marc Mehlman
Marc Mehlman (University of New Haven)
Sampling
1 / 20
Table of Contents
1
Sampling Distributions
2
Central Limit Theorem
3
Binomial Distribution
Marc Mehlman
Marc Mehlman (University of New Haven)
Sampling
2 / 20
Sampling Distributions
Sampling Distributions
Sampling Distributions
Marc Mehlman
Marc Mehlman (University of New Haven)
Sampling
3 / 20
Sampling Distributions
Parameters and Statistics
As we begin to use sample data to draw conclusions about a wider
population, we must be clear about whether a number describes a
sample or a population.
AAparameter
parameterisisaanumber
numberthat
thatdescribes
describessome
somecharacteristic
characteristicof
ofthe
the
population.
population.In
Instatistical
statisticalpractice,
practice,the
thevalue
valueof
ofaaparameter
parameterisisnot
not
known
knownbecause
becausewe
wecannot
cannotexamine
examinethe
theentire
entirepopulation.
population.
AAstatistic
statisticisisaanumber
numberthat
thatdescribes
describessome
somecharacteristic
characteristicof
ofaa
sample.
sample.The
Thevalue
valueof
ofaastatistic
statisticcan
canbe
becomputed
computeddirectly
directlyfrom
fromthe
the
sample
sampledata.
data.We
Weoften
oftenuse
useaastatistic
statisticto
toestimate
estimatean
anunknown
unknown
parameter.
parameter.
Remember s and p: statistics come from samples and
parameters come from populations.
We write µ (the Greek letter mu) for the population mean and σ for the
population standard deviation. We write x(x-bar) for the sample mean and s
for the sample standard deviation.
4
Marc Mehlman
Marc Mehlman (University of New Haven)
Sampling
4 / 20
Sampling Distributions
Statistical Estimation
The process of statistical inference involves using information from a
sample to draw conclusions about a wider population.
Different random samples yield different statistics. We need to be able to
describe the sampling distribution of possible statistic values in order to
perform statistical inference.
We can think of a statistic as a random variable because it takes numerical
values that describe the outcomes of the random sampling process.
Population
Population
Sample
Sample
Collect data from a
representative Sample...
Make an Inference
about the Population.
5
Marc Mehlman
Marc Mehlman (University of New Haven)
Sampling
5 / 20
Sampling Distributions
Sampling Variability
Different random samples yield different statistics. This basic fact is
called sampling variability: the value of a statistic varies in repeated
random sampling.
To make sense of sampling variability, we ask, “What would happen if we
took many samples?”
Population
Population
Sample
Sample
Sample
Sample
Sample
Sample
Sample
Sample
Sample
Sample
Sample
Sample
Sample
Sample
Sample
Sample
6
Marc Mehlman
Marc Mehlman (University of New Haven)
Sampling
6 / 20
Sampling Distributions
Sampling Distributions
The law of large numbers assures us that if we measure enough
subjects, the statistic x-bar will eventually get very close to the unknown
parameter µ.
If we took every one of the possible samples of a certain size, calculated
the sample mean for each, and graphed all of those values, we’d have a
sampling distribution.
The
Thepopulation
populationdistribution
distributionof
ofaavariable
variableisisthe
thedistribution
distributionof
of
values
valuesof
ofthe
thevariable
variableamong
amongall
allindividuals
individualsininthe
thepopulation.
population.
The
Thesampling
samplingdistribution
distributionof
ofaastatistic
statisticisisthe
thedistribution
distributionof
of
values
valuestaken
takenby
bythe
thestatistic
statisticininall
allpossible
possiblesamples
samplesof
ofthe
thesame
same
size
sizefrom
fromthe
thesame
samepopulation.
population.
7
Marc Mehlman
Marc Mehlman (University of New Haven)
Sampling
7 / 20
Sampling Distributions
Mean and Standard Deviation of a
Sample Mean
Mean of a sampling distribution of a sample mean
There is no tendency for a sample mean to fall systematically above or
below µ, even if the distribution of the raw data is skewed. Thus, the
mean of the sampling distribution is an unbiased estimate of the
population mean µ.
Standard deviation of a sampling distribution of a sample mean
The standard deviation of the sampling distribution measures how much
the sample statistic varies from sample to sample. It is smaller than the
standard deviation of the population by a factor of √n.
 Averages are less variable than individual observations.
8
Marc Mehlman
Marc Mehlman (University of New Haven)
Sampling
8 / 20
Sampling Distributions
The Sampling Distribution of a
Sample Mean
When we choose many SRSs from a population, the sampling distribution
of the sample mean is centered at the population mean µ and is less
spread out than the population distribution. Here are the facts.
The
TheSampling
SamplingDistribution
Distributionof
ofSample
SampleMeans
Means
Suppose that x is the mean of an SRS of size n drawn from a large population
with mean µ and standard deviation σ . Then :
The mean of the sampling distribution of x is µ x = µ
The standard deviation of the sampling distribution of x is
σ
σx =
n
Note : These facts about the mean and standard deviation of xare true
no matter what shape the population distribution has.
IfIfindividual
individualobservations
observationshave
havethe
theN(µ,σ)
N(µ,σ)distribution,
distribution,then
thenthe
thesample
samplemean
mean
of
ofan
anSRS
SRSof
ofsize
sizennhas
hasthe
theN(µ,
N(µ,σ/√n)
σ/√n)distribution
distributionregardless
regardlessof
ofthe
thesample
sample
size
sizen.
n.
9
9
Marc Mehlman
Marc Mehlman (University of New Haven)
Sampling
9 / 20
Central Limit Theorem
Central Limit Theorem
Central Limit Theorem
Marc Mehlman
Marc Mehlman (University of New Haven)
Sampling
10 / 20
Central Limit Theorem
Central Limit Theorem
“I know of scarcely anything so apt to impress the imagination as the wonderful
form of cosmic order expressed by the “law of frequency of error” [the normal
distribution]. The law would have been personified by the Greeks and deified, if
they had known of it. It reigns with serenity and in complete self effacement amidst
the wildest confusion. The huger the mob, and the greater the anarchy, the more
perfect is its sway. It is the supreme law of Unreason.” – Francis Galton
In the previous slide, the sampling distribution of X̄ is depicted as:
1
with mean µ, ie unbiased.
2
√
with standard deviation σ/ n.
3
with normal distribution.
The first two depictions are always true, regardless of sample size or population distribution.
The Central Limit Theorem (below) says the third depiction is approximately true, regardless of
population distribution, for large sample sizes, n.
As Francis Galton said, the averaged effects of random acts from a large mob form a familiar
pattern.
Theorem (Central Limit Theorem, CLT)
Consider a random sample of size n from a population with mean µ and
√ standard deviation σ.
For large n, the sampling distribution of X̄ is approximately N µ, σ/ n .
Marc Mehlman
Marc Mehlman (University of New Haven)
Sampling
11 / 20
Central Limit Theorem
Example
Based on service records from the past year, the time (in hours) that
a technician requires to complete preventative maintenance on an air
conditioner follows the distribution that is strongly right-skewed, and
whose most likely outcomes are close to 0. The mean time is µ = 1
hour and the standard deviation is σ = 1.
Your company will service an SRS of 70 air conditioners. You have budgeted 1.1
hours per unit. Will this be enough?
The central limit theorem states that the sampling distribution of the mean time spent
working on the 70 units is:
σ
1
=
= 0.12
n
70
The sampling distribution of the mean time spent working is approximately N(1, 0.12)
because n = 70 ≥ 30.
μx = μ = 1
σx =
z=
1.1 −1
= 0.83
0.12
P(x > 1.1) = P(Z > 0.83)
= 1− 0.7967 = 0.2033
If you budget 1.1 hours per unit, there is a 20%
chance the technicians will not complete the
work within the budgeted time.
11
Marc Mehlman
Marc Mehlman (University of New Haven)
Sampling
12 / 20
Central Limit Theorem
A Few More Facts
Any linear combination of independent Normal
random variables is also Normally distributed.
More generally, the central limit theorem notes
that the distribution of a sum or average of
many small random quantities is close to
Normal.
Finally, the central limit theorem also applies
to discrete random variables.
12
Marc Mehlman
Marc Mehlman (University of New Haven)
Sampling
13 / 20
Binomial Distribution
Binomial Distribution
Binomial Distribution
Marc Mehlman
Marc Mehlman (University of New Haven)
Sampling
14 / 20
Binomial Distribution
Definition (Bernoulli Distribution, X ∼ BIN(1, p))
Model: X = # heads after tossing a coin once, that has a probability of
heads on each toss equal to p.
Definition (Binomial Distribution, X ∼ BIN(n, p))
Model: X = # heads after tossing a coin n times, that has a probability of
heads on each toss equal to p.
Theorem
If X ∼ BIN(n, p) and j is a nonnegative integer between 0 and n inclusive
n j
P(X = j) =
p (1 − p)n−j .
j
Furthermore
µX = np,
σX2 = np(1 − p) and σX =
p
np(1 − p).
Marc Mehlman
Marc Mehlman (University of New Haven)
Sampling
15 / 20
Binomial Distribution
Let Y1 , Y2 , · · · , Yn be a random sample from BIN(1, p). Then
def Pn
1 X =
j=1 Yj ∼ BIN(n, p).
def
# of heads
2 p̂ = Ȳ =
is an unbiased estimator of p.
# of tosses
3 For large n, the distribution of p̂ = Ȳ is approximately
q
N p,
p(1−p)
n
by the Central Limit Theorem.
Since X = nȲ one has
Theorem (Normal Approximation for Binomial Distribution)
For large n, one hasX ∼ BIN(n, p) is approximately distributed as
p
N np, np(1 − p) .
For how large of n is the above approximate good?
Convention
When np ≥ 10 and n(1 − p) ≥ 10.
Marc Mehlman
Marc Mehlman (University of New Haven)
Sampling
16 / 20
Binomial Distribution
When dealing with discrete random variables as the binomial distribution, a “continuity
correction” can greatly improve accuracy.
For instance consider the example:
Example (Exact)
Joe always runs red lights. The probability of an accident for each red light run is 0.3. Of
the last 100 red lights run, what is the probability that there were 25 or fewer accidents?
Solution: Letting X ∼ BIN(100, 0.3) be the number of accidents. The exact answer is
!
25
25
X
X
100
P(X = j) =
(0.3)j (0.7)100−j = 0.1631,
j
j=0
j=0
(obtained with Mathematica). Or using R,
> pbinom(25,100,0.3)
[1] 0.1631301
The exact answer can’t easily be obtained without a computer.
Marc Mehlman
Marc Mehlman (University of New Haven)
Sampling
17 / 20
Binomial Distribution
Example (Normal approximation without continuity correction)
Joe always runs red lights. The probability of an accident for each red light run is 0.3. Of the
last 100 red lights run, what is the probability, approximately, that there were 25 or fewer
accidents?
Solution: Let X ∼ BIN(100, 0.3). Since 100(0.3)
10 and 100(1 − 0.3) ≥ 10, X has
≥p
approximately the same distribution as Y ∼ N 30, 100(0.3)(1 − 0.3) = N(30, 4.582576).
Thus
P[X ≤ 25]
≈
=
P [Y ≤ 25]
Y − 30
25 − 30
P
≤
4.582576
4.582576
P [Z ≤ −1.091089]
=
0.1379
=
using the Table.
Instead of using a table, one can get more accuracy using R for the normal approximation
without continuity correction:
> pnorm(25,30,sqrt(100*0.3*(1-0.3)))
[1] 0.1376168
The approximation is unsatisfactory.
Marc Mehlman
Marc Mehlman (University of New Haven)
Sampling
18 / 20
Binomial Distribution
Continuity Correction
Let X ∼ BIN(n, p) and let j, k be integers such that 0 ≤ j ≤ k ≤ n. Then
it is common practice to use the following approximation when np ≥ 10
and n(1 − p) ≥ 10:
P [j ≤ X ≤ k] ≈ P [j − 0.5 ≤ Y ≤ k + 0.5]
p
where Y ∼ N np, np(1 − p) .
Marc Mehlman
Marc Mehlman (University of New Haven)
Sampling
19 / 20
Binomial Distribution
Example (Normal approximation with continuity correction)
Joe always runs red lights. The probability of an accident for each red light run is 0.3. Of the
last 100 red lights run, what is the probability, approximately, that there were 25 or fewer
accidents?
Since 100(0.3)
≥ 10 and 100(0.7)
≥ 10 the above convention says, letting
p
Y ∼ N 30, 100(0.3)(1 − 0.3) = N(30, 4.582576)
P(X ≤ 25)
≈
=
P(Y ≤ 25.5)
Y − 30
25.5 − 30
P
≤
4.582576
4.582576
P(Z ≤ −0.9819805)
≈
0.1635
=
using the Table.
Instead of using a table, one can get more accuracy using R for the normal approximation with
continuity correction:
> pnorm(25.5,30,sqrt(100*0.3*(1-0.3)))
[1] 0.1630547
This approximation is much, much better than the normal approximation without continuity
correction.
Marc Mehlman (University of New Haven)
Marc Mehlman
Sampling
20 / 20