Download SAMPLING AND SAMPLING SAMPLING AND SAMPLING

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
SAMPLES AND POPULATIONS
y It is usually not practical to study an entire population
SAMPLING AND SAMPLING
DISTRIBUTION
¾ in a random sample each member of the population has an
equal chance of being chosen
¾ a representative sample might have the same proportion of
men and women (and several other characteristics) as does
the population.
¾ a convenience sample (haphazard sample) is a non
non-random
random
subset of the population chosen because they happen to be
available (e.g., all members of this class would be a
convenience sample of AAU students).
students)
y convenience samples may be biased because each member of the
population does not have an equal chance of being chosen.
2
SAMPLES AND POPULATIONS
y Population
SAMPLES AND POPULATIONS
y Sample
¾ Relatively
R l ti l smallll
y Sample
¾ All Americans
¾ A subset of
¾ All pets in Europe
¾ A subset of pets
¾ All car accidents in
¾ Today
Today’s
s car accidents
Prague
¾ All depressed people
¾ The depressed people in
number of cases
(observations) that
are studied
t di d to
t make
k
inferences about a
larger group from
which
hi h th
they were
drawn
y Population
p
¾ The larger group (all
cases) from which a
sample is drawn
Prague.
Sample
Population
4
3
WHY SAMPLE THE POPULATION?
z
z
z
z
The physical impossibility of checking all items in the
population.
l ti
The cost of studying all the items in a population is
often prohibited.
The adequacy of sample results.
To contact the whole population would often be timeconsuming.
DEFINITION: STATISTICS AND PARAMETERS
y a parameter is a characteristic of a population
¾ e.g., the average blood pressure of all Czechs.
y a statistic is a characteristic of a sample (function of a r.v.)
¾ e.g.,
e g the average blood pressure of a sample of Czechs.
Czechs
y We use statistics of samples to estimate parameters of
populations.
Statistic o estimates o Parameter
X o estimates o P
P “mew”
P
mew
s o estimates o V
5
6
s2 o estimates o V 2
r o estimates o U
V “sigma”
U “rho”
COMMON SENSE -
COMMON SENSE: SAMPLES AND POPULATIONS
y A random sample should represent the population
y Randomness in sampling is good
well so sample statistics from a random sample
well,
should provide reasonable estimates of population
parameters
y All sample
l statistics
t ti ti h
have some error iin estimating
ti ti
population parameters
y If repeated
p
samples
p
are taken from a p
population
p
and
the same statistic (e.g. mean) is calculated from each
sample, the statistics will vary, that is, they will have a
distribution
y A larger sample provides more information than a
smaller sample so a statistic from a large sample
should have less error than a statistic from a small
sample
y Statistics (= functions of r.v.) have error
y Statistics have distributions
y Larger
L
sample
l size
i ((n)) iis b
better
tt - less
l
error
y The probability distribution of a statistic is called its
sampling distribution
7
8
SAMPLING DISTRIBUTIONS
SAMPLING DISTRIBUTIONS
OF THE MEAN
A sampling distribution is a distribution of all of the possible
values of a statistic for a given size sample selected from a
population
Sampling
Distributions
Sampling
Distributions
Sampling
Distributions
of the
Mean
ea
Sampling
Distributions
of the
Mean
Sampling
Distributions
of the
Proportion
opo t o
9
10
DEVELOPING A SAMPLING DISTRIBUTION
SAMPLING DISTRIBUTION OF MEANS
y A sampling distribution is a frequency distribution
(equivalently, a probability distribution) of a sample
(equivalently
statistic.
y The distribution of sample means is usually referred
t as the
to
th sampling
li di
distributions
t ib ti
off means or, the
th
sampling distribution of the mean
y Definition
Assume there is a population, size N=4. Random variable, X,
is age of individuals,
individuals taking values X: 18
18, 20
20, 22
22, 24 (years)
Summary Measures for the Population Distribution:
¾ A sampling distribution of means is a probability
¦X
P(x)
i
N
.3
18 20 22 24
4
distribution. It is the relative frequency distribution of
means obtained from an unlimited series of sampling
p g
experiments, each consisting of a sample of size n
randomly selected from the population
11
Sampling
Distributions
of the
Proportion
12
¦ (X )
i
N
21
.2
.1
x
0
2
2.236
18
20
22
24
A
B
C
D
Uniform Distribution
(continued)
(continued)
DEVELOPING A SAMPLING DISTRIBUTION
DEVELOPING A SAMPLING DISTRIBUTION
NOW CONSIDER ALL POSSIBLE SAMPLES OF SIZE N=2
1st
Obs
2nd Observation
18
20
22
24
SAMPLING DISTRIBUTION OF ALL SAMPLE MEANS
16 Sample
Means
18 18,18 18,20 18,22 18,24
Sample Means
Distribution
16 Sample Means
20 20,18 20,20 20,22 20,24
1st 2nd Observation
Obs 18 20 22 24
1st 2nd Observation
Obs 18
8 20
0 22 24
22 22,18 22,20 22,22 22,24
18 18 19 20 21
18 18 19 20 21
24 24,18
24 18 24,20
24 20 24,22
24 22 24,24
24 24
20 19 20 21 22
20 19 20 21 22
22 20 21 22 23
22 20 21 22 23
16 possible samples
(sampling with
replacement)
24 21 22 23 24
13
_
P(X)
.3
.2
.1
0
24 21 22 23 24
18 19
14
20 21 22 23
24
_
X
(no longer uniform)
(continued)
DEVELOPING A SAMPLING DISTRIBUTION
COMPARING THE POPULATION WITH ITS
SAMPLING DISTRIBUTION
SUMMARY MEASURES OF THIS SAMPLING DISTRIBUTION:
X
¦X
Population
N=4
18 19 21 24
16
i
N
¦(X i
X
X
21
N
1.58
15
P(X)
3
.3
.2
.2
.1
.1
0
CHARACTERISTICS OF THE
SAMPLING DISTRIBUTION OF MEANS
1.58
18
20
22
24
A
B
C
D
X
0
18 19
20 21 22 23
24
¾ The shape of the distribution of means will be
approximately normal if either
y The sample
p size 30 or g
greater,, of
y The underlying distribution of the population of
individuals is normal
¾ This means that the sampling distribution of means will
be normal even if the distribution being sampled is not,
provided sample size is large enough
¾ This
Thi is
i referred
f
d to
t as the
th CENTRAL LIMIT THEOREM
y Variance (of the sampling distribution of means)
2
2
X
V
n
y Standard deviation of the sampling distribution of
means ((usuallyy called the standard error of the mean,
or SEM)
17
X
y Shape of the sampling distribution of means
P
VX
21
CHARACTERISTICS OF THE
SAMPLING DISTRIBUTION OF MEANS
y Mean (of the sampling distribution of means)
V
X
2.236
P(X)
.3
3
16
PX
_
)2
(18 - 21)2 (19 - 21)2 (24 - 21)2
16
21
Sample Means Distribution
n=2
V
n
y Sampling distributions of means tend toward a normal shape as
the sample size increases, regardless of the shape of the
population distribution from which the samples have been
randomly selected.
18
_
X
SAMPLING DISTRIBUTION OF MEANS
IF THE POPULATION IS NORMAL
y If a population is normal with mean and standard
deviation , the sampling distribution of X is also
normally distributed with
X
and
n
X
(This assumes that sampling is with replacement or
sampling is without replacement from an infinite population)
19
20
Z-VALUE FOR SAMPLING DISTRIBUTION
OF THE MEAN
STANDARD ERROR OF THE MEAN
y Z-value for the sampling distribution of X :
y Different samples of the same size from the same
population will yield different sample means
y A measure of the variability in the mean from sample
to sample is given by the Standard Error of the Mean:
X
n
where:
y Note that the standard error of the mean decreases
as the sample size increases
21
(X X )
Z
X
( X )
n
X = sample
p mean
= population mean
= population standard deviation
n = sample size
22
EXAMPLE
SAMPLING DISTRIBUTION PROPERTIES
y Suppose a population has mean = 8 and standard
x
(i e
(i.e.
x
deviation = 3. Suppose a random sample of size n =
36 is selected.
y What is the probability that the sample mean is between
7 8 and 8
7.8
8.2?
2?
Normal Population
Distribution
is unbiased )
Solution:
x
y Even if the population is not normally distributed
distributed, the
Normal Sampling
Distribution
(has the same mean)
central limit theorem can be used (n > 30)
y … so the sampling distribution of
x
is approximately
normal
x
23
y … with mean
x
24
= 8
y …and standard deviation
x
n
3
36
0.5
EXAMPLE
EXAMPLE: SAMPLE MEAN
If n
Solution (continued):
P(7.8
( 8 X 8
8.2))
Population
Distribution
???
??
?
?
?
?
?
?
?
25
Sampling
Distribution
Sample
X
8
§
V
V2
V·
P ¨ P d N (P , ) d P ¸
4
20
4¹
©
§
20
20 ·
P ¨¨ d N (0,1) d
¸¸ ) (1.12) ) (1.12) 0.7372
4
4
©
¹
0 3830
0.3830
Probability density function
74%
Standard Normal
Distribution
X lies within V / 4 of P .
V
V·
§
P¨P d X d P ¸
4
4¹
©
§
·
X -
¨ 7.8 - 8
8.2 - 8 ¸
P¨
¸
3
¸
¨3
36
n
36 ¹
©
P( 0 5 Z 0.5)
P(-0.5
0 5)
20, compute the prob. that Pˆ
of Pˆ
.1915
1915
+.1915
X when n
20
Standardize
7.8
X
8
8.2
x
-0.5
z
0
0.5
P V / 4
Z
P
P V / 4
26
USING THE SAMPLING DISTRIBUTION OF MEANS
TO DETERMINE PROBABILITIES
USING THE SAMPLING DISTRIBUTION OF MEANS
TO DETERMINE PROBABILITIES
y Assume IQ is normally distributed, with P = 100 and
y Assume IQ is normally distributed, with P= 100 and
V = 16
V= 16
¾ what is the probability of obtaining a sample mean of
¾ what is the probability of obtaining a sample mean that
102 or higher
g
if:
y the sample size is 4?
differs from the p
population
p
mean byy 6 points
p
or more
(i.e., ±6 points) if:
y the sample size is 16?
y the sample size is 4?
y the sample size is 9?
¾ what is the probability of obtaining a sample mean of 98
or less if:
y How would your answer change if IQ were not
y the sample size is 4?
normally distributed?
y the
th sample
l size
i iis 16?
27
28
DISTRIBUTION OF SAMPLE VARIANCE
DISTRIBUTION OF SAMPLE MEAN
If X 1 , , X n normally distributed with mean P and
If X 1 , , X n are observations from a population with a mean P
variance V 2 , then the sample variance S 2 has the distribution
and a variance V 2 , then the Central Limit Theorem indicates
S2 ~ V 2
the sample mean has the approximate distrbution
Pˆ
X ~ N (P ,
V2
n
V
( X P)
S
n
, but since V is usually unknown,
n
wemaysafelyreplace V ,whennislarge,
by the observed value (the sample standard deviation) s,
29
s.e.( X )
s
n
(n 1)
)
The standard error of the sample mean is defined as
s.e.( X )
F n21
30
(X P)
V
n
˜
1
N (0, 1)
~
~ tn 1
§S·
F n21
¨ ¸
©V ¹
(n 1)
POPULATION PROPORTIONS, P
SAMPLING DISTRIBUTIONS OF THE
PROPORTION
p = the proportion of the population having
some characteristic
Sampling
Distributions
y Sample proportion ( ps ) provides an estimate
ps
Sampling
Distributions
of the
Mean
Sampling
Distributions
of the
Proportion
31
X
n
of p:
number of items in the sample having the characteristic of interest
sample size
y 0 ps 1
y ps has
h a bi
binomial
i l di
distribution
t ib ti
(assuming sampling with replacement from a finite
population or without replacement from an infinite
population)
32
SAMPLING DISTRIBUTION: SAMPLE PROPORTION
P-HAT
X
has
n
p (1 p ) ·
§
the approximate distribution pˆ ~ N ¨ p,
¸
n
©
¹
y The situation in this section is that we are interested
p ppropotion
p
pˆ
If X ~ B(n, p ), then the sample
E ( pˆ )
i th
in
the proportion
ti off the
th population
l ti th
thatt has
h a certain
t i
characteristic.
y This proportion is the population parameter of
interest, denoted by symbol p.
y We estimate this parameter with the statistic p-hat –
the number in the sample with the characteristic
divided by the sample size n.
p,
Var ( X )
np (1 p ) Ÿ Var ( pˆ ) Var (
X
)
n
p(1 p )
n
pˆ
33
X /n
34
SAMPLING DISTRIBUTION OF P-HAT
EXAMPLES OF P-HAT COMPUTATION
y Flip coin n=10 times, keep track of number of heads,
y How does p-hat behave? To study the behavior,
X and compute the p
X,
p-hat
hat = X/10.
X/10
i
imagine
i ttaking
ki many random
d
samples
l off size
i n, and
d
computing a p-hat for each of the samples.
y Then flip coin n=30 times, then compute p-hat.
y ?? What would be the shape and behavior properties
of the two displays??
y Because the shape of the distribution is normal, we
can standardize the variable p-hat to a Z standard
normal distribution. Use Z-transform:
y PROPERTIES
y When sample sizes are fairly large, the shape of the
p-hat distribution will be normal.
y The mean of the distribution is the value of the
population parameter p
p.
y The standard deviation of this distribution is the
Z
q
root of p(
p(1-p)/n.
p)
square
35
36
pˆ p
p (1 p )
n
pˆ E ( pˆ )
Var ( pˆ )
STANDARD ERROR OF THE SAMPLE
POPULATION PROPORTIONS
PROPORTION
• Categorical variable (e.g., gender)
• % population having a characteristic
Th standard
The
t d d error off the
th sample
l mean is
i defined
d fi d as
• If two outcomes, binomial distribution
¾ Possess or don’t possess characteristic
p(1 p)
, but since p is usually unknown,
n
wereplace p by the observed value pˆ x / nto have
pˆ(1 p̂ˆ ) 1 x(n x)
s.e.( pˆ )
n
n
n
s.e.( pˆ )
• Sample proportion (ps)
X
n
Ps
37
number of successes
sample size
38
STANDARDIZING SAMPLING DISTRIBUTION
OF PROPORTION
SAMPLING DISTRIBUTION OF P
y Approximated by a
normal distribution if:
¾
P( ps)
np t 5
Z #
Sampling Distribution
.3
.2
.1
0
and
n(1
(1 p)) t 5
ps - P p
Vp
=
Sampling
Distribution
0
.2
.4
4
.6
6
8
1
V=1
where
ps
p
and
ps
p(1 p)
n
(where p = population proportion)
39
y
ps
p(1 p)
n
Convert to
P(.40 d ps d .45)
standard
normal:
41
.4(1 .4)
200
.03464
ps
Z
P = 0
EXAMPLE: SAMPLING DISTRIBUTION OF
PROPORTION
9 np t 5
n(1 p ) t 5
if p = .4 and n = 200, what is P(.40 ps .45) ?
Find p s :
Pp
40
EXAMPLE : If the true proportion of voters who
support Proposition A is p = .4, what is the
probability that a sample of size 200 yields a sample
proportion between .40 and .45?
p (1 p )
n
Standardized
Normal Distribution
Vp
ps
ps - p
Z#
Sampling Distribution
ps - p
p (1 p )
n
.43 - .40
.40u(1.40)
200
= .87
Standardized
Normal Distribution
-
Vp = .0346
=
V=1
0.3078
40 .40
40
.43
43 .40
40 ·
§ .40
dZd
P¨
¸
.03464
.03464
©
¹
P(0 d Z d 0.87)
0 87)
42
Pp = .40
.43
ps
P =0
.87
Z
Z-VALUE FOR PROPORTIONS, CORRECTION
EXAMPLE
FOR SAMPLING WITHOUT REPLACEMENT
(continued)
Always standardize ps to a Z value with the formula:
Z
ps p
ps
if p = .4
4 and n = 200,
200 what is P(.40
P( 40 ps .45)
45) ?
ps p
p(1 p)
n
Use standard normal table:
P(0 Z 1.44) = .4251
y If sampling is without replacement and n is greater than
Standardized
Normal Distribution
Sampling Distribution
5% of the population size, then p must use the finite
population correction factor:
.4251
Standardize
ps
43
p(1 p)
p(
n
Nn
N 1
.40
0
ps
Z
EXAMPLE – USE OF SAMPLE PROPORTION IN
ESTIMATION
y Old drug cures 80 percent of the time. New drug for
y Suppose that you want to know how many animals
heartworm
h
t
disease
di
iin d
dogs cures 90 percentt off
n=1000 dogs, so p-hat=.9.
y What is the probability of observing a sample
proportion greater than .9 if p=.8 and n=1000.
y That is, P(p-hat > .9) = P(Z > .9 - .8/ .0126)
=P(Z> 7.905) = approx zero.
y
y
y
y
/b
/bugs/insects/tigers
/i
t /ti
are in
i a specific
ifi
region/storage/house/… How could you use sampling
get a g
good estimate?
to g
ANSWER
Mark; release; resample
Catch say 50 animals. Mark them. Release them
back in the region/storage/house/ again. Now, catch
randomly 50 animals again.
again What percentage are the
originals that were captured?
How could this be used in other estimations? On what
does it rely?
46
ANSWER
IF THE POPULATION IS NOT NORMAL
y Mark; release; resample
y We can apply the Central Limit Theorem:
y Catch 50 tigers. Put a band around their neck.
¾ Even if the population is not normal,
Release them in the jungle again. Now, catch 50
tigers again. What percentage are the originals that
were captured?
y How could this be used in other estimations? On
what does it rely?
?
¾ …sample means from the population will be approximately
normal as long as the sample size is large enough.
Properties of the sampling distribution:
and
x
47
1.44
44
EXAMPLE – MOTIVATION FOR TESTING
45
.45
48
x
n
CENTRAL LIMIT THEOREM
As the
sample size
gets large
enough…
n
IF THE POPULATION IS NOT NORMAL
Population Distribution
the sampling
distribution
becomes almost
normal
regardless of
shape of
population
Sampling distribution
properties:
C t lT
Central
Tendency
d
x
Variation
x
x
49
n
x
Sampling Distribution
(becomes normal as n increases)
Larger
sample
size
Smaller
sample size
((Sampling
p g with
replacement)
x
50
HOW LARGE IS LARGE ENOUGH?
CENTRAL LIMIT THEOREM
y Even if data are not normally
y distributed,, as long
g as
you take “large enough” samples, the sample
averages will at least be approximately normally
distributed.
distributed
y Mean of sample averages is still P
y Standard error of sample averages is still V/sqrt(n).
y In general, “large enough” means more than 30
measurements.
y For most distributions, n > 30 will give a sampling
distribution that is nearly normal
y For fairly symmetric distributions, n > 15
y For normal population distributions
distributions, the sampling
distribution of the mean is always normally
distributed
51
52
CENTRAL LIMIT THEOREM
CENTRAL LIMIT THEOREM
y For a population with a mean P and a variance V2, the
y Even if data are not normally
y distributed,, as long
g as
sampling
li di
distribution
t ib ti off th
the means off allll possible
ibl
samples of size n generated from the population will
pp
y normally
y distributed - with the mean
be approximately
of the sampling distribution equal to P and the
variance equal to V2/n - assuming that the sample
size is sufficiently large
large.
you take “large enough” samples, the sample
averages will at least be approximately normally
distributed.
distributed
y Mean of sample averages is still
P
y Standard error of sample averages is still
V/sqrt(n).
/sqrt(n)
y In general, “large enough” means more than 30
measurements.
53
54
x
SUMMARY:
CENTRAL LIMIT THEOREM ((CLT))
• The sampling distribution of
– hhas mean P Y P Y
– and standard error V Y
Y
VY / N
• As the sample size N gets larger,
– the standard error gets smaller
– and the sampling distribution gets closer to “normal.”
• So
– larger samples give
• closer
• more predictable
55 – approximations to the population mean
SUMMARY
y Law of Large Samples
¾ If we take simple random samples
y from a well-defined population
¾ we expect
p
y that the sample means
y is “usually” “close” to the population mean
y Central Limit Theorem
¾ If by “close”
y we mean “within 2 (1.96) standard errors”
¾ then by “usually”
usually
y we mean “in 95% of all samples”
¾ For other definitions of “close” and “usually,”
y see the “z
z (standard normal)…table
normal) table” in your course binder
57
SUMMARY
y If we take a simple
p random sample
p
¾ from a well-defined population
y we expect
¾ that the sample mean
¾ is “probably” “close” to the population mean
y By “close”
close we mean “within
within ~2
2 standard errors”
errors
• Today, we’ll learn that “probably” means in 95% of all samples
56