Download sample

Document related concepts

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Statistical inference wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Gibbs sampling wikipedia , lookup

Sampling (statistics) wikipedia , lookup

Transcript
Statistics
Sampling and Sampling
Distribution
1/101
STATISTICS in PRACTICE


MeadWestvaco Corporation’s
products include textbook
paper, magazine paper, and
office products.
MeadWestvaco’s internal consulting
group uses sampling to provide information
that enables the company to obtain
significant productivity benefits and remain
competitive.
2/101
STATISTICS in PRACTICE


Managers need reliable and accurate
information about the timberlands and
forests to evaluate the company’s ability to
meet its future raw material needs.
Data collected from sample plots
throughout the forests are the basis for
learning about the population of trees
owned by the company.
3/101
Contents







The Electronics Associates Sampling Problem
Simple Random Sampling
Point Estimation
Introduction to Sampling Distributions
Sampling Distribution of p
Properties of Point Estimators
Other Sampling Methods
4/101
Statistical Inference



The purpose of statistical inference is to obtain
information about a population from
information contained in a sample.
A population is the set of all the elements of
interest.
A sample is a subset of the population.
5/101
Statistical Inference



The sample results provide only estimates of
the values of the population characteristics.
With proper sampling methods, the sample
results can provide “good” estimates of the
population characteristics.
A parameter is a numerical characteristic of a
population.
6/101
The Electronics Associates
Sampling Problem


Often the cost of collecting information
from a sample is substantially less than
from a population,
Especially when personal interviews must
be conducted to collect the information.
7/101
Simple Random Sampling:
Finite Population

Finite populations are often defined by lists
such as:
 Organization membership roster
 Credit card account numbers
 Inventory product numbers
8/101
Simple Random Sampling:
Finite Population
 A simple
random sample of size n from a
finite population of size N is a sample
selected such that each possible sample of
size n has the same probability of being
selected.
9/101
Simple Random Sampling:
Finite Population
 Replacing each sampled element before
selecting subsequent elements is called
sampling with replacement.
 Sampling without replacement is the
procedure used most often.
10/101
Simple Random Sampling

Random Numbers: the numbers in the table are
random, these four-digit numbers are equally
likely.
11/101
Simple Random Sampling:
Infinite Population

Infinite populations are often defined by an
ongoing process whereby the elements of the
population consist of items generated as
though the process would operate
indefinitely.
12/101
Simple Random Sampling:
Infinite Population

A simple random sample from an infinite
population is a sample selected such that the
following conditions are satisfied.
 Each element selected comes from the same
population.
 Each element is selected independently.
13/101
Simple Random Sampling:
Infinite Population

In the case of infinite populations, it is
impossible to obtain a list of all elements
in the population.

The random number selection procedure
cannot be used for infinite populations.
14/101
Point Estimation

In point estimation we use the data from the
sample to compute a value of a sample statistic
that serves as an estimate of a population
parameter.

We refer to x as the point estimator of the
population mean .
15/101
Point Estimation

s is the point estimator of the population standard
deviation .

p
is the point estimator of the population
proportion p.
16/101
Point Estimation

Example: to estimate the population mean,
the population standard deviation and
population proportion.
17/101
Sampling Error
 When the expected value of a point estimator
is equal to the population parameter, the point
estimator is said to be unbiased.
 The absolute value of the difference between an
unbiased point estimate and the corresponding
population parameter is called the
sampling error.
18/101
Sampling Error
 Sampling error is the result of using a subset
of the population (the sample), and not the
entire population.
 Statistical methods can be used to make
probability statements about the size of the
sampling error.
19/101
Sampling Error

The sampling errors are:
|x   | for
|s   |
sample mean
for sample standard deviation
| p  p| for
sample proportion
20/101
Example: St. Andrew’s
St. Andrew’s College
receives 900 applications
annually from
prospective students.
The application form
contains a variety of information
including the individual’s scholastic aptitude
test (SAT) score and whether or not the
individual desires on-campus housing.
21/101
Example: St. Andrew’s
The director of admissions
would like to know the
following information:
 the average SAT score
for the 900 applicants,
and
 the proportion of applicants that want to
live on campus.
22/101
Example: St. Andrew’s
We will now look at three
alternatives for obtaining
The desired information.
 Conducting a census of
the entire 900 applicants
 Selecting a sample of 30
applicants, using a random number table

Selecting a sample of 30 applicants, using Excel
23/101
Conducting a Census

If the relevant data for the entire 900
applicants were in the college’s database, the
population parameters of interest could be
calculated using the formulas presented in
Chapter 3.

We will assume for the moment that
conducting a census is practical in this
example.
24/101
Conducting a Census

Population Mean SAT Score

Population Standard Deviation for SAT
Score

Population Proportion Wanting On-Campus
Housing
25/101
Simple Random Sampling


Now suppose that the necessary data on the
current year’s applicants were not yet entered
in the college’s database.
Furthermore, the Director of Admissions must
obtain estimates of the population parameters of
interest for a meeting taking place in a few hours.
26/101
Simple Random Sampling


Now suppose that the necessary data on the
current year’s applicants were not yet
entered in the college’s database.
Furthermore, the Director of Admissions must
obtain estimates of the population parameters
of interest for a meeting taking place in a few
hours.
27/101
Simple Random Sampling

The applicants were numbered, from 1 to 900,
as their applications arrived.
28/101
Simple Random Sampling:
Using a Random Number Table

Taking a Sample of 30 Applicants


Because the finite population has 900
elements, we will need 3-digit random
numbers to randomly select applicants
numbered from 1 to 900.
We will use the last three digits of the 5-digit
random numbers in the third column of the
textbook’s random number table , and
continue into the fourth column as needed.29/101
Simple Random Sampling:
Using a Random Number Table

Taking a Sample of 30 Applicants


The numbers we draw will be the numbers of
the applicants we will sample unless the
random number is greater than 900 or
the random number has already been used.
We will continue to draw random numbers
until we have selected 30 applicants for our
sample.
30/101
Simple Random Sampling:
Using a Random Number Table

(We will go through all of column 3 and
part of column 4 of the random number table,
encountering in the process five numbers
greater than 900 and one duplicate, 835.)
31/101
Simple Random Sampling:
Using a Random Number Table

Use of Random Numbers for Sampling
3-Digit
Applicant
Random Number Included in Sample
744
No. 744
436
No. 436
865
No. 865
790
No. 790
835
No. 835
902
Number exceeds 900
190
No. 190
836
No. 836
. . . and so on
32/101
Simple Random Sampling:
Using a Random Number Table

Sample Data
Random
No. Number
1
744
2
436
3
865
4
790
5
835
.
.
.
.
SAT
Score
Applicant
Conrad Harris
1025
Enrique Romero
950
Fabian Avante
1090
Lucila Cruz
1120
Chan Chiang
930
.
.
.
.
30
Emily Morse
498
1010
Live OnCampus
Yes
Yes
No
Yes
No
.
.
No
33/101
Simple Random Sampling:
Using a Computer

Taking a Sample of 30 Applicants
• Computers can be used to generate random
numbers for selecting random samples.
• For example, Excel’s function
= RANDBETWEEN(1,900)
can be used to generate random numbers
between 1 and 900.
• Then we choose the 30 applicants
corresponding to the 30 smallest random
34/101
numbers as our sample.

Point Estimation
wx

x as Point Estimator of 
w
i i
i


s as Point Estimator of 
–p as Point Estimator of p
35/101
Point Estimation
Note: Different random numbers would
have identified a different sample which
would have resulted in different point
estimates.
36/101
Summary of Point Estimates
Obtained from a Simple Random
Sample
Population
Parameter
Parameter
Value
 = Population mean
990
x = Sample mean
997
 = Population std.
80
s = Sample std.
deviation for
SAT score
75.2
p = Population proportion wanting
campus housing
.72
SAT score
Point
Estimator
Point
Estimate
SAT score
deviation for
SAT score
p
= Sample proportion wanting
campus housing
.68
37/101
Sampling Distribution

Example: Relative Frequency Histogram of Sample
Mean Values from 500 Simple Random Samples of
30 each.
38/101
Sampling Distribution

Example: Relative Frequency Histogram of Sample
Proportion Values from 500 Simple Random
Samples of 30 each.
39/101
Sampling Distribution of x 

Process of Statistical
Inference
Population
with mean
=?
The value of x is used to
make inferences about
the value of .
A simple random sample
of n elements is selected
from the population.
The sample data
provide a value for
the sample mean x.



Sampling Distribution of x 

w
x

i
i
The sampling distribution of x is the probability
wi

distribution of all possible values of the sample
wi xi

mean x .
wi
wi xi


Expected Value of x 
w
E( x ) = 
i
where:
 = the population mean
41/101
Sampling Distribution of x

Standard Deviation of x
Finite Population
 N n
x  ( )
n N 1
Infinite Population
x 

n
• A finite population is treated as being
infinite if n/N < .05.
42/101
Sampling Distribution of x
( N  n) /( N  1)
•
is the finite
correction factor.
•  x is referred to as the standard
error of the mean.
43/101
Form of the Sampling Distribution
wi xi

of x 

w

If we use a large (n > 30) simple random sample,
i
the central limit theorem enables us to conclude
that the sampling distribution of x can be
approximated by a normal distribution.

When the simple random sample is small (n < 30),
the sampling distribution of x can be considered
normal only if we assume the population has a
normal distribution.
44/101
Central Limit Theorem

Illustration of The Central Limit Theorem
45/101
Relationship Between the Sample Size and
the Sampling Distribution of Sample Mean

A Comparison of The Sampling
Distributions of Sample Mean for Simple
Random Samples of n = 30 and n = 100.
46/101
w

x  for
w
i
Sampling Distribution of
SAT Scores
Sampling
Distribution
of x
E( x )  990
x 

80

 14.6
n
30
x
47/71
wx

Sampling Distribution of x for
w

SAT Scores
i
i
i
What is the probability that a simple
random sample of 30 applicants will provide
an estimate of the population mean SAT score
that is within +/-10 of the actual population mean ?
In other words, what is the probability that x 
will be between 980 and 1000?
48/101
wx

Sampling Distribution of x for w

i
i
SAT Scores
Step 1: Calculate the z-value at the upper
endpoint of the interval.
z = (1000 - 990)/14.6= .68
Step 2: Find the area under the curve to the
left of the upper endpoint.
P(z < .68) = .7517
49/101
i
wx

Sampling Distribution of x for
w

SAT Scores
i
i
Cumulative Probabilities for
the Standard Normal Distribution
z
.00
.01
.02
.03
.04
.05
.06
.07
.08
.09
.
.
.
.
.
.
.
.
.
.
.
.5
.6915 .6950 .6985 .7019 .7054 .7088 .7123 .7157 .7190 .7224
.6
.7257 .7291 .7324 .7357 .7389 .7422 .7454 .7486 .7517 .7549
.7
.7580 .7611 .7642 .7673 .7704 .7734 .7764 .7794 .7823 .7852
.8
.7881 .7910 .7939 .7967 .7995 .8023 .8051 .8078 .8106 .8133
.9
.8159 .8186 .8212 .8238 .8264 .8289 .8315 .8340 .8365 .8389
.
.
.
.
.
.
.
.
.
.
. 50/101
wx

Sampling Distribution of x for
w
SAT Scores
i i
i
Sampling
Distribution
of x
 x  14.6
Area = .7517
x
990 1000
51/101
w

Sampling Distribution of x for

w

SAT Scores
Step 3: Calculate the z-value at the lower
endpoint of the interval.
z = (980 - 990)/14.6= - .68
Step 4: Find the area under the curve to the
left of the lower endpoint.
P(z < -.68) = P(z > .68)
= 1 - P(z < .68)
= 1 - . 7517
= .2483
52/101
wx

Sampling Distribution of x for
w
SAT Scores
i
i
Sampling
Distribution
of x
 x  14.6
Area = .2483
x
980 990
53/101
i
w

Sampling Distribution of x for


SAT Scores
Step 5: Calculate the area under the curve
between the lower and upper endpoints
of the interval.
P(-.68 < z < .68) = P(z < .68) - P(z < -.68)
= .7517 - .2483
= .5034
The probability that the sample mean SAT
score will be between 980 and 1000 is:
P(980 <x < 1000) = .5034
54/101
wx

Sampling Distribution of x for w

i i
i
SAT Scores
Sampling
Distribution
of x
 x  14.6
Area = .5034
980 990 1000
x
55/101
Relationship Between the Sample Size

x

and the Sampling Distribution of

 Suppose we select a simple random sample
of 100 applicants instead of the 30 originally
considered.

wx

E(x )= m regardless of the sample size.
wx

w

in our example, E(x ) remains at 990.
w
i i
i
i i
i
56/101
Relationship Between the Sample Size
and the Sampling Distribution of x 
 Whenever the sample size is increased, the
standard error of the mean σ x is decreased.
With the increase in the sample size to n = 100,
the standard error of the mean is decreased to:
57/101
Relationship Between the Sample Size
and the Sampling Distribution of x 
With n = 100,
x  8
With n = 30,
 x  14.6
E( x )  990
x
58/101

Relationship Between the Sample Size
wi x

and the Sampling Distribution of x 
w
i

Recall that when n = 30,
wi xi

P(980 < x < 1000) = .5034.
w
i

We follow the same steps to solve for
wi xi

P(980 < x < 1000)
w

i
when n = 100 as we
showed earlier when n = 30.
59/101
Relationship Between the Sample Size
w

and the Sampling Distribution of x 
w

wx

Now, with n = 100, P(980 < x<1000) = .7888.
w
Because the sampling distribution with n = 100
w

has a smaller standard error, the values of xhave

w
less variability and tend to be closer to the
wx

population mean than the values of x with
n = 30.

w
i i
i

i i
i
60/101
Relationship Between the Sample Size
w

and the Sampling Distribution of x 

Sampling
Distribution
of x
x  8
Area = .7888
x
980 990 1000
61/101
Sampling Distribution

Example: Relative Frequency Histogram of
Sample Proportion Values from 500 Simple
Random Samples of 30 each.
62/101
Sampling Distribution of p

Making Inferences about a Population
Proportion
Population
with proportion
p=?
The value of p is used
to make inferences
about the value of p.
A simple random sample
of n elements is selected
from the population.
The sample data
provide a value for the
sample proportion p.
63/101
Sampling Distribution of
p
The sampling distribution of p is the probability
distribution of all possible values of the sample
proportion p .
Expected Value of p
E ( p)  p
where:
p = the population proportion
64/101
Sampling Distribution of
p
Standard Deviation of p
Finite Population
p 
p
p (1  p ) N  n
n
N 1
Infinite Population
p 
p (1  p )
n
is referred to as the standard error
of the proportion.
Form of the Sampling
Distribution of p

The sampling distribution of p can be
approximated by a normal distribution whenever
the sample size is large.
The sample size is considered large
whenever these conditions are satisfied:
np > 5 and n(1 – p) > 5
Form of the Sampling
Distribution of p


For values of p near .50, sample sizes as
small as 10 permit a normal approximation.
With very small (approaching 0) or very
large (approaching 1) values of p, much
larger samples are needed.
67/101
Sampling Distribution of p

Example: St. Andrew’s College
Recall that 72% of the prospective students
applying to St. Andrew’s College desire
on-campus housing.
68/101
Sampling Distribution of p

Example: St. Andrew’s College
What is the probability that a simple random
sample of 30 applicants will provide an estimate
of the population proportion of applicant
desiring on-campus housing that is within plus or
minus .05 of the actual population proportion?
69/101
Sampling Distribution of
p
For our example, with n = 30 and p = .72,
the normal distribution is an acceptable
approximation because:
np = 30(.72) = 21.6 > 5
and
n(1 - p) = 30(.28) = 8.4 > 5
Sampling Distribution of
Sampling
Distribution
of
p
Sampling Distribution of
p
Step 1: Calculate the z-value at the upper
endpoint of the interval.
z = (.77 - .72) /.082 = .61
Step 2: Find the area under the curve to the
left of the upper endpoint.
P(z < .61) = .7291
72/101
Sampling Distribution of
p
Cumulative Probabilities for
the Standard Normal Distribution
z
.00
.01
.02
.03
.04
.05
.06
.07
.08
.09
.
.
.
.
.
.
.
.
.
.
.
.5
.6915 .6950 .6985 .7019 .7054 .7088 .7123 .7157 .7190 .7224
.6
.7257 .7291 .7324 .7357 .7389 .7422 .7454 .7486 .7517 .7549
.7
.7580 .7611 .7642 .7673 .7704 .7734 .7764 .7794 .7823 .7852
.8
.7881 .7910 .7939 .7967 .7995 .8023 .8051 .8078 .8106 .8133
.9
.8159 .8186 .8212 .8238 .8264 .8289 .8315 .8340 .8365 .8389
.
.
.
.
.
.
.
.
.
.
.
73/101
Sampling Distribution of
p
 p  .082
Sampling
Distribution
of p
Area = .7291
p
.72
.77
74/101
Sampling Distribution of
p
Step 3: Calculate the z-value at the lower
endpoint of the interval.
z = (.67 - .72) /.082 = - .61
Step 4: Find the area under the curve to the
left of the lower endpoint.
P(z < -.61) = P(z > .61)
= 1 - P(z < .61)
= 1 - . 7291
= .2709
75/101
Sampling Distribution of
p
Sampling
Distribution
of
Area = .2709
.67
.72
76/101
Sampling Distribution of
p
Step 5: Calculate the area under the curve between
the lower and upper endpoints of the interval.
P(-.61 < z < .61) = P(z < .61) - P(z < -.61)
= .7291 - .2709
= .4582
The probability that the sample proportion of
applicants wanting on-campus housing will be
within +/-.05 of the actual population proportion :
P(.67 <
< .77) = .4582
77/101
Sampling Distribution of
Sampling
Distribution
of p
p
 p  .082
Area = .4582
p
.67
.72
.77
78/101
Point Estimators


Notations:
^
θ = the population parameter of interest.
For example, population mean, population
standard deviation, population proportion,
and so on.
79/101
Point Estimators




Notations:
^
θ = the sample statistic or point estimator
of θ .
Represents the corresponding sample
statistic such as the sample mean, sample
standard deviation, and sample proportion.
The notation θ is the Greek letter theta.
^
the notation θ is pronounced “theta-hat.”
80/101
Properties of Point Estimators

Before using a sample statistic as a point
estimator, statisticians check to see whether
the sample statistic has the following
properties associated with good point
estimators.

Unbiased

Efficiency

Consistency
81/101
Properties of Point Estimators

Unbised
If the expected value of the sample
statistic is equal to the population parameter
being estimated, the sample statistic is said
to be an unbiased estimator of the
population parameter.
82/101
Properties of Point Estimators

Unbised
The sample statistic ^θ is unbiased estimator
of the population parameter θ if E(θ)= ^θ
where
^
E(θ)=the
expected value of the sample
statistic θ
83/101
Properties of Point Estimators

Examples of Unbiased and Biased Point
Estimators
84/101
Properties of Point Estimators

Efficiency
Given the choice of two unbiased
estimators of the same population parameter,
we would prefer to use the point estimator
with the smaller standard deviation, since it
tends to provide estimates closer to the
population parameter.
The point estimator with the smaller
standard deviation is said to have greater
85/101
relative efficiency than the other.
Properties of Point Estimators

Example: Sampling Distributions of Two
Unbiased Point Estimators.
86/101
Properties of Point Estimators

Consistency
A point estimator is consistent if the
values of the point estimator tend to become
closer to the population parameter as the
sample size becomes larger.
87/101
Other Sampling Methods
Stratified Random Sampling(分層隨機抽樣)
 Cluster Sampling(部落抽樣)
 Systematic Sampling(系統抽樣)
 Convenience Sampling(便利抽樣)
 Judgment Sampling(判斷抽樣)

88/101
Stratified Random Sampling

The population is first divided into groups of
elements called strata.

Each element in the population belongs to one
and only one stratum.
Best results are obtained when the elements
within each stratum are as much alike as possible
(i.e. a homogeneous group).

89/101
Stratified Random Sampling

Diagram for Stratified Random Sampling
90/101
Stratified Random Sampling


A simple random sample is taken from each
stratum.
Formulas are available for combining the
stratum sample results into one population
parameter estimate.
91/101
Stratified Random Sampling

Advantage: If strata are homogeneous, this
method is as “precise” as simple random
sampling but with a smaller total sample size.

Example: The basis for forming the strata
might be department, location, age, industry type,
and so on.
92/101
Cluster Sampling
The population is first divided into separate
groups of elements called clusters.
 Ideally, each cluster is a representative smallscale version of the population (i.e.
heterogeneous group).
 A simple random sample of the clusters is then
taken.
 All elements within each sampled (chosen)
cluster form the sample.

93/101
Cluster Sampling

Diagram for Cluster Sampling
94/101
Cluster Sampling

Example: A primary application is area
sampling, where clusters are city blocks or
other well-defined areas.
Advantage: The close proximity of elements
can be cost effective (i.e. many sample
observations can be obtained in a short time).
 Disadvantage: This method generally requires a
larger total sample size than simple or stratified
random sampling.

95/101
Systematic Sampling



If a sample size of n is desired from a population
containing N elements, we might sample one
element for every n/N elements in the population.
We randomly select one of the first n/N elements
from the population list.
We then select every n/Nth element that follows in
the population list.
96/101
Systematic Sampling

This method has the properties of a simple
random sample, especially if the list of the
population elements is a random ordering.
Advantage: The sample usually will be easier to
identify than it would be if simple random
sampling were used.
Example: Selecting every 100th listing in a
telephone book after the first randomly selected
97/101
listing
Convenience Sampling
It is a nonprobability sampling technique.
Items are included in the sample without known
probabilities of being selected.
The sample is identified primarily by convenience.
Example: A professor conducting research might
use student volunteers to constitute a sample.
98/101
Convenience Sampling

Advantage: Sample selection and data collection
are relatively easy.

Disadvantage: It is impossible to determine how
representative of the population the sample is.
99/101
Judgment Sampling

The person most knowledgeable on the subject of
the study selects elements of the population that
he or she feels are most representative of the
population.

It is a nonprobability sampling technique.

Example: A reporter might sample three or four
senators, judging them as reflecting the general
opinion of the senate.
100/101
Judgment Sampling

Advantage: It is a relatively easy way of
selecting a sample.

Disadvantage: The quality of the sample results
depends on the judgment of the person selecting
the sample.
101/101