Download ( , ) ( , ) x Bin np npq use x Normal np npq

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
STT200
Ch. 18
Chapter 18.
Sampling Distribution and Estimators
For large number of trials, Binomial distribution of the number of successes x
in n trials can be approximated with Normal distribution with THE SAME mean
and standard deviation:
x  Bin(np, npq )
use
x  Normal (np, npq )
Normal instead Binomial can be used if all original conditions are held: finite
number of Bernoulli trials, and
 Success/failure condition: A Binomial model is approximately Normal
if we expect at least 10 successes and 10 failures: np ≥ 10 and nq ≥ 10.
Example: Proportion of people with blood type O-negative is 6%.Tennessee
Red Cross collected blood from 32,000 donors. What is the probability that
they have at least 1850 units of O-negative blood?
Solution: (Check conditions: random sample, selected less than 10% of the
population, Bernoulli trials with two outcomes per trial and constant probability of
success). Normal approximation can be used because np and nq are both at
least 10
p=0.06
= np = 32,000 0.06 = 1920
 = (32,000 0.060.94) = 42.48
X is approximately N(1920, 42.48)
P(X  1850)=Normalcdf(1850, 32000, 1920, 42.48)  0.9503
While Binomial model gives 1-Binomcdf(32000, 0.06,1849)=0.952
**************************************
Inferential statistics: Using information taken from a sample we’ll learn
about the whole population
The terms (some old, some new):
A parameter – a numerical feature of a population: the proportion, mean,
median, range, variance, standard deviation, etc.
Statistic is any numerical measure calculated from data: the proportion,
mean, median, range, variance, standard deviation, etc.
An estimator: the statistic used to calculate the estimate: the sample mean,
sample variance, sample proportion etc.
A point estimate: a single number calculated from a sample that can be used
as an educated guess for an unknown population parameter.
Usually it is impossible to examine the whole population. We take a sample
from the population and our inference about an unknown parameter of the
1
STT200
Ch. 18
population distribution is based on an appropriate statistic computed from the
sample (an estimator).
For example to estimate the mean  across the population (parameter), we
may use as the estimator the sample mean x-bar (which is a statistic)
Or we can estimate the proportion p across the population (parameter), using
as the estimator the sample proportion 𝑝̂ (which is a statistic)
We don’t expect any statistic computed from any sample to be exactly equal
to true population parameter. But we will use some statistics to estimate the
value of parameter.
Statistical inference: a method that converts the information from random
samples into reliable estimates of the population parameters.
Notation:
Parameter
Estimator
Proportion
p
p̂
Mean

x
Variance
2
s2
Sampling distribution:
The probability distribution of sample statistics is called a sampling
distribution.
What is a difference between the distribution of proportions of heads in twocoin-toss, and 20-coins-toss?
For two coins distribution of successes x is Binomial:
p=0.5; n=2, x has the mean =np=1 and
standard deviation=
npq  2(.5)(.5)  0.707
Example: Experiment: tossing a coin 20 times and observing proportion of
heads.
How are proportions of heads in twenty-coin-tosses distributed?
We’ll toss 20 coins (of one coin 20 times). The size of our sample is n=20, and
x=observed number of heads. The sample proportion of heads p-hat=x/n.
Find your p-hat: …………..
Next we’ll collect the results obtained by each person. We expect …….% of heads.
Find the shape and statistics of the distribution of all p-hats collected so far:
(No coins? No problem: simulate on the calculator.
Go to MATH, PRB, 5:randInt and type 0,1,5 ENTER.
2
STT200
Ch. 18
The calculator will produce five randomly selected zero-one digits at a time.
Click ENTER again to get the list of ten numbers altogether. Count 1 as
heads; repeat 3 times to obtain the total of 20 tosses.)
Toss a coin 20 times. Write down the proportion of heads.
Case #
Tally the
heads in 20
tosses
Proportion of heads
(p-hats)
1
12
pˆ1  12 / 20  0.6
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Average twenty
proportions:
Standard deviation
of twenty proportions
3
STT200
Ch. 18
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
(Our results are collected in this dot—plot)
Discuss the shape, mean and standard deviation of this distribution of the sample
proportions.
To find the distribution of proportions divide x/n, mean/n, and standard
deviation/n:
Under condition np  10 and nq  10
sample number of heads: x  Normal (np, npq )
Sample proportion of heads p 
p have N ( p,
pq
n )
N (.5,
x
 
np npq
 N( , )  N( ,
)  N ( p,
n
n n
n
n
0.5*0.5 ) 
20
pq
n )
N (0.5,0.112)
The mean of the proportions of heads for 20-coins-toss distribution remains the
same as for two-coin toss, but the spread of the distribution is much smaller!
p̂
Because each sample was selected at random, the sample proportion
is a
random variable. With many repetitions of twenty tosses and taking more coins
into our samples the distribution of sample proportions becomes more and more
resembling normal model.
While we can always find the statistics for any selected sample, in general we
don’t know the parameters for the entire population.
4
STT200
Ch. 18
Goal of Inferential Statistics: Use sampling distribution of a statistic to
estimate the value of a population parameter with a known degree of
certainty.
When using a sample statistic to estimate a population parameter, some
statistics are good in the sense that they target the population parameter and
are therefore likely to give good results. Such statistics are called unbiased
estimators. For instance, the sample mean is a good estimator of the
population mean.
The sample proportion is a good (“unbiased”) estimator of the
population proportion.
Generalizing:
Sampling Distribution for the Proportions
Notation:
p = population proportion, q = 1-p,
𝑝̂ = sample proportion and 𝑞̂ = 1 − 𝑝̂
Central Limit Theorem for Proportions:
For large n (np>10 and nq>10) the sampling distribution of
approximately normal:

N  p,

pq
n
that is,
p̂ is approximately normal with mean
p̂ is




and standard deviation
 p̂ 
 p̂  p
pq
n
Before you use this theorem to solve a problem always check the
conditions:
1. Randomization Condition: The sample should be a simple random
sample of the population.
2. 10% Condition: If sampling has not been made with replacement,
then the sample size, n, must be no larger than 10% of the population.
3. Success/Failure Condition: The sample size has to be big enough so
that both np and nq are at least 10.
5
STT200
Ch. 18
Example: About 13% of the population is left-handed. A large auditorium has
15 "leftie seats". In a class of 90 students, what is the (approximate)
probability that there will be not enough seats for the left-handed students?
The question translates into
What is the probability that in a population with p=0.13, the sample proportion
p̂ computed from a sample of n=90 students will be more than .167?
Solution: Given: n=90, x=15, sample proportion p̂ = 15/90 = 16.7%,
population proportion p=13%.
1. Check the assumptions:
The sample is as random as other group of students.
Both np = 90*.13 = 11.7 > 10, and nq = 90*.87 = 78.3 > 10.
Thus p̂ is approximately normally distributed
2. Use CLT: the distribution of the p-hats is
N ( p,
pq
0.13  0.87
)  N (0.13,
)  N (0.13, 0.035)
n
90
,
Answer: P( p̂ >.167) = normalcdf(0.167, 10^99, 0.13, 0.035) = 0.145
There is 14.5% chance that there won’t be enough seats for the lefties in that
class.
Example
According to the Centers for Disease Control and Prevention, 18.8% of school-aged
children, aged 6-11 years, were overweight in 2004. (In 2011 reached 35%!)
(a) Check the conditions. Can sampling distribution model be used?
(b) In a random sample of 90 school-aged children, aged 6-11 years, what is the
probability that at least 19% are overweight?
(c) Suppose a random sample of 90 school-aged children aged 6-11 years results in 24
overweight children. What might you conclude?
Solution
a) Check the conditions and find the mean and standard deviation of sampling
distribution
6
STT200
Ch. 18
Can we state that the distribution of all sample proportions of overweight children is
approximately normal with the mean=……… and standard deviation = …………..
Check: yes. The model is N(0.188, 0.0412), that is, N(18.8%, 4.12%)
b) Use Normalcdf to answer question. (How? use Normalcdf(“from”, “to”, population
proportion, standard deviation). In case of proportions the lowest possible value
is 0, and the greatest is 1: there is no need to use positive or negative 10^9,
although it is not an error to use them.)
Normalcdf(.19, 1, 0.188, 0.0412)=0.481
c) What is the probability of observing at least 24 overweight kids in a random
group of 90 kids? (Hint: p-hat=x/n=24/90 and up)
Use Normalcdf (or the table) to answer the question: Answer: 0.0281 or 2.81%
A comment: If the population proportion of overweight kids is truly 0.188, then our
observed proportion is unusual: this or higher proportion of overweighed kids has
only less than 3% chance to occur.
Less than 3% is an unlikely event! We only expect to see about 3 samples in any
100 of such samples with 24 or more overweight kids out of the selected 90 kids.
We can conclude that the proportion of overweighed or obese children is
recently more than 18.8%.
End of Ch. 18. Almost.
Class Exercises:
7
STT200
Ch. 18
8
STT200
Ch. 18
9
STT200
Ch. 18
Central Limit Theorem for the means will be covered in Ch. 23, but the
topic matches the subject here…
The CLT requires essentially the same assumptions we saw for sample proportions:



Randomization Condition: The data values must be sampled randomly.
10% Condition: When the sample is drawn without replacement, the sample size,
n, should be no more than 10% of the population.
Large Enough Sample Condition: The CLT doesn’t tell us how large a sample
we need. The more skewed the population of y is, the larger sample is needed.
For normally distributed data samples can be small.
Conclusions:
1. Distribution of sample means x-bars will, as the sample size increases,
approach normal distribution. In case if original distribution was normal, then
sample means have exactly, not approximately, normal distribution
2. The average (mean) of all sample means is equal to population mean:
x  
3. The standard deviation of all sample means is  x 

n
Playing with sampling distributions:
http://media.pearsoncmg.com/aw/aw_mml_shared_1/statistics/West_Applets/sampled
ist.html
Another source to simulate sampling and get better understanding of
CLT: http://www.amstat.org/publications/jse/v6n3/applets/CLT.html
************************************************************************
NOTE: The population standard deviation is usually not known. In this case
we need to estimate it. Estimates of standard deviation of the estimators are
called standard errors
Estimator:
x
p̂
Standard Deviation
SD( x ) 
SD( pˆ ) 
Standard Error
(approx.)

SE ( x ) 
n
pq
n
SE ( pˆ ) 
10
s
n
ˆˆ
pq
n
STT200
Ch. 18
Example:
Using CLT to solve a problem
The distribution of women’s pulse rates is skewed right with m = 74 bpm, s =
13 bpm.
If 30 women are randomly selected find the probability that their average
pulse rate is less than 72.
Solution:
Check conditions. Random? Independent? Less than 10% of all women? Is
distribution of pulses normal, or do we have big enough sample? (yes, sample
was random, small enough and even though the distribution of pulses is
skewed right, the sample size is large enough to use CLT and find
characteristics of sampling distribution).
The pulse rates for groups of 36 women are normally distributes with the
mean 74 bpm and standard deviation approximated by the standard error
13
38 = almost 2.1bpm.
13 

N  74,
  N (74, 2.1)
38 

P( x  72)  Normalcdf (0,72,74, 2.1)  0.17
Answer: There is about 17% chance to find a random group of 36 women with
average pulse rate less than 72 bpm.
Note: While the mean and standard deviation of the distribution of single
woman’s pulse rate is known (74, 13) we cannot use this information to find
P(x<72) unless we know the more about the distribution of single scores.
Hints:
 When working with an individual value taken from normally
distributed population, one at a time, use a model N(,)
11
STT200
Ch. 18
 When working with the sample size n>1 (when conditions of using
CLT are met), use the model N(, / n )
 If the original data are not normally distributed, the size of your
sample must be large!
In all word problems concerning use of Central Limit Theorem we’ll use
a calculator to eliminate or limit hand computations.
With TI-83/84:
If n=1:
P(a<x<b)=normalcdf(a, b, population mean, population standard deviation)
If n>1:
P(a<x-bar<b)=normalcdf(a, b, population mean, population standard
deviation/sqrt(n))
12