Download X - Alan Neustadtl @ The University of MD

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Foundations of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

German tank problem wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Single Sample Means
SOCY601—Alan Neustadtl
The Central Limit Theorem
¾ If we have a population measured by a variable X
with a mean µ and a standard deviation σ, and if all
possible random samples of size n are drawn from
this population, regardless of the shape of the
distribution of the population, then as n becomes
large:
™the distribution of sample means will be approximately
normally distributed with a mean equal to the population
parameter, and a
™Standard error equal to the population standard deviation
divided by the square-root of N.
X X = µ X and s X =
σX
N
Important Points
¾ We repeatedly take random samples and calculate means.
Then we use these means as a variable and create a frequency
distribution. This distribution represents the mean of every
sample that possibly could be selected. It is a sampling
distribution.
¾ The distribution of sample means will be normally distributed
(particularly if n is large), regardless of the shape of the
population.
¾ The mean of the sampling distribution is equal to the mean of
the population.
¾ As sample size increases, the standard error (read standard
deviation) of the sampling distribution will decrease.
Three Distributions
Three Different Types of Distributions
Central
Tendency
Dispersion
Population
X
µX =
N
Sample
X
X=
n
∑
σX =
2
X
−
µ
(
)
∑
N
Sampling Distribution
X
XX =
=µ
n
∑
sX =
∑
(X − X )
n −1
∑
2
σX =
σX
n
What Does This Mean?
¾ Suppose that we have a population with a mean equal
to 100 (µ=100) and a standard deviation equal to 15
(σ=15). Assuming that we take a simple random
sample of 400 cases (n=400) from this population, we
can immediately calculate the standard error of the
sampling distribution using the following formula:
σX =
15
= 0.750
400
CL95 = µ X ± (1.96 )( 0.750 )
CL95 = 98.53 < µ X < 101.47
The Effect of Sample Size
¾If the sample size was increased to 1,600, the
standard error would be smaller and the
confidence interval narrower. For example,
the standard error would be equal to:
15
σX =
= 0.375
1,600
CL95 = µ X ± (1.96 )( 0.375 )
CL95 = 99.27 < µ X < 100.74
The Effect of Confidence Size
¾If the sample size is held constant at 1,600, but
we used a larger confidence interval, 99% for
example, we would see an increase in the
range of possible sample means:
15
σX =
= 0.375
1,600
CL95 = µ X ± ( 2.58 )( 0.375 )
CL95 = 99.03 < µ X < 100.97
Confidence Intervals
 σ 
X ± ( z ) (σ X ) = X ± ( z ) 

 n
σx_
_
X
µ − 1.645σ x µ + 1.645σ x
90% Samples
µ − 1.96σ x
µ + 1.96σ x
95% Samples
µ − 2.58σ x
99% Samples
µ + 2.58σ x
Intervals &
Level of Confidence
Sampling
Distribution of
the Mean
α/2
Intervals
Extend from
σ_
x
1-α
µX = µ
_
X
(1 - α) % of
Intervals
Contain µ.
X − ZσX
to
X + ZσX
α/2
α % Do Not.
Confidence Intervals
Important Points
All else being equal:
¾ As sample size increases, the standard error decreases. As the standard error
decreases, the confidence interval decreases.
Conversely, small sample sizes are associated with larger standard errors that in
turn are associated with larger confidence intervals.
¾ Moving from a smaller to larger confidence limit (e.g. 0.95 to 0.99), the confidence
interval increases in size—it is more inclusive.
Conversely, smaller confidence limits (e.g. 0.95 versus 0.99) are associated with
smaller confidence intervals—they are more exclusive.
¾ The smaller the population standard deviation (s), the smaller the standard error
and, in turn, the confidence interval.
Conversely, the larger the population standard deviation, the larger the standard
error and confidence interval.
Sample Point Estimates and
Confidence Intervals
Symbolically a point estimate of a mean is given as X .
We can place a confidence interval around this value. For
example, using a 95% confidence interval (α=0.05) we define
boundaries approximately two standard errors below and
above the point estimate:
X ± (1.96 ) σ X
Sample Point Estimates and
Confidence Intervals
Similarly, we can construct a 68%
confidence interval:
Or a 99% confidence interval:
X ±σX
X ± ( 2.58 ) σ X
Sample Point Estimates and
Confidence Intervals
In general, confidence intervals can be constructed for any
desired level of confidence, 1-α, using this formula:
 zα
X ±
 2


 σ X

Summary of Assumptions
We Assume that:
1. the sample for estimatingμX is drawn randomly.
2. we have chosen a sample where n is equal to or greater than 50.
3. that we know σX.
Confidence Intervals when the
Standard Error is Unknown
Typically, we will not know the population parameters. We may
be in a position to make assumptions about the mean, but
rarely about the standard deviation. We can usually make an
estimate of the standard error using the following formula:
σˆ X =
sX
n −1
Confidence Intervals when the
Standard Error is Unknown
When we use this formula, we have to use the t-distribution, not
the z-distribution. In general, they are similar. For example,
the general formula for confidence intervals becomes:
t
X ± α
 2


 σˆ X =

t
X ± α
 2

  sX 
 

1
n
−



z- and t-Distributions
¾ Similarities to z:
™ There are many t-distributions; their shape varies with the sample size
and the sample standard deviation.
™ The t-distribution is bell shaped and has a mean of zero.
™ With large sample sizes (n≥150) the t- and z-distributions converge.
¾ Difference from z:
™ The use of the t-distribution to test hypotheses assumes that the sample
was drawn from a normally distributed population. The use of t is
generally robust against the violation of this assumption.
™ A t-distribution for a given sample size has a larger variance than a
similar z-distribution. Therefore, the standard error of a t-distribution is
larger than that of a similar z-distribution.
Student’s t Distribution
Standard
Normal
Bell-Shaped
Symmetric
‘Fatter’ Tails
t (df = 13)
t (df = 5)
0
Z
t
An Example Using t to Construct
Confidence Intervals
¾ Research in the 1970s indicated that there was an
increase in city size since World War I. But with a
reversal in this trend by 1970.
¾ Using data measuring the percentage change in city
populations in 63 American cities, we find that the
mean of the difference is -1.26 with a standard
deviation of 6.32. That is, the point estimate
indicated that between 1960 and·1970 there was a
decrease in average city size of 1.26%.
An Example Using t to Construct
Confidence Intervals
Using an alpha level of 0.05, there are 62 degrees of freedom (n-l) the tabled
value of t is approximately equal to 2.00. It is approximate because 62 df is
not in the table. However, we can use 60 instead. The 95% confidence
interval, then, is equal to:
CL95 = X ± ( t0.025 )(σˆ X )
 6.32 
= −1.26 ± ( 2.00 ) 

 63 − 1 
= −1.26 ± ( 2.00 )( 0.7962 )
= −1.26 ± 1.592
or: -2.85 < X < 0.33
z- and t-Tests
Besides placing confidence intervals around point
estimates of the mean, we can also calculate standard
z-tests and t-tests:
z=
X − µX
σX
X − µX
t=
σˆ X
An Example of Hypothesis Testing
Using Point Estimates
If the difference is not equal to zero, do we reject the
null hypothesis? To answer that question we need to
know what chance or random error can do—what
kind of differences is chance likely to produce?
The central limit theorem provides a distribution based
on chance. This allows us to see how chance operates
on means.
An Example of Hypothesis Testing
Using Point Estimates
We know that the mean score on an intelligence test in the general population
is 100 with a standard deviation of 15. The mean based on a sample of size
100 from a program for accelerated students is 108. Clearly, there is a
difference between the population and sample means. What could produce
this difference?
1. The program is successful or
2. random error, sampling error, or chance
The real question we need to answer is how likely is it that chance produced
this difference. Typically, we choose to assume #2 and call it the null
hypothesis (H0). In other words, it is not likely that the difference between
the sample mean and the population mean is equal exactly to zero; there
will generally be some difference. The null hypothesis is the assumption
that this difference is due to random error.
Hypothesis Testing
¾ What could produce differences between observed and expected values?
™ There actually is a difference, or
™ random error, sampling error, or chance.
¾ There are five basic steps in hypothesis testing:
™ Assume the null hypothesis of no difference
™ We have to have an idea about the range of outcomes if the null hypothesis is
true. We obtain this from an appropriate sampling distribution.
™ We have to decide or set a criterion for enough evidence to be convinced that
the null hypothesis is false. This is a significance level called alpha or α.
™ We have to go to the real world and collect data. That is determine some
sample statistic.
™ We compare 4 with 3 and reject or fail to reject the null hypothesis. If the
value we calculate falls in the critical region or exceeds the critical value
associated with α, we must reject the null hypothesis; otherwise we fail to
reject it.
Null and Alternative Hypotheses
First we posit the null
hypothesis:
Next, we choose one of
three different
alternative hypotheses,
depending on a priori
expectations:
H0 : µX = 0
2-tailed
{H1 : µ X
1-tailed
H1 : µ X > 0

 H1 : µ X < 0
≠0
Hypothesis Testing
1. Assume the null hypothesis of no
difference
Ho: no IQ difference between
population and sample
H1:there is a statistically significant
difference in IQ between the
population and the sample
2. We have to have an idea about the
range of outcomes if the null
hypothesis is true. We obtain this
from an appropriate sampling
distribution.
In this problem, we have a large sample
size and we know the population
standard deviation. We can safely
use the z-distribution to answer this
question.
3. We have to decide or set a criterion
for enough evidence to be
convinced that the null hypothesis
is false. This is a significance
level called alpha or α.
It is reasonable to assume that students
in an accelerated program should
have higher average I.Q. scores.
Therefore, we choose to use a onetailed test. Furthermore, since
implementing a program like this
universally would be expensive we
wish to minimize the probability of a
Type I error. So, we select α=0.01.
In this case, z-critical is equal to
2.327.
Hypothesis Testing
4. We have to go to the real world and
collect data. That is determine
some sample statistic.
5. We compare 4 with 3 and reject or
fail to reject the null hypothesis. If
the value we calculate falls in the
critical region or exceeds the
critical value associated with α, we
must reject the null hypothesis;
otherwise we fail to reject it.
z=
X − µ X 108 − 100
=
≅ 5.33
15
σˆ X
100
The calculated z of 5.31 exceeds the
zcritical of 2.327. We reject the null
hypothesis in favor of the alternative
hypothesis knowing that the
probability that we have made a
Type I error is 1%.
Determining How Big A Sample You Need
You know that sample size affects the amount of error in
parameter estimates—ceterus paribus larger samples have less
error. This is bound up in the following formula:
t
error =  α
 2

t
= α
 2


 (σˆ X )

  sX 
 

 n 
Determining How Big A Sample You Need
So, knowing a and either knowing the population standard
deviation or making an estimate of it, you can solve this
formula for n, sample size. Consider the following:


 

  tα  (σˆ X ) 
n= 2 

 


 error 


2
An Example
We know that the population mean and standard
deviation for the Stanford-Binet intelligence
test is 100 and 15 respectively. How large a
sample do we need to produce a parameter
estimate of the mean within three points of the
parameter? Since we know the actual
parameters, we can use z.
An Example
 (1.96 )(15 ) 
n = 
 ≅ 100
3


2
What if we wanted to reduce the margin of error to one point?
How big a sample size do we need to draw?
 (1.96 )(15 ) 
n = 
 ≅ 865
1


2
Tests Involving Proportions
 zα 
pˆ ±  2  (σ pˆ ) =
 
 zα   ( p )(1 − p ) 
=
pˆ ±  2  

n
  

 zα   ( pˆ )(1 − pˆ ) 

pˆ ±  2  

n
  

where:
X
∑
pˆ =
n
Note : When n is large, pˆ can approximate the value of p in the formula for σ pˆ .
An Example
In a sample of 1,000 American citizens, 637 respond that they
trust the president. Using a 95% confidence interval show the
range of the population that trusts the president.
 zα
CL95 = pˆ ±  2


 

( pˆ )(1 − pˆ ) 
n


 ( 0.637 )( 0.363) 

= 0.637 ± (1.96 ) 


1,
000


= 0.637 ± 0.30
0.607 < pˆ < 0.667
Tests Involving Proportions
z=
ps − pu
( pu )(1 − pu )
n
An Example
In a sample of 40 students taking an examination, 70% earned a
score of 80% or greater. The professor claims success if 80%
meet or exceed the goal of mastering 80% of the examination
material. Evaluate this examination using a 99% confidence
interval.
z=
0.70 − 0.80
( 0.80 )( 0.20 )
≅ −1.58
40
zcritical is equal to 2.575, so we fail to reject the null hypothesis
An Example
Using a confidence interval, we get:
 zα
CL99 = pˆ ±  2


 

( pˆ )(1 − pˆ ) 
n


 ( 0.8 )( 0.2 ) 

= 0.70 ± ( 2.575 ) 


40


= 0.70 ± 0.16
0.54 < pˆ < 0.86