Download chapter7

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
STP 420 SUMMER 2005
STP 420
INTRODUCTION TO APPLIED STATISTICS
NOTES
PART 2 – PROBABILITY AND INFERENCE
CHAPTER 7
INFERENCE FOR DISTRIBUTIONS
7.1
Inference for the Mean of a Population
The t distributions – density curve
- symmetric about the mean 0
- area under the curve is 1
- shape similar to the standard normal curve
- has mean 0 and standard deviation varies and decreases as sample size
increases
- as n becomes large the t curve approaches the N(0, 1) curve
- more appropriate since the  of a population is rarely known.
If we sample from a population where the standard deviation is not known, then we have
to estimate it using the sample mean. The z statistic is now not valid and we use a more
appropriate statistic (t statistic).
Standard error ( SE x ) of a statistic – standard deviation is estimated from the data
SE x
s
n
The t distributions
1
STP 420 SUMMER 2005
Suppose that an SRS of size n is drawn from an N(, ) population. Then the onex
t
sample t statistic
has the t distribution with n – 1 degrees of
s/ n
freedom.
degrees of freedom – always n – 1 since only n – 1 of the observations are free
The One-Sample t Confidence Interval
Suppose that an SRS of size n is drawn from a population having unknown mean . A
level C confidence interval for  is
x t*
s
n
where t* is the value for the t(n – 1) density curve with area C between –t* and t*.
This interval is exact when the population distribution is normal and is approximately
correct for large n in other cases (non normal populations).
Margin of error - t *
s
is similar in structure as when we used the N(0, 1) distribution
n
except we replace z* with t* and  by s.
We can report the confidence interval, or we can report the mean of the interval with half
the confidence interval as the margin of error.
The One-Sample t Test
2
STP 420 SUMMER 2005
Suppose that an SRS of size n is drawn from a population having unknown mean . To
test the hypothesis H0 :  = 0 based on an SRS of size n, compute the one-sample t
x 0
t
statistic
in terms of a random variable T having the t(n – 1)
s/ n
distribution, the P-value for a test of H0 against
Ha :  > 0
is
P(T  t)
Ha :  < 0
is
P(T  t)
Ha :   0
is
2P(T  |t|)
These P-values are exact if the population distribution is normal and are approximately
correct for large n in other cases.
It is wrong to look at the data and then decide whether you want to do a one-tailed test
instead of a two-tailed test. If you have no previous knowledge that suggest the current
data being more or less, then go with a two-sided test.
Matched pairs t procedures
Subjects are matched in pairs.
Eg.
Difference of pretest scores and post test scores for same set of individuals form a
data set that can be tested using the same test as before.
1.
A matched pairs analysis is needed when there are two measurements or
observations on each individual and we want to examine the change from the first
to the second. Before and after are common.
2.
For each individual compute after minus before.
3.
Analyze the difference using the one-sample confidence interval and significancetesting procedures.
3
STP 420 SUMMER 2005
Robustness of the t procedures
robust – procedures that are not strongly affected by non-normality
Robust Procedures
- A statistical inference procedure is robust if the probability calculations required
are insensitive to violations of the assumptions made.
Some practical guidelines
1.
Sample size < 15:
Use t procedures if data are close to normal.
Do not use if data clearly non normal or outliers present
2.
Sample size >=15:
Use t procedures except when outliers are present or strong
skewness of data
3.
Large samples:
t procedures can be used even for clearly skewed
distributions (n  40)
The Sign Test
When populations are nonnormal, distribution-free procedures/tests are more
straightforward.
They have two drawbacks,
1.
Less powerful than tests designed for specific distribution like the t test.
2.
Often need to modify hypothesis to use distribution-free tests.
Distribution-free tests – stated in terms of median rather than the mean.
- good when distribution is skewed
Sign test – simplest distribution-free test
- based on counts and the binomial distribution
4
STP 420 SUMMER 2005
H0 : p = ½
Ha : p > ½
==
==
H0 : population median = 0
Ha : population median > 0
p is the probability of improvement from lets say a pretest to a post test
p = ½ implies no improvement
p > ½ implies improvement
The Sign Test for Matched Pairs
Ignore pairs with difference 0; the number of trials n is the count of the remaining pairs.
The test statistic is the count X of pairs with a positive difference, P-values for X are
based of the binomial B(n, ½) distribution.
Considering the pretest/posttest experiment, the sign test tests the hypothesis that the
median of the differences between the pretest and posttest scores is 0.
The sign test does not use the actual scores but uses a count of the improvement
(differences between pretest and posttest greater than 0)
7.2
Comparing Two Means
Two-Sample Problems
1.
Goal of inferences is to compare responses in two groups
2.
Each group is a sample from a distinct population
5
STP 420 SUMMER 2005
3.
Responses in each group are independent of those from the other groups.
The two-sample z statistic ( is known)
Suppose that x 1 is the mean of an SRS of size n1 drawn from an N(1, 1) population and
that x 2 is the mean of an independent SRS of size n2 drawn from an N(2, 2)
population. Then the two-sample z statistic
z
( x1  x 2 )  (
2
1
n1

1

2
2
2
)
has standard normal N(0, 1) sampling distribution.
n2
The two-sample t significance test ( is unknown)
Suppose that an SRS of size n1 drawn from a normal population with unknown mean 1
and that an independent SRS of size n2 is drawn from another normal population with
unknown mean 2. To test the hypothesis H0 : 1 = 2, compute the two-sample t
statistic
t
x1
x2
2
1
s
s 22
n1 n2
and use P-values or critical values for the t(k) distribution, where the degrees of freedom
k are either approximated by software or are the smaller of n1 – 1 and n2 – 1.
The Two-Sample t Confidence Interval
Suppose that an SRS of size n1 is drawn from a normal population with unknown mean
1 and that an independent SRS of size n2 is drawn from another normal population with
unknown mean 2. The confidence interval for 1 - 2 is given by
2
( x1
s
x2) t * 1
n1
2
s2
n2
6
STP 420 SUMMER 2005
has confidence level at least C no matter what the population standard deviation maybe.
Here t*, is the value for the t(k) density curve with area C between –t* and t*. The value
of the degrees of freedom k is either approximated by software or we use the smaller of
n1 – 1 and n2 – 1.
Robustness of the two-sample procedures
Two-sample procedures are more robust that the one-sample procedures. If two
populations have same shape, small samples (~5) are okay; otherwise if populations have
different shapes, larger samples are needed. Equal sample sizes are better to work with.
Inference for small samples
We need to be very careful.
Not enough observations for boxplots or normal quantile plots.
7
Related documents