• Study Resource
• Explore

Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
```Chapter 16
Analysis
of
Variance
1
9.1 Parameters, Statistics,
and Statistical Inference
A statistic is a numerical value computed from a
sample. Its value may differ for different samples.
e.g. sample mean x , sample standard deviation s,
and sample proportion p̂.
A parameter is a numerical value associated with
a population. Considered fixed and unchanging.
e.g. population mean m, population standard
deviation s, and population proportion p.
2
ANOVA
Analysis of variance: tool for analyzing
how the mean value of a quantitative
response variable is affected by one or more
categorical explanatory factors.
If one categorical variable: one-way ANOVA
If two categorical variables: two-way ANOVA
3
16.1 Comparing Means
with an ANOVA F-Test
H0: m1 = m2 = … = mk
Ha: The population means are not all equal.
F-statistic:
Variation among sample means
F
Natural variation within groups
4
Variation among sample means
F
Natural variation within groups
Variation among sample means is 0 if all
k sample means are equal and gets larger
the more spread out they are.
If large enough  evidence at least one
population mean is different from others
 reject null hypothesis.
p-value found using an F-distribution (more later)
5
Example 16.1 Seat Location and GPA
Q: Do best students sit in the front of a classroom?
Data on seat location and GPA for n = 384 students;
88 sit in front, 218 in middle, 78 in back
Students sitting in the front
generally have slightly
higher GPAs than others.
6
Example 16.1 Seat Location and GPA
H0: m1 = m2 = m3
Ha: The three population means are not all equal.
The F-statistic is 6.69 and the p-value is 0.0001.
p-value so small  reject H0 and conclude there
are differences among the population means.
7
Example 16.1 Seat Location and GPA
95% Confidence Intervals for 3 population means:
Interval for “front” does not overlap with the
other two intervals  significant difference
between mean GPA for front-row sitters and
mean GPA for other students
8
Notation for Summary Statistics
k = number of groups
x , si, and ni are the mean, standard deviation,
and sample size for the ith sample group
N = total sample size = n1 + n2 + … + nk
Example 16.2 Seat Location and GPA
Three seat locations  k = 3
n1 = 88, n2 = 218, n3 = 78; N = 88+218+78 = 384
x1  3.2029, x2  2.9853, x3  2.9194
s1  0.5491, s2  0.5577, s3  0.5105
9
Assumptions for the F-Test
• Samples are independent random samples.
• Distribution of response variable is a normal curve
within each population.
• Different populations may have different means.
• All populations have same standard deviation, s.
How k = 3 populations might look …
10
Conditions for Using the F-Test
• F-statistic can be used if data are not extremely
skewed, there are no extreme outliers, and group
standard deviations are not markedly different.
• Tests based on F-statistic are valid for data with
skewness or outliers if sample sizes are large.
• A rough criterion for standard deviations is that
the largest of the sample standard deviations
should not be more than twice as large as the
smallest of the sample standard deviations.
11
Example 16.3 Seat Location and GPA
• The boxplot showed two outliers in the group
of students who typically sit in the middle of
a classroom, but there are 218 students in that
group so these outliers don’t have much
influence on the results.
• The standard deviations for the three groups
are nearly the same.
• Data do not appear to be skewed.
Necessary conditions for F-test seem satisfied.
12
The Family of F-Distributions
• Skewed distributions with minimum value of 0.
• Specific F-distribution indicated by two parameters
called degrees of freedom: numerator degrees of
freedom and denominator degrees of freedom.
• In one-way ANOVA,
numerator df = k – 1,
and
denominator df = N – k
13
Determining the p-Value
Statistical Software reports the p-value in output.
Table A.4 provides critical values for 1% and 5%
significance levels.
• If the F-statistic is > than the 5% critical value,
the p-value < 0.05.
• If the F-statistic is > than the 1% critical value,
the p-value < 0.01 .
• If the F-statistic is between the 1% and 5% critical
values, the p-value is between 0.01 and 0.05.
14
16.2 Details of One-Way
Analysis of Variance
Fundamental concept: the variation among the data
values in the overall sample can be separated into:
(1) differences between group means
(2) natural variation among observations within a group
Total variation =
Variation between groups + Variation within groups
ANOVA Table displays this information.
15
Measuring Variation
Between Groups
Sum of squares for groups = SS Groups
SS Groups   groups ni xi  x 
2
Numerator of F-statistic = mean square for groups
SS Groups
MS Groups 
k 1
16
Measuring Variation
within Groups
Sum of squared errors = SS Error
SS Errors  groups ni  1si2
Denominator of F-statistic = mean square error
SS Error
MSE 
N k
Pooled standard deviation:
sp 
MSE
17
Measuring Total Variation
Total sum of squares = SS Total = SSTO
SS Total  values xij  x 
2
SS Total = SS Groups + SS Error
18
General Format of a
One-Way ANOVA Table
19
Example 16.7 Analysis of Variation
among Weight Losses
x1  7
x2  9
x3  15
Program 3 appears to have the highest weight loss overall.
20
Example 16.8 Analysis of Variation
among Weight Losses
x1  7, x2  9, x3  15 and x  10
n1  4, n2  3, n3  3 and N  10
SS Groups  groups ni xi  x 
2
 47  10  39  10  315  10  114
2
2
2
SS Groups 114
MS Groups 

 57
k 1
3 1
21
Example 16.8 Analysis of Variation
among Weight Losses
x1  7, x2  9, x3  15 and x  10
n1  4, n2  3, n3  3 and N  10
SS Total  values xij  x 
2
 7  10  9  10  5  10  7  10
2
2
2
2
 9  10  11  10  7  10
2
2
2
 15  10  12  10  18  10
 148
2
2
2
22
Example 16.8 Analysis of Variation
among Weight Losses
x1  7, x2  9, x3  15 and x  10
n1  4, n2  3, n3  3 and N  10
SS Error  SS Total - SS Groups
 148-114  34
SS Error
34
MSE 

 4.857
N k
10  3
MS Groups
57
F

 11.74 with 2 and 7 df
MSE
4.857
23
Example 16.8 Analysis of Variation
among Weight Losses
“Factor” used instead of Groups as the groups (weight-loss
programs) form an explanatory factor for the response.
Note: Pooled StDev is s p  MSE  4.86  2.204
24
Example 16.9 Top Speeds of Supercars
Data: top speeds for six runs on
each of five supercars. Kitchens (1998, p. 783)
25
Example 16.9
Top Speeds
26
Example 16.9
Top Speeds
• F = 25.15 and p-value is 0.000  reject null hypothesis
that population mean speeds are same for all five cars.
• Conditions are satisfied. Data not skewed and no
extreme outliers. Largest sample std dev (5.02 Viper) not
more than twice as large as smallest std dev (2.92 Acura).
• MS Error =14.5 is an estimate of variance of top speed for
hypothetical distribution of all possible runs with one car.
Estimated standard deviation for each car is 3.81.
• Based on sample means and CIs: Porsche and Ferrari
seem to be significantly faster than other cars.
27
Computation of 95% Confidence
Intervals for the Population Means
In one-way analysis of variance, a confidence
interval for a population mean mi is


s
p
*

xi  t
 n 
 i
where s p 
MSE and
t* is such that the confidence level is the probability
between -t* and t* in a t-distribution with df = N – k.
28
16.3 Other Methods
When data are skewed or extreme outliers present
…better to analyze the median instead of mean
H0: Population medians are equal.
Ha: Population medians are not all equal.
Two such tests are:
1. Kruskal-Wallis Test
2. Mood’s Median Test
Also called nonparametric tests.
29
Example 16.12 Drinks and Seat Location
Data: Seat location and
number of alcoholic drinks per week
Data appear skewed, sample
standard deviations differ.
Students sitting in the back report drinking more.