Download 1. Why Conduct an Experiment?

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Foundations of statistics wikipedia , lookup

Omnibus test wikipedia , lookup

Student's t-test wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Analysis of variance wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
0. Why Conduct an Experiment?
• Infer causation
• Ensure repeatability
Also,
• Determine the relationships between
variables
• Estimate significance of each variable
1
1. Components of Experimentation
• Formulate research hypotheses
– Derivations from a theory
– Deductions from empirical experience
– Speculation
Research hypotheses are the questions we
hope to answer by conducting the
experiment
2
Components of experimentation
• Define variables and design
– Are the independent variables capable of testing
the hypotheses?
– Are the independent variables confounded?
• For example, assume buffer size and cycle time are
independent variables
– Condition 1: buffer size = 10, cycle time = 1 min
– Condition 2: buffer size = 15, cycle time = 1 min
– Condition 3: buffer size = 15, cycle time = 2 min
3
Components of experimentation
– No meaningful inference can be drawn by
conducting an experiment that includes
conditions 1 and 3 because the variables are
confounded, that is, varied simultaneously
under control of the researcher.
4
Components of experimentation
Design of experiments is a technique for
examining and maximizing the information
gained from an experiment.
5
Components of experimentation
• Conduct experiment
– Collect data
– Extract information
• Analyze results
– Test hypotheses
• Report outcomes
6
Components of experimentation
An experiment is conducted usually to test a
theory. If the outcome of the experiment is
negative, the experiment may be inadequate
while the theory may be valid.
7
Definitions
• Factor - an input variable
• Level - a possible value of a factor
• Treatment - a combination of factors, all at a
specified level, as in a simulation run
• Parameter - a measure calculated from all
observations in a population
• Statistic - a measure calculated from all
observations in a sample
8
2. Hypothesis Testing
In analyzing data from an experiment, we are
interested in describing not only the
performance of the subjects selected in the
treatment conditions–we want to make
inferences about the behavior of the source
population of our sample of subjects.
9
Hypothesis testing
• We start by making an assumption about the
value of a population parameter.
• We can test this assumption in two ways:
– census
• foolproof, time consuming
– random sample
• not foolproof, faster than census
10
Hypothesis testing
• In hypothesis testing, we make an
assumption (hypothesis) about the value of
a population parameter and test it by
examining evidence from a random sample
taken from the population.
• Since we are not testing the entire
population, we must be aware of the risk
associated with our decision.
11
Hypothesis testing
• We start by formulating two competing
hypotheses.
• We test the hypothesis that is the opposite of
the inference we want to make.
• The hypothesis we test is called the null
hypothesis (H0); the inference we want to
make is called the alternative hypothesis
(H1).
12
Example 1
Yosemite recently acquired the Acme Disintegrating
Pistol. However, after repeated attempts with the
pistol, he has been unsuccessful at destroying Bugs.
Yosemite suspects that the pistol is not delivering its
rated output of 10 megatons/shot. He has decided to
keep the pistol only if the output is over 10
megatons/shot. He takes a random sample of 100
shots and records the outputs. What null and
alternative hypotheses should Yosemite use to make
the decision?
13
Example 1 - One Sided Alternative
• Let  denote the mean output/shot.
H0: 10
H1: >10
• Practically
H0: =10
H1: >10
14
Example 2 - Two Sided Alternative
Suppose Yosemite bought a used Pistol and he
suspects that the output is not 10
megatons/shot. What should be the null and
alternative hypotheses?
H0: =10
H1: 10
15
Hypothesis Testing-Two Populations
• With one population, we are interested in
making an inference about a parameter of a
population.
• With two populations, we are interested
testing hypotheses about parameters of two
populations.
• We want to compare the difference between
the parameters, not their actual values.
16
Example 3
Han Solo has been disappointed with the performance of his
X-Wing fighter lately. He usually finds himself trailing the
other fighters on Death Star missions. He suspects the
quality of the fuel he is getting from the neighborhood fuel
portal on his home planet of Tatooine. He decides to try the
fuel portal located on the nearby planet of Betelgeuse.
After each fill, Han checks the fighter’s logs for the time it
takes to jump to hyperspace and compares it with the logs
from the Tatooine fuel. The jump takes an average of 17.01
trilons on Tatooine fuel and 16.9 trilons on Betelgeuse fuel.
Can Han attribute this difference to fuel?
17
Example 3
• Let 1 denote the time taken to jump to
hyperspace on Tatooine fuel and 2 denote
the time taken to jump to hyperspace on
Betelgeuse fuel.
H0: 1 – 2  0
H1: 1 – 2 < 0
18
Hypothesis Testing-Two Populations
• Practically
H0: 1 – 2 = 0
H1: 1 – 2 < 0
or
H0: 1 = 2
H1: 1 < 2
19
Hypothesis testing
• We formulate hypotheses to assert that the
treatments (independent variables) will
produce an effect. We would not perform an
experiment otherwise.
• We formulate two mutually exclusive
hypotheses that cover all possible parameter
values.
20
Hypothesis testing
• The statistical hypothesis we test is called
the null hypothesis (H0). It specifies values
of a parameter, often the mean.
• If the values obtained from the treatment
groups are very different than those
specified by the null hypothesis, we reject
H0 in favor the alternative hypothesis (H1).
21
Hypothesis Testing-Multiple Populations
The null hypothesis usually assigns the same
value to the treatment means:
H0: 1= 2= 3= …
H1: not all s are equal
22
Hypothesis Testing-Multiple Populations
1 = 2 = 3
1  2  3
23
Hypothesis Testing-Multiple Populations
• The null hypothesis is an exact statement –
the treatment means are equal.
• The alternative hypothesis is an inexact
statement – any two treatment means may
be unequal. Nothing is said about the actual
differences between the means because we
would not need to experiment in that case.
24
Hypothesis Testing-Multiple Populations
• A decision to reject H0 suggests significant
differences in the treatment means.
• If the treatments means are reasonably close
to the ones specified in H0, we do not reject
H0.
• We usually cannot accept H0; we question
the experiment instead.
25
2.1 Experimental Error
• We can attribute a portion of the difference
among the treatment means to experimental
error.
• This error can result from:
–
–
–
–
sampling
error in entering input data
error in recording output data
inadequate run length
26
Experimental error
• Under the null hypothesis, we have two
sources of experimental error – differences
within treatment means and differences
between treatment means.
• Under the alternative hypothesis, we have
genuine differences among treatment
means. However, a false null hypothesis
does not preclude experimental error.
27
Experimental error
• A false null hypothesis implies that
treatment effects are also contributing
toward the differences in means.
28
2.2 Evaluating H0
• If we form a ratio of
the two experimental
errors under H0, we
have:
• This can also be
thought of as
contrasting two
experimental errors:
differences between groups
differences within groups
experimental error
experimental error
29
Evaluating H0
• Under H1, there is an additional component in the
numerator:
treatment effects  experimental error
experimental error
30
3. ANOVA and the F ratio
• To evaluate the null hypothesis, it is
necessary to transform the between- and
within-group differences into variances.
• The statistical analysis involving the
comparison of variances is called the
analysis of variance.
31
ANOVA and the F ratio
sum of squared deviations from the mean
degrees of freedom
SS

 MS
df
variance 
• Degrees of freedom is approximately the number
of observations with independent information, that
is, variance is roughly an average of the squared
deviations.
32
3.1 The F Ratio
MS between MS treatment
F 

MS within
MS error
• Under H0, we expect the F ratio to be
approximately 1.
• Under H1, we expect the F ratio to be greater than
1.
33
Typical data for 1-way ANOVA
Treatment
level
1
2

l
Observation
y11
y21

yl1
y12
y22

yl2



y1n
y2n

yln
34
ANOVA table
Source of
variation
Treatment
Error
Total
Sum of
squares
SSt
SSe
SST
df
l1
Nl
N1
Mean
square
MSt
MSe
F
MSt
MSe
l - treatment levels
N - total number of observations
35
Computational formulas
n
yi.   yij ,
j 1
l
n
y..    yij ,
i 1 j 1
yi .
y i.  , i  1,, l
n
y..
y .. 
N
2
y
..
SST    yij2 
i 1 j 1
N
2
l y2
y
SSt   i.  ..
i 1 n
N
l
n
36
3.2 Evaluating the F ratio
• Assume we have a population of scores and
we draw at random 3 sets of 15 scores each.
• Assume the null hypothesis is true, that is,
each treatment group is drawn from the
same population (1=2=3).
• Assume we draw a very large number of
such experiments and compute the value of
F for each case.
37
3.3 Sampling Distribution of F
• If we group the Fs according to size, we can
graph them by the frequency of occurrence.
• A frequency distribution of a statistic such
as F is called the sampling distribution of
the statistic.
38
39
Sampling distribution of F
• The graph demonstrates that the F
distribution is the sampling distribution of F
when infinitely many experiments are
conducted.
• This distribution can be determined for any
experiment, that is, any number of groups
and any number of subjects in the groups.
40
Sampling distribution of F
• The F distribution allows us to make
statements concerning how common or rare
an observed F value is. For example, only
5% of the time would we expect an
Fobs  3.23.
• This is the probability that an Fobs  3.23
will occur on the basis of chance factors
alone.
41
Sampling distribution of F
• We have considered the sampling
distribution of F under H0. However, we
conduct experiments expecting to find
treatment effects.
• If H0 is false, we expect that F > 1. The
sampling distribution of F under H1 is
called F'.
42
Sampling distribution of F
• We cannot plot the distribution of F' as we
can with F, because the distribution of F'
depends on the magnitude of the treatment
effects as well as the df s.
43
3.4 Testing the Null Hypothesis
H0: all means are equal
H1: not all means are equal
Alternatively,
H0: there are no treatment effects
H1: there are some treatment effects
44
Testing the null hypothesis
• When we conduct an experiment, we need
to decide if the observed F is from the F
distribution or the F' distribution.
• Since we test the null hypothesis, we focus
on the F distribution.
• Theoretically, it is possible to obtain any
value of F under H0.
45
Testing the null hypothesis
• Thus, we cannot be certain that an observed
F is from the F or the F' distribution, that is,
we do not know if the sample means are
different due to chance.
• We can take this attitude and render the
experiment useless or we can be willing to
make mistakes in rejecting the null
hypothesis when it’s true.
46
Testing the null hypothesis
• We select an arbitrary dividing line for any
F distribution where values of F falling
above the line are unlikely and ones falling
below the line are likely.
• If the observed F falls above the line, we
can conclude that it is incompatible with the
null hypothesis (reject H0).
47
Testing the null hypothesis
• If the observed F falls below the line, we
can conclude that it is compatible with the
null hypothesis (retain H0).
• The line conventionally divides the F
distribution so that 5% of the area under the
curve (cumulative probability) is the region
of incompatibility. This probability is called
the significance level.
48
Testing the null hypothesis
• We can choose any significance level, as
long as it’s done before the experiment.
• The formal rule is stated as:
Reject H0 when Fobs  F(dfnum,dfdenom);
otherwise retain H0
49
Testing the null hypothesis
Most software reports the probability of
occurrence of Fobs. This relieves us from
consulting the F tables (but not from
specifying  before the test). The formal
rule becomes:
If p  , reject H0; otherwise retain H0
50
3.5 Errors in Hypothesis Testing
• There are two states of reality (H0
true/false) and two decisions we may make
(reject/retain H0).
• Out of the four combinations, only two lead
to correct decisions. The other two lead to
errors.
51
Errors in hypothesis testing
Reality
Decision
Reject H0
Retain H0
H0 true
H0 false
Type I error
Correct decision
Correct decision
Type II error
52
Errors in hypothesis testing
H0 true
F
Type I error ()
Power (1–)
H1 true
Type II error ()
H0 true
H1 true
Retain H0
Reject H0
53
Errors in hypothesis testing
• Type I and type II errors are related
inversely, that is, decreasing the  level
increases type II error.
• The power of a statistical test is the
probability of rejecting the null hypothesis
when it is false. Thus, power is the
probability of making a correct decision.
54
Errors in hypothesis testing
• If  denotes the probability of making a
type II error, then power = 1. Thus, a
smaller  indicates more power.
• Power is an index of the sensitivity of an
experiment. A well designed experiment
should have high power so that the results
are repeatable.
55
4. Effect Size and Power
• The power of an experiment depends on
–  level
– sample size
– effect size
• While the F ratio is a measure of statistical
significance, effect size is a measure of
practical significance.
56
Effect size and power
• Effect size indicates whether the treatments
have a practical effect on the response
variables.
• Unlike the F test, effect size is not biased by
sample size.
• The F ratio will usually indicate
significance with a large sample size even
with small treatment effects.
57
Example
A researcher compares four religious groups
on their attitude toward education. Ten
items are used to assess attitude. There are
800 usable responses. The Protestants are
split into two groups for analysis purposes.
58
Example
n
x
s
Prot1 Catholic Jewish Prot2
238
182
130
250
32
33
34
31
7.1
7.6
7.8
7.5
ANOVA indicates Fobs=5.61, which is significant
at the .001 level.
59
Effect size and power
• Thus, while the F ratio indicates statistical
association, its size does not reflect the
degree of this association, that is, a large F
is not necessarily better then a small one.
• The effect size provides the degree of the
statistical association.
60
4.1 Omega Squared (2)
2 is a measure of effect size. It is the
proportion of the population variance
accounted for by the treatment. For single
factor experiments,
(l  1)( F  1)
 
(l  1)( F  1)  l n
2
l = number of factor levels
n = number of observations
61
4.2 Sample Size and Power
• We have seen that the power of an
experiment depends on the  level, the
effect size, and the sample size.
• The  level is conventionally fixed at .05
and the effect size is usually assumed to be
large.
• This leaves the researcher with the sample
size to control power.
62
Sample size as a function of Power, 2, and 
• Four factor levels
• Sample sizes are per factor level
2
Power
.1
.2
.3
.4
.01
.06
.15
21
5
3
53
10
5
83
14
6
113
19
8
.01
.06
.15
70
13
6
116
20
8
156
26
11
194
32
13
.5
 = .05
144
24
10
 = .01
232
38
15
.6
.7
.8
.9
179
30
12
219
36
14
271
44
17
354
57
22
274
45
18
323
53
20
385
62
24
478
77
29
63
4.3 Estimating Power
“Power reflects the degree to which we can
detect the treatment differences we expect
and the chances that others will be able to
duplicate our findings when they attempt to
repeat our experiments”
64
Estimating power
• If the power of an experiment is .5, it
indicates that there is only a 50% chance of
the result being duplicated.
• A well designed experiment should have a
power of at least .8.
65
Estimating power
• We can estimate the power of an experiment
from Pearson-Hartley power curves by
calculating 2 and another statistic (2)
• Suppose for an experiment, Fobs=3.2, l=3,
and n=5. Since F.05(2,12)=3.89, Fobs is not
significant.
66
Estimating power
(l  1)( F  1)
(3  1)(3.2  1)
 

 .227
(l  1)( F  1)  l n (3  1)(3.2  1)  3(5)
2
2

.227
2
 n
5
 1.469
2
1 
1  .227
  1.21
67
Estimating power
• The power curves for dfnum=l1=2 and
dfden=l(n1)=12 indicate that power is
approximately .36, which is too low.
• We can use the same equation to estimate
the sample size needed to detect an effect of
.227 and reject H0 at =.05 with power=.8.
68
Estimating power
.227
 n
 .294n or   .542 n
1  .227
2
If we try n=12, =1.88. Since dfden=33, we can
use the power curve for dfden=30. Locating
=1.88 on this curve gives a power .8.
69
5. Assumptions in ANOVA
• Suppose we have l levels of a factor that we
wish to compare. In the single factor case,
different levels of the factor are also called
treatments.
• The linear model underlying ANOVA states:
70
Assumptions
yij     j  ij , i  1,, n
j  1,, l
yij = observation i under treatment level j
 = overall mean
j = j   = jth treatment effect
ij = yij  j = experimental error
71
Assumptions
• Independence - The observations are
independent within and between treatment
groups.
• Normality - The observations in the
treatment groups are distributed normally.
• Homoscedasticity - The variances of the
treatment groups are equal.
72
Definitions
• Nominal  - The  level set by the
experimenter. It is the percent of time one is
rejecting H0 falsely when all assumptions
are met.
• Actual  - The percent of time one is
rejecting H0 falsely if one or more
assumptions are violated.
73
5.1 Independence
The independence assumption is the most
important one. A violation of this
assumption can increase the actual  to 10
times the nominal . For example, if
nominal =.05 then actual alpha=.5, which
indicates a 50% probability of making a
Type I error.
74
Independence
• Correlation in time series data can be
checked by the Durbin-Watson statistic (d).
The statistic, however, checks for first-order
autocorrelation only.
• d < 2 for positive correlation
d  2 for no correlation
d > 2 for negative correlation
75
Independence
• To control autocorrelation, we can
– decrease the level of significance
– use non-overlapping random number streams
– use batch means
76
5.2 Normality
• Normality can be checked by
–
–
–
–
plotting the data in each treatment group
normal probability plot
Anderson-Darling test
Shapiro-Wilk test
• If the distribution is skewed then we can
decrease the significance level.
77
Normality
The F-test is robust against non-normality
when the sample sizes are equal, that is,
actual   nominal .
78
5.3 Homoscedasticity
• There are a number of tests for assessing
homoscedasticity. Among the more popular
are the Brown-Forsythe and the Levene
tests.
• If the data is heteroscedastic, we can
– decrease the significance level
– use a variance stabilizing transform
79
Homoscedasticity
The F-test is robust against heterogeneous
variances when the sample sizes are equal,
that is, actual   nominal .
80
6. Cumulative Type I Error
• Assume we perform a 3-way ANOVA
(ABC) and conduct all 7 tests (A, B, C,
AB, AC, BC, ABC) at =.05.
• The probability of type I error increases
with the number of tests, that is, overall  is
no longer .05 for the set of tests.
81
Cumulative type I error
• Overall  for a set of tests is the probability
of at least one false rejection when H0 is
true.
• The Bonferroni Inequality provides an
upper bound for overall . For a 3-way
ANOVA:
overall   .05  .05    .05  .35
82
6.1 Bonferroni Inequality
• In general, if we are testing k hypotheses at
1,…, n then overall   1,+…+, n.
• If all hypotheses are tested at the same level
' then overall   n'.
• If the tests are independent then
overall  = 1(1)n.
83
6.2 Bonferroni Correction
The Bonferroni Correction divides the desired
overall  equally among the n tests:
overall 
 
n
Thus, each test can be conducted at the '
significance level.
84
6.3 Disadvantages of -Correction
• Loss of power for detecting true differences
when they exist
– Impediment in discovering new findings.
• Undue importance to overall 
– The overall  calculation assumes H0 is true.
This is not the case in most experiments. Thus,
the calculation overestimates the probability of
committing type I error.
85
Disadvantages of -correction
• The (misleading) definition of overall :
– For a set of tests it is the probability of one or
more false rejection when H0 is true.
– The overall  error is produced mostly from
experiments in which only one type I error has
occurred.
– The instances of two type I errors are fairly
small and decrease with the number of errors.
86
7. Factorial Design
• The reason for conducting experiments is to
identify factors contributing to a
phenomenon.
• This can be done by focusing on a single
factor while keeping other factors constant
or by focusing on multiple factors
simultaneously.
87
Factorial design
• The issue in the latter case is whether a
particular factor studied simultaneously
with another factor will show the same
effect as it would when studied in a single
factor design, that is, do the factors interact?
• A factorial experiment is used to
simultaneously examine the effect of two or
more factors.
88
Example 1
A factorial experiment with two factors - A
and B, with 2 levels each:
A
B
b
10
30
1
a
a
1
2
b
20
40
2
89
Definitions
Simple effect - of a factor is result of the
component single factor experiment. The
rows are the simple effects of B and the
columns are the simple effects of A.
Main effect - of a factor is the difference in the
averages of the component single factor
experiments. The main effect of A is 20, and
the main effect of B is 10.
90
Example 2
A factorial experiment with two factors - A
and B, with 2 levels each:
A
B
b
10
30
1
a
a
1
2
b
20
5
2
91
7.1 Interaction
1 No Interaction
b2
40
30
b1
20
10
0
a1
a2
2 Interaction
40
30
b1
20
10
b2
0
a1
a2
• In example 1, the
effect of A does not
depend on the levels
of B (parallel curves).
• In example 2, the
effect of A is not the
same for the levels of
B (non-parallel
curves).
92
Interaction
• An interaction is present when the effects of
one factor change at different levels of the
second factor.
• In most experiments, interactions are the
primary interest in the study. It is not
particularly revealing to report on the
significance of main effects.
93
Interaction
• The presence of an interaction often
requires ingenuity in explaining the
relationships in the data.
• It also requires that main effects not be
reported in isolation as they are meaningless
without the interaction information.
94
7.2 Advantages of Factorial Experiments
• More efficient than single-factor
experiments
• Necessary if interactions are present
• Allow the effects of a factor to be estimated
at several levels of other factors, yielding
conclusions that are valid over a range of
experimental conditions
95
7.2.1 Efficiency
A factorial experiment with two factors at 2
levels each:
A
B
b
ab
ab
1
a
a
1
2
1
1
2
1
b
ab
ab
2
1
2
2
2
96
Efficiency
• Effects of changing A:
a2b1 a1b1, a2b2 a1b2
• Effects of changing B:
a1b2 a1b1, a2b2 a2b1
Thus, we have two estimates of both effects.
97
Efficiency
The equivalent single-factor experiment:
A
B
b
ab
ab
1
a
a
1
2
1
1
2
1
b
ab
2
1
2
98
Efficiency
• Effect of changing A:
a2b1 a1b1
• Effect of changing B:
a1b2 a1b1
We have one estimate of each effect. We need
3 more observations to get two estimates
each–a total of 6 observations.
99
Efficiency
• The relative efficiency of the factorial
design to the single-factor experiment is
6/4=1.5.
• In general, with n factors each at 2 levels
(2n design), the relative efficiency is
(n+1)/2.
100
Exercise
The following are 8 different outcomes of the
same 2-factor experiment.
• Calculate the main effects
• Plot the data to check for interactions
• Can interactions occur in the absence of
main effects?
101
Exercise
a1
a2
a1
1
a2
a1
2
a2
a1
a2
3
4
b1
5
5
4
6
7
7
6
8
b2
5
5
4
6
3
3
2
4
5
6
7
8
b1
6
4
5
5
8
6
7
7
b2
4
6
3
7
2
4
1
5
6
4
8
8
10
6
6
8
4
4
2
2
0
0
6
4
2
0
1
2
1
8
8
6
6
2
0
2
1
2
10
1
2
8
8
6
6
4
4
2
2
0
0
4
4
1
2
2
2
0
1
2
0
1
2
1
2
102
Typical data for 2-way ANOVA
Factor B
1
2
1 y111,y112, y121,y122,
,y11n
,y12n
2 y211,y212, y221,y222,
,y21n
,y22n
b
y1b1,y1b2,
,y1bn
y2b1,y2b2,
,y2bn
a ya11,ya12, ya21,ya22,
,ya1n
,ya2n
yab1,yab2,
,yabn
103
7.3 Linear Model
yijk=  + i + j + ij + ijk
where
 = overall mean
i = average treatment effect at level ai
j = average treatment effect at level bj
ij = interaction effect at cell aibj
ijk = experimental error
104
Linear model
H0: All i = 0
H1: Not all i = 0
B main effect: H0: All j = 0
H1: Not all j = 0
A×B interaction: H0: All ij = 0
H1: Not all ij = 0
A main effect:
105
7.4 Types of Factors
Fixed - The levels are selected specifically
and inferences are confined to these levels.
Random - The levels are selected arbitrarily
from a population of levels and inferences
extend to the sampled populations of the
levels (always Qualitative).
106
Types of factors
Quantitative - Numeric and ordered (always
Fixed)
Qualitative - Categorical, non-numeric, or
numeric but unordered
107
8. Statistical Inference
• Statistical inference deals with inferring the
characteristics of a population by examining
a sample.
• Every sample will have an associated
sampling error because a sample is a subset
of the population.
• Sampling error decreases as the sample size
is increased.
108
Statistical inference
• We can estimate a population parameter,
such as the mean (), with a sample
statistic, such as the sample mean (x), or we
can make an inference about the interval in
which this mean falls.
109
8.1 Sampling Distribution of the Mean
Suppose we have a large rabbit population
and we select all possible samples of 100
rabbits and calculate the mean of each
sample. These sample means form a
distribution called the sampling distribution
of the mean (x).
110
8.2 Central Limit Theorem
If x has a distribution with mean  and
standard deviation , then the sampling
distribution of the mean will approach the
normal distribution as the size of the sample
is increased. Its mean will be  and standard
deviation /n.
111
8.3 Standard Error of the Mean
• The term x= /n is called the standard
error of the mean because it measures the
sampling error.
•   1.96 x covers approximately the
middle 95% of the total possible sample
means.
112
8.4 Interval Estimation of the Mean
• In practice, we don’t have the resources to
select all possible samples.
• We usually have one sample on which to
draw conclusions from about the
population.
113
Interval estimation of the mean
• We use the standard deviation of the sample
(s) as an estimate for  and substitute it in
the formula for standard error. Thus, sx=
s/n.
• It is assumed that the sample size, n  30.
114
Example
A sample of 100 rabbits is selected at random
from the Vogon forest and the mean weight
is computed to be 10 lb, with a standard
deviation of 1 lb.
sx= s/n = 1/100 = .1.
x  1.96 sx= 10  .196
Thus, a 95% CI around the mean weight is
(10.196, 9.804)
115
Interval estimation of the mean
x3
x1 
x2 x4
95% of the intervals contain 
116
9. Linear Regression
• In many problems there are two or more
variables that are related and it may be
useful to quantify this relationship.
• Regression analysis is a statistical technique
for modeling the relationship between two
variables for prediction or optimization.
117
Linear regression
• In general, there is a single dependent or
response variable y related to n independent
or regressor variables x1, x2,…,xn, under the
control of the experimenter.
• The relationship between these variables is
described by a mathematical model called a
regression equation of the form
y = f(x1, x2,…,xn), where f is unknown.
118
Linear regression
• In simple regression there is only one
independent variable; in multiple regression
there are many independent variables.
• “Linear” implies that the relationship
between the dependent and the independent
variables in linear. Since this is a restrictive
condition, Polynomial regression allows
non-linear relationships.
119
9.1 Least Squares
• Assume that we have n pairs of
observations, (x1,y1),…,(xn,yn) and the
relationship between y and x is a straight
line.
• Therefore, each observation can be
described by the model, yi = 0+1xi+i,
where i is a random error distributed
normally with mean zero and variance 2.
120
Least squares
• The i are also assumed to be independent.
• The i capture the influence of omitted
variables, measurement errors, and random
factors on y.
•  is called the random error term because it
disturbs what would be otherwise an exact
relationship between x and y.
121
Least squares
• The assumptions imply that:
– yi  N(0+1xi,)
– yi and yj are independent
• The method that is used to estimate 0 and
1 from the observations (xi,yi) is called
least squares.
122
Reality
f(x)
y
x1
1
x2
x3
x4
2
3
4
123
Assumptions
f(x)
y
x1
x2
x3
x4
1
2
3
4
124
Least squares
Minimize ei2
yi
ei
ŷi
y
SSt
SSr
SSe
( yi  y )  ( yˆ i  y )  ( yi  yˆ i ) 2
2
2
x
125
Least squares
Assume we estimate the model
y  0  1 x   by yˆ  ˆ 0  ˆ 1 x
ˆ 1 
S xy
ˆ 0  y  ˆ 1 x
S xx
where
(x)
S xx  x 
n
2
2
x y
S xy  xy 
n
126
9.2 Linear Regression - Problem
• Given a probabilistic relationship, we
cannot estimate the exact value of a
dependent variable solely from the value of
an independent variable.
• We also require values of , which are
unobservable.
127
Linear Regression - Problem
• At any given setting of the independent
variable, there is a sub-population of values
of the dependent variable—we don’t know
the actual value.
• A compromise is to determine the average
value of the dependent variable for a given
value of the independent variable.
128
Linear Regression - Problem
• This average is called the conditional mean
of y (yx).
• Forecasting on the basis of the conditional
mean is more accurate than on the basis of
the unconditional mean.
129
9.3 Hypothesis Testing
• We can formulate hypotheses to test the
significance of regression.
H0: 1=0
H1: 10
• 1 represents the expected change in y for a
unit change in x.
130
Hypothesis testing
• Failing to reject H0 is equivalent to
concluding that the relationship between x
and y does not have significant slope.
• This may imply that either x is of little value
in explaining the variation in y or that the
relationship between x and y is not linear.
• Alternatively, if H0 is rejected then x is of
some value in explaining the variation in y.
131
Hypothesis testing
H0 not rejected
H0 rejected
132
ANOVA for testing significance of regression
Source of
variation
Regression
Error
Total
Sum of
squares
SSr
SSe
SSt
df
1
n2
n1
Mean
square
MSr
MSe
F
MSr
MSe
2
(

y
)
SS r  ˆ 1S xy , SSe  SSt  SS r , SSt  y 2 
n
Reject H0 if Fobs>F,1,n2
133
9.4 Interval Estimation
• We can construct a confidence interval for
the average value of y for a given x. This is
also called a CI about the regression line.
• A 100(1)% CI about the regression line
at x is given by
yˆ  t / 2,n2
 1 ( x  x )2 

MS e  
S xx 
n
134
Interval estimation
• A 100(1)% CI about x will enclose x
100(1)% of the time if the experiment is
conducted 100 times. For example, a 90%
CI about x will enclose x 9 times out of 10.
135
9.5 Prediction
• We can construct a prediction interval for
the actual value of y for a given value of x.
• A 100(1)% PI about the regression line
at x is given by
yˆ  t / 2,n2
 1 ( x  x )2 

MS e 1  
S xx 
 n
136
9.6 Interval Estimation and Prediction
y
CI
PI
x
137
Interval estimation and prediction
• A CI is for estimating the average value of y
A PI is for predicting the actual value of y.
• A CI is constructed about parameters.
• A PI is constructed about variables.
• A PI is always wider than the corresponding
CI as it is about a quantity that may vary
unlike the average value of y, which is
constant.
138
Interval estimation and prediction
• The confidence bands widen at the
boundaries of the regression line indicating
that we should not extrapolate the average
value of y.
139
9.7 Assumptions in Linear Regression
• We require the following assumptions when
fitting a regression model:
– i ~ NID(0,)
– the relation between x and y is linear
140
Assumptions
• These assumptions can be checked by
analyzing the residuals (error terms).
– the normality assumption can be checked by
plotting the residuals on normal probability
paper
– the independence and constant variance
assumptions can be checked by plotting the
residuals against the predicted values
141
Assumptions
Normal
Heteroscedastic
Error in calculation
Curvilinear
142
9.8 Transformations
• In some situations, there is a need to
transform the dependent or the independent
variable to linearize the relationship.
• The transformation depends on the
curvature of the scatterplot.
143
Selecting a transformation
x
y
x
y
x
y
x
y
x2
x3
log y
1/y
log x
1/x
y2
y3
log x
1/x
log y
1/y
x2
x3
y2
y3
144
Transformations
• If we replace x by log x, the regression
model is:
y = 0+ 1z + , where z = log x
• If we replace y by log y, the regression
model is:
log y = 0+ 1x + 
145
9.9 Correlation
• Correlation analysis allows us to measure
the strength of the relationship between the
two variables.
• There are two correlation measures:
– coefficient of correlation (r)
– coefficient of determination (r2)
146
9.9.1 Coefficient of Correlation
r
•
•
•
•
S xy
S xx S yy
–1 < r < 1
r > 0 indicates a positive linear relationship
r < 0 indicates a negative linear relationship
r = 0 indicates no linear relationship
147
Coefficient of correlation
r>0
r<0
r=0
r=0
148
9.9.2 Coefficient of Determination
SS r explained variation
r 

SSt
total variation
2
r2 accounts for the proportion of the variation
in the y values that is explained by the x
variable.
149
9.10 Common Errors and Limitations
• Estimates from a regression equation should
not be made beyond the range of the
original observations.
• Correlation analysis does not indicate a
cause-and-effect relationship – r2 indicates
the proportion of explained variation if there
is a causal relationship. It is not necessarily
the percentage variation in y caused by x.
150
Common errors and limitations
• The correlation coefficient should not be
interpreted as a percentage.
• It is possible to omit the intercept (0) from
the model so that y = x + . This is a strong
assumption and it implies that y=0 when
x=0. A model with the intercept usually
gives a better fit.
151
10. Multiple Regression
In some situations, simple linear regression is
not adequate in describing the relationship
between the dependent and the independent
variable:
– the relationship may involve many independent
variables (not “simple”)
– the relationship may not be a straight line (not
“linear”)
152
Multiple regression
• However, most of the linear regression
concepts apply to multiple regression.
• In multiple regression the dependent
variable relates to a set of independent
variables:
y  0  1 x1  2 x2    n xn  
153
Multiple regression
• If interactions are present in the model then:
y  0  1 x1  2 x2  3 x1 x2  
• The (1,…, n) are called the partial
regression coefficients or partial slopes. i
(i>0) represents the expected change in y for
a unit change in xi holding all other xs
constant.
154
10.1 Issues Common to LR and MR
•
•
•
•
•
Least squares
Hypothesis testing
Interval estimation
Assumptions
Multiple correlation (r)
155
10.2 Multicollinearity
• To maximize r in MR, we are interested in
finding predictors that are correlated
significantly with the dependent variable
and uncorrelated with each other.
• This allows each predictor to explain
different components of the variance on the
dependent variable.
156
Multicollinearity
• In many cases, the predictor variables will
be correlated with each other to some
degree. Thus, we should choose variables
which are correlated minimally with each
other.
• Multicollinearity occurs when the predictors
are correlated with each other.
157
10.2.1 Correlation Matrix
Which of the following will have the smallest
and the largest multiple correlation?
x1 x2 x3
y .2 .1 .3
x1
.5 .4
x2
.6
x1 x2 x3
y .6 .5 .7
x1
.2 .3
x2
.2
x1 x2 x3
y .6 .7 .7
x1
.7 .6
x2
.8
158
Multicollinearity
• Multicollinearity:
– limits r severely because the predictors are
going after much of the same variance on y.
– may undermine the importance of a given
predictor because the effects of the predictors
are confounded due to correlations among them
– increases the variance of the regression
coefficients
159
Multicollinearity
• Multicollinearity can be diagnosed by
examining the:
– correlation matrix
– variance inflation factors
• Multicollinearity can be combated by
– combining correlated predictors
– dropping predictors
– adding data
160
10.2.2 Variance Inflation Factor
• The VIF for a predictor i indicates if there
is a strong linear association between it and
the remaining predictors.
• A VIF above 10 is cause for concern.
161
10.3 Model Selection
There are many methods available for
selecting a good set of predictors. Most are
sequential, that is, they examine the
contribution of a predictor toward the
variance on y while holding the effects of
other predictors constant.
162
10.3.1 Forward Selection
• Enter the first variable with the largest
simple correlation with y.
• If this predictor is significant then consider
the variable with the largest semipartial
correlation with y.
• Repeat until a variable does not make a
significant contribution to the prediction.
163
10.3.2 Backward Selection
• Compute a model with all variables and
calculate the partial F for every variable as
if it were the last variable to enter the
model.
• Compare the smallest partial F to a given
significance value and remove the
corresponding variable if necessary.
• Repeat until all variables are significant.
164
10.3.1 Stepwise Selection
• Similar to Forward Selection except that the
significance of each variable is assessed at
every stage.
• In Forward Selection a variable stays in the
equation upon entering. This is not the case
in Stepwise Selection.
165
10.4 Under/Overfitting
• It is important to not underfit (important
variables left out) or overfit (include
variables that make marginal or no
contribution) a model.
• Mallow’s Cp is a criterion that aids in
selecting a model with the correct number
of predictors.
• For a correctly specified model, Cp  p.
166
10.5 Model Validation
• It is important to determine how well the
equation will predict on a given data
sample. The following are three forms of
model validation:
– data splitting
– adjusted R2
– Press statistic
167
Model validation
• Data splitting - Split the data in half. Derive
the model from this half and validate it on
the other half.
• Adjusted R2 - Compute an R2 adjusted for
the number of variables in the model.
168
Model validation
• Press statistic - The prediction error for
observation i is computed from the equation
derived on the remaining n–1 data points.
Thus this statistic has n validations of
sample size n–1.
169
10.6 Common Errors and Limitations
• The magnitude of a partial regression
coefficient does not indicate the importance
of the corresponding variable.
• r can be brought close to 1 by continuously
adding variables with marginal contribution.
• There should be at least 15 observations per
variable.
170
References
Hines WW and DC Montgomery, Probability and Statistics in Engineering and
Management, 2nd ed, John Wiley & Sons, NY, 1980
Keppel G, Design and Analysis: A Researcher’s Handbook, 3rd ed, Prentice Hall,
NJ, 1991
Montgomery DC, Design and Analysis of Experiments, 3rd ed, John Wiley &
Sons, NY, 1991
Ott L, An Introduction to Statistical Methods and Data Analysis, 2nd ed, Duxbury
Press, MA, 1984
Sanders D, Statistics: A Fresh Approach, 4th ed, McGraw-Hill, NY, 1990
Stevens J, Applied Multivariate Statistics for the Social Sciences, 3rd ed,
Lawrence Erlbaum Associates, NJ, 1996
Vaidyanathan R and G Vaidyanathan, College Business Statistics with Canadian
Applications, 2nd ed, McGraw-Hill Ryerson, ON, 1992
171