Download chapter16

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Chapter 16
Understanding Relationships
โ€“ Numerical Data Part 2
Created by Kathy Fritz
The Simple Linear Regression Model
You might convert x = temperature in degrees
centigrade to y = temperature in degrees Fahrenheit
using ๐‘ฆ =
9
๐‘ฅ
5
+ 32.
Suppose you want to convert 20หšC into Fahrenheit.
Temperature in Fahrenheit
20หšC = 68หšF
100
80
60
40
20
10
20
30
40
50
Temperature in centigrade
This is a deterministic
relationship. The value of
the independent variable
(centigrade temperature)
is all that is needed to
determine the value of the
dependent variable
(Fahrenheit temperature).
Now suppose we were to investigate the relationship
between y = the first-year college grade point average
and x = high school grade point average.
Is the first-year college grade point
The first-year
college
average
determined
solelygrade
by thepoint
high
The equation
for grade
aand
probabilistic
modelgrade
is: point
average
the
high
school
school
point
average?
Explain.
average do NOT have a deterministic
relationship.
๐‘ฆ = ๐‘‘๐‘’๐‘ก๐‘’๐‘Ÿ๐‘š๐‘–๐‘›๐‘–๐‘ ๐‘ก๐‘–๐‘ ๐‘“๐‘ข๐‘›๐‘๐‘ก๐‘–๐‘œ๐‘›
๐‘œ๐‘“ ๐‘ฅ + ๐‘Ÿ๐‘Ž๐‘›๐‘‘๐‘œ๐‘š ๐‘‘๐‘’๐‘ฃ๐‘–๐‘Ž๐‘ก๐‘–๐‘œ๐‘›
=๐‘“ ๐‘ฅ +๐‘’
A description of the relationship between two
variables
that
are not
deterministically related can
Where
e is an
โ€œerrorโ€
variable
be given by a probabilistic model.
The simple linear regression model assumes that there is
a line with y-intercept a and slope b, called the population
regression line.
When a value of the independent variable x is fixed and
an observation on the dependent variable y is made,
y ๏€ฝ a ๏€ซ bx ๏€ซ e
y
a
Populationeregression
Without
the
random
deviation
in the
e1
line (slope b)
equation, all observed
(x,
y) points would
e2
fall exactly on the population regression
line.
x1
x2
x
50
40
30
20
10
Basic Assumptions of the Simple
Linear Regression Model
1.
The distribution of e at any particular value
Before
actually observe a value of y for any
of
x is you
normal.
particular value of x, you are uncertain about the value
of e (random deviation from the regression line). It
could be positive, negative, or even 0.
The linear regression model makes some assumptions
about the distribution of e at any particular x value in
the population.
x1
x2
x3
Basic Assumptions of the Simple
Linear Regression Model
1.
The distribution of e at any particular value
Because the values of e can be negative or positive,
of x is normal.
the sum of the values of e at any particular x value
2. The distribution
of e at
anymeparticular
x
will be zero.
Thus,
= 0.
value has mean value 0. That is, me = 0.
x1
x2
x3
Basic Assumptions of the Simple
Linear Regression Model
1.
The distribution of e at any particular value of
x is normal.
2. The distribution of e at any particular x value
has mean value 0. That is, me = 0.
3. The standard deviation
of e is the same for any
particular value of x.
This standard deviation
is denoted by se.
x1
x2
x3
Basic Assumptions of the Simple
Linear Regression Model
1.
The distribution of e at any particular value
of x is normal.
2. The distribution of e at any particular x
value has mean value 0. That is, me = 0.
3. The standard deviation of e is the same for
any particular value of x. This standard
deviation is denoted by se.
4. The random deviations e1, e2, . . ., en
associated with different observations are
independent of one another.
regression
passes
Thus The
the population
slope b is the
mean orline
expected
through
means of
thea y1 values.
change
in y the
associated
with
unit
increase in x.
y
a + bx3
a + bx2
a + bx1
se is the
The standard
same for
deviation
of yany
for particular
any fixed x
value ofvalue
x* is
also se
The mean of y values at a
fixed value x* is
x
y = aJust
+ bx*as there is variability in the values of e
x
x2
x3
at1 any particular
value of
x, there is also
variability in the y values.
Another look at se
The smaller se, the closer
the points are to the
regression line.
The larger se, the farther
the points are from the
regression line.
The estimates of the slope and the y intercept of the
population regression line are the slope and y intercept,
respectively, of the least squares line, ๐‘ฆ = ๐‘Ž + ๐‘๐‘ฅ .
๐‘ = estimate of ๐›ฝ =
๐‘ฅโˆ’๐‘ฅ ๐‘ฆโˆ’๐‘ฆ
๐‘ฅโˆ’๐‘ฅ 2
๐‘Ž = estimate of ๐‘Ž = ๐‘ฆ โˆ’ ๐‘๐‘ฅ
The values of a and b are usually obtained using
Let x*
denote a software
specifiedor
value
of the independent
statistical
a graphing
calculator.
variable x. Then a + bx* has two different interpretations:
1. It is a point estimate of the mean y value when x = x*.
2. It is a point prediction of an individual y value to be
observed when x = x*.
Medical researches have noted that adolescent females are
much more likely to deliver low-birth-weight babies than are
adult females.
Because low-birth-weight babies have higher mortality rates,
a number of studies have examined the relationship between
birth weight and motherโ€™s age for babies born to young
mothers.
The following data is on x = maternal age (in years) and y = birth weight
of baby (in grams).
x
15
17
18
15
16 The
19 scatterplot
17
16 shows
18 a linear
19
pattern and the spread in the
y values appears to be similar
across the range of x values.
This supports the
Sketch a scatterplot of appropriateness
these data.
of the simple
linear regression model.
Babyโ€™s Weight (g)
y 2289 3393 3271 2648 2897 3327 2970 2535 3138 3573
3500
3000
2500
15
16
17
18
Motherโ€™s Age (yrs)
19
Birth Weight Continued
. . babies increases
The weight. of
approximately 245.15 grams for each
The following data is on x = maternal age (in years) and y =
increase of 1 year in the motherโ€™s age.
birth weight of baby (in grams).
x
15
17
18
15
16
19
17
16
18
19
y 2289 3393 3271 2648 2897 3327 2970 2535 3138 3573
WhatThat
is the point
yห†Beware
๏€ฝ ๏€ญ1163
245.15
of.45
the๏€ซdanger
ofx extrapolation.
Babyโ€™s Weight (g)
estimate
for the mean
is,๏€ฝbe
careful
when
trying
to
make
an
estimate
๏€ญ1163.45 ๏€ซ 245.15(18) ๏€ฝ 3249
.25
grams
weight
of
babies born
or prediction for any x value much outside the
todata.
18-year-old
range of the observed x values in the
mothers?
This This
is also
the
prediction
is
the
point
3500
of the
weightfor
of the
a single
estimate
3000
baby
bornweight
to a mother
mean
of all 18
years
of age.
2500
babies
born
to 18year-old mothers.
15
16
17
18
Motherโ€™s Age (yrs)
19
The statistic for estimating the variance ๐œŽ๐‘’2 is
SSResid
s ๏€ฝ
n ๏€ญ2
2
e
TheThe
subscript
a reminder
value ofโ€œeโ€
se,isthe
estimated standard deviation
where
2
that you
estimating
the
about
the are
population
regression
line,
is
ห†
SS Resid ๏€ฝ ๏ƒฅ y ๏€ญ y interpreted as
variance
ofamount
the โ€œerrorsโ€
or an observation deviates
the
typical
by which
residuals.
from
the population regression line.
๏€จ
๏€ฉ
The estimate of se is the estimated the standard
Note that the degrees of
deviation
freedom2associated with
sestimating
se ๐œŽ๐‘’2 or ๐œŽ๐‘’ in simple
e ๏€ฝ
linear regression is
df = n - 2
Recall, the coefficient of determination, r2, is
the proportion of variability in y that can be
explained by the approximate linear relationship
between x and y.
How do we know if the estimated regression
equation will be useful model for predicting y
values from x?
The residual plot and the values of se and r2
can be used to determine the estimated
regression equationโ€™s usefulness.
Wildlife biologists monitor the ecological health of the
Rocky Mountain elk. The equipment, manpower, and time to
make direct measurement of the elk weights are difficult
and expensive.
Biologists found that they could reliably estimate the
weight of an elk by measuring the chest girth and then
using linear regression to estimate the weight. They
measured the chest girth and weight of 19 Rocky Mountain
elk.
There appears to be a
strong positive linear
relationship between the
chest girth and weight
of elk.
Elk Weight Problem Continued . . .
Partial Minitab regression output is shown below.
The regression equation is
Weight = -136 + 2.81 Girth
Predictor
Coef
SE Coef
T
P
Constant
-135.51
35.75
-3.79
0.001
Girth
2.8063
0.2686
10.45
0.000
S = 23.6626
R-Sq = 86.5%
R-Sq(adj) = 85.7%
This is the estimated regression equation.
the observed
TheApproximately
magnitude of a86.5%
typicalofdeviation
from variation
the leastelkisweight
be attributed
linear small
squares in
line
about can
23.6626
kg, whichto
is the
relatively
relationship
between
chest
girth.
in comparison
to the
y valuesweight
(shownand
in the
scatterplot).
Inferences Concerning the Slope of
the Population Regression Line
Properties of the Sampling
Distribution of b
When the four basic assumptions of the simple linear regression
model are satisfied, the following statements are true:
Since
b is value
almost
1. The
mean
ofalways
b is b. unknown,
That is, mitb must
= b, so the
be
estimated
from
independently
selected
Since
sb is distribution
usually
unknown,
the
estimated
standard
sampling
of b is
centered
at the
value of b.
observations.
The slope
b of the leastdeviation
of the statistic
b is
๐‘ ๐‘’
squares
line
gives
a
point
estimate
for bb. is
2. The standard deviation
of
the
statistic
๐‘ ๐‘ =
๐‘ฅ๐œŽโˆ’๐‘’ ๐‘ฅ 2
๐œŽ๐‘ =
2
๐‘ฅ
โˆ’
๐‘ฅ
When the four basic assumptions of the simple linear
model are satisfied, the probability distribution of the
๐‘โˆ’๐›ฝ
3. The statistic b has a normal
distribution (a
standardized variable ๐‘ก =
is the t distribution with
๐‘ ๐‘ assumption that the random
consequence of the model
df deviation
= (n - 2). e is normally distributed.)
Confidence Interval for b
When the four basic assumptions of the simple linear
regression model are satisfied, a confidence interval for
b, the slope of the population regression line, has the form
๐‘ ± (๐‘ก critical value)๐‘ ๐‘
where the t critical value is based on df = n โ€“ 2.
The dedicated work of conservationists for over 100
years has brought the bison in Yellowstone National Park
from near extinction to a herd of over 3000 animals. It
is important to monitor and manage the size of the bison
population.
Researchers have studied a number of environmental
factors to better understand the relationship between
bison reproduction and the environment. One factor
thought to influence reproduction is stress due to
accumulated snow, which makes foraging more difficult
for the pregnant bison.
Data from 1981-1997 on y = spring calf ratio (SCR) and
x = previous fall snow-water equivalent (SWE) are shown
on page 750. The researchers were interested in
estimating the mean change in spring calf ratio associated
with each additional cm in snow-water equivalent.
Bison Population Problem Continued . . .
Step 1 (Estimate):
The value of b, the mean increase in spring calf ratio for
each additional 1 cm of snow-water equivalent, will be
estimated.
Step 2 (Method):
Because the answers to the four key questions are
estimation, sample data, two numerical values, and one
sample, a confidence interval for b, the slope of the
population regression line, will be considered. A 95%
confidence level will be used.
Bison Population Problem Continued . . .
Step 3 (Check):
โ€ข You will need to assume that these 17 years are
representative of yearly circumstances at Yellowstone
and that each yearโ€™s reproduction and snowfall is
independent of previous years.
โ€ข A scatterplot of the data looks linear and the spread
does not seem different for different values of x.
โ€ข Because the boxplot of the
residuals is approximately
symmetrical and there are no
outliers, it is reasonable to
think that the distribution of
e is approximately normal.
Bison Population Problem Continued . . .
Step 4 (Calculate):
JMP regression output is shown here:
Linear Fit
SCR = 0.2606561 โ€“ 0.0136639*SWE
Summary of Fit
RSquare
0.257644
Rsquare Adj
0.208153
Root Mean Square Error
0.033513
Mean of Response
0.209412
Observations
17
df = 17 โ€“ 2 = 15
The t critical value for a 95%
confidence level and df = 15 is
2.13.
b ± (t critical value) sb
= -0.0137 ± (2.13)(0.005989)
= (-0.265, -0.0009)
Parameter Estimates
Term
Estimate
Std Error
t Ratio
Prob>|t|
Intercept
0.206561
0.023885
10.91
<.0001*
SWE
-0.013664
0.005989
-2.28
0.0375*
Slope b
sb
Bison Population Problem Continued . . .
Step 5 (Communicate Results):
Confidence Interval:
You can be 95% confident that the true average
change in spring calf ratio associated with an
increase of 1 cm in the snow-water equivalent is
between -0.0265 and -0.0009.
Confidence level:
The method used to construct this interval estimate
is successful in capturing the actual value of the slope
of the population regression line about 95% of the
time.
Summary of Hypothesis Tests
Concerning b
Appropriate when the four basic assumptions of the simple
regression model are reasonable:
1.
The distribution of e at any particular x value has a
mean of 0 (me = 0).
2. The standard deviation of e is se, which does not depend
on x.
3. The distribution of e at any particular x value is normal.
4. The random deviations e1, e2, โ€ฆ, en associated with
different observations are independent of one another.
Summary of Hypothesis Tests
Concerning b Continued . . .
When these conditions are met, the following test
statistic can be used:
๐‘ โˆ’ ๐›ฝ0
๐‘ก=
๐‘ ๐‘
where b0 is the hypothesized value from the null
hypothesis.
Form of the null hypothesis: H0:
b = b0
When the assumptions of the simple linear model are
reasonable and the null hypothesis is true, the t test
statistic has a t distribution with df = n โ€“ 2.
Summary of Hypothesis Tests
Concerning b Continued . . .
Associated P-Value:
When the alternative
hypothesis is . . .
The P-value is . . .
Ha: b > b0
area to right of t under the
appropriate t curve
Ha: b < b0
area to left of t under the
appropriate t curve
Ha: b โ‰  b0
2(area to the right of t) if t is
positive
or
2(area to the left of t) if t is
negative
Inference for a population slope generally focuses on two
questions:
(1) What are plausible values for the population slope?
(2) Is the population slope different from zero?
This question can be
addressed by
When theThis
null question
hypothesis
the
canHbe
0: b = 0 is true,
calculating
a
population
regression
line isthe
a horizontal
line. interval.
answered
by using
confidence
= ๐›ผ + ๐›ฝ๐‘ฅ + ๐‘’
hypothesis ๐‘ฆtesting
= a๐›ผnull
+ 0๐‘ฅ + ๐‘’
procedure with
=๐›ผ+๐‘’
hypothesis
H0: b = 0
If b test
is in of
fact equal to
This
H0: b =0,0knowledge
versus Ha:of
b โ‰ x 0will be
no use
โ€“ it will
have
is called theof
model
utility
test
for
no regression.
โ€œutilityโ€ for
simple linear
predicting y.
The Model Utility Test for Simple
Linear Regression
The model utility test for simple linear regression is the
test of
H0: b = 0 versus Ha: b โ‰  0
The null hypothesis specifies that there is no useful linear relationship
between x and y, whereas the alternative hypothesis specifies that
there is a useful linear relationship between x and y.
If H0 is rejected, you can conclude
The test statistic
is the
ratio: linear regression
that
thetsimple
model is useful for predicting y.
๐‘โˆ’0 ๐‘
๐‘ก=
=
๐‘ ๐‘
๐‘ ๐‘
When you hear a song on your car radio, you probably
remember title of the song, the artist, and even when the
song was released. An investigator wants to study this
phenomenon. He compiled a list of songs from Rolling
Stone, Billboard, and Blender lists of songs plus some
recent songs familiar to college students.
Twenty-three college students were then exposed
to 56 clips of songs. Most of these students had
had musical training, and they listened to popular
music for an average of 21.7 hours per week.
After hearing three short clips from a song (only 400 ms in
Letโ€™s perform a model utility test
duration), the students were asked in what year each of
to answer this question.
the songs was released.
The accompanying data show the actual release year and the
average of the release years given by the students. Is
there a relationship between the judged and actual release
year for these songs?
Song Recognition Problem Continued . . .
Step 1 (Hypotheses):
H0: b = 0
Ha: b โ‰  0
where b is the slope of the population regression line of
the judged release year and the actual year
Step 2 (Method):
Because the answers to the four key questions are
hypothesis testing, two numerical variables in a regression
setting, and one sample, a hypothesis test for the slope
of a population regression line will be considered. A
significance level of 0.05 will be used.
Song Recognition Problem Continued . . .
Step 3 (Check):
For this example you can assume that the assumptions are
reasonable and proceed with the model utility test. (We
will see how to check if the four assumptions of the
simple linear regression model are reasonable in the next
section.)
Song Recognition Problem Continued . . .
Step 4 (Calculate):
JMP regression output is shown here:
Linear Fit
Judged Release = 1095.1525 + 0.449281*Actual Release
Summary of Fit
RSquare
0.771
Rsquare Adj
0.766759
Root Mean Square Error
3.59844
Mean of Response
1986.013
Observations
๐‘ โˆ’ 0 0.449 โˆ’ 0
๐‘ก=
=
= 13.48
๐‘ ๐‘
0.0333
P-value = 2P (t > 13.48) โ‰ˆ 0
56
Parameter Estimates
Term
Estimate
Std Error
t Ratio
Prob>|t|
Intercept
1095.1525
66.07159
16.58
<.0001*
SWE
0.449281
0.033321
13.48
<.0001*
Slope b
sb
Song Recognition Problem Continued . . .
Step 5 (Communicate Results):
Because the P-value is less than the selected significance
level, the null hypothesis is rejected.
Decision: Reject H0
Conclusion:
The sample data provide convincing
evidence that there is a useful linear
relationship between the actual release
year and the judged release year.
Checking Model Adequacy
Checking Model Adequacy
The simple linear regression model is
y = a + bx + e
where e represents the random deviation of a y value
from the population regression line a + bx.
methods, include:
confidence interval for slope and the
TheseThe
assumptions
utility test,
require
assumptions
about
1. At model
any particular
x value,
thesome
distribution
of e is
the random deviations in the simple linear regression
normal.
model
be met xinvalue,
orderthe
forstandard
inferencedeviation
to be valid.
2. At any
particular
of e is
se, which is constant over all values of x (that is, se
does not depend on x).
Residual Analysis
If the deviations e1, e2, . . . , en from the population line
were available, they could be examined for any
inconsistencies with model assumptions.
However, these deviations are
Any observation
e1 =that
y1 โ€“ (gives
a + bxa1)large positive or
negative residual shouldโ‹ฎ be examined carefully for
any unusual circumstances,
en = yn โ€“ (a such
+ bxn)as a recording error
or nonstandard experimental condition.
Instead, diagnostic
checks
be based
on the residuals
These values
of e MUST
can ONLY
be calculated
๐‘ฆ1 known,
= ๐‘ฆ1 โˆ’ which
๐‘Ž + ๐‘๐‘ฅis
1 almost
if a and๐‘ฆ1bโˆ’are
โ‹ฎ case.
never the
๐‘ฆ๐‘› โˆ’ ๐‘ฆ๐‘› = ๐‘ฆ๐‘› โˆ’ ๐‘Ž + ๐‘๐‘ฅ๐‘›
which are the deviations from the estimated regression line.
Residual Analysis
Recall, me = 0.
So, the numerator is really
residual โ€“ 0.
Identifying residuals with unusually large magnitudes is
made easier by inspecting standardize residuals.
residual
standardized residual=
estimated standard deviaiton of residual
Because residuals at different x values have
different standard deviations (depending on the value
of x for that observation), computing the
standardized residuals can be tedious. Most
statistical software will perform this calculation.
Revisiting the Elk
Example 16.3 introduced data on
x = chest girth (in cm) and y = weight (in kg)
for a sample of 19 Rocky Mountain elk.
Inspection of the
scatterplot suggest the
data are consistent with
the assumptions of the
simple linear regression
model.
Revisiting the Elk Continued . . .
Letโ€™s examine the residuals more closely. The data,
residuals, and the standardized residuals (computed using
Minitab) are given on page 761.
The largest residual = 38.1397 and the associated
The
of the residuals and standardized
Neither
one
ofboxplots
these
standard
residual
= 1.81294.
residuals
is surprisingly
large.are approximately symmetric with no
The smallest
residual
and
associated
Notice
that =the
boxplots
of the
theof
residuals
outliers,
so-38.2661
the
assumption
normallyand
standard residual
= -1.92313.
standardized
residuals
are nearly
identical.
distributed
errors seems
reasonable.
Revisiting the Elk Continued . . .
Another way to assess whether the error values are
normally distributed is to look at normal probability plots
of the residuals or the standardized residuals. (Only one
plot is The
needed.)
pattern in the normal probability plots are
The
standardized
plot is
recommended,
but it is
reasonably
straight,
confirming
that the
acceptable of
to normality
use the unstandardized
residual plot
assumption
of the error distribution
if you do not have access to a computer package
is reasonable.
A Look at Residual Plots
This is a desirable plot in that it
exhibits no pattern and has no
point that lies far away from the
other points.
Both of these plots
contain points far
plot
awayThis
from
theexhibits a curved
In this plot, the
standard
deviation of the
others.
These
pattern which indicates that
residuals increases as the x-values increase. While
points
can
have model should be
the
fitted
a straight-linesubstantial
model might
still be appropriate,
effects
changed to incorporate the
the best-fit lineonshould
be found
using weighted
estimates
of
a
curvature.
least-squares. and
Consult
your
b as well aslocal statistician!
other quantities.
Newborns and infants have a small trachea, and there is
little
margin plots
for error
when
inserting
tracheal
tubes.
Residual
like the
one
shown here
are desirable.
Using
X-rays
ofunusually
a large number
of children
ages
months
There
are no
large residuals
since
no 2
point
lies
to 14
years,
researchers
examined
thebetween
relationships
much
outside
the horizontal
band
-2 and 2.
between
appropriate
trachea
depth
and
There is
no point far
to thetube
left insertion
or right of
the others
other
variables
as height,
weight, and
age.
and there
are such
no pattern
of curvature
or differences
in
the variability of the residuals for different height
Below
aretoa indicate
scatterplot
standardized
residual
values
thatand
thea model
assumptions
are plot
not
constructed using data reasonable.
on the insertion depth and height
of children (both measured in cm).
Newborns and Infants Problem Continued . . .
But consider what happens when the relationship between
insertion depth and weight is examined.
A careful inspection of these plots suggests that along
While some curvature is evident in the original
with curvature, the residuals
may
be more variable at
The clearly
linear
regression
scatterplot, it is even more
visible in the
larger weights.
standardized residual model
plot. is not appropriate.
Related documents