Download Chapter 11 PowerPoint

Document related concepts
no text concepts found
Transcript
Chapter 11
Using Relationships to Make
Predictions
Objectives
SWBAT:
1) Use a model to make predictions
2) Model linear relationships with a least-squares
regression line
3) Make predictions using a least-squares regression
line
4) Calculate least-squares regression lines
5) Calculate the standard deviation of the residuals
Using a Model to Make Predictions
• Is it possible to use Statistics to predict how many
wins a baseball team will have?
• This is a concept portrayed in the movie
“Moneyball.”
• Moneyball
• Bill James is an American baseball writer, historian,
and statistician that has been writing about
baseball since 1977.
• His approach to baseball is called sabermetrics and
it was mainstreamed with the Oakland Athletics, as
seen in “Moneyball.”
• James looked to investigate the relationship
between runs scored, runs allowed, and
winning percentage.
• He proposed the Pythagorean expectation of
winning percentage.
• Let’s take a look at this formula in action by
examining the 2008 World Champion
Philadelphia Phillies. In the 2008 season, they
scored 799 runs and allowed 680 runs.
In a 162-game season, the
predicted number of wins for
the Phillies would be:
• The Phillies were predicted to win 93.96 games.
However, they only won 92 games, approximately
2 games below the prediction.
• The difference between the actual number of
wins and predicted number of wins is called a
residual.
• The residual for the 2008 Phillies is:
• The negative indicates that the Phillies won 1.96
fewer games than expected, based on their runs
scored and runs allowed.
• Here are the residuals
for all 2008 MLB teams
(pg 400-401).
• Remember that a
negative residual means
a team PERFORMED
below their prediction,
and a positive means a
team PERFORMED
above their prediction.
• To help remember the
order of subtraction,
think of the acronym AP:
actual - predicted
Choosing the Best Model
• Why do you think Bill James chose to use the
exponent “2” in his Pythagorean model? Why
not 3, or 1, or even 17?
• In other words, how can we pick the “best”
exponent to help us make the best predictions
(the exponent that produces the smallest
residuals)?
• Let’s examine the same formula, but replace the
exponent of 2 with a 1. Then we’ll calculate the
prediction for the Phillies.
• Remember they actually won 92 games, so the new
residual now is 4.52.
• The other model gave a residual of -1.96.
• The closer a residual is to 0, the
more accurate a prediction is.
The residual of 4.52 was more
than double the residual of -1.96.
This indicates that, for the
Phillies, the exponent of 2 was
more appropriate than the
exponent of 1. Did this hold true
for all teams?
• Here are the residuals for all 30
teams when using an exponent
of 1.
• At first glance, these residuals
seem larger than the initial ones.
• To make a comparison between the model with
an exponent of 2 and the model with an exponent
of 1, let’s square the residuals for each model and
then sum them up (squaring them will guarantee
that they are positive). The sum is called the sum
of squared residuals (SSR).
• For the exponent of 2:
• For the exponent of 1:
• Because the SSR is smaller when using an exponent of 2, that
makes it a better model than using an exponent of 1.
The Concept of Least Squares
• In general, when statisticians compare models
to see which one makes the best predictions,
they compare the sum of squared residuals for
each model and choose the model with the
smallest sum.
• This is known as the “least-squares” method.
• The scatterplot to the right
shows the SSR using the
Pythagorean formula with
different exponents 1.0 to 3.0.
• As you can see, the best exponent to use is a 2,
since it results in the smallest SSR (the most
accurate predictions.
• When there is a linear association between two
numerical variables, we can use a linear model to
make predictions.
• Let’s see how exciting that is…
Least-Squares Regression Lines
• A least-squares regression line is used to model a
linear relationship between an explanatory
variable x and a response variable y.
• In a previous Algebra class you’ve likely talked
about “lines of best fit.” The least-squares
regression line best models an association
because it uses the smallest sum of squared
residuals.
• Let’s examine the
relationship between home
runs and runs scored for
teams in the 2008 MLB
season.
• One would think that teams
that are good at hitting
home runs are also good at
scoring runs.
• There is a moderately strong,
positive, linear relationship
between the number of home
runs a team hit and the number of
runs they scored.
• The correlation is r=0.62.
• We can use a least-squares
regression line to model this data.
• The question becomes what is the
equation for this line?
• Least-squares regression lines are expressed in
the following form:
• Once we have the least-squares regression line, we can
use it to make predictions for specific teams.
• The 2008 Texas Rangers hit 194 home runs. Let’s use that
to predict how many runs they would score.
• We predict they would score
approximately 793 runs.
• This means the Rangers scored 108.5 more
runs than predicted, based on the number of
home runs they hit.
• The San Diego Padres hit 154 home runs. Find
their predicted number of runs scored.
• They actually scored 637 runs. Calculate the value of
their residual, and then interpret what that means.
Since the residual is negative, we know the Padres scored 105.5 fewer
runs than predicted, based on the number of home runs they hit.
• So what exactly do the slope and y-intercept
mean in a least-squares regression equation?
• Remember from algebra
that slope is rise over run.
• The interpretation of the
slope is the predicted
change in the y variable
when the value of the x
variable is increased by 1.
• The y-intercept (a) is the constant (where the leastsquares regression line crosses the y axis).
• The y-intercept gives us the predicted value of y when
x is 0.
• Sticking with our least-squares regression line using
home runs to predict runs scored, the constant value
is 550. This means that a team with 0 home runs
would be predicted to score about 550 runs.
• Caution: No teams in 2008 hit 0 home runs, or were
anywhere near that mark. We really don’t know
anything about the relationship between home runs
and runs scored for teams with this little power.
• Extrapolation is trying to make predictions
outside of the range of the data we have.
• This can occur when we try to make a
prediction for a value much smaller than the
other values (for example 0 with the home run
and runs scored example) or much larger than
the other values (for example 4000 with the
home run and runs scored example).
• It is risky to make predictions using
extrapolation since the association between
the variables may not be the same for
extremely small or extremely large values of
the explanatory variable.
• Extrapolation also occurs when people try to
project how well an athlete or team will do in the
future based on previous PERFORMANCES.
• This is especially true early in a game or season.
• For example, suppose a running back gains 200
yards in his first game of the season. At that pace,
he would be predicted to run for 3200 yards for
the entire season. This would likely be inaccurate,
especially considering the single-season rushing
record is 2101 yards (Eric Dickerson).
Calculating the Least-Squares Regression Line
• There are formulas
to manually
calculate the leastsquares regression
line. However, the
TI-84 makes this
calculation much
easier, so we can
thank our friends •
at Texas
Instruments for
helping us out.
Let’s look at how we can calculate
the least-squares regression line
using our 84. Trig seems to be a
fan of this idea.
• To the right is 2008 MLB
data showing every
team, the number of
hits they had, and the
number of runs they
had. Let’s calculate the
least-squares regression
equation relating hits
(x) to runs (y). (pg 413)
• Like with many other calculator functions, the first thing
we need to do is enter this data into lists. Enter hits in
L1 and runs into L2.
• Now press STAT, go to CALC,
and go to option 8:
LinReg(a+bx).
• Make sure your Xlist is L1 and
your Ylist is L2. Then press
calculate.
• To view a graph of the
• To graph the leastscatterplot, press STAT PLOT
squares regression line
(2nd Y=), choose plot 1, turn it
with the scatterplot,
on, choose the first graph
press Y= and enter the
type (scatterplot), and make
equation. Then press
sure Xlist is L1 and Ylist is L2.
zoom 9.
• The same Correlation and Regression applet used in
the last chapter can also calculate the least-squares
regression line for us.
• www.tinyurl.com/SRISapplets
• Observed slope is b and observed intercept is a.
• Another fantastic
applet is the one found
here:
www.tinyurl.com/SPAa
pplets
• Click on Two Quantitative Variables, and then enter the
data for the explanatory and response variable.
• After pressing “begin analysis,” you can utilize other
features such as calculating correlation and calculating
the least-squares regression line.
The Standard Deviation of the Residuals
• So far we’ve looked at least-squares regression
lines to model the relationship between home
runs and runs scored, and the relationship
between hits and runs scored. Which of the two
explanatory variables, home runs or hits, does a
better job of predicting the number of runs
scored?
• In order to find out, we have to calculate the
standard deviation of the residuals (s).
• The standard deviation of the residuals (s) estimates
the typical distance between the actual values of the
response variable and their corresponding predicted
values.
• In this instance, the standard deviation of the
residuals measures the typical distance between a
team’s actual number of runs scored and their
predicted number of runs scored.
• In other words, it estimates about how far away we
should expect the points on the scatterplot to be from
the graph of the least-squares regression line.
Let’s calculate s for our
hits and runs example.
(pg 416)
This means that when using hits to predict runs, our predicted values
will typically be about 51.6 runs away form the actual values.
• Calculating the standard deviation of the residuals
can also be done on the TI-84.
1) Just as before, enter hits in L1, runs in L2, then hit
STAT, CALC, 8: LinReg(a+bx).
2) To display the residuals in L3, go into STAT, EDIT,
and move the cursor to the heading L3. Now
press LIST (2nd, STAT), scroll down to option 7:
RESID, and press ENTER twice. (Note: this only
works if you calculate the regression line FIRST).
3) To find the sum of the squared residuals, press STAT,
CALC, 1: 1-Var Stats, and enter L3 for the list.
4) Finish by substituting into the formula for s.
Note: For the first applet, enter the information the same as before,
and look for the option “observed s.” This gives you the standard
deviation of the residuals. For the second applet, calculate the leastsquares regression line and s is listed.
• We still need to decide which variable (hits or home
runs) is a better predictor of runs scored.
• Let’s compare the standard deviation of the residuals
for each relationship.
• The explanatory variable that provides the smaller
standard deviation of the residuals will be the better
predictor (since the predictions will be closer to the
actual values).
• As previously calculated, the standard deviation of the
residuals using hits was 51.6 runs.
• The standard deviation of the residuals using home
runs is approximately 55.0 runs.
• Since the model using hits had a smaller s than the
model using home runs, we can state that using hits is
better than using home runs to predict runs.
The table to the right
shows a variety of
offensive variables
that can be used to
predict runs for the
2008 MLB season.
• The variables with the smallest standard deviations of
the residuals are on-base percentage (OBP) and
slugging average (SLG).
• Note: These two statistics measure the two
fundamental components of scoring runs: getting
on base and moving runners around the bases.
• Recently, a hybrid of on-base percentage and
slugging average has been proposed as a way to
evaluate a player’s overall ABILITY to help his
team score runs.
• The hybrid is called OPS and is short for on-base
percentage plus slugging average
OPS=OBP+SLG
To the right is a scatterplot of OPS and
runs scored for all 30 teams in 2008. As
you can see, there is much less
variability from the line in this
scatterplot. The sum of the squared
residuals is only 14,872, resulting in a
standard deviation of the residuals of
just s=23.0 runs.
Influential Points
• In the previous chapter, we examined the
effect that a single observation can have on
the value of correlation.
• The lesson is the same in this chapter. A single
observation can have a big effect on the
equation of the least-squares regression line,
as well as on related measures such as the
standard deviation of the residuals.
To the right is a scatterplot
showing shooting percentage
and number of wins for the
30 NBA teams in the 20082009 regular season.
• The Phoenix Suns made 50.4% of their shots but only won 46 of
their games, which seems to be out of the pattern of the rest of
the teams.
• You can see two least-squares regression lines on the scatterplot,
one that includes the Suns and one that does not.
• We can see that the Suns PERFORMANCE had a strong influence
on the slope of the equation. Removing the Suns causes the slope
to go from 6.8 to 10.8. It also causes the standard deviation of the
residuals to go from s=11 wins to s=9 wins, and correlation to go
from r=0.65 to r=0.79.
• Occasionally, some points that seem influential
aren’t.
• Below is a scatterplot showing the relationship
between shooting percentage and average points
scored per game for the 30 NBA teams in the 20082009 season.
• The Warriors scored much more than they were
expected to, and seem out of the pattern of the rest
of the teams.
However, as you can see, when
the Warriors were removed
from the data, the equation of
the least-squares regression line
barely changed. The slope went
from 1.66 to 1.67 and the yintercept went from 23.8 to
22.8.
• The change was a little bit more noticeable on the standard deviation of
the residuals (from s=3.66 points to s=3.30 points) and the correlation
(from r=0.53 to r=0.58).
• To identify observations that are potentially influential, look for points on
the scatterplot that do not follow the pattern of the rest of the data.
• Points that have especially small or large values of the explanatory
variable can be particularly influential on the equation of the least-squares
regression line, the standard deviation of the residuals, and the
correlation, whereas points that have values of the explanatory variable
near the average tend not to influence the equation of the least-squares
regression line as much.
Regression to the Mean
• In the beginning of the year, you all tried out for
the varsity coin flipping team. Now it’s time to try
out for the AAU coin flipping team.
• Whoever wants to try out, come grab a coin, flip it
ten times, and record how many heads you get.
Student
# of Heads
• Let’s have our best PERFORMING student(s)
and worst PERFORMING student(s) stand up.
• Who do you think has a better chance of
improving their PERFORMANCE if we did 10
more flips?
• WHY?????
• At the beginning of every sports season, there are
athletes who PERFORM much better than their
ABILITY.
• Often times sportscasters get carried away talking
about possible records that can be broken.
• However, how likely is it that an athlete keeps
PERFORMING that much more than their ABILITY
for a continued period of time?
• Thinking about it from the other side, sometimes
athletes start off in a slump, PERFORMING worse
than their ABILITY. How likely is it that they keep
up their poor PERFORMANCE?
• Mr. Doback has a class of 26 students, and runs
the same coin flipping experiment that we just
did, except he has the students flip their coins 50
times, and he ran the experiment two days in a
row.
• On the first day, 15 students flipped 25 or more
heads, and 11 students flipped less than 25
heads.
• On the second day, of the 15 students that flipped
25 or more heads on day 1, only 2 of the students
improved their heads total, meaning 13 students’
totals went down. However, of the 11 students
that flipped less than 25 heads, 8 of them
increased their number of heads.
• Here is a scatterplot
showing the number of
heads on day 1 and the
number of heads on day
2 for the 26 students.
• The vertical line at 25 represents what the expected
number of heads would be. Any point to the left of the
line indicates a student that got less than 25 heads on
day 1, and any point to the right indicates a student that
got more than 25 heads on day 1.
• The line y=x is a reference. If a point is above the line,
then a student did better on day 2. If a point is below
the line, then a student did worse on day 2.
The graph shows that 87% of
students who did well on the
first did worse on the second
day (ended up below y=x) and
73% of students who did
worse on the first day did
better on the second (ended
up above y=x).
• When measuring the same variable in two different
time periods, the tendency for better PERFORMANCES
to follow poor PERFORMANCES and for worse
PERFORMANCES to follow good PERFORMANCES is
called regression to the mean.
Let’s look at
another example
using batting
averages of MLB
players.
• Each dot represents a player and his batting average
PERFORMANCES in 2008 and 2009.
• The vertical lines at 0.260 and 0.300 are to identify what
is commonly accepted as bad and good hitting
PERFORMANCES.
• Players above the line y=x are players that improved
from 2008 to 2009, players below the line are players
that regressed from 2008 to 2009, and players on the
line are players that kept the same average.
18 of the 24 players
that hit over 0.300 in
2008 had worse
PERFORMANCES in
2009.
• 10 of the 15 players who hit worse than 0.260 in 2008 had
better PERFORMANCES in 2009.
• Both groups players’ PERFORMANCES regressed to the
mean. In other words, players who were great in 2008
tended to be closer to the average of all players in 2009,
and players who were not so great in 2008 also tended to
be closer to the average of all players in 2009.
• Let’s look at the same
scatterplot with the
least-squares
regression line.
• The least-squares regression line predicts that players
that had bad PERFORMANCES in 2008 will improve, but
will still be below average. For example, if a player hit
0.220 in 2008, they will be predicted to hit
0.133+0.52(0.220) which is 0.247. This is still below
average, but is better than 0.220.
• Likewise, it predicts
players who had good
PERFORMANCES in
2008 will do worse in
2009, but still be
above average.
• For example, if a plyer hit 0.350 in 2008, they will
be predicted to hit 0.133+0.52(0.350) which is
0.315. This is still above average, but is not as
good as 0.350.
• In general, when the association is strong in a
scatterplot, there will be less regression to the
mean.
• Likewise, when the association is weak in a
scatterplot, there will be more regression to
the mean.
The example below shows
the strikeout rates for
pitchers in 2008 and 2009.
There is a very strong
association (r=0.76), so
there is very little
regression to the mean.
The example below shows
the winning percentage for
pitchers in 2008 and 2009.
There is a weak association
(r=0.31), so there is much
more regression to the
mean.
• There are many sports examples of regression to the
mean. One of the most famous it the sophomore
slump, in which rookies that had great seasons then
go on to have worse PERFORMANCES the next year.
• A study from 2004 found that of the 112 Rookie of the
Year award winners in MLB (Henry Rowengartner
wasn’t one of them), 63.4% PERFORMED worse in
their second year, while 33% had better
PERFORMANCES. About 3.6% stayed the same.
• The explanation is that these sophomores regress to
their ABILITY after benefitting from RANDOM CHANCE
the previous season.
• Another example is the Sports Illustrated or Madden
jinxes.
• Regression to the mean can be used to make
decisions.
• Think about stocks. People always say to “buy low
and sell high.”
• The same reasoning can be used in sports. A
team or fantasy owner can consider trading a
player who is PERFORMING above his ABILITY
while the trade value is high. Furthermore, a
team should consider acquiring players who are
PERFORMING below their ABILITY because they
will tend to be bargains.