Download Linear Regression Power Point

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Linear least squares (mathematics) wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Regression toward the mean wikipedia , lookup

Time series wikipedia , lookup

Transcript
Objective: Understanding and using linear regression
Answer the following questions:
(c) If one house is larger in size than another, do you think it affects the
price?
(d) Take a guess as to how much (on average) each additional 1000
square foot would increase the price?
(e) Would you guess that distance explains about 50% of the variability
in airfares, about 65% of this variability, or about 85% of this
variability?
What is the line that best “fits”
or models this data?
In other words, what constitutes a “good” line through a scatterplot?
• We can model the relationship with a line
and give its equation. No line can go
through all the points, but we it can still be
a useful “model.” The best line might not
even go through any of the points.
• We want to find the line that comes closer
to all the points than any other line.
Residual
• Definition: the difference between the
observed value and its associated
predicted value
• The residual tells us how far off the
model’s prediction is at that point.
• We always subtract the predicted value
from the observed one.
y  yˆ
y  yˆ
• A negative residual means:
• the model made an overestimate
• A positive residual means:
• the model made an underestimate.
• When we draw a line through the scatterplot, some
residuals are positive and some are negative. If we add
up all the residuals, what happens?
• We faced the same issue when we calculated a standard
deviation to measure spread. How do we deal with it?
• We square the residuals! Since squaring them will make
them all positive, we can now sum them.
• Squaring also emphasizes the larger residuals. When
we add up all the squared residuals, that sum indicates
how well the line we drew fits the data. Do we want a
small or large sum?
LINE OF BEST FIT
• Is the line for which the sum of squared
residuals is the smallest.
• Our line has the property that the variation
of data from the model is the smallest it
can be for any straight line model for the
data.
• We say that this line “minimizes the sum of
squared residuals” – the best fit line
becomes the “least squares” line.
Correlation & the Line
• What we know about correlation can lead us to
the equation of the linear model… Let’s look at
scatterplots of standardized variables again:
• What line would you choose to model the
relationship of standardized variables?
(x, y)
• The line must go through
• So in z-scores, the line must go through the point:
• The equation of a line that passes through the origin can
be written as y = mx
zˆ y  mz x
• We need to again change it z-scores
• m = slope so we can say moving over one unit in the zscores corresponds to moving up m units in the
predicted z-scores of y.
• There are many different slopes that pass through the
origin. Which one fits our data best? In other words,
which slope minimizes the sum of the squared
residuals?
• It turns out that the best choice for m is the correlation
coefficient itself, r!
• So now, we can write:
zˆ y  rz x
zˆ y  rz x
• What does it tell us?
• In moving one standard deviation away
from the mean in x, we can expect to
move about r standard deviations away
from the mean in y.
• In general, moving any number of
standard deviations in x moves r times that
number of standard deviations in y.
• A scatterplot of house prices (in thousands of
dollars) vs. house size (in thousands of square
feet) shows a relationship that is straight, with
only moderate scatter, and no outliers. The
correlation is 0.85.
• If a house is one SD above the mean in size
(making it about 2170 sq ft), how many SD
above the mean would you predict its sale price
to be?
• What would you predict about the sale price of a
house that’s 2 SDs below average in size?
The regression line in real units
• If we want to find real values, we don’t
always want to convert to z-scores, find
the correlation, use the formula for looking
at standard deviation changes, and then
convert back to the original units…why not
write an equation for the line for our data:
• In Algebra, you learned that an equation
for a line was:
y= mx + b
• Statisticians use slightly different notation:
ŷ  b0  b1 x
b1 
rs y
sx
b0  y  b1 x
yˆ  9.564  122.74 x
• y – price (in thousands of dollars)
• x – house size (in thousands of sq. feet)
• What does the slope mean?
• What are the units?
• How much can the homeowner expect the value
of his house to increase if he builds an additional
2000 sq feet?
• How much would you expect to pay for a house
of 3000 sq ft?
Calculating a Regression Equation
step-by-step
• Estimate the costs per person associated with traffic
delays
• Annual Cost per person
• Mean = 298.96
• SD = 180. 830
• Peak Period freeway speed
• Mean = 54.34 mph
• SD = 4.494 mph
• r = -0.90
• Find the equation of the regression line and write a
sentence interpreting your equation.
Summary of Residuals
• A common theme in statistical modeling is to
think of each data point as being composed of 2
parts – the part that is explained by the model
(often called the fit) and the “leftover” part, (often
called the residual). In the context of least
squares regression, the fitted value for an
observation is simply the y value that the
regression line would predict for the x-value of
that observation. The residual is the difference
between the actual y value and the fitted y hat.
Residual = actual – fitted.
Data = Model + Residual
• Or Residual = Data – Model
• In symbols: e  y  yˆ
• We can do a “residual plot” in the hopes of
finding “nothing.”
•
•
The residual plot shown offers a good example of what a problem-free plot should
look like. There are no odd fan or curved trends in the plot, the average of the
residuals is zero, and the points are equally represented about the x-axis. This
residual represents the difference between the observed response variable Y and
the value predicted by the regression line.
Accounting for Variation
• The variation in residuals is the key to
assessing how well the model fits. All
regression models fall between the two
extremes of zero correlation and perfect
correlation. Can we gauge where are
model falls?
• Compare a regression model with
correlation 0.5 and –0.5 in terms of
strength of linearity.
• Since they only have different directions,
we can square the correlation coefficient
to get r2.
• R square d gives us the fraction of the
data’s variation accounted for by the
model, and 1-r squared is the fraction of
the original variation left in the residuals.
• What does r2 = 0 mean?
• What does r2 = 69% mean?
price hat = 9.564 + 122.72 size
• Back to our regression of house price
• The value is reported as 71.4%/
• What does this R2 value mean about the
relationship of price and size?
• Is the correlation positive or negative? How do
you know?
• If we measured the size in thousands of square
meters instead of thousands of square feet,
would the r2 value change? What about the
slope?