Download Linear Regression Notes

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Forecasting wikipedia , lookup

Interaction (statistics) wikipedia , lookup

Time series wikipedia , lookup

Regression analysis wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Linear Regression and Correlation
February 11, 2009
The Big Ideas
I
To understand a set of data, start with a graph or graphs.
The Big Ideas
I
To understand a set of data, start with a graph or graphs.
I
If the data concern the relationship between two quantitative
variables measured on the same individuals, use a scatterplot.
If the variables have an explanatory-response relationship, be
sure to put the explanatory variable on the horizontal (x) axis
of the plot.
The Big Ideas
I
When you look at a statistical graph, look for the overall
pattern and for striking deviations from that pattern.
Thanks to David S. Moore and W.H. Freeman and Company for
this summary.
The Big Ideas
I
When you look at a statistical graph, look for the overall
pattern and for striking deviations from that pattern.
I
When you look at a scatterplot, describe the overall pattern
by its form, direction, and strength. Outliers are an important
kind of deviation from the pattern.
Thanks to David S. Moore and W.H. Freeman and Company for
this summary.
The Big Ideas
I
When you look at a statistical graph, look for the overall
pattern and for striking deviations from that pattern.
I
When you look at a scatterplot, describe the overall pattern
by its form, direction, and strength. Outliers are an important
kind of deviation from the pattern.
I
If the overall pattern is roughly linear (straight-line), the
correlation r measures its direction and strength.
Thanks to David S. Moore and W.H. Freeman and Company for
this summary.
Linear Equations
y = a + bx
I
These describe straight lines.
Linear Equations
y = a + bx
I
These describe straight lines.
I
The constant a is the y -intercept.
Linear Equations
y = a + bx
I
These describe straight lines.
I
The constant a is the y -intercept.
I
The constant b is the slope.
Linear Equations
y = a + bx
I
These describe straight lines.
I
The constant a is the y -intercept.
I
The constant b is the slope.
I
The variable x is the independent or explanatory variable.
Linear Equations
y = a + bx
I
These describe straight lines.
I
The constant a is the y -intercept.
I
The constant b is the slope.
I
The variable x is the independent or explanatory variable.
I
The variable y is the dependent orresponses variable.
Example 1
I
African peoples often eat “bushmeat”, the meat of wild
animals.
Thanks to David S. Moore and W.H. Freeman and Company for
this example.
Example 1
I
African peoples often eat “bushmeat”, the meat of wild
animals.
I
The explanatory variable is sh supply per person, in kilograms.
Thanks to David S. Moore and W.H. Freeman and Company for
this example.
Example 1
I
African peoples often eat “bushmeat”, the meat of wild
animals.
I
The explanatory variable is sh supply per person, in kilograms.
I
The response is the percent change in the total “biomass”
(weight in tons).
Thanks to David S. Moore and W.H. Freeman and Company for
this example.
Analyzing Relationships
I
Form: There is a general linear (straight-line) pattern.
Analyzing Relationships
I
Form: There is a general linear (straight-line) pattern.
I
Direction: The plot has a clear lower-left to upper-right
pattern, so that more positive (or less negative) changes in
wildlife go with higher sh supply. This is a positive association
between the variables.
Analyzing Relationships
I
Form: There is a general linear (straight-line) pattern.
I
Direction: The plot has a clear lower-left to upper-right
pattern, so that more positive (or less negative) changes in
wildlife go with higher sh supply. This is a positive association
between the variables.
I
Strength: The idea of the strength of a relationship is how
closely the points follow the overall pattern. A “perfectly
strong” linear relationship means that all points fall exactly on
a straight line. The relationship here is moderately strong.
Line of Best Fit
I
We can make predictions about the response variable based
on the explanatory variable by finding points on a line that
best fits the data.
Line of Best Fit
I
We can make predictions about the response variable based
on the explanatory variable by finding points on a line that
best fits the data.
I
We use software to find this line of best fit.
Line of Best Fit
I
We can make predictions about the response variable based
on the explanatory variable by finding points on a line that
best fits the data.
I
We use software to find this line of best fit.
I
Our responsibility is to check to see that the line is sensible
and then interpret the value of the predictions.
Line of Best Fit
I
The regression equation is the equation of the line of best fit
y = a + bx
Line of Best Fit
I
The regression equation is the equation of the line of best fit
y = a + bx
I
A y values on the line is a predicted value of the response
variable for a given explanatory value.
Line of Best Fit
I
The regression equation is the equation of the line of best fit
y = a + bx
I
A y values on the line is a predicted value of the response
variable for a given explanatory value.
I
We use the notation ŷ for predicted values and y for observed
values. Thus
yˆ0 = a + bx0
is the value of the response variable that is predicted by the
regression equation for the explanatory value x0 .
Errors
I
The error or residual measures the distance between the
predicted response to a given explanatory value and the actual
response for that explanatory value
Errors
I
The error or residual measures the distance between the
predicted response to a given explanatory value and the actual
response for that explanatory value
i.e. the vertical distance between the line and the data.
I
We use the notation for the error. So, for a data point
(xi , yi ) we have
i = |yi − ŷi |
Example 2
Life expectancy and the percent of people who can read.
Sum of the Squared Errors
I
SSE stands for the Sum of the Squared Errors.
Sum of the Squared Errors
I
SSE stands for the Sum of the Squared Errors.
I
In symbols we have
SSE =
n
X
i
2i =
n
X
(yi − ŷ )2
i
Sum of the Squared Errors
I
SSE stands for the Sum of the Squared Errors.
I
In symbols we have
SSE =
n
X
i
I
2i =
n
X
(yi − ŷ )2
i
The line of best fit is the line that minimizes the SSE.
The Correlation Coefficient r
I
The correlation coefficient r measures how strong the
relationship is between the the variables.
The Correlation Coefficient r
I
The correlation coefficient r measures how strong the
relationship is between the the variables.
I
We always have r bounded by ±1.
−1 ≤ r ≤ 1
The Correlation Coefficient r
I
The correlation coefficient r measures how strong the
relationship is between the the variables.
I
We always have r bounded by ±1.
−1 ≤ r ≤ 1
I
The bigger the absolute value of r , the stronger the
relationship.
I
I
I
r = 1 means there is a perfect positive relationship between
the variables.
r = 0 means there is absolutely no relationship between the
variables.
r = −1 means there is a perfect negative relationship between
the variables.
Textbook
I
See page 484 of the text for pictures about positive, negative
and zero correlation.
Textbook
I
See page 484 of the text for pictures about positive, negative
and zero correlation.
I
See page 491 of the text for strength guidelines.
Example: Exercise vs. Screen Time
I
What kind of correlation do we expect from the data in our
class?
Outliers
I
If an error is more than 1.9s than we consider the
corresponding data to be a potential outlier.
I
The value of the standard deviation s is
sP
r
Pn
n 2
2
SSE
i i =
i (yi − ŷ )
=
s=
n−2
n−2