Download r2, the coefficient of determination The bivariate normal assumption

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Linear regression wikipedia , lookup

Regression analysis wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Today's Agenda
r2, the coefficient of determination
The bivariate normal assumption
Diagnostic plots: Residuals and Cook's Distance
R output (moved to week 3),
Syllabus note: We are ahead of schedule in regression, so we're
taking the time to add more examples and details, like Cook's
distance and residuals.
Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 1
r2, the coefficient of determination
r2 is simply the Pearson correlation coefficient r, but squared.
So why all the fuss about it?
When x and y are correlated, we say that some of the
variation in y is explained by x.
The proportion explained is r2 . It is called the coefficient of
determination because it represents how well a value of y
can be determined by x.
Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 2
Abstract case 1:
If there was a perfect correlation between x and y,
then the relationship between them could be described
perfectly by a line. ( r = -1 or +1)
Once you have the regression equation, knowing x allows
you determine what y is, and without any error.
In these cases, r2 is 1, meaning that 100% of the variance in y
is explained by x.
Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 3
Abstract case 2:
If there was NO correlation between x and y, such that r=0,
then there is no linear relationship between x and y.
Knowing x and using the regression equation of that (lack of)
relationship would tell you literally nothing about y.
In these cases, r2 is 0, so none of the variance in y is
explained by x.
Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 4
Medical example:
On page 4 of 8 of this paper, Pak J Physiol 2010;6(1):
http://www.pps.org.pk/PJP/6-1/Talay.pdf
... there are several scatterplots describing the correlation
between resting heart rate (RHR) and several other possibly
related variables.
Consider the first scatterplot, called Figure 1A. In this figure,
a regression of body-mass index (BMI, y) as a function of
resting heart rate (RHR, x) is shown.
Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 5
Scatterplot of Heart Rate (x) and Body-Mass Index (y)
Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 6
Here, the sample correlation is r = 0.305, and there is strong
evidence that the population correlation is positive because
p < 0.01.
r2 = 0.3052 = 0.0930,
so 9.3% of the variation in BMI can be explained by RHR.
Also, 9.3% of the variation in RHR can be explained by BMI.
Why?
Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 7
Correlation works in both directions.
In Figure 1b, the sample shows that some variation in Waistto-Hip Ratio (WHR) is explained by (and explains) RHR.
0.2302 = 0.0529 or 5.3% of the variation.
Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 8
If Body-Mass Index explains 9.3% of the variation of RHR,
and
Waist-to-Hip Ratio explains 5.3% of the variation,
could they together explain 9.3 + 5.3 = 14.6% ?
Sadly, no.
Since BMI and WHR are measuring very similar things, there
is going to be a lot of overlap in the variation that they
explain.
Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 9
But what is this 'variation'? Let's dig deeper!
Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 10
Recall that the regression equation without the error term,
α + βx , is called the least squares line.
Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 11
The 'squares' being referred to are the squared errors.
Mathematically, it is the line through the data that produces
the smaller sum of squared error (SSE), which is
where epsilon ε is the error term that we ignored earlier:
Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 12
The sum of squares error SSE is the amount of variation that
is left unexplained by the model.
We used squared errors because...
- Otherwise negative and positive errors would cancel.
- This way, the regression equation will favour creating many
small errors instead of one big one.*
- In calculus, the derivative of x2 is easy to find.
* Also why Pearson correlation is sensitive to extreme values.
Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 13
The error term is in any model we use, even the null model,
which is a fancy term for not regressing at all.
Or
In the null model, every value of y is predicted to be the
average of all observed y values. So α is the sample mean of
y, y-bar.
Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 14
The total squared difference from the mean of y is called the
sum of squares total, or SST
SST is the total square length of all the vertical red lines.
Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 15
If we fit a regression line, (most of the) errors become
smaller.
Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 16
Most importantly, the squared errors get smaller. The
coefficient of determination, r2, is measuring how much
smaller the squared errors get.
Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 17
Here, the correlation is very strong ( r is large), and there are
barely and errors at all.
So SSError would be much smaller than SSTotal,
and r2 is also large
Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 18
2
The relationship between r , SSE, and SST is:
SST is the total amount of variation in Y
SSE is the amount of variation in Y left unexplained by X.
2
When r is zero, SSE is same as SST
2
When r is one, SSE disappears completely.
Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 19
So we now have two different interpretations of r-squared.
1. The square of the correlation efficient.
2. The proportion of Sum of Squares Total (SST) that is
removed from the error term.
Interpretation #1 is specific to correlation.
Interpretation #2 works for simple regression, but also for
AnOVa, multiple regression, general linear models!
Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 20
R-squared is truly the go-anywhere animal.
Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 21
Bivariate Normality (and some diagnostics)
Regression produces a line that minimizes sum squared
errors, so a small number of extreme values (outliers) can
have a strong effect on a model.
Consider this Pearson r:
Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 22
More specifically, regression is sensitive to violations of the
assumption of bivariate normality.
The regression model assumes:
1. The distributions of the x and y variables is normal.
If you were to take a histogram of all the x values, that
histogram should resemble a normal curve.
Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 23
The regression model also assumes:
2. The distribution of y, conditional on x, is normal.
If you were to take a histogram of all the error terms, that
histogram should ALSO resemble a normal curve.
Any observations that produce errors that are too large to be
in the curve are potentially influential outliers.
Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 24
In this diagram, the red line is the regression on all 54 points.
The blue line is the regression without the 4 red points.
Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 25
These points are near the lower end of x, and have very
large error terms associated with them, so they 'pull' the left
end of the regression line down.
Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 26
Another word for these errors is residuals, literally the
residue, or portion left over from the model. Here is a
scatterplot of the residuals over x. A.K.A, a residual plot.
Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 27
The outliers are clearly visible from the residual plot, and
from the histogram below. Their values are twice as large as
any other observation.
Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 28
One way to measure how much an outlier is affecting the
model is to remove that one point and see how much the
model changes.
We can see a big difference between the blue and red lines
above, but that is a comparison by removing 4 points
manually.
Another, more systematic (and therefore quick, easy, and
often more reliable) method is to remove one observation at
a time and see how much the model changes.
Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 29
Cook's distance is a regression deletion diagnostic. It works
by comparing a model with every observation to one with
only the observation in question removed / deleted.
The higher Cook's distance is for a value, the more that
particular value is influencing the model.
If there are one or two values that are having undue
leverage on the model, Cook's distance will find them. This is
true even if the residual plot fails to find them (which it can
if the observation is 'pulling' hard enough)
Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 30
This is Cook's Distance for all 54 data points.
Note that although all 4 problem points have high Cook's
distance compared to the rest, two of them are not obvious
problems. Cook's distance has a hard time identifying influential
observes when there are several.
Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 31
Dealing with outliers is like selecting an acceptable Type I
error. There are conventions and guidelines in place, but it is
a case-by-case judgement call.
One question to ask is “does this observation belong in my
data set?”, when considering things other than your model.
Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 32
If the outlier is the result of a typo, it's not the same as the
rest of your sample and it should go.
If other information about that observation is nonsense,
such as joke answers in a survey, then that's also justification
to remove that outlier observation.
If it just happens to be an extreme value, but otherwise
everything seems fine with it, then it is best to keep it.
Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 33
Don't rush to finish your model. Look for outliers first.
Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 34
Next Tuesday:
- Diagnostics and Regression in R.
- Correlation vs Causality
Read: Rubin on Causality, only Sections 1-3 for next Tuesday.
Sources:
xkcd.com/605 My Hobby: Extrapolating.
Sand Crab Photo, by Regiane Cardillo, Brasil
http://www.pps.org.pk/PJP/6-1/Talay.pdf , Pak. J. Phisol. (2010) 6:1
Mandarin Duck and Parrot on Tortoise unknown
Stat 302, Winter 2016 SFU, Week 2, Hour 3, Page 35