Download Oct 10

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Forecasting wikipedia , lookup

Interaction (statistics) wikipedia , lookup

Lasso (statistics) wikipedia , lookup

Choice modelling wikipedia , lookup

Time series wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Linear regression wikipedia , lookup

Regression analysis wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Stat 1510: Statistical
Thinking and Concepts
REGRESSION
Agenda
2

Regression Lines

Least-Squares Regression Line

Facts about Regression

Residuals

Influential Observations

Cautions about Correlation and Regression

Correlation Does Not Imply Causation
Objectives
3

Quantify the linear relationship between an
explanatory variable (x) and a response
variable (y).

Use a regression line to predict values of (y) for
values of (x).

Calculate and interpret residuals.

Describe cautions about correlation and
regression.
Regression Line
A regression line is a straight line that describes how a
response variable y changes as an explanatory variable x
changes.
We can use a regression line to predict the value of y for a
given value of x.
Example: Predict the number of
new adult birds that join the
colony based on the percent of
adult birds that return to the
colony from the previous year.
If 60% of adults return, how
many new birds are predicted?

4
Regression Line
5
Regression equation:
^
y = a + bx
 x is the value of the explanatory variable.
 “y-hat” is the predicted value of the response
variable for a given value of x.
 b is the slope, the amount by which y changes for
each one-unit change in x.
 a is the intercept, the value of y when x = 0. If the
range of values of x does not include zero, intercept
has no meaning.
 We cannot use this fitted model to predict for any
value of x that is outside the range of x values used to
fit the model.
Least-Squares Regression
Line
6

Since we are trying to predict y, we want the regression line
to be as close as possible to the data points in the vertical
(y) direction.
Least-Squares Regression Line (LSRL):
 The line that minimizes the sum of the squares
of the vertical distances of the data points from
the line.
Equation of the LSRL
Regression equation:
br
y^ = a + bx
sy
sx
a  y  bx
where sx and sy are the standard deviations of
the two variables, and r is their correlation.
Prediction via Regression Line
8
 For
the returning birds example, the LSRL is
y-hat = 31.9343  0.3040x
– y-hat is the predicted number of new birds for colonies
with x percent of adults returning
 Suppose
we know that an individual colony has
60% returning. What would we predict the
number of new birds to be for just that colony?
 For
colonies with 60% returning,
we predict the average number
of new birds to be:
31.9343  (0.3040)(60) = 13.69
birds
Facts about Least-Squares
Regression
9

The distinction between explanatory and response
variables is essential.

The slope b and correlation r always have the
same sign.

The LSRL always passes through (x-bar, y-bar).

The square of the correlation, r2, is the fraction of
the variation in the values of y that is explained by
the least-squares regression of y on x.
 r2 is called the coefficient of determination.
Regression Calculation
Case Study
10
Per Capita Gross Domestic Product
and Average Life Expectancy for
Countries in Western Europe
Regression Calculation - Case Study
Country
Austria
Belgium
Finland
France
Germany
Ireland
Italy
Netherlands
Switzerland
United Kingdom
Per Capita GDP (x)
21.4
23.2
20.0
22.7
20.8
18.6
21.5
22.0
23.8
21.2
Life Expectancy (y)
77.48
77.53
77.32
78.63
77.17
76.39
78.51
78.15
78.99
77.37
11
Regression Calculation- Case Study
Linear regression equation:
x  21.52
s x  1.532
y  77.754
sy  0.795
r  0.809
 0.795 
br
 (0.809)
  0.420
sx
 1.532 
a  y  bx  77.754 - (0.420)(21 .52)  68.716
sy
^
y = 68.716 + 0.420x
12
Simplified Formula
13
To avoid the rounding error, use the above
Formulae for computing b
R – Output (using lm function)
lm(formula = LE ~ GDP)
Residuals:
Min
14
1Q
Median
3Q
Max
-0.92964 -0.24309 0.02843 0.25987 0.76440
Coefficients:
Estimate Std. Error
(Intercept) 68.7151
GDP
0.4200
2.3238
0.1077
t value
29.571
Pr(>|t|)
1.85e-09 ***
3.899
0.00455 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4951 on 8 degrees of freedom
Multiple R-squared: 0.6552,
Adjusted R-squared: 0.6121
F-statistic: 15.2 on 1 and 8 DF, p-value: 0.004554
============================
The fitted model is
LE = 68.7151 + 0.420 GDP
Coefficient of Determination (R2)
15


Measures usefulness of regression prediction
R2 (or r2, the square of the correlation): measures
what fraction of the variation in the values of the
response variable (y) is explained by the regression
line
 r=1:
R2=1:
regression line explains all (100%) of
the variation in y
 r=.7:
R2=.49:
regression line explains almost half
(50%) of the variation in y
Residuals
16

Gesell Adaptive Score and Age at First Word
A residual is the difference between an observed value of
the response variable and the value predicted by the
regression line:
^
residual = y  y

A residual plot is a scatterplot of the regression residuals
against the explanatory variable
 used to assess the fit of a regression line
 look for a “random” scatter around zero
Outliers and Influential Points
17

An outlier is an observation that lies far away from
the other observations.
 Outliers in the y direction have large residuals.
 Outliers in the x direction are often influential
for the least-squares regression line, meaning
that the removal of such points would markedly
change the equation of the line.
Outliers and Influential Points
18
Gesell Adaptive Score and Age at First Word
After removing child 18
r2 = 11%
From all of the data
r2 = 41%
Chapter 5
Cautions about Correlation and
Regression
19

Both describe linear relationships.

Both are affected by outliers.

Always plot the data before interpreting.

Beware of extrapolation.
 predicting outside of the range of x

Beware of lurking variables.
 These have an important effect on the relationship
among the variables in a study, but are not
included in the study.

Correlation does not imply causation!
Caution: Beware of
Extrapolation



Sarah’s height was
plotted against her
age.
Can you predict her
height at age 42
months?
Can you predict her
height at age 30
years (360 months)?
Caution: Beware of
Extrapolation

Regression line:
y-hat = 71.95 + .383 x

Height at age 42
months? y-hat = 88

Height at age 30 years?
y-hat = 209.8
 She is predicted to be
6’10.5” at age 30!
21
Caution: Beware of Lurking
Variables
22
Example: Meditation and Aging
(Noetic Sciences Review, Summer 1993, p. 28)


Explanatory variable: observed meditation
practice (yes/no)
Response variable: level of age-related enzyme
 General concern for one’s well being may also
be affecting the response (and the decision to
try meditation).
Correlation Does Not Imply
Causation
23

Even very strong correlations may not correspond to a
real causal relationship (changes in x actually causing
changes in y).
 Correlation may be explained by a lurking variable
Social Relationships and Health
House, J., Landis, K., and Umberson, D. “Social Relationships and
Health,” Science, Vol. 241 (1988), pp 540-545.
Does lack of social relationships cause people to become ill? (there was
a strong correlation)
 Or, are unhealthy people less likely to establish and maintain
social relationships? (reversed relationship)
 Or, is there some other factor that predisposes people both to
have lower social activity and become ill?
Evidence of Causation
24

A properly conducted experiment may establish causation.

Other considerations when we cannot do an experiment:
 The association is strong.
 The association is consistent.
•
The connection happens in repeated trials.
•
The connection happens under varying conditions.
 Higher doses are associated with stronger responses.
 Alleged cause precedes the effect in time.
 Alleged cause is plausible (reasonable explanation).
Other Regression Models
25



Multiple Linear Regression Models
More than one independent variable
Logistic Regression
Response of interest (y) belongs to two categories – good /
bad, 0 /1 etc.
Poisson Regression
Response of interest is positive integers (0,1,2,,3,etc.)