Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Chapter 4
Regression
5/23/2017
1
Regression
• Like correlation, regression addresses
linear relationships between quantitative
variables X & Y
• Objective of correlation  quantify
direction and strength of linear association
• Objective of regression  derive best
fitting line that describes the association
• We are especially interested in the slope
of the line
5/23/2017
2
Same illustrative data as Ch 3
Country
Life Expectancy
Y
21.4
23.2
20.0
22.7
20.8
18.6
21.5
22.0
23.8
21.2
77.48
77.53
77.32
78.63
77.17
76.39
78.51
78.15
78.99
77.37
Enter data into calculator
Austria
Belgium
Finland
France
Germany
Ireland
Italy
Netherlands
Switzerland
UK
Per Capita GDP
X
Algebraic equation for a line
• y = a + b∙X
where
• b ≡ slope ≡
change in Y per
unit X
• a ≡ intercept ≡
value of Y when x
=0
Statistical Equation for a Line
ŷ = a + b∙X
where:
ŷ ≡ predicted average of Y at a given level
of X
a ≡ intercept
b ≡ slope
a and b are called regression coefficients
5/23/2017
5
How do we find the equation for the best
fitting line through the scatter cloud?
Ans: We use the “least squares method”
Life expectancy (yrs)
79
78
77
76
18
20
22
24
Per Capital GDP
5/23/2017
6
These formulas derive the coefficients
for the least squares regression line
Slope
br
5/23/2017
sy
Intercept
a  y  bx
sx
7
Illustrative Example (GDP & Life Expectancy)
Statistics for illustrative data (calculated with TI-30XSII)
x  21.52
y  77.754
s x  1.532
s y  0.795
r  0.809
Calculation of regression coefficients by hand:
 0.795 
br
 (0.809) 
  0.420
sx
1.532


sy
a  y  bx  77.754 - (0.420)(21.52)  68.716
5/23/2017
8
“Least Squares” Regression
Coefficients via TI-30XIIS
STAT > 2-VAR > DATA > STATVAR
BEWARE!
The TI-30XIIS mislabels the
slope & intercept. The slope is
mislabeled as a and the
intercept is mislabeled as b.
It should be the other way
around!
5/23/2017
9
Interpretation of Slope
(GDP & Life Expectancy)
ŷ = 68.7 + 0.42∙X
 Each ↑$1K in
GDP associated
with a 0.42 year
increase in life
expectancy
b = increase in Y per unit X = 0.42 years
1 unit X
5/23/2017
10
Interpretation of Intercept
• Mathematically =
the predicted
value of Y when X
=0
• In real-world =
has no
interpretation
unless a value of
X = 0 is plausible
5/23/2017
11
Regression Line for Prediction
• The regression line will
always go through (x-bar,
y-bar) which in this case is
(21.5, 77.8)
• To draw the regression
line, connect any two
points on the line
5/23/2017
Case Study (Life Expectancy)
79
Life expectancy (yrs)
• Example: What is the
predicted life expectancy
of a country with a GDP of
20?
• Ŷx=20 = 68.7 + (0.42)X
= 68.7+(0.42)(20)
= 77.12
78
x
x
77
76
18
20
22
24
Per Capital GDP
12
2
Coefficient of Determination r
Interpretation: proportion of the variability in Y
mathematically explained by X
Our example  r =.809
r2 = .8092 = 0.66.
Interpretation: 66% of the variability in Y (life
expectancy) mathematically explained* by X (GDP)
* mathematically explained ≠ causally explained
5/23/2017
13
Cautions about linear
regression
1. Applies to linear relationships only
2. Strongly influenced by outliers, especially
when outlier is in the X direction
3. Do not extrapolate!
4. Association ≠ causation
(Beware of lurking variables.)
5/23/2017
14
Outliers / Influential Points
• Outliers in the X
direction have strong
influence (tip the line)
• Example (right)
– Child 18 = outlier in X
direction
w/o outlier
with outlier
– Changes the slope
substantially
5/23/2017
15
Do Not Extrapolate!
8
7
height (feet)
• Example (right): Sarah’s
height from age 3 to 5
• Least squares regression
line: ŷ = 2.32 + .159(X)
• Predict height at age 30
• ŷ = 2.32 + .159(X)
= 2.32 + .159(30)
= 8.68’ (ridiculous)
•  Do NOT extrapolate
beyond the range of X
6
5
4
3
2
1
0
0
5
10
15
20
25 30
35
age (years)
5/23/2017
16
Association ≠ Causation
• “Association” not the
same as “causation”
• Lurking variable ≡ an
extraneous factor (Z)
that is associated with
both X and Y
• Lurking variables
can confound an
association
5/23/2017
17
Example of Confounding
by a Lurking Variable
• Explanatory variable X
≡ number of prior children
• Response variable Y ≡
the risk of Down’s
syndrome
• Lurking variable Z ≡
advanced age of mother
• X is associated with Y,
but does not cause Y in
this example
• Z does cause Y
5/23/2017
Number of children
Mental retardation
Older mother
18
Criteria used to establish causality
with examples about smoking (X) and lung cancer (Y)
• Strength of association
– X & Y strongly correlated
• Consistency of findings
– Many studies have shown X & Y correlated
• Dose-response relationship
– The more you smoke, the more you increase risk
• Temporality (time relation)
– Lung cancer occurs after 10 – 20 years of smoking
• Biological plausibility
– Chemical in cigarette smoke are mutagenic
5/23/2017
19
Related documents