Download Regression - Margo J. Anderson

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Correlation and Regression
Basic Concepts
An Example
• We can hypothesize that the value of a house
increases as its size increases.
• Said differently, size and house value “covary” or
“co-relate.”
• Further, we can hypothesize that the relationship
is a simple linear one, e.g., that as size
increases, house value increases in a similar
linear fashion.
• Hence we can use the simple linear equation,
• y = a + bx, to describe the relationship
We Ask Two Questions…
• Is there a relationship and how strong is
it?
• What is the relationship?
• We answer the first with a new statistic, a
“correlation” coefficient.
• We answer the second with a linear
regression model.
Terms
•
•
•
•
•
•
Independent and Dependent variables
Scatterplots
Correlation, correlation coefficient, r
Regression, regression coefficient, b
Regression, regression constant, a
Ordinary Least Squares (OLS) equation:
y = a + bx + e
Issues
• Defining relationships
– Nature of the relationship: for the moment,
linear
– Strength of the relationship (using r)
– Direction of the relationship (using r and b)
– Calculation of the relationship: y = a + bx + e
Some useful websites
• http://davidmlane.com/hyperstat/A60659.h
tml
• http://digitalfirst.bfwpub.com/stats_applet/s
tats_applet_5_correg.html
• http://mste.illinois.edu/activity/regression/
Illustration
• Case A. x= 2.5, y=2
• Case B. x=8, y = 7
Linear Trend
What if there are lots of data
points?
12000
PROPVALU
9000
6000
3000
0
0
1000
2000 3000
SIZE
4000
5000
If there are more data points?
How do we summarize the
relationships in the data?
Solution: Least Squares
Regression, The Best Linear Fit
8
C
Dependent Variable
7
B
6
5
4
3
2
1
2
A
3
4
5
6
7
Independent Variable
8
9
Some Theory
• Knowing nothing else, the best estimate
of a variable is its mean.
8
B
C
Dependent Variable
7
6
5
4
3
2
1
2
A
3
4
5
6
7
Independent Variable
8
9
Linear Trend
Mean Y
The Regression Model does
better…
• Deviation from y = yi – ymean
8
B
C
Dependent Variable
7
6
5
4
3
2
1
2
A
3
4
5
6
7
Independent Variable
8
9
Linear Trend
Mean Y
A Regression equation…
• Measures the nature of the relationship
between x and y using a linear model
• Measures the direction of the relationship
• Accompanying statistics, for the time
being, r, measures the strength of the
relationship.
Understanding the Improvement,
measuring the deviations from the
mean
8
Dependent Variable
7
6
5
4
3
2
1
2
3
4
5
6
7
Independent Variable
8
9
Linear Trend
Mean y
More Terms
• Yi – the value of a particular case
• Y mean – mean value of y
• Y hat – y with a ^ above it soŷ
• (Yi – Ymean) = total deviation from mean Y
• (Yhat – Ymean) = explained deviation of Yi from
Y mean
• (Yi – Yhat) = unexplained deviation of Yi from Y
mean
Bivariate Regression
• Relationships are modeled using the
equation, y = a + bx + e
• Translation: The values of an interval
level dependent variable, y, can be
“predicted” or “modeled” by adding a
constant, a, to the product of a slope
coefficient, b, times the values of the
independent variable, x, and an error term,
e.
Estimating the Equation,
y = a + bx + e
• The regression equation is calculated by
finding the equation that minimizes the
sum of the squared deviations between
the data points, the y’s, and the predicted
y’s, also called y hat.
y  a  bx  e
yˆ  y hat or predicted y
y  mean y
Correlation Coefficient: r
• A measure of the strength of a linear
relationship between two interval
variables, x and y
• Ranges from – 1 to + 1
• The higher the value of r (e.g., the closer
to -1 or + 1, the stronger the relationship
between x and y
Correlation Coefficient calculation
• r = Covariance of x and y divided by the
product of the standard deviation of x and
the standard deviation of y
• Covariance is the sum of the products of
the deviations of the cases divided by N.
Equations...
r  correlatio n coefficient
r
 ( X  X )(Y  Y )
 ( X  X )  (Y  Y )
2
2
Calculating a and b
Y  b X

a
N
 Y  bX
XY  NXY

b
 X  NX
2
R2  r2 
2
2
ˆ
(
Y

Y
)

2
(
Y

Y
)

8
8
B
Dependent Variable
Dependent Variable
7
6
5
4
3
2
1
2
A
3
X
2.5
4
8
B
C
C
7
6
5
4
3
2
4
5
6
7
Independent Variable
8
A
1
2
9
Y
2
7
7
3
4
5
6
7
Independent Variable
8
Linear Trend
Mean Y
9
8
8
7
B
C
Dependent Variable
Dependent Variable
7
6
5
4
3
2
1
2
5
4
3
2
A
3
6
4
5
6
7
Independent Variable
8
9
Linear Trend
Mean Y 1
2
3
4
5
6
7
Independent Variable
8
9
Linear Trend
Mean y
Related documents