Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Correlation and Regression Basic Concepts An Example • We can hypothesize that the value of a house increases as its size increases. • Said differently, size and house value “covary” or “co-relate.” • Further, we can hypothesize that the relationship is a simple linear one, e.g., that as size increases, house value increases in a similar linear fashion. • Hence we can use the simple linear equation, • y = a + bx, to describe the relationship We Ask Two Questions… • Is there a relationship and how strong is it? • What is the relationship? • We answer the first with a new statistic, a “correlation” coefficient. • We answer the second with a linear regression model. Terms • • • • • • Independent and Dependent variables Scatterplots Correlation, correlation coefficient, r Regression, regression coefficient, b Regression, regression constant, a Ordinary Least Squares (OLS) equation: y = a + bx + e Issues • Defining relationships – Nature of the relationship: for the moment, linear – Strength of the relationship (using r) – Direction of the relationship (using r and b) – Calculation of the relationship: y = a + bx + e Some useful websites • http://davidmlane.com/hyperstat/A60659.h tml • http://digitalfirst.bfwpub.com/stats_applet/s tats_applet_5_correg.html • http://mste.illinois.edu/activity/regression/ Illustration • Case A. x= 2.5, y=2 • Case B. x=8, y = 7 Linear Trend What if there are lots of data points? 12000 PROPVALU 9000 6000 3000 0 0 1000 2000 3000 SIZE 4000 5000 If there are more data points? How do we summarize the relationships in the data? Solution: Least Squares Regression, The Best Linear Fit 8 C Dependent Variable 7 B 6 5 4 3 2 1 2 A 3 4 5 6 7 Independent Variable 8 9 Some Theory • Knowing nothing else, the best estimate of a variable is its mean. 8 B C Dependent Variable 7 6 5 4 3 2 1 2 A 3 4 5 6 7 Independent Variable 8 9 Linear Trend Mean Y The Regression Model does better… • Deviation from y = yi – ymean 8 B C Dependent Variable 7 6 5 4 3 2 1 2 A 3 4 5 6 7 Independent Variable 8 9 Linear Trend Mean Y A Regression equation… • Measures the nature of the relationship between x and y using a linear model • Measures the direction of the relationship • Accompanying statistics, for the time being, r, measures the strength of the relationship. Understanding the Improvement, measuring the deviations from the mean 8 Dependent Variable 7 6 5 4 3 2 1 2 3 4 5 6 7 Independent Variable 8 9 Linear Trend Mean y More Terms • Yi – the value of a particular case • Y mean – mean value of y • Y hat – y with a ^ above it soŷ • (Yi – Ymean) = total deviation from mean Y • (Yhat – Ymean) = explained deviation of Yi from Y mean • (Yi – Yhat) = unexplained deviation of Yi from Y mean Bivariate Regression • Relationships are modeled using the equation, y = a + bx + e • Translation: The values of an interval level dependent variable, y, can be “predicted” or “modeled” by adding a constant, a, to the product of a slope coefficient, b, times the values of the independent variable, x, and an error term, e. Estimating the Equation, y = a + bx + e • The regression equation is calculated by finding the equation that minimizes the sum of the squared deviations between the data points, the y’s, and the predicted y’s, also called y hat. y  a  bx  e yˆ  y hat or predicted y y  mean y Correlation Coefficient: r • A measure of the strength of a linear relationship between two interval variables, x and y • Ranges from – 1 to + 1 • The higher the value of r (e.g., the closer to -1 or + 1, the stronger the relationship between x and y Correlation Coefficient calculation • r = Covariance of x and y divided by the product of the standard deviation of x and the standard deviation of y • Covariance is the sum of the products of the deviations of the cases divided by N. Equations... r  correlatio n coefficient r  ( X  X )(Y  Y )  ( X  X )  (Y  Y ) 2 2 Calculating a and b Y  b X  a N  Y  bX XY  NXY  b  X  NX 2 R2  r2  2 2 ˆ ( Y  Y )  2 ( Y  Y )  8 8 B Dependent Variable Dependent Variable 7 6 5 4 3 2 1 2 A 3 X 2.5 4 8 B C C 7 6 5 4 3 2 4 5 6 7 Independent Variable 8 A 1 2 9 Y 2 7 7 3 4 5 6 7 Independent Variable 8 Linear Trend Mean Y 9 8 8 7 B C Dependent Variable Dependent Variable 7 6 5 4 3 2 1 2 5 4 3 2 A 3 6 4 5 6 7 Independent Variable 8 9 Linear Trend Mean Y 1 2 3 4 5 6 7 Independent Variable 8 9 Linear Trend Mean y