Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data assimilation wikipedia , lookup
Choice modelling wikipedia , lookup
Regression toward the mean wikipedia , lookup
Interaction (statistics) wikipedia , lookup
Forecasting wikipedia , lookup
Instrumental variables estimation wikipedia , lookup
Time series wikipedia , lookup
Regression analysis wikipedia , lookup
COLLIN COLLEGE Chapter 4 Describing the relation between two variables Math 1342.S06 Prof. Ntchobo Section 4.1 – Scatter Diagrams and Correlation Correlation – There is a correlation between two variables when one of them is related to the other in some way. The response variable is the variable whose value can be explained by the value of the predictor or explanatory variable. Example: Create a scatterplot of the following data: x 3 4 3 1 5 2 5 y 3 2 4 5 2 5 1 Various Types of Relations in a Scatter Diagram 2 The linear correlation coefficient (r) measures the strength and direction of the linear relationship between the two variables. It is also called the Pearson product moment correlation coefficient. r n xy ( x )( y ) n( x 2 ) ( x ) 2 n( y 2 ) ( y ) 2 (ALWAYS ROUND TO 3 DECIMAL PLACES) Properties of r : 1. –1 ≤ r ≤ 1 2. The closer r is to +1, the stronger is the evidence of positive association. 3. The closer r is to –1, the stronger is the evidence of negative association. 4. If r is close to 0, then little or no evidence exists of a linear relationship between the two variables. 5. The value of r is unitless. The units of x and y play no role in the interpretation of r. 6. r is not resistant, ie, outliers or points that do not follow the pattern will affect the value of r. To “turn on” r on your calculator go to 2nd, 0 (catalog) and scroll down to “DiagnosticOn” then hit enter twice. It should say “done” on the screen. To find r, enter the data in L1 and L2, then go to Stat, Calc, LinReg (#4) Example: x y Find r for the following set of data. Describe the association, if any. 11 13 8 11 4 10 1 4 7 11 3 How can we use the value of r to determine if the correlation between the two variables is strong enough to conclude that there is a linear relationship between them? (ie, significant linear correlation) Use Table II in appendix A. If the absolute value of r exceeds the value in Table II, then a linear relationship exists between the two variables. Example: Assume that 20 pairs of data result in a value of r = 0.565. Is there a linear relationship between x and y ? Example: The following measurements represent the chest size of a bear and its weight. x Chest (in) y Weight (lbs) 26 90 45 344 54 416 49 348 41 262 49 360 44 332 19 34 a) Draw a scatter diagram of the data. 4 b) Calculate r. c) Determine if there is a linear relationship between a bear’s weight and its chest size. d) Do the results change if you convert inches to feet? Difference Between correlation and causation According to data obtained from the Statistical Abstract of the United States, the correlation between the percentage of the female population with a bachelor’s degree and the percentage of births to unmarried mothers since 1990 is 0.940. Does this mean that a higher percentage of females with bachelor’s degrees causes a higher percentage of births to unmarried mothers? Certainly not! The correlation exists only because both percentages have been increasing since 1990. It is this relation that causes the high correlation. In general, time series data (data collected over time) will have high correlations because each variable is moving in a specific direction over time (both going up or down over time; one increasing, while the other is decreasing over time). When data are observational, we cannot claim a causal relation exists between two variables. We can only claim causality when the data are collected through a designed experiment. Another way that two variables can be related even though there is not a causal relation is through a lurking variable. A lurking variable is related to both the explanatory and response variable. For example, ice cream sales and crime rates have a very high correlation. Does this mean that local governments should shut down all ice cream shops? No! The lurking variable is temperature. As air temperatures rise, both ice cream sales and crime rates rise. 5 Section 4.2 – Least-Squares Regression Once the linear correlation coefficient has indicated that a linear relationship exists between two variables, our next step is to find a linear equation that describes the relationship between the two variables. Or, you can use your calculator: Stat, Calc, LinReg (#4 or #8) Using Regression Lines to Make Predictions: When predicting a y-value based on some value of x, *If there is not linear correlation between the variables, the best prediction for y is y . *If there is linear correlation between the variables, the best predicted y-value is found by substituting the x value into the regression equation. *When using the regression equation to predict, stay within the scope of the sample data. 6 Example: a) Use your calculator to find the least-squares regression line for the following set of data: Car Weight (lbs) Fuel Consumption (mi per gal) 3175 3450 3225 3985 2440 2500 2290 27 29 27 24 37 34 37 b) Predict the fuel consumption of a car that weighs 3000 lbs. Example: a) Use your calculator to find the least-squares regression line for the following set of data: x y 2 5 3 9 3 10 5 12 8 25 b) Calculate r c) Is there a linear relationship between the variables? d) What is the best predicted value of y when x = 4? 7 Example: Suppose yˆ 0.25 3.2 x , n = 8, r = 0.613, and y = 10. a) Is there a linear relationship between the variables? b) What is the best predicted value of y when x = 6? A Residual is the difference between an actual observed y-value (y) and the predicted y-value ( ŷ ) found using the regression equation. ( y – ŷ ) EX: for a predicted x value of 3 and a corresponding predicted y value of 5.2 and an observed value of 4.5 the residual is shown below. Positive residuals indicate that a data point is ABOVE average. Negative residuals indicate that a data point is BELOW average. 8 Example: a) Use your calculator to find the least-squares regression line for the following set of data: Club-head speed (mph) Distance (yd) 100 257 102 264 103 274 101 266 105 277 100 263 99 258 105 275 b) Interpret the slope and y-intercept. c) Predict the distance that a golf ball will travel if the club-head speed is 103 mph. d) Suppose that a golf ball is hit with a club-head speed of 103 mph and travels 274 yards. Is this distance above or below average among all golf balls hit with a club-head speed of 103 mph? CAUTION: If the least-squares regression line is used to make predictions based on values of the explanatory variable that are much larger or much smaller than the observed values, we say the researcher is working outside the scope of the model. Never use a least-squares regression line to make predictions outside the scope of the model because we can’t be sure the linear relation continues to exist. 9