Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 4 Describing the Relation between Two Variables 4.1 Scatter Diagrams and Correlation The RESPONSE is the variable whose value can be explained by the value of the EXPLANATORY or PREDICTOR VARIABLE. ‘y’ depends on ‘x’ A SCATTER DIAGRAM is a graph that shows the relationship between two quantitative variables measured on the same individual. Each individual in the data set is represented by a point in the scatter diagram. I. Scatter plot (x) # hour of sleep 6 8 10 2 (y) performance 3 5 4 1 II. Linear Correlation coefficient (r) 1 The linear correlation coefficient or Pearson product moment correlation coefficient is a measure of the strength and direction of the linear relation between two quantitative variables. The Greek letter ρ (rho) represents the population correlation coefficient, and r represents the sample correlation coefficient. We present only the formula for the sample correlation coefficient. Sample Linear Correlation Coefficient xi x yi y s s x y r n 1 where x is the sample mean of the explanatory variable sx is the sample standard deviation of the explanatory variable y is the sample mean of the response variable sy is the sample standard deviation of the response variable n is the number of individuals in the sample Properties of the Linear Correlation Coefficient 1. The linear correlation coefficient is always between –1 and 1, inclusive. That is, –1 ≤ r ≤ 1. 2. If r = + 1, then a perfect positive linear relation exists between the two variables. 3. If r = –1, then a perfect negative linear relation exists between the two variables. 4. The closer r is to +1, the stronger is the evidence of positive association between the two variables. 5. The closer r is to –1, the stronger is the evidence of negative association between the two variables. 6. If r is close to 0, then little or no evidence exists of a linear relation between the two variables. So r close to 0 does not imply no relation, just no linear relation. 7. The linear correlation coefficient is a unitless measure of association. So the unit of measure for x and y plays no role in the interpretation of r. 8. The correlation coefficient is not resistant. Therefore, an observation that does not follow the overall pattern of the data could affect the value of the linear correlation coefficient. 2 EXAMPLE Determining the Linear Correlation Coefficient Determine the linear correlation coefficient of the drilling data. STEP 1: First make a scatterplot (using stat crunch) to see if there is a linear relationship! Put data in column 1 & 2 Step 2: STAT-SUMMARY STATISTICS-CORRELATION-SELECT COLUMNSCLICK 1ST VAR & CLICK 2ND VAR WHILE HOLDING CTRL KEY-COMPUTE r = 0.7773 Compare to diagram: Testing for a Linear Relation • Step 1 Determine the absolute value of the correlation coefficient • Step 2 Find the critical value in Table II from Appendix A for the given sample size 3 • Step 3 If the absolute value of the correlation coefficient is greater than the critical value, we say a linear relation exists between the two variables. Otherwise, no linear relation exists. EXAMPLE Does a Linear Relation Exist? R=0.7773 There is a strong positive correlation between depth at which drilling begins and time to drill 5 feet. Using the table n = 12 CV 0.576 Since .7773 > 0.576 then a linear relationship exists. 4.2 Least-Squares Regression EXAMPLE Finding an Equation that Describes Linearly Related Data The following data shows the number of doctor visits in a year (x) with the corresponding days sick from work (y) for six patients at a clinic. (a) Graph the equation on the scatter diagram. (b) Find a linear equation that relates x (the explanatory variable) and y (the response variable) by selecting two points on the line of best fit and finding the equation of the line containing the points. Use (5, 3) and (1, 6) 63 3 =-0.75 Slope = 1 5 4 Equation: y y1 m( x x1 ) 4 y 6 0.75( x 1) y 6 0.75 x 0.75 y 0.75 x 6.75 or Days sick = -0.75(doctor visits) + 6.75 S = - 0.75 V + 6.75 (c) Use the equation to predict the number of sick days if you visit the doctor 3 times a year. S = - 0.75 V + 6.75 S = - 0.75 (3) + 6.75 = - 2.25 + 6.75 = 4.5 days If you visit the doctor three times a year, you can expect to miss 4.5 days of work on average. d) Does going to the doctor cause you to miss less days? No. association ≠ causation It may cause you to miss less days. e) Use the equation to predict the number of sick days if you visit the doctor 10 times a year. You cannot predict beyond the scope of the data!!!! This is called extrapolation. The difference between the observed value of y and the predicted value of y is the error, or residual. Using the line from the last example, and the predicted value at x = 3: residual = observed y – predicted y = 5.2 – 4.5 = 0.7 days (under predicted) Least-Squares Regression Criterion If there is positive / negative correlation between x and y, find the best fitted line for the data. The least-squares regression line is the line that minimizes the sum of the squared errors (or residuals). This line minimizes the sum of the squared vertical distance between the observed values of y and those predicted by the line ŷ , (“y-hat”). We represent this as “minimize Σ residuals2 ” (minimizes the sum of the squared errors). The Least-Squares Regression Line The equation of the least-squares regression line is given by yˆ b1 x b0 where b1 r sy sx is the slope of the least-squares regression line 5 and b0 y b1 x is the y-intercept of the least-squares regression line The Least-Squares Regression Line Note: x is the sample mean and sx is the sample standard deviation of the explanatory variable x ; y is the sample mean and sy is the sample standard deviation of the response variable y. EXAMPLE Finding the Least-squares Regression Line Using the drilling data and computer technology (a) Draw the least-squares regression line on the scatter diagram of the data to verify a linear relationship. From before r = .773 (b) Find the least-squares regression line. 1.Input data into statcrunch 2. STAT-REGRESSION-STIMPLE LINEAR-SELECT VAR’S-COMPUTE 3. equation: yˆ 5.5273 0.0116 x (c) Predict the drilling time if drilling starts at 130 feet. ŷ 5.5273 0.0116(130) 7.035 minutes (d) Is the observed drilling time at 130 feet above, or below, average. Observered = 6.93 minutes which was below predicted average. 6 e) Interpretation of Slope: 0.0116 min 1 foot For each additional one foot of drilling, the time to drill 5 feet increases by 0.0116 minutes on average. Interpretation of the y-Intercept: 5.5273 ≈5.5 feet (occurs when x = 0) When drilling begins at 0 feet (the surface), the time to drill 5 feet is 5.5 minutes. Caution: If the least-squares regression line is used to make predictions based on values of the explanatory variable that are much larger or much smaller than the observed values, we say the researcher is working outside the scope of the model. Never use a least-squares regression line to make predictions outside the scope of the model because we can’t be sure the linear relation continues to exist. Predictions When There is No Linear Relation: No predictions should be made! When the correlation coefficient indicates no linear relation between the explanatory and response variables, and the scatter diagram indicates no relation at all between the variables, then we use the mean value of the response variables, then we use the mean value of the response variable as the predicted value so that ŷ y Summary 1. Use StatCrunch to plot a scatter plot 2. Use StatCrunch to calculate r 3. Determine whether there is a positive/negative linear correlation between X and Y. 4. If there is a linear correlation between X and Y, use StatCrunch to find the least squares regression line. Otherwise, do not find the least squares regression line. And stop! 5. When a value is assigned to X if there is a correlation between X and Y, use the least squares regression line to find the best predicted Y. 6. When a value is assigned to X if there is no correlation between X and Y, use StatCrunch to find y and the best predicted Y is y for any X. 7