* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download correlation coefficient
Data assimilation wikipedia , lookup
Forecasting wikipedia , lookup
Regression toward the mean wikipedia , lookup
Choice modelling wikipedia , lookup
Interaction (statistics) wikipedia , lookup
Instrumental variables estimation wikipedia , lookup
Time series wikipedia , lookup
Regression analysis wikipedia , lookup
Correlation and Regression Outline Introduction 10-1 Scatter plots . 10-2 Correlation . 10-3 Correlation Coefficient . 10-4 Regression . Note: This PowerPoint is only a summary and your main source should be the book. Correlation and Regression are inferential statistics involves determining whether a relationship between two or more numerical or quantitative variables exists. Examples: Is the number of hours a student studies is related to the student’s score on a particular exam? Is caffeine related to heart damage? Is there a relationship between a person’s age and his or her blood pressure? Introduction Correlation is a statistical method used to determine whether a linear relationship between variables exists. Regression is a statistical method used to describe the nature of the relationship between variables— that is, positive or negative, linear or nonlinear. There are two types of relationships simple multiple In a simple relationship, there are two variables: an o independent variable (predictor variable) odependent variable (response variable). In a multiple relationship, there are two or more independent variables that are used to predict one dependent variable. Note: This PowerPoint is only a summary and your main source should be the book. Example: 1-Is there a relationship between a person’s age and his or her blood pressure? The type of relationship: The independent variable(s): The dependent variable: ------------------------------------------------------------2-Is there a relationship between a students final score in math and factors such as the number of hours a student studies, the number of absences, and the IQ score. The type of relationship: The independent variable(s): The dependent variable: Simple relationship can also be positive or negative. Positive relationship exists when both variables increase or decrease at the same time. Negative relationship, as one variable increases, the other variable decreases and vice versa. Example: a person’s height and perfect weight. Example: the strength of people over 60 years of age. Scatter Plots A scatter plot is a graph of the ordered pairs (x, y) of numbers consisting of the independent variable x and the dependent variable y. Notation: X: Explanatory (independent, predictor) variable Y: Response (dependent, outcome) variable Example 10-1: Construct a scatter plot for the data shown for car rental companies in the United States for a recent year. Dependent Independent There is a positive relationship. Example 10-2: Construct a scatter plot for the data obtained in a study on the number of absences and the final grades of seven randomly selected students from a statistics class. Student Number of absences x Final grade y A 6 82 B 2 86 C 15 43 D 9 74 E 12 58 F 5 90 G 8 78 Solution : Step 1: Draw and label the x and y axes. Step 2: Plot each point on the graph. 90 Final.grade 80 70 60 50 40 2 4 6 8 10 Number.0f.absences 12 14 THERE IS A NEGATIVE RELATIONSHIP 16 Example 10-3: Construct a scatter plot for the data obtained in a study on the number of hours that nine people exercise each week and the amount of milk (in ounces) each person consumes per week. Student Hours x Amount y A 3 48 B 0 8 C 2 32 D 5 64 E 8 10 F 5 32 G 10 56 H 2 72 I 1 48 Solution : Step 1: Draw and label the x and y axes. Step 2: Plot each point on the graph. Amount 60 40 20 0 0 2 4 6 8 Hours There is no specific type of relationship. 10 positive linear relationship negative linear relationship Do the data sets have a positive, a negative, or no relationship? A. the relationship between exercise and weight Negative relationship C. The size of a person and the number of fingers he has No relationship D. When we study the relationship between the Number of hours of studying and the final score Positive relationship Correlation correlation coefficient, a numerical measure to determine whether two or more variables are related and to determine the strength of the relationship between or among the variables. The correlation coefficient computed from the sample data measures the strength and direction of a linear relationship between two variables. The symbol for the sample correlation coefficient is r. The symbol for the population correlation coefficient is . The range of the correlation coefficient is from 1 to 1. -1 ≤ r ≤ 1 If there is a strong positive linear relationship between the variables, the value of r will be close to 1. If there is a strong negative linear relationship between the variables, the value of r will be close to 1. Correlation Coefficient Pearson Ch (10) r -Denoted by ( ) -Only Used when Two variables are quantitative. Spearman Rank Ch (13) r -Denoted by ( s) -Used when Two variables are Quantitative or Qualitative. There are several types of correlation coefficients. The one explained in this section is called the Pearson product moment correlation coefficient (PPMC). The formula for the correlation coefficient is r n xy x y 2 2 n x 2 x 2 n y y where n is the number of data pairs. Rounding Rule: Round to three decimal places. EX: 1- Compute the value of the Pearson product moment correlation coefficient for the data below: X 2 4 1 2 Y 8 10 3 6 Example 10-4: Compute the correlation coefficient for the data in Example 10–1. company Cars x Income y xy x2 y2 A 63.0 7.0 441 3969 49 B 29.0 3.9 113.10 841 15.21 C 20.8 2.1 43.68 432.64 4.41 D 19.1 2.8 53.48 364.81 7.84 E 13.4 1.4 18.76 179.56 1.96 F 8.5 1.5 2.75 72.25 2.25 Σx = 153.8 Σy = 18.7 Σxy = 682.77 Σx2 = 5859.26 Σy2 = 80.67 Solution : r n xy x y 2 n x 2 x 2 n y 2 y 𝑟 = 6 682.77 − (153.8)(18.7) √[(6)(5859.26) − (153.8)2 ][(6)(80.67) − (18.7)2 ] r = 0.982 (Strong Positive Relationship) Note: This PowerPoint is only a summary and your main source should be the book. Example 10-5: Compute the correlation coefficient for the data in Example 10–2. Final grade 82 xy x2 y2 A Number of absences 6 492 36 6.724 B 2 86 172 4 7.396 C 15 43 645 225 1.849 D 9 74 666 81 5.476 E 12 58 696 144 3.364 F 5 90 450 25 8.100 G 8 78 624 64 6.084 Student Σx = 57 Σy = 511 Σxy = 3745 Σx2 = 579 Σy2 = 38.993 Solution : r n xy x y 2 n x 2 x 2 n y 2 y r = -0.944 (strong negative relationship) Note: This PowerPoint is only a summary and your main source should be the book. Rank Correlation Coefficient Other types of correlation coefficients. Is called the Spearman rank correlation coefficient, can be used when the data are ranked. The formula for the correlation coefficient is rs 1 Where d = difference in ranks. n = number of data pairs. 6 d 2 n(n 2 1) If both sets of data have the same ranks ,rs will be +1. If the sets of data are ranked in exactly the opposite way , rs will be -1. If there is no relationship between the ranking ,rs will be near 0. Example 13-7 P(698): Two students were asked to rate eight different textbooks for a specific course on an ascending scale from 0 to 20 points. Compute the correlation coefficient for the data: Textbook. Student 1 A B C D E F G H Total 4 10 18 20 12 2 5 9 Student 2 4 6 20 14 16 8 11 7 Rank(X1) Rank(X2) d=X1 – X2 7 8 4 7 2 1 1 3 3 2 8 5 6 4 5 6 d² -1 -3 1 -2 1 3 2 -1 1 9 1 4 1 9 4 1 0 30 rs 1 6 d 2 n( n 2 1) 6(30) 180 rs 1 1 0.643 2 8(8 1) 504 rs = 0.643 (strong positive relationship) Regression If the value of the correlation coefficient is significant, the next step is to determine the equation of the regression line which is the data’s line of best fit. Best fit means that the sum of the squares of the vertical distance from each point to the line is at a minimum. y a bx a 2 y x x xy n x x n xy x y b n x x 2 where a = y intercept b = the slope of the line. 2 2 2 Example 10-9: Find the equation of the regression line for the data in Example 10–4, and graph the line on the scatter plot. Σx = 153.8, Σy = 18.7, Σy2 = 80.67, Σxy = 682.77, Σx2 = 5859.26, n=6 y x x xy 18.7 5859.26 153.8 682.77 0.396 a 2 6 5859.26 153.8 n x x 2 2 b 2 n xy x y n x 2 x 2 6 682.77 153.8 18.7 6 5859.26 153.8 2 0.106 Find two points to sketch the graph of the regression line. Use any x values between 10 and 60. For example, let x equal 15 and 40. Substitute in the equation and find the corresponding y value. Plot (15,1.986) and (40,4.636), and sketch the resulting line. y 0.396 0.106 x y 0.396 0.106 x 0.396 0.106 15 0.396 0.106 40 1.986 4.636 Example 10-10: Find the equation of the regression line for the data in Example 10–5, and graph the line on the scatter plot. Σx = 57, Σy = 511, y x x xy a n x x 2 2 b 2 n xy x y n x 2 x 2 Σxy = 3745, Σx2 = 579, n=7 *Remark: The sign of the correlation coefficient and the sign of the slope of the regression line will always be the same. r (positive) ↔ b (positive) r (negative) ↔ b (negative) Car Rental Companies: r=0.982, b=0.106 Absences and Final Grade: r= -0.944, b= -3.622 The regression line will always pass through the point (x ,ӯ). *Remark: The magnitude of the change in one variable when the other variable changes exactly 1 unit is called a marginal change. The value of slope b of the regression line equation represent the marginal change. For Example: Car Rental Companies: b= 0.106, which means for each increase of 10,000 cars, the value of y changes 0.106 unit (the annual income increase $106 million) on average. For Example: Absences and Final Grade :b= -3.622, which means for each increase of 1 absences, the value of y changes -3.62 unit (the final grade decrease 3.622 scores) on average. Example 10-11: Use the equation of the regression line to predict the income of a car rental agency that has 200,000 automobiles. x = 20 corresponds to 200,000 automobiles. y 0.396 0.106 x 0.396 0.106 20 2.516 Hence, when a rental agency has 200,000 automobiles, its revenue will be approximately $2.516 billion.