Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Correlation and Regression CORRELATION AND REGRESSION CORRELATION The correlation coefficient is a number, between -1 and +1, which measures the strength of the relationship between two sets of data. The closer the correlation coefficient is to +1 or 1 the stronger the relationship and the easier it is to predict one item by using the other. For example, there is a strong relationship between amount of daily sunshine and the sales of icecream so the correlation coefficient is close to 1. Positive and Negative correlation coefficient If the two sets of data are related in such a way that as one increases then so does the other, there is a positive correlation between them and the correlation coefficient will be +. The daily sunshine and sales of ice-cream have a positive correlation. If they are related so that as one increases the other decreases, then they have a negative correlation. For example, the amount of daily sunshine and the sales of rainwear would have a negative correlation. Strong Positive Correlation r = 0.8 Strong Negative Correlation r = -0.8 No correlation r=0 Perfect positive Correlation r = +1 Perfect Negative Correlation r = -1 Perfect correlation is usually found only in science. In most other situations the correlation coefficient is a decimal r Walter Fleming nxy xy nx 2 x 2 ny y 2 2 Page 1 of 6 Correlation and Regression RANK CORRELATION This is another method of finding the correlation coefficient. With this method both sets of data must be ranked i.e. numbered in either ascending or descending order. Both sets must be ranked in the same way i.e. either both ascending or both descending. Then the difference between the rank is found by subtracting one from the other. This gives us d for the formula. r 1 6 d 2 n n 2 1 Example A group of 7 candidates are ranked in a written exam and in a practical exam. The following table gives the results: Candidate A B C D E F G Place in Written 3 5 1 4 2 7 6 Place in Practical 4 5 3 2 1 6 7 R1 R2 d d2 A 3 4 1 1 B 5 5 0 0 C 1 3 2 4 D 4 2 2 4 E 2 1 1 1 F 7 6 1 1 G 6 7 1 1 16 d2 n=7 r = 6å d 2 2 n (n - 1) = 1 - = 1 - 0.29= 0.71 - 1) This indicates that there is only a fair correlation between the written exam results and the practical results. Walter Fleming Page 2 of 6 Correlation and Regression Regression Regression analysis examines what the relationship between two sets of data is. It relates one set to the other by means of an equation, the Regression equation. This is the equation of a straight line. Regression assumes that the two sets of data lie in a straight line, known as the line of best fit or the Regression Line so the stronger the correlation the more reliable this equation is for forecasting. The equation is Y = a + bX where b nxy xy nx 2 x 2 and a y x b n n To forecast, substitute in the X value given into the equation. The Coefficient of Determination This is found using the formula: (r2.100)%. It gives the percentage of the variation in the dependent variable that is explained through one’s knowledge of the variation in the independent variable. Walter Fleming Page 3 of 6 Correlation and Regression Example: The number of daily hours sunshine and the sales of ice-cream for a particular week is given in the following table: Hours of 4 sunshine Ice10 cream sales (000kg) 5 3 6 5 9 9 11 10 12 10 15 16 Find the regression equation and use it to forecast the expected sales on a day in which the expected hours of sunshine is 7 hours. Procedure for calculating the Product moment Correlation Coefficient. 1. Arrange the to sets of data in 2 columns, Col.1 is X, Col.2 is Y If the data are labelled X and Y, follow that, but if they are not, X is the data over which you have control, (i.e. the independent variable.) for example price is X since you can directly control them, and Sales is Y. If in doubt, take the top line as X and the second line as Y. 2. Form three more columns, Col. 3 for XY, the product of each pair, Col 4 for X squared, and Col 5 for Y squared. 3. Calculate the entries in each column. 4. Add up each column. This gives the Σs in the formula. N is the number of rows of data. 5. Insert the results into the formula and calculate. Be careful with the calculations; don't try to do too much at the one time. Walter Fleming Page 4 of 6 Correlation and Regression Example To find Pearson’s product moment correlation coefficient for the following data showing the cost of maintaining 10 machines of different ages (in months): Machine 1 2 3 4 5 6 7 8 9 10 Age 5 10 15 20 30 30 30 50 50 60 cost 190 240 250 300 310 335 300 300 350 395 Put the data in vertical columns, identifying them as X and Y. Then find the values of the columns XY, X2 and Y2. X 5 10 15 20 30 30 30 50 50 60 Y 190 240 250 300 310 335 300 300 350 395 XY 950 2400 37580 6000 9300 10050 9000 15000 17500 23700 X2 25 100 225 400 900 900 900 2500 2500 3600 Y2 36100 57600 62500 90000 96100 112225 90000 90000 122500 156025 300 2970 97650 12050 913050 X Y XY X2 Y2 Put these values into the formula: r= 10 x97650 300 x 2970 10 x12050 300 x10 x913050 2970 2 2 85500 85500 = = 9442800000 [30500][309600] = 0.88 A coefficient of 0.88 tells us that the link between age and cost is strong. Also, it is a positive correlation, which indicates that if age increases then cost also increases. = The coefficient of determination = r2x100 = 77.44% This indicates that 77.44% of the differences that occur in the cost are associated with differences in age. The remaining 22.56% of the differences are due to other factors. A rough guide to interpreting the correlation coefficient strength: 1.0 – 0.9 Very strong 0.9 – 0.8 Strong 0.8 – 0.7 Fairly strong under 0.7 Weak to none Walter Fleming Page 5 of 6 Correlation and Regression To find the least squares equation of the regression line (line of best fit) The equation is of the form Y = a + bX Using the formulae for a and b, we get b= 85500 10 x97650 300 x 2970 = = 2.8 2 30500 10 x12050 300 a= 2970 2.8x 300 = 297- 84 =213 10 10 So the regression equation is Y = 213 + 2.8X This can be used for forecasting. To forecast the cost of maintaining a machine that is 23 months old, substitute in 23 for X and find Y, the cost: Y = 213 + 2.8x23 = 277.4 To draw the regression line on the scatter graph: Find two points on the line by putting in two values for X and finding the Y. It is easier to draw if you take values for X that are near the lower end and near the higher end of the data. In this example you could take X = 5, which gives a value of Y = 227 And X = 50, which gives a value of Y = 353 Now plot the two points (5,227) and (50,353) on the scatter diagram then join them with a straight line. This is the regression line for the data. Walter Fleming Page 6 of 6