Download CORRELATION AND REGRESSION

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Partial differential equation wikipedia , lookup

Schwarzschild geodesics wikipedia , lookup

Debye–Hückel equation wikipedia , lookup

Equation of state wikipedia , lookup

Regression toward the mean wikipedia , lookup

Transcript
Correlation and Regression
CORRELATION AND REGRESSION
CORRELATION
The correlation coefficient is a number, between -1 and +1, which measures the strength of
the relationship between two sets of data. The closer the correlation coefficient is to +1 or 1 the stronger the relationship and the easier it is to predict one item by using the other. For
example, there is a strong relationship between amount of daily sunshine and the sales of icecream so the correlation coefficient is close to 1.
Positive and Negative correlation coefficient
If the two sets of data are related in such a way that as one increases then so does the other,
there is a positive correlation between them and the correlation coefficient will be +. The
daily sunshine and sales of ice-cream have a positive correlation. If they are related so that as
one increases the other decreases, then they have a negative correlation. For example, the
amount of daily sunshine and the sales of rainwear would have a negative correlation.
Strong Positive Correlation
r = 0.8
Strong Negative Correlation
r = -0.8
No correlation
r=0
Perfect positive Correlation
r = +1
Perfect Negative Correlation
r = -1
Perfect correlation is usually found only in science. In most other situations the correlation
coefficient is a decimal
r
Walter Fleming
nxy  xy
nx
2
  x 
2
ny  y 
2
2
Page 1 of 6
Correlation and Regression
RANK CORRELATION
This is another method of finding the correlation coefficient.
With this method both sets of data must be ranked i.e. numbered in either ascending or
descending order. Both sets must be ranked in the same way i.e. either both ascending or
both descending. Then the difference between the rank is found by subtracting one from the
other. This gives us d for the formula.
r  1
6 d 2
n n 2  1
Example
A group of 7 candidates are ranked in a written exam and in a practical exam. The following
table gives the results:
Candidate
A
B
C
D
E
F
G
Place in Written
3
5
1
4
2
7
6
Place in Practical
4
5
3
2
1
6
7
R1
R2
d
d2
A
3
4
1
1
B
5
5
0
0
C
1
3
2
4
D
4
2
2
4
E
2
1
1
1
F
7
6
1
1
G
6
7
1
1
16
d2
n=7
r =  
6å d 2
2
n (n - 1)
= 1 - 

= 1 - 0.29= 0.71
 - 1)
This indicates that there is only a fair correlation between the written exam results and the
practical results.
Walter Fleming
Page 2 of 6
Correlation and Regression
Regression
Regression analysis examines what the relationship between two sets of data is. It relates one
set to the other by means of an equation, the Regression equation. This is the equation of a
straight line. Regression assumes that the two sets of data lie in a straight line, known as the
line of best fit or the Regression Line so the stronger the correlation the more reliable this
equation is for forecasting.
The equation is
Y = a + bX
where
b
nxy  xy
nx 2   x
2
and
a
y
x
b
n
n
To forecast, substitute in the X value given into the equation.
The Coefficient of Determination
This is found using the formula: (r2.100)%. It gives the percentage of the variation in the
dependent variable that is explained through one’s knowledge of the variation in the
independent variable.
Walter Fleming
Page 3 of 6
Correlation and Regression
Example:
The number of daily hours sunshine and the sales of ice-cream for a particular week is given
in the following table:
Hours of 4
sunshine
Ice10
cream
sales
(000kg)
5
3
6
5
9
9
11
10
12
10
15
16
Find the regression equation and use it to forecast the expected sales on a day in which the
expected hours of sunshine is 7 hours.
Procedure for calculating the Product moment Correlation Coefficient.
1. Arrange the to sets of data in 2 columns, Col.1 is X, Col.2 is Y
If the data are labelled X and Y, follow that, but if they are not, X is the data over which
you have control, (i.e. the independent variable.) for example price is X since you can
directly control them, and Sales is Y. If in doubt, take the top line as X and the second
line as Y.
2. Form three more columns, Col. 3 for XY, the product of each pair, Col 4 for X squared,
and Col 5 for Y squared.
3. Calculate the entries in each column.
4. Add up each column. This gives the Σs in the formula. N is the number of rows of data.
5. Insert the results into the formula and calculate. Be careful with the calculations; don't try
to do too much at the one time.
Walter Fleming
Page 4 of 6
Correlation and Regression
Example
To find Pearson’s product moment correlation coefficient for the following data showing the
cost of maintaining 10 machines of different ages (in months):
Machine
1
2
3
4
5
6
7
8
9
10
Age
5
10
15
20
30
30
30
50
50
60
cost
190
240
250
300
310
335
300
300
350
395
Put the data in vertical columns, identifying them as X and Y. Then find the values of the
columns XY, X2 and Y2.
X
5
10
15
20
30
30
30
50
50
60
Y
190
240
250
300
310
335
300
300
350
395
XY
950
2400
37580
6000
9300
10050
9000
15000
17500
23700
X2
25
100
225
400
900
900
900
2500
2500
3600
Y2
36100
57600
62500
90000
96100
112225
90000
90000
122500
156025
300
2970
97650
12050
913050
X
Y
XY
X2
Y2
Put these values into the formula:
r=
10 x97650  300 x 2970
10 x12050  300 x10 x913050  2970 
2
2
85500
85500

= 
= 
9442800000 
[30500][309600]
= 0.88
A coefficient of 0.88 tells us that the link between age and cost is strong. Also, it is a positive
correlation, which indicates that if age increases then cost also increases.
=
The coefficient of determination = r2x100 = 77.44%
This indicates that 77.44% of the differences that occur in the cost are associated with
differences in age. The remaining 22.56% of the differences are due to other factors.
A rough guide to interpreting the correlation coefficient strength:
1.0 – 0.9
Very strong
0.9 – 0.8
Strong
0.8 – 0.7
Fairly strong
under 0.7
Weak to none
Walter Fleming
Page 5 of 6
Correlation and Regression
To find the least squares equation of the regression line (line of best fit)
The equation is of the form
Y = a + bX
Using the formulae for a and b, we get
b=
85500
10 x97650  300 x 2970
=
= 2.8
2
30500
10 x12050  300
a=
2970 2.8x 300
= 297- 84 =213
10
10
So the regression equation is Y = 213 + 2.8X
This can be used for forecasting. To forecast the cost of maintaining a machine that is 23
months old, substitute in 23 for X and find Y, the cost:
Y = 213 + 2.8x23 = 277.4
To draw the regression line on the scatter graph:
Find two points on the line by putting in two values for X and finding the Y. It is easier to
draw if you take values for X that are near the lower end and near the higher end of the data.
In this example you could take
X = 5, which gives a value of Y = 227
And X = 50, which gives a value of Y = 353
Now plot the two points (5,227) and (50,353) on the scatter diagram then join them with a
straight line. This is the regression line for the data.
Walter Fleming
Page 6 of 6