Download The correlation coefficient - Ing-Stat

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Principal component analysis wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Something about the correlation coefficient
The main expression for variation of a random variable X is the variance, often designated 2 or
V(X). The variance of X is defined as ”the expected squared deviation from the mean”:
V ( X )  E( X   )2
If we have a linear combination of two uncorrelated variables, say X and Y, their variances add:
V ( X  Y )  V ( X )  V (Y )
V ( X  Y )  V ( X )  V (Y )
or
However, if the variables are correlated we have to add a term to the expressions above:
V ( X  Y )  V ( X )  V (Y )  2  Cov( X , Y )
or
V ( X  Y )  V ( X )  V (Y )  2  Cov( X , Y )
The extra term is called the covariance between the two variables and is defined in the following
way:
Cov ( X , Y )  E  X   X   Y   Y   E ( XY )  E ( X )  E (Y )
It is clear that the covariance increases the total variation if the linear combination is the sum of
the two variables and decreases the total variation if the linear combination is the difference
between the two variables. We investigate this using a simple simulation.
Simulation. The following lines simulate the result above. NB that the formulas above are valid
for any random variable but for simplicity we use the normal distribution for the simulation:
Random 1000 c1;
Normal 100 3.
# Stores 1000 values in column c1
# from a distribution N(100, 3).
Random 1000 c2;
Normal 100 4.
# Stores 1000 values in column c2
# from a distribution N(100, 4).
let c3 = c1 + c2
describe c3
# Adds column c1 and c2. Result in c3.
# Describes c3.
The printout shows that the variance is approximately 25 (the standard deviation squared) which is
according to the formula above.
If we simulate the difference between the two variables we get the following:
Random 1000 c1;
Normal 100 3.
# Stores 1000 values in column c1
# from a distribution N(100, 3).
Random 1000 c2;
Normal 100 4.
# Stores 1000 values in column c2
# from a distribution N(100, 4).
let c3 = c1 - c2
describe c3
# Takes the difference between c1 and c2.
# Describes c3.
Again the print out shows that the variance is approximately 25 (the standard deviation squared)
according to the formula above.
© Ing-Stat – statistics for the industry www.ing-stat.nu
Rev C . 2009-12-03 . 1(5)
If we sort the two columns from we get the following result:
sort c1 c3
sort c2 c4
let c5 = c3 + c4
# Sorts column c1, the result in c3.
# Sorts column c2, the result in c4.
# Sums c3 and c4.
If we sort the second column (c2) in decreasing (descending) order, we get the following result:
sort c2 c6;
descending c2.
let c7 = c3 + c6
describe c5 c7
# Sorts c2 into c6, descending order.
# Sums c3 and c6. Sorted in different order.
# Describes c5 and c7.
The print-out shows that the variance of c5 and c7 are completely different. C5 contains the result
where we have rowwise added small values to small values, intermediate values to intermediate
values and large values to large values. This of course creates a large variation.
C7 contains the result where we rowwise have added small values to large values, intermediate
values to intermediate values and large values to small values. This is done by sorting the data in
opposite directions and this of course creates a smaller variation.
The correlation coefficient. The (numerical) size of the covariance between two variables de-
pends on the data. If we e.g. express the data in millimeters we will get one value of the covariance
but if we express the data in meters we will get another value. Thus it is difficult to use the covariance to understand the degree of relationship between two variables.
In order to avoid this difficulty the correlation coefficient has been defined. This coefficient will
obtain values in the interval –1 to +1. (Use the macro %CorrTest in order to get more details.) The
correlation coefficient (XY) is defined as follows:
 XY 
Cov( X , Y )
 X  Y

Cov ( X , Y )   XY   X   Y
We see that the formulas for the variance of a linear combination also can be written as follows:
V ( X  Y )  V ( X )  V (Y )  2   XY   X   Y
To simulate a correlation coefficient. In any book in regression analysis we can find the
following expression where Y is the variable that is modelled and R is the residual i.e. that part of
the variation that is not explained by the model. See the literature for details:
2
 XY

V (Y )  V ( R )
V ( R ) (NB that a smaller V(R) gives a
2
1
which can be expressed as  XY
higher coefficient which means that
V (Y )
V (Y ) data is closer to the model)
This shows that the correlation coefficient is bounded by the interval [–1, 1].
Suppose that we want to simulate data giving a certain correlation coefficient between the two
variables Y and X. Set Y  X  Z i.e. V (Y )  V ( X )  V ( Z ) :
2
 XY

V (Y )  V ( Z ) V ( X )  V ( Z )  V ( Z )
V (X )


V (Y )
V ( X )  V (Z )
V ( X )  V (Z )
After a little more mathematics we get the following relationship between the variance of Z and X:
 1

V ( Z )   2  1  V ( X )
  XY

© Ing-Stat – statistics for the industry www.ing-stat.nu
Rev C . 2009-12-03 . 2(5)
After simulating X and Z and creating the sum Y, we calculate the correlation coefficient which
will become close to the derivation above.
let k1 = 0.70
let k2 = 25
let k3 = (1/(k1**2) - 1)*k2
# Wanted corr coefficient.
# Variance of X.
# Variance of Z.
let k4 = sqrt(k2)
let k5 = sqrt(k3)
let k6 = 1000
# Standard deviation of X.
# Standard deviation of Z.
# Number of X- and Z-values.
random
normal
random
normal
# Simulates 1000 X-values in column c1.
k6 c1;
100 k4.
k6 c2;
100 k4.
# Simulates 1000 Z-values in column c2.
let c3 = c1 + c2
# Creates the Y-results.
corr c3 c1
# The corr coeff between Y and X.
Four simulations gave the following values of the correlation coefficient: 0.705, 0.714, 0.688,
0.699.
Tips. Run the %CorrTest macro on the two variables c3 and c1. If the number of observations is
smaller (or much smaller), it will be more difficult for the macro to find the correlation.
Decrease the number of values (k6) and see if the macro still finds the correlation.
Copy the simulation lines above into the ’Command Line Editor’ of Minitab for easier executing.
Below is a copy of a result of %CorrTest:
Testing a calculated correlation coefficient
Example 1: r = 1.00
Example 2: r = -1.00
Example 3: r = 0.76
Example 4: r = -0.60
Example 5: r = -0.19
120
-0.082
0.082
110
Average
Stand dev
100
0. 0
0.2
0.4
0.6
0.8
If the calculated "r" (green triangle) is
outside the dashed red lines, then it
C1
99.750
C3
199.704
5.057
7.124
Max
115.538
221.150
Min
82.434
174.765
n
1000
Corr coeff
0.716
90
is considered significant (1% risk).
80
170
180
190
200
210
220
The correlation coefficient expresses the degree of linear relationship between two variables. However, a significant result
does not necessarily mean that one variable causes the other. Maybe both variables depend on some third variable.
(c) Ing-Stat, www.ing-stat.nu (Corrtest.mac).
See also "A collection of..." and %BivNorm, %Npdfcdf, %NormArea, %Nhist, %MultDist
© Ing-Stat – statistics for the industry www.ing-stat.nu
2005-03-28 14:12:37
Rev C . 2009-12-03 . 3(5)
A slightly larger model. Suppose that we study times in a process with the following four steps:
A
B
C
D
Let us designate the total time T. Thus we have the following model:
T  A BC  D
This gives us the following general expression for the variance of T:
V (T )  V ( A)  V ( B)  V (C )  V ( D)  2  Cov( A, B)  2  Cov( A, C )  2  Cov( A, D)
2  Cov( B, C )  2  Cov( B, D)
2  Cov(C , D)
Let us suppose that the variance for A, B, C and D is 25, 36, 16 and 49 respectively. Let us also
pretend that there is a positive correlation between A and C, 0.6, and a negative correlation
between B and D, –0.5. We rewrite the expression above with this information:
V (T )  V ( A)  V ( B)  V (C )  V ( D)  2  Cov( A, C )  2  Cov( B, D)
 V ( A)  V ( B)  V (C )  V ( D)  2   AC   A   C  2   BD   B   D
 25  36  16  49  2  0.6  5  4  2  0.5  6  7  108
The total variance of T is thus 108 and below we will simulate this situation.
Simulation. In reality the process presents us its result with or without correlation. However, in
order to illustrate this, we need to create the data by using some theory from the regression analysis. We make it somewhat easier by giving all four columns the same expected value, say, 100.
erase c1-c100
# Erases the first 100 columns.
let
let
let
let
let
let
let
let
#
#
#
#
#
#
#
#
k1
k2
k3
k4
k5
k6
k7
k8
=
=
=
=
=
=
=
=
2000
100
5
6
4
7
0.6
-0.5
Number of points to be created.
‘Mu’.
Stand ‘A’.
Stand ‘B’.
Stand ‘C’.
Stand ‘D’.
r(A, C)
r(B, D)
name c1 'A' c2 'B' c3 'C' c4 'D' c5 'T'
Random k1 c1;
Normal k2 k3.
# Stores k1 values in column c1 (A)
# from a distribution N(100, 5).
Random k1 c2;
Normal k2 k4.
# Stores k1 values in column c2 (B)
# from a distribution N(100, 6).
let k11 = sqrt(k5**2 * (1 - k7**2)) # Variance for Z (part of ‘C’).
random k1 c3;
norm 0 k11.
# The Z-variable.
let c3 = k2 + (k5/k3*k7)*(c1 - k2) + c3 # Creating the ‘C’-variable.
let k11 = sqrt(k6**2 * (1 - k8**2)) # Variance for Z (part of ‘D’).
random k1 c4;
norm 0 k11.
# The Z-variable.
let c4 = k2 + (k6/k4*k8)*(c2 - k2) + c4 # Creating the ‘D’-variable.
© Ing-Stat – statistics for the industry www.ing-stat.nu
Rev C . 2009-12-03 . 4(5)
# Investigating the result:
# ------------------------corr c1-c4
# Calculates the correlation between
# columns c1-c4.
layout;
text 0.05 0.30 "* There is a positive correlation between 'A' and 'C'.";
tsize 0.7;
tcolor 4;
text 0.05 0.27 "* There is a negative correlation between 'B' and 'D'.";
tsize 0.7;
tcolor 4;
text 0.05 0.24 "* There is no correlation otherwise.";
tsize 0.7;
tcolor 4.
matrix c1-c4;
# Matrix plot of c1-c4
ur;
graph;
etype 1;
esize 1;
type 1;
color 46;
title 'Time from all subactivities plotted in a matrix plot';
tsize 1.1;
data 0.00 0.85 0.00 0.80;
symb;
size 0.45;
color 4.
endlayout
let c5 = c1 + c2 + c3 + c4
# Gives T.
descr c1-c5
cova c1-c5
corr c1-c5
#
#
#
#
Describes the five columns.
Covariance matrix.
Gives the correlation structure.
See HELP COVA, HELP CORR
Another example (I). Suppose that we make printed circuit boards from large sheets of copper
plated glass fibre. There is hardly any thickness variation within the large sheet although there is
subtantial thickness variation between sheets.
In order to make a multi layer board we cut a sheet into panels and join the panels in a press. In
doing so we get a very high correlation between the thickness of the different layers as the covariance increases the thickness variation between the finished boards. One way of counter act this
phenomena is to mix the panels from different sheets before making the layers. This will give a
smaller variation.
Another example (II). Suppose that we position components on a electronic circuit. The com-
ponents are rather small and are before mounting still in the original frame from the production of
the components. This means that many of the variables in such a frame are highly correlated and
on a single circuit there are many components placed that tend to be high. On another circuit there
might be many components placed that tend to be low. In this way the covariance increase the total
variance. ■
© Ing-Stat – statistics for the industry www.ing-stat.nu
Rev C . 2009-12-03 . 5(5)