Download Covariance - KSU Faculty Member websites

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

System of linear equations wikipedia , lookup

Non-negative matrix factorization wikipedia , lookup

Matrix multiplication wikipedia , lookup

Linear least squares (mathematics) wikipedia , lookup

Four-vector wikipedia , lookup

Cayley–Hamilton theorem wikipedia , lookup

Matrix calculus wikipedia , lookup

Ordinary least squares wikipedia , lookup

Principal component analysis wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Covariance
Prototypes of Bivariate Data Matrices
Continuous-Continuous Data Matrices
A bivariate data matrix with a continuous-continuous arrangement of data vectors
can be obtained, e.g., from asking a group of five subjects whether they like poetry
(variable X), and whether they like Gothic novels (variable Y), using a five point rating
scale. We might be interested in whether the answers to these two test items are related.
The answer to this question can be provided by the coefficient of correlation.
Binary-Continuous Data Matrices
A bivariate data matrix with a binary-continuous arrangement of data vectors
can be obtained, e.g., when one is interested in whether there is a relationship between
the gender of the subjects (coded as 0 or 1) and the enjoyment of poetry. The answer to
this question can be provided by the point biserial coefficient of correlation.
Continuous-Binary Data Matrices
A bivariate data matrix with a continuous-binary arrangement of data vectors
is typical of discriminant analysis which is used when one is interested in whether some
issue divides subjects into different groups.
Note that the first variable is the responses to a question (continuous data) and the second variable indicates
group membership (e.g., gender) coded as 0 and 1.
Binary-Binary Data Matrices
A bivariate binary data matrix
could have been obtained, e.g., from two groups of subjects answering a question
whether they liked poetry, or not, with alternatives provided in the 'yes - no' item
response format. A correlation method of choice for answering problems formulated in
this fashion is the phi coefficient of correlation.
Note that he first variable indicates group membership (e.g., gender) coded as 0 and 1. The second variable
indicates 'yes-no' response coded as 0 and 1.
Variance and Covariance
As an introduction to correlational issues, let us begin with the notion of covariation.
Within the elementary statistics, the term covariance is typically encountered in
connection with the formulae pertaining to the variances of sums and differences.
The notion of covariation is an extension of the concept of variation to the case of two
variables. To introduce the analytical rendering of covariance coefficient, imagine that at
one meeting of the Midwest Club of Poets a question arose whether people who like to
read Gothic novels also like to read poetry. To resolve this, the club leader administered a
short questionnaire to their members asking them to rate their liking of the Gothic novels
and poetry.
I LIKE TO READ GOTHIC NOVELS
I like poetry
Responses from the subjects were recorded into the data matrix shown below.
together with its scatterplot
The question to be answered is whether the liking of the Gothic novels and poetry are
related. Will the members who like to read Gothic novels also like to read poetry?
Covariance Defined
To quantify the degree to which the above variables, X and Y, vary together (covary),
consider that for a single variable, variance is computed as
This can also be imagined as
Substitution of a deviation score y for the second deviation score x results in a formula
that defines covariance as
The computation of the coefficient of covariance for the example is showed in the
following table. The question to be answered is to whether the liking of the Gothic novels
and poetry are related.
To compute covariance, means are computed for both variables. Next, both variables are
transformed into deviation scores by subtracting their mean from scores in their
respective distributions. The product of the deviation scores (xy) is then computed for
each subject. These products are summed and averaged, resulting in the coefficient of
covariance, for the example equal to 4.00.
Covariance in Obtained Scores
Covariance can also be computed directly from the obtained scores. By substituting into
the formula
the formulae for transformation of X and Y into deviation scores, i.e.,
and
the covariance can be written as
Expanding the above expression as
Replace the summation notation for the means with M as shown below
and
Recall that sum of a constant equals n times the constant. Thus,
The above expression can be simplified as
Thus, the formula for computing the coefficient of covariance directly from the obtained
scores can be written as
or, concisely, as
As an illustration, the computational algorithm is shown as
The sum of the products of variables X and Y equals 112 and its mean (MXY) equals 28.
Subtracting the product of the separate means (28 - (4)(6)) yields the coefficient of
covariance equal to 4.00. This value is equal to the value of the covariance coefficient
computed from the formula expressed in deviation scores computed previously.
Lack of Upper and LowerLimits
The coefficient of covariance has no upper or lower limits. As will be seen later, this
indeterminacy is its main disadvantage as compared with the coefficient of correlation.
Change in the Units of Measurement
The lack of upper and lower limits is due to the sensitivity of covariance to the units of
measurement that, on the other hand, can be an advantage in some instances.
To illustrate this sensitivity, imagine that the above example does not pertain to responses
to two rating scales, but (rather unrealistically) represents the measurements of the length
of each subjects' toes in centimeters (X = [3 1 3 9) and subjects' fingers in millimeters (Y
= [40 40 80 80]). There are 10 millimeters to a centimeter, so we need to multiply
variable Y by 10. The product of variables X and Y is computed for this new example as
XY = [120 40 240 720] and the coefficient of covariance is computed as 1120/4-4(60)
which equals 40.
Consider that we did not change the measurements themselves, but only the units of the
second measurement. The coefficient of covariance reflects this change in the unit of
measurement. After reading the next chapter, try to re-compute the relationship between
the above measurements by using the coefficient of correlation. If you do this you will
observe that the coefficient of correlation remains invariant with respect to change of the
measurement unit.
Variance-Covariance Matrix
Variance and covariance are often reported jointly as variance-covariance matrices. The
variance-covariance matrix has variances in its diagonal and covariance in its offdiagonal elements. Thus the variance-covariance matrix V can be defined as
For our example,
the variance-covariance matrix can be constructed as
Variance of a Sum and a Difference
The most common operation on data matrices is the summation of two or more variables.
An example is summing the responses for each question on a test to obtain the total test
score. To compute the variance of a sum of variables X + Y
the above binomial is expanded, as
The variance of a sum of two variables is defined as the sum of the variances of the
variables being summed, plus two times their covariance
Consider the example, presented below. Using the above formula for the variance of the
sum and recalling that the covariance between X and Y for the current example equals
4.00, the variance of the sum of the X and Y variables is computed as 9 + 4 + 2(4) = 21.
This result can be computationally verified by subtracting the mean (10) from the
variable X + Y to form a vector of deviation scores x + y = [-3 -5 +1 +7]. Summing the
squared values of this vector of deviation scores and dividing this sum by the n (number
of cases) as 84/4, confirms that the variance of the X + Y variable indeed equals 21.
To compute the variance of the difference of variables X and Y,
the above binomial can be expanded to an expression
The variance of a difference between two variables is defined as the sum of the variances
of their constituent variables minus twice their covariance, i.e.,
Consider another example that is presented in Table 9.5. Using the above formula for the
variance of a difference and recalling that the covariance between X and Y for the current
example equals 4, the variance of the difference of the X and Y variables is computed as
9 + 4 - 2(4) = 5.
This result can be computationally verified by subtracting the mean (-2) from the variable
X - Y to form a vector of deviation scores x - y = [1 -1 -3 +3]. Summing the squared
values of this vector of deviation scores and dividing this sum by the n as 20/4, confirms
that the variance of the X - Y variable indeed equals 5.
Variance of Weighted Variables
Add/Subtract
A related topic is what happens to a variance of a variable if we add a constant to all of its
values.
While the mean increases by the value of the added constant, the variance remains
unchanged. The same is true if a constant is subtracted from a variable.
The mean decreases, but the variance remains constant.
Multiply/Divide
What happens to a variance of a variable if we multiply all of its values by a constant?
While the mean is multiplied by the constant, the variance is multiplied by the square of
the constant.
The following table illustrates what happens if a variable is divided by a constant.
The mean is divided by the constant, and the variance is diminished by the square of the
constant.
Summary
Covariance , defined as
is often used in the course of algebraic derivations of the relationships such as the
variance of a sum and the variance of a difference
Variance of a Sum
Variance of a
Difference
Covariance plays an important role in determining variance of weighted or composite
variables.