Download Probability Theory and Random Variables: Mean, Variance

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Probability wikipedia , lookup

Randomness wikipedia , lookup

Transcript
Probability Theory and Random Variables:
Mean, Variance, Covariance, and Correlation
Dan Saunders
Introduction
Suppose X is a random variable. What does that mean? The simplest notion is that we do not
know the value that X will take on with certainty. However, this does not imply that we are clueless
about X. The short answer is that uncertainty, as modeled by probability theory, is an exhaustive
characterization of all of the possible outcomes, and that random variables make that uncertainty
amenable to mathematical analysis.
Probability Theory
Probability theory requires three main ingredients. The first is the sample space, Ω, which is the set
of possible outcomes. Second is the event space, F, which is the set of all subsets of Ω. Essentially,
F represents the set of all possible events to which a probability can be assigned. Finally, we need
the probability measure, P , which assigns those probabilities to every event. There is a lot more
that could be said about these three mathematical objects, but it will be easier to demonstrate
with an example. Consider the familiar case of a fair coin toss
Ω = {H, T }
F = {∅, H, T, {H, T }}
1
P (∅) = 0, P (H) = P (T ) = , P ({H, T }) = 1
2
As can be seen from the example, these three objects {Ω, F, P } are a complete description of the
uncertain nature of the coin toss. For clarity, let’s consider one more example: tossing a die. In this
case, the set of outcomes are naturally represented as the numbers Ω = {1, 2, 3, 4, 5, 6}. The set of
events, F can be any subset of Ω. For example, we may ask for the probability that we roll less than
three, in which case we are choosing the event {1, 2, 3}, which is a subset of Ω. Finally, we would
need to assign the probability to each event, such as P ({1}) = 1/6 or P ({1, 2, 3}) = 1/2. All of this
logical construction occurs before any mention of random variables. So what is a random variable?
Well, it’s actually quite a misnomer, because a random variable is more than just a variable.
1
Random Variables
A random variable is actually a function. Specifically, it’s a function which assigns a real number
to every element in the sample space, in accordance with its respective probability measure
X
Ω −→ <
Again, the fair coin toss can help the discussion. Consider the random variable X, which maps the
outcome “heads” to the number 1 and the outcome “tails” to the number 0
1 with probability 1/2
X=
0 with probability 1/2
In fact, this type of random variable should seem familiar. Recall the definition of a Bernoulli
random variable, which takes the form
1 with probability p
X=
0 with probability 1 − p
A Bernoulli random variable is a mathematical function which can be used to represent any uncertainty with two outcomes. In fact, the use of 0 and 1 is not that special. We could use any two
real numbers to achieve the same goal.
Why do we need random variables? Simply put, we cannot mathematically analyze the uncertainty described by {Ω, F, P }. We must first map the set of possible outcomes to numbers, before
we can define the mean or variance. After all, what’s the average of heads and tails? On the other
hand, we can say what the average is of 0 and 1. If they are equally likely, then the average is
1 · 1/2 + 0 · 1/2 = 1/2.
Expected Value
More generally, we define the expected value of any discrete random variable X as
X
E(X) =
X(ω)P (ω)
ω∈Ω
It is important to keep in mind that this is probability theory, and E(X) represents the theoretically
true mean; sometimes called the population mean and denoted by µ. This is entirely separate from
statistics, where we do not know the mean or the distribution. Notice that, for any Bernoulli
random variable we have
E(X) = 1 · p + 0 · (1 − p) = p
In the case where the distribution is unknown, we collect n independent and identically distributed data points, in order to estimate the expected value using the sample mean estimator
n
x̄ =
1X
xi
n
i=1
This is also refer to as the arithmetic mean. This estimator is a function of random variables.
Therefore, it is also a random variable, whose properties will be of great interest when making
statistical inference.
2
Variance and Standard Deviation
While the mean is a measure of the central tendency, the variance is a measure of the dispersion.
Intuitively, we are measuring the average squared distance from the mean.
h
i
V ar(X) = E (X − E(X))2
Why squared? Minimizing the variance of an estimator increases precision. However, in order to
use calculus, we must have a smooth function without kinks. Therefore, minimization will be easier
with squared terms than with absolute values; although, the absolute value function would have a
more intuitive interpretation.
The variance is often denoted σ 2 . An alternative definition exists, which may be more useful
in solving problems. First, note that the mean could be written as E(X 1 ). The exponent of 1
emphasizes the why the mean is sometimes called “the first moment”. In general, the nth moment
of a random variable is E(X n ). As it turns out, the variance is closely related to the second moment
by the following relation
2
V ar(X) = E(X 2 ) − E(X)
It is quite common for people to refer to the second moment when talking about the variance. In
particular, if E(X) = 0, then the second moment is exactly equal to the variance.
The standard deviation, defined as the square root of the variance and denoted σ, has a simple,
important interpretation. Recall the distance formula from Algebra
p
d = (x1 − x2 )2 + (y1 − y2 )2
So distance is naturally measured as the square root of the sum of squared differences. The variance,
by definition, is a sum of squared differences from the mean, and the standard deviation is the square
root of that sum. Therefore, the standard deviation can arguably be said to measure the average
distance from the mean.
1/2
X
2
X(ω) − µ · P (ω)
σ=
ω∈Ω
The standard deviation also shares the same units as the underlying random variable, unlike
the variance. For example, if X is measured in meters, then so are µ and σ, whereas σ 2 is measured
in meters-squared.
In the instance where we don’t know the underlying distribution, as with the mean, we must
use a sample analog by collecting n independent observations. Specifically, for the variance we have
the following estimator
n
2
1 X
2
s =
xi − x̄
n−1
i=1
We divide by n − 1 because we lose one degree of freedom by using the estimate of the mean, x̄,
in the estimation of the variance. As a matter of convention, we refer to s, the estimated standard
deviation, as the “standard error”.
3
Covariance and Correlation
The covariance is a generalization of the variance. This is obvious from the definition
h
i
Cov(X, Y ) = E X − E(X) · Y − E(Y )
Notice that Cov(X, X) = V ar(X). All we can hope to interpret about the covariance is its sign
(positive or negative). If, for example, the covariance is positive, than we can say the following:
“On average, if X is above its mean, then Y is also above its mean, and vice versa.” As with the
simplification for the variance, we have the following formula
Cov(X, Y ) = E[XY ] − E[X] · E[Y ]
Unfortunately, the covariance is quite sensitive to the units of measurement. For example, suppose
X and Y were both measured in meters. Now suppose we measure X in centimeters, i.e., we create
a new random variable Z = 100X. Then Cov(Z, Y ) = 100 · Cov(X, Y ). So the covariance can be
scaled up by any arbitrarily large number, without any change in the underlying relationship. To
solve this problem, we calculate the correlation, sometimes called the coefficient of correlation and
denoted ρ. The definition is as follows
ρX,Y =
Cov(X, Y )
σX · σY
As it turns out, −1 ≤ ρ ≤ 1 for any two random variables, and it is invariant to scale. Values
of ρ close to 1 imply a strong linear relationship, while values of ρ close to 0 indicate a weak
relationship.
Properties
For any two random variables X, Y and any two constants a, b we have
1. The expectations operator is linear
E[aX + bY ] = aE[X] + bE[Y ]
2. The covariance, defined as an expected value, has the following property
Cov(aX, bY ) = abCov(X, Y )
This further implies that the variance is a non-linear operator
V ar(aX) = Cov(aX, aX) = a2 V ar(X)
3. Finally, we use all of these properties together to find the variance of a sum of random
variables as
V ar(aX + bY ) = a2 V ar(X) + b2 V ar(Y ) + 2abCov(X, Y )
The properties listed above generalize for any number of random variables, and they give us the
tools to calculate the mean and variance of a collection of random variables.
4