Download Economics 140A Random Variables Probability In everyday usage

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Probability interpretations wikipedia , lookup

Probability wikipedia , lookup

Randomness wikipedia , lookup

Transcript
Economics 140A
Random Variables
Probability
In everyday usage, probability expresses a degree of belief about an event or
statement with a number between 0 and 1. This de nition of probability is subjective in nature because di erent individuals assign di erent probabilities to an
event. We work with an objective de nition of probability. Our objective de nition of the probability that an event occurs is given by the limit of the empirical
frequency of the event as the number of replications of the experiment, from which
the event can occur, increases inde nitely. The assignment of probability does not
di er across individuals.
Probability is a subject that can be studied independently of statistics; it
forms the foundation of statistics. For example, what is the probability that a
head comes up twice in a row if we toss an unbiased coin? The answer, .25, is
calculated without need of statistical inference.
Axioms of Probability
De nitions of a few commonly used terms follow. These terms inevitably
remain vague until they are illustrated.
Sample space. The set of all the possible outcomes of an experiment.
Event. A subset of the sample space.
Simple Event. An event that cannot be a union of other events.
Composite Event. An event that is not a simple event.
Example 1.
Experiment. Tossing a coin twice.
Sample space: fHH, HT, TH, TTg.
The event that a tail occurs at least once: HT [ TH [ TT.
Example 2.
Experiment. Reading the temperature (F) at UCSB at noon on November 1.
Sample Space. Real Interval (0,100).
Events of interest are intervals contained in the sample space.
A probability is a nonnegative number we assign to every event. The axioms of
probability are the rules we agree to follow when we assign probabilities. (Often,
Venn diagrams are used to determine relations among the probabilities assigned
to di erent sets.)
Axioms of Probability
(1) P (A) 0 for any event A.
(2) P (S) = 1, where S is the sample space.
(3) If fAi g, i = 1; 2; : : :, are mutually exclusive (that is, Ai \ Aj = ; for all
i 6= j), then P (A1 [ A2 [ : : :) = P (A1 ) + P (A2 ) + : : :.
The rst two rules are consistent with everyday use of the word probability.
The third rule is consistent with the frequency interpretation of probability, for
relative frequency follows the same rule. If, at the roll of a die, A is the event
that the die shows 1 pip and B is the event that the die shows 2 pips, the relative
frequency of A [ B is the sum of the relative frequencies of A and B. We want
probability to follow the same rule.
If the sample space is discrete, as in example 1, it is possible to assign probability to every event (that is, every possible subset of the sample space) in a
way that is consistent with the probability axioms. If the sample space is continuous, however, as in example 2, it is not possible to do so. In such a case we
restrict attention to a smaller class of events to which we can assign probabilities
in a manner consistent with the axioms. For example, the class of all intervals
contained in (0,100) and their unions satis es the condition.
Conditional Probability
The concept of conditional probability is intuitively easy to understand. For
example, it makes sense to talk about the conditional probability that two pips
show in the role of a die given that an even number of pips is showing. In the
frequency interpretation, this conditional probability can be regarded as the limit
of the ratio of the number of times two pips shows to the number of times an
even number of pips shows. In general, we consider the \conditional probability
of A given B," denoted by P (AjB), for any pair of events A and B in a sample
space, provided P (B) > 0, and establish axioms that govern the calculation of
conditional probabilities.
Axioms of Conditional Probability
(In the following it is assumed that P (B) > 0.)
(1) P (AjB) 0 for any event A.
(2) P (AjB) = 1 for any event B A.
(3) If fAi \ Bg, i = 1; 2; : : :, are mutually exclusive then P (A1 [ A2 [ : : : jB) =
P (A1 jB) + P (A2 jB) + : : :.
= PP (H)
.
(4) If H B and G B and P (G) 6= 0, then PP (HjB)
(GjB)
(G)
Axioms (1)-(3) are analogous to the corresponding axioms of probability. They
imply that we may treat conditional probability just like probability by regarding
2
B as the sample space. Axiom (4) is justi ed by observing that because whenever
H or G occurs B occurs, the relative frequency of H versus G remains the same
before and after B is known to have occurred.
As explained in the background notes, axiom (1) is redundant, that is axiom
(1) follows from the other three axioms. Axioms (2)-(4) can also be written as a
single, more complicated axiom
P (AjB) =
P (A \ B)
for any pair of events A and B such that P (B) > 0: (0.1)
P (B)
Statistical Independence
We rst de ne the concept of statistical (stochastic) independence for a pair
of events. Henceforth it will be referred to simply as independence.
De nition (Pairwise). Events A and B are said to be independent if P (A) =
P (AjB).
The term independence has a clear intuitive meaning. It means that the probability of occurrence of A is not a ected by whether or not B has occurred.
Because of (0.1), the above equality is equivalent to P (B) = P (AjB) or to
P (A)P (B) = P (A \ B).
In many cases two random variables are not independent. For example, the
probability that the Chicago Bulls win a game is not independent of whether
Michael Jordan plays.
Statistics
If we know all the features of a random mechanism that generates data, then
we use probability theory to construct probabilities. In most cases (and in all cases
of empirical interest in the real world) we do not know all features of the random
mechanism that has generated the data we are studying. Statistics is the science
of observing data and making inferences about the characteristics of a random
mechanism that has generated the data. The literal translation of econometrics
is measurement in economics, which means measuring the characteristics of a
random mechanism that has generated the data under study. As such, we can
think of econometrics as drawing on results from statistics.
Random Variables
3
To make mathematical analysis tractable, the statistician assigns numbers to
objects (for example 1 to heads and 0 to tails in our coin tossing example). A
random mechanism whose outcomes are real numbers is called a random variable.
De nition. A random variable is a variable that takes values according to a
certain probability distribution.
A discrete random variable takes a countable ( nite or countably in nite) number of real numbers with preassigned probabilities. A continuous random variable
takes a continuum of values in the real line according to the rule determined by
a density function. A third type of random variable is formed as a mixture of
discrete and continuous random variables. The term probability distribution captures a broader concept that refers to either a set of discrete probabilities or a
density function.
4
Example. The roll of a fair die is an example of a univariate discrete random
variable.
Sample Space
Probability
Z1
1 if odd
Z2 =
0 if even
The arrows indicate mappings from the sample space to the random variables.
The random variable Z1 can hardly be distinguished from the sample space. Note
that the probability distribution of Z2 can be derived from the sample space:
P (Z2 = 1) = 12 and P (Z2 = 0) = 21 .
The probability distribution of a discrete random variable is completely characterized by the equation
P (Z = zi ) = pi ; i = 1; 2; : : : ; n:
(0.2)
For a discrete random variable the probability of a speci c value can
Pnbe nonzero:
pi 0. Because one of the possible outcomes must always occur, i=1 pi = 1; n
may be 1 in some cases.
As we have not yet been precise about a density function, we need to formalize
our earlier de nition of a continuous random variable.
De nition. If there is a nonnegative function f (y) de ned over the whole line
such that
Z y2
P (y1 Y
y2 ) =
f (y)dy;
y1
for any y1 , y2 satisfying y1
y2 , then Y is a continuous random variable and
f (y) is the density function for YR. One of the possible outcomes must always
1
occur (probability axiom (2)), so 1 f (y)dy = 1. The de nition of an integral
implies that the probability that a continuous random variable takes any single
value is zero, so it does not matter whether < or is used within the probability
bracket. In most practical applications, f (y) is continuous except for possibly a
nite number of discontinuities. For such a function the Riemann integral exists,
and therefore f (y) is a density function.
The cumulative distribution function F (y), gives the probability that y takes
values less than or equal to a number b. The cumulative distribution function is
obtained by accumulating all the relevant probabilities.
5
For a continuous random variable
P (y
b) = F (b) =
For a discrete random variable
P (z
b) = G(b) =
Z
X
b
f (y)dy:
1
P (Z = zi ):
i:zi b
Location
The location of a random variable describes a typical value of the random
variable. There are three common measures of location. The rst is the mean,
or expected value. We use the notation EY to denote the expected value of the
random variable Y , where E is the expectation operator. (An operator denotes a
speci c mathematical procedure.) For a continuous random variable
Z 1
y f (y)dy = y :
EY =
1
For a discrete random variable
EZ =
N
X
zi P (Z = zi ) =
z:
i=1
We see that the mathematical procedure is to take the sum of each possible value
multiplied by the probability of the value.
Example. Let Z be the number of pips resulting from one roll of a fair die.
The mean of Z is 3:5. Note that 3:5 pips are never observed, so the mean is not
necessarily the value that occurs on average.
Two important features of the expectations operator are apparent. First, the
expectations operator is linear, that is, E (Y1 + Y2 ) = EY1 + EY2 . To verify for
the discrete case
n1 X
n2
X
E (Z1 + Z2 ) =
(z1i + z2j ) P (Z1 = z1i ) P (Z2 = z2j )
=
=
i=1 j=1
n1 X
n2
X
i=1 j=1
n1
X
z1i P (Z1 = z1i ) P (Z2 = z2j ) +
z1i P (Z1 = z1i ) +
i=1
n1 X
n2
X
i=1 j=1
n2
X
j=1
6
z2j P (Z2 = z2j ) ;
z2j P (Z1 = z1i ) P (Z2 = z2j )
P 2
P 1
where the third line follows from the fact that nj=1
P (Z2 = z2j ) = ni=1
P (Z1 = z1i ) =
1. To verify for the continuous case, simply replace summands with integral signs.
The second feature is that the expectation of a deterministic variable is simply
the deterministic variable. For example
Z
Z
EY (5) = 5fY (y) dy = 5 fY (y) dy = 5:
Because EY is simply a number, like 5, we have
Z
Z
E (EY ) = (EY ) f (y) dy = EY
fY (y) dy = EY:
The mean is the balance point of the data in that the cumulative distance
from the mean of those observations above the mean is equal to the cumulative
distance from the mean of those observations below the mean. Because the mean
is the balance point, it is extremely sensitive to outliers. (Diagram with a teeter
totter the following two examples: f10,12,17 - mean equals 13g and f30,35,38,57
- mean equals 40g- as the last value shifts out, the entire sample is forced to the
left of the balance point (mean).)
A more robust measure of location, one that is generally not as sensitive to
outliers, is the median. The median is the value (not necessarily unique) that half
of the observations equal or fall below. Finally, the mode is an alternative way
to represent typical values. The mode is the value (not necessarily unique) with
the highest probability (for discrete random variables) or the highest point of the
density function (for continuous random variables).
Variance
While location describes a typical value of the random variable, we need some
measure of dispersion to describe how likely we are to get a typical value. The
variance of a random variable is a common measure of dispersion. To ensure that
positive and negative values do not cancel out, we study the squared di erence
between possible values of Y and the expected value of Y .
De nition: The variance of Y is the average squared distance between Y and
EY :
V ar(Y ) = E (Y
= E[Y 2
2Y EY + (EY )2 ] = E (Y 2 )
7
EY )2
2(EY )2 + (EY )2 = E Y 2
(EY )2 ;
where the second equality follows from the fact that expectation is a linear operator and EY is not random.
Clearly the variance is nonnegative. Often the square root of the variance,
termed the standard deviation, is reported rather than the variance. To accommodate the fact that the variance is nonnegative, the standard deviation is de ned
to be the positive square root of the variance.
Interpretation of the standard deviation is not as straightforward as interpretation of the mean. To help in interpretation we use Chebyshev's inequality, which
states that in any set of data at least 1 k12 of the data are within k standard
deviations of the mean. For k = 2 we nd that at least 34 of the data lie within 2
standard deviations of the mean. (The bound is not always sharp: For a Gaussian
random variable roughly 95 percent of the data lie within 2 standard deviations
of the mean and nearly 23 of the data lie within one standard deviation.)
As with the mean, the variance is sensitive to outliers. A more robust measure
of dispersion is the interquartile range, which is de ned as the range encompassed
by the middle half of the data.
Note that the nal expression for the variance involved the term E (Y 2 ). From
the above de nition for the expected value of Y we infer that
Z 1
2
E Y =
y 2 f (y)dy:
1
The expressions for EY and E (Y 2 ) are examples of moments. For any positive
value of k (note k does not need to be an integer)
Z 1
k
E Y =
y k f (y)dy:
1
We have seen that the moments for k = 1; 2 are of interest to us. Note the
distinction between the second moment and the variance. The second moment
equals the variance only if the rst moment is zero.
Standardization
Standardization is the process by which a random variable with mean and
variance 2 is transformed to a random variable with mean zero and variance one.
Because random variables are typically standardized to allow one to use statistical
tables, in our analysis we often standardize random variables. Suppose
8
Y
N( ;
2
);
where the notation indicates that Y is distributed as a Gaussian random variable
with EY = and the variance of Y equal to 2 .
The rst step in standardization is to ensure that the transformed random
variable has mean 0. To do so, we subtract :
E (Y
) = EY
E = EY
= EY
EY = 0:
The (transformed) random variable Y
has mean 0. The second step in standardization is to ensure that the transformed mean zero random variable has
variance 1. To do so, we divide through by the standard deviation, :
V ar
where E
Y
Y
= 1 E (Y
E
=E
Y
E
) = 0. The variance of
2
Y
0
The transformed random variable
Y
Y
1
=
2
E (Y
2
Y
Y
;
is
)2 = 1:
has the distribution
N (0; 1) :
The second step in the standardization is a special case of the rule
V ar(cY ) = c2 V ar(Y );
for any constant c.
Standardized variables are useful for comparisons within groups. Let Y s =
Y
. The standardized variable Y s measures how many standard deviations Y is
above or below its mean. If Y equals its mean, then Y s is equal to 0. If Y is 2
standard deviations above its mean, then Y s is equal to 2. If Y is one standard
deviation below its mean, then Y s is equal to -1.
Multivariate Distributions
9
What if we have more than one random variable? The joint distribution of
several random variables is a multivariate distribution. Consider two continuous
random variables Y and W . The multivariate density function fY;W (y; w) is dened over R2 . The individual, or marginal, density functions are fY (y) and fW (w)
and the conditional density functions are fY jW (yjw) and fW jY (wjy). Recall
fY;W (y; w) = fY jW (yjw)fW (w):
Conditional moments are constructed as
Z
E (Y jW = w) = yfY jW (yjw)dy:
An important relation between conditional moments and unconditional moments
is the
Law of Iterative Expectations
EY = EW E (Y jW ) :
Proof. From the de nition of the joint density function
Z Z
Z Z
EY =
yfY;W (y; w)dwdy =
yfY jW (yjw)fW (w)dwdy
Z Z
=
yfY jW (yjw)dy fW (w)dw = EW E (Y jW ) :
Independence
We have already de ned the concept of independence for a pair of events. We
now extend the concept of independence to a pair of random variables. If Y and
W are unrelated, in the sense that the probability of Y taking on a certain value
is not related to the probability of W taking on a certain value then Y and W are
independent. If Y and W are independent, then
E[Y W ] = E[Y ]E[W ];
and
fY jW (yjw) = fY (y) ) fY;W (y; w) = fY (y)fW (w):
Covariance
10
The covariance measures the product of deviations from means
Cov(Y; W ) = E[(Y
EY )(W
EW )]:
The covariance captures only the linear relationship. For example, if Y = W 2
and W takes values f 3 2 1 0 1 2 3g, then Y takes values f9 4 1 0 1 4 9 g.
If W takes each value with equal probability, then EW = 0 and EY = 4. The
covariance between W and Y is
1X
(wi
7 i=1
7
0)(yi
4) = 0:
Thus there is no linear relation between W and Y , but there is clearly a nonlinear
relation. When W is independent of Y there is no relation between W and Y :
Relation:
i) If W is independent of Y , then there is no relation between W and Y :
no relation ) no linear relation ) Cov(Y; W ) = 0.
ii) If the covariance between W and Y is zero, then there is no linear relation:
no linear relation 6 ) no relation.
Note that covariance is not scale invariant (the value of the estimated covariance depends on the units in which the random variables are measured). The
correlation coe cient provides a scale invariant measure of the linear relation
between W and Y :
Cov(Y; W )
;
YW =
Y
where
1
YW
W
1.
Variance of a Sum
The variance of a sum of two random variables is
V ar(Y + W ) = E[(Y + W )
= E[Y
EY ]2 + E[W
E(Y + W )]2 = E[(Y
EW ]2 + 2E[(W
EY ) + (W
EW )(Y
= V ar(Y ) + V ar(W ) + 2Cov(Y; W ):
11
EY )]
EW )]2
Similarly
V ar(Y
Z) = V ar(Y ) + V ar(Z)
2Cov(Y; Z):
When Cov(Y; W ) = 0:
V ar(Y + W ) = V ar(Y ) + V ar(w):
Central Limit Theorems
A remarkably large class of random variables can be well approximated by a
single distribution, the Gaussian distribution. What types of random variables?
Why the types we typically encounter in applied work, such as the average height
of students, the average income gain from college, or the average e ect of interest
rates on investment. The key word is average.
Theorem. If Z is a standardized sum of n independent identically distributed
random variables with a nite, nonzero standard deviation, then the probability
distribution of Z approaches the Gaussian distribution as n increases.
The beauty of the theorem is that the conditions do not need to be taken
literally. For many applications, n need be only 20 or 30 for the approximation to work reasonably well. Also, the restriction that the data be independent
identically distributed random variables is not needed, although the sample size
at which the approximation works reasonably well increases as the heterogeneity
and dependence increase.
12