Download STP 421 - Core Concepts 1. Probability Spaces: Probability spaces

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Probability wikipedia , lookup

Randomness wikipedia , lookup

Transcript
STP 421 - Core Concepts
1. Probability Spaces: Probability spaces are used to model processes with random outcomes
and have three components: a sample space Ω which is the set of all possible outcomes, a
collection F of subsets of Ω that are called events, and a function P : F → [0, 1] which assigns
a probability P(E) to every event E ∈ F. P is said to be a probability distribution or
probability measure on F.
2. Interpretations: The probability of an event E can be interpreted in two very different ways.
On the one hand, the frequentist interpretation regards P(E) as the limiting frequency with
which the event E occurs in an infinite series of independent, identically-distributed trials. In
contrast, Bayesian interpretations regard P(E) as a measure of the plausibility of a conjecture
E, either as a matter of subjective belief or in terms of the strength of the evidence in favor of
E. Whereas frequentist probabilities can only be assigned to events that can occur in repeated
trials, Bayesian probabilities can be assigned to events that are unique.
3. The Laws of Probability: Every probability distribution satisfies the following four identities:
1. P(∅) = 0
2. P(Ω) = 1
3. Sum rule: P(A ∪ B) = P(A) + P(B) − P(A ∩ B)
4. Product rule: P(A ∩ B) = P(A)P(B|A) = P(B)P(A|B)
It follows from the first, second and third law that P(Ac ) = 1 − P(A) for any event A where
Ac = Ω \ A is the complement of A in the sample space.
4. Countable additivity: If E1 , E2 , · · · is a countable collection of mutually exclusive events,
then the probability of their union is equal to the sum of their probabilities:


[
X
P  Ei  =
P(Ei ).
i≥1
i≥1
Be aware that additivity does not generally extend to uncountable unions of disjoint sets.
5. Conditional Probabilities: The conditional probability of A given B is denoted P(A|B)
and is defined by the formula:
P(A|B) =
P(A ∩ B)
P(B)
provided that P(B) > 0. P(A|B) is undefined if P(B) = 0. P(A|B) can be interpreted as the
probability that A is true given that we know or assume that B is true.
1
6. Independence: Two events A and B are said to be independent if P(A ∩ B) = P(A)P(B).
In this case, it can be shown that P(A|B) = P(A) and P(B|A) = P(A) if P(A) > 0 and P(B) > 0.
In other words, if two events are independent, then knowing that one of them has occurred does
not change the likelihood that the other has occurred.
Similarly, a collection of n events E1 , · · · , En is said to be independent if any finite subcollection,
say Ei1 , · · · , Eim satisfies the condition
!
m
m
Y
\
P(Eik ).
P
Eik =
k=1
k=1
For example, three events A, B and C are independent if and only if the following four identities
hold:
P(A ∩ B ∩ C) = P(A)P(B)P(C)
P(A ∩ B) = P(A)P(B)
P(A ∩ C) = P(A)P(C)
P(B ∩ C) = P(B)P(C).
Independence does not follow from pairwise independence, e.g., it is possible to find three events
A, B and C such that each pair of events is independent, but the three events taken together
are not independent.
7. The Law of Total Probability: Suppose that E and F are events and that P(F ) > 0 and
P(F c ) > 0. Then
P(E) = P(F )P(E|F ) + P(F c )P(E|F c ).
In other words, the probability of E is equal to the weighted average of the conditional probabilities of E given F and of E given F c . More generally, if E, F1 , · · · , Fn are events such that
F1 , · · · , Fn are mutually exclusive and E ⊂ F1 ∪ · · · ∪ Fn , then
P(E) =
n
X
P(Fi )P(E|Fi ),
i=1
which we can also interpret as a weighted average of the conditional probabilities of E given
each of the events F1 , · · · , Fn . This result is important because we can sometimes use it to
calculate the probability of an event E by conditioning on additional information that makes
the probability calculations simpler.
8. Bayes’ formula: Suppose that E and F are events such that P(E) > 0 and P(F ) > 0. Then
P(F |E) = P(F )
P(E|F )
,
P(E)
which shows how the two conditional probabilities P(F |E) and P(E|F ) are related to each
other. The probability P(E) appearing in the denominator on the right-hand side can often
2
be evaluated with the help of the law of total probability. This is arguably one of the most
important identities in all of mathematics because of the central role that it plays in Bayesian
statistical inference.
9. Random variables: Let (Ω, F, P) be a probability space and let E be a set, e.g., E could
be the real line or the set of 2 × 2 matrices or the collection of English words with 5 letters.
An E-valued random variable is simply a function X : Ω → E which assigns a value X(ω) in E
to each outcome ω in the sample space. Often we think of X as a measurement or observation
that depends on the state of a random process described by the probability space. In particular,
because the outcome ω is random, so is the value X(ω) of the random variable.
The distribution of an E-valued random variable X is a probability distribution PX defined
on the set E by
PX (A) = P(X ∈ A) = P(X −1 (A)),
where A is a subset of E. In words, PX (A) is the probability that the value assumed by X
belongs to the set A. Often we will work solely with random variables and their distributions
without specifying the underlying probability space.
10. Cumulative distribution function: If X is a real-valued random variable, then the
cumulative distribution function of X is the function FX : R → [0, 1] defined by
FX (x) = P(X ≤ x).
It can be shown that the cumulative distribution function of a random variable uniquely determines its distribution and that this function is non-decreasing with jump discontinuities at
those values x where P(X = x) > 0. Such values are called atoms.
11. Discrete random variables: A random variable X is said to be discrete if it takes values
in a countable set E = {x1 , x2 , · · · }. In this case, the probability mass function of X is the
function pX : E → [0, 1] defined by
pX (x) = P(X = x).
The distribution of a discrete random variable is uniquely determined by its probability mass
function. Indeed, if A is a subset of E, then
P(X ∈ A) =
X
pX (x).
x∈A
In particular,
1 = P(X ∈ E) =
X
pX (x).
x∈E
12. Continuous random variables: A real-valued random variable X is said to be continuous
if there is a function fX : R → [0, ∞] such that
3
Z
P(a ≤ X ≤ b) =
b
fX (x)dx
a
for all −∞ ≤ a ≤ b ≤ ∞. The function fX is said to be the probability density function
of X and it uniquely determines the distribution of X. In particular, by taking a = −∞ and
b = ∞, we have
Z ∞
fX (x)dx.
1 = P(−∞ < X < ∞) =
−∞
Furthermore, by taking a = x = b, we have
Z
x
fX (t)dt = 0
P(X = x) =
x
for every x ∈ R. Thus continuous random variables do not have atoms.
The density and the cumulative distribution function of a continuous random variable are related
as follows:
Z x
FX (x) = P(−∞ < X ≤ x) =
fX (t)dt,
−∞
while
fX (x) = FX0 (x),
i.e., the cumulative distribution function is an anti-derivative of the density, which the density
is the derivative of the cumulative distribution function.
13. Expectations: If X is a real-valued random variable, then the expected value of X is
the quantity
P

if X is discrete with probability mass function pX ,
 xi pX (xi ) xi
E[X] =

R ∞
−∞ fX (x) x dx if X is continuous with density fX ,
provided that the sum or the integral exists. The expected value is also called the expectation
or the mean. It is a weighted average of the values that the random variable can assume with
weights equal to the probabilities with which those values occur.
Expectations have several important properties. First, they are linear in the sense that
E[X1 + · · · + Xn ] =
n
X
E[Xi ],
i=1
i.e., the expected value of a sum of random variables is equal to the sum of the expected values
of each random variable. Secondly, given a function φ : R → R, the expected value of φ(X) can
be calculated using one of the following two formulas
4
E[φ(X)] =
P

 xi pX (xi ) φ(xi )
if X is discrete with p.m.f. pX ,

R ∞
if X is continuous with density fX .
−∞ fX (x) φ(x) dx
This result is sometimes known as the law of the unconscious statistician. If φ is non-linear,
then in general E[φ(X)] 6= φ(E[X]). However, if φ(x) = ax + c is a linear function, then the
previous result can be used to show that
E[aX + c] = aE[X] + c.
14. The Law of Large Numbers: Suppose that X1 , X2 , X3 , · · · is a sequence of independent,
identically-distributed real-valued random variables with finite mean µ = E[X1 ] and let Sn =
X1 + · · · + Xn be the sum of the first n variables. Dividing through by n, we obtain the sample
mean Sn /n, which is just the average of the first n values in this sequence. The Strong Law
of Large Numbers asserts that the sequence of sample means is certain to converge to the true
mean, i.e.,
1
P lim Sn = µ = 1.
n→∞ n
In other words, by collecting a sufficiently large number of independent data points, we are
guaranteed that the sample mean will approach the true mean arbitrarily closely. This is one
of the reasons that independent experimental trials are conducted when trying to estimate an
unknown quantity.
15. Moments: The n’th moment of a real-valued random variable X is the expected value of
X n:
P
n

if X is discrete with p.m.f. pX ,
 xi pX (xi ) xi
n
µn = E[X ] =

R ∞
n
−∞ fX (x) x dx if X is continuous with density fX .
These are sometimes referred to as non-central moments. In contrast, the n’th central
moment of X is the expected value of the quantity (X − µ)n , where µ = E[X] is the expected
value of X:
P
n

if X is discrete with p.m.f. pX ,
 xi pX (xi ) (xi − µ)
0
n
µn = E[(X − µ) ] =

R ∞
n
−∞ fX (x) (x − µ) dx if X is continuous with density fX .
The central moments tell us something about the dispersion of the values of a random variable
around its mean. The most important central moment is the second, which is also called the
variance. The third central moment is sometimes referred to as the skewness and tells us
whether the distribution of X is symmetric around the mean or skewed to the right or the
5
left. The fourth central moment is known as the kurtosis and tells us how rapidly the tails
of the distribution decay to 0. Notice that the first central moment is always equal to 0 since
µ01 = E[X − µ] = E[X] − µ = 0.
16. Variance: The second central moment of a random variable X is more commonly known
as the variance:
Var(X) = E[(X − µ)2 ].
In words, the variance is equal to the mean squared distance between the values assumed by X
and its expected value. If the variance is close to 0, then X is typically close to its mean. In
particular, Var(X) = 0 if and only if X is effectively a constant, i.e., if and only if P(X = µ) = 1.
Notice, however, that the variance of any random variable is non-negative. By expanding the
quadratic expression inside the expectation, it can be shown that the variance of X is also equal
to the following expression:
Var(X) = E[X 2 ] − µ2 ,
i.e., the variance is equal to the difference between the second non-central moment and the
square of the mean. It is often easier to calculate variances with this second formula than with
the definition. Furthermore, if a and b are constants, then
Var(aX + b) = a2 Var(x),
i.e., adding a constant to a random variable does not change its variance, but multiplying the
variable by a constant a rescales the variance by a factor of a2 . The square root of the variance
of a random variable X is known as the standard deviation of X; it has the advantage that
it has the same units as X.
17. Independence of random variables: Two random variables X and Y are said to be
independent if the identity
P(X ∈ A, Y ∈ B) = P(X ∈ A)P(Y ∈ B)
holds for all subsets A and B where X and Y take values. In words, X and Y are independent
if the value assumed by X does not influence the value assumed by Y and vice versa. More
generally, the random variables X1 , X2 , · · · , Xn are independent if the identity
P(X1 ∈ A1 , X2 ∈ A2 , · · · , Xn ∈ An ) =
n
Y
P(Xi ∈ Ai )
i=1
holds for all subsets A1 , · · · , An of the ranges of these random variables. If the random variables
are both independent and have the same distribution, then we say that they are independent
and identically-distributed, which is customarily abbreviated i.i.d.
18. Variances of sums of random variables: If X1 , · · · , Xn are independent real-valued
random variables, then the variance of their sum is equal to the sum of their variances:
6
Var (X1 + · · · + Xn ) =
n
X
Var(Xi ).
i=1
19. Some important discrete distributions: A random variable X is said to have the
Bernoulli distribution with parameter p ∈ [0, 1] if X takes values in the set {0, 1} with probabilities p = P(X = 1) and 1 − p = P(X = 0). In this case, E[X] = p and Var(X) = p(1 − p).
Bernoulli random variables are often used to represent experiments that have just two possible
outcomes, e.g., success or failure, heads or tails, etc. Also, if A is an event in a probability space
(Ω, F, P), then the indicator variable of A is the random variable 1A which is defined to be
equal to 1 if A occurs and 0 otherwise. It follows that 1A is a Bernoulli random variable with
parameter p = P(A).
A random variable X is said to have the Binomial distribution with parameters n ≥ 1 and
p ∈ [0, 1] if X takes values in the set {0, 1, · · · , n} with probabilities
n k
pX (k) = P(X = k) =
p (1 − p)n−k .
k
n!
Here, nk = k!(n−k)!
is a binomial coefficient which counts the number of ways of choosing k
objects from a set containing n objects. Binomial random variables arise in the following way.
Suppose that we perform a series of n independent trials, each of which can result in a success
with probability p. Then the total number of successes in all n trials is Binomially distributed
with parameters n and p. Equivalently, if X1 , · · · , Xn are independent Bernoulli random variables with parameter p, then X = X1 + · · · + Xn is Binomially distributed with parameters n
and p. This can be used to show that E[X] = np and Var(X) = np(1 − p).
A random variable X is said to have the Poisson distribution with parameter λ > 0 if X
takes values in the non-negative integers {0, 1, 2, · · · } with probabilities
λk
.
k!
It can be shown that the mean and the variance of X are both equal to λ. The Poisson
distribution arises as a limiting case of the Binomial distribution when the number of trials n is
large and the success probability per trial p is small. Specifically, if Xn is a Binomial random
variable with parameters n and pn = λ/n, then according to the Law of Rare Events,
pX (k) = P(X = k) = e−λ
λk
n→∞
k!
for each integer k ≥ 0. This helps explain why count data, such as the number of typing errors
per page of a book or the number of car accidents at an intersection per day, can often be
modeled using a Poisson distribution.
lim P(Xn = k) = e−λ
20. Some important continuous distributions: A random variable X is said to be uniformly distributed on the compact interval [a, b] if X takes values in this interval with density
7
1
.
b−a
By convention, the density is defined to be equal to 0 outside of this interval. In the special case
where a = 0 and b = 1, we say that X is a standard uniform random variable. In general,
a random variable X is uniformly distributed on a region if every point within that region is
equally likely to occur as a value of X. For X uniform on [a, b] we have E[X] = (a + b)/2 and
Var(X) = (b − a)2 /12.
fX (x) =
A random variable X is said to be exponentially distributed with rate parameter λ > 0 if
X takes values in the non-negative real numbers [0, ∞) with density
fX (x) = λe−λx .
In this case we have E[X] = λ−1 and Var(X) = λ−2 . Exponentially distributed random variables
are often used to model random waiting times such as the time until a mechanical part fails
or the time between successive telephone calls. This application is most appropriate when the
event in question occurs at a constant rate. This is because the exponential distribution is the
unique continuous distribution which satisfies the following memorylessness property: for all
t, s > 0,
P(X > t + s|X > t) = P(X > s).
For example, if we think of X as a survival time or a lifespan, then memorylessness means that
the death or failure rate does not change with age, i.e., the conditional probability of surviving
for an additional s units of time given that one has already survived to age t is the same as
surviving to age s.
A random variable X is said to be normally distributed with mean µ and variance σ 2 > 0 if
X takes values in the real numbers with density
1
(x − µ)2
fX (x) = √ exp −
.
2σ 2
σ 2π
This is the classic bell curve familiar from statistics. As the description suggests, E[X] =
µ and Var(X) = σ 2 . In the special case where µ = 0 and σ 2 = 1, X is said to have the
standard normal distribution. The importance of the normal distribution, which is also known
as the Gaussian distribution, is connected with its appearance in the Central Limit Theorem,
described below. One useful property of the normal distribution is that it is maintained under
linear transformations. Specifically, if X is normally distributed with mean µ and variance σ 2 ,
and a and b are constants with a 6= 0, then Y = aX + b is a normally distributed random
variable with mean aµ + b and variance a2 σ 2 . In particular, if we set Z = (X − µ)/σ, then Z is
a standard normal random variable.
21. Central Limit Theorem: Suppose that X1 , X2 , X3 , · · · is a sequence of independent,
identically-distributed random variables with finite mean µ = E[X1 ] and finite positive variance
σ 2 = Var(X1 ). If Sn = X1 + · · · + Xn is the sum of the first n variables, then according to the
law of large numbers we knows that the sample means Sn /n converge to µ. The Central Limit
8
Theorem characterizes the deviations from this limit for large n. Specifically, for each n ≥ 1, let
Zn be the normalized difference
√ n 1
Zn =
Sn − µ
σ
n
and let Z be a standard normal random variable. Then, according to the Central Limit Theorem,
Z t
1
2
√ e−x /2 dx
lim P(Zn ≤ t) = P(Z ≤ t) =
n→∞
2π
−∞
for every t ∈ (−∞, ∞). In other words, when n is large, the distribution of the normalized difference Zn is approximately that of a standard normal random variable. This holds no matter
what the distribution of the Xi ’s is (e.g., whether it is discrete, continuous, bounded, etc.) so
long as the mean and the variance are finite. In a sense, by summing over a large number of
independent random variables, the properties of the distribution of the individual variables are
‘averaged away’ and we are left with the normal distribution.
It also follows that when n is large, the distribution of the partial sum Sn is approximately
normal with mean nµ and variance nσ 2 . In particular, if X is a binomial random variable with
parameters n and p, where n is large and p is neither too close to 0 nor too close to 1, then X
is approximately normally distributed with mean np and variance np(1 − p). This result, which
is known as the de Moivre-Laplace theorem, is a special case of the central limit theorem
that follows from the fact that X has the same distribution of a sum of n independent Bernoulli
random variables, each having parameter p.
9