Download Introduction to Basic Probability Theory

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Randomness wikipedia , lookup

Probability interpretations wikipedia , lookup

Probability wikipedia , lookup

Transcript
This document contains the complete probability review material. You
should understand this material and ask me if you have any questions.
Introduction to Basic Probability Theory
I. Sample Space and Events
A. Definition: probabilistic experiment
An experiment satisfying the following two properties
1. All possible outcomes of the experiment are known a priori
2. However, the specific outcome of an experiment cannot be
predicted prior to running the experiment
B. Definition: sample space, written Ω
The set of all possible outcomes.
C. Examples: sample space
1. Flip a coin: Ω = {H, T }
2. Reading nucleotides: Ω = {A, C, G, T }
3. SARS TRSes as outcomes of experiments to create working
TRSes: Ω = {CU AAACGAACU U, AU AAACGAACU U, . . .}
4. ∗ Time until a nucleotide at a given position in the genome
mutates: Ω = [0, ∞)
D. Definition: discrete sample space
The sample space has finite or countably many elements.
E. Definition: continuous sample space (starred example above)
The sample space has uncountably many elements. The sample
space involves a continuum of values.
F. Definition: event, A
Any subset A of Ω.
G. Examples: events
1. A = {H} is the event that the coin lands heads up.
2. B = {G} is the event that the next nucleotide read is guanine.
B = {A, G} is the event that the next nucleotide read is a
purine.
1
3. Let A be the event that the SARS TRS has 3 purines. If we
knew all possible working TRSes, we could then enumerate all
those with 3 purines to define the set corresponding to event
A.
4. Let B be the event that no mutation happens for t time. Then
B = [t, ∞).
H. Definition: union
The union of events A and B, written A ∪ B, is the collection of
outcomes that are in either A or B.
I. Definition: intersection
The intersection of events A and B, written A∩B, is the collection
of outcomes that are in both A and B.
J. Examples: intersection
K. Remarks: union and intersection
• If An are events, then the union of these events is ∪∞
n=1 An is
the event consisting of all outcomes that are in An for at least
one n.
• If An are events, then the intersection of these events is ∩∞
n=1 An
is the event consisting of all outcomes that are in all events
An for all n.
L. Definition: mutually exclusive (aka disjoint)
Two events A and B are mutually exclusive if AB is the empty
set, denoted by ∅.
M. Definition: exhaustive
The events Ai are exhaustive if ∪ni=1 Ai = Ω.
N. Definition: complement, Ac
The complement of event A, denoted Ac , is the event consisting
of all outcomes in Ω that are not in A.
II. Probabilities Defined on Events
A. Definition: probability, P
Probability, denoted P , is a function that takes events in a sample
space and maps them to the real number line and satisfies the
following 3 properties.
2
1. 0 ≤ P (A) ≤ 1 for all events A.
2. P (Ω) = 1
3. For any series of mutually exclusive events A1 , A2 , . . .
P
(∪∞
n=1 An )
=
∞
X
P (An )
n=1
B. Note: requirement 3 is called the addition law.
C. Examples: probability
1. Fair coin toss: P ({T }) = P ({H}) = 21 .
2. All nucleotides are equally prevalent: P ({G}) = 14 .
3. P ({A, G}) = 12 .
D. Properties:
1. P (Ac ) = 1 − P (A) proven by noting that 1 = P (Ω) = P (A ∪
Ac ) = P (A) + P (Ac ).
2. P (A ∪ B) = P (A) + P (B) − P (AB) is the addition law for
events that are NOT mutually exclusive.
E. Claim: Law of Total Probability
P (A) =
n
X
P (ABi )
i=1
where Bi are mutually exclusive and exhaustive events.
III. Conditional Probability
A. Definition: conditional probability
P (A | B) =
P (AB)
P (B)
P (A | B) reads “the probability of A given that B has occurred.”
B. Intuitive interpretation. We are talking about the outcome of
a single experiment. If B has occurred, then for A to also occur,
the outcome of the experiment must have been in AB. On the
other hand, because we know B occurred, the sample space is no
longer Ω, but rather B for this experiment. But P (B) 6= 1, so to
normalize the conditional probability so that the sample space for
1
this experiment has probability 1, we introduce a factor, P (B)
.
3
C. Example: conditional probability
Exposure
Disease
Present Absent
Present
75
325
Absent
25
575
100
900
400
600
1000
What is the probability of disease conditional on exposure?
What is the probability of disease conditional on no exposure?
D. Properties/Results
1. Claim: Law of Total Probability (version 2)
P (A) =
n
X
P (A | Bi )P (Bi )
i=1
2. Getting comfortable with conditional manipulations. . .
P (AB) = P (A|B)P (B), P (ABC) = P (AB|C)P (C), P (ABC) =
P (A|BC)P (BC) = P (A|BC)P (B|C)P (C).
3. Multiplication rule
P (A1 A2 · · · An ) = P (A1 | A2 · · · An )P (A2 | A3 · · · An ) · · · P (An−1 | An )P (An )
4. Claim: Bayes’ Rule
For mutually exclusive and exhaustive events Bi for i = 1, 2, . . . , n,
P (A | C)P (C)
P (C | A) = Pn
i=1 P (A | Bi )P (Bi )
Example: false positive
P (t|D) = 0.05 is false negative rate and P (T |d) = 0.05 is false
positive rate determined by experiment.
P (D) = 0.01 is incidence of disease as estimated by studying
the population and observing actual rates.
What is the probability that you actually have the disease if
you test positive? Some studies indicate only about 10% of
4
doctors could correctly answer this question!
P (T |D)P (D)
P (T |D)P (D) + P (T |d)P (d)
0.95 × 0.01
= 0.161
=
0.95 × 0.01 + 0.05 × 0.99
P (D|T ) =
5. CP is a probability (satisfies three axioms). That’s why modeling works. We condition on the rest of the world. If the
experimental conditions are true, then my prediction is X. It
is your job to make sure the “experimental” (model) conditions are good enough so that conditional probability is not
far from the actual probability in real life.
IV. Independent Events
A. Definition: independent
Two events A and B are independent if P (AB) = P (A)P (B).
B. Remarks: independence
1. The events A1 , A2 , . . . , An are independent if for any subset
Ai1 , Ai2 , . . . , Air
P (Ai1 Ai2 · · · Air ) = P (Ai1 )P (Ai2 ) · · · P (Air )
2. Note: Pairwise independent events need not be independent.
3. Equivalent definition P (A|B) = P (A) by applying conditional
probability to definition of independence.
V. Comprehensive example: Monty Hall Dilemma Let A1 , A2 , and A3
be the events that the car is behind door 1, 2, or 3, respectively.
Let B be the event that Monty (M) shows a goat behind door 2.
Let C be the event that the contestant (C) choose door 3.
We need to compute P (A1 | BC).
P (A1 | BC) =
P (A1 BC)
P (BC)
Bayes’ rule
P (A1 | BC) =
P (BC | A1 )P (A1 )
.
P (BC | A1 )P (A1 ) + P (BC | A2 )P (A2 ) + P (BC | A3 )P (A3 )
5
P (BC | A1 ) = P (B | CA1 )P (C | A1 )
VI. Random Variables
A. Intuition: Often you care little about the experimental outcome
and more about a function of the experimental outcome.
Example: You play a board game where you advance by throwing
two dice. You care only about the sum of the two numbers on the
dice, not the actual values showing.
B. Definition: random variable
real-valued functions defined on a sample space, i.e. they map
events to real numbers
1. Definition: discrete random variable
random variable taking on countably many possible values
2. Definition: continuous random variable
random variable taking on continuum (uncountable) number
of possible values
C. Examples: Since random values depend on the outcome of an
experiment, their value is random and discussion of probabilities
is appropriate.
1. Let X be the sum of two fair dice. Then, we can compute
probabitilies, e.g.
P (X = 1) = 0
P (X = 2) = P ({(1, 1)}) =
1
36
P (X = 3) = P ({(1, 2), (2, 1)}) =
2
36
P (X = 4) = P ({(1, 3), (2, 2), (3, 1)}) =
..
.
3
36
2. Let N be the number of products rolling off a production line
before a faulty one appears. Suppose the probability that a
new product is faulty is p. Represent the sequence of products
as {G, G, G, F, . . .}, where G represents good and F represents
6
faulty. Assume that products are produced independently.
P (N = 1) = P ({F }) = p
P (N = 2) = P ({G, F }) = (1 − p)p
P (N = 3) = P ({G, F }) = (1 − p)2 p
..
.
Note that
P (∪∞
n=1 {N = n}) =
∞
X
P (N = n)
n=1
∞
X
= p
(1 − p)n−1
n=1
= p
1
= 1.
1 − (1 − p)
3. You are involved with a company that produces batteries.
You would like to know the probability that a battery lasts
at least 2 years (so you can issue a guarantee for example).
Battery life is a random outcome. Let
1, lasts more than two years
I=
0, otherwise
D. cumulative distribution function (or distribution function)
1. Definition:
F (b) = P (X ≤ b)
for any real number b.
2. Properties:
a. F (b) is nondecreasing function of b.
This property follows because the event A = {X ≤ b} is
contained in B = {X ≤ a} whenever b < a.
b. limb→∞ F (b) = F (∞) = 1
c. limb→−∞ F (b) = F (−∞) = 0
d. P (a < X ≤ b) = F (b) − F (a) for all a < b
E. probability function
7
1. Definition: A discrete random variable X has a probability
mass function (pmf)
p(a) = P [X = a],
defined for the countable number of real numbers a that X
can assume.
2. Definition: probability density function (pdf) for continuous
random variables
A continuous r.v. has a pdf if there exists a function f (x)
such that
Z
P (X ∈ B) =
f (x)dx
B
for any set of real numbers B.
In particular, applying the above to the set B = (−∞, a] and
using a fundamental theorem of calculus shows that
f (x) =
dF (x)
.
dx
3. Though P (X = x) = 0 for all continuous r.v. X and real
values x, one can interpret f (x) as the relative probability
that X falls near x. To see this, note
P (x < X ≤ x + dx) = F (x + dx) − F (x) = dF (x) = f (x)dx
So, as as dx approaches 0, the probability that X falls in a
small region around x is given by f (x)dx, so f (x) gives the
relative probability for X to be around x.
F. Common discrete random variables
1. Bernoulli
X=
1 with probability p
0 with probability 1 − p
2. Binomial Let X represent the number of successes (1’s) in n
Bernoulli trials, then
n i
p(i) =
p (1 − p)n−i
i
8
for i = 0, 1, . . . , n.
Note, the definition of “n choose i”:
n!
n
=
.
i
(n − i)!i!
3. Geometric Let X be the number of Bernoulli trials required
to get the first success, then
p(n) = (1 − p)n−1 p,
for n = 1, 2, . . ..
4. Poisson
a. The random variable X taking on values 0, 1, 2, . . . is Poisson if its pmf is
λi
p(i) = e−λ ,
i!
for i = 0, 1, 2, . . ..
b. Property
The Poisson r.v. X may be used to approximate the binomial r.v. Y when n is large and p is small. In other
words, as n → ∞ and p → 0 with np = λ, we have
pbinomial (i) ≈ ppoisson (i).
G. Common continuous random variables
1. Uniform
The random variable X is uniformly distributed over the interval (0, 1) if the pdf is given by
1 0<x<1
f (x) =
0 ow
Generally, if the random variable Y is uniformly distributed
over the interval (a, b), then its pdf is
1
a<x<b
b−a
f (x) =
0
ow
9
2. Exponential
Random variable X is an exponential r.v. if its pdf is given
by
−λx
λe
x≥0
f (x) =
,
0
ow
for some λ > 0.
3. Gamma
Random variable X is gamma r.v. if its pdf is given by
( −λx α−1
λe
(λx)
x≥0
Γ(α)
f (x) =
,
0
ow
for some λ > 0 and α > 0.
Note, the gamma function is defined as
Z ∞
Γ(α) =
e−x xα−1 dx,
0
and Γ(n) = (n − 1)! for positive integer n.
VII. Expectation of a random variable
A. Definition:
For a discrete random variable X with pmf p(x), the expected
value of X is defined as
X
E[X] =
xp(x).
x:p(x)>0
For a continuous random variable X with pdf f (x), the expected
value of X is defined as
Z ∞
E[X] =
xf (x)dx.
−∞
B. Proposition: Let g(x) be any real-valued function.
1. If X is a discrete random variable with pmf p(x), then
X
E[g(X)] =
g(x)p(x).
x:p(x)>0
10
2. If X is a continuous r.v. with pdf f (x), then
Z ∞
E[g(X)] =
g(x)f (x)dx.
−∞
C. Related concepts
1. The nth moment of a random variable X is defined as
P
xn p(x) X discrete
n
E[X ] = R ∞x:p(x)>0)
.
xf (x)dx
X continuous
−∞
2. The variance of a random variable X is defined as
Var[X] = E[(X − E[X])2 ].
VIII. Joint distribution
A. Definition: The joint cumulative probability distribution of r.v.’s
X and Y is defined as
F (a, b) = P (X ≤ a, Y ≤ b),
for −∞ < a, b < ∞.
B. Definition: Relative to the joint distribution, the cdf of X is the
marginal cdf
FX (a) = P (X ≤ a) = P (X ≤ a, Y < ∞) = F (a, ∞).
C. The joint probability mass function of discrete r.v.’s X and Y is
p(x, y) = P (X = x, Y = y)
and the marginal probability mass functions are
X
pX (x) =
p(x, y)
y:p(x,y)>0
pY (y) =
X
x:p(x,y)>0
11
p(x, y).
D. The joint probability density function, if it exists, of continuous
r.v.’s X and Y is the function f (x, y) such that
Z Z
P (X ∈ A, Y ∈ B) =
f (x, y)dxdy.
B
A
E. The marginal probability density functions of continuous r.v. X
and Y can be recovered from the joint pdf as
Z ∞
fX (x) =
f (x, y)dy
−∞
Z ∞
fY (y) =
f (x, y)dx.
−∞
F. Joint Distribution Example
p(1, 1) = 0.3 p(2, 1) = 0.1
p(1, 2) = 0.1 p(2, 2) = 0.5
Expectation of g(x, y) = xy:
E[XY ] = 1 × 0.3 + 2 × 0.1 + 2 × 0.1 + 4 × 0.5 = 2.7
Marginal probability mass functions:
pX (1)
pX (2)
pY (1)
pY (2)
=
=
=
=
p(1, 1) + p(1, 2) = 0.4
p(2, 1) + p(2, 2) = 0.6
p(1, 1) + p(2, 1) = 0.4
p(1, 2) + p(2, 2) = 0.6
G. The expectation of a function g(x, y) of two variables is
P
P
x:p(x)>0)
y:p(y)>0) g(x, y)p(x, y) X discrete
E[g(X, Y )] = R ∞ R ∞
.
xf
(x,
y)dxdy
X continuous
−∞ −∞
H. Covariance
1. Definition: The covariance of two r.v. X and Y is
Cov(X, Y ) = E[(X−E[X])(Y −E[Y ])] = E[XY ]−E[X]E[Y ].
12
2. Properties:
Cov(X, X)
Cov(X, Y )
Cov(cX, Y )
Cov(X, Y + Z)
!
n
m
X
X
Cov
Xi ,
Yj
i=1
Var
j=1
n
X
i=1
Xi
!
=
=
=
=
Var(X)
Cov(Y, X)
cCov(X, Y )
Cov(X, Y ) + Cov(X, Z)
n X
m
X
=
Cov(Xi , Yj )
i=1 j=1
=
n
X
Var(Xi ) + 2
i=1
n X
X
Cov(Xi , Xj )
i=1 j<i
IX. Independent random variables
A. Definition: R.v.’s X and Y are independent if for all a, b,
P (X ≤ a, Y ≤ b) = P (X ≤ a)P (Y ≤ b).
In other words, F (a, b) = FX (a)FY (b).
B. Properties:
1. When X and Y are discrete, the condition for independence
is
p(x, y) = pX (x)pY (y).
2. When X and Y are continuous and the joint pdf exists, then
the condition is
f (x, y) = fX (x)fY (y).
3. Proposition: If X and Y are independent, then for any functions h and g,
E[g(X)h(Y )] = E[g(X)]E[h(Y )].
4. Claim: The covariance of two independent r.v.’s X and Y is
0.
So, in particular, we have
!
n
n
X
X
Xi =
Var
Var(Xi ).
i=1
13
i=1
C. Example: Variance of binomial r.v.
Var(X) = Var(X1 + · · · + Xn )
n
X
=
Var(Xi ),
i=1
but
Var(Xi ) = E[Xi2 ] − (E[Xi ])2
= p − p2 ,
so
Var(X) = np(1 − p).
X. Conditional distributions
A. Definition: The conditional probability mass function for discrete
r.v.’s X given Y = y is
pX|Y (x|y) = P (X = x|Y = y)
P (X = x, Y = Y )
=
P (Y = Y )
p(x, y)
,
=
pY (y)
for all y such that P (Y = y) > 0.
B. Example: Using the joint pmf (8f) given above, compute pX|Y (x|y).
p(1, 1)
pY (1)
p(1, 2)
pX|Y (1|2) =
pY (2)
p(2, 1)
pX|Y (2|1) =
pX (1)
p(2, 2)
pX|Y (2|2) =
pX (2)
pX|Y (1|1) =
14
0.3
0.4
0.1
=
0.6
0.1
=
0.4
0.5
=
0.6
=
= 0.75
≈ 0.167
= 0.25
≈ 0.83.
C. Example: Suppose Y ∼ Bin(n1 , p) and X ∼ Bin(n2 , p) and let
Z = X + Y and q = 1 − p. Calculate pX|Z (x|z).
pX,Z (x, z)
pZ (z)
pX,Y (x, z − x)
=
pZ (z)
pX (x)pY (z − x)
=
p (z)
z−x n −z+x
Z
n1 x n1 −x n2
p q 2
p q
z−x
x
=
n1 +n2 z n1 +n2 −z
pq
z
n2
n1
pX|Z (x|z) =
=
z−y
x
n1 +n2
m
.
The last formula is the pmf of the Hypergeometric distribution,
the distribution that is canonically associated with the following
experiment. Suppose you draw m balls from an urn containing
n1 black balls and n2 red balls. The Hypergeometric is the distribution of the random number counting the number of black balls
you selected.
D. Definition: The conditional probability density function for continuous r.v.’s X given Y = y is
fX|Y (x|y) =
f (x, y)
.
fY (y)
E. Example:
f (x, y) =
fX|Y (x|y) =
6xy(2 − x − y) 0 < x < 1, 0 < y < 1
0
ow.
6xy(2 − x − y)
f (x, y)
6x(2 − x − y)
= R1
.
=
fY (y)
4 − 3y
6xy(2 − x − y)dx
0
XI. Conditional expectation
15
A. Definition: The conditional expectation of discrete r.v. X given
Y = y is
X
E[X|Y = y] =
xP (X = x|Y = y)
x
=
X
xpX|Y (x|y).
x
B. Example: Using the same joint pmf given above, compute E[X|Y =
1].
E[X|Y = 1] = 1×pX|Y (1|1)+2×pX|Y (2|1) = 1×0.75+2×0.25 = 1.25.
C. Definition: The conditional expectation of continuous r.v.’s X
given Y = y is
Z ∞
E[X|Y = y] =
xfX|Y (x|y)dx.
−∞
D. Example: Using the same joint pdf as in the previous section,
find E[X|Y = y].
Z 1 2
Z 1
6x (2 − x − y)
5 − 4y
xfX|Y (x|y)dx =
E[X|Y = y] =
dx =
.
4 − 3y
8 − 6y
0
0
E. Note: We can think of E[X|Y ], where a value for the r.v. Y is
not specified, as a random function of the r.v. Y .
F. Proposition:
E[X] = E[E[X|Y ]]
Proof:
E[E[X|Y ]] =
Z
∞
E[X|Y = y]fY (y)dy
Z ∞
=
xfX|Y (x, y)dxfY (y)dy
−∞ −∞
Z ∞Z ∞
f (x, y)
=
x
fY (y)dxdy
fY (y)
−∞ −∞
Z ∞Z ∞
xf (x, y)dxdy
=
Z−∞
∞
−∞
16
−∞
=
=
Z
∞
Z−∞
∞
x
Z
∞
f (x, y)dydx
−∞
xfX (x)dx
−∞
= E[X].
G. Example: Suppose that the expected number of accidents per
week is 4 and suppose that the number of workers injured per
accident has mean 2. Also assume that the number of people
injured is independently determined for each accident. What is
the expected number of injuries per week?
Let N be the number of accidents in a week. P
Let Xi be the
number of people injured in accident i. Let X = N
i=1 Xi be the
total number of injuries in a week.
E[X] = E[E[X|N ]]
##
" " N
X
Xi |N ,
= E E
i=1
hP
N
i=1
i
Xi |N is just the expectabut once you condition on N , E
tion of a sum of a fixed N number of random variables, so
" N
#
N
X
X
E
Xi |N
=
E[Xi ]
i=1
i=1
= 2N.
Now, remember N is actually not known, so continuing
E[X] = E[2N ] = 2E[N ].
H. Example: Suppose you are in a room in a cave. There are three
exits. The first takes you on a path that takes 3 hours but eventually deposits you back in the same room. The second takes you
on a path that takes 5 hours but also deposits you back in the
room. The third takes you to the exit, which is 2 hours away.
On average, how long do you expect to stay in the cave if you
randomly select an exit each time you enter the room in the cave?
17
It is helpful to condition on the random variable Y indicating
which exit you choose on your first attempt to exit the room. The
conditional probabilities are easier to compute:
E[X|Y = 1] = 3 + E[X]
E[X|Y = 2] = 5 + E[X]
E[X|Y = 3] = 2,
where the E[X] is added to the first two probabilities because
once you return to the room you start the process over again.
Now, applying the property, we can compute
E[X] = E[E[X|Y ]] = E[X|Y = 1]P [Y = 1] + E[X|Y = 2]P [Y = 2] + E[X|Y = 3]P [Y = 3]
1
=
(3 + E[X] + 5 + E[X] + 2)
3
Solving this equation for the unknown, reveals you expect to spend
E[X] = 10 hours wandering around the caves before seeing light.
I. Proposition: There is an equivalent results for variances.
Var(X) = E[Var(X|Y )] + Var(E[X|Y ]).
J. Result: We can use the above result to find the variance of a
compound r.v. Suppose that E[Xi ] = µ and Var(Xi ) = σ 2 .
!
N
N
X
X
Xi |N =
Var(Xi ) = N σ 2 .
Var(X|N ) = Var
i=1
i=1
Var(X) = E[N σ 2 ] + Var(N µ) = σ 2 E[N ] + µ2 Var(N )
and you obtain the variance of the compound random variable X
by knowing the mean and variance of Xi and N .
XII. Comprehensive Example: The Best Prize Problem
You are presented n prizes in random sequence. When presented with
a prize, you are told its rank relative to all the prizes you have seen
(or you can observe this, for example if the prizes are money). If you
accept the prize, the game is over. If you reject the prize, you are
presented with the next prize. How can you best improve your chance
of getting the best prize of all n.
18
Let’s attempt a strategy where you reject the first k prizes. Thereafter,
you accept the first prize that is better than the first k. We will find the
k that maximizes your chance of getting the best prize and compute
your probability of getting the best prize for that k.
Let Pk (best) be the probability that the best prize is selected using the
above strategy. It is easiest to compute the relevant probabilities by
conditioning on the position of the best prize X.
Pk (best) =
n
X
i=1
n
1X
Pk (best|X = i)P (X = i) =
Pk (best|X = i),
n i=1
since the best prize is equally likely to be in any one of the n positions.
Now let’s compute the conditional probabilities
0
i≤k
.
Pk (best|X = i) =
k
i>k
P [best of first i − 1 is among first k] = i−1
Therefore,
Pk (best) =
≈
=
≈
n
1 X k
n i=k+1 i − 1
Z
k n−1 1
dx
n k
x
n−1
k
log
n
k
n
k
log
.
n
k
Now, let g(x) = nx log nx . We can use this function to find the k (now
x) that maximizes Pk (best). Setting the derivative to 0 and solving
yields
n
x= .
e
Selecting a k close to this value will maximize the probability, and the
maximum probability is
ne 1
n
n
log
=
= ≈ 0.37.
P ne (best) = g
e
en
n
e
Thus, using this strategy you can achieve nearly 40% chance of getting
the best prize out of all n prizes.
19