Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
This document contains the complete probability review material. You should understand this material and ask me if you have any questions. Introduction to Basic Probability Theory I. Sample Space and Events A. Definition: probabilistic experiment An experiment satisfying the following two properties 1. All possible outcomes of the experiment are known a priori 2. However, the specific outcome of an experiment cannot be predicted prior to running the experiment B. Definition: sample space, written Ω The set of all possible outcomes. C. Examples: sample space 1. Flip a coin: Ω = {H, T } 2. Reading nucleotides: Ω = {A, C, G, T } 3. SARS TRSes as outcomes of experiments to create working TRSes: Ω = {CU AAACGAACU U, AU AAACGAACU U, . . .} 4. ∗ Time until a nucleotide at a given position in the genome mutates: Ω = [0, ∞) D. Definition: discrete sample space The sample space has finite or countably many elements. E. Definition: continuous sample space (starred example above) The sample space has uncountably many elements. The sample space involves a continuum of values. F. Definition: event, A Any subset A of Ω. G. Examples: events 1. A = {H} is the event that the coin lands heads up. 2. B = {G} is the event that the next nucleotide read is guanine. B = {A, G} is the event that the next nucleotide read is a purine. 1 3. Let A be the event that the SARS TRS has 3 purines. If we knew all possible working TRSes, we could then enumerate all those with 3 purines to define the set corresponding to event A. 4. Let B be the event that no mutation happens for t time. Then B = [t, ∞). H. Definition: union The union of events A and B, written A ∪ B, is the collection of outcomes that are in either A or B. I. Definition: intersection The intersection of events A and B, written A∩B, is the collection of outcomes that are in both A and B. J. Examples: intersection K. Remarks: union and intersection • If An are events, then the union of these events is ∪∞ n=1 An is the event consisting of all outcomes that are in An for at least one n. • If An are events, then the intersection of these events is ∩∞ n=1 An is the event consisting of all outcomes that are in all events An for all n. L. Definition: mutually exclusive (aka disjoint) Two events A and B are mutually exclusive if AB is the empty set, denoted by ∅. M. Definition: exhaustive The events Ai are exhaustive if ∪ni=1 Ai = Ω. N. Definition: complement, Ac The complement of event A, denoted Ac , is the event consisting of all outcomes in Ω that are not in A. II. Probabilities Defined on Events A. Definition: probability, P Probability, denoted P , is a function that takes events in a sample space and maps them to the real number line and satisfies the following 3 properties. 2 1. 0 ≤ P (A) ≤ 1 for all events A. 2. P (Ω) = 1 3. For any series of mutually exclusive events A1 , A2 , . . . P (∪∞ n=1 An ) = ∞ X P (An ) n=1 B. Note: requirement 3 is called the addition law. C. Examples: probability 1. Fair coin toss: P ({T }) = P ({H}) = 21 . 2. All nucleotides are equally prevalent: P ({G}) = 14 . 3. P ({A, G}) = 12 . D. Properties: 1. P (Ac ) = 1 − P (A) proven by noting that 1 = P (Ω) = P (A ∪ Ac ) = P (A) + P (Ac ). 2. P (A ∪ B) = P (A) + P (B) − P (AB) is the addition law for events that are NOT mutually exclusive. E. Claim: Law of Total Probability P (A) = n X P (ABi ) i=1 where Bi are mutually exclusive and exhaustive events. III. Conditional Probability A. Definition: conditional probability P (A | B) = P (AB) P (B) P (A | B) reads “the probability of A given that B has occurred.” B. Intuitive interpretation. We are talking about the outcome of a single experiment. If B has occurred, then for A to also occur, the outcome of the experiment must have been in AB. On the other hand, because we know B occurred, the sample space is no longer Ω, but rather B for this experiment. But P (B) 6= 1, so to normalize the conditional probability so that the sample space for 1 this experiment has probability 1, we introduce a factor, P (B) . 3 C. Example: conditional probability Exposure Disease Present Absent Present 75 325 Absent 25 575 100 900 400 600 1000 What is the probability of disease conditional on exposure? What is the probability of disease conditional on no exposure? D. Properties/Results 1. Claim: Law of Total Probability (version 2) P (A) = n X P (A | Bi )P (Bi ) i=1 2. Getting comfortable with conditional manipulations. . . P (AB) = P (A|B)P (B), P (ABC) = P (AB|C)P (C), P (ABC) = P (A|BC)P (BC) = P (A|BC)P (B|C)P (C). 3. Multiplication rule P (A1 A2 · · · An ) = P (A1 | A2 · · · An )P (A2 | A3 · · · An ) · · · P (An−1 | An )P (An ) 4. Claim: Bayes’ Rule For mutually exclusive and exhaustive events Bi for i = 1, 2, . . . , n, P (A | C)P (C) P (C | A) = Pn i=1 P (A | Bi )P (Bi ) Example: false positive P (t|D) = 0.05 is false negative rate and P (T |d) = 0.05 is false positive rate determined by experiment. P (D) = 0.01 is incidence of disease as estimated by studying the population and observing actual rates. What is the probability that you actually have the disease if you test positive? Some studies indicate only about 10% of 4 doctors could correctly answer this question! P (T |D)P (D) P (T |D)P (D) + P (T |d)P (d) 0.95 × 0.01 = 0.161 = 0.95 × 0.01 + 0.05 × 0.99 P (D|T ) = 5. CP is a probability (satisfies three axioms). That’s why modeling works. We condition on the rest of the world. If the experimental conditions are true, then my prediction is X. It is your job to make sure the “experimental” (model) conditions are good enough so that conditional probability is not far from the actual probability in real life. IV. Independent Events A. Definition: independent Two events A and B are independent if P (AB) = P (A)P (B). B. Remarks: independence 1. The events A1 , A2 , . . . , An are independent if for any subset Ai1 , Ai2 , . . . , Air P (Ai1 Ai2 · · · Air ) = P (Ai1 )P (Ai2 ) · · · P (Air ) 2. Note: Pairwise independent events need not be independent. 3. Equivalent definition P (A|B) = P (A) by applying conditional probability to definition of independence. V. Comprehensive example: Monty Hall Dilemma Let A1 , A2 , and A3 be the events that the car is behind door 1, 2, or 3, respectively. Let B be the event that Monty (M) shows a goat behind door 2. Let C be the event that the contestant (C) choose door 3. We need to compute P (A1 | BC). P (A1 | BC) = P (A1 BC) P (BC) Bayes’ rule P (A1 | BC) = P (BC | A1 )P (A1 ) . P (BC | A1 )P (A1 ) + P (BC | A2 )P (A2 ) + P (BC | A3 )P (A3 ) 5 P (BC | A1 ) = P (B | CA1 )P (C | A1 ) VI. Random Variables A. Intuition: Often you care little about the experimental outcome and more about a function of the experimental outcome. Example: You play a board game where you advance by throwing two dice. You care only about the sum of the two numbers on the dice, not the actual values showing. B. Definition: random variable real-valued functions defined on a sample space, i.e. they map events to real numbers 1. Definition: discrete random variable random variable taking on countably many possible values 2. Definition: continuous random variable random variable taking on continuum (uncountable) number of possible values C. Examples: Since random values depend on the outcome of an experiment, their value is random and discussion of probabilities is appropriate. 1. Let X be the sum of two fair dice. Then, we can compute probabitilies, e.g. P (X = 1) = 0 P (X = 2) = P ({(1, 1)}) = 1 36 P (X = 3) = P ({(1, 2), (2, 1)}) = 2 36 P (X = 4) = P ({(1, 3), (2, 2), (3, 1)}) = .. . 3 36 2. Let N be the number of products rolling off a production line before a faulty one appears. Suppose the probability that a new product is faulty is p. Represent the sequence of products as {G, G, G, F, . . .}, where G represents good and F represents 6 faulty. Assume that products are produced independently. P (N = 1) = P ({F }) = p P (N = 2) = P ({G, F }) = (1 − p)p P (N = 3) = P ({G, F }) = (1 − p)2 p .. . Note that P (∪∞ n=1 {N = n}) = ∞ X P (N = n) n=1 ∞ X = p (1 − p)n−1 n=1 = p 1 = 1. 1 − (1 − p) 3. You are involved with a company that produces batteries. You would like to know the probability that a battery lasts at least 2 years (so you can issue a guarantee for example). Battery life is a random outcome. Let 1, lasts more than two years I= 0, otherwise D. cumulative distribution function (or distribution function) 1. Definition: F (b) = P (X ≤ b) for any real number b. 2. Properties: a. F (b) is nondecreasing function of b. This property follows because the event A = {X ≤ b} is contained in B = {X ≤ a} whenever b < a. b. limb→∞ F (b) = F (∞) = 1 c. limb→−∞ F (b) = F (−∞) = 0 d. P (a < X ≤ b) = F (b) − F (a) for all a < b E. probability function 7 1. Definition: A discrete random variable X has a probability mass function (pmf) p(a) = P [X = a], defined for the countable number of real numbers a that X can assume. 2. Definition: probability density function (pdf) for continuous random variables A continuous r.v. has a pdf if there exists a function f (x) such that Z P (X ∈ B) = f (x)dx B for any set of real numbers B. In particular, applying the above to the set B = (−∞, a] and using a fundamental theorem of calculus shows that f (x) = dF (x) . dx 3. Though P (X = x) = 0 for all continuous r.v. X and real values x, one can interpret f (x) as the relative probability that X falls near x. To see this, note P (x < X ≤ x + dx) = F (x + dx) − F (x) = dF (x) = f (x)dx So, as as dx approaches 0, the probability that X falls in a small region around x is given by f (x)dx, so f (x) gives the relative probability for X to be around x. F. Common discrete random variables 1. Bernoulli X= 1 with probability p 0 with probability 1 − p 2. Binomial Let X represent the number of successes (1’s) in n Bernoulli trials, then n i p(i) = p (1 − p)n−i i 8 for i = 0, 1, . . . , n. Note, the definition of “n choose i”: n! n = . i (n − i)!i! 3. Geometric Let X be the number of Bernoulli trials required to get the first success, then p(n) = (1 − p)n−1 p, for n = 1, 2, . . .. 4. Poisson a. The random variable X taking on values 0, 1, 2, . . . is Poisson if its pmf is λi p(i) = e−λ , i! for i = 0, 1, 2, . . .. b. Property The Poisson r.v. X may be used to approximate the binomial r.v. Y when n is large and p is small. In other words, as n → ∞ and p → 0 with np = λ, we have pbinomial (i) ≈ ppoisson (i). G. Common continuous random variables 1. Uniform The random variable X is uniformly distributed over the interval (0, 1) if the pdf is given by 1 0<x<1 f (x) = 0 ow Generally, if the random variable Y is uniformly distributed over the interval (a, b), then its pdf is 1 a<x<b b−a f (x) = 0 ow 9 2. Exponential Random variable X is an exponential r.v. if its pdf is given by −λx λe x≥0 f (x) = , 0 ow for some λ > 0. 3. Gamma Random variable X is gamma r.v. if its pdf is given by ( −λx α−1 λe (λx) x≥0 Γ(α) f (x) = , 0 ow for some λ > 0 and α > 0. Note, the gamma function is defined as Z ∞ Γ(α) = e−x xα−1 dx, 0 and Γ(n) = (n − 1)! for positive integer n. VII. Expectation of a random variable A. Definition: For a discrete random variable X with pmf p(x), the expected value of X is defined as X E[X] = xp(x). x:p(x)>0 For a continuous random variable X with pdf f (x), the expected value of X is defined as Z ∞ E[X] = xf (x)dx. −∞ B. Proposition: Let g(x) be any real-valued function. 1. If X is a discrete random variable with pmf p(x), then X E[g(X)] = g(x)p(x). x:p(x)>0 10 2. If X is a continuous r.v. with pdf f (x), then Z ∞ E[g(X)] = g(x)f (x)dx. −∞ C. Related concepts 1. The nth moment of a random variable X is defined as P xn p(x) X discrete n E[X ] = R ∞x:p(x)>0) . xf (x)dx X continuous −∞ 2. The variance of a random variable X is defined as Var[X] = E[(X − E[X])2 ]. VIII. Joint distribution A. Definition: The joint cumulative probability distribution of r.v.’s X and Y is defined as F (a, b) = P (X ≤ a, Y ≤ b), for −∞ < a, b < ∞. B. Definition: Relative to the joint distribution, the cdf of X is the marginal cdf FX (a) = P (X ≤ a) = P (X ≤ a, Y < ∞) = F (a, ∞). C. The joint probability mass function of discrete r.v.’s X and Y is p(x, y) = P (X = x, Y = y) and the marginal probability mass functions are X pX (x) = p(x, y) y:p(x,y)>0 pY (y) = X x:p(x,y)>0 11 p(x, y). D. The joint probability density function, if it exists, of continuous r.v.’s X and Y is the function f (x, y) such that Z Z P (X ∈ A, Y ∈ B) = f (x, y)dxdy. B A E. The marginal probability density functions of continuous r.v. X and Y can be recovered from the joint pdf as Z ∞ fX (x) = f (x, y)dy −∞ Z ∞ fY (y) = f (x, y)dx. −∞ F. Joint Distribution Example p(1, 1) = 0.3 p(2, 1) = 0.1 p(1, 2) = 0.1 p(2, 2) = 0.5 Expectation of g(x, y) = xy: E[XY ] = 1 × 0.3 + 2 × 0.1 + 2 × 0.1 + 4 × 0.5 = 2.7 Marginal probability mass functions: pX (1) pX (2) pY (1) pY (2) = = = = p(1, 1) + p(1, 2) = 0.4 p(2, 1) + p(2, 2) = 0.6 p(1, 1) + p(2, 1) = 0.4 p(1, 2) + p(2, 2) = 0.6 G. The expectation of a function g(x, y) of two variables is P P x:p(x)>0) y:p(y)>0) g(x, y)p(x, y) X discrete E[g(X, Y )] = R ∞ R ∞ . xf (x, y)dxdy X continuous −∞ −∞ H. Covariance 1. Definition: The covariance of two r.v. X and Y is Cov(X, Y ) = E[(X−E[X])(Y −E[Y ])] = E[XY ]−E[X]E[Y ]. 12 2. Properties: Cov(X, X) Cov(X, Y ) Cov(cX, Y ) Cov(X, Y + Z) ! n m X X Cov Xi , Yj i=1 Var j=1 n X i=1 Xi ! = = = = Var(X) Cov(Y, X) cCov(X, Y ) Cov(X, Y ) + Cov(X, Z) n X m X = Cov(Xi , Yj ) i=1 j=1 = n X Var(Xi ) + 2 i=1 n X X Cov(Xi , Xj ) i=1 j<i IX. Independent random variables A. Definition: R.v.’s X and Y are independent if for all a, b, P (X ≤ a, Y ≤ b) = P (X ≤ a)P (Y ≤ b). In other words, F (a, b) = FX (a)FY (b). B. Properties: 1. When X and Y are discrete, the condition for independence is p(x, y) = pX (x)pY (y). 2. When X and Y are continuous and the joint pdf exists, then the condition is f (x, y) = fX (x)fY (y). 3. Proposition: If X and Y are independent, then for any functions h and g, E[g(X)h(Y )] = E[g(X)]E[h(Y )]. 4. Claim: The covariance of two independent r.v.’s X and Y is 0. So, in particular, we have ! n n X X Xi = Var Var(Xi ). i=1 13 i=1 C. Example: Variance of binomial r.v. Var(X) = Var(X1 + · · · + Xn ) n X = Var(Xi ), i=1 but Var(Xi ) = E[Xi2 ] − (E[Xi ])2 = p − p2 , so Var(X) = np(1 − p). X. Conditional distributions A. Definition: The conditional probability mass function for discrete r.v.’s X given Y = y is pX|Y (x|y) = P (X = x|Y = y) P (X = x, Y = Y ) = P (Y = Y ) p(x, y) , = pY (y) for all y such that P (Y = y) > 0. B. Example: Using the joint pmf (8f) given above, compute pX|Y (x|y). p(1, 1) pY (1) p(1, 2) pX|Y (1|2) = pY (2) p(2, 1) pX|Y (2|1) = pX (1) p(2, 2) pX|Y (2|2) = pX (2) pX|Y (1|1) = 14 0.3 0.4 0.1 = 0.6 0.1 = 0.4 0.5 = 0.6 = = 0.75 ≈ 0.167 = 0.25 ≈ 0.83. C. Example: Suppose Y ∼ Bin(n1 , p) and X ∼ Bin(n2 , p) and let Z = X + Y and q = 1 − p. Calculate pX|Z (x|z). pX,Z (x, z) pZ (z) pX,Y (x, z − x) = pZ (z) pX (x)pY (z − x) = p (z) z−x n −z+x Z n1 x n1 −x n2 p q 2 p q z−x x = n1 +n2 z n1 +n2 −z pq z n2 n1 pX|Z (x|z) = = z−y x n1 +n2 m . The last formula is the pmf of the Hypergeometric distribution, the distribution that is canonically associated with the following experiment. Suppose you draw m balls from an urn containing n1 black balls and n2 red balls. The Hypergeometric is the distribution of the random number counting the number of black balls you selected. D. Definition: The conditional probability density function for continuous r.v.’s X given Y = y is fX|Y (x|y) = f (x, y) . fY (y) E. Example: f (x, y) = fX|Y (x|y) = 6xy(2 − x − y) 0 < x < 1, 0 < y < 1 0 ow. 6xy(2 − x − y) f (x, y) 6x(2 − x − y) = R1 . = fY (y) 4 − 3y 6xy(2 − x − y)dx 0 XI. Conditional expectation 15 A. Definition: The conditional expectation of discrete r.v. X given Y = y is X E[X|Y = y] = xP (X = x|Y = y) x = X xpX|Y (x|y). x B. Example: Using the same joint pmf given above, compute E[X|Y = 1]. E[X|Y = 1] = 1×pX|Y (1|1)+2×pX|Y (2|1) = 1×0.75+2×0.25 = 1.25. C. Definition: The conditional expectation of continuous r.v.’s X given Y = y is Z ∞ E[X|Y = y] = xfX|Y (x|y)dx. −∞ D. Example: Using the same joint pdf as in the previous section, find E[X|Y = y]. Z 1 2 Z 1 6x (2 − x − y) 5 − 4y xfX|Y (x|y)dx = E[X|Y = y] = dx = . 4 − 3y 8 − 6y 0 0 E. Note: We can think of E[X|Y ], where a value for the r.v. Y is not specified, as a random function of the r.v. Y . F. Proposition: E[X] = E[E[X|Y ]] Proof: E[E[X|Y ]] = Z ∞ E[X|Y = y]fY (y)dy Z ∞ = xfX|Y (x, y)dxfY (y)dy −∞ −∞ Z ∞Z ∞ f (x, y) = x fY (y)dxdy fY (y) −∞ −∞ Z ∞Z ∞ xf (x, y)dxdy = Z−∞ ∞ −∞ 16 −∞ = = Z ∞ Z−∞ ∞ x Z ∞ f (x, y)dydx −∞ xfX (x)dx −∞ = E[X]. G. Example: Suppose that the expected number of accidents per week is 4 and suppose that the number of workers injured per accident has mean 2. Also assume that the number of people injured is independently determined for each accident. What is the expected number of injuries per week? Let N be the number of accidents in a week. P Let Xi be the number of people injured in accident i. Let X = N i=1 Xi be the total number of injuries in a week. E[X] = E[E[X|N ]] ## " " N X Xi |N , = E E i=1 hP N i=1 i Xi |N is just the expectabut once you condition on N , E tion of a sum of a fixed N number of random variables, so " N # N X X E Xi |N = E[Xi ] i=1 i=1 = 2N. Now, remember N is actually not known, so continuing E[X] = E[2N ] = 2E[N ]. H. Example: Suppose you are in a room in a cave. There are three exits. The first takes you on a path that takes 3 hours but eventually deposits you back in the same room. The second takes you on a path that takes 5 hours but also deposits you back in the room. The third takes you to the exit, which is 2 hours away. On average, how long do you expect to stay in the cave if you randomly select an exit each time you enter the room in the cave? 17 It is helpful to condition on the random variable Y indicating which exit you choose on your first attempt to exit the room. The conditional probabilities are easier to compute: E[X|Y = 1] = 3 + E[X] E[X|Y = 2] = 5 + E[X] E[X|Y = 3] = 2, where the E[X] is added to the first two probabilities because once you return to the room you start the process over again. Now, applying the property, we can compute E[X] = E[E[X|Y ]] = E[X|Y = 1]P [Y = 1] + E[X|Y = 2]P [Y = 2] + E[X|Y = 3]P [Y = 3] 1 = (3 + E[X] + 5 + E[X] + 2) 3 Solving this equation for the unknown, reveals you expect to spend E[X] = 10 hours wandering around the caves before seeing light. I. Proposition: There is an equivalent results for variances. Var(X) = E[Var(X|Y )] + Var(E[X|Y ]). J. Result: We can use the above result to find the variance of a compound r.v. Suppose that E[Xi ] = µ and Var(Xi ) = σ 2 . ! N N X X Xi |N = Var(Xi ) = N σ 2 . Var(X|N ) = Var i=1 i=1 Var(X) = E[N σ 2 ] + Var(N µ) = σ 2 E[N ] + µ2 Var(N ) and you obtain the variance of the compound random variable X by knowing the mean and variance of Xi and N . XII. Comprehensive Example: The Best Prize Problem You are presented n prizes in random sequence. When presented with a prize, you are told its rank relative to all the prizes you have seen (or you can observe this, for example if the prizes are money). If you accept the prize, the game is over. If you reject the prize, you are presented with the next prize. How can you best improve your chance of getting the best prize of all n. 18 Let’s attempt a strategy where you reject the first k prizes. Thereafter, you accept the first prize that is better than the first k. We will find the k that maximizes your chance of getting the best prize and compute your probability of getting the best prize for that k. Let Pk (best) be the probability that the best prize is selected using the above strategy. It is easiest to compute the relevant probabilities by conditioning on the position of the best prize X. Pk (best) = n X i=1 n 1X Pk (best|X = i)P (X = i) = Pk (best|X = i), n i=1 since the best prize is equally likely to be in any one of the n positions. Now let’s compute the conditional probabilities 0 i≤k . Pk (best|X = i) = k i>k P [best of first i − 1 is among first k] = i−1 Therefore, Pk (best) = ≈ = ≈ n 1 X k n i=k+1 i − 1 Z k n−1 1 dx n k x n−1 k log n k n k log . n k Now, let g(x) = nx log nx . We can use this function to find the k (now x) that maximizes Pk (best). Setting the derivative to 0 and solving yields n x= . e Selecting a k close to this value will maximize the probability, and the maximum probability is ne 1 n n log = = ≈ 0.37. P ne (best) = g e en n e Thus, using this strategy you can achieve nearly 40% chance of getting the best prize out of all n prizes. 19