* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Probability
Survey
Document related concepts
Transcript
EE 278 Lecture Notes # 2 Winter 2010–2011 Probability Space Probability assigns a measure like length, area, volume, weight, mass to events = sets in some space Probability Usually involves sums (discrete probability) or integrals (continuous) Review and elaboration of basic probability + simple examples of random variables, vectors, and processes Basic construct is a probability space or experiment (Ω, F , P) which consists of three items: Probability spaces, fair spinner, one coin flip, multiple coin flips, a Bernoulli random process. pdfs and pmfs EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 1 1. Sample space Ω= an abstract space of elements called points EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 2 Sample space Ω An abstract space of elements called (sample) points E.g., {H, T }, {0, 1}, [0, 1), Rk 2. Event space F = collection of subsets of Ω called events, to which probabilities are assigned Intuition: contains all distinguishable elementary outcomes or finest grain results of an experiment. E.g., all subsets of Ω 3. Probability measure P = assignment of real numbers to events consistent with a set of axioms Consider each component in order: EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 3 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 4 Probability measure P An assignment of a number P(F) to every event F ∈ F in a way that satisfies Kolmogorov’s axioms of probability: Event space F (sigma-field) Collection of subsets of Ω s.t. a) If F ∈ F , then also F c ∈ F b) If Fi ∈ F , i = 1, 2, . . . , then also � i Fi ∈ F , 1. P(F) ≥ 0 for all F ∈ F 2. P(Ω) = 1 Intuition: Algebraic structure — a)-b) + set theory ⇒ countable set theoretic operations (union, intersection, complementation, difference) of events produces new events. Ω ∈ F , ∅ ∈ F . 3. If Fi ∈ F , i = 1, 2, . . . are disjoint or mutually exclusive (Fi ∩ F j = ∅ if i � j), then F ⊂ Ω, but F ∈ F (set inclusion vs. element inclusion) P(∪i Fi) = Event spaces not an issue in elementary case where Ω discrete, use F = all subsets of Ω (power set of Ω) Event spaces are in issue in continuous spaces, power set too large for useful theory. If Ω = R, Borel field B(R). (Smallest event space containing the intervals.) EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 5 Example: Spinning pointer 0.75 P(Fi) i Axioms are enough to get a useful calculus of probability + useful mathematical models of random processes with predictable long term behavior (laws of large numbers or ergodic theorems, central limit theorem) EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 6 Sample space = Ω = [0, 1) Event space = F = smallest event space containing all of the intervals, called B([0, 1)), the Borel field of [0, 1). Introduce several fundamental ideas in context of two simple examples: a fair spinning pointer (or wheel) and a single coin flip. Then consider many coin flips. Spin a fair pointer in a circle: � Probability measure: For fair spinner probability outcome a point in F ∈ B([0, 1)) is � P(F) = 0.0 where ✬✩ ✻ � F f (x)dx f (x) = 1, x ∈ [0, 1), 0.25 ✫✪ is a uniform probability density function (pdf) 0.5 E.g., if F = [a, b] = {r : a ≤ r ≤ b} with 0 ≤ a ≤ b < 1, P(F) = b − a When pointer stops it can point to any number in the unit interval ∆ Ω = [0, 1) = {r : 0 ≤ r < 1} Describe (Ω, F , P) The probability of the pointer landing in an interval of length b − a is b − a, the fraction of the sample space corresponding to the event intuitive! EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 7 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 8 In general, f (x), x ∈ Ω is a pdf if Same result if instead set Ω = R, F = B(R), and pdf 1 f (r) = 0 1. f (x) ≥ 0, all x, 2. � Ω f (x)dx = 1. pdf ⇒ a probability measure by integration: P(F) = Can also write P(F) = � � F Comments: • The integrals in practice are Riemann In theory they are Lebesgue (better limiting properties) In most cases two integrals are the same. if r ∈ F • The axioms of probability are properties of integration in disguise. if r � F EE278: Introduction to Statistical Signal Processing, winter 2010–2011 See the next slide. January 11, 2011 R.M. Gray 9 PDFs and the axioms of probability Suppose f is a pdf (nonnegative, P(F) = Ω � F EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 10 • The probability of a finite union of disjoint regions is the sum of the probabilities of the individual events since integration is linear: P(F ∪ G) = � � 1F∪G (r) f (r) dr = (1F (r) + 1G (r)) f (r) dr � � = 1F (r) f (r) dr + 1G (r) f (r) dr f (x)dx = 1) and f (x)dx = P(F) + P(G). Then (Part of Axiom 3 from linearity of integration) • Probabilities are nonnegative since integrating a nonnegative Need to show for a countable number of disjoint events for Axiom 3. Is above enough? Unfortunately, no for Riemann integral. argument yields a nonnegative result. (Axiom 1) • The probability of the entire sample space is 1 since integrating 1 True if use Lebesgue integral. (We do not pursue details.) over the unit interval yields 1. (Axiom 2) EE278: Introduction to Statistical Signal Processing, winter 2010–2011 . Important in research, not so much in practice. But good to know the language when reading the literature. where indicator function of F given by � otherwise • Event space details ⇒ integrals make sense f (x)dx 1F (x) f (x)dx 1 ∆ 1F (r) = 0 if r ∈ [0, 1) January 11, 2011 R.M. Gray 11 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 12 Example Probability Spaces: Single coin flip Coin Flip Model I: Direct description Develop in two ways: Sample space Ω = {0, 1} — using numbers instead of {head, tail } will prove convenient Model I Direct description — Simplest way, fine if all you care about is one coin flip. Model II As a random variable (or signal processing) defined on fair spinner. Will define lots of other random variables on same space. Event space F = {{0}, {1}, Ω = {0, 1}, ∅}, all subsets of Ω (power set of Ω) Probability measure P Define in terms of a sum of probability mass function, analogous to integral of pdf for spinner EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 13 Given pmf p, define P by P({ω}) + Axiom 3: P(F) = � x∈F P({x}) = � x∈F p(x) = Given a discrete sample space (= countable number of elements) Ω, a probability mass function (pmf) p(ω) is a nonnegative function such � that ω∈Ω p(ω) = 1. EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 14 Notes: � • Probabilities are defined on sets — P(F), pdfs and pmfs are 1F (x)p(x) defined on points!! Again: theory of sums ⇒ axioms of probability satisfied, obvious if Ω finite. For the fair coin, set p(1) = p(0) = 1/2. For a biased coin, common to use p(1) = p, p(0) = 1 − p for a parameter p ∈ (0, 1) • In discrete case, one point (singleton) sets have possibly nonzero probability, e.g., P({0}) = p(0) = 1 2 A pmf gives the probability of something. A pdf is not the probability of anything, e.g., in our uniform spinner case � 1 P({ }) = 2 {1/2} 1dx = 0. If P determined by pdf, probability of individual point (e.g., 1/π) =0 Must integrate a pdf to get a probability EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 15 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 16 Coin Flip Model II: Random variable on spinner Simple example of signal processing — fair spinner produces signal or input r, quantizer operates on signal to produce a value q(r). Think A/D conversion or threshold decision rule based on noisy observation Suppose (Ω, F , P) describes uniform spinner: Ω = [0, 1), P described by pdf f (r) = 1 for r ∈ [0, 1) Define a measurement q made on outcome of spin by Simple quantizer 1 if r ∈ [0.5, 1) q(r) = 1[0.5,1)(r) = 0 if r ∈ [0, 0.5) (1) January 11, 2011 R.M. Gray � pq(0) = P({r : q(r) = 0}) = Similarly pq(1) = P({r : q(r) = 1}) = 17 Can use pmf to find any probability involving q: Pq(F) = Output space is discrete ⇒ only need pmf to characterize probability measure � 0.5 0 1dr = 1 2 Often write informally as Pr(q = 0) or P{q = 0}, short hand for “probability random variable q takes on a value of 0” Simple example of a random variable = a real-valued function defined on sample space, q(r), r ∈ Ω EE278: Introduction to Statistical Signal Processing, winter 2010–2011 Original probability space + function ⇒ new probability space with binary sample space Ωq = {0, 1}, event space Fq = power set of {0, 1}. New space inherits probability measure, say Pq, from old. � 1 0.5 1dr = 1 2 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 18 Notes: • Multiple ways to arrive at same model — equivalent models pq(x) x∈F Directly in terms of P, indirectly in terms of probability space + rv Derived new probability measure from old + function q. Probability measure Pq describing output of random variable q called the distribution of q In general, Pq(F) = P({ω : q(ω) ∈ F}) relates probability in new ���������������������������� • Basic idea will be seen often: probability space + function (random variable) ⇒ new probability space with probability measure = distribution of random variable given by inverse image formula Will see many tools and tricks for doing actual calculus. q−1(F) space to probability in old. −1 q (F) = inverse image of set F under mapping q Basic example of derived distribution: (Ω, F , P) + q ⇒ (Ωq, Fq, Pq) • Using Model II, can define more random variables on common experiment, e.g., two coin flips or even an infinite number. Model I only good for single coin flip. General solution: inverse image method Pq(F) = P(q−1(F)) EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 19 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 20 Two coin flips Single experiment ⇒ values of two random variables X and Y , or single 2D random vector (X, Y). Easy to compute pmfs for individual random variables X and Y (marginal pmfs) Let (Ω, F , P) be the fair spinner experiment. 1 pX (k) = pY (k) = ; k = 0, 1 2 Rename quantizer q as X (common to use upper case for random variables). Define another random variable Y on same space: 0 X(ω) = 1 0 Y(ω) = 1 Equivalent random variables, same pmf. if ω ∈ [0, 0.5) Now also have joint pmf of 2 rvs together: inverse image formula ⇒ if ω ∈ [0.5, 1.0) pXY (x, y) = Pr(X = x, Y = y) = P({ω : (X, Y)(ω) = (x, y)}) if ω ∈ [0, 0.25) ∪ [0.5, 0.75) 1 4 if ω ∈ [0.25, 5.0) ∪ [0.75, 1.0) Y(r)=0 Y(r)=1 Y(r)=0 Y(r)=1 �������� ������������������������ 0��������������������������������������������������������1 X(r)=0 ✲ Do the math: pXY (x, y) = ; x = 0, 1; y = 0, 1 E.g., pXY (0, 1) = P({ω : ω ∈ [0, 0.5)} ∩ {ω ∈ [0.25, 5.0) ∪ [0.75, 1.0)}) 1 = P({ω : ω ∈ [0.25, 5.0)}) = 4 r X(r)=1 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 21 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 • For this example pXY (x, y) = pX (x)pY (y), Notes: Alternatively could compute marginals from joint using total probability: pX (x) = P({ω : X(ω) = x}) = = � 22 a product pmf If two discrete random variables satisfy above, they are said to be independent (will discuss more later) • Here separately derived joint pmf pXY and marginal pmfs pX , pY . � January 11, 2011 R.M. Gray Can define any number of random variables on a common probability space. P({ω : X(ω) = x, Y(ω) = y}) An infinite collection of random variables such as {Xn; n = 0, 1, 2, . . .} defined on a common probability space is called a random process y∈ΩY pXY (x, y) y∈ΩY pY (y) = � pXY (x, y) x∈ΩX Joint and marginal pmfs are consistent — can get marginals either from original P or from joint pmf pXY EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 23 Extend example: a Bernoulli random process EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 24 A Bernoulli Random Process: Fair Coin Flips Xn(u) = bn(u) defines a discrete-time random process Xn; n = 0, 1, 2, . . . one experiment ⇒ an infinite number of rvs! In earlier 2D example, X = X0, Y = X1. Again let Ω = [0, 1) and P be determined by uniform pdf. Similar computations to 2D example (inverse image formula) ⇒ Every number in u ∈ [0, 1) has a binary representation u= ∞ � • marginal pmfs pXn (x) = 1/2, x = 0, 1 (all equivalent, fair coin flip) • for any k = 1, 2, . . ., joint pmf describing random vector X k is bn(u)2−n−1 n=0 pX k (xk ) = Pr(X k = xk ) = 2−k ; xk ∈ {0, 1}k binary analog of decimal representation — unique if choose representation {bn} not having a finite number of 0s, Hence k pX k (x ) = e.g., choose 1/2 → 100000 · · · , not 01111 · · · n−1 � n=0 pXn (xn); xk ∈ {0, 1}k , E.g., if u = 3/4, then b0(u) = b1(u) = 1, bn(u) = 0 for all n ≥ 2. EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray a product pmf 25 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 26 Probability spaces: sample space Ω A collection of discrete random variables is said to be (mutually) independent if the joint pmf = the product of the marginal pmfs. A random vector is said to be independent identically distributed or iid (or i.i.d. or IID) if its component random variables are independent and identically distributed Common examples: • {0, 1} Coin flip, value in data stream at time t A random process {Xn} is iid if any finite collection of random variables in the collection is iid • [0, 1) Fair spin, analog sensor output An iid process also called a Bernoulli process (sometimes name reserved for binary processes). Classic example is an infinite sequence of fair coin flips. • Zk = {0, 1, . . . , k − 1} die roll, ascii character, card drawn from deck • [0, ∞) Time to arrival of first packet, bus, customer ∆ • Z+ = {0, 1, 2, . . .} Number of packets/buses/customers arriving in [0, T ) End extended example of uniform spinner and Bernoulli process, back to general material Elaborate on Ω, F , P — • R = (−∞, ∞) voltage at sensor (without known bound) EE278: Introduction to Statistical Signal Processing, winter 2010–2011 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 27 • {0, 1}k =all binary k-tuples (a product space), flip one coin k times, flip k coins once, sample k successive values in a data stream January 11, 2011 R.M. Gray 28 Probability spaces: event space F • [0, 1)k = k-dimensional unit cube, measurements from k identical sensors at different locations • Rk k-dimensional Euclidean space, voltage of array of k sensors Given Ω, smallest event space is {∅, Ω}. • etc. — e.g., Zk , sequence spaces such as all binary sequences, Biggest event space is power set. (Too big to be useful for continuous spaces.) waveform spaces such as all continuous differential waveforms Useful event space in R and Rk is Borel field=smallest event space containing all of the intervals (k=1) and rectangles (k > 1) (Do not need for HW or exams, useful to have an idea when encounter in books and papers) A measurable space (Ω, F ) = sample space + event space of subsets of sample space EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 29 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 Probability spaces: probability measure January 11, 2011 R.M. Gray 30 A trivial example When dealing with finite sample spaces, only need Axiom 3 to hold for finite collections of disjoint events. The simplest possible example is useless except for providing a trivial example. Key point A set function P defined on an event space F of subsets of a sample space Ω is a probability measure if and only if (iff) it satisfies the axioms of probability Ω is any abstract space F = {Ω, ∅} P defined by P(Ω) = 1, P(∅) = 0 Axioms of probability measure satisfied. EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 31 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 32 Simple example: biased coin In general can not do it this way and list probabilities of ever event, too many events Instead provide a formula for computing probabilities of events as integrals (of a pdf) or sums (of a pmf) Simplest nontrivial example Ω = {0, 1} Will see many common examples, most have names (uniform, binomial, geometric, Poisson, Gaussian, Bernoulli, exponential, Laplacian, . . . ) F = {{0}, {1}, Ω = {0, 1}, ∅} axioms of event space satisfied P defined by P(F) = 1− p p 0 1 if F = {0} if F = {1} , if F = ∅ if F = Ω where p ∈ (0, 1) is a fixed parameter ( p = 0, 1 is a variation on the trivial probability space) Before more examples, derive several fundamental properties of probability — several elementary and one advanced Axioms can be verified in a straightforward manner. EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 33 Elementary properties of probability EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 34 Proof (a) F ∪ F c = Ω ⇒ P(F ∪ F c) = 1 (Axiom 2). F ∩ F c = ∅ ⇒ 1 = P(F ∪ F c) = P(F) + P(F c) (Axiom 3). (Ω, F , P) (b) P(F) = 1 − P(F c) ≤ 1 (Axiom 1 and (a) above). (a) For all events F , P(F c) = 1 − P(F) (c) By Axiom 2 and (a) above, P(Ωc) = P(∅) = 1 − P(Ω) = 0. (b) For all events F , P(F) ≤ 1 Note: Empty set ∅ has probability 0, but P(F) = 0 does not mean F = ∅. E.g., uniform spinner, F = {1/n : n = 1, 2, 3, . . .} (c) Let ∅ be the null or empty set, then P(∅) = 0. (d) Total Probability If events {Fi; i = 1, 2, . . .} form a partition of Ω, � i.e., if Fi ∩ Fk = ∅ when i � k and i Fi = Ω, then for any event G P(G) = � i (d) Using set theory and Axiom 3: � � � P(G) = P(G ∩ Ω) = P(G ∩ ( i Fi)) = P( i (G ∩ Fi)) = i P(G ∩ Fi). ���������� P(G ∩ Fi). disjoint (e) If G ⊂ F for events G, F , then P(F − G) = P(F) − P(G) ∆ (F − G = F ∩ Gc, also written F\G) EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 35 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 36 (e) F − G = F ∩ Gc and G are disjoint, so Axiom 3 ⇒ An advanced property of probability: Continuity P((F − G) ∪ G) = P(F − G) + P(G). A sequence of sets Fn, n = 0, 1, 2, . . . is increasing if Fn−1 ⊂ Fn all n (also called nested) decreasing if Fn ⊂ Fn−1 Since G ⊂ F , G = F ∩ G ⇒ (F − G) ∪ G = (F ∩ Gc) ∪ (F ∩ G) = F ∩ (G ∪ Gc) = F . ������������ Ω a Thus P(F) = P(F − G) + P(G). b � Note (e) ⇒ that if G ⊂ F , then P(G) ≤ P(F) F1 F2 F3 F4 F4 F3 F2 F1 (a) Increasing sets, (b) decreasing sets EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 37 E.g., Increasing: Fn = [0, n), Fn = [1, 2 − 1/n), (−n, a) EE278: Introduction to Statistical Signal Processing, winter 2010–2011 ∆ lim Fn = n→∞ Natural definition of limit of increasing sets ∆ n→∞ ∞ � 38 Natural definition of limit of decreasing sets Decreasing: Fn = [1, 3 + 1/n), Fn = (1 − 1/n, 1 + 1/n) lim Fn = January 11, 2011 R.M. Gray ∞ � Fn n=1 E.g., Fn n=1 lim [1, 3 + 1/n) = [1, 3) n→∞ E.g., lim [1 − 1/n, 1 + 1/n) = {1} n→∞ lim [0, n) = [0, ∞) n→∞ lim [1, 2 − 1/n) = [1, 2) n→∞ There is no natural definition of a limit of an arbitrary sequence of sets, only for increasing or decreasing. lim (−n, a) = [−∞, a) n→∞ EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 39 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 40 Then Continuity of probability If Fn is an increasing or decreasing sequence of events, then � P( lim Fn) = lim P(Fn) n→∞ P lim Fn n→∞ n→∞ Prove for increasing sets {Fn; n = 0, 1, 2, . . .}. Recall set theory difference A − B = A ∩ Bc = points in A that are not in B. Define G0 = F0, Gn = Fn − Fn−1 for n = 1, 2, . . . Gn are disjoint, �n k=0 Fk = Fn = ∞ � �n � ∞ ∞ ∞ � � � = P Fk = P Gk = P(Gk ) k=0 k=0 k=0 ������������������������������������������������ Fk = k=0 ⇒ Gk , definition of infinite sum Gn = Fn − Fn−1 and Fn−1 ⊂ Fn ⇒ P(Gn) = P(Fn) − P(Fn−1) k=0 G k , ∞ � Axiom 3 n � = lim P(Gk ) n→∞ k=0 �������������������������������� n � k=0 P(Gk ) = P(F0) + n � k=1 (P(Fn) − P(Fn−1)) = P(Fn). “telescoping sum,” all terms cancel but last one: k=0 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 41 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 42 Similar proof for decreasing events. E.g., P({a}) = limn→∞ P((a − 1/n, a + 1/n)) P(Fn) = + P(Fn) − P(Fn−1) + P(Fn−1) − P(Fn−2) If P described by a pdf, then probability of a point is 0. + P(Fn−2) − P(Fn−3) . Can show Axioms 1, 2, and 3 for finite collections of disjoint events + continuity of probability ⇔ Axioms 1–3 + P(F1) − P(F0) Kolmogorov’s countable additivity axiom ensures good limiting behavior. + P(F0). ⇒ If Fn is a sequence of increasing events, then � � P lim Fn = lim P(Fn) n→∞ Back to more concrete issues. n→∞ E.g., P((−∞, a]) = limn→∞ P((−n, a]). EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 43 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 44 Discrete probability spaces Common examples of pmfs Binary (Bernoulli) pmf. Ω = {0, 1}; p(0) = 1 − p, p(1) = p, where p is a parameter in (0, 1). A uniform pmf. Ω = Zn = {0, 1, . . . , n − 1} and p(k) = 1/n; k ∈ Zn. Binomial pmf. Ω = Zn+1 = {0, 1, . . . , n} and Introduce several common examples. Recall basic construction: Ω = a discrete (countable) set, F = power set of Ω � � n k p(k) = p (1 − p)n−k ; k ∈ Zn+1, k Given a probability measure P on (Ω, F ), ⇒ pmf ∆ p(ω) = P({ω}); ω ∈ Ω where � � n n! = k k!(n − k)! is the binomial coefficient (read as “n choose k”). Conversely, given a pmf p (nonnegative, sums to 1), ⇒ P via P(F) = � p(ω). ω∈F Calculus (properties of sums) implies axioms of probability satisfied. Geometric pmf. Ω = {1, 2, 3, . . .} and p(k) = (1 − p)k−1 p; k = 1, 2, . . ., where p ∈ (0, 1) is a parameter. EE278: Introduction to Statistical Signal Processing, winter 2010–2011 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 45 The Poisson pmf. Ω = Z+ = {0, 1, 2, . . .} and p(k) = (λk e−λ)/k!, where ∆ λ is a parameter in (0, ∞). (Keep in mind that 0! = 1.) These are all obviously nonnegative, hence to verify they are pmfs � need only show ω∈Ω p(ω) = 1 46 Discrete expectations Given pmf p on discrete sample space Ω Obvious for Bernoulli and uniform. For binomial, use binomial theorem, for geometric use geometric progression formula, for Poisson use Taylor series expansion for e x (will do details later, try it on your own) Suppose that g is a real-valued function defined on Ω: g : Ω → R Recall: A real-valued function defined on a probability space is called a random variable 1 These are exercises in calculus of sums. Define expectation of g (with respect to p) by Before continuing to continuous examples (pdfs), consider more general weighted sums called expectations. E(g) = January 11, 2011 R.M. Gray � g(ω)p(ω). ω∈Ω (Treated in depth later, useful to introduce basic idea early.) EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 1 There is a required technical condition we will see later, but it is automatic for discrete probability spaces with the power set as event space. 47 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 48 Example: If g(ω) = 1F (ω), then The second moment is often of interest: P(F) = E(1F ), m(2) = (2) ⇒ probability can be viewed as a special case of expectation centralized moments: Some other important examples: Most important is the variance Suppose that Ω ⊂ R, e.g., R, [0, 1), Z or Zn. If g(ω) = ω, mean or first moment m= kth moment e.g., m = m(1). m(k) = � � Fix pmf p. (ω)2 p(ω). � (ω − m)k p(ω), σ2 = � (ω − m)2 p(ω). Note: ωp(ω) � � (ω − m)2 p(ω) = (ω2 − 2ωm + m2)p(ω) � � � = ω2 p(ω) − 2m ωp(ω) + m2 p(ω) σ2 = (ω)k p(ω), EE278: Introduction to Statistical Signal Processing, winter 2010–2011 � = m(2) − 2m2 + m2 = m(2) − m2 January 11, 2011 R.M. Gray 49 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 50 January 11, 2011 R.M. Gray 52 Computational examples variance = second moment - mean2 Discrete uniform pmf on Zn 1� #(F) 1F (ω) = , n n #(F)= number of elements in F . P(F) = Will see similar definition for continuous case with a pdf, with integrals instead of sums For now moments can be viewed as simply attributes of pmfs and pdfs. For many pmfs and pdfs, knowing certain moments completely describes the pmf/pdf. mean: n−1 m= second moment: m(2) = EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 51 1� n+1 k= n k=0 2 n−1 1 � 2 (n + 1)(2n + 1) k = n k=0 6 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 Binomial pmf To show a valid pmf, use binomial theorem: n (a + b) = n � � � n k k=0 Set a = p, b = 1 − p n � k=0 p(k) = Trick: resembles terms in binomial theorem, massage it into form: Change variables l = k − 1, anbn−k m = n−1 � l=0 n � � � n k=0 k k p (1 − p) n−k = np n = (p + 1 − p) = 1. k=1 Postpone second moment until have a better method. n! pk (1 − p)n−k (n − k)!(k − 1)! EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 53 ∞ � ak = k=0 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 Set a = 1 − p and get Geometric pmf Use geometric progression: if |a| < 1, 1 , 1−a m= ∞ � kp(k) = k=1 m= January 11, 2011 R.M. Gray 54 1 p for geometric pmf A similar idea works for the second moment: ⇒ geometric pmf indeed sums to 1. mean: (n − 1)! pl(1 − p)n−1−l (n − 1 − l)!l! = np(p + 1 − p)n−1 = np. � � n � n m = k pk (1 − p)n−k k k=0 n � n−1 � l=0 Finding mean messy, but good practice. Later find shortcuts. = n! pl+1(1 − p)n−l−1 (n − l − 1)!l! ∞ � k=1 (2) m = ∞ � k=1 kp(1 − p)k−1. 2 k−1 k p(1 − p) � 2 1 2− p =p 3− 2 = p p p2 hence How evaluate? Can look it up, or use trick: differentiate geometric progression formula to obtain σ2 = m(2) − m2 = � 1− p . p2 ∞ d � k d 1 a = da k=0 da 1 − a ���������� ������������ 1 � ∞ kak−1 k=0 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 (1−a)2 January 11, 2011 R.M. Gray 55 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 56 Example of probability calculus using geometric pmf: Find the probabilities of the event F = {k : k ≥ 10} and G = {k : k is odd}. P(G) = � � p(k) = k∈G Note that F = {10, 11, 12, . . .} and G = {1, 3, 5, 7, . . .} = p k=1,3,... � p(1 − p)k−1 ∞ � (1 − p) = p [(1 − p)2]k k k=0,2,4,... Solutions: = � P(F) = p(k) = k∈F ∞ � k=10 p(1 − p)k−1 k=0 p 1 = . 2 2− p 1 − (1 − p) Poisson pmf Show sums to 1: ∞ ∞ � p � p k 10 = (1 − p) = (1 − p) (1 − p)k−10 1 − p k=10 1− p k=10 ∞ � ∞ � = p(1 − p) (1 − p)k = (1 − p)9, k=0 9 p(k) = ∞ � λk e−λ k=0 k! −λ =e × January 11, 2011 R.M. Gray 57 mean: ∞ ∞ � � λk e−λ λk −λ kp(k) = k =e . k! (k − 1)! k=0 k=1 k=1 m(2) = λ2 m = kp(k) = λe −λ = k=1 ∞ � k=2 ∞ � λl l=0 k January 11, 2011 R.M. Gray 58 l! + m = λ2 + λ so σ2 = λ k=0 ∞ � ∞ � λle−λ l=0 Change variables l = k − 1 (2) EE278: Introduction to Statistical Signal Processing, winter 2010–2011 Change variables l = k − 2 ∞ � ∞ � =1 k! k=0 ���� Tayor series for eλ k=0 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 ∞ � λk k −λ 2λ e k! = k −λ ∞ � k=2 l! =λ k(k − 1) Moments crop up in many signal processing applications (and statistical analysis of many kinds), so it is useful to have table of moment formulas for important distributions handy and an appreciation of the underlying calculus. λk e−λ +m k! λe +m (k − 2)! EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 59 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 60 Multidimensional pmf’s Common example of a pmf on vectors is a product pmf: Product pmf Suppose {pi; i = 0, 1, . . . , k − 1} one-dimensional pmf’s � on A (for each i, pi(x) ≥ 0 all x ∈ Ω, x∈Ω pi(x) = 1) each nonnegative, sums to 1) Vector spaces with discrete coordinates like Zkn are also discrete. Same ideas of describing probabilities by pmfs work. Define the product k-dimensional pmf p on Ak by Suppose A is a discrete space p(x) = p(x0, x1, . . . , xk−1) = k A = {all vectors x = (x0, . . . xk−1) with xi ∈ A, i = 0, 1, . . . , k − 1} is also pi(xi). i=0 discrete. Easily seen to be a pmf (nonnegative, sums to 1) If have pmf p on Ak , i.e., p(x) ≥ 0, � k−1 � p(x) = 1, x∈Ak then P(F) = � x∈F p(x) a probability measure EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 61 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 62 Continuous probability spaces Example: product of Bernoulli pmfs: pi(x) = p x(1 − p)1−x; x = 0, 1, all i Recall basic construction using pdf on the real line R: ⇒ product pmf (Ω, F ) = (R, B(R)) p(x0, x1, . . . , xk−1) = k−1 � f is a pdf on R if p xi (1 − p)1−xi i=0 w(x0,x1,...,xk−1) = p f (r)dr = 1 (1 − p)k−w(x0,x1,...,xk−1), Ω Define set function P by where w(x0, x1, . . . , xk−1)= # number of ones in binary k-tuple x0, x1, . . . , xk−1, Hamming weight of the vector. P(F) = Will see when probability of a bunch of things factors into a bunch of probabilities in this way, things simplify significantly EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray f (r)� ≥ 0, all r ∈ Ω 63 � F f (r) dr = � 1F (r) f (r) dr, F ∈ B(R) Unlike pmf, a pdf is not a probability of something — it is a density of probability. To get a probability, must integrate a pdf over a set. EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 64 Common examples of pdfs Common approximation: Suppose Ω = R and f a pdf. Consider event F = [x, x + ∆x), ∆ small. Mean value theorem of calculus ⇒ P([x, x + ∆x)) = � x+∆x x � pdfs are assumed to be 0 outside the specified domain and are given in terms of real-valued parameters b, a, λ > 0, m, and σ > 0. f (α) dα ≈ f (x)∆x � Theoretical issue Does P(F) = F f (r) dr = 1F (r) f (r) dr, F ∈ B(R) in fact define a probability measure? That is, does set function P satisfy axioms 1-3? Answer: Yes, with proper care to use correct event space and notion of integration. As usual, rarely a problem in practice. Need details in research. Uniform pdf. Given b > a, f (r) = 1/(b − a) for r ∈ [a, b]. Exponential pdf. f (r) = λe−λr ; r ≥ 0. Doubly exponential (or Laplacian) pdf f (r) = λ −λ|r| e ; r ∈ R. 2 Gaussian (or Normal) pdf f (r) = (2πσ2)−1/2 exp(−(r − m)2/2σ2); r ∈ R. Commonly denoted by N(m, σ2). Usually continuous probability just substitutes integration for summation, but pmfs and pdfs are inherently different. EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 65 Continuous expectation EE278: Introduction to Statistical Signal Processing, winter 2010–2011 centralized moments including the variance Analogous to discrete expectation. � (r − m)k f (r) dr � σ2 = (r − m)2 f (r) dr As in the discrete case, g(r) f (r) dr. σ2 = m(2) − m2 P(F) = E(1F ), As in discrete case 66 When allow complex-valued�functions, often kth absolute moment is used: m(k) |r|k f (r) dr. a = Have Ω, pdf f , function g : Ω → R Define expectation of g (with respect to f ) � E(g) = January 11, 2011 R.M. Gray Important examples if Ω ⊂ R mean or first moment kth moment second moment � m = r f (r) dr, � m(k) = rk f (r) dr � m(2) = r2 f (r) dr EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 67 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 68 Computational examples Exponential pdf From tables, or continuous analogs of tricks for geometric. The validation of the pdf (integrates to 1) and the mean, second moment, and variance of the exponential pdf can be found from integral tables, or by the integral analogs to the corresponding computations for the geometric pmf, or from integration by parts: Continuous uniform pdf on [a, b) Consider a = 0 and b = 1 m= mean �1 0 m(2) = second moment σ2 = 13 − variance �1 r2 �� 2 �0 r dr = �1 0 1 2 � �� r2 dr = 3 �� = 13 , 0 � �2 1 2 = = 1 r3 1 12 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 69 ∞ 0 λe−λr dr = 1 � ∞ 1 m = rλe−λr dr = λ �0 ∞ 2 m(2) = r2λe−λr dr = 2 λ 0 2 1 1 σ2 = 2 − 2 = 2 λ λ λ EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray Laplacian pdf = mixture of exponential and its reverse, left as exercise. Computing probabilities sometimes easy, sometimes not. Gaussian pdf Moment computation more trouble than its worth, will find an easier method later. P([c, d]) = (d − c)/(b − a) With a uniform pdf on [a, b], then for a ≤ c < d ≤ b for exponential � pdf, [c, d], 0 ≤ c < d, For now just state: P([c, d]) = � � ∞ −∞ � ∞ −∞ ∞ −∞ √ 1 2πσ2 e−(x−m) 1 xe √ 2πσ2 1 2 /2σ2 −(x−m)2/2σ2 (x − m)2e−(x−m) √ 2 2πσ 2 /2σ2 d c λe−λx dx = e−λc − e−λd . No such nice form for Gaussian, but well tabulated in terms of dx = 1 dx = m Φ(α) = dx = σ2 Q(α) = erf(α) = i.e., mean=m, variance = σ2, as notation suggests. EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 70 71 � α 1 2 e−u /2 du √ 2π −∞ � ∞ 1 2 e−u /2 du = 1 − Φ(α) √ 2π α � α 2 2 e−u du √ π 0 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 72 Note: Φ(α) = P((−∞, α]) = P({x : x ≤ α}) for N(0, 1) Change variables: u = (x − m)/σ ⇒ α 1 2 2 e−(x−m) /2σ dx √ −∞ 2πσ2 � (α−m)/σ 1 2 = √ e−u /2 du −∞ 2π �α − m� �α − m� = Φ =1−Q . σ σ P({x : x ≤ α}) = Q function = complementary function Q(α) = 1 − Φ(α). Common in communications systems analysis � � �� 1 α Q(α) = 1 − erf √ = 1 − Φ(α). 2 2 Change variables to find probabilities with tables (or numerically from these functions) � � �a − m� b−m P((a, b]) = P((−∞, b]) − P((−∞, a]) = Φ −Φ . σ σ Symmetry of a Gaussian density ⇒ 1 − Φ(a) = Φ(−a). For example, find P((−∞, α)) for N(m, σ2): EE278: Introduction to Statistical Signal Processing, winter 2010–2011 � January 11, 2011 R.M. Gray 73 Multidimensional pdf example EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 74 Interpretation: Sample points (x, y) = first arrival times of two distinct types of particle (type A and type B) at a sensor (or packets at a router or buses at a bus stop) after time 0 Ω = R2 Event space: two-dimensional Borel field B(R)2 (to give it a name, main point is there is a standard, useful definition which you can learn about in advanced probability) F = event that a particle of type A arrives at the sensor before one of type B Probability measure described by a 2D pdf: As with 1D, probability = integral of pdf over event: λµe−λx−µy; x ∈ [0, ∞), y ∈ [0, ∞) f (x, y) = . 0 otherwise P(F) = What is the probability of the event F = {(x, y) : x < y}? = Note: Analogous to a product pmf, this is a product pdf EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 75 � � � � (x,y):(x,y)∈F f (x, y) dx dy (x,y):x≥0,y≥0,x<y EE278: Introduction to Statistical Signal Processing, winter 2010–2011 λµe−λx−µy dx dy. January 11, 2011 R.M. Gray 76 Mass functions as densities Now it is just calculus: P(F) = � � = λµ = λµ � � � 0 (x,y):x≥0,y≥0,x<y �� y ∞ dy ∞ 0 dye ∞ 0 −µy λµe−λx−µy dx dy dxe−λxe−µy �� y 0 −µy 1 dxe −λx Can use continuous ideas in discrete problems with Dirac deltas, but usually clumsy and adds complication. Describe briefly. � Dirac delta defined implicitly by behavior inside integral: if {g(r); r ∈ R} continuous at a ∈ R, then � � −λy = λµ dye (1 − e ) λ 0 �� ∞ � � ∞ −µy −(µ+λ)y = µ dye − dye 0 g(r)δ(r − a) dr = g(a) (no ordinary function does this, Dirac deltas are generalized functions) 0 µ λ = 1− = . µ+λ µ+λ Given a pmf p defined on Ω ⊂ R, can define a pdf f by f (r) = EE278: Introduction to Statistical Signal Processing, winter 2010–2011 Then f (r) ≥ 0 and � 77 p(ω)δ(r − ω). EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 78 Multidimensional pdfs: General Case f (r) dr = = � January 11, 2011 R.M. Gray � = 1F (r) f (r) dr = = = � �� � p(ω)δ(r − ω) dr � � p(ω) δ(r − ω) dr � p(ω) = 1 � �� � 1F (r) p(ω)δ(r − ω) dr � � p(ω) 1F (r)δ(r − ω) dr � p(ω)1F (ω) = P(F). Multidimensional integrals allow construction of probabilities on Rk Given the measurable space (Rk , B(R)k ), a real-valued function f on Rk is a pdf if f (x) ≥ 0 ; all x = (x0, x1, . . . , xk−1) ∈ Rk , � f (x) dx = 1 Rk Define a set function P by But sum of pmfs is simpler, deltas usually make things harder. EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray P(F) = 79 � F f (x) dx, all F ∈ B(R)k , EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 80 where the vector integral is shorthand for the k-dimensional integral P(F) = � (x0,x1,...,xk−1)∈F where fi; i = 0, 1, . . . , k − 1 are one-dimensional pdfs on the real line. Most important special case: all fi(r) are the same (identically distributed) f (x0, x1, . . . , xk−1) dx0 dx1 . . . dxk−1. So all pdfs on R ⇒ product pdfs on Rk As with multidimensional pmf’s, a pdf is not itself the probability of anything. As with 1D case, subject to appropriate assumptions (event space, integral), this does define a probability measure, i.e., it satisfies the axioms. Two common and very important examples of k-dimensional pdfs: Product pdf. A product pdf has the form f (x) = f (x0, x1, . . . , xk−1) = k−1 � fi(xi) i=0 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 81 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 Multidimensional Gaussian pdf where det Λ is the determinant of the matrix Λ m = (m0, m1, . . . , mk−1)t , a column vector Notes: Λ = a k by k square matrix with entries {λi, j; i = 0, 1, . . . , k − 1; j = 0, 1, . . . , k − 1}. Assume • Λ positive definite ⇒ det Λ > 0 and Λ−1 exists 1. Λ is symmetric (Λ = Λ or, equivalently, λi, j = λ j,i, all i, j) • If Λ is a diagonal matrix, this becomes a product pdf 2. Λ is positive definite; i.e., for any nonzero vector y ∈ Rk k−1 � k−1 � • There is a more general definition of a Gaussian random vector that does only requires that Λ be nonnegative definite yiλi, jy j > 0 • Gaussian important because it crops up often as a good i=0 j=0 approximation (central limit theorem) and sometimes is a “worst case” A multidimensional pdf is Gaussian if it has the form t −1 (x−m)/2 f (x) = (2π)−k/2(det Λ)−1/2e−(x−m) Λ EE278: Introduction to Statistical Signal Processing, winter 2010–2011 82 • Hard to show integrates to 1 t yt Λy = January 11, 2011 R.M. Gray ; x ∈ Rk . January 11, 2011 R.M. Gray 83 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 84 Mixtures Useful for constructing probability measures mixing continuous and discrete: Example of constructing new probability measures from old Ω = R, f a pdf and p a pmf. λ ∈ (0, 1) Pi, i = 1, 2, . . . , ∞ is a collection of probability measures on a common measurable space (Ω, F ) ∞ � ai ≥ 0, i = 1, 2, . . ., Then mixture ai = 1. Then P(F) = λ i=0 P(F) = ∞ � x∈F ai Pi(F) is also a probability measure on (Ω, F ). ∞ � x∈F f (x) dx (Almost) most general model. Find expectations in such cases in natural way: a i Pi i=1 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 85 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 86 Event Independence Given a function g, E(g) = λ p(x) + (1 − λ) � E.g., experiment: First spin a fair wheel. If pointer in [0, λ), then roll a die described by p. If pointer in [λ, 1), then choose ω using Gaussian. i=1 Abbreviation: P = � � x∈Ω g(x)p(x) + (1 − λ) � x∈Ω g(x) f (x) dx . Given (Ω, F , P), two events F and G are independent if P(F ∩ G) = P(F)P(G) Works for scalar and vector sample spaces. A collection of events {Fi; i = 0, 1, . . . , k − 1} is independent or mutually independent if for any distinct subcollection {Fli ; i = 0, 1, . . . , m − 1}, lm < k, Another use of mixtures: First choose parameters at random (e.g., bias of coin), then use probability measure described by parameter (e.g., Bernoulli ( p)) m−1 m−1 � � P Fli = P(Fli ). Will see examples. i=0 Note: Requirement on all subcollections! Not enough to just say � � P EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 87 i=0 �k−1 i=0 Fi = �k−1 i=0 P(Fi) EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 88 Example of what can go wrong: Suppose that P(F) = P(G) = P(H) = Probabilistic independence has an intuitive interpretation in terms of next topic — elementary conditional probability. 1 3 1 P(F ∩ G ∩ H) = = P(F)P(G)P(H) 27 P(F ∩ G) = P(G ∩ H) = P(F ∩ H) = But definition does not require conditional probability. 1 � P(F)P(G). 27 Zero probability on the overlap F ∩ G except where it also overlaps H , i.e., P(F ∩ G ∩ H c) = 0. Thus P(F ∩ G ∩ H) = P(F)P(G)P(H) = 1/27, but P(F ∩ G) = 1/27 � P(F)P(G) = 1/9. So P(F ∩ G ∩ H) = P(F)P(G)P(H) for three events F , G, and H , yet it is not true that P(F ∩ G) = P(F)P(G). So here it is not true that the three events are mutually independent! EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 89 Elementary conditional probability EE278: Introduction to Statistical Signal Processing, winter 2010–2011 What properties should it have? Clearly P(Gc|G) = 0 & P(G|G) = 1. Since P(· | G) must be a probability measure, ⇒ P(F | G) = P(F ∩ (G ∪ Gc) | G) E.g., outcome of roll of one die does not change probability of next roll (or rolling another die) = P(F ∩ G | G) + �������������������������� P(F ∩ Gc | G) 0 Need definition of probability of one event conditioned on another. Motivation: Given (Ω, F , P), suppose know that event G has occured. What is a reasonable definition for probability F will occur given (conditioned on) G? P(F | G) January 11, 2011 R.M. Gray 90 For a fixed G, P(F | G) should be defined for all events F , it should be a probability measure on (Ω, F ). Intuitively, independence of two events means that the occurrence of one event should not affect the probability of occurrence of the other. EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 91 = P(F ∩ G | G) so P(F | G) = P(F ∩ G | G) (�1) EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 92 Next, no reason to suspect that the relative probabilities within G should change because of the knowledge that G occurred. Only works if P(G) > 0, called elementary conditional probability E.g., F ⊂ G is twice as probable as an event H ⊂ G with respect to P, then the same should be true with respect to P(· | G). For arbitrary events F and H , F ∩ G, H ∩ G ⊂ G, ⇒ (will later see nonelementary conditional probability, it is more complicated) Easy to prove that since P is a probability measure, so is P(· | G) Independence redux P(F ∩ G|G) P(F ∩ G) = . P(H ∩ G|G) P(H ∩ G) Suppose that F and G are independent events and that P(G) > 0, then P(F ∩ G) = P(F); P(G) Occurence of G does not affect probability of F , a posteriori probability P(F | G) = a priori probability P(F). P(F | G) = Set H = Ω + (�1) ⇒ P(F|G) = P(F ∩ G|G) = P(F ∩ G) (�2) P(G) Not used as definition of independence since less general. Motivates using (�2) as definition of conditional probability EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 93 PA(F) = � 94 (1 − p)k−1 p (1 − p)k−1 p pA(k) = �∞ = l−1 p (1 − p)K−1 l=K (1 − p) Example: Given discrete (Ω, F , P) described by a pmf p, event A with P(A) > 0. Define pmf pA: ⇒ (Ω, F , PA), where January 11, 2011 R.M. Gray E.g., p = geometric pmf and A = {ω : ω ≥ K} = {K, K + 1, . . .}. Conditional probability provides a means of constructing new probability spaces from old ones p(ω)/P(A) = P({ω}|A) , pA(ω) = 0 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 = (1 − p)k−K p; k = K + 1, K + 2, . . . , ω∈A a geometric pmf that begins at k = K + 1 ω�A pA(ω) = P(F|A). ω∈F pA is a conditional pmf EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 95 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 96 Example: Continuous (Ω, F , P), pdf f P(A) > 0 Define conditional fA: f (ω)/P(A) fA(ω) = 0 ω∈A ω∈A Interpretation: Exponential models bus arrival times. pdf for next arrival given you have waited an hour is the same as when you got to the bus stop. and conditional probability PA(F) = � fA(ω) dω = P(F|A). ω∈F E.g., f = exponential pdf, A = {r : r ≥ c} fA(x) = � ∞ λe−λx c λe−λy dy = λe−λx e−λc = λe−λ(x−c); x ≥ c, Like geometric, conditioned on x ≥ c, has same form, just shifted. EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 97 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 98 Bayes Rule P(Fi ∩ G) P(G | Fi)P(Fi) = � P(G) j P(G ∩ F j) P(G | Fi)P(Fi) = � j P(G | F j)P(F j) P(Fi | G) = Total probability + conditional probability Recall total probability: If {Fi; i = 1, 2, . . .} is a partition of, then for any event G � P(G) = i Example of Bayes’ rule: Binary communication channel P(G ∩ Fi). Noisy binary communication channel: 0 or 1 is sent and 0 or 1 is received. Assume that 0 is sent with probability 0.2 (⇒ 1 is sent with probability 0.8) Suppose know P(Fi), all i — a priori probabilities of collection of events Suppose know P(G | Fi) The channel is noisy. If a 0 is sent, a 0 is received with probability 0.9, and if a 1 is sent, a 1 is received with probability 0.975 Find a posteriori probabilities P(Fi | G). Given you observe G, how does that change the probabilities of Fi? EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 99 EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 100 Can represent this channel model by a probability transition diagram Define A = {0 is sent} = {(0, 1), (0, 0)}, P({0} | {0}) = 0.9 P({0}) = 0.2 0 • ✘✘ 0 ✘✘✘ • ✘✘✘ ✘✘✘ ✘ ✘✘ P({1} | {0}) = 0.1 ✘ ✘ ✘✘ ✘✘ ✘✘✘ ✘✘ P({0} | {1}) = 0.025 ✘✘✘ ✘ ✘ ✘ ✘ ✘ ✘✘ 1 • • 0 P({1}) = 0.8 P({1} | {1}) = 0.975 B = {0 is received} = {(0, 0), (1, 0)} Probability measure is defined via the P(A), P(B|A), and P(Bc|Ac) provided on the probability transition diagram of the channel Problem: Given that 0 is received, find the probability that 0 was sent To find P(A|B), use Bayes rule P(A|B) = Sample space Ω = {(0, 0), (0, 1), (1, 0), (1, 1)}, points: (bit sent, bit received) ⇒ P(B|A)P(A) , P(A)P(B|A) + P(Ac)P(B|Ac) 0.9 × 0.2 = 0.9, 0.2 much higher than the a priori probability of A (= 0.2) EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 101 End of basic probability: Next: random variables, vectors, and processes. EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 103 P(A|B) = EE278: Introduction to Statistical Signal Processing, winter 2010–2011 January 11, 2011 R.M. Gray 102