Download Document

Lecture 2 Probability and what it has to do with data analysis Abstraction Random variable, x it has no set value, until you ‘realize’ it its properties are described by a probability, P One way to think about it x p(x) pot of an infinite number of x’s Drawing one x from the pot “realizes” x Describing P If x can take on only discrete values, say (1, 2, 3, 4, or 5) then a table would work: 40% probability that x=4 x 1 2 3 4 5 P 10% 30% 40% 15% 5% Probabilities should sum to 100% Sometimes you see probabilities written as fractions, instead of percentages x 1 2 3 4 5 P 0.10 0.40 0.40 0.15 0.05 Probability should sum to 1 0.15 probability that x=4 And sometimes you see probabilities plotted as a histogram 0.5 0.15 probability that x=4 P(x) 0.0 1 2 3 4 5 x If x can take on any value, then use a smooth function (or “distribution”) p(x) instead of a table probability that x is between p(x) x1 and x2 is proportional to this area x1 x2 mathematically P(x1<x<x2) = x1 p(x) dx x2 x p(x) x Probability that x is between - and + is 100%, so total area = 1 Mathematically - + p(x) dx = 1 One Reason Why all this is relevant … Any measurement of data that contains noise is treated as a random variable, d and … The distribution p(d) embodies both the ‘true value’ of the datum being measured and the measurement noise and … All quantities derived from a random variable are themselves random variables, so … The algebra of random variables allows you to understand how … … measurement noise affects inferences made from the data Basic Description of Distributions want two basic numbers 1) something that describes what x’s commonly occur 2) something that describes the variability of the x’s 1) something that describes what x’s e commonly occur that is, where the distribution is centered Mode x at which distribution has peak most-likely value of x p(x) peak xmode x The most popular car in the US is the Honda CR-V Honda CR-V But the next car you see on the highway will probably not be a Honda CR-V Where’s a CV-R? But modes can be deceptive … 100 realizations of x Sure, the 1-2 range has the most counts, but most of the measurements are bigger than 2! N 3 18 11 8 11 14 8 7 11 9 p(x) peak x 0-1 1-2 2-3 3-4 4-5 5-6 6-7 7-8 8-9 9-10 0 xmode x 10 Median 50% chance x is smaller than xmedian 50% chance x is bigger than xmedian p(x) No special reason the median needs to coincide with the peak 50% 50% xmedian x Expected value or ‘mean’ P(x) value you would get if you took the mean of lots of realizations of x Let’s examine a discrete distribution, for simplicity ... 4 3 2 1 0 1 2 3 x Hypothetical table of 140 realizations of x x 1 2 3 Total N 20 80 40 140 [ 20  1 + 80  2 + 40  3 ] / 140 mean = = (20/140)  1 + (80/140)  2 + (40/140)  3 = p(1)  1 + p(2)  2 + p(3)  3 = Σi p(xi) xi by analogy for a smooth distribution Expected (or mean) value of x E(x) = - + x p(x) dx 2) something that describes the variability of the x’s that is, the width of the distribution p(x) Here’s a perfectly sensible way to define the width of a distribution… 50% 25% W50 25% x … it’s not used much, though Width of a distribution Here’s another way… p(x) Parabola [x-E(x)]2 E(x) x … multiply and integrate p(x) Idea is that if distribution is narrow, then most of the probability lines up with the low spot of the parabola [x-E(x)]2 p(x) E(x) But if it is wide, then some of the probability lines up with the high parts of the parabola Compute this total area … E(x) Variance = x s2 = - + [x-E(x)]2 p(x) dx x variance = s p(x) A measure of width … s E(x) x we don’t immediately know its relationship to area, though … the Gaussian or normal distribution s2 is variance x is expected value p(x) = 1 (2p)s exp{ - (x-x)2 / 2s2 ) Memorize me ! p(x) x=1 s=1 Examples of x Normal p(x) Distributions x=3 s = 0.5 x Properties of the normal distribution Expectation = p(x) Median = Mode = x 95% x x-2s x x+2s 95% of probability within 2s of the expected value Again, Why all this is relevant … Inference depends on data … You use measurement, d, to deduce the values of some underlying parameter of interest, m. e.g. use measurements of travel time, d, to deduce the seismic velocity, m, of the earth model parameter, m, depends on measurement, d so m is a function of d, m(d) so … If data, d, is a random variable then so is model parameter, m All inferences made from uncertain data are themselves uncertain Model parameters are described by a distribution, p(m) Functions of a random variable any function of a random variable is itself a random variable Special case of a linear relationship and a normal distribution Normal p(d) with mean d and variance s2d Linear relationship m = a d + b Normal p(m) with mean ad+b and variance a2s2d multivariate distributions Example Liberty island is inhabited by both pigeons and seagulls 40% of the birds are pigeons and 60% of the birds are gulls 50% of pigeons are white and 50% are tan 100% of gulls are white Two variables species s takes two values pigeon p and gull g Of 100 birds, 20 are white pigeons color c takes two values white w and tan t 20 are tan pigeons 60 are white gulls 0 are tan gulls What is the probability that a bird has species s and color c ? a random bird, that is p 20% 20% g 60% 0% s w t c Note: sum of all boxes is 100% This is called the Joint Probability and is written P(s,c) Two continuous variables say x1 and x2 have a joint probability distribution and written p(x1, x2) with   p(x1, x2) dx1 dx2 = 1 You would contour a joint probability distribution and it would look something like x2 x1 What is the probability that a bird has color c ? Of 100 birds, start with P(s,c) p 20% 20% s g and sum columns 60% 0% w t c To get P(c) 80% 20% 20 are white pigeons 20 are tan pigeons 60 are white gulls 0 are tan gulls What is the probability that a bird has species s ? start with P(s,c) p 20% 20% s Of 100 birds, g 60% 0% 20 are white pigeons 20 are tan pigeons 60 are white gulls 0 are tan gulls w t c 40% and sum rows 60% To get P(s) x2 These operations make sense with distributions, too x2 x2 x1 p(x1) x1 x1 p(x1) =  p(x1,x2) dx2 distribution of x1 (irrespective of x2) p(x2) =  p(x1,x2) dx1 distribution of x2 (irrespective of x1) p(x2) Given that a bird is species s what is the probability that it has color c ? Of 100 birds, 20 are white pigeons p 50% 50% g 100% 0% s 20 are tan pigeons 60 are white gulls 0 are tan gulls w t c Note, all rows sum to 100 This is called the Conditional Probability of c given s and is written P(c|s) similarly … Given that a bird is color c what is the probability that it has species s ? Of 100 birds, 20 are white pigeons 20 are tan pigeons p 25% 100% g 75% 0% s 60 are white gulls 0 are tan gulls So 25% of white birds are pigeons w t c Note, all columns sum to 100 This is called the Conditional Probability of s given c and is written P(s|c) Beware!  P(c|s) p 50% p 50% s P(s|c) 25% 100% 75% 0% s g 100% 0% w t c g w t c Lot of errors occur from confusing the two: Probability that, if you have pancreatic cancer, that you will die from it 90% Probability that, if you die, you will have died of pancreatic cancer Actor Patrick Swayse pancreatic cancer victim 1.4% note P(s,c) = P(s|c) P(c) p 20 p 20 25 100  =s s g 60 w 0 c t 25% of 80 is 20 g 75 80 20 0 w w c t t c and P(s,c) = P(c|s) P(s) 50% of 40 is 20 p 20 p 20 50 50  =s s g 60 w 0 c t p g 100 40 s 0 g w c t 60 Note that since P(s,c) = P(s|c) P(c) = P(c|s) P(s) then P(s) = c P(s,c) = c P(s|c) P(c) and P(c) = s P(s,c) = c P(c|s) P(s) Continuous versions: p(s) =  p(s,c) dc =  p(s|c) p(c) dc and p(c) =  p(s,c) ds =  p(c|s) p(s) ds Also, since P(s,c) = P(s|c) P(c) = P(c|s) P(s) then P(s|c) = P(c|s) P(s) / P(c) and P(c|s) = P(s|c) P(c) / P(s) … which is called Bayes Theorem In this example bird color is the observable, the “data”, d bird species is the “model parameter”, m P(c|s) “color given species” or P(d|m) is “making a prediction based on the model” Given a pigeon, what the probability that it’s grey? P(s|c), “species given color” or P(m|d) is “making an inference from the data” Given a grey bird, what the probability that it’s a pigeon? Bayes Theorem with data d and model m P(m|d) = P(d|m) P(m) / P(d) = P(d|m) P(m) / i P(d|mi) P(mi) Bayesian Inference: Interpret P(m) as our knowledge of m before measuring d. Then P(m|d) is our updated state of knowledge after measuring d. Example of Bayesian Inference Scenaio: A body of a man is brought to the morgue. The coroner wants to know, “did the man die of pancreatic cancer?”. Thus there is one model parameter, m, takes one of two values, Y (he died of pancreatic cancer) and N (he didn’t). Before examining the body, the best estimate of P(m) that can be made is P(Y)=0.014 and P(N)=0.986, the rate of death by pancreatic cancer in the general population. Now the coroner performs a test for pancreatic cancer, giving one data, d, and its positive, + (as contrasted to negative, -). But the test is not perfect. It has a non-zero rate of both false-positives and false-negatives, as quantified by the conditional distribution: false negatives (didn’t have cancer but tested +) Y 0.995 0.01 N 0.005 0.99 + - false positives (did have cancer but tested -) P(d|m) P(Y|+) = P(+|Y) P(Y) / [P(+|Y) P(Y)+P(+|N) P(N)] = 0.9950.014 / [0.9950.014+0.0050.986] A 74% chance that person died of pancreatic = 0.74 or 74% cancer is not all that conclusive! Why Bayes Theorem is important It provides a framework for relating making a prediction from the model, P(d|m) to making an inference from the data, P(m|d) Bayes Theorem also implies that the joint distribution of data and model parameters p(d, m) is the fundamental quantity If you know p(d, m), you know everything there is to know … Expectation Variance And Covariance Of a multivariate distribution x2 The expectation is computed by first reducing the distribution to one dimension x2 x1 p(x1) x1 take the expectation of p(x1) to get x1 x2 x2 x1 x1 p(x2) take the expectation of p(x2) to get x2 x2 The varaince is also computed by first reducing the distribution to one dimension x2 x1 p(x1) x1 s1 x1 take the variance of p(x1) to get s12 x2 x2 x1 s2 p(x2) take the variance of p(x2) to get s22 Note that in this distribution if x1 is bigger than x1, then x2 is bigger than x2 and if x1 is smaller than x1, then x2 is smaller than x2 x2 This is a positive correlation x2 x1 x1 Expected value Conversely, in this distribution if x1 is bigger than x1, then x2 is smaller than x2 and if x1 is smaller than x1, then x2 is smaller than x2 x2 This is a negative correlation x2 x1 x1 Expected value This correlation can be quantified by multiplying the distribution by a four-quadrant function x2 x2 x1 - + + - x1 And then integrating. The function (x1-x1)(x2-x2) works fine C = (x1-x1) (x2-x2) p(x1,x2) dx1dx2 Called the “covariance” Note that the matrix C with elements Cij = (xi-xi) (xj-xj) p(x1,x2) dx1dx2 has diagonal elements of sxi2 the variance of xi and off-diagonal elements of cov(xi,xj) the covariance of xi and xj s12 C= cov(x1,x2) cov(x1,x3) cov(x1,x2) s22 cov(x2,x2) cov(x1,x3) cov(x2,x2) s32 The “vector of means” of multivatiate distribution x and the “Covariance matrix” of multivariate distribution Cx summarized a lot – but not everything – about a multivariate distribution Functions of a set of random variables, x A vector of of N random variables in a vector, x Special Case linear function y=Mx the expectation of y is y=Mx Memorize! the covariance of y is So Cy = M Cx MT Memorize! Note that these rules work regardless of the distribution of x if y is linearly related to x, y=Mx then y=Mx (rule for means) Cy = M Cx MT (rule for propagating error) Memorize!

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Document