* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Probability
Survey
Document related concepts
Transcript
Review of probability Nuno Vasconcelos UCSD Probability • probability is the language to deal with processes that are non-deterministic • examples: – – – – if I flip a coin 100 times, how many can I expect to see heads? what is the weather going to be like tomorrow? are my stocks going to be up or down? am I in front of a classroom or is this just a picture of it? Sample space • the most important concept is that of a sample space • our process defines a set of events – these are the outcomes or states of the process • example: – we roll a pair of dice – call the value on the up face at the nth toss xn – note that possible events such as odd number on second throw two sixes x1 = 2 and x2 = 6 – can all be expressed as combinations of the sample space events x2 6 1 1 6 x1 Sample space • is the list of possible events that satisfies the following properties: – finest grain: all possible distinguishable events are listed separately – mutually exclusive: if one event happens the other does not (if x1 = 5 it cannot be anything else) – collectively exhaustive: any possible outcome can be expressed as unions of sample space events x2 6 1 1 6 • mutually exclusive property simplifies the calculation of the probability of complex events • collectively exhaustive means that there is no possible outcome to which we cannot assign a probability x1 Probability measure • probability of an event: – number expressing the chance that the event will be the outcome of the process • probability measure: satisfies three axioms – P(A) ≥ 0 for any event A – P(universal event) = 1 – if A ∩ B = ∅, then P(A+B) = P(A) + P(B) x2 6 • e.g. – P(x1 ≥ 0) = 1 – P(x1 even U x1 odd) = P(x1 even)+ P(x1 odd) 1 1 6 x1 Probability measure • the last axiom – combined with the mutually exclusive property of the sample set – allows us to easily assign probabilities to all possible events • back to our dice example: – suppose that the probability of any pair (x1,x2) is 1/36 – we can compute probabilities of all “union” events – P(x2 odd) = 18x1/36 = 1 – P(U) = 36x1/36 = 1 – P(two sixes) = 1/36 – P(x1 = 2 and x2 = 6) = 1/36 x2 6 1 1 6 x1 Probability measure • note that there are many ways to define the universal event U x2 6 – e.g. A = {x2 odd}, B = {x2 even}, U=AUB 1 – on the other hand U = (1,1) U (1,2) U (1,3) U … U (6,6) 1 6 x1 – the fact that the sample space is finest grain, exhaustive, and mutually exclusive and the measure axioms – make the whole procedure consistent Random variables • random variable X – is a function that assigns a real value to each sample space event – we have already seen one such function: PX(x1,x2) = 1/36 for all (x1,x2) • notation: – specify both the random variable and the value that it takes in your probability statements – we do this by specifying the random variable as subscript PX and the value as argument PX (x1,x2) = 1/36 means Prob[X=(x1,x2)] = 1/36 – without this, probability statements can be hopelessly confusing Random variables • two types of random variables: – discrete and continuous – really means what types of values the RV can take • if it can take only one of a finite set of possibilities, we call it discrete – this is the dice example we saw, there are only 36 possibilities x2 6 1 1 6 x1 Random variables • if it can take values in a real interval we say that the random variable is continuous • e.g. consider the sample space of weather temperature – we know that it could be any number between -50 and 150 degrees – random variable T ∈ [-50,150] – note that the extremes do not have to be very precise, we can just say that P(T < -45o) = 0 • most probability notions apply equal well to discrete and continuous random variables Discrete RV • for a discrete RV the probability assignments given by a probability mass function (PMF) – this can be thought of as a normalized histogram – satisfies the following properties α 0 ≤ PX ( a ) ≤ 1, ∀ a ∑P X (a ) = 1 a • example for the random variable – X ∈ {1,2,3, …, 20} where X = i if the grade of student z on class is between 5i and 5(i+1) – we see that PX(14) = α Continuous RV • for a continuous RV the probability assignments are given by a probability density function (PDF) – this is just a continuous function – satisfies the following properties 0 ≤ PX ( a ) ∀ a ∫P X ( a ) da = 1 • example for the Gaussian random variable of mean µ and variance σ2 ⎧ (a − µ ) 2 ⎫ 1 PX ( a ) = exp ⎨ − 2π σ ⎩ 2σ 2 ⎬ ⎭ Discrete vs continuous RVs • in general the same, up to replacing summations by integrals • note that PDF means “density of probability”, – this is probability per unit – the probability of a particular event is always zero (unless there is a discontinuity) – we can only talk about b Pr( a ≤ X ≤ b ) = ∫ PX (t ) dt a – note also that PDFs are not upper bounded – e.g. Gaussian goes to Dirac when variance goes to zero Multiple random variables • frequently we have problems with multiple random variables – e.g. when in the doctor, you are mostly a collection of random variables x1: temperature x2: blood pressure x3: weight x4: cough … • we can summarize this as – a vector X = (x1, …, xn) of n random variables – PX(x1, …, xn) is the joint probability distribution Marginalization P (cold ) ? • important notion for multiple random variables is marginalization – e.g. having a cold does not depend on blood pressure and weight – all that matters are fever and cough – that is, we need to know PX1,X4(a,b) • we marginalize with respect to a subset of variables – (in this case X1 and X4) – this is done by summing (or integrating) the others out PX 1 , X 4 ( x1 , x4 ) = ∑P X1 , X 2 , X 3 , X 4 x3 , x4 ( x1 , x2 , x3 , x4 ) PX 1 , X 4 ( x1 , x4 ) = ∫ ∫ PX1 , X 2 , X 3 , X 4 ( x1 , x2 , x3 , x4 )dx2 dx3 Conditional probability PY | X ( sick | cough) ? • another very important notion: – so far, doctor has PX1,X4(fever,cough) – still does not allow a diagnostic – for this we need a new variable Y with two states Y ∈ {sick, not sick} – doctor measures fever and cough levels, these are no longer unknowns, or even random quantities – the question of interest is “what is the probability that patient is sick given the measured values of fever and cough?” • this is exactly the definition of conditional probability – what is the probability that Y takes a given value given observations for X PY | X 1 , X 4 ( sick | 98, high) Conditional probability • note the very important difference between conditional and joint probability • joint probability is an hypothetical question with respect to all variables – what is the probability that you will be sick and cough a lot? PY , X ( sick , cough) ? Conditional probability • conditional probability means that you know the values of some variables – what is the probability that you are sick given that you cough a lot? PY | X ( sick | cough) ? – “given” is the key word here – conditional probability is very important because it allows us to structure our thinking – shows up again and again in design of intelligent systems Conditional probability • fortunately it is easy to compute – we simply normalize the joint by the probability of what we know PY | X 1 ( sick | 98) = PY , X 1 ( sick ,98) PX 1 (98) – note that this makes sense since PY | X 1 ( sick | 98) + PY | X 1 (not sick | 98) = 1 – and, by the marginalization equation, PY , X 1 ( sick ,98) + PY , X 1 (not sick ,98) = PX 1 (98) – the definition of conditional probability just makes these two statements coherent simply says that, given what we know, we still have a valid probability measure universal event {sick} U {not sick} still probability 1 after observation The chain rule of probability • is an important consequence of the definition of conditional probability – note that, from this definition, PY , X 1 ( y, x1 ) = PY | X 1 ( y | x1 ) PX 1 ( x1 ) – more generally, it has the form PX 1 , X 2 ,..., X n ( x1 , x2 ,..., xn ) = PX 1 | X 2 ,..., X n ( x1 | x2 ,..., xn ) × × PX 2 | X 3 ..., X n ( x2 | x3 ,..., xn ) × ... × ... × PX n−1 | X n ( xn −1 | xn ) PX n ( xn ) • combination with marginalization allows us to make hard probability questions simple The chain rule of probability • e.g. what is the probability that you will be sick and have 104o of fever? PY , X 1 ( sick ,104) = PY | X 1 ( sick | 104) PX 1 (104) – breaks down a hard question (prob of sick and 104) into two easier questions – Prob (sick|104): everyone knows that this is close to one You have a cold! PY | X ( sick | 104) = 1! The chain rule of probability • e.g. what is the probability that you will be sick and have 104o of fever? PY , X 1 ( sick ,104) = PY | X 1 ( sick | 104) PX 1 (104) – Prob(104): still hard, but easier than P(sick,104) since we know only have one random variable (temperature) – does not depend on sickness, it is just the question “what is the probability that someone will have 104o?” gather a number of people, measure their temperatures and make an histogram that everyone can use after that The chain rule of probability • in fact, the chain rule is so handy, that most times we use it to compute probabilities – e.g. PY ( sick ) = ∫ PY , X ( sick , t ) dt (marginalization) 1 = ∫ PY | X 1 ( sick | t ) PX 1 (t )dt – in this way we can get away with knowing PX1(t), which we may know because it was needed for some other problem PY|X1(sick|t), we can ask a doctor or approximate with rule of thumb t > 102 ⎧1 ⎪ PY | X 1 ( sick | t ) ≈ ⎨0.5 98 < t < 102 ⎪0 t < 98 ⎩ Independence • another fundamental concept for multiple variables – two variables are independent if the joint is the product of the marginals PX 1 , X 2 (a, b) = PX 1 (a ) PX 2 (b) – note: implies that PX 1 | X 2 (a | b) = PX 1 , X 2 (a, b) PX 2 (b) = PX 1 (a ) – which captures the intuitive notion: – “if X1 is independent of X2, knowing X2 does not change the probability of X1” e.g. knowing that it is sunny does not change the probability that it will rain in three months Independence • extremely useful in the design of intelligent systems – frequently, knowing X makes Y independent of Z – e.g. consider the shivering symptom: if you have temperature you sometimes shiver it is a symptom of having a cold but once you measure the temperature, the two become independent PY , X 1 , S ( sick ,98, shiver ) = PY | X 1 , S ( sick | 98, shiver ) × PS | X 1 ( shiver | 98) PX 1 (98) = PY | X 1 ( sick | 98) × PS | X 1 ( shiver | 98) PX 1 (98) • simplifies considerably the estimation of the probabilities Independence • useful property: if you add two independent random variables their probability distributions convolve – i.e. if z = x + y and x,y are independent then PZ ( z ) = PX ( z ) * Py ( z ) where * is the convolution operator – for discrete random variables PZ ( z ) = ∑ PX (k ) PY ( z − k ) k – for continuous random variables PZ ( z ) = ∫ PX (t ) PY ( z − t )dt Moments • important properties of random variables – summarize the distribution σ2 • important moments – mean: µ = E[x] µ – variance: σ2 = E[(x-µ)2] – various distributions are completely specified by a small number of moments discrete mean continuous µ = ∑ PX (k ) k µ = ∫ PX (k ) k dk k variance σ 2 = ∑ PX (k )(k-µ ) k 2 σ 2 = ∫ PX (k ) (k - µ ) 2 dk Mean • µ = E[x], is the center of mass of the distribution mean discrete continuous µ = ∑ PX (k ) k µ = ∫ PX (k ) k dk k • is a linear quantity – if Z = X + Y, then E[Z] = E[X] + E[Y] – this does not require any special relation between X and Y – always holds σ2 • other moments are the mean of powers of X – nth order moment is E[Xn] – nth central moments is E[(X-µ)n] µ Variance • σ2 = E[(x-µ)2] measures the dispersion around the mean – it is the second central moment discrete variance continuous σ 2 = ∑ PX (k )(k-µ ) k 2 σ 2 = ∫ PX (k ) (k - µ ) 2 dk • in general, not linear – if Z = X + Y, then Var[Z] = Var[X] + Var[Y] – only holds if X and Y are independent • it is related to 2nd order moment by σ 2 = E [(x − µ ) ] = E [x 2 [ ] 2 − 2 xµ + µ 2 [ ] ] = E x 2 − 2E [x ]µ + µ 2 = E x 2 − µ 2 σ2 µ