Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Lecture 3. Conditional probability . Discrete and continuous random values. Binomial, uniform and normal distribution. Mathematical expectation and probability density. Discrete random values Let’s return to the first example considered in the 1st Lecture: You toss up a coin. There are two outcomes: ‘head’ and ‘tail’. So, the probabilities that it will be ‘tail’ and ‘head’ are the same and equal 0.5. The observed side of the coin is a random variable. And it possesses the values ‘tail’ and ‘head’ with probabilities 0.5 respectively. Def. A random value is a characteristic, measurement, or count that changes randomly according to some sets of probabilities; its notations X,Y,Z, and so on. A list of all possible values of random variable, along with their probabilities is called a probability distribution. In short: 𝑋 = {0, 1} Random Variables can be either Discrete or Continuous: Discrete Data can only take certain values (such as 1,2,3,4,5) Continuous Data can take any value within a range (such as a person's height) Probability of the sets Example. If we have a discrete random variable that is the number of correct answers that a student gets on a test of 5 questions, i.e. integers in the set {0; 1; 2; 3; 4; 5} then we could be interested in the probability that the student gets an even number of questions correct, or less than 2, or more than 3, or between 3 and 4, etc. In the case of any complex outcome that can be written as the union of some other disjoint (non-overlapping) outcomes, the probability of the complex outcome is the sum of the probabilities of the disjoint outcomes. The addition rule for disjoint unions is really a special case of the general rule for the probability that the outcome of an experiment will fall in a set that is the union of two other sets. Using the above 5-question test example, we can define event E as the set {𝑇 ∶ 1 ≤ 𝑇 ≤ 3} read as all values of outcome T such that 1 is less than or equal to T and T is less than or equal to 3. Of course 𝐸 = {1, 2, 3}. Now define 𝐹 = {𝑇 ∶ 2 ≤ 𝑇 ≤ 4} or 𝐹 = {2, 3, 4}. The union of these sets, written 𝐸 ∪ 𝐹 is equal to the set of outcomes {1, 2, 3, 4}. To find 𝑃(𝐸 ∪ 𝐹 ) we could try adding 𝑃(𝐸) + 𝑃(𝐹), but we would be double counting the elementary events in common to the two sets, namely {2} and {3}, so the correct solution is to add first, and then subtract for the double counting. We define the intersection of two sets as the elements that they have in common, and use notation like 𝐸 ∩ 𝐹 = {2, 3}. Then the rule for the probability of the union of two sets is: 𝑃(𝐸 ∪ 𝐹 ) = 𝑃(𝐸) + 𝑃(𝐹 ) − 𝑃(𝐸 ∩ 𝐹 ) For our example, 𝑃(𝐸 ∩ 𝐹) = 0.61 + 0.59 − 0.35 = 0.85, which matches the direct calculation 𝑃({1, 2, 3, 4}) = 0.26 + 0.14 + 0.21 + 0.24. Conditional probability Another important concept is conditional probability. Example. We might want to calculate the probability that a random student gets an odd number of questions correct while ignoring those students who score over 4 points. This is usually described as finding the probability of an odd number given T ≤ 4. The notation is 𝑃(𝑇 𝑖𝑠 𝑜𝑑𝑑|𝑇 ≤ 4) , where the vertical bar is pronounced “given”. For this example we are excluding the 5% of students who score a perfect 5 on the test. Our new sample space must be “renormalized” so that its probabilities add up to 100%. We can do this by replacing each probability by the old probability divided by the probability of the reduced sample space, which in this case is (1 − 0.05) = 0.95. Because the old probabilities of the elementary outcomes in the new set of interest, {0, 1, 2, 3, 4}, add up to 0.95, if we divide each by 0.95 (making it bigger), we get a new set of 5 (instead of 6) probabilities that add 0.26 up to 1.00. We can then use these new probabilities to find that the probability of interest is 0.95 + 0.21 0.95 = 0.495. Or we can use a new probability rule: 𝑃(𝐸|𝐹 ) = 𝑃(𝐸 ∩ 𝐹 ) P(F ) In our current example, we have: 𝑃 (𝑇 ∈ {1, 3, 5}|𝑇 ≤ 4) = 𝑃(𝑇 ∈ {1,3,5}∩ 𝑇 ≤ 4) 𝑃(𝑇 ≤ 4) = 𝑃(𝑇)∈{1,3} 1 − 𝑃(𝑇= 5) = (0.26 + 0.21) 0.95 = 0.495 Example. In a population of a particular species of snail, individuals exhibit different forms. It is known that 45% have a pink background coloring, while 55% have a yellow background coloring. In addition, 30% of individuals are striped, and 20% of the population are pink and striped. 1. Is the presence or absence of striping independent of background color? 2. Given that a snail is pink, what is the probability that it will have stripes. Define the events: A, B, that a snail has a pink, respectively yellow, background coloring, and S for the event that is has stripes. Then we are told 𝑃(𝐴) = 0.45, 𝑃(𝐵) = 0.55, 𝑃(𝑆) = 0.3, and 𝑃(𝐴 ∩ 𝑆) = 0.2. For part (1), note that 0.2 = 𝑃(𝐴 ∩ 𝑆) ≠ 0.135 = 𝑃(𝐴) ∗ 𝑃(𝑆); so the events A and S are not independent. For part (2), 𝑃(𝑆|𝐴) = 𝑃(𝐴∩𝑆) 𝑃(𝐴) 0.2 = 0.45 = 0.44 Thus, knowledge that a snail has a pink background coloring increases the probability that it is striped. (That 𝑃(𝑆|𝐴) ≠ 𝑃(𝑆) also establishes that background coloring and the presence of stripes, are not independent). Example. In a medical setting we might want to calculate the probability that a person has a disease D given they have a specific symptom S, i.e. we want to calculate 𝑃(𝐷|𝑆). This is a hard probability to assign as we would need to take a random sample of people from the population with the symptom. A probability that is much easier to calculate is 𝑃(𝑆|𝐷), i.e. the probability that a person with the disease has the symptom. This probability can be assigned much more easily as medical records for people with serious diseases are kept. The power of Bayes Rule is its ability to take 𝑃(𝑆|𝐷) and calculate 𝑃(𝐷|𝑆). We have already seen a version of Bayes' Rule 𝑃(𝐸|𝐹 ) = 𝑃(𝐸 ∩ 𝐹 ) P(F ) Using the Multiplication Law we can rewrite this as 𝑃(𝐸|𝐹 ) = 𝑃(𝐹|𝐸 ) ∗ 𝑃(𝐸) P(F ) In our example suppose 𝑃(𝑆|𝐷) = 0.12, 𝑃(𝐷) = 0.01 and 𝑃(𝑆)= 0.03. Then 𝑃(𝐷|𝑆) = 0.12*0.01/0.03=0.04 Discrete distribution Binomial distribution Let’s return to the first example considered in the beginning of the lecture: You flip a fair coin 10 times and count the number of heads X. How could we describe the rule to define the probabilities of possible values of X? That’s the classical example of a random variable having binomial probability distribution. Def. A random variable has a binomial distribution if all of following conditions are met: 1. There are a fixed number of trials (n). 2. Each trial has two possible outcomes: success or failure. 3. The probability of success (p) is the same for each trial. 4. The trials are independent, meaning the outcome of one trial doesn’t influence that of any other. Let X equal the total number of successes in n trials; if all of the above conditions are met, X has a binomial distribution with probability of success equal to p. X takes on values from 0 to n. Let’s check our example: 1. Are there a fixed number of trials? You’re flipping the coin 10 times, which is a fixed number. Condition 1 is met, and n = 10. 2. Does each trial have only two possible outcomes — success or failure? The outcome of each flip is either heads or tails, and you’re interested in counting the number of heads, so flipping a head represents success and flipping a tail is a failure. Condition 2 is met. 3. Is the probability of success the same for each trial? Because the coin is fair the probability of success (getting a head) is p = ½ for each trial. You also know that 1 – ½ = ½ is the probability of failure (getting a tail) on each trial. Condition 3 is met. 4. Are the trials independent? We assume the coin is being flipped the same way each time, which means the outcome of one flip doesn’t affect the outcome of subsequent flips. Condition 4 is met. Because the coin-flipping example meets the four conditions, the random variable X, which counts the number of successes (heads) that occur in 10 trials, has a binomial distribution with n = 10 and p=½. After you identify that X has a binomial distribution (the four conditions are met), you’ll likely want to find probabilities for X. Probabilities for a binomial random variable X (the probability that the number of successes (X) equals k) can be found using the formula: 𝑃(𝑋 == 𝑘) = С𝑘𝑛 𝑝𝑘 (1 − 𝑝)𝑛−𝑘 , where n is the fixed number of trials, k is the specified number of successes, n – k is the number of failures, p is the probability of success on any given trial, 1 – p is the probability of failure on any given trial. The number of ways to arrange x successes among n trials is called “n choose k,” and the notation is 𝐶𝑛𝑘 . For example, 𝐶32 means “3 choose 2” and stands for the number of ways to get 2 successes in 3 trials. In general, to calculate “n choose k,” you use the formula 𝑛! С𝑘𝑛 = 𝑘!(𝑛−𝑘)! . The notation 𝑛! stands for n-factorial, the number of ways to rearrange n items. To calculate n!, you multiply n(n – 1)(n – 2) . . . (1). For example: 3! is 3(2)(1) = 6; 2! is 2(1) = 2; and 1! is 1. By convention, 0! equals 1. To calculate “3 choose 2,” you do the following: С23 = 3! 2! (3 − 2)! = 3∗2∗1 2 ∗ 1 ∗ (1) =3 Example. Suppose you cross three traffic lights on your way to work, and the probability of each of them being red is 0.3. (Assume the lights are independent.) You let X be the number of red lights you encounter and you want to find the probability distribution for X. You know p = probability of red light = 0.3; 1–p = probability of a non-red light = 1–0.3 = 0.7; and the number of non-red lights is 3 – X. Using the formula, you obtain the probabilities for X = 0, 1, 2, and 3 red lights: 𝑃(𝑋 = 0) = 𝐶30 ∗ 0.30 ∗ (1 − 0.3)3−0 = 3! ∗ 0.30 ∗ 0.73 = 0.343 0! (3 − 0)! 𝑃(𝑋 = 1) = 𝐶31 ∗ 0.31 ∗ (1 − 0.3)3−1 = 3! ∗ 0.31 ∗ 0.72 = 0.441 1! (3 − 1)! 𝑃(𝑋 = 2) = 𝐶32 ∗ 0.32 ∗ (1 − 0.3)3−2 = 3! ∗ 0.32 ∗ 0.71 = 0.189 2! (3 − 2)! 𝑃(𝑋 = 3) = 𝐶33 ∗ 0.33 ∗ (1 − 0.3)3−3 = 3! ∗ 0.33 ∗ 0.70 = 0.027 3! (3 − 3)! Def. The mean of a random variable is the long-term average of its possible values over the entire population of individuals (or trials). It’s found by taking the weighted average of the X-values multiplied by their probabilities. The mean of a random variable is denoted by 𝑚. For the binomial random variable the mean is 𝑚 = 𝑛 ∗ 𝑝. Example. Suppose you flip a fair coin 100 times and let X be the number of heads; this is a binomial random variable with n = 100 and p = 0.5. Its mean is n*p = 100*(0.50) = 50. Def. The variance of a random variable X is the weighted average of the squared deviations (distances) from the mean. The variance of a random variable is denoted by 𝜎 2 . The variance of the binomial distribution is 𝜎 2 = 𝑛 ∗ 𝑝 ∗ (1 − 𝑝). The standard deviation of X is just the square root of the variance, which in this case is = √𝑛 ∗ 𝑝 ∗ (1 − 𝑝) . Example. Suppose you flip a fair coin 100 times and let X be the number of heads. The variance of X is n*p*(1 – p) = 100*(0.50)*(1 – 0.50) =25, and the standard deviation is the square root, which is 5. The mean and variance of a binomial have intuitive meaning. The p is the probability of a success, but it also represents the proportion of successes you can expect in n trials. Therefore the total number of successes you can expect — that is, the mean of X — equals n*p. The only variability in the outcomes of each trial is between success (with probability p) and failure (with probability 1 – p). Over n trials, it makes sense that the variance of the number of successes/failures is measured by n*p*(1 – p). How can we calculate probability that the total number of heads will be lower that some value? Suppose you’d like to estimate the probability that the total number of observed heads will be equal to 7. For this purpose you should use the concept of cumulative distribution function. Def. Cumulative distribution function 𝐹𝑋 (𝑥) = 𝑃(𝑋 ≤ 𝑥) – the probability that X will possess the value less than or equal to x . In case of binomial distribution: 𝐹𝑥 (𝑥) = ∑𝑥𝑘=0 𝑃(𝑋 == 𝑘) = ∑𝑥𝑘=0 𝐶𝑘𝑛 ∗ 𝑝𝑘 ∗ (1 − 𝑝)𝑛−𝑘 . Bernoulli distribution Example. You flip a coin once (n=1). And as success you take ‘head’. How to describe random variable X? Def. Bernoulli distribution is a particular case of Binomial distribution with n=1. If X has Bernoulli distribution, its mean 𝑚 = 𝑝, its variance 𝜎 2 = 𝑝(1 − 𝑝). Uniform distribution Example. Before we’ve considered tossing of a coin and there are only two outcomes of one throw. Now suppose you cast a die. Hence, there are 6 outcomes: 1,2,3,4,5,6 – the possible score of fallen side of the dice. Let random variable X be the falling score. The probabilities of each outcome (value of X) are equal. That’s an instance of random variable having uniform distribution. Def. A random variable X has a uniform distribution on the interval [a,b] if X between a and b has equal probability for all values. For our example, a = 0, b = 6, and the probability of each result is 1/6. Probabilities for an uniform random variable X (the probability that the value of X equals k) can be found using the formula: 1 𝑎≤𝑘≤𝑏 𝑃(𝑋 == 𝑘) = { 𝑏 − 𝑎 , 0, 𝑘 > 𝑏 𝑜𝑟 𝑘 < 𝑎 The probability of any value between a and b is p: 1 𝑝 = 𝑏−𝑎 The mean of an uniform random variable X: 𝑚 = 𝑎+𝑏 2 The variance of an uniform random variable X: 𝜎 2 = . (𝑏−𝑎)2 12 . 0, 𝑥 <𝑎 𝑥−𝑎 Cumulative distribution function: 𝐹𝑥 (𝑥) = { 𝑏−𝑎 , 𝑎 ≤ 𝑥 ≤ 𝑏 1, 𝑥 > 𝑏 However what if the random variable X isn’t discrete? The probability of any value between a and b is p: 1 𝑝 = 𝑏−𝑎 . Because the total of all probabilities must be 1, so 𝑡ℎ𝑒 𝑎𝑟𝑒𝑎 𝑜𝑓 𝑡ℎ𝑒 𝑟𝑒𝑐𝑡𝑎𝑛𝑔𝑙𝑒 = 1 𝑝 × (𝑏 − 𝑎) = 1 𝑝 = 1/(𝑏 − 𝑎) We can write: 1 𝑎≤𝑘≤𝑏 𝑃(𝑋 == 𝑘) = { 𝑏 − 𝑎 , 0, 𝑘 > 𝑏 𝑜𝑟 𝑘 < 𝑎 So, formulas for mean, variance, probability and cumulative distribution function are the same. Example: Old Faithful erupts every 91 minutes. You arrive there at random and wait for 20 minutes ... what is the probability you will see it erupt? This is actually easy to calculate, 20 minutes out of 91 minutes is: 𝑝 = 20/91 = 0.22 (𝑡𝑜 2 𝑑𝑒𝑐𝑖𝑚𝑎𝑙𝑠) But let's use the Uniform Distribution for practice. To find the probability between a and a+20, find the blue area: 𝐴𝑟𝑒𝑎 = (1/91) 𝑥 (𝑎 + 20 − 𝑎) = (1/91) 𝑥 20 = 20/91 = 0.22 (𝑡𝑜 2 𝑑𝑒𝑐𝑖𝑚𝑎𝑙𝑠) So there is a 0.22 probability you will see Old Faithful erupt. If you waited the full 91 minutes you would be sure (p=1) to have seen it erupt. But remember this is a random thing! It might erupt the moment you arrive, or any time in the 91 minutes. The idea of an area brings us to the idea of calculation of probabilities for any random variable with distribution different from uniform. 𝑏 As it’s known an integral ∫𝑎 𝑓(𝑥)𝑑𝑥 is an area below the 𝑓(𝑥) curve on the [a,b] interval. 𝑏 ∞ Then ∫𝑎 𝑓(𝑥)𝑑𝑥 = 𝑃(𝑎 < 𝑋 < 𝑏), ∫−∞ 𝑓(𝑥)𝑑𝑥 = 1 and 𝑓(𝑥) called probability density. For 1 uniform distribution 𝑓(𝑥) = { 𝑏−𝑎 , 𝑎≤𝑥≤𝑏 0, 𝑥 > 𝑏 𝑜𝑟 𝑥 < 𝑎 Normal distribution We say that X has a normal distribution if its values fall into a smooth (continuous) curve with a bell-shaped, symmetric pattern, meaning it looks the same on each side when cut down the middle. The total area under the curve is 1. Each normal distribution has its own mean 𝑚, and its own standard deviation, 𝜎. One very special member of the normal distribution family is called the standard normal distribution, or Z-distribution. The standard normal (Z) distribution has a mean of zero and a standard deviation of 1. Because probabilities for any normal distribution are nearly impossible to calculate by hand, we use tables to find them. All the basic results you need to find probabilities for any normal distribution can be boiled down into one table based on the standard normal (Z) distribution. This table is called the Z-table. All you need is one formula to transform your normal distribution (X) to the standard normal (Z) distribution, and you can use the Z-table to find the probability you need. The general 𝑋−𝑚 formula for changing a value of X into a value of Z is = 𝜎 . You take your x-value, subtract the mean, and divide by the standard deviation; this gives you its corresponding z-value. Example. If X is a normal distribution with mean 16 and standard deviation 4, the value 20 on the X-distribution would transform into 20 – 16 divided by 4, which equals 1. So, the value 20 on the X-distribution corresponds to the value 1 on the Z-distribution. Now use the Z-table to find probabilities for Z, which are equivalent to the corresponding probabilities for X. To use the Z-table to find probabilities, do the following: 1. Go to the row that represents the leading digit of your z-value and the first digit after the decimal point. 2. Go to the column that represents the second digit after the decimal point of your z-value. 3. Intersect the row and column. That number represents 𝑃(𝑍 < 𝑧). Example. Suppose you want to look at P(Z < 2.13). Using Z-table, find the row for 2.1 and the column for 0.03. Put 2.1 and 0.03 together as one three-digit number to get 2.13. Intersect that row and column to find the number: 0.9834. You find that P(Z < 2.13) = 0.9834. Example. Suppose we know that the birth weight of babies is Normally distributed with mean 3500g and standard deviation 500g. What is the probability that a baby is born that weighs less than 3100g? That is X ~ N(3500, 5002 ) and we want to calculate P(X < 3100)? We can calculate the probability through the process of standardization. Drawing a rough diagram of the process can help you to avoid any confusion about which probability (area) you are trying to calculate. 𝑋 < 3500 3100 − 3500 𝑃(𝑋 < 3100) = 𝑃 ( < ) = 𝑃(𝑍 < −0.8) = 1 − 𝑃(𝑍 < 0.8) = 0.2119 500 500 ,where 𝑍~𝑁(0, 1).