Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
School of Computer Science and Engineering, The Hebrew University of Jerusalem. תאריך :יום ראשון ה1.5.11 - המתכונת: ◦ ◦ ◦ ◦ חצי שעה במהלך השיעור רמה :קל עד בינוני בחירה של 5מתוך 6שאלות אמריקאיות הוכחה (קצרה) אחת (אין בחירה) החומר :הכל עד עצי AVLכולל (כלומר עד שבוע 8כולל) ◦ כל הפתרונות הרלוונטיים יועלו לאתר (מיד לאחר חופשת הפסח) משקל 20% :מגן יהיה :בסדר Randomness is a good model for uncertainty. Uncertainty has a negative connotation, but as we will see later on, we can use it to our advantage. Probability is the formalization of randomness, and allows us to treat it mathematically. In computer science, we use probability to: ◦ ◦ ◦ ◦ Define parts of problems which contain ‘uncertainty’ Analyze probabilistic running time Achieve average-case performance on worst-case input We also computationally generate randomness! But first – what is probability? Consider a toss of a die. Though we know what the possible outcomes of the toss can be, we don’t know which it will be. Nonetheless, we are not completely clueless. For instance, we know that if we toss a die 100 times, there is a ‘good chance’ the die will sow ‘1’ at some point. Since tossing a die is a process which is random in nature, we will use it as a test case for formalizing randomness as probability. ◦ Note: we will only discuss discrete probability for now. Consider a toss of a die. We have mentioned that we know the toss has 6 possible (basic) outcomes: ◦ ◦ ◦ ◦ After the toss, the die will show 1. After the toss, the die will show 2. … After the toss, the die will show 6. As a first step in formalization, we shall create a set of these outcomes: Ω= {1, 2, 3, 4, 5, 6} Ωis called the Sample Space. Nonetheless, there are other non-direct outcomes, such as: ◦ After the toss the die showed an even number ◦ After the toss the die showed a number larger than 3 ◦ After the toss the die showed either 1 or 2 Notice that These are actually subsets of Ω: {2, 4, 6}, {4, 5, 6}, {1, 2} Each subset is called an event. Therefore, we will consider all possible subsets: 2Ω = {A : A ⊆ Ω} This group is called the sigma-algebra. ◦ The events {1}, {2}, … {6} are called elementary events. Although we do not know the outcome of the toss, we know that all numbers have same ‘chance’ of showing up. Let us attempt to quantify this ‘chance’ notion as probability (we will mark it Pr for short). Since we want are interested in the chance of things happening: ◦ Define the probability that something will happen as 1. ◦ Define the probability that nothing will happen as 0. ◦ In our case, let’s divide the probability equally between all 6 basic outcomes. That is: ∀ i=1…6 Pr(roll=i) = ⅙ But what about events like parity - {2, 4, 6}? It makes sense that: Pr(roll=even) = ½ since: Pr(roll=even) = Pr(roll = 2 or 4 or 6) and {2, 4, 6} are exactly half of the outcomes. This gives us the intuition to require: Pr(roll = 2 or 4 or 6) = Pr(roll=2)+Pr(roll=4)+Pr(roll=6) Formally, we will define a probability function: P: 2Ω → ℝ Such that: 1. ∀ event A ∋ 2Ω, P(A) ≤ 0 2. P(Ω)=1 3. ∀ disjoint events A,B ∋ 2Ω, P(A ⊍ B) = P(A) + P(B) In general, for discrete sample spaces, we have: ∀ A ∋ 2Ω P(A) = |A| / |Ω| Tossing 3 fair coins: ◦ What is the sample space? Ω= {H, T}3 ◦ What is the sigma-algebra? 2Ω ◦ What is the probability function? ∀ ω ∋ Ω, P(ω)= ½⋅½⋅½ = 1/8 ◦ Q: Why is it enough to define the probability only on the elementary events? ◦ A: Because all events are disjoint unions of elementary events. We toss a fair coin. If it lands on heads, we toss a fair die. If it lands on tails, we draw a ball from a bag that has 3 red balls and 2 yellow balls. ◦ What is the sample space? Ω= { {H}x{1, 2,..., 6}, {T}x{R, Y} } ◦ What is the sigma-algebra? 2Ω ◦ What is the probability function? ∀ i=1…6, P(H, i) = ½⋅⅙ = 1/12 P(T, R) = ½⋅3/5 = 3/10 P(T, Y) = ½⋅2/5 = 2/10 ◦ Do all the probabilities sum up to 1? A drunkard tries to walk on a straight line of 5 meters. With each step, he advances ½ meter, but deviates 10 cm. left or right from the line with the same chance. ◦ What is the sample space? Ω = {L, R} or {L, R}10 ? We get to choose! notice that there is no uncertainty about the forward motion! ◦ What is the sigma-algebra? 2Ω ◦ What is the probability function? P(L) = P(R) = ½, or ∀A∋2Ω, P(A) = 2-10 ? (*) try it yourselves! http://www.maniacworld.com/walk_home_drunk.htm The sample space Ωis a set of all elementary events. The sigma algebra, which is (usually) all the subsets of Ω, contains all of the events. The probability P(⋅) is a special function that tells us what are the chances of events happening. The sample space, sigma-algebra and probability function define a probability space. Let’s go back to tossing a die. Say we play a game where for each toss, we get as points the outcome of the toss, and we want to get as many points. What if instead of numbers, the faces of the die had the following drawings: {,, L, G, M, :, 2} ◦ What is our sample space now? ◦ Has the probability function changed? ◦ How can we compute our score in the game? Let’s give the formulation another try: ◦ Our sample space has for elementary event all the faces of the die, regardless of what is drawn on them. Our sample spce would be: Ω = {faces} ◦ For each face, we will give a value. Formally, we will create a value function X: X: Ω → ℝ this value function is called a random variable. ◦ Our original probability P will induce a new ‘probability’ on X: PX : 2X(Ω) → ℝ ◦ This is not really a probability function, since it is not from the sample space. We will call PX the distribution of the random variable X. (*) Note that we can actually treat X(Ω) as a new sample space, which will make PX a true probability function. Distributions are actually a special case of probability functions, which are from subsets of ℝ into ℝ. For example, say we define: ◦ X(,) = X(L) = X(G) = 1 ◦ X(M) = 2.5 ◦ X(:) = X(2) = 10 Now, we get the distribution: ◦ ◦ ◦ ◦ PX(X=1) = |{,, L, G}|/|Ω| = ½ PX(X=2.5) = |{M}|/|Ω| = ⅙ PX(X=10) = |{:, 2}|/|Ω| = 1/3 ∀ x ∌ {1, 2.5, 10}, P(X=x) = 0 There are several interesting families of distributions it is good to know. For instance, consider our original die-toss example (where Ω= {1,…,6}). If we define: ∀ ω ∋ Ω, X(ω) = ω since all ω-s had the same probability, each value of X has the same probability (in this case, ⅙). We call this a uniform distribution U(⋅), since the probability is partitioned uniformly between all events. There are many other families of distributions. We might see some later on in the course. A fair coin is tossed. You get 1 for heads and 0 for tails. ◦ What is the sample space? The prob. function? Ω= {H, T}, P(H) = P(T) = ½ ◦ How would you define the random variable? X(H) = 1, X(T) = 0 ◦ What is the distribution? How would it look like? PX(X=1) = PX(X=0) = ½ Px 1 0.8 0.6 0.4 0.2 0 0 1 2 dice are tossed. We want their sum. ◦ What is the sample space? The prob. function? Ω= {1,…, 6}2, P(i,j) = 1/36 ◦ How would you define the random variable? X(i,j) = i+j ◦ What is the distribution? How would it look like? Px 0.2 0.15 0.1 0.05 0 2 3 4 5 6 7 8 9 10 11 12 3 fair coins are tossed. We get 1 for all 3 coins being heads, and 0 otherwise. ◦ What is the sample space? The probability function? Ω = {H, T}3, P(HHH) = P(HHT) =…= P(TTT)= 1/8 ◦ How would you define the random variable? X(HHH)=1, ∀ A ∋ 2Ω X(A)= 0 ◦ What is the distribution? How would it look like? ◦ PX(X=1) = 1/8, PX(X=0) = 7/8 Px 1 Q: What Boolean function is represented here? A: AND 0.8 0.6 0.4 0.2 0 0 1 You are a serious lottery addict. We want to know how many losing cards you should fill before finally winning. ◦ What is the sample space? The probability function? Ω = {winning card, losing card}, P{winning} = 10-7 = p Ω‘ = ⋃n=1…∞({winning card, losing card}n) ◦ How would you define the random variable? Px with p=0.4 X = number of tries until success ◦ What is the distribution? How would it look like? Use common sense! PX(X=k) = (1-p)k-1p This is called a Geometric distribution 0.3 0.25 0.2 0.15 0.1 0.05 0 1 2 3 4 5 6 7 8 9 10 … A drunkard tries to walk on an (infinite) straight line. With each step, he advances ½ meter, but deviates 10 cm. left or right from the line with the same chance. ◦ What is the sample space? The prob. function? Ω = {L, R}, P(L) = P(R) = ½ ◦ How would you define the random variable? Y(R) = 1, Y(L) = -1 Define Xn = position on right-left axis = ∑i=1..nY ◦ What is the distribution? 1 𝑛 PX(Xn=k)=PX(∑i=1..nY = k) = 2𝑛 (𝑛 + 𝑘)/2 Different orderings of L, R Choose exactly k more +1 than -1 For each web page, we want to know how many web pages have links to it. ◦ What is the sample space? The probability function? Ω = {web pages}, P~U(|web pages|) uniform distribution ◦ How would you define the random variable? X(web page) = number of outgoing links ◦ What is the distribution? How would it look like? Use common sense! ◦ What is the distribution? How would it look like? There are a zillion incoming links to Google, a few Yahoo! and cnn.com with a lot of links, a few more sites with less links, and so on, until there are a zillion sites with one or no incoming links, like private homepages. One possible explanation for this is ‘rich get richer’ – the more popular a web page, the more probable it is that other web pages will link to it. Research shows that if new links are distributed proportionally to the already existing incoming-links distribution, then the distribution is preserved. ◦ What is the distribution? How would it look like? A known distribution that preserves this quality is the power-law distribution, namely: 𝑃𝑋 𝑋 = 𝑥 ∝ 𝑥 −𝛼 This is a good example of a micro-scale random process (a web-site adding new links) that creates a macro-scale distribution (the distribution of the number of incoming links Px with alpha=2 of all web pages). 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 2 3 4 5 6 7 8 9 10 11 Let’s go back to the die-tossing game. Say we toss the die 100 times and get points accordingly. What is the average points per toss? This is easy! Just sum: 1 100 𝑋 𝑜𝑢𝑡𝑐𝑜𝑚𝑒 𝑜𝑓 𝑡𝑜𝑠𝑠 = 𝑡𝑜𝑠𝑠𝑒𝑠 1+ 1 + ⋯+ 2 + 2 + ⋯6 + 6 + 6 100 But will this be the average of every 100 tosses? ◦ No, since there is always uncertainty in dice ◦ Almost, since… well… let’s formalize first. Let’s rearrange the summing of the average: 1 100 𝑋 𝑜𝑢𝑡𝑐𝑜𝑚𝑒 = 𝑡𝑜𝑠𝑠𝑒𝑠 = 1 100 1 100 𝑥= 𝑡𝑜𝑠𝑠𝑒𝑠 𝑥∈𝑜𝑢𝑡𝑐 𝑜𝑚𝑒 6 𝑥 ⋅ #(𝑥) = 𝑥∈𝑜𝑢𝑡𝑐𝑜𝑚𝑒𝑠 𝑖⋅ 𝑖=1 1 100 𝑥 𝑥∈𝑜𝑢𝑡𝑐 𝑜𝑚𝑒 𝑡𝑜𝑠𝑠𝑒𝑠 #(𝑖) 100 If we think in probabilistic terms, each element #(i)/100 ‘should’ be close to P(X=i). We shall define the expectation of the random variable X to be: 𝐸𝑃𝑋 𝑋 = 𝑥 ⋅ 𝑃𝑋 (𝑋 = 𝑥) 𝑥∈𝐼𝑚 (𝑋) In our case: 6 𝐸𝑃𝑋 𝑋 = 𝑖=1 1 1 𝑖 ⋅ 𝑃𝑋 𝑋 = 𝑖 = 1 ⋅ + ⋯ + 6 ⋅ = 3.5 6 6 Intuitively, expectation can be thought of as the average of an infinite number of samples. It can also be thought of ‘what to expect’ from undrawn samples. Statistical theory has theorems on the relation between the (empirical, e.g. finite) average and the (possibly infinite) expectation. The notion of expectation is very handy. We will use it plenty. We flip a coin whit probability p for heads. We get 1 for heads and 0 for tails. What is the expectation? ◦ 𝔼[X] = 1⋅PX(X=1)+0⋅PX(X=0) = 1⋅p+0⋅(1-p) = p We toss 2 dice. Define X to be the sum, then: ◦ 𝔼[X] = 2⋅PX(X=1)+3⋅PX(X=3)+…+12⋅PX(X=12) = = 2⋅1/36+3⋅1/18+4⋅1/12+…+12⋅1/36 = 7 We toss a fair coin. If it lands on heads, we toss a fair die. If it lands on tails, we toss 2 dice. Define X to be the sum of the dice we eventually roll. What is the expectation? ◦ There are several ways to model this. We will show one. ◦ We have already calculated the expectation of both one die (3.5) and two dice (7). Since there is probability 0.5 for each: E[X] = ½E[one die] + ½E[two dice] = 5.25 Q: Is it a coincidence that the expectation of 2 dice is double that of one die? A: No. This is true because of the ‘linearity of expectation’, which we might learn later on. We toss n coins with each having P(heads)=p. Define X=1 for heads, 0 for tails. What is the probability of X=k for any k? 𝑛 𝑃𝑋 𝑋 = 𝑘 = 𝑘 𝑘 𝑝 𝑛 𝑝 1− 𝑛 𝑛−𝑘 𝑘 𝜆 −𝜆 𝑒 , 𝜆 = 𝑝𝑛 𝑛→∞ 𝑘! 𝑝→0 What is the ∞expectation of X? ∞ 𝑘 𝐸𝑋 = 𝑘𝑃𝑋 (𝑘) = 𝑘=0 ∞ = 𝑒 −𝜆 𝑘=0 𝑘=0 𝑘+1 𝜆 𝑘! 𝜆 −𝜆 𝑘𝑒 = 𝑒 −𝜆 𝑘! ∞ = 𝜆𝑒 −𝜆 𝑘=0 This is called a Poisson distribution. ∞ 𝑘=1 𝜆𝑘 (𝑘 − 1)! 𝜆𝑘 = 𝜆𝑒 −𝜆 𝑒 𝜆 = 𝜆 𝑘! Statistics and probability are closely related, but: ◦ Statistics can be measured, and can help us analyze things that have happened. Examples of statistics: average, median, minimum ◦ Probability is theoretical and cannot be measured, but can help us ‘guess’ (or predict) better by describing the population from which the sample came from. This is necessary since populations can’t always be measured (for instance infinite populations) We can use statistics to fit a probabilistic model to the measured reality. For example, consider the statistic average. ◦ (since it is a simple calculation over measurements, average is indeed a statistic). An average describes some property of a measured sample (=reality), but does not tell us anything (directly) about the whole world. Statistical theory shows us that the average of a large sample is very close to the expectation of the population from which the sample was taken. ◦ This is called ‘the law of large numbers’ The expectation describes a property of the distribution, which is a theoretical object. Knowing the expectation of the distribution can help us ‘guess’ better. Example: ◦ Say we take a sample of 100 people and measure their height. We now have 100 numbers. ◦ Next, we calculate the average of these 100 numbers. Assume the average was 1.65 m. ◦ By the law of large numbers, we know that the expectation of the population of heights (e.g. all heights of all the people in the world) is close to 1.56 m. ◦ Given a new person, our best estimate of his height would now be 1.65 m. Same example, other way around: ◦ Say we want to estimate height of people. ◦ First, we model the probability space: Each person is an elementary event The probability of randomly drawing each person is uniform We define h:{people}->R+ to be the height of a person. This induces a probability space over h with a distribution over heights. ◦ Next, we assume the distribution of heights has a normal distribution. ◦ In order estimate the expectation of the distribution, we sample 100 people, measure their heights, and calculate the average. Assume the average was 1.65 m. ◦ The law of large numbers tells us this average is close to the expectation of the population. ◦ Given a new person, our best estimate of his height would now be 1.65 m.