Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Revision Course in Probability and Statistics Frederic Udina August 21, 2009 Outline Contents 1 2 3 4 5 6 Basics 1 1.1 Basic prob. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Discrete models 11 2.1 13 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Continuous RV 17 3.1 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2 Continuous models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Sampling 22 Inference 27 5.1 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 5.2 Confidence Int. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 5.3 Hypothesis Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Joint distr. 38 6.1 Conditional, marginal, joint distr. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 6.2 Covariation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 6.3 Continuous Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 6.4 Multivar. normal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 6.5 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 References Version August 21, 2009 Professor and TA Frederic Udina [email protected]; Tomaz Cajner [email protected] Web page http://pascal.upf.edu/revStat2009 Books Some books that include the content of this course (but obviously they include much more): 1 • J. Pitman Probability Springer Verlag. • Grinstead and Snell, Introduction to Probability, American Mathematical Society. This is an open source book, so freely available at: http://www.dartmouth.edu/%7Echance/teaching aids/books articles/probability book/book.html • Newbold, Paul, Statistics for business & economics 4th ed. Englewood Cliffs, Prentice-Hall, 1995. • Stark, Philip B. SticiGui: Statistics Tools for Internet and Classroom Instruction with a Graphical User Interface A good web based book: http://stat-www.berkeley.edu/ stark/SticiGui/index.htm • Chapter 4 of Johnson, R and Wichern D, Applied Multivariate Statistical Analysis, 4th ed., Upper Saddle River, Prentice Hall 1998. At the course’s web page some spreadsheets and web tools may be found as a help to practice and understand some of the concepts and techniques. Some examples and ideas in these notes are borrowed from Michael Greenacre. Answers to many exercises were provided by Basak Gunes, TA for the 2008 course. 1 Basics of probability La primitiva You pay 1 euro to play the state lottery, and you select six numbers from 01 to 49, and you also get randomly assigned one number 0 to 9, called the reintegro, for example 07 13 23 29 41 47 reintegro 7 On the day of the lottery draw, six number are drawn as well as an additional number (the número complementario) and the reintegro. If you have the reintegro you get your euro back, if you have 3 or more numbers, you get the prizes shown. We’ll learn how to answer in the first part of the course...[10truemm] • What is the probability you get your money any 3 numbers 10 back? any 4 numbers 60 any 5 numbers any 5 and compl. all 6 numbers 2000 50000 1000000 • What is the prob. of each of the prizes? • What is your expected winning from this game? Answer: For the reintegro you have 1 number out of 10, probability is 1/10. There are W = 49 6 = 13983816 possible sets of 6 numbers drawn to win. All 6 numbers: 1/W approx 1 in 14 millions. Any 5 nums: 6 × 43 out of W , 1 in 54201. Any 5 + complementary: 6 × 43 out of 10 × W , 1 in 542008. Any 4 numbers: We miss 2 numbers: 62 possibilities, in its place it can be 43 2 possible pairs. Out of W , you win in 62 × 43 2 cases. Approx. prob. 1 in 1032. Any 3 numbers: similarly, approx. 1 in 57. On one side, we pay 1 and get it back with prob. 1/49: −1+1·1/10 = −.9 Expected prizes are: 10 · 1/57 + 60 · 1/1032 + 2000 · 1/542001 + 50000 · 1/542008 + 1000000 · 1/13983816 All in all, we expect to win −.9 + .4010 = −.4990, so, to lose 50 cents. See page 14 on expected value. See the complete truth at http://onlae.terra.es/indexplp.htm Random walks You toss up a coin. If it’s heads you walk up a step, if it’s tails, down a step. You do it six times. 2 • How many paths are possible? 26 = 64 • Do each path have exactly the same chance? Yes • How many paths lead to −2? We should have 2 more tails than heads, 5 tails and3 heads. How many paths have just 3 heads? We should select 3 numbers from 1 to 6 to place the heads, 63 = 20 possibilities. • What’s the probability of ending at −2? 20/64 Beware of probability: intuition is often wrong • In La primitiva you are offered to participate in two tickets. One has circled the numbers 3, 8, 12, 21, 35, 56. The other one has 44, 45, 46, 47, 48, 49 circled. Which one do you prefer? Why? • We consider families with two children in Barcelona. We are told that the one next door has at least one boy. What’s the probability of both children being boys? 1/2? 2/3? 1/3? Some cites * Diaconis: ”We are hard-wired to overreact to coincidences. It goes back to primitive man. You look in the bush, it looks like stripes, you’d better get out of there before you determine the odds that you’re looking at a tiger. The cost of being flattened by the tiger is high. Right now, people are noticing any kind of odd behavior and being nervous about it.” See http://www.amstat.org/publications/jse/v10n3/peterson.html The Monty Hall Problem In September of 1991 a reader of Marilyn Vos Savant’s Sunday Parade column wrote in and asked the following question: ”Suppose you’re on a game show, and you’re given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 1, and the host, who knows what’s behind the other doors, opens another door, say No. 3, which has a goat. He then says to you, ’Do you want to pick door No. 2?’ Is it to your advantage to take the switch?” What would you recommend? To switch or not to switch? Ask Google about “Monty Hall Problem”. (Google: Results 1 - 20 of about 2,570,000 for Monty Hall Problem. ) Main message: Probability is often counterintuitive! Counting Let |B| denote the number of elements of a finite set B. Counting elements of sets is useful to compute probabilities. Correspondence rule If the elements of B can be put in one-to-one correspondence with the elements of another set C, then |B| = |C|. Addition rule If B can be split into disjoints sets B1 , B2 , . . . , Bn , then |B| = |B1 | + |B2 | + · · · + |Bn | 3 Multiplication rule Suppose that k successive choices are to be made, with exactly nj choices available at each stage j ≤ k, no matter what choices have been made at previous stages. Then the total number of successive choices which can be made is the multiplication n1 n2 · · · nk . See, e.g. Pitman p508 Sequences, Orderings By direct application of the previous rules we have: Number of sequences A sequence of length k elements of a set S is any ordered k-tuple (s1 , s2 , . . . , sk ) with sj ∈ S for each j. The number of sequences of length k from a set S of n elements is nk . Number of orderings An ordering or permutation of k elements is a sequence of length k with no duplications. The number of orderings of k out of n elements is n(n − 1)(n − 2) · · · (n − k + 1). Note that this is a product of k factors, and that can be written as n!/(n − k)! using the factorial notation. Choosing k out of n, number of subsets The number of subsets of size k from a set with n elements is n(n − 1)(n − 2) · · · (n − k + 1) n! n = = k k! (n − k)!k! This is the number of possible choices of k out of n objects. Main properties of the combinatorial numbers are: n n ; = k n−k n−1 n−1 n + = k k−1 k The total number of subsets of a set of n elements is 2n . Simple exercises (List #1) • Four people are to be arranged in a row to have their picture taken. In how many ways can this be done? • An automobile manufacturer has four colors available for automobile exteriors and three for interiors. How many different color combinations can he produce? • In a digital computer, a bit is one of the integers 0,1, and a word is any string of 32 bits. How many different words are possible? • There are three different routes connecting city A to city B. How many ways can a round trip be made from A to B and back? How many ways if it is desired to take a different route on the way back? • In arranging people around a circular table, we take into account their seats relative to each other, not the actual position of any one person. Show that n people can be arranged around a circular table in (n − 1)! ways. More exercises (List #2) A poker hand is a set of 5 cards randomly chosen from a deck of 52 cards. Find the probability of a • royal flush (ten, jack, queen, king, ace in a single suit). 4 • straight flush (five in a sequence in a single suit, but not a royal flush). • four of a kind (four cards of the same face value). • full house (one pair and one triple, each of the same face value). • flush (five cards in a single suit but not a straight or royal flush). • straight (five cards in a sequence, not all the same suit). (Note that in straights, an ace counts high or low.) 1.1 Basic concepts and notation Sample space, results, events The sample space Ω = {ω1 , ω2 , . . . } is the set of all posible outcomes . Ω Sample space Outcomes or sample points ω3 ω2 ω1 B X, a random variable Events A R X(w) 0 P(A) 1 Events are subsets of outcomes. The probability law P assigns a [0, 1] number to each event. A real random variable assings a number to each outcome. Equally likely outcomes In a finite sample space Ω, we say that outcomes are equally likely when for all ω ∈ Ω, Then, for any A ⊂ Ω, |A| P (A) = |Ω| P (ω) = 1/|Ω| Examples : Dice, Coin, Lottery, simple random sample in Statistics,. . . Note 1 : This is not the case for Monty Hall’s remaining two doors. Note 2 : If Ω is the set of outcomes when throwing two identical dice at once, outcomes are not equally likely. (outcome {1, 2} is more likely than outcome {1, 1}). Note 3 : In a non finite sample space, outcomes can not be equally likely. Language of sets vs. events Event language sample space results, outcomes events impossible event not A, opposite of A either A or B or both both A and B A but not B A and B are mutually exclusive If A then B Set language universal set elements of Ω subsets of Ω empty set complement of A union of A and B intersection of A and B difference, A minus B Set notation Ω a, b, c, . . . , x, y, z A, B, C, . . . ∅ Ac , Ā, ¬A A∪B AB, A ∩ B A − B, A \ B A and B disjoints A is a subset of B A∩B =∅ A⊂B 5 Play the game at http://pascal.upf.edu/revStat2009/divs/boole Probability Rules A probability distribution is a function over the subsets of the sample space Ω that satisfies: • Non negativity: For any B ⊂ Ω, P (B) ≥ 0 • Addition: If B1 , B2 , . . . , Bn is a partition of B P (B) = P (B1 ) + P (B2 ) + · · · + P (Bn ) • Total one: P (Ω) = 1 We say that B1 , . . . , Bn is a partition of an event B ⊂ Ω if B = B1 ∪ · · · ∪ Bn and events B1 , . . . , Bn are mutually disjoints (that is, Bi ∩ Bj = ∅ for i 6= j). From these rules, we may derive: Complement rule P (Ā) = 1 − P (A) Difference rule If A implies B, that is, if A ⊂ B, then P (A) ≤ P (B) and also P (B − A) = P (B) − P (A). In general, P (B − A) = P (B and not A) = P (B ∩ Ac ) = P (B) − P (A ∩ B). Inclusion-Exclusion rule P (A ∪ B) = P (A) + P (B) − P (A ∩ B). Interpretation Frequency interpretation probability as the relative frequency in the long run. Axiomatic probability don’t care what probability means, define a set of axioms and work with it to derive interesting properties. But, how to decide what’s interesting ? Subjective probability Probability as a degree of certainty. See Pitman p 11, p 16. Sampling in Quality Control A quality control inspector checks batches of 100 items (e.g., light bulbs) by looking at a random sample of 10, and accepts the whole batch if none in the sample are defective. What is the probability of accepting a batch that actually has 8 defective items? To define our sample space, we should answer: What are the outcomes of the experiment? Ω should be the set of all samples of 10 items from the batch of 100. What outcomes are of interest? Let A be the event “the sample has no defective item” . A ⊂ Ω is our event. 92 Then, we count: |Ω| = 100 ; |A| = 10 10 , and finally P (A) = |A| 92 × 91 × · · · × 83 = = 0.4166 |Ω| 100 × 99 × · · · × 91 Sampling in Quality Control (II) Now we ask the question: How many samples of 10 items have exactly k defective items? Let Ak be the event containing such samples. Then A = A0 , and Ω = ∪10 k=0 Ak , the sample space is partitioned into 11 events.. To form a sample with k defective items, we first choose 10 − k non defective items and then k defective ones. By the multiplication rule for counting, 92 8 |Ak | = 10 − k k 6 The probabilites are P (Ak ) = |Ak | |Ω| and they satisfy 10 X P (Ak ) = 1 k=0 (This is the hypergeometric distribution , we don’t study it further.) Exercises (List #3) 1. We roll two identical dice at once. Describe the sample space. Are its outcomes equally likely? 2. We roll a die twice and we write out the first and the second face shown. Describe the sample space. Are its outcomes equally likely? 3. We roll a die twice and we write out the sum of the number shown. Describe the sample space. Are its outcomes equally likely? (Play the game at http://pascal.upf.edu/revStat2009/divs/puranen/dice) 4. Show graphically that Ā ∩ B̄ = A ∪ B. 5. Show by brief reasoning that A − B = A ∩ B̄ 6. Try to state a formula for Ā ∪ B̄, similar to (4). 7. I have five cards from a poker deck. Let A be the event “I have some card of each suit” and B be “not all my cards are of the same suit”. Which one is true: A ⊂ B or B ⊂ A. 8. . 9. In a box there are 50 balls, 8 of them are red and the rest is white. You take 5 balls at once. What’s the probability of getting all 5 balls red? What’s the probability of getting no red balls at all? 1.2 Conditional Probability Beware of conditional probability 1. Remember the example about two children families . 2. What is more likely: a thief be a banker or a banker be a thief? 3. We have two coins. One is a regular coin, the other has two heads on it. Suppose that both coin are fair: 50% for each side. You choose one coin at random at toss up it. It shows heads. What is the probability of the coin being the fake one? Conditional probability Probability of A given B is defined as P (A | B) = 7 P (A ∩ B) P (B) P (A ∩ B): joint probability (space Ω) P (A | B): conditional probability (space B) P (B | A): conditional probability (space A) All three are about the same set of outcomes!, but the reference space is different. Warning! From it, it follows the Multiplication Rule P (A ∩ B) = P (A | B)P (B) This can be generalized: P (A ∩ B ∩ C) = P (A ∩ B)P (C | A ∩ B) = P (A)P (A | B)P (C | A ∩ B) and also for n events (we use the product notation here): P (A1 A2 A3 . . . An ) = P (A1 )P (A2 | A1 )P (A3 | A1 A2 ) . . . P (An | A1 A2 . . . An−1 ) A simple example A business school has 500 students, 300 men and 200 women. The results of the final examination are as follows: Men (M ) Women (W ) Total Fail (F ) 30 10 40 Pass (A) 270 190 460 Total 300 200 500 1. What is the probability that the student has passed? P (A) = 460/500 2. What is the probability that the student has passed or is a man? P (A ∪ M ) = (460 + 300 − 270)/500 3. What is the probability that the student has passed and is a man? P (A ∩ M ) = 270/500 4. What is the probability that the student has passed if is a man? P (A|M ) = 270/300 5. What is the probability that the student is a man if has passed? P (M |A) = 270/460 6. Is passing/failing independent of sex? No!: P (A|M ) 6= P (A) 6= P (A|W ) See page 10 on independence. Another example A drunk driver should take the 3rd exit in the highway to go home, but at each exit, he decides at random to take it or to continue on the highway, with a probability 1/3 of exiting. What’s the probability that he takes the right exit? Let Ei be the event “take exit i”. What is Ēi ? Is it “to continue on the highway at exit i”?. No! We may use twice the multiplication rule: P (E3 ) = P (E3 ∩ Ē2 ∩ Ē1 ) = P (E3 |Ē2 ∩ Ē1 )P (Ē2 ∩ Ē1 ) = 1/3P (Ē2 |Ē1 )P (Ē1 ) = 1/3 · 2/3 · 2/3 = 4/27 8 Exercises (List #4) 1. Are the following statements true for any events A, B, C? • P (A ∪ B | C) = P (A | C) + P (B | C) − P (A ∩ B | C) • P (A ∩ B | C) = P (A | (B ∩ C)) • P (Ā | B) = 1 − P (A | B) • P (A | B̄) = 1 − P (A | B) 2. Mary and Pete are playing poker. Five cards have been dealt to Mary and five more to Pete. Mary has four kings (and no ace) and is wondering whether Pete could have four aces to win. (a)What is the probability that Pete has four aces? Suddenly, through a mirror, Mary sees that Peter already has at least two aces. (b) What is now the probability that Pete has four aces? Now we change the wording of last question: Mary sees that the first and the second cards in Pete’s hand are aces. (b) What is now the probability that Pete has four aces? Total Probability Theorem Theorem 1. For a partition B1 , B2 , . . . , Bn of Ω, for any event A, P (A) = n X P (A | Bj )P (Bj ) j=1 This is like “weighted averaging” probabilities: If there are 60% of girls in a group, and 30% of girls wear jeans while 2% of boys do, the overall probability of wearing jeans should be: P (J) = P (J|G)P (G) + P (J|B)P (B) = 0.30 · 0.60 + 0.02 · 0.40 Note: We say that subsets B1 , B2 , . . . , Bn of Ω are a partition of Ω if B1 ∪ · · · ∪ Bn = Ω and Bi ∩ Bj = ∅ for any i 6= j. Bayes Theorem Theorem 2. For a partition B1 , B2 , . . . , Bn of Ω, and for any event A, P (Bi | A) = P (A | Bi )P (Bi ) P (A | Bi )P (Bi ) = Pn P (A) j=1 P (A | Bj )P (Bj ) Bayes’s formula updates prior probabilities P (Bi ) to posterior probabilities P (Bi |A) using the likelihoods P (A|Bi ). This is called bayesian learning in many contexts: when some additional information is given, we should change our behaviour. Bayes Theorem Example 1 Box A contain 4 red and 6 white balls. Box B contain 6 red and 4 white balls. We draw a ball from A and we replace it into box A. If it was red, we draw again from A, otherwise we draw a ball from B. What’s the probability that the second ball is red? By the TPT, P (R2 ) = 4/10 · 4/10 + 6/10 · 6/10 = 13/25 Given that the second ball is red, what’s the probability that it came from box B? 9 P (B2 |R2 ) = = P (R2 |B2 )P (B2 ) P (R2 |A2 )P (A2 ) + P (R2 |B2 )P (B2 ) 6/10 · 6/10 = 9/13 = 0.69231 4/10 · 4/10 + 6/10 · 6/10 A cancer test Experience has shown that, in 99 percent of the cases in which cancer is present, the test is positive; and in 95 percent of the cases in which it is not present, it is negative. Assume that cancer is present in 0.1% of the population. If the test turns out to be positive, what probability should the doctor assign to the event that cancer is present? An alternative form of this question is to ask for the relative frequencies of false positives and cancers. We are given that prior(cancer) = .001 and prior(not cancer) = .999. We know also that P (+|c) = .99, P (−|c) = .01, P (+|c̄) = .05, and P (−|c̄) = .95. We can compute using Bayes rule that the probability of cancer given a positive test has only increased from .001 to .019. While this is nearly a twenty-fold increase, the probability that the patient has the cancer is still small. Stated in another way, among the positive results, 98.1 percent are false positives, and 1.9 percent are cancers. When a group of second-year medical students was asked this question, over half of the students incorrectly guessed the probability to be greater than .5. 1.3 Independence Independence Two events A and B are independent if any of the following equivalent conditions hold: (a) P (A | B) = P (A) (b) P (B | A) = P (B) (c) P (A ∩ B) = P (A)P (B) Examples. In two tosses of a coin, “first is head” and “second is head” are independent, it’s easy to check and intuitively clear. Are “first is head” and “both tosses are the same” independent? Not so intuitively obvious. Let’s check it! P (T1 = H) = 1/2, P (T1 = T2 ) = 1/2, P (T1 = H and T1 = T2 ) = 1/4. As 1/2 · 1/2 = 1/4, yes, they are independent. Alternate check: P (T1 = H) = 1/2 = P (T1 = H|T1 = T2 ) Mutual Independence (Three events) Three events A, B and C are mutually independent if they are pairwise independent and P (A ∩ B ∩ C) = P (A)P (B)P (C) Examples . Consider tossing a coin twice C1 , C2 . Let A be the event “C1 shows head”, let B be “C2 shows head” and C “both coins show same face”. It’s clear that A and B are independent, it’s easy to check that C is independent with A and with B, BUT knowing A and B completely determine C, so there is no independence. Check that P (A ∩ B ∩ C) 6= P (A)P (B)P (C) Answer: 1/4 6= 1/8 Mutual Independence (Many events) Events A1 , A2 , . . . , An are mutually independent if for any choice B1 , . . . , Bk , with 2 ≤ k ≤ n and each Bi equal to some Aj or Āj , ! k k \ Y P Bi = P (Bi ) i=1 i=1 10 Important remark: Uncorrelated events or variables need not be independent! See page 39 on correlation. Exercises (List #5) 1. A coin is tossed three times. Consider the following events. A: Heads on the first toss. B: Tails on the second. C: Heads on the third toss. D: All three outcomes the same (HHH or TTT). E: Exactly one head turns up. (a) Which of the following pairs of these events are independent? (1) A, B (2) A, D (3) A, E (b) Which of the following triples of these events are independent? (1) A, B, C (2) A, B, D (4) D, E (3) C, D, E 2. Solve the question on two coins (one fake, one fair) posed at the beginning of this section. Write carefully the events of interest, and apply the theorems stated in the previous pages. 3. A doctor assumes that a patient has one of three diseases d1 , d2 , or d3 . Before any test, he assumes an equal probability for each disease. He carries out a test that will be positive with probability .8 if the patient has d1, .6 if he has disease d2, and .4 if he has disease d3. Given that the outcome of the test was positive, what probabilities should the doctor now assign to the three possible diseases? 2 Discrete Distributions Discrete Random Variables Definition: a discrete random variable is an integer valued function on the outcomes of a sample space. A d.r.v is characterized by its probability (mass) function, given by a list of values and its probabilities, that is f (x) = P (X = x) 0.4 Binomial, p=0.5; N=8 0.3 A probability function is best visualized by its bar graph. 0.2 The probability function f (x) for a d.r.v. should satisfy: 0.1 1. f (x) ≥ 0 for all x X 2. f (x) = 1 0 x −2. 0. 2. 4. 6. 8. 10. Idioms Equalities or inequalities on random variables define events, so we can consider P (X = 2), P (X ≤ 4), P (X ≥ Y ). For example, X = 2 define the event of all outcomes whose value by X is 2. Example: Let X be the first roll of a die, Y the second roll. Let Z be the sum of both rolls. P (X > 3) = 1/2, 2/36 P (X > Y ) = 30/36 = 5/6, P (Z = 7|X < 3) = P (Z = 7 ∩ X < 3)/P (X = 3) = 6/36 = 1/3. We say that random variables X and Y are independent if P (X = x and Y = y) = P (X = x)P (Y = y) for any pair of values x, y. 11 Uniform, Bernoulli models The uniform distribution on {1, . . . , n} has values 1 to n with equal probability 1/n for each one. Example: A die is uniform on {1, . . . , 6} A variable X has bernoulli distribution with probability of success p if it has values 0, 1 with probabilities 1 − p, p. We say that X = 1 is a success , X = 0 is a failure . Example: (Statistical random sampling) We select a student at random from a group with p = 0.3 of Turkish students. We consider X = 1 if the selected student is Turkish, X = 0 otherwise. Binomial Distribution In repeated identical and independent Bernoulli trials with probability p of success and probability q = 1 − p of failure, n k (n−k) P (k successes in n trials) = p q k We say that the number of successes has a Binomial(n, p) distribution. Expected number (mean) of successes in a Binomial(n, p) is µ = np. Examples are number of six when throwing n dice, number of italians in a sample of 10 PhD students, etc. Play the game with DensDemo: fix n, e.g. to 15, what values of p make the distribution symmetric? What values of p give more spread ? Geometric Distribution In repeated identical and independent Bernoulli trials with probability p of success and probability q = 1 − p of failure, let T be the number of trials until the first success, then T has geometrical distribution with parameter p. P (T = t) = pq t−1 , t = 1, 2, 3, . . . Example: Number of die rolls until the first six appear. Example: the drunk driver. To check that P∞ t=1 P (T = t) = 1, use the sum of a geometrical series: ∞ X an = n=0 1 for a < 1. To prove this last 1−a equality, show that aS = S − 1, where S is the sum of the series. Geometric Distribution: example You play roulette until you win (prob. of success in one trial p = 1/36). What are the outcomes of the experiment? How many outcomes can be? If there are infinite outcomes, do this mean that probability of any of them is zero? As in previous slide, k−1 k 36 1 , k = 1, 2, . . . P (first success in k − th trial) = 37 37 The sum of these infinitely many probabilities is 1. Poisson Distribution A d.r.v. N has Poisson Distribution with parameter µ if it has values n = 0, 1, 2, . . . and P (N = n) = e−µ Examples 12 µn n! 1. You sit by the side of the road and count the number of cars that pass you in five minutes. 2. You count how many people enter a bank in an hour. 3. Or how many (different) people access a web site in a minute. 4. A radiation detector counts how many radio-active particles hit every second. 5. Number of goals per football match in a period of m minutes. Properties: 1. Mean and variance of a Poisson(µ) distr., are equal to µ. 2. The Poisson distr. is obtained as a limit of Binomial distributions with parameters n and p = µ/n, as n → ∞. 3. If N1 , . . . , Nk are independent Poisson random variables with paremeters µ1 , . . . , µk , then N1 + · · · + Nk is a Poisson r.v. with parameter µ1 + · · · + µk . The Poisson distribution appear whenever we count appearances of a rare event in a fixed length period of time, assuming that there is independence between appearances. Cumulative Probability Function It is defined as P (X ≤ x) for any x ∈ (−∞, +∞). Sheet1 For a discrete r.v. the CDF is a step function: N P 7 X P(X=x) 0 1 2 3 4 5 6 7 Binomial n=7, p=0.4 0.4 0.0280 0.1306 0.2613 0.2903 0.1935 0.0774 0.0172 0.0016 P(X<=x) 0.0280 0.1586 0.4199 0.7102 0.9037 0.9812 0.9984 1.0000 1.1000 1.0000 0.9000 0.8000 0.7000 0.6000 Column E 0.5000 0.4000 0.3000 0.2000 0.1000 0.0000 Row 13 Row 14 Row 15 Row 16 Row 17 Row 18 Row 19 Row 20 See DensDemo for a better plot. Median, Quartiles The median is the smallest value x such that P (X ≤ x) ≥ 1/2. The first quartile is the smallest value x such that P (X ≤ x) ≥ 1/4. The third quartile is the smallest value x such that P (X ≤ x) ≥ 3/4. The p−th percentile is the smallest value x such that P (X ≤ x) ≥ p/100. Example: the median is the second quartile, it is the 50th percentile (and the 5th decile). Note: The median is more informative than the mean in a non symmetrical distribution (e.g. an income distribution). Exercises (List #6) 1. A die is rolled until the first time T that a six turns up. (a) What is the probability distribution for T ? 13 (b) Find P (T > 3) (c) Find P (T > 6|T > 3) (d) Show that P (T > a + k|T > a) = P (T > k) (we say that this distribution has no memory). 2. Assume that, during each second, a phone lines switchboard receives one call with probability .01 and no calls with probability .99. Use the Poisson approximation to estimate the probability that the operator will miss at most one call if she takes a 5-minute coffee break. (Hint: Use an appropriate Poisson distribution for the number of calls that are received in a period of 5 minutes). 3. An airline finds that 4 percent of the passengers that make reservations on a particular flight will not show up. Consequently, their policy is to sell 100 reserved seats on a plane that has only 98 seats. Find the probability that every person who shows up for the flight will find a seat available. (Note that we need the independence assumption here, and then we may apply a binomial distribution, use Excel for the calculation). 2.1 Expectation and Variance Expectation of a Random Variable Expectation or expected value of a (discrete) random variable is the sum of its values weighted by its probabilities: X E(X) = x P (X = x) x Example :You pay 1 and throw up a die. You get 5 if it shows a six. Do you want to play this game? Expected value of your winnings is 5 1 E(W ) = (−1) + 5 = 0 6 6 This is a fair game. Expectation of a function of X For a real function g and a random variable X, we define X E(g(X)) = g(x) P (X = x) x Properties of expected value 1. For a constant c, E(c) = c, E(cX) = cE(X), and E(c + X) = c + E(X). 2. For any r.v. X, Y , E(X + Y ) = E(X) + E(Y ) (see next slide) 3. If X, Y are independent, E(XY ) = E(X)E(Y ). In general this equality is not true. 4. If the probability function f (x) = P (X = x) is symmetrical around a, then E(X) = a. Remember: f (x) is symmetrical around a when f (a + x) = f (a − x) for all x. 5. Warning: in general, E(f (X)) 6= f (E(X)). Sum of two d.r.v. Given two d.r.v. X, Y , a new d.r.v. Z = X + Y can be considered. The value of Z on each outcome is the sum of values of X and Y . Then P (Z = z) = P (∪x+y=z P (X = x ∩ Y = y)) 14 See an example in http://pascal.upf.edu/revStat2009/divs/puranen/dice/dicesim1.html If N1 , N2 are two independent dice, the distinct values of its sum appear in the up to down diagonals of the table. Expectation exercises (List #7) 1. In some country, couples keep having children until they have a daughter. What is the expected value of the number of children per family? 2. You make a bet, and toss up a coin. If it shows heads, you get double of your bet, if it shows tails, you lose your money. What is the expected value of your winnings? 3. In the previous game, now you play this way: Start with 1 euro. If you loose, repeat the game doubling your bet each round until you win. What is now the expected values of your winnings? 4. The previous game is known as the St. Petersburg Paradox. This is another version: Suppose that we flip a coin until a head first appears, and if the number of tosses equals n, then we are paid 2n euros. What is the expected value of the payment? How much would you pay for playing this game? 5. A die is rolled twice. Let X denote the sum of the two numbers that turn up, and Y the difference of the numbers (specifically, the number on the first roll minus the number on the second). Show that E(XY ) = E(X)E(Y ). Are X and Y independent? 6. A royal family has children until it has a boy or until it has three children, whichever comes first. Assume that each child is a boy with probability 1/2. Find the expected number of boys in this royal family and the expected number of girls. 7. Let A be an urn with three balls inside labelled by numbers 0, 1, 2. Let X be the number of the first ball drawn from A. Let Y be the number of the second ball drawn from A. We consider two cases: (1) the first ball is put back into the urn before drawing the second one, and (2) the first ball is not put back into the urn before drawing the second one. (a) Describe the probability distributions of X and Y in both cases (1) and (2). Compute its expectation and variance. Are they independent random variables? (b) Describe the random variable Z = X + Y in both cases. Compute its expectation and variance. (c) Check that in both cases E(Z) = E(X) + E(Y ) but that V (Z) = V (X) + V (Y ) only holds in case (2), when there is independence of X and Y . Conditional Expectation If F is any event and X is a random variable, then the conditional expectation of X given F is defined by X E(X|F ) = xP (X = x|F ) x Theorem 3. If F1 , F2 , . . . , Fr is a partition of the sample space Ω, and X is a random variable on Ω, then X E(X) = E(X|Fj )P (Fj ) j and E(X|F ) = X E(X|Fj )P (Fj ∩ F ) j 15 Example. In a course there are italian, german, dutch, . . . students. The average score of all students in an examination can be computed as a weighted average by conditioning on nationality. . . Interpreting the Expected Value • As the average in many repeats of the experiment See Law of large numbers, page 19 P • As the number closer to the possible (weighted) values. Solve miny x (x − y)2 P (X = x) P • As the center of masses (barycenter) of the bar diagram. Prove that 0 = x (x − E(X))P (X = x) See the Excel sheet DiscreteVariable.xls • The expected value play an important role in decission theory. Under uncertainty, and facing several options, a rational agent should take the decission that gives him or her maximum expected value. See for example, Newbold chap. 19. Variance For a r.v X, the variance of X is defined by X (X − E(X))2 P (X = x) V (X) = E (X − E(X))2 = x This is the (weighted) average of squared deviations (from the mean). The standard deviation of X is SD(X) = p V (X). Properties 1. For a constant c, V (c) = 0, V (cX) = c2 V (X), and V (c + X) = V (X). Note that V (−X) = V (X) 2. For independent r.v. X and Y , V (X + Y ) = V (X) + V (Y ). 3. V (X) = E(X 2 ) − E(X)2 Sums and averages of i.i.d. sequences If X1 , X2 , . . . , Xn are i.i.d. random variables with E(Xi ) = µ and V (Xi ) = σ 2 for all i. Let us define Sn = X Xi An = i Sn . n Then, E(Sn ) V (Sn ) SD(Sn ) = nµ E(An ) = µ 2 V (An ) = σ 2 /n √ SD(An ) = σ/ n = nσ √ = σ n Sn − nµ √ , then E(Sn∗ ) = 0, and V (Sn∗ ) = 1 σ n Sn∗ is called the standardized version of Sn . It follows that, if we set Sn∗ = Warning The same terms (average and variance) are used in different contexts: • In a probabilistic model, for a random variable. • When summarizing real data: One consider the data set as a population to be summarized by average and variance (or std. dev.) 16 • Using a sample to estimate average and variance in a population (the sample mean and the sample variance, see page 24). These are related but different concepts! Summary of discrete distr. Distribution Uniform[a, b] Bernoulli(p) Binomial(n, p) Geometric(p) Poisson(µ) Values P (X = x) Mean Variance 1 b−a+1 a+b 2 (b−a)2 12 p np 1/p µ p(1 − p) np(1 − p) 1/p2 µ a, a + 1, . . . , b 0, 1 0, 1, . . . , n 1, 2, . . . 0, 1, . . . , n n x p,x (1 − p)n−x p (1 − p) pq x−1 x e−µ µx! Exercises (List #8) 1. Let X be a random variable with E(X) = 100 and V (X) = 15. Find (a)E(X 2 ) (b) E(3X + 10) (c) E(−X) (d) V (−X) (e)SD(−X) 2. A coin is tossed three times. Let X be the number of heads that turn up. Find V (X) and SD(X). 3. A die is loaded so that the probability of a face coming up is proportional to the number on that face. The die is rolled with outcome X. Find V (X) and SD(X). 4. Show that for a Bernoulli (and also for a binomial) distribution, variance is maximum for p = 1/2. 3 Continuous Probability A continuous roulette Consider a roulette with slots numbered form 1 to 12. Probability of any slot is P (X = n) = 1/12. Cumulative probability is P (X ≤ n) = n/12 Consider a continuous spinner on a 12 hours clock. For any x ∈ [0, 12), what is P (X = x)??? For any x ∈ [0, 12), P (X ≤ x) = x/12 See this in an Excel simulation. Compare =randbetween(1,12) 3.1 =12 * rand() Basic concepts Probability Density In the previous example, P (x − a ≤ X ≤ x + a) = (2a)/12 = interval width × 1/12 We say that the probability density at x is f (x) = 1/12. So, we may say f (x) dx = P (X ∈ dx). f (x) is a density function for a continuous random variable X if 1. f (x) ≥ 0 for all x ∈ R. R +∞ 2. −∞ f (x) dx = 1 17 b Z 3. For any interval [a, b], P (a < X ≤ b) = f (x) dx a The probability of a single value (discrete case) is replaced now by the density of probability f (x) dx around x, and sums are replaced by integrals. CDF The cumulative distribution function of X is defined by F (x) = P (X ≤ x) Note: identical to discrete case Properties: 1. For any interval [a, b], P (a < X ≤ b) = F (b) − F (a). 2. F is nondecreasing. 3. lim F (x) = 0 and lim F (x) = 1 x→−∞ x→+∞ 4. If f (x) is a density function for X, Z x F (x) = f (t)dt −∞ Example: The (continuous) uniform distribution on [0,1] has density f (x) = 1 for x ∈ [0, 1] and f (x) = 0 otherwise. 0 if x ≤ 0 x if 0 < x ≤ 1 Its CDF is F (x) = 1 if x > 1 Graphical example A prob. density function Its cumulative distribution function Probabilities are areas under the density function Probabilities are values of the CDF Exercises (List #9) 1. Given f (x) = kx in [0,1] (and f (x) = 0 otherwise), find k for f (x) be a density function. Find the corresponding c.d.f F (x). Compute the mean and the median of this distribution. 18 2. Let X a r.v. with c.d.f F (x). Find the c.d.f of aX + b, first for a > 0 then for a < 0. 3. Given F (x) = x3 for x ∈ [0, 1] extend it to a c.d.f. Find a density function for it. Show that if X1 , X2 , X3 are independent r.v. with uniform distribution in [0,1], F (x) is the c.d.f of the random variable X = max(X1 , X2 , X3 ). 4. Let X be a r.v. with p.d.f 0 1+x f (x) = 1−x 0 if if if if x ≤ −1 −1 < x ≤ 0 0<x≤1 1<x (a) Plot f (x) Answer: x + x2 /2 + 1/2 (b) Compute P [−1 ≤ X ≤ x] for any x < 0. Ans: x − x2 /2 + 1/2 (c) Compute P [−1 ≤ X ≤ x] for any x >= 0. (d) Describe and plot the c.d.f of X (e) Compute the 25th, 50th and 75th percentiles of the distribution. Mean and variance of cont. r.v. For a continuous random variable X with density function f , we define (provided the integrals converge) Expectation or expected value of X: Z +∞ E(X) = xf (x) dx −∞ If g is a real function, Z +∞ E(g(X)) = g(x)f (x) dx −∞ Variance of X: V (X) = E (X − E(X))2 = E(X 2 ) − E(X)2 (Compare to the discrete case: we are replacing sums by integrals) Properties are the same as the discrete case (page 14) Exercise (List #10) 1. Let X be a random variable with range [-1, 1] and fX its density function. Find µ(X) and σ 2 (X) if, for |x| > 1, fX (x) = 0, and for |x| < 1, (a) fX (x) = (3/4)(1 − x2 ). (b) fX (x) = (x + 1)/2. (c) fX (x) = (3/8)(x + 1)2 . Law of large numbers Theorem 4. Let X1 , X2 , . . . , Xn , . . . be a sequence of i.i.d. random variables with finite expected value µ = E(Xj ) and finite variance σ 2 = V (Xj ). Let Sn = X1 + · · · + Xn . Then for any > 0, Sn lim P − µ < = 1 n→∞ n So, just knowing the expected value of X, and by repeating the experiment, whatever random or unknown X might be, we may be sure that the average will be very close to the expected value. Note that if Xi are Bernoulli variables, the average Sn /n is just the proportion of successes in n repeats, and µ is the probability of success. So the theorem say that the empirical frequency in many repeats converges to the probability of success. 19 Normal or Gaussian Distribution The standard normal or Gaussian distribution has range (−∞, +∞) and density function 2 1 φ(z) = √ e−z /2 2π It has E(Z) = 0, V (Z) = 1. Its CDF is denoted by Φ(z) (it has no exact expression). We say that X has Normal(µ, σ 2 ) distribution when X ∗ = (X − µ)/σ has standard normal distribution. So, the density function of X is 1 1 x−µ 2 φµ,σ (x) = √ exp(− ( ) ) 2 σ σ 2π and E(X) = µ, V (X) = σ 2 . The graph of φµ,σ (x) is symmetrical, with a maximum at µ and inflection points at µ ± σ. P (µ − σ < X ≤ µ + σ) = 0.68, P (µ − 2σ < X ≤ µ + 2σ) = 0.955, P (µ − 3σ < X ≤ µ + 3σ) = 0.99 LC of normal variables For independent normal r.v. Xi with expected values µi and variances σP linear combinai , i = 1, 2, . . . , n, anyP P tion L = i ai Xi is also a normal random variable with expected value i ai µi and variance i a2i σi2 . Example: Given X ∼ Normal(0, 1) and Y ∼ Normal(1, 1), X and Y independent, find P (X < Y ). √ Answer: P (X < Y ) = P (Y − X > 0). Since W = Y − X is Normal(1, 2), P (W > 0) = P (Z > −1/ 2) and this can be found in normal tables or Excel. Visualize the central limit theorem From a population with µ = 0.5 and σ 2 = 0.08 we draw 2000 samples of size N and plot the histogram of the 2000 sample means obtained. We report the mean (m) and the variance (v) of the 2000 sample means. 4.5 6 4 5 3.5 3 4 2.5 3 2 1.5 2 1 1 0.5 0 -0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 n = 8, m=0.5009, v=0.0099 0.7 0.8 0.9 0 -0.1 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 n = 16, m=0.4969, v=0.0053 9 12 8 10 7 6 8 5 6 4 3 4 2 2 1 0 -0.2 0 0.2 n = 32, m=0.4999, v=0.0022 0.4 0.6 0.8 0 -0.1 1 0 0.1 0.2 n = 64, m=0.4994, v=0.0011 The central limit theorem A sum of many independent random variables has normal distribution. More precisely: Theorem 5. If X1 , . . . , Xn , . . . is a sequence of i.i.d. random√variables all with mean µ and variance σ 2 , let Sn be the sum of the first n variables, and Sn∗ = (Sn − nµ)/(σ n). Then Sn∗ has approximate standard normal distribution in the sense that, for any constants a < b limn→∞ P (a < Sn∗ ≤ b) = Φ(b) − Φ(a) The theorem is valid for discrete random variables, and also for continuous random variables. There are versions of the theorem for sequences of non identically distributed independent variables, but we skip this. 20 Useful approximations Particular cases of the CLT, that might be useful when doing statistical sampling. Approximations: Binomial(n, p) ≈ Normal(µ = np, σ 2 = npq) if npq > 5. Poisson(λ) ≈ Normal(µ = λ, σ 2 = λ) if λ > 5 Using continuity correction approximations are better: A discrete value a is matched to an interval of values (a − 0.5, a + 0.5) in the continuous approximation. Exercises (List #11) 1. Let S = S200 the number of heads that turn up in 200 tosses of a fair coin. Using CLT give estimations of (a) P (S < 90) (b) P (S > 105) (c) P (S = 100) (Note: You should use continuity correction, for sure in (c)!) 2. Once upon a time, there were two railway trains competing for the passenger traffic of 1000 people leaving from Chicago at the same hour and going to Los Angeles. Assume that passengers are equally likely to choose each train. How many seats must a train have to assure a probability of .99 or better of having a seat for each passenger? 3. (Pitman p. 364). For σ = 1, 2, 3 suppose Xσ has normal (0, σ 2 ) distribution, and these random variables are independent. Find (a) P (X1 + X2 + X3 < 4) (b) P (4X1 − 10 < X2 + X3 ) 3.2 Continuous Models t and χ2 Some important distributions derived from the normal. Given Z, Z1 , Z2 , . . . , Zn independent standard normal variables Chi-square(n dof) t(n dof) Snedecor’s F(m, n dof) χ2n = Z12p + Z22 + · · · + Zn2 Tn = Z/ χ2n /n χ2 /m Fm,n = m2 χn /n Parameters n and m are called degrees of freedom (dof) because they are the number of free variables in the build. Moments of χ2n , tn , Fm,n Chi-square(n) Student’s t(n) F(m, n) Mean n 0 n/(n − 2) Variance 2n n/(n − 2) 2n2 (m+n−2) m(n−4)(n−2)2 We’ll make good use of these distributions in statistical inference. More on them then (see e.g. page 31). 21 Lognormal distribution We say that X has a lognormal(µ, σ 2 ) distribution when Y = log X has Normal(µ, σ 2 ) distribution. As log is a monotone increasing transformation, percentiles of X are the exp of the percentiles of a normal distribution. σ2 and the variance of X is exp(2µ + σ 2 )(exp(σ 2 ) − 1) The mean of X is exp µ + 2 By CLT, a product of many independent random variables has lognormal distribution. Centered Moments of a R.V. For a r.v. X, the r-th centered moment is defined as E (X − µ)r (provided it is finite). The 2nd centered moment is the variance, other useful moments are the skewness E(X − µ)3 (gives a measure of symmetry) and the kurtosis E(X − µ)4 . Sometimes a standardized version of them is used: Skewness coefficient: E(X − µ)3 /σ 3 Excess of kurtosis: E(X − µ)4 /σ 4 − 3 For calculus and proofs, non-centered moments are more useful. The r-th moment is defined as E(X r ), if it exists and is finite. Important property: knowledge of all moments completely determines a distribution. Moment generating function The moment generating function of a r.v. X is a function ψ defined by ψ(t) = E(etX ). Derivatives at t = 0 of the m.g.f. are the moments: ψ (r) (0) = E(X r ). Examples: 2 1. For the standard normal Z, ψ(t) = et /2 2. For the Poisson distribution, ψ(t) = eµ(e t −1) An important result (not easy to proof) is that If two r.v. have m.g.f. that are identical in some interval [−δ, δ], then they have the same distribution. Exercises (List #12) 1. Let X be a r.v. with density function f (x) = e−x for x > 0 and f (x) = 0 otherwise. Show that its m.g.f is ψ(t) = 1/(1 − t) for t < 1. Show that the mean and the variance of X are both 1, using its first and second moment. 2. Suppose that X is a r.v. with E(X) = 1, E(X 2 ) = 2, and E(X 3 ) = 5. Compute the third centered moment of X. 4 Sampling Distributions Example: Toss of a coin (1) You toss up an unbiased coin 100 times. You’re interested in the number of heads you get. 1. What is the random variable, its distribution, mean and variance? 2. What is the probability you get 50 heads? 22 3. What is the probability you get 60 heads or more? 4. What is the probability you get 40 heads or less? 5. What is the probability you get between 41 and 59 heads (inclusive)? (2) Repeat the above question with a biased coin that has a 60% probability of turning up heads. (3) Repeat the above question replacing the coin by choose a citizen of Barcelona at random , and replacing turning up heads by “he or she don’t like La Sagrada Familia”, and assuming that you know that 70% of them do not like it. Answers for example 3 1. What is the random variable, its distribution, mean and variance? A single citizen is modelled by X ∼ Bernoulli(p = 0.7), so E(X) = p = 0.7 and V (X) = p(1−p) = 0.21. The number of success in the sample of 100 citizens will be B ∼ Binomial(p = 0.7, n = 100), it has E(B) = np and V (B) = np(1 − p) = 21. The proportion of success in the sample will be F = B/100 and have E(F ) = E(B)/100 = 0.7. V (F ) = V (B)/1002 = 0.0021. Because F √ is the sum of many independent r.v., it can be approximated by a normal distribution with µ = 0.7 and σ = 0.0021 = 0.04583 2. What is the probability you get 70 heads? Using the binomial fórmula (or Excel) we obtain P (B = 70) = 0.08678. Using the normal approximation (with continuity correction), we would compute (use Excel again) P (0.695 < F ≤ 0.705) = P ((0.695 − 0.7)/0.04583 < Z ≤ (0.705 − 0.7)/0.04583) = 0.5434 − 0.4566 = 0.08688 3. What is the probability you get 75 successes or more? Using the binomial fórmula (or Excel) we obtain P (B ≥ 75) = 1 − 0.8369 = 0.1631. Using the normal approximation (no continuity correction now), we would compute (use Excel again) P (F ≥ 0.75) = 1 − P (Z <= (0.75 − 0.7)/0.04583) = 1 − 0.8624 = 0.1376 4. What is the probability you get 60 successes or less? Using the binomial fórmula (or Excel) we obtain P (B ≤ 60) = 0.0210. Using the normal approximation we would compute (use Excel again) P (F ≤ 0.6) = P (Z <= (0.6 − 0.7)/0.04583) = 0.01456 (quite bad approximation, we are in the tail... 5. What is the probability you get between 61 and 74 heads (inclusive)? Binomial: P (61 ≤ B ≤ 74) = 0.8369 − 0.0210 = 0.8159. Normal approximation: P (0.61 ≤ F ≤ 0.74) = P ((0.61−0.7)/0.04583 < Z <= (0.74−0.7)/0.04583) = 0.8086 − 0.01456 = 0.7941 Sampling population IQ Suppose that the IQ of a population of students follows a normal distribution, with a mean of 110 and a standard deviation of 12. 1. What percentage of students have an IQ greater than or equal to 125? 2. What percentage of students have an IQ between 109 and 111? Now suppose we take a random sample of 100 students from this population and calculate the sample mean. What is the probability that the sample mean lies between 109 and 111? – will consider this after next few slides. 23 Small population Suppose we have a population of 7 people from our vendor force, and we are interested in years of experience. See spreadsheet smallpop.xls Take your own sample, see where your observed sample lies in this distribution – it is more likely to lie in the higher probability part of the distribution. When we sample from a small population, independence is violated. For example, if we draw balls from an urn with 3 red balls and 5 black balls, “first ball is red” is not independent with “second ball is red”. This do not affect the sample mean, but we not consider small populations here. See, e. g., Newbold p 230, 235. Sampling from a (large) population A Simple Random Sampling procedure is one in which every possible sample of n objects is equally likely to be chosen. Any object has the same probability of being chosen, and with independence of any other object being chosen or not. We represent a simple random sample as a collection of i.i.d. copies of a random variable: X1 , X2 , . . . , Xn , with Xi ∼ X. X is the population, the model for each individual. Think of n tosses of a coin. Think of n citizens chosen at random in Barcelona. Note: i.i.d. means Independent and Identically Distributed. X, Y are i.i.d. doesn’t mean X = Y !!! (Think of two dice, they are independent, and have identical distribution, but they are no equal.) The sample mean From a random sample X1 , X2 , . . . , Xn , with Xi ∼ X, E(X) = µ, V (X) = σ 2 , we define The Sample Mean is the random variable n 1X X̄ = Xi n i=1 It has E(X̄) = µ; V (X̄) = √ Its SD(X̄) = σ/ n is called the standard error . If the population X is normal, (or n is large enough) so is X̄. σ2 n Suppose that the IQ of a population of students follows a normal distribution, with a mean of 110 and a standard deviation of 12. 24 Now suppose we take a random sample of 100 students from this population and calculate the sample mean. What is the probability that the sample mean lies between 109 and 111? Since the Xi are normal, so is any LC of them, X̄ is normal. X̄ ∼ N (110, 122 /100), it has σ = 1.2. Why E(X̄) = µ? n • E(X̄) = E 1X Xi n i=1 = n X 1 E Xi n i=1 = 1X E(Xi ) n i=1 = 1X µ n i=1 = 1 nµ = µ n (for a constant c, E(cX) = cE(X)) n (since E(X + Y ) = E(X) + E(Y )) n (since each Xi has mean µ) Note that this is valid for any population X, given that we have a random sample. Independence is not required, so it works also with small populations. Why V (X̄) = σ 2 /n? • V (X̄) n `1 X ´ Xi n i=1 = V = 1 `X ´ V Xi n2 i=1 = n 1 X V (Xi ) n2 i=1 = n 1 X 2 σ 2 n i=1 = 1 σ2 2 nσ = n2 n n (since V (cX) = c2 V (X)) (since if X, Y are independent variables, V (X + Y ) = V (X) + V (Y )) (since all of them have variance σ 2 ) So, this is valid for any population X, given that we have a random sample. Independence is required, so we need a large population to sample from. Exercises (List #13) 1. In a car factory a certain item has a population mean width of 50 mm and standard deviation of 2 mm. What is the probability that the mean of a sample of 100 items inspected at random will lie between 49.5 and 50.5 mm? 2. A random sample of 64 business managers is obtained. If 50% of all business managers are older than 45 years, what is the probability that less than 40% of the managers in the sample are older than 45 years. 3. Is the production process on target? – behaviour of the mean. A pharmaceutical company produces “Cipro” tablets with a mean of 25 mg active ingredient and a standard deviation of 1 mg. Suppose that everything is going exactly according to these specifications and suppose that a quality control inspector takes a sample of 30 tablets, measures the amount of active ingredient in each and calculates the average. (a) What is the probability that this average lies outside the limits (24.9, 25.1)? Or outside the limits (24.8, 25.2)? Or, in general, outside the limits (25-m, 25+m)? 25 (b) At what value of m does the probability become small, for example 0.05 (1 in 20, or 5%), or very small, for example 0.001 (1 in a 1000, or 0.1%)? (c) For what sample size will the probability of the average lying outside the interval (24.9, 25.1) be small (0.05) or very small (0.001)? The sample variance Taking size n samples from a population X (with E(X) = µ, V (X) = σ 2 ) gives us a sample mean for each sample. The sample mean X̄ will be useful to estimate the population mean. What about the variance? Given a random sample X1 , X2 , . . . , Xn , the sample variance is n S2 = 1 X (Xi − X̄)2 n − 1 i=1 and satisfies E(S 2 ) = V (X). Its square root, s is called the sample standard deviation. Why this n − 1???. See next slide. Distribution of the sample variance Assuming normality on the population X, it can be shown that (n − 1)S 2 ∼ χ2n−1 σ2 and from this we may derive that V (S 2 ) = 2σ 4 /n − 1 Why we divide by (n-1)? P Let’s try to use σ̃ 2 = n1 (Xi − X̄)2 to estimate σ 2 : E ˜ ˆ1 X (Xi − X̄)2 n i ˜ 1 ˆX E (Xi − µ + µ − X̄)2 = n i ´˜ 1 ˆX` E (Xi − µ)2 + (X̄ − µ)2 − 2(Xi − µ)(X̄ − µ) = n i ´ 1X 1X 2 `X = (Xi − µ)(X̄ − µ) E((Xi − µ)2 ) + E((X̄ − µ)2 ) − E n i n i n i „ X X ´« ` 2 last term is − E (X̄ − µ) Xi − µ n i i = = 1 nE((X̄ − µ)2 ) − 2E((X̄ − µ)2 ) n 1 2 σ2 n−1 2 nσ − E((X̄ − µ)2 ) = σ 2 − = σ n n n σ2 + it fails!!! E(σ̃ 2 ) 6= σ 2 . Exercise (List #14) Is the production process on target? – behaviour of the variance. Suppose the same situation as ex. 3, List 13; namely a pharmaceutical company producing “Cipro” tablets with a mean of 25 mg active ingredient and a standard deviation of 1 mg. Suppose again that everything is going exactly according to these specifications and that a quality control inspector takes a sample of 30 tablets, measures the amount of active ingredient in each and calculates the variance. We assume furthermore that the distribution of the quantity of active ingredient in each table follows a normal distribution. 26 (a) What is the probability that the variance is greater than 1 mg2 ? greater than 1.1? greater than 1 × v in general? (b) At what value of v does the probability become small, for example 0.05 (5%), or very small, for example 0.001 (0.1%)? (c) For what sample size will the probability be small (0.05) that the variance is greater than 1.2? And for what sample size will it be very small (0.001)? 27 5 Statistical inference Role division Probability builds models for populations. It also studies properties of samples from populations. Once we have data, data analysis selects or check a suitable model. Then we use our data as a sample, and Statistical Inference tries to guess the model parameters from the sample. Basic concepts Population is being modeled by a single random variable. Parameter A numerical feature of the population that is our interest. Sample We plan to take a sample of size n. Until we already take the sample, each individual in the sample is random, i.i.d. as X Statistic Is some function of the sample, so it is a random variable. Example: Height of people aged 25 is modeled by X ∼ N (µ, σ 2 ) Example: We want to know or at least to have an idea about the mean height. µ is the parameter of interest. Example: Our sample is X1 , X2 , . . . , Xn Example: X̄ = 1 n P Xi . Or may be an interval. It is always important to know the distribution of the Statistic. We’ll have only one sample, but we need to compare it to the distribution (all possible samples). Exercise (List #15) Identify the population, parameter, and statistic when they appear in each of the following statements. (a) Based on a 5% sample of census data, the average annual income of Spaniards in the 21-35 year age groups is 12.232 . (b) A survey of 1000 people from minority groups living in Barcelona revealed that 185 were out of work. (c) In the second cycle of the “Llicenciatures ECO & ADE” at the UPF, three of the four student delegates are from the “ADE” business degree. 5.1 Estimation Point Estimation, an example Suppose that the true percentage of first-language Catalan-speakers in Barcelona is 41% (here we are assuming we know the truth — soon we will move away form this assumption!) A researcher interviews a random sample of n = 1000 residents in Barcelona in order to estimate this percentage. What is the probability that the estimated percentage is between 40% and 42% , i.e. to have an estimate correct to 1%? What should n (the sample size) be so that the probability is 0.95 that the estimate is correct to 1%? Point Est., an example (cont) In this example: Population: X ∼ Bern(p = 0.41) Parameter: p is the parameter of our interest. 28 Sample: X1 , . . . , X1000 , Xi are i.i.d. as X. Sample size n = 1000. Statistic: The sample proportion p̂ of catalan speakers. Statistic’s Distribution: By CLT, p̂ ∼ N (p, p(1 − p)/n) We say that p̂ is an estimator of p, a point estimator Point estimation: basic concepts We are interested in parameter θ. Our tool is estimator θ̂, computed from a sample (it has a different value for each sample). An estimator θ̂ of a parameter θ may have bias. It is defined as bias(Θ̂) = E(θ̂) − θ. Examples: Since E(X̄) = µ, X̄ is an unbiased estimator of µ. p̂ is an unbiased estimator of p. s2 is an unbiased estimator of σ 2 (using previous slides notation). P If we were using s̃2 = n1 (Xi − X̄)2 instead of s2 , bias would be E(s̃2 ) − σ 2 = n−1 2 σ − σ 2 = −σ 2 /n n So, s̃2 is a biased estimator of σ 2 . Point estimation: MSE A good measure of the quality of an estimator is the mean squared error MSE(Θ̂) = E (Θ̂ − Θ)2 ) There is an important decomposition: h i2 MSE(Θ̂) = bias(Θ̂) + V (Θ̂) For an unbiased estimator, MSE equals its variance. Proof of decomposition: MSE(θ̂) = ˆ ˜ E (θ̂ − E(θ̂) + E(θ̂) − θ)2 ˆ ˜ E (θ̂ − E(θ̂))2 + (E(θ̂) − θ)2 + 2(θ̂ − E(θ̂))(E(θ̂) − θ) = V (θ̂) + [bias(θ̂)]2 + 2(E(θ̂) − θ)E[θ̂ − E(θ̂)] = V (θ̂) + [bias(θ̂)]2 = An example: the mean and the median Suppose X ∼ N (µ, σ 2 ). To estimate µ we have two estimators: X̄ and M , the sample median. It can be shown that • E(M ) = µ, so the median is an unbiased estimator. • MSE(M ) = V (M ) = So we see that V (M ) = π2 V (X̄) > V (X̄), we say that X̄ is more efficient than M . πσ 2 2n X M µ 29 5.2 Confidence intervals Confidence interval for a mean To estimate the mean of a population X − N (µ, σ 2 ) we’ll get a size n sample with sample mean X̄. (We assume by now that σ 2 is known) We know that X̄ − µ √ ∼ N (0, 1), so for a fixed constant 0 < α < 1, σ/ n X̄ − µ √ ≤ zα/2 = 1 − α P −zα/2 ≤ σ/ n σ σ σ = P µ ∈ X̄ ∓ zα/2 √ =1−α P X̄ − zα/2 √ ≤ µ ≤ X̄ + zα/2 √ n n n Let za be the value of Z such that P (Z > za ) = a Confidence interval for a mean - 2 The 100(1 − α)% Confidence interval for µ is σ X̄ ± zα/2 √ n √ It is the sample mean ± a margin of error The margin of error is Zα/2 × σ/ n, (half the interval width) that is: the critical point × the standard error of the estimate. 1−α 0.90 0.95 0.99 100(1 − α)% is called confidence level or probability content of the interval. α α/2 0.10 0.05 0.05 0.025 0.01 0.005 zα/2 1.645 1.960 2.576 C.I. for µ: Example A laboratory analyzes a pharmaceutical product with a procedure that is known to measure with no bias but with a σ = 0.0068 gr per liter. They made n = 3 measurements: .8403, .8363, .8447. So x̄ = .8404 √ √ Standard error of the estimate: σ/ n = 0.0068/ 3 = .0039 For a 99% Confidence interval, we compute: Critical point: z0.005 = 2.576 Margin of error: 2.576 · .0039 = .0101 99% Confidence Interval for µ: .8404 ± .0101 = (.8303, .8505) Excel will give us the margin of error using: =CONFIDENCE(0.01, .0068, 3) Increasing 100(1 − α) Note that (leaving the rest unchanged) Increasing n If σ is bigger 30 Interval width Increases Decreases is wider Confidence interval for a mean - 3 Important notes: 1. The confidence interval is random: each sample gives a different interval (see Excel spreadsheet simu-intervals.xls for simulation). 2. Some intervals miss the mean, actually a 100α% of them! 3. We’ll NEVER say “the probability of the mean being in the interval (.8303, .8505) is 99%”. That’s non sense, the mean is a fixed, unknown, quantity. The interval IS random. 4. We say: the mean is between .8303 and .8505, with a confidence of 99%. 5. For large n (larger than 30) if we don’t know σ, it can be replaced in the formulas by s2 (computed from the sample). See page 31. 6. The most important point: all this ONLY works if we have a truly random sample. Confidence interval for a proportion We want to estimate a population proportion p using the sample proportion p̂ from a (large) size n sample. r p(1 − p) The standard error is now n r p̂(1 − p̂) It can be approximated by se = n The Confidence Interval is p̂ ± zα/2 se Example: A random sample 344 students was asked about the quality of the university cantina. 83 students answered “good”. Find a 90% interval for the proportion of all university students that would say that the cantina is “good”. p Point estimation: p̂ = 83/344 = .241 For α = 0.1, critical point is z0.05 = 1.645 Standard error is .241 · .759/344 = .02306 Margin of error: .0379 = 3.79% 90% Confidence interval: (.2031, .2789) = (20.31%, 27.89%) 80% Confidence interval: (.211, .271) 90% Confidence interval: (.203, .279) 95% Confidence interval: (.196, .286) 99% Confidence interval: (.182, .300) Sample size for desired margin of error z σ 2 √ α/2 To have a margin of error at most m, we want zα/2 σ/ n ≤ m, and thus sample size should be n ≥ m To use this, we need to know or to have some idea about how big can σ be. Example: Now we know that the proportion of cantina-satisfied students is around 25%. We want to make a new poll to know it with a margin of error not bigger than 1%. Using the same confidence level, 90%, the sample size required would be bigger than 2 1.645 · .25 · .75 = 951.3 0.01 So the sample size should be at least 952. Confidence interval summary We are trying to estimate θ using estimator θ̂ interval is point estimate ∓ margin of error margin of error critical point × standard error , it is half the width of the interval. standard error of the estimate is the standard deviation of the estimator. 31 confidence level 1 − α, is the proportion of samples that give us a good interval. So, we admit that α% of the intervals would be wrong. critical point leaves probability α in both tails. Exercises (List #16) 1. During some municipal elections in which only two parties, CiU and PSC, participate, a voter survey is carried out on 1000 voters selected at random from the voter population. A total of 615 indicate that they will vote for CiU. (a) Construct a 95% confidence interval for the percentage of votes that will be obtained by CiU in the election. (b) Construct a confidence interval for the difference of percentage of votes between CiU and PSC. (c) Can CiU be sure of a victory? 2. A closely contested presidential election in 1976 pitted Jimmy Carter against Gerald Ford. A poll taken immediately before the 1976 election showed that 51% of the sample intended to vote for Carter. The polling organization announced that they were 95% confident that the sample result was within ±2 points of the true percent of all voters who favored Carter. (a) Explain in plain language to someone who knows no statistics what “95% confident” means in this announcement. (b) The poll showed Carter leading. Yet the polling organization said the election was too close to call. Explain why. (c) On hearing of the poll, a nervous politician asked, “What is the probability that over half the voters prefer Carter?” A statistician said in reply that this question not only can’t be answered from the poll results, it doesn’t even make sense to talk about this probability. Explain why. Small samples, σ unknown X̄ − µ Let X be a normal population, we know that p has N (0, 1) distribution. σ 2 /n In most cases, we don’t know σ 2 , we use the sample to estimate it through S 2 . Then: X̄ − µ For size n random samples from a X ∼ N (µ, σ 2 ), the statistic p has a Student’s t distribution with n − 1 S 2 /n degrees of freedom. t density has heavier tails than Z. Critical values will be higher, intervals will be wider. Small samples, t based Conf. Intervals For random samples from a population X ∼ N (µ, σ 2 ) of small size n and σ estimated from the sample, the (1 − α)% confidence interval is S X̄ ± tα/2 √ n Some critical values, for α = 0.05, confidence 95%, (upper tail contains probability α/2, both tails contain α) d.f. t0.025 5 2.57 15 2.13 30 2.04 100 1.98 ∞ 1.96 as n grows, t approaches Z. 32 Exercises (List #17) 1. Of a random sample of 134 auditors employed by major auditing firms, 82 said that, on receiving new audit business, they always inquired of the predecessor auditor the reason for the change of auditors. Find a 95% confidence interval for the population proportion. Without doing more calculations, state whether a 90% confidence interval would be wider than or narrower than the one found before. 2. A random sample of 595 marketing experts was asked to assess on a scale from one (completely unethical) to seven (completely ethical) the practice of packaging a store brand to closely resemble a national brand. The sample mean response was 3.38 and the sample standard deviation was 1.80. Find a 90% confidence interval for the population mean. 3. Of a random sample of 151 marketing executives in consumer good manufacturing, 76.0% said that brand identification held by incumbents was an important or extremely important barrier to entering a new market. Based on this information a statistician computed, for the population proportion of this view, the confidence interval .720 < p < .800. Find the probability content of this interval. 5.3 Hypothesis testing Testing: Can you believe it? 1. Suppose that I toss up a coin 100 times and I obtain 95 heads. Would you believe that the coin is fair? Would you bet that it’s biased? What if I would obtained 65 tails? 2. Suppose that someone tells you that 80% of people from Barcelona speak catalan. Then you pick 100 numbers at random from the white pages guide and you find only 25 people speaking catalan. Would you trust your source? Why? or Why not? 3. Suppose your are auditing a firm that claims that the average amount of export operations is $1000, you select 50 operations at random and you find a sample mean of $800. Can you reject the claim? In all these cases we are checking or questioning a Null Hypothesis. Usually there is an alternative hypothesis. We want to decide what hypothesis is more likely to be true. Concepts of hypothesis testing Null Hypothesis The starting point, default scenario, previous knowledge, no action situation. Alternative Hyp. The one we would like to find or discard. Target of our experiment. p-value Probability, assuming H0 is true, of obtaining the observed results or one that is even more extreme (more in favor of H1 ). Small p-values make us suspect that H0 may be not true, we should decide to reject it. Significance level α is the threshold of the p-value which we regard as unusual, often assumed to be 0.05. If p < α then we reject H0 , otherwise we don’t have enough evidence to reject it. One or two sided test Depending on the statement of the problem we set up the hypothesis and see whether the alternative hypothesis is in one direction only or in both directions of the values of the statistic. False Positive, False Negative Testing at the medical laboratory A blood sample arrives at a medical laboratory for testing for the HIV virus. After the test a decision is made: either the blood sample has the virus or it doesn’t have it. But the truth can be different and in reality the virus may be present or not. Thus there are four different possibities, depending on the decision and depending on the truth. 33 • List the posibilities in the form of a 2x2 table where the rows are the two true alternatives and the columns are the two alternatives according the laboratory. • Which possibility is the “false positive” and which the “false negative”? Repeat the above exercice for a quality control situation where the mean of a process is being controlled for. False positive: “Type I error” False negative: “Type II error” Note that “positive” means here “reject the null that “no virus present” Null hypothesis and p-value: example At the jury In a murder case, the jury is trying to decide whether a man is to be declared innocent or guilty. The null hypothesis is innocence. Unless proven otherwise, he is innocent. We may think that there is a truth: the man is really innocent, or he is really guilty. But nobody knows the real truth... A fact is shown to be true in the trial: e.g. the gun was found under the man’s bed. Assuming innocence, this fact has very small probability to occur (this is the p-value!). So the jury will decide “non innocent”: they will reject the null because the p-value is so small. What is “Type I error” and “Type II error” in this situation? Make still another 2x2 table for this situation. Remark: p-value is the probability, under the null hyp., to find something like the observed fact. If it is very small, the null is under suspicion, we better reject it. Error types in hypothesis testing Decission Truth H0 true H0 false Do not reject H0 OK Type II error (prob. β) Reject H0 Type I error (prob. α) OK Important note: Truth is never known in statistics (as in real life?) Methodology Sequential steps to do a Hypothesis Test (this is important to understand the concepts). 1. Define the population model and parameter of interest 2. Set up the hypothesis: null and alternative. 3. Choose a statistic as estimator and study its distribution under the null hyp. 4. Fix a value for α. Decide a sample size. Identify acceptance and rejection regions. Write down the decision rule. 5. Obtain the sample and compute the observed value of the statistic. 6. Compute the p-value. 7. Make a decision: reject or not reject H0 Note that steps 1 to 4 are done before even seing data. In a significance test steps 4 and 7 are skipped. We don’t state any decision, we simply state the p-value (hopefully very small). 34 Testing a mean: Example In a quality control lab of a pharmaceutical company, they need to be sure that the drug they are producing contains no more than 12 mg per pill of certain component C. They plan to take a random sample of size n = 10 pills and check the contents of component C. Then they will compute the mean and consult the statistician. She knows by experience that the quantity of C in a pill is normally distributed. To make things a bit easier, we would assume that this quantity distribution has a known standard deviation σ = 4. Let’s follow the steps outlined above. 1. Population model: X ∼ N (µ, σ 2 = 16). Param. of interest is µ. 2. H0 : µ ≤ 12; Ha : µ > 12. 3. Statistic: sample mean X̄. Under H0 , X̄ ∼ N (12, 16/n). 4. Let α = 0.01. Sample size will be n = 10. Values of X̄ too high √ go against H0 in favour of Ha . Critical point for Z is z0.01 = 2.326, so rejection region is X̄ > 2.326 ∗ 4/ 10 + 12 = 14.94, acceptance region is X̄ ≤ 14.94. The decision rule is If X̄ > 14.94 reject H0 , otherwise, don’t reject it. 5. Once the sample is obtained by random sampling, let x̄ = 10.5 be the sample mean. p 6. The p-value will be the upper tail 10.5, or the upper tail of (10.5 − 12)/ 16/10 in the Z distribution. p = 0.1178. Maybe we stop here and report p = 0.1178. 7. Since the p-value is higher than α, we don’t reject H0 . Another way to reach the same conclusion: we apply the decision rule, 10.5 < 14.94, so we don’t reject H0 . Power of a test: Example (cont.) [See spreadsheet test-power.xls] In the previous situation, suppose that the quality control inspector operates according to the decission rule deduced above: If X̄10 > 14.94, reject H0 Now suppose that, unbeknown to him, the quantity of component C has gone up to 13 mg per pill. • What’s the probability of detecting this change? Answer: 1 − β = 0.0623 with β = P [X̄10 > 14.94|Ha true, with µ = 13] = 0.9377 • Repeat the calculation for true changes of quantity of C to 14, 15, 16, 17 mg per pill, and plot the results in a curve. Answers: Resp. 0.2281, 0.5181, 0.7984, 0.9481. This is called the power curve of the test and shows how effective the test is in detecting the alternative hypothesis. Power of a test Remember Type I error Rejecting a true null hypothesis. Probability α, predefined. Type II error Accepting (not rejecting) a false null hypothesis. Probability β, depends on the true alternative, that is unknown. The power of the test is 1 − β, the probability of detecting a true alternative hypothesis. Usually we compute a power curve of 1 − β as a function of the true value of the parameter. 35 Power of a test: Example 2 A marketing company runs an electoral poll to predict the outcome of some presidential election. They plan to take a size n = 400 random sample of potential voters, to ask them whether they are going to vote candidate B or candidate J. They want to use α = 0.05 as significance level. Given the current political analysis, they take as null hypothesis a tie: H0 : p = 0.5, where p is the proportion of voters for B. Alternative hypothesis is Ha : p 6= 0.5. From this setting, they compute the decission rule: If 0.4510 < p̂ < 0.549, don’t reject H0 Using this rule, what is the probability of detecting a change in voter’s intention to p = 0.55? [Answer: Probability of not detect the change is β = 0.4839.] After seing that, they decide to increase the sample size to n = 1000. • Will the decission rule change? • What will be the new β and the new power of the test? • Compute the power curve for this case. Goodness of fit Example 1: I throw up a die 600 times and obtain 115 1’s, 88 2’s, 97 3’s, 112 4’s, 82 5’s and 106 6’s. Is this unusual for a fair die? Example 2: In the 2000 national elections in Spain the distribution of votes was as follows: PP: 41.3%, PSOE: 32.1%, IU: 12.7%, CiU: 10.5%, other: 3.4% A survey published six months after the election randomly selected 1200 spanish voters and show that 407 would vote for PP, 371 for PSOE, 156 for IU, 122 for CiU, 42 for other parties and 102 voters were still undecided. 1. Comment on the voting pattern observed in this sample. What do you do with the undecided? 2. Test whether there has been a change in the voting pattern compared to the previous election. Goodness of fit We have some discrete probability distribution with K possible values V1 , . . . , VK (usually categorical) and probabilities P (Vi ) = pi . We have a sample of size N , where Oi cases are in category Vi . Our null hypothesis is that the sample come from the stated model. Alternative hypothesis is that some of the categories are over or under represented in the sample. Under H0 , we would expect to find Ei = N pi cases in each category. Under H0 , it can be shown that K X (Oi − Ei )2 ∼ χ2K−1 E i i=1 (Note: This approximation is acceptable given that expected values are not smaller than 5) Goodness of fit, examples 36 Example 1 Null hyp: Our die is fair. Observed 1 2 3 4 5 6 Example 2 41.3 32.1 12.7 10.5 3.4 100 PP PSOE CiU IU other Observed 407 371 156 122 42 1098 Expected 453.47 352.46 139.45 115.29 37.33 p-value Discr. 4.76 0.98 1.97 0.39 0.58 8.68 Expected 115 88 97 112 82 106 Discr. 100 100 100 100 100 100 600 p-value 2.25 1.44 0.09 1.44 3.24 0.36 8.82 0.11646 DISTR.CHI(K16,5) There are 5 dof We see that under the null in 11.6% of the experiments like this we could find a discrepancy equal or bigger to the one we observed. We don't have evidence enough to reject the null 0.06968 DISTR.CHI(K10,4) Test of independence A sample is drawn from a population. Each member of the population is uniquely cross-classified according a pair of attributes. Is there association or dependence in the population between the posession of those attributes? Example: Stockbrokers make dozens of decisions every week about buying and selling shares. Their decission can be classified into three categories: clearly correct (++), borderline (+/-) and clearly wrong (–). Four brokers working at the Madrid stock exchange analyze their decisions and come up with the following result: Broker José Ignacio Miguel Hugo ++ 29 21 20 25 +/15 10 21 24 – 4 5 9 7 • Comment on the margins of the table and the performances of the stockbrokers. • Assuming no differences between the brokers, calculate their expected frequencies. • Perform the hypothesis test of no difference between the brokers. Testing independence First attribute has r categories, r rows in the contingency table. Second attribute has c categories, c columns. Let Oij be the number of cases in row i and column j. Then Oi+ denote the sum of row i and O+j denote the sum of column j. And O++ is the grand total. Under the null hyp. of independence between row and columns • we would obtain, in cell i, j, the expected value Eij = Oi+ O+j O++ • The discrepancy χ2obs r X c X (Oij − Eij )2 = Eij i=1 j=1 has χ2(r−1)(c−1) distribution. So, we should reject the null hyp. when χ2obs > χ2(r−1)(c−1),α The p-value can be computed, e.g., using Excel with =DISTR.CHI(chi2value, dof). Note: again, expected values should be at least 5. 37 Testing independence, example Observed Broker José Ignacio Miguel Hugo ++ +/- – 29 21 20 25 95 15 10 21 24 70 4 5 9 7 25 48 36 50 56 190 24.00 18.00 25.00 28.00 95 17.68 13.26 18.42 20.63 70 6.32 4.74 6.58 7.37 25 48 36 50 56 190 1.042 0.500 1.000 0.321 0.407 0.803 0.361 0.550 0.849 0.015 0.891 0.018 Expected under H0 Discrepancies 6.757 p-value 0.3439 DISTR.CHI(F28,6) Exercises (List #18) 1. Applicants to the graduate program of the UPF in 1996 consisted of 65 males and 48 females. (a) Using a chi-square test, test whether the number of male applications is significantly different from the number of female applications. (b) Test the same hypothesis with the test of a proportion, using the normal approximation to the binomial distribution. (c) Verify that the value of the test statistic in (a) is the square of that in (b), so that the two tests are equivalent. (N.B. This again illustrates the fact that the square of a standard normal variable is a chi-squared variable with 1 degree of freedom). 2. A marketing researcher wants to assess consumer preferences among four colors for a washing machine. Next table shows observed frequencies in a recent survey of 198 sales. Color Freq. metal 61 brown 55 white 41 blue 41 Total 198 Test the null hypothesis that all four colors are equally popular. Do it with a significance level of α = 0.05. 3. . 4. Next table shows number of goals per match scored by some football team. Number of goals per match Number of matches 0 14 1 18 2 29 3 18 4 10 5 7 6 3 7 1 (a) Compute the mean of this distribution and the frequencies that would give a Poisson distribution with this mean. (b) Check the hypothesis that the number of goals in a match follow a Poisson distribution. (N.B. In a goodness-of-fit chi-square test, number of degrees of freedom equals number of categories minus one. If we use the same observed sample to estimate any parameter of the distribution to be fit, the number of degrees of freedom is reduced further by the number of parameters estimated using the sample.) 5. In a six month period four Californian cities had the following frequencies of three selected crimes: 38 Berkeley Oakland San Francisco San José MURDER 7 44 63 13 RAPE 36 192 294 214 ARSON 28 208 209 514 Test the hypothesis that the cities have the same incidence of crime. 6 6.1 Joint distributions Joint, marginal and conditional distributions Joint distributions: Example in the discrete case We will completely develop a simple example in the discrete case, for it will be useful to understand the continuous case later. We roll a four-sided die. Let X be the number shown on top. Then we toss a coin as many times as X. Let Y be the number of heads. X and Y are random variables in a common sample space . We say that they are jointly distributed. It’s easy to derive the conditional distributions of Y given X. If X was, say, 2, Y is 0, 1, 2 with probabilities 1/4, 1/2, 1/4. We write P (Y = 0|X = 2) = 1/4 etc. Note that Y |X = x is a different r.v, for each value x, with its distribution on the values of Y . Then, Y |X is a collection of conditional distributions, one for each value of X. The joint distribution can be obtained from the conditionals: P (x, y) = P (X = x and Y = y) = P (Y = y|X = x)P (X = x). Y \X 0 1 2 3 4 Marginal of X 1 1/8 1/8 0 0 0 1/4 2 1/16 1/8 1/16 0 0 1/4 3 1/32 3/32 3/32 1/32 0 1/4 Marginal of Y 15/64 13/32 1/4 3/32 1/64 1 4 1/64 1/16 3/32 1/16 1/64 1/4 The marginal distribution of X is uniform, the marginal distribution of Y can be obtained by adding up the rows. Now, we can compute the conditional distributions of X given Y , for example, P (X = 3|Y = 3) = P (3, 3)/P (Y = 3) = (1/32)/(3/32) = 1/3. Remember that we say that X and Y are independent if, for all values x, y, P (x, y) = P (X = x)P (Y = y). In this example, X and Y are not independent. Definitions For d.r.v. X, Y on a common sample space, we consider Joint distribution P (x, y) = P (X = x ∩ Y = y). Should satisfy P P x y P (x, y) = 1. Marginal distributions variable by itself. X Are the distributions of each P P (X = x) = P (x, y) and P (Y = y) = x P (x, y). y Conditional distributions (As many as different values for X and Y ) X|Y = y is a r.v. with distribution P (X = x|Y = y) = P (x, y)/P (Y = y). 39 We may compute expectations on any univariate distribution: Marginals E(X), E(Y ), as usual. P Conditional expectations P E(X|Y = y) = x xP (X = x|Y = y) E(Y |X = x) = y yP (Y = y|X = x) The Law of iterated expectations (it is a version of Theorem 3): P Theorem 6. E(Y ) = E E(Y |X = x) = x E(Y |X = x)P (X = x) Exercise (List #19) 1. Check the law of iterated expectations for the example given at the beginning of this section. 2. Let D1 and D2 be two dice. Let M = min(D1 , D2 ) be the minimum of them. Compute the conditional distributions of M |D1 for each value of D1 . Derive from it the joint distribution of D1 and M . Check the law of iterated expectations in this case. 6.2 Covariance and correlation Covariance For a function g of X, Y , expectation is to be computed by XX g(x, y)P (x, y) E(g(X, Y )) = x y Of particular importance is the Covariance of X and Y , defined by Cov(X, Y ) = E (X − µX )(Y − µy ) where µX = E(X) and µY = E(Y ). Note that a big positive covariance means that most products are positive, so above the average values of X correspond with above the average values of Y ... Properties 1. For constant c, Cov(c, X) = 0. 2. Cov(X, Y ) = Cov(Y, X) 3. V (X) = Cov(X, X) 4. Cov(X, Y ) = E(XY ) − E(X)E(Y ) 5. V (X + Y ) = V (X) + V (Y ) + 2Cov(X, Y ) 6. If X and Y are independent, Cov(X, Y ) = 0. Note that the converse is not true! Example: Let X have equally likely values −1, 0, 1. Then Cov(X, X 2 ) = 0 but X and X 2 are not independent! Correlation For non-constant X, Y , we define Correlation of X and Y as Corr(X, Y ) = Cov(X, Y ) SD(X)SD(Y ) Properties 1. If X ∗ , Y ∗ are the standardized versions of X, Y , then Corr(X, Y ) = Corr(X ∗ , Y ∗ ) = Cov(X ∗ , Y ∗ ) = E(X ∗ Y ∗ ) 2. −1 ≤ Corr(X, Y ) ≤ 1 40 6.3 Continuous case Continuous joint distribution Let X, Y be continuous r.v. on a common sample space with real values. A joint density function of X and Y is a function f (x, y) defined on R2 satisfying: 1. f (x, y) ≥ 0 for all x, y Z Z +∞ Z +∞ 2. f= f (x, y) dx dy = 1 R2 −∞ −∞ 3. For B ⊂ R2 , P (X, Y ) ∈ B = Z Z f (x, y) dx dy B Note that now probabilities are volumes over the set of values . We may write P (X ∈ dx and Y ∈ dy) = f (x, y) dx dy Compare this to the discrete case, probabilities get replaced by densities, sums by integrals. 41 Cont. joint density Continuous Marginals, Conditionals Let X, Y be c.r.v. with joint density f (x, y). The marginal density functions of X and Y are Z +∞ fX (x) = f (x, y)dy Z +∞ fY (y) = −∞ f (x, y)dx −∞ The conditional density of X given Y = y is defined by fX (x|Y = y) = f (x, y) fY (y) (Think of y as constant, here) and similarly, fY (y|X = x) = f (x, y) fX (x) Expectations and the Law of iterated expectations work the same way as in the discrete case. 42 Marginals and conditionals Exercises (List #20) 1. Compute the covariance and correlation for X and Y in the previous example (four-sided die and coin tossing). Before doing any calculation, by simply inspecting the join distribution table shown there, could you state whether covariance will be positive or negative? 2. Let (X, Y ) have uniform distribution on the four points (−1, 0), (1, 0), (0, −1), (0, 1). Show that X and Y are uncorrelated but not independent. 3. Let X have uniform distribution on {−1, 0, 1} and let Y = X 2 . Are X and Y uncorrelated? Are X and Y independent? Explain carefully. 4. Repeat the previous exercise for continuous X and Y . 5. Remember exercise (7)-5. Check that X and Y described there are uncorrelated but not independent. 6.4 The multivariate normal The multivariate normal distr.(1) Reminder of the univariate case (ee page 19.: 43 We say that X ∼ N (µ, σ 2 ) when the density funcion is x−µ 2 1 f (x) = √ exp −( ) /2 σ σ 2π Note that we may see x−µ σ as the distance between x and µ measured in standard units. For a random vector X = (X1 , . . . , Xp )0 , let µ = (µ1 , . . . , µp ) be the vector of means and let Σ = (σij ) the matrix of variances and covariances: √ σij = Cov(Xi , Xj ), σii = V (Xi ), SD(Xi ) = σii Note that here we don’t use σ 2 for the variances Let’s suppose that Σ is positive definite, then its inverse Σ−1 define a distance in Rp : dist(x, y)2 = (x − y)0 Σ−1 (x − y) (similarly as (x − µ) σ12 (x − µ) in the 1-D case before). We say that the random vector X has multivariate normal distribution Np (µ, Σ) if its density function is 1 1 0 −1 exp − (x − µ) Σ (x − µ) f (x) = 2 (2π)p/2 |Σ|1/2 where |Σ| denotes determinant of Σ. 0.3 0.25 0.2 0.15 0.1 0.05 0 –2 –2 –1 –1 y0 1 1 2 0x 2 2 y 1 –2 –1 1 x –1 –2 44 2 Properties of multiv. normal (1) If X ∼ Np (µ, Σ) then 1. Any linear combination a0 X = a1 X1 + a2 X2 + · · · + ap Xp has normal distribution N (a0 µ, a0 Σa). 2. In particular, the marginals Xi are normal N (µi , σii ) 3. If we combine several linear combinations of X into a random vector like in AX, where A is a [a × p] matrix, then the q−vector AX has also normal distribution Nq (Aµ, AΣA0 ). 4. As a consequence, any subset of components of X has also normal distribution. Properties of multiv. normal (2) Note that in the following lines we are using block vectors and matrices, that is, vectors formed by concatenating vectors, or matrices formed by joining conforming matrices. In all cases, boldface letters are for vectors or matrices. For univariate normals, uncorrelation implies independence, for multiv. normals too: – » – » X1 µ1 , Let X = a random vector formed by two vectors X1 , X2 . Then its vector of means can be written µ = X2 µ2 » – Σ11 Σ12 and the cov. matrix can also be split as Σ = Σ12 Σ22 Then, X1 and X2 are independent if and only if Σ12 = 0. » Also, if X1 and X2 are independent and distributed as Nq1 (µ1 , Σ1 ) and Nq2 (µ2 , Σ2 ) resp., then X = –« – » „» µ1 Σ11 0 distributed as Nq1 +q2 , µ2 0 Σ22 2 4 Example: X = (X1 , X2 , X3 ) ∼ N3 (µ, Σ) with Σ = 4 1 0 1 3 0 X1 X2 – is normal, 3 0 0 5. Then (X1 , X2 ) and X3 are independent normals. 2 Conditional distributions In the bivariate case, X = (X1 , X2 ), any conditional distribution X1 |X2 = x2 is univ. normal N (µ1 + σ12 σ2 (x2 − µ2 ), σ11 − 12 ) σ22 σ22 (Note that variance do not depend on the particular conditioning value x2 ) In general, if X= X1 X2 ∼ Np µ1 µ2 Σ11 , Σ21 Σ12 Σ22 −1 then X1 |X2 = x2 is normal with mean µ1 + Σ12 Σ−1 22 (x2 − µ2 ) and covariance matrix Σ11 − Σ12 Σ22 Σ21 . (Note again that covariances do not depend on the particular conditioning value x2 ) 6.5 Simple Regression Simple regression: Example A company launches a new product (a home cinema DVD system) in eight different regions of a country, with different prices in each region: Price Sales 340 42 360 38 380 35 360 40 45 300 44 380 38 270 45 300 42 1. Make a scatterplot of these data 2. What is the covariance between these two variables and their correlation? 3. Estimate the regression line. 4. What percentage of variance of sales is explained by price? 5. Test the hypothesis that the “true” slope of the “true” regression line is 0 (test at the 5% level) 6. Find a 95% confidence interval for the slope coefficient. 7. Interpret the slope coefficient and its confidence interval. We will be able to answer this example after a review of simple regression. Simple regression Population model is E(Y |X = x) = α + βx or Y = α + βx + e where e is a r.v. (called errors or residuals ), Sample: (X1 , Y1 ), . . . (Xn , Yn ). Observed sample, (xi , yi ) : y (Note that x’s may be random or fixed) yi = α + βxi + ei . y = α + βx β (xi , yi ) yi 1 ei α xi x Simple regression: estimation We need to estimate y = α + βx, two parameters:α and β least squares estimation Estimators α̂ and β̂ should satisfy n n X X min f (a, b); f (a, b) = (yi − a − bxi )2 = e2i a,b i=1 Exercise: The mean ȳ = Least squares estimation We have an optimization problem with two unknowns: min f (a, b) = a,b i=1 P n X i=1 46 (yi − a − bxi )2 yi n solves min c n X i=1 2 (yi − c) Simple regression: estimation-2 Solution: It is a convex function (sum of convex f.), ∂f ∂f = 0 and =0 the global minimum is for a and b such that ∂a ∂b (two equations for two unknowns) P P • ∂f /∂a = iP 2(yi − aP− bxi )(−1) = 0 ; (yi − a − bxi ) = 0 P yi = a + bxi ; nȳ = na + bnx̄ ȳ = a + bx̄ (The regression line pass through (x̄, ȳ)). P • ∂f /∂b = i 2(yi − a − bxi )(−xi ) = 0 P P P xy = axi + bx2i P i i P 2 x y = anx̄ + bxi Pi i P 2 1 xi yi = ax̄ + b n1 xi n By substituting a = ȳ − bx̄, we obtain b. Least Squares estimates for α and β Pn (x − x̄)(yi − ȳ) Pn i β̂ = i=1 ; 2 i=1 (xi − x̄) We have: β̂ = α̂ = ȳ − β̂ x̄ Cov(X, Y ) SD(Y ) =r V (X) SD(X) (If X, Y are standardized, β̂ = r) The regression model • Assumptions for model yi = α + βxi + ei 1. xi are fixed, or, if xi ’s are realizations of a r.v. Xi , then Xi and ei are independent. 2. E(ei ) = 0 3. E(e2i ) = σe2 , the same for all i. 4. E(ei ej ) = 0 for all i 6= j Sample distribution for β̂ n • Se2 = n 1 X 2 1 X (yi − α̂ − β̂xi )2 = ē n − 2 i=1 n − 2 i=1 i is an unbiased estimator of σe2 • E(β̂) = β σ2 • V (β̂) = P e (so, we can decrease the variance of the estimator by increasing xi ’s dispersion or, (xi − x̄)2 increasing n) S2 is an unbiased estimator for V (β̂) • Sβ̂2 = P e (xi − x̄)2 If errors are normally distributed, (or n large enough), β̂ has normal distribution, and β̂ − β ∼ tn−2 Sβ̂ From this we can compute confidence intervals and hypothesis tests. 47 Analysis of Variance y y = E(Y|x) = α ^ + β^ x (xi , yi ) yi ei y y ^i x xi Then, yi = α̂ + β̂xi + ei = ŷi + ei P P (yi − ȳ)2 = (ŷi − ȳ)2 + SST sum of squares total = x (yi − ȳ) = (ŷi − ȳ) + ei and, P 2 ei SSR sum of squares regression + SSE sum of squares errors • For a perfect linear relation, SSE = 0 and SST = SSR • If there is no linear relation at all, (β = 0), SSR = 0 and SST = SSE. To measure the intensity of linear relation, we define the coefficient of determination R2 = SSR SST • 0 ≤ R2 ≤ 1 • R2 × 100% is the percentage of the total variance explained by the regression, by the model, or by x. In the particular case of simple regression, R2 = r2 and, r2 = 1 − SSE , SST SSE = SST(1 − r2 ) and s2e = 1 SSE n−2 Assuming H0 : β = 0 (no linear relation): SSR ∼ F1,n−2 SSE/(n − 2) (This is the Snedecor F distribution, see page 21.) Our example worked out A company launches a new product (a home cinema DVD system) in eight different regions of a country, with different prices in each region: Price Sales 340 42 360 38 380 35 360 40 48 300 44 380 38 270 45 300 42 Our model is Qi = α + βpi + i , where pi is a non random price and i is a random variable called the “errors” or more properly the residuals , we are stating that the sales quantity in region i, Qi depend linearly on the price plus some other random variables not specified by the model, added all together in the residuals. Since there are a lot of such other variables, we may assume that i has normal distribution when needed. Another way to state our model is E(Q|p) = α + βp where the sales quantity Q is assumed to have normal distribution given any p. See the Excel spreadsheet regression.xls at the web site of the course 1. Make a scatterplot of these data. Excel gives us a quite poor scatterplot... sales vs price 50 45 40 35 30 250 270 290 310 330 350 370 390 The big dot marks the averages of the data series, p̄ = 332.5, q̄ = 40.5. 2. What is the covariance between these two variables and their correlation? Using the formulas above or Excel we obtain Cov(qi , pi ) = −730.00, r = −0.8101. We say that prices and sales have a quite high negative linear correlation. That means that, in average, sales decreases when price increases. 3. Estimate the regression line. Using the formulas above, or Excel we obtain β̂ = −0.719, α̂ = 64.41, so our estimate of the true regression line, always unknown, is Qi = 64.41 − 0.719pi + i 4. What percentage of variance of sales is explained by price? R2 can be computed as r2 = 0.6563 or as R2 = SSR/SST = 52.50/80.00 = 0.6563. We interpret that 65.63% of the variability of the sales is due to the variability of the prices, or that the linear model explains 65.63% of the sales. 5. Test the hypothesis that the “true” slope of the “true” regression line is 0 (test at the 5% level) Assuming normality and homoscedacity for the residuals (needed here!), under H0 : There is no linear dependence of sales on prices, we have β = 0 and β̂/Sβ̂ ∼ t8−2 . Since we have observed a value of −0.719/0.0212 = −3.38 for this statistic, it is clear that the p-value is very small and we may safely reject the null hypothesis. The actual p-value can be computed using a statistical table found in any statistics book, or any software (e.g. Excel). It is p-value= 0.01486. 6. Find a 95% confidence interval for the slope coefficient. Assuming again normality for the residuals (needed here!), we know that β̂−β Sβ̂ ∼ tn−2 , so we can construct the confidence interval using the 95% critical points of a t6 distribution, that are ∓2.45 (stat tables, Excel,. . . ). The interval is then −0.719 ∓ 0.0212 · 2.45 = [−0.1239, −0.0199] 7. Interpret the slope coefficient and its confidence interval. According our estimation, an increase of a unit in price gives a decrease of 0.07 units on sales. We don’t have great precission on our estimate but we can be reasonably sure that the real slope is between -0.12 and -0.02 (with a confidence of 95%). Since the full 95% interval lies in the negative reals we may be confident that this dependence is real, that the slope is not null. 49