Download The Idea of Independence- Part II

The Idea of Independence- Part II in Probability and Statistics Sampling Henry Mesa Use your keyboard’s arrow keys to move the slides forward (▬►) or backward (◄▬) Use the (Esc) key to exit Use your keyboard’s arrow keys to move the slides forward (▬►) or backward (◄▬) If you want to stop the slide show use the Esc key on your keyboard. As you view the slides have paper and pencil handy. Take down notes, and when asked to guess at a result do so before going on. If something does not make sense, go back through the slides using the backward (◄▬) key on your keyboard. If the slides do not make sense to you then write down your question and ask your instructor. On the last series we looked at independence, and how using the conditional notation P(A | B) communicates if we have independence. P(A | B) = P(A) only if the two events are independent But the examples we used last time where of the type which I like to view as “starting from a whole, and then going “into” subgroup of the whole as depicted below. For example, a deck of card of “regular” playing cards contain 52 cards, 13 clubs, 13 diamonds, 13 hearts, and 13 spades. The names are called suits. Each suit, contains one card called the ace. P(ace) = 4/52. There are four aces, out of a deck of 52 cards. Suppose that I remove all the cards except the club cards. What is the chance of getting an ace now? P(ace | club) = ? P(ace | club) =1/13 P(ace) = 4/52 = 1/13 We have independence between events. Notice how we went from a large group, to a smaller group within the larger group. The notation P(ace | club) , makes it clear to the reader that we have changed sample space, by using the symbols “ | club” in the function notation. We want the probability of an ace given that our new sample space contains only club cards. Original sample space started with 52 cards. Subspace, is subset of original sample space. Also notice how the question is about a sample of size one, that is one card is chosen randomly. A very important distinction. Sampling (more than one item) involves the exact opposite effect. We start with an original sample space, and use it to make a bigger set of itself. Here is the analogy that I like to present to you. The elements in chemistry are suppose to be the simplest structure that can be used to create more complicated structures. Thus, hydrogen, H, is an element, and so is oxygen, O. But put them together in the right order and combination and we get water, H2O, a compound. We can do the same with sample spaces, and events. I like to think of sampling as creating the equivalent of creating compounds in chemistry. As we study probability that is a good way to think about more complicated structures in probability, as compounds in chemistry. What is the chance of getting an ace on a single draw from a deck of cards? P(ace) = 4/52; this is like an element in chemistry, a simple event. What is the chance of getting exactly three aces on a draw of five cards from a deck of 52? P(ace AND ace AND ace AND not ace AND not ace) = ? This is like a compound in chemistry, we used the simple event of getting an ace, to create a more complicated structure. But if you look closely, this involved sampling from the deck multiple times. But in my particular example using a deck of cards, my sample space changed from 52 items in it, {10 of clubs, ace of spades, 3 of hearts, …} to a more complicated structure with millions of items in it. Just one “thing” consists of five connected smaller events such as (ace of hearts, 10 of clubs, five of spades, 7 of hearts, three of clubs). Those five smaller events are just one item when I draw five cards from a deck of 52. Reiterating on the analogy, this is like saying H is hydrogen, O is oxygen, but to get one water molecule you need two hydrogen, and one oxygen connected in a precise manner; H2O. You need three elements to get one water molecule. So in order to make the concept of independence but from the sampling perspective understood, we need to see how the notation P(A | B) is used when we sample more than one item. After all, to show independence I need to show that P(A | B) = P(A) and that will not change. Also, keep in mind that when I ask about independence in a problem where I am sampling more than one item, I am asking about a change in probability (chance) during the sampling procedure. The key word here is “during.” During the process of sampling asks, “Has the probability of getting a particular event changed because it is now the fifth item to be chosen for example?” So how do we interpret the symbol P(A | B) during sampling? It turns out it is easier than you think. Notice when one samples, we think of the actual sampling as taking place as a sequence: sample 1st item, then second, then third, and so on until you get the last item in your sampling. This does not have to be the case; the sampling can occur simultaneously. Which ever view point you choose it does not change the final outcomes of this discussion. So here is how the notation P(A | B) works. Suppose that I am going to choose three cards from a deck of 52 cards without replacement. Now I could ask, what is the probability of choosing an ace? As stated earlier P(ace ) = 4/52. Now I actually sample a card and it turns out to be a 10 of hearts. I keep the card and sample again. What is the probability of an ace now? And here is how I would write the question. P(ace | 10 hearts) = ? I am using the “given that” symbol “ | “ to indicate that my first card is an 10 of hearts. How do I know that it suggests sampling as described here versus, the one sample scenarios depicted earlier? By context of the scenario, that is, the notation alone will not indicate which scenario is which, but under context of the situation, written in words, you can then tell what is the intent of the notation. So, P(ace | 10 hearts) = 4/ 51. Notice the denominator is 51 since I have only 51 cards left and 4 aces left. Do I have independence during the sampling? No, since on the first draw, P(ace) = 4/52, but on the second draw P(ace | 10 hearts) = 4/51. Thus, P(ace) ≠ P(ace | 10 hearts) and I do not have independence during the sampling. Suppose that on the second draw the card is a 5 of clubs. What is the probability of an ace on the third attempt? See if you can write down the question using function notation. So, P(ace | 10 hearts and 5 clubs) = 4/50 Did you get it? Now what is the answer? Do you see the pattern? P(item you want | what has occurred already during sampling) Thus, answering a question about independence concerning sampling is visually easier to tell. I have ten marbles in a bag, of which four are red and six are green. I will sample twice from the bag, and I will not put the marble back after each draw. Do I have independence during the sampling procedure? Can you answer the question? I bet you can? Remember what you want to answer is P(green | red) = P(green)? Or P(red | red) = P(red)? Or P(green | green) = P(green)? Now try it before moving on. Well lets pick one. P(green | green) = P(green)? P(green) = 6/10 and P(green | green) = 5/9, which means that we do not have independence since 6/10 ≠ 5/9. You may start to ask how do we then get independence during sampling then? Do we ever have independence during sampling? Consider the following sampling scenario. When we sample a marble from the bag, note the color and then put it back in the bag, shake the bag and sample again. Do we have independence now during sampling? There are still 10 marbles in the bag, six green four red. Well? Give it a try first! Remember what you want to answer is P(green | red) = P(green)? Or P(red | red) = P(red)? Or P(green | green) = P(green)? Now give it a try. P(red) = 4/10 P(red | green) = 4/10. Yes we have independence during the sampling procedure. The probability of choosing a red marble, now that we put the marbles back after each pick, remains the same from sample to sample. We have independence during the sampling procedure. The sampling procedure as outlined is called sampling with replacement, versus the first scenario which is sampling without replacement. By putting the marbles back I have created the scenario, sampling with replacement. If I keep the marbles after each pick, then this is called sampling without replacement. So far the sample space did not come to play as promised. Oh but it did, I just did not emphasize it yet. So lets look at some important issues here. Why should I care? So, you can recognize what to do when I am not around. Right now I am leading you, but later when you are alone you will have some tools to fall back on. In the marble scenario the population I am sampling from (sample space) is finite; 10 marbles. Keep in mind that my sample space of my sampling procedure is much larger than 10 items (not important right now but keep this in the back of your mind for future use). Consider flipping a fair coin. I flip a coin three times. Do I have independence during sampling? See if you can answer the question. Remember to use conditional notation to answer the question. You know P(A | B) = P(A) when you have independence. Well? In the marble scenario the population I am sampling from (sample space) is finite; 10 marbles. Keep in mind that my sample space of my sampling procedure is much larger than 10 items (not important right now but keep this in the back of your mind for future use). Consider flipping a fair coin. I flip a coin three times. Do I have independence during sampling? See if you can answer the question. Remember to use conditional notation to answer the question. You know P(A | B) = P(A) when you have independence. Well? P(Heads) = 0.5 P(Heads | Tails ) = 0.5 Do I have independence? Yes, since the probability of getting a coin to appear heads does not change from sample to sample. Notice that there is no “replacement” here. Or is there? We could argue for a while but let us not get sidetracked. We could think of it as the coin, by the nature of the problem, automatically resets itself. Some people like to think of it as a bag with infinitely many coins in it, which will be a helpful analogy in about two or three more slides. Some other people say that “the coin has no memory.” All these statements are trying to come to grips with the problem at hand. Let me give you some scenarios and see if you can detect if independence exists or does not during sampling. In a class of forty children, 15 of them are girls I will sample five children at random, and note down their gender. Do I have independence during sampling? See, if you can answer the question. To reiterate, use the notation P(A | B) = P(A) to show that I do or do not have independence. Well? P(girl) = 15/40, but P(girl | girl) = 14/39 so I do not have independence. Notice that to answer the question, I assumed that I would not pick the same child twice. Assuming that, then there is no independence during sampling. But what if the sampling was being done to give out prizes randomly, and you allowed a child to win more than one prize if there name was chosen more than one time? One scenario would have one child winning all the prizes. Do I have independence now? Try and answer the question before moving on. In that case I do have independence during sampling. P(girl | girl) = P(girl). P(girl) = 15/40, but P(girl | girl) = 15/40 and so on. Now you are going to toss a fair die five times and record the value that the die appears. Do you have independence during throws? Give it a try? P(a six) = ? P(a six | a four) = ? I hope you can see that you do have independence during the sampling procedure. P(a six) = P(a six | a four) = 1/6 You are going to sample 10 students at random from a University that has 15,000 registered students to participate in a study. You can not select the same student twice, that is all ten students are unique. Do we have independence during sampling? Yes, give it a try. P(Shane) = 1/15000. P(Shane | Marla) = 1/14999. You can see that the two values are not the same, thus I do not have independence, but… Warning! You must be very open minded in order to grasp what is about to happen next! Your ability to accept this or not will determine how well you will do with future statistics concepts. I am trying to let you in on a little secret that is not well explained in most statistics books and the author expects that you will eventually catch on subconsciously. I am referring what occurs when you are given a probability question and you wonder how do I know I should be doing this to calculate this probability. You are going to sample 10 students at random from a University that has 15,000 registered students to participate in a study. You can not select the same student twice, that is all ten students are unique. P(Shane) = 1/15000. P(Shane | Marla) = 1/14999. You can see that the two values are not the same, thus I do not have independence, but… Alright, you do not have independence but, are the two numbers that different from each other? 0.00006666666666 versus 0.00006667111141 Not much of a difference. P(Shane) = 1/15000. P(Shane | Marla) = 1/14999. So we don’t have independence, but for the practical purposes of calculating probabilities we will assume we do. What? That’s correct. We recognize that the two numbers, being almost identical, can be interchanged for calculating purposes, recognizing of course that you will be off the true value but maybe not by much once you recognize what to look for. This is an important view point as you study Statistics. Often, there is a correct way of how to do something. But that process might be difficult. Thus, Statisticians look for ways to see if another process would give results that are nearly as good. Often those processes is what is actually calculated in practice. This may seem strange to you, but in reality most people employ a similar way of thinking that leads to certain decisions; “the that is good enough for…” An aide rushes in “Senator, you are leading the polls! You have 59% support.” A second aide rushes in, “Senator I have a better estimate. You are leading the polls by 57.92% of the votes.” Notice that both numbers essentially same the same thing. The Senator is leading by more than half of the votes. One is not any better than the other as far as making some decision on what to do next. Lets run through some scenarios so will start recognizing the situations. In California, 68% of registered voters, who voted, rejected a tax increase measure. A newspaper reporter wants to interview 100 register voters to see why they voted the way they voted. So we will think of the people we contact as either being for tax increase, or against tax increase. So as we sample voters, do we have independence from voter to voter? Think about what this means for a minute before continuing. So, as we sample each person, does the P(for) change? That is P(for | for) = P(for) or P(for | against) = P(for)? Think about it, and commit to a response before continuing. So, as we sample each person, does the P(for) change? That is P(for | for) = P(for) or P(for | against) = P(for)? Think about it, and commit to a response before continuing. I hope you said that we do not have independence, since the reporter is sampling without replacement; once the reporter interviews a person the person will not be chosen again, essentially removing the person from the sample space. But, for practical purposes, the sampling procedure is nearly independent. Lets say 100,000 people voted. P(for ) = 68000/100000, but P(for | for) = 67,999/99,999 which is nearly identical. So while not independent for practical purposes it is, and we will act as if it is. Out of a shipment of 500,000 potatoes, let us say 10% would not meet some criteria set by a factory and thus the potatoes would need to be rejected. A quality inspector will choose two random bucket full of potatoes for inspection (about 50 potatoes) to see if they will accept the shipment. As the inspector samples from the shipment, does the probability of finding a defective potato change from sample to sample (which if it does it means we do not have independence)? Again what I am asking is P(defective | defective) = P(defective), or P(defective | defective and defective) = P(defective)? Think about it before answering. Again, I hope you said that we do not have independence, since P(defective) = 10000/40,000 while P(defective | defective and defective) = 9998/39,998. But again the same issue arises. We do not have independence but for practical purposes (calculating probabilities) we do. So as long as the population I am sampling from is so large compared to my sample size, (if I am sampling without replacement) then I may not have actual independence, but I can say close enough to say yes for practical purposes (calculating probabilities). So lets end the slides by showing you some of these coveted calculations you have been hearing so much about. In the first example I will show you the “sample space” that results from sampling, and do an actual calculation. With later examples we can dispense with the need to show the sample space, you will know its there and how it can be created, which, because of sheer size, will be not possible to actually show for the most interesting of scenarios. So lets make sure we have this idea down by doing a simple example. A bag has 5 marbles, of which 2 are red and 3 are green. What is the chance a marble chosen at random from the bag is red? P(red) = ? P(red) = 2/5. But now lets ask what would happen if we sampled two marbles from the bag? What is the chance that two marbles chosen at random from the bag are both green? P(green AND green) = ? Lets look at visual of the sample space we have just created by asking the question, “what is the chance of getting to green marbles in a row” ; keep in mind that sample spaces are “created” by the question you ask. What is the probability of getting two green marbles in a row? Let me use a table of all the possible outcomes; while the marbles for a particular color are impossible to distinguish they are physically different marbles. I will use subscripts to denote which marble is which. R1 R2 G1 G2 G3 G1 G2 G3 RG RG RG RG RG RG RR GG GG GR GR GG GR GR GR GR GG GG GG R1 R2 RR The empty space in the table indicates that the marble has already been taken, thus once G1 is picked, example, it can not be picked again, and thus the empty space in column. Every marble is equally likely to be chosen thus, counting how many outcomes lead to the desired result we get … R1 R2 G1 G2 G3 G1 RG RG GG GG G2 RG RG GG GG G3 RG RG GG GG R1 RR GR GR GR R2 RR GR GR GR P(G and G) = 6/20 What I needed you to see at the moment is how my sample space just increased by sampling, instead of one item, 2 items. I also wanted you to see that the answer to the question can be arrived at by counting how many outcomes meet the criteria. Now I will use what we have learned so far to make connections to previous material. R1 R2 G1 G2 G3 G1 RG RG GG GG G2 RG RG GG GG G3 RG RG GG GG R1 RR GR GR GR R2 RR GR GR GR P(G and G) = 6/20 We could have answered the question using conditional probabilities. P(G and G) = The first marble needs to be green, then after that so does the second marble. I will use a tree diagram to illustrate all that could occur (sample space). My original sample space contained just two items {R, G} my new sample space consisting of sampling more than one item from that original sample space consists of {RR, RG, GR, GG} four things; we can debate later about RG and GR being different. Now, I will add the probability of going from a particular node to a particular ending. The first part of the tree shows, what occurs on the first draw. P(green) = 3/5. But on the second draw, if the first one is green, we then have only four marbles to choose from of which only two are green. The notation will be P(green | green) = 2/4. Thus the probability of choosing two green marbles is P(green)P(green | green) = (3/5)(2/4) = 6/20 just like in the previous example. In general we are saying that one way to calculate P(A and B) is by having the following information: P(A) and P(B | A) Or P(B) and P(A | B) So then we can use the general formula P(A and B) = P(A)P(B | A) or P(B)P(A|B) Thus the probability of choosing two green marbles is P(green AND green) = P(green)P(green | green) = (3/5)(2/4) = 6/20 So we used the tree diagram as a tool for calculating probabilities that describe sampling from a population more than once. Had we sampled, three times, instead of two times the corresponding tree would have looked like P(green AND green AND green) = (3/5)(2/5)(1/3) which equals P(green)P(green | green)P(green | green AND green) So do we have independence? In the speak of sampling larger than one, what I am asking is this. Is the probability of getting a particular outcome affected as I continue to sample? Using my example, does the probability of getting a green marble change as I continue to grab marbles from the bag? I think you would agree the answer is yes. On the first grab, the chance is P(green) = 3/5. But on the second round the probability has changed to P(green | green) = 2/4 which is not the same as 3/5. So this clearly shows I do not have independence as I sample from the bag. What would have to happen in order to have independence during the sampling procedure? To have independence I would need to put the marble back after each pick, shake the bag and draw again. Producing the following tree. Notice that P(green) = 3/5, but after the second pick P(green | green) = 3/5 again since the marble is put back. We have independence. Alright, what did we learn so far? The idea of independence is still the same as before, but now we ask about independence during the sampling procedure. Notice that P(green) = 3/5, but after the second pick P(green | green) = 3/5 again since the marble is put back. We have independence during sampling. Just like before to show independence we need to show P(A | B) = P(A). Keep this in mind as this will be critical for what is coming next. A Consequence of Having Independence In general P(A and B) = P(A)P(B|A), furthermore P(A and B and C) = P(A)P(B|A)P(C|B and A), P(A and B and C and D) = P(A)P(B|A)P(C|B and A)P(D| A&B&C) Lets say that events A and B are independent then, P(A and B) = P(A)P(B|A) = P(A)P(B) since P(B | A) = P(B). A Consequence of Having Independence In general P(A and B) = P(A)P(B|A), furthermore P(A and B and C) = P(A)P(B|A)P(C|B and A), P(A and B and C and D) = P(A)P(B|A)P(C|B and A)P(D| A&B&C) Lets say that events A and B and C are independent then, P(A and B and C) = P(A)P(B)P(C) and so on. Since P(C | B and A) = P(C) which says the probability of C does not change if we change the sample space to the scenario in which the events A and B have already occurred. So now, I will pose a few problems that you will see them in most textbooks, and how you should view those problems. In a population of 60,000 registered voters, 70% favor all day kindergarten. What is the probability of choosing five registered voters at random for an interview and having all five favor all day kindergarten? I want to answer the question P(favor AND favor AND favor AND favor AND favor) which could be answered by knowing P(favor)P(favor | favor)P(favor| favor and favor) and so on. The above calculation is not very pleasant to consider, but do I have independence? In a population of 60,000 registered voters, 70% favor all day kindergarten. What is the probability of choosing five registered voters at random for an interview and having all five favor all day kindergarten? I want to answer the question P(favor AND favor AND favor AND favor AND favor) which could be answered by knowing P(favor)P(favor | favor)P(favor| favor and favor) and so on. The above calculation is not very pleasant to consider, but do I have independence? To answer this I can ask P(favor | favor) = P(favor)? No, since after I interview the first person I will not interview them again; sampling without replacement. In a population of 60,000 registered voters, 70% favor all day kindergarten. What is the probability of choosing five registered voters at random for an interview and having all five favor all day kindergarten? I want to answer the question P(favor AND favor AND favor AND favor AND favor) which could be answered by knowing P(favor)P(favor | favor)P(favor| favor and favor) and so on. But for practical purposes do I have independence? Yes, since I am only taking out one person out of 60,000 voters. P(favor) ≈ P(favor | favor). Thus. P(favor AND favor AND favor AND favor AND favor) = (.7)(.7)(.7)(.7)(0.7) = 0.75 One last example. I toss a die three times. The die is fair. What is the probability of getting three “ones” in a row? Do you have independence during sampling, or for practical purposes do I have independence during sampling? Here I do have independence, since P(one | one) = P(one), that is a fair die has no memory of past outcomes. You can think of it as sampling with replacement, in this case the die automatically reverts to its original state after each throw. P(one AND one AND one) = (1/6)3 You can see that if we can justify having independence or almost having independence we can make some calculations much easier; it removes a huge headache from the calculation. As you continue studying statistics you will see that many statistical tests use the fact that we have near independence during the sampling process. Having said this, there are historically many situation in which the calculations were created assuming independence and that was a wrong conclusion, the result being disastrous. A memorable case involved the Challenger Space Shuttle disaster. View this slide show as often as you need. Keep in mind that I am not expecting you to get everything on the first pass through. View this slide set, do some reading, attempt some problems, think about what you did, and view the slides again. Repeat and repeat. Write down questions you may have and ask your instructor. Continue to work hard. The End

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download The Idea of Independence- Part II