Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Lab 4 for Math 17: Probability and Simulation 1 The Law of Total Probability and More The law of total probability gives us some additional options to find probabilities of events. For an event A, and set of events Bi , i = 1, . . . , n, which partition the entire sample space, you can find the probability of A using the probabilities of Bi and the conditional probability of A given each Bi . That is, the law of total probability states that: P (A) = n X P (Bi )P (A|Bi ). i=1 For an interesting application of the law of total probability, consider the following situation. As a researcher, you need to ask a very sensitive question of your subjects (yes/no) and you aren’t sure they are going to reply honestly. Warner’s Randomized Response Model was designed for this situation based on the law of total probability. Before we implement the model, let’s decide on a not-so-sensitive question that you all want to learn the answer to. For example, the probability that students in the class have had an internship while at Amherst. 5. Our question of interest (yes/no) is As the researcher, you would give the following directions to your subjects as well as a coin. Flip the coin twice, and record the flips just for yourself. If the first flip was heads, answer question 1. If the first flip was tails, answer question 2. On the slip of paper, record ONLY the yes/no answer to your question (not the flips). Additionally, you would need the following questions set up: Question 1 (Q1). (Question of Interest) Question 2 (Q2). Was the second flip heads? Thinking about the setup and the fact that we are interested in the probability of a yes reply to question one, we could use the law of total probability as follows: P (Y es) = P (Y es|Q1)P (Q1) + P (Y es|Q2)P (Q2). Note that we can fill in values for most of these probabilities already. P (Y es) = P (Y es|Q1)(.5) + (.5)(.5) Compute P(Yes) based on the class responses and then compute P (Y es|Q1). 1 2 Simulating Probabilities Simulations can be powerful tools to help guide understanding of the world and in this case, probability. You MUST read the directions for this activity. Questions to check your understanding are provided along the way. Code for this exercise is available online for you to copy/paste since it is run directly in R and not Rcmdr. To begin, consider randomly guessing on a true/false quiz of 10 questions. We might be interested in the probability of students scoring 8 or higher by guessing. We can investigate this in many ways. First, let’s try a method by hand. Assuming that we use a fair coin, we can flip a coin 10 times and count the number of heads (or tails.. but you have to pick) as the number of correct answers. Try it. What score did you get on the quiz simulating it with a coin flip? If we really want to simulate the distribution of scores, we’d need more than just one quiz score, so let’s repeat. Generate 2 more quiz scores using coin flips. What scores did you get? Three scores still aren’t enough to visualize the distribution of scores, let alone the probability of scoring 8 or higher. Let’s gather the scores from the entire class into a basic dotplot on the board. What does the distribution of scores look like? Does it seem like there is a high probability of scoring 8 or higher if you are randomly guessing? We can simulate faster by computer than by hand. In this context, R can simulate 10 random numbers each between 0 and 1, signifying the 10 random guesses. All possible answers to the questions are Ts and Fs, and one of those is correct for each, so we can arbitrarily tell R to round to the nearest integer, and let 0 signify an incorrect response and 1 a correct response for all ten questions. The sum of the rounded random numbers then would represent the number of correct reponses on a single quiz by guessing. The process can be repeated for thousands of quizzes. Check your understanding: 1. What probability are we running the simulation in R to try to estimate? 2. Following these simulation instructions, at the point where we simulate 10 numbers, which of the following lists of numbers are NOT possible? a. 1, 8, 6, 7, 10, 3, 9, 4, 5, 2 b. 0.68163862, 0.76066568, 0.92300422, 0.04519679, 0.99642235, 0.52319682, 0.34061928, 0.60487563, 0.63202120, 0.81534584 c. 1, 1, 1, 0, 1, 1, 0, 1, 1, 1 3. Once the numbers are simulated and rounded, a correct response will be indicated by: 4. Determine the quiz score for the simulated list of numbers: 0.05307022, 0.62305132, 0.48539199, 0.01777247, 0.26683285, 0.91124043, 0.80340403, 0.26077497, 0.62321026, 0.43242987 Now let’s actually try the simulation. You will ONLY need to open R, not Rcmdr. These com2 mands can be run in the R window. 1. For a single quiz result, try the following command: sum(round(runif(10))). sum computes the sum, round does the rounding, and runif(10) generates 10 random numbers between 0 and 1. What score did you get for your single quiz (hint: that is what is output to the screen)? 2. Run the command again. Now what score did you get? 3. Run a set of 100 quizzes by copying/pasting the following commands (code online): sumcorrect=NULL; freq=NULL for(i in 1:100) {sumcorrect[i]=sum(round(runif(10)))} for(i in 1:11) {freq[i]=length(which(sumcorrect==i-1))/length(sumcorrect)} freq The scores are saved in the variable sumcorrect and the relative frequency distribution is saved in freq. Create a histogram and boxplot for sumcorrect, using the commands hist(sumcorrect) and boxplot(sumcorrect) allowing you to see the distribution of number correct out of 10 when guessing. What do those graphs show you about this distribution? 4. The variable freq is the frequency distribution for the scores (i.e. the estimated probability distribution for X=number of correct reponses in 10 T/F questions). Fill in the following table with the values of freq for 100, 1000, and then 10000 quizzes by running the appropriate number of quizzes and rerunning the freq computation (the only number you have to change is the 100 above to 1000 and then to 10000) or by copy/paste. NumQuiz/Score 100 1000 10000 Binomial(10,.5) 0 1 2 3 4 5 6 7 8 9 10 .00098 .0098 .0439 .1172 .2051 .2461 .2051 .1172 .0439 .0098 .00098 Do you see much difference between the frequency distributions for the different numbers of quizzes? 5. How do your simulated distributions compare to what the exact distribution is (last row)? Why is a Binomial (10, .5) the exact distribution? 6. Find the probability a student scores 8 or higher guessing using the simulated distribution and the exact distribution. You should try the binomial calculation yourself (at least set it up) and confirm using the table. 3 7. Does it seem likely that someone who scores 8 out of 10 right was guessing? 8. What about someone who scored 7 out of 10? 3 Practice with Trees (Diagrams) Many species of fish are susceptible to high levels of nitrate in their water, which can in turn cause Brown Blood disease, where the fish literally suffocate to death because their blood can no longer carry oxygen. During the capture stage of a capture-recapture sampling problem, you decide to take samples of fish blood to test for Brown Blood disease (assume you have developed a test for this). Assume that in the region where you are taking samples of fish, 5 percent of sampling locations (ponds, lakes, rivers, etc.) have high nitrite concentrations and the fish are suffering from Brown Blood disease. The test you have developed will detect Brown Blood disease 90 percent of the time when the fish actually has it, but also indicates the fish has the disease 10 percent of the time when the fish does not have the disease. Use a tree diagram to assist you in determining the following probabilities: 1. The probability a randomly selected fish will test positive 2. The probability a randomly selected fish actually has the disease given that they test positive 3. The probability a randomly selected fish has the disease and it was not detected 4 4 Python Eggs The following table contains data on number of eggs hatched and not for 3 different temperature settings for python eggs, as recorded by scientists studying environmental impact of temperature changes on python reproduction. Temp/Hatch? Cold Neutral Hot Total Yes 16 38 75 No 11 18 29 Total 1. Name an event related to this scenario and its complement. 2. What is the probability a randomly selected egg from this experiment hatched? 3. Given each temperature setting (do a calculation for cold, neutral, and hot), what is the probability a randomly selected egg from that level hatched? 4. The calculations in 3 and 4 allow you to check for independence (be sure you understand why). Are the events temperature and hatch status independent? 5. Use the probabilities from 2 and 3 as population probabilities to complete the following: a. What is the probability a randomly selected python who leaves her egg at a neutral temperature will have a hatched egg? b. Suppose two randomly selected pythons leave their eggs at cold temperatures. What is the probability that both eggs hatch? c. Suppose a randomly selected python leaves her egg at an unknown temperature. What is the probability that her egg doesn’t hatch? d. Suppose three randomly selected pythons leave their eggs at hot temperatures. What is the probability that all three eggs don’t hatch? e. In the same setting as d., what is the probability at least one of the three eggs hatches? 5 5 Random Variables 1. Provide an example of a discrete random variable and an example of a continuous random variable. 2. You have obtained a random sample of 10 rainwater samples and you know that the probability a sample from this region is determined to be acid rain is .3 (for this example acid rain indicates pH level lower than 5.3). Let X be the number of your rainwater samples that turn out to be acid rain. a. What distribution does X have? What is the expected value of X? b. What is the probability that exactly two rainwater samples turn out to be acid rain? c. What is the probability that at least one rainwater sample turns out to be acid rain? 6 Probability Question to Turn In A study investigating the relationship between coronary heart disease and anger levels (measured using the Spielberger Trait Anger Scale) resulted in the following table classifying the 8474 subjects based on heart disease status and anger level. HD/Anger Yes (CHD) No (CHD) Total Low 53 3057 3110 Moderate 110 4621 4731 High 27 606 633 Total 190 8284 8474 a. What is the probability a randomly selected subject had coronary heart disease? Does this probability represent a sample proportion or population proportion? b. What is the probability a randomly selected subject had high anger levels? c. What is the probability a randomly selected subject had coronary heart disease if it was known the subject had high anger levels? d. Can you conclude that heart disease status and anger levels are independent? Explain briefly. For the last 2 parts, treat the probabilities as population probabilities (i.e. do not worry about the effect of sampling without replacement). e. What is the probability that 2 randomly selected subjects will have no CHD given that each of them has a high anger level? f. What is the probability that a random sample of 2 subjects will yield one subject with CHD and moderate anger level and another subject with no CHD and moderate anger level? 6