Download Lab 4 for Math 17: Probability and Simulation 1 The Law of Total

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Statistics wikipedia , lookup

History of statistics wikipedia , lookup

Inductive probability wikipedia , lookup

Ars Conjectandi wikipedia , lookup

Probability wikipedia , lookup

Probability interpretations wikipedia , lookup

Transcript
Lab 4 for Math 17: Probability and Simulation
1
The Law of Total Probability and More
The law of total probability gives us some additional options to find probabilities of events. For an
event A, and set of events Bi , i = 1, . . . , n, which partition the entire sample space, you can find
the probability of A using the probabilities of Bi and the conditional probability of A given each
Bi . That is, the law of total probability states that:
P (A) =
n
X
P (Bi )P (A|Bi ).
i=1
For an interesting application of the law of total probability, consider the following situation. As
a researcher, you need to ask a very sensitive question of your subjects (yes/no) and you aren’t
sure they are going to reply honestly. Warner’s Randomized Response Model was designed for this
situation based on the law of total probability. Before we implement the model, let’s decide on a
not-so-sensitive question that you all want to learn the answer to. For example, the probability
that students in the class have had an internship while at Amherst.
5. Our question of interest (yes/no) is
As the researcher, you would give the following directions to your subjects as well as a coin.
Flip the coin twice, and record the flips just for yourself. If the first flip was heads, answer question
1. If the first flip was tails, answer question 2. On the slip of paper, record ONLY the yes/no
answer to your question (not the flips).
Additionally, you would need the following questions set up:
Question 1 (Q1). (Question of Interest)
Question 2 (Q2). Was the second flip heads?
Thinking about the setup and the fact that we are interested in the probability of a yes reply to
question one, we could use the law of total probability as follows:
P (Y es) = P (Y es|Q1)P (Q1) + P (Y es|Q2)P (Q2).
Note that we can fill in values for most of these probabilities already.
P (Y es) = P (Y es|Q1)(.5) + (.5)(.5)
Compute P(Yes) based on the class responses and then compute P (Y es|Q1).
1
2
Simulating Probabilities
Simulations can be powerful tools to help guide understanding of the world and in this case, probability. You MUST read the directions for this activity. Questions to check your understanding are
provided along the way. Code for this exercise is available online for you to copy/paste since it is
run directly in R and not Rcmdr.
To begin, consider randomly guessing on a true/false quiz of 10 questions. We might be interested
in the probability of students scoring 8 or higher by guessing. We can investigate this in many
ways. First, let’s try a method by hand.
Assuming that we use a fair coin, we can flip a coin 10 times and count the number of heads (or
tails.. but you have to pick) as the number of correct answers. Try it. What score did you get on
the quiz simulating it with a coin flip?
If we really want to simulate the distribution of scores, we’d need more than just one quiz score,
so let’s repeat. Generate 2 more quiz scores using coin flips. What scores did you get?
Three scores still aren’t enough to visualize the distribution of scores, let alone the probability of
scoring 8 or higher. Let’s gather the scores from the entire class into a basic dotplot on the board.
What does the distribution of scores look like? Does it seem like there is a high probability of
scoring 8 or higher if you are randomly guessing?
We can simulate faster by computer than by hand. In this context, R can simulate 10 random
numbers each between 0 and 1, signifying the 10 random guesses. All possible answers to the
questions are Ts and Fs, and one of those is correct for each, so we can arbitrarily tell R to round
to the nearest integer, and let 0 signify an incorrect response and 1 a correct response for all ten
questions. The sum of the rounded random numbers then would represent the number of correct
reponses on a single quiz by guessing. The process can be repeated for thousands of quizzes.
Check your understanding:
1. What probability are we running the simulation in R to try to estimate?
2. Following these simulation instructions, at the point where we simulate 10 numbers, which of
the following lists of numbers are NOT possible?
a. 1, 8, 6, 7, 10, 3, 9, 4, 5, 2
b. 0.68163862, 0.76066568, 0.92300422, 0.04519679, 0.99642235, 0.52319682, 0.34061928, 0.60487563,
0.63202120, 0.81534584
c. 1, 1, 1, 0, 1, 1, 0, 1, 1, 1
3. Once the numbers are simulated and rounded, a correct response will be indicated by:
4. Determine the quiz score for the simulated list of numbers: 0.05307022, 0.62305132, 0.48539199,
0.01777247, 0.26683285, 0.91124043, 0.80340403, 0.26077497, 0.62321026, 0.43242987
Now let’s actually try the simulation. You will ONLY need to open R, not Rcmdr. These com2
mands can be run in the R window.
1. For a single quiz result, try the following command: sum(round(runif(10))). sum computes the
sum, round does the rounding, and runif(10) generates 10 random numbers between 0 and 1. What
score did you get for your single quiz (hint: that is what is output to the screen)?
2. Run the command again. Now what score did you get?
3. Run a set of 100 quizzes by copying/pasting the following commands (code online):
sumcorrect=NULL; freq=NULL
for(i in 1:100) {sumcorrect[i]=sum(round(runif(10)))}
for(i in 1:11) {freq[i]=length(which(sumcorrect==i-1))/length(sumcorrect)}
freq
The scores are saved in the variable sumcorrect and the relative frequency distribution is saved in
freq. Create a histogram and boxplot for sumcorrect, using the commands hist(sumcorrect) and
boxplot(sumcorrect) allowing you to see the distribution of number correct out of 10 when guessing.
What do those graphs show you about this distribution?
4. The variable freq is the frequency distribution for the scores (i.e. the estimated probability
distribution for X=number of correct reponses in 10 T/F questions). Fill in the following table
with the values of freq for 100, 1000, and then 10000 quizzes by running the appropriate number of
quizzes and rerunning the freq computation (the only number you have to change is the 100 above
to 1000 and then to 10000) or by copy/paste.
NumQuiz/Score
100
1000
10000
Binomial(10,.5)
0
1
2
3
4
5
6
7
8
9
10
.00098
.0098
.0439
.1172
.2051
.2461
.2051
.1172
.0439
.0098
.00098
Do you see much difference between the frequency distributions for the different numbers of quizzes?
5. How do your simulated distributions compare to what the exact distribution is (last row)? Why
is a Binomial (10, .5) the exact distribution?
6. Find the probability a student scores 8 or higher guessing using the simulated distribution and
the exact distribution. You should try the binomial calculation yourself (at least set it up) and
confirm using the table.
3
7. Does it seem likely that someone who scores 8 out of 10 right was guessing?
8. What about someone who scored 7 out of 10?
3
Practice with Trees (Diagrams)
Many species of fish are susceptible to high levels of nitrate in their water, which can in turn cause
Brown Blood disease, where the fish literally suffocate to death because their blood can no longer
carry oxygen. During the capture stage of a capture-recapture sampling problem, you decide to
take samples of fish blood to test for Brown Blood disease (assume you have developed a test
for this). Assume that in the region where you are taking samples of fish, 5 percent of sampling
locations (ponds, lakes, rivers, etc.) have high nitrite concentrations and the fish are suffering from
Brown Blood disease. The test you have developed will detect Brown Blood disease 90 percent of
the time when the fish actually has it, but also indicates the fish has the disease 10 percent of the
time when the fish does not have the disease. Use a tree diagram to assist you in determining the
following probabilities:
1. The probability a randomly selected fish will test positive
2. The probability a randomly selected fish actually has the disease given that they test positive
3. The probability a randomly selected fish has the disease and it was not detected
4
4
Python Eggs
The following table contains data on number of eggs hatched and not for 3 different temperature
settings for python eggs, as recorded by scientists studying environmental impact of temperature
changes on python reproduction.
Temp/Hatch?
Cold
Neutral
Hot
Total
Yes
16
38
75
No
11
18
29
Total
1. Name an event related to this scenario and its complement.
2. What is the probability a randomly selected egg from this experiment hatched?
3. Given each temperature setting (do a calculation for cold, neutral, and hot), what is the
probability a randomly selected egg from that level hatched?
4. The calculations in 3 and 4 allow you to check for independence (be sure you understand
why). Are the events temperature and hatch status independent?
5. Use the probabilities from 2 and 3 as population probabilities to complete the following:
a. What is the probability a randomly selected python who leaves her egg at a neutral temperature
will have a hatched egg?
b. Suppose two randomly selected pythons leave their eggs at cold temperatures. What is the
probability that both eggs hatch?
c. Suppose a randomly selected python leaves her egg at an unknown temperature. What is
the probability that her egg doesn’t hatch?
d. Suppose three randomly selected pythons leave their eggs at hot temperatures. What is the
probability that all three eggs don’t hatch?
e. In the same setting as d., what is the probability at least one of the three eggs hatches?
5
5
Random Variables
1. Provide an example of a discrete random variable and an example of a continuous random
variable.
2. You have obtained a random sample of 10 rainwater samples and you know that the probability a sample from this region is determined to be acid rain is .3 (for this example acid rain
indicates pH level lower than 5.3). Let X be the number of your rainwater samples that turn out
to be acid rain.
a. What distribution does X have? What is the expected value of X?
b. What is the probability that exactly two rainwater samples turn out to be acid rain?
c. What is the probability that at least one rainwater sample turns out to be acid rain?
6
Probability Question to Turn In
A study investigating the relationship between coronary heart disease and anger levels (measured
using the Spielberger Trait Anger Scale) resulted in the following table classifying the 8474 subjects
based on heart disease status and anger level.
HD/Anger
Yes (CHD)
No (CHD)
Total
Low
53
3057
3110
Moderate
110
4621
4731
High
27
606
633
Total
190
8284
8474
a. What is the probability a randomly selected subject had coronary heart disease? Does this
probability represent a sample proportion or population proportion?
b. What is the probability a randomly selected subject had high anger levels?
c. What is the probability a randomly selected subject had coronary heart disease if it was known
the subject had high anger levels?
d. Can you conclude that heart disease status and anger levels are independent? Explain briefly.
For the last 2 parts, treat the probabilities as population probabilities (i.e. do not worry about the
effect of sampling without replacement).
e. What is the probability that 2 randomly selected subjects will have no CHD given that each of
them has a high anger level?
f. What is the probability that a random sample of 2 subjects will yield one subject with CHD and
moderate anger level and another subject with no CHD and moderate anger level?
6