Download Statistical Programming Camp: An Introduction to R

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Foundations of statistics wikipedia , lookup

History of statistics wikipedia , lookup

Inductive probability wikipedia , lookup

Taylor's law wikipedia , lookup

Student's t-test wikipedia , lookup

German tank problem wikipedia , lookup

Law of large numbers wikipedia , lookup

Probability amplitude wikipedia , lookup

Transcript
Statistical Programming Camp: An Introduction to R
Handout 6: Probability and Simulations
Fox Chapters 1, 3, 8
In this handout, we cover the following new materials:
ˆ Using
sample() to calculate probability through simulation
ˆ Using
pnorm() and pbinom to evaluate cumulative distribution functions
ˆ Using
tion
rnorm() and rbinom() to take random draws from the normal and binomial distribu-
ˆ Using dnorm() and dbinom() to evaluate probability density (distribution) functions (pdf)
of the normal and binomial distribution
ˆ Using
sort() to sort vectors
ˆ Using
matrix() to generate a data matrix.
1
1
Calculating Probability through Simulation
Probability can be thought of as the “limit” of repeated identical experiments. Using a loop to repeat
an experiment, we may calculate an approximate probability of specified events.
ˆ The function sample(X, Y, replace = TRUE, prob = P) will randomly sample Y
units from a vector X. Sampling may be done with or without replacement (replace = TRUE
or replace = FALSE). The argument prob = P denotes a vector of probability for observing elements of vector X. The default is equal probability.
> Z <- seq(from = 2, to = 16, by = 2) # Create vector to draw samples from
> sample(Z, 7, replace = TRUE) # Randomly draw 7 samples from Z, with replacement
[1]
4
2 14
4 16
4
8
> sample(Z, 7, replace = FALSE) # Randomly draw 7 samples from Z, without replacement
[1]
2
8
6
4 14 16 10
> ## Randomly draw 7 samples from Z, with replacement and unequal probabilities,
> ## where prob = is a vector of probability weights
> sample(Z, 7, replace = TRUE, prob = c(1,1,1,1,1,1,7,7))
[1] 14 16
4 10 16
4 14
ˆ We explore calculating probability through simulation with the birthday problem. In this problem, we determine how many students must be in a class in order for the probability that at
least two students have the same birthday to exceed 0.5. We answer this question by using a
loop within a loop to repeat (i.e.: simulate) an experiment several times.
>
>
>
>
>
>
>
>
>
>
+
+
+
+
+
+
+
+
sims <- 5000 # Specify number of simulations
bday <- 1:365 # Create sequence to represent all possible birthdays
## Create empty container for our answers, where NA indicates
## missing data that we will fill with data generated by the loop
answer <- rep(NA, 25)
## Generate a simulation through a loop nested within a loop
## The inner loop (indexed by i) is the simulation
## The outer loop (indexed by k) uses the results of the simulation to generate
## the probability that at least two share a birthday
for (k in 1:25) {
count <- 0 # Start counter of simulations that meet condition
for (i in 1:sims) {
class <- sample(bday, k, replace = TRUE) # sampling with replacement
if (length(unique(class)) < length(class)) {
count <- count + 1 # add one to counter if any kids share the birthday
}
}
## Store the answers (counter of simulations that meet condition
2
+
## divided by total number of simulations)
+
answer[k] <- count/sims
+ }
> answer
[1] 0.0000 0.0032 0.0068 0.0178 0.0280 0.0438 0.0538 0.0732 0.0990 0.1248
[11] 0.1372 0.1598 0.2006 0.2152 0.2492 0.2866 0.3146 0.3564 0.3938 0.4182
[21] 0.4538 0.4710 0.4978 0.5382 0.5668
> ## plotting probability that was saved during the loop
> plot(1:25, answer, type = "b", col="blue", lwd = 2, xlab = "Number of People",
+
ylab = "Probability at Least Two People Share a Birthday",
+
main = "Birthday Problem")
> abline(h = 0.5, col = "red", lty=2)
Birthday Problem
0.5
●
●
●
0.4
●
●
●
0.3
●
●
0.2
●
●
●
●
0.1
●
●
●
●
●
0.0
Probability at Least Two People Share a Birthday
●
●
● ● ●
●
●
●
5
10
15
20
25
Number of People
2
Distribution Functions
The (cumulative) distribution function of the random variable X, denoted by F (x), is equal to
the probability of X taking a value less than or equal to x.
ˆ The function pbinom(q, size, prob, lower.tail = TRUE) will take in a vector of
values, q, and report the corresponding proportion of times we would expect to see q or fewer
successes when we have sample size size and probability of success prob. The lower.tail
default is TRUE, which specifies the quantity of interest is proportion of all observations having
values of q or lower, given the mean and sd. Choosing lower.tail = FALSE produces
probability of values greater than q.
> x <- 0:10
> y <- pbinom(x, 10, .81, lower.tail = TRUE)
> y
3
[1] 6.131066e-08 2.675081e-06 5.281820e-05 6.228663e-04 4.875725e-03
[6] 2.663246e-02 1.039261e-01 2.922204e-01 5.932435e-01 8.784233e-01
[11] 1.000000e+00
The stepfun(X, Y) may be used to plot a step function, where X indicates the x values an
Y indicates the y values. Note that the y vector must have exactly one more element than the x
vector.
> binom.cdf <- stepfun(x[2:11], y)
> plot(binom.cdf, main = "Cumulative Distribution", ylab = "Cumulative Probability")
1.0
Cumulative Distribution
●
0.6
0.4
●
●
0.2
Cumulative Probability
0.8
●
0.0
●
●
0
●
2
●
●
●
4
6
8
10
x
ˆ Similarly, the function pnorm(q, mean, sd, lower.tail=TRUE) will take in a vector
of values, q, and report the proportion of all observations we would expect to witness having
values q or lower, given that the distribution is normal with the specified mean and sd.
ˆ We return to the 2008 Presidential Election example. We again rely on the data file e08.RData,
available on Blackboard.
ˆ We begin by calculating the mean poll predicted margin of victory for Obama within each state
as well as the actual margin of victory. Assume that each poll has the sample size of n = 1000
and that the probability of winning a state is equal to Pr(X > 0) where X is normally distributed
with mean equal to the mean margin of victory from the p
poll (converted from percentage points
to probabilities) and the standard deviation is equal to 2 p(1 − p)/mn where m is the number
of polls for that state and p = (x + 1)/2 with x being the mean margin of victory (again, in
probabilities not percentage points). We use this information to calculate Obama’s probability
of winning for each state.
> ## load data
> load("e08.RData")
> ## strip out poll results
4
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
e08.polls <- e08[e08$DaysToElection != 0, ]
e08.results <- e08[e08$DaysToElection == 0, ]
## calculate the mean poll margin by state
O.margin.polls <- tapply(e08.polls$Dem - e08.polls$GOP, e08.polls$State, mean)
## convert margin to percentage
means <- O.margin.polls * 0.01
## calculate p and sd using formulas from lecture
p <- (0.01 * O.margin.polls + 1)/2
poll.n <- table(e08.polls$State)
sds <- 2*sqrt(p*(1-p)/(poll.n*1000))
## determine Obama's probability of winning by state
prob.O.win <- pnorm(0, means, sds, lower.tail = FALSE)
## Let's look at Obama's probability of winning in Missouri
prob.mo <- prob.O.win[names(prob.O.win) == "Missouri"]
prob.mo
Missouri
0.3886478
3
Random Draws from Probability Distributions
ˆ The function rbinom(n, s, p) will create a vector of length n containing independent,
random draws from a binomial distribution with the size of s and the probability of success p.
ˆ In the following example, n is equal to 20, s is equal to 7, and p is equal to 0.55. In other words,
we take twenty independent, random draws from a binomial distribution with a size of 7 and a
probability of success equal to 0.55.
> rbinom(n = 20, s = 7, p = 0.55)
[1] 5 3 2 2 4 4 4 5 3 4 3 3 2 5 5 3 4 3 4 3
ˆ The function rnorm(n, mean, sd) will create a vector of length n containing independent,
random draws from a normal distribution with the specified mean and sd (standard deviation).
ˆ In the following example, n is equal to 10, mean is equal to -1, and sd is equal to 0.2. In other
words, we take 10 independent, random draws from normal distribution with a mean of -1 and
a standard deviation of 0.2.
> ## Ten random draws from N(-1, 0.2)
> rnorm(n = 10, mean = -1, sd = 0.2)
[1] -0.7467027 -0.8366969 -1.0084668 -1.3515575 -0.9807023 -0.8515630
[7] -1.1431174 -0.9481980 -1.0078429 -1.2161229
5
4
Probability Density Functions
Recall from the lecture that the probability density function (pdf ), denoted by f (x), is the
function that equals the likelihood of taking a value x.
ˆ The function dbinom(x, s, p) will take in a vector of values, x, and report the probability
of seeing exactly the number of successes denoted by each value of x when we have sample size
s and probability of success p.
ˆ In the example below, our vector x contains elements that range from −1 to 4. For each element
of x, R reports the probability of seeing exactly the number of successes denoted by each element
of x when we have sample size of 5 and probability of success equal to 0.4.
> #-1 never occurs, hence the zero probability
> x <- -1:4
> dbinom(x, 5, 0.4)
[1] 0.00000 0.07776 0.25920 0.34560 0.23040 0.07680
ˆ The 2008 Minnesota Senate race was very close. The difference between Democratic candidate
Franken and Republican candidate Coleman was only 312 votes. Most elections, however, are
not close. Anthony Downs famously set forth the voter’s paradox: for a rational, self-interested
citizen, the costs of voting will exceed the benefits of voting – unless the citizen’s vote is decisive.
If we assume that each voter is equally likely to vote for either the Democrat or the Republican
and that each vote is independent, what is the probability of a given vote being decisive? We
use the example from lecture to illustrate the voter’s paradox:
> ## Number of voters is 400,000
> dbinom(x = 200000, size = 400000, prob = 0.5)
[1] 0.001261565
ˆ The function dnorm(x, mean, sd) will take in a vector of values, x, and report the value
of the density function at point x for the normal distribution with a specific mean and sd.
ˆ We again return to the 2008 election example and produce the distribution of Missouri using
random draws from normal distribution with the mean and standard deviation from the state.
The function sort() will sort a vector of elements in a descending or ascending (default is
ascending) order.
>
>
>
>
>
>
>
>
mo.mean <- means[names(means) == "Missouri"]
mo.sd <- sds[names(sds) == "Missouri"]
draw.10 <- rnorm(10, mo.mean, mo.sd)
draw.100 <- rnorm(100, mo.mean, mo.sd)
draw.1000 <- rnorm(1000, mo.mean, mo.sd)
## generate density for 1000 draws
draw.1000 <- sort(draw.1000) # sort data to aid graphing
y.mo <- dnorm(draw.1000, mo.mean, mo.sd)
6
>
+
+
>
>
>
+
plot(draw.1000, y.mo, type = "l", col = "blue", lwd = 3, ylim = c(0, 100),
main = "Obama Margin of Victory", ylab = "Density",
xlab = "Obama Margin in Missouri")
lines(density(draw.10), col="purple", lty=2, lwd=2)
lines(density(draw.100), col="darkgreen", lty=4, lwd=2)
legend("topright", c("10 Draws", "100 Draws", "1000 Draws"),
lty = c(2, 4, 3), lwd = c(2, 2, 3), col=c("purple", "darkgreen", "blue"))
100
Obama Margin of Victory
0
20
40
Density
60
80
10 Draws
100 Draws
1000 Draws
−0.04
−0.02
0.00
0.02
0.04
Obama Margin in Missouri
5
Law of Large Numbers
The Law of Large Numbers states that the sample mean approaches the population mean as
we increase the sample size. We demonstrate this important law by randomly drawing samples of
increasing sizes from the normal distribution with a specified mean and standard deviation. According
to the Law of Large Numbers, as we increase the sample size, we should see the sample mean and
standard deviation approach the specified mean and standard deviation of the distribution.
ˆ We begin by illustrating the law of large numbers with a simple numerical example. Using
similuation, we begin with a sample size of one draw from the normal distribution with a mean
of 1 and standard deviation of 3. We increase the sample size by an increment to one until we
reach 1200 draws. For each sample size, we calculate and store the mean and standard deviation.
Finally, we plot the results.
>
>
>
>
>
>
>
n.draws <- 1200
means <- rep(NA, n.draws) ## containers for the mean
sds <- rep(NA, n.draws) ## and standard deviation
## Begin by generating n.draws from the normal distribution
## See how the mean and standard deviation changes as
## we inlcude more of the draws
x <- rnorm(n=n.draws, mean=1, sd=3)
7
> ## simulation
> for (i in 1:n.draws){
+
means[i] <- mean(x[1:i]) ## mean of the first i draws
+
sds[i] <- sd(x[1:i]) ## standard deviation of the first i draws
+ }
>
>
+
>
>
>
## Plot the results
plot(means, xlab="Number of Draws", main="Law of Large Numbers",
ylim=c(0,4), col="blue", ylab="Mean and Standard Deviation")
abline(h=1, col="red", lty=2)
points(sds, col="orange")
abline(h=3, col="red", lty=2)
4
Law of Large Numbers
●
3
2
1
Mean and Standard Deviation
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
●
0
200
400
600
800
1000
1200
Number of Draws
6
Central Limit Theorem
The Central Limit Theorem (CLT) states that the sample mean of identically distributed independent random variables is approximately normally distributed if the sample size is large. This is true
for virtually any distribution! We illustrate the Central Limit Theorem using draws from a skewed
Binominal population distribution.
ˆ We begin by using the following formula: Z ≡ √X̄−E(X)
Var(X)/ns
∼ N (0, 1), where ns is the sample
size. Also, recall that if X ∼ B(nt , p), then E(X) = nt p and Var(X) = nt p(1 − p), where nt is
the number of trials (population size). We illustrate the Central Limit Theorem by simulating
4000 times an experiment in which we take random draws from a Binomial distribution with a
success rate of 0.05 and a size of three for three sample sizes: 20, 100, and 800. We calculate
and store the Z-score for the mean of each sample size. Note: The command matrix(data,
nrow=X, ncol=Y) will create a matrix with X rows and Y columns, and fill those spaces with
a vector of data.
8
>
>
>
>
>
>
>
>
>
+
+
+
+
+
+
p <- 0.05
n <- 3
sims <- 4000
m <- c(20, 100, 800)
E.of.X <- n*p
V.of.X <- n*p*(1-p)
## Construct an empty matrix to store our simulation results
Z <- matrix(NA, nrow = sims, ncol = length(m))
for (i in 1:sims){
for (j in 1:length(m)){ # loop over m (our sample size)
samp <- rbinom(n = m[j], size = n, prob = p)
sample.mean <- mean(samp) # sample mean
Z[i,j] <- (sample.mean - E.of.X) / sqrt(V.of.X/m[j]) ## Z-score for mean
}
}
Finally, we create three stacked histogram of the Z-scores, one for each sample size, and add the
density curve from the standard normal distribution to each histogram.
## Displaying the distribution of means
par(mfrow=c(3,1)) ## placing multiple plots in one graph
for (j in 1:3){
hist(Z[,j], xlim=c(-5, 5), freq=FALSE, ylim=c(0, 0.5),
ylab="Probability", xlab="", main=paste("Sample Size =", m[j]))
x <- seq(-4, 4, by=0.01)
y <- dnorm(x)
lines(x, y, col="blue") ## Add N(0,1) density curve to histogram
}
0.4
0.2
0.0
Probability
Sample Size = 20
−4
−2
0
2
4
2
4
2
4
0.4
0.2
0.0
Probability
Sample Size = 100
−4
−2
0
0.4
0.2
Probability
Sample Size = 800
0.0
>
>
>
+
+
+
+
+
+
−4
−2
0
9
7
Practice Questions
7.1
PhD Admissions
Getting into the Ph.D. program of the Politics Department is known to be very difficult, and there is
considerable uncertainty about the admission process. Every year, a secretary types each applicant’s
name on a separate card together with a matching envelope. Then, he drops the pile from the window
of his office on the second floor of Corwin Hall. Finally, he goes downstairs and places the cards
randomly in the envelopes. If the name on a card matches with a name on the envelope in which the
card is placed, the applicant will be admitted. What is the probability of nobody getting accepted?
Does this probability vary as you change the number of applicants from 50 to 500 (by 25)? Create
a graph that is similar to the one created above for the birthday problem. Hint: Use sample(...,
replace = FALSE) to represent the random process.
7.2
The 2008 Presidential Election
In the example above, we calculated the probability of Obama’s victory based on the normal distribution for each state, i.e., prob.O.win. Simulate the election 5,000 times where you draw the winner
for each state according to this probability and allocate electoral votes for the winner (Hint: use
rbinom()). Store the total number of electoral votes for each simulation and plot them as a histogram
after the simulation. Add a vertical dashed line indicating the actual election result. According to the
simulation result, what is the predicted probability of Obama’s victory (calculate this probability by
computing the proportion of simulated elections where Obama won the majority of electoral votes)?
Comment on the performance of this prediction.
7.3
Staten Island DA Conviction Rates
For the past five years, Staten Island District Attorney Donovan’s office has obtained the highest
conviction rate among serious crimes cases in the New York criminal justice system. Of the 423 cases
tried, 397, or 92.5% ended in conviction. We will take this percentage as the ‘true’ ability of the SI
DA’s office to obtain convictions. Using simulation, we illustrate the Law of Large Numbers and the
Central Limit Theorem.
1. To begin, use simulation to show that as the number of years increases from one to 100, the
sample mean approaches the population mean.
2. We now make this problem a bit more realistic by making the number of cases the office tries
a random variable. Specifically, let the number of cases normally around 423 with a standard
deviation of 5 cases. Use the round() function to ensure the resulting value is an integer.
Simulate a 100 year trial history 1000 times. Then, calculate the mean of the conviction rates.
How does the result compare to the true conviction rate?
3. We now use the results from the previous question to illustrate the Central Limit Theorem.
Plot the distribution of simulated conviction rates. Add a thick solid red line for the mean and
dashed red lines for the points one standard deviation above and below the mean. What does
the resulting histogram look like?
4. Now create a plot of two density lines. For the first line, use the density of a normal distribution
with the mean of simulated conviction rates and a standard deviation of the simulated conviction
10
rates. Specifically, using the sequence command, create a distribution from the minimum simulated conviction rate to the maximum, with 1000 observations (specify this by using length.out
= 1000 rather than by = ). Next, create the density function using the dnorm command. Create
the second line using the simulated conviction rate data. What do the density lines show?
11