* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Statistical Programming Camp: An Introduction to R
Degrees of freedom (statistics) wikipedia , lookup
Bootstrapping (statistics) wikipedia , lookup
Foundations of statistics wikipedia , lookup
History of statistics wikipedia , lookup
Inductive probability wikipedia , lookup
Taylor's law wikipedia , lookup
Student's t-test wikipedia , lookup
German tank problem wikipedia , lookup
Statistical Programming Camp: An Introduction to R Handout 6: Probability and Simulations Fox Chapters 1, 3, 8 In this handout, we cover the following new materials: Using sample() to calculate probability through simulation Using pnorm() and pbinom to evaluate cumulative distribution functions Using tion rnorm() and rbinom() to take random draws from the normal and binomial distribu- Using dnorm() and dbinom() to evaluate probability density (distribution) functions (pdf) of the normal and binomial distribution Using sort() to sort vectors Using matrix() to generate a data matrix. 1 1 Calculating Probability through Simulation Probability can be thought of as the “limit” of repeated identical experiments. Using a loop to repeat an experiment, we may calculate an approximate probability of specified events. The function sample(X, Y, replace = TRUE, prob = P) will randomly sample Y units from a vector X. Sampling may be done with or without replacement (replace = TRUE or replace = FALSE). The argument prob = P denotes a vector of probability for observing elements of vector X. The default is equal probability. > Z <- seq(from = 2, to = 16, by = 2) # Create vector to draw samples from > sample(Z, 7, replace = TRUE) # Randomly draw 7 samples from Z, with replacement [1] 4 2 14 4 16 4 8 > sample(Z, 7, replace = FALSE) # Randomly draw 7 samples from Z, without replacement [1] 2 8 6 4 14 16 10 > ## Randomly draw 7 samples from Z, with replacement and unequal probabilities, > ## where prob = is a vector of probability weights > sample(Z, 7, replace = TRUE, prob = c(1,1,1,1,1,1,7,7)) [1] 14 16 4 10 16 4 14 We explore calculating probability through simulation with the birthday problem. In this problem, we determine how many students must be in a class in order for the probability that at least two students have the same birthday to exceed 0.5. We answer this question by using a loop within a loop to repeat (i.e.: simulate) an experiment several times. > > > > > > > > > > + + + + + + + + sims <- 5000 # Specify number of simulations bday <- 1:365 # Create sequence to represent all possible birthdays ## Create empty container for our answers, where NA indicates ## missing data that we will fill with data generated by the loop answer <- rep(NA, 25) ## Generate a simulation through a loop nested within a loop ## The inner loop (indexed by i) is the simulation ## The outer loop (indexed by k) uses the results of the simulation to generate ## the probability that at least two share a birthday for (k in 1:25) { count <- 0 # Start counter of simulations that meet condition for (i in 1:sims) { class <- sample(bday, k, replace = TRUE) # sampling with replacement if (length(unique(class)) < length(class)) { count <- count + 1 # add one to counter if any kids share the birthday } } ## Store the answers (counter of simulations that meet condition 2 + ## divided by total number of simulations) + answer[k] <- count/sims + } > answer [1] 0.0000 0.0032 0.0068 0.0178 0.0280 0.0438 0.0538 0.0732 0.0990 0.1248 [11] 0.1372 0.1598 0.2006 0.2152 0.2492 0.2866 0.3146 0.3564 0.3938 0.4182 [21] 0.4538 0.4710 0.4978 0.5382 0.5668 > ## plotting probability that was saved during the loop > plot(1:25, answer, type = "b", col="blue", lwd = 2, xlab = "Number of People", + ylab = "Probability at Least Two People Share a Birthday", + main = "Birthday Problem") > abline(h = 0.5, col = "red", lty=2) Birthday Problem 0.5 ● ● ● 0.4 ● ● ● 0.3 ● ● 0.2 ● ● ● ● 0.1 ● ● ● ● ● 0.0 Probability at Least Two People Share a Birthday ● ● ● ● ● ● ● ● 5 10 15 20 25 Number of People 2 Distribution Functions The (cumulative) distribution function of the random variable X, denoted by F (x), is equal to the probability of X taking a value less than or equal to x. The function pbinom(q, size, prob, lower.tail = TRUE) will take in a vector of values, q, and report the corresponding proportion of times we would expect to see q or fewer successes when we have sample size size and probability of success prob. The lower.tail default is TRUE, which specifies the quantity of interest is proportion of all observations having values of q or lower, given the mean and sd. Choosing lower.tail = FALSE produces probability of values greater than q. > x <- 0:10 > y <- pbinom(x, 10, .81, lower.tail = TRUE) > y 3 [1] 6.131066e-08 2.675081e-06 5.281820e-05 6.228663e-04 4.875725e-03 [6] 2.663246e-02 1.039261e-01 2.922204e-01 5.932435e-01 8.784233e-01 [11] 1.000000e+00 The stepfun(X, Y) may be used to plot a step function, where X indicates the x values an Y indicates the y values. Note that the y vector must have exactly one more element than the x vector. > binom.cdf <- stepfun(x[2:11], y) > plot(binom.cdf, main = "Cumulative Distribution", ylab = "Cumulative Probability") 1.0 Cumulative Distribution ● 0.6 0.4 ● ● 0.2 Cumulative Probability 0.8 ● 0.0 ● ● 0 ● 2 ● ● ● 4 6 8 10 x Similarly, the function pnorm(q, mean, sd, lower.tail=TRUE) will take in a vector of values, q, and report the proportion of all observations we would expect to witness having values q or lower, given that the distribution is normal with the specified mean and sd. We return to the 2008 Presidential Election example. We again rely on the data file e08.RData, available on Blackboard. We begin by calculating the mean poll predicted margin of victory for Obama within each state as well as the actual margin of victory. Assume that each poll has the sample size of n = 1000 and that the probability of winning a state is equal to Pr(X > 0) where X is normally distributed with mean equal to the mean margin of victory from the p poll (converted from percentage points to probabilities) and the standard deviation is equal to 2 p(1 − p)/mn where m is the number of polls for that state and p = (x + 1)/2 with x being the mean margin of victory (again, in probabilities not percentage points). We use this information to calculate Obama’s probability of winning for each state. > ## load data > load("e08.RData") > ## strip out poll results 4 > > > > > > > > > > > > > > > e08.polls <- e08[e08$DaysToElection != 0, ] e08.results <- e08[e08$DaysToElection == 0, ] ## calculate the mean poll margin by state O.margin.polls <- tapply(e08.polls$Dem - e08.polls$GOP, e08.polls$State, mean) ## convert margin to percentage means <- O.margin.polls * 0.01 ## calculate p and sd using formulas from lecture p <- (0.01 * O.margin.polls + 1)/2 poll.n <- table(e08.polls$State) sds <- 2*sqrt(p*(1-p)/(poll.n*1000)) ## determine Obama's probability of winning by state prob.O.win <- pnorm(0, means, sds, lower.tail = FALSE) ## Let's look at Obama's probability of winning in Missouri prob.mo <- prob.O.win[names(prob.O.win) == "Missouri"] prob.mo Missouri 0.3886478 3 Random Draws from Probability Distributions The function rbinom(n, s, p) will create a vector of length n containing independent, random draws from a binomial distribution with the size of s and the probability of success p. In the following example, n is equal to 20, s is equal to 7, and p is equal to 0.55. In other words, we take twenty independent, random draws from a binomial distribution with a size of 7 and a probability of success equal to 0.55. > rbinom(n = 20, s = 7, p = 0.55) [1] 5 3 2 2 4 4 4 5 3 4 3 3 2 5 5 3 4 3 4 3 The function rnorm(n, mean, sd) will create a vector of length n containing independent, random draws from a normal distribution with the specified mean and sd (standard deviation). In the following example, n is equal to 10, mean is equal to -1, and sd is equal to 0.2. In other words, we take 10 independent, random draws from normal distribution with a mean of -1 and a standard deviation of 0.2. > ## Ten random draws from N(-1, 0.2) > rnorm(n = 10, mean = -1, sd = 0.2) [1] -0.7467027 -0.8366969 -1.0084668 -1.3515575 -0.9807023 -0.8515630 [7] -1.1431174 -0.9481980 -1.0078429 -1.2161229 5 4 Probability Density Functions Recall from the lecture that the probability density function (pdf ), denoted by f (x), is the function that equals the likelihood of taking a value x. The function dbinom(x, s, p) will take in a vector of values, x, and report the probability of seeing exactly the number of successes denoted by each value of x when we have sample size s and probability of success p. In the example below, our vector x contains elements that range from −1 to 4. For each element of x, R reports the probability of seeing exactly the number of successes denoted by each element of x when we have sample size of 5 and probability of success equal to 0.4. > #-1 never occurs, hence the zero probability > x <- -1:4 > dbinom(x, 5, 0.4) [1] 0.00000 0.07776 0.25920 0.34560 0.23040 0.07680 The 2008 Minnesota Senate race was very close. The difference between Democratic candidate Franken and Republican candidate Coleman was only 312 votes. Most elections, however, are not close. Anthony Downs famously set forth the voter’s paradox: for a rational, self-interested citizen, the costs of voting will exceed the benefits of voting – unless the citizen’s vote is decisive. If we assume that each voter is equally likely to vote for either the Democrat or the Republican and that each vote is independent, what is the probability of a given vote being decisive? We use the example from lecture to illustrate the voter’s paradox: > ## Number of voters is 400,000 > dbinom(x = 200000, size = 400000, prob = 0.5) [1] 0.001261565 The function dnorm(x, mean, sd) will take in a vector of values, x, and report the value of the density function at point x for the normal distribution with a specific mean and sd. We again return to the 2008 election example and produce the distribution of Missouri using random draws from normal distribution with the mean and standard deviation from the state. The function sort() will sort a vector of elements in a descending or ascending (default is ascending) order. > > > > > > > > mo.mean <- means[names(means) == "Missouri"] mo.sd <- sds[names(sds) == "Missouri"] draw.10 <- rnorm(10, mo.mean, mo.sd) draw.100 <- rnorm(100, mo.mean, mo.sd) draw.1000 <- rnorm(1000, mo.mean, mo.sd) ## generate density for 1000 draws draw.1000 <- sort(draw.1000) # sort data to aid graphing y.mo <- dnorm(draw.1000, mo.mean, mo.sd) 6 > + + > > > + plot(draw.1000, y.mo, type = "l", col = "blue", lwd = 3, ylim = c(0, 100), main = "Obama Margin of Victory", ylab = "Density", xlab = "Obama Margin in Missouri") lines(density(draw.10), col="purple", lty=2, lwd=2) lines(density(draw.100), col="darkgreen", lty=4, lwd=2) legend("topright", c("10 Draws", "100 Draws", "1000 Draws"), lty = c(2, 4, 3), lwd = c(2, 2, 3), col=c("purple", "darkgreen", "blue")) 100 Obama Margin of Victory 0 20 40 Density 60 80 10 Draws 100 Draws 1000 Draws −0.04 −0.02 0.00 0.02 0.04 Obama Margin in Missouri 5 Law of Large Numbers The Law of Large Numbers states that the sample mean approaches the population mean as we increase the sample size. We demonstrate this important law by randomly drawing samples of increasing sizes from the normal distribution with a specified mean and standard deviation. According to the Law of Large Numbers, as we increase the sample size, we should see the sample mean and standard deviation approach the specified mean and standard deviation of the distribution. We begin by illustrating the law of large numbers with a simple numerical example. Using similuation, we begin with a sample size of one draw from the normal distribution with a mean of 1 and standard deviation of 3. We increase the sample size by an increment to one until we reach 1200 draws. For each sample size, we calculate and store the mean and standard deviation. Finally, we plot the results. > > > > > > > n.draws <- 1200 means <- rep(NA, n.draws) ## containers for the mean sds <- rep(NA, n.draws) ## and standard deviation ## Begin by generating n.draws from the normal distribution ## See how the mean and standard deviation changes as ## we inlcude more of the draws x <- rnorm(n=n.draws, mean=1, sd=3) 7 > ## simulation > for (i in 1:n.draws){ + means[i] <- mean(x[1:i]) ## mean of the first i draws + sds[i] <- sd(x[1:i]) ## standard deviation of the first i draws + } > > + > > > ## Plot the results plot(means, xlab="Number of Draws", main="Law of Large Numbers", ylim=c(0,4), col="blue", ylab="Mean and Standard Deviation") abline(h=1, col="red", lty=2) points(sds, col="orange") abline(h=3, col="red", lty=2) 4 Law of Large Numbers ● 3 2 1 Mean and Standard Deviation ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● 0 200 400 600 800 1000 1200 Number of Draws 6 Central Limit Theorem The Central Limit Theorem (CLT) states that the sample mean of identically distributed independent random variables is approximately normally distributed if the sample size is large. This is true for virtually any distribution! We illustrate the Central Limit Theorem using draws from a skewed Binominal population distribution. We begin by using the following formula: Z ≡ √X̄−E(X) Var(X)/ns ∼ N (0, 1), where ns is the sample size. Also, recall that if X ∼ B(nt , p), then E(X) = nt p and Var(X) = nt p(1 − p), where nt is the number of trials (population size). We illustrate the Central Limit Theorem by simulating 4000 times an experiment in which we take random draws from a Binomial distribution with a success rate of 0.05 and a size of three for three sample sizes: 20, 100, and 800. We calculate and store the Z-score for the mean of each sample size. Note: The command matrix(data, nrow=X, ncol=Y) will create a matrix with X rows and Y columns, and fill those spaces with a vector of data. 8 > > > > > > > > > + + + + + + p <- 0.05 n <- 3 sims <- 4000 m <- c(20, 100, 800) E.of.X <- n*p V.of.X <- n*p*(1-p) ## Construct an empty matrix to store our simulation results Z <- matrix(NA, nrow = sims, ncol = length(m)) for (i in 1:sims){ for (j in 1:length(m)){ # loop over m (our sample size) samp <- rbinom(n = m[j], size = n, prob = p) sample.mean <- mean(samp) # sample mean Z[i,j] <- (sample.mean - E.of.X) / sqrt(V.of.X/m[j]) ## Z-score for mean } } Finally, we create three stacked histogram of the Z-scores, one for each sample size, and add the density curve from the standard normal distribution to each histogram. ## Displaying the distribution of means par(mfrow=c(3,1)) ## placing multiple plots in one graph for (j in 1:3){ hist(Z[,j], xlim=c(-5, 5), freq=FALSE, ylim=c(0, 0.5), ylab="Probability", xlab="", main=paste("Sample Size =", m[j])) x <- seq(-4, 4, by=0.01) y <- dnorm(x) lines(x, y, col="blue") ## Add N(0,1) density curve to histogram } 0.4 0.2 0.0 Probability Sample Size = 20 −4 −2 0 2 4 2 4 2 4 0.4 0.2 0.0 Probability Sample Size = 100 −4 −2 0 0.4 0.2 Probability Sample Size = 800 0.0 > > > + + + + + + −4 −2 0 9 7 Practice Questions 7.1 PhD Admissions Getting into the Ph.D. program of the Politics Department is known to be very difficult, and there is considerable uncertainty about the admission process. Every year, a secretary types each applicant’s name on a separate card together with a matching envelope. Then, he drops the pile from the window of his office on the second floor of Corwin Hall. Finally, he goes downstairs and places the cards randomly in the envelopes. If the name on a card matches with a name on the envelope in which the card is placed, the applicant will be admitted. What is the probability of nobody getting accepted? Does this probability vary as you change the number of applicants from 50 to 500 (by 25)? Create a graph that is similar to the one created above for the birthday problem. Hint: Use sample(..., replace = FALSE) to represent the random process. 7.2 The 2008 Presidential Election In the example above, we calculated the probability of Obama’s victory based on the normal distribution for each state, i.e., prob.O.win. Simulate the election 5,000 times where you draw the winner for each state according to this probability and allocate electoral votes for the winner (Hint: use rbinom()). Store the total number of electoral votes for each simulation and plot them as a histogram after the simulation. Add a vertical dashed line indicating the actual election result. According to the simulation result, what is the predicted probability of Obama’s victory (calculate this probability by computing the proportion of simulated elections where Obama won the majority of electoral votes)? Comment on the performance of this prediction. 7.3 Staten Island DA Conviction Rates For the past five years, Staten Island District Attorney Donovan’s office has obtained the highest conviction rate among serious crimes cases in the New York criminal justice system. Of the 423 cases tried, 397, or 92.5% ended in conviction. We will take this percentage as the ‘true’ ability of the SI DA’s office to obtain convictions. Using simulation, we illustrate the Law of Large Numbers and the Central Limit Theorem. 1. To begin, use simulation to show that as the number of years increases from one to 100, the sample mean approaches the population mean. 2. We now make this problem a bit more realistic by making the number of cases the office tries a random variable. Specifically, let the number of cases normally around 423 with a standard deviation of 5 cases. Use the round() function to ensure the resulting value is an integer. Simulate a 100 year trial history 1000 times. Then, calculate the mean of the conviction rates. How does the result compare to the true conviction rate? 3. We now use the results from the previous question to illustrate the Central Limit Theorem. Plot the distribution of simulated conviction rates. Add a thick solid red line for the mean and dashed red lines for the points one standard deviation above and below the mean. What does the resulting histogram look like? 4. Now create a plot of two density lines. For the first line, use the density of a normal distribution with the mean of simulated conviction rates and a standard deviation of the simulated conviction 10 rates. Specifically, using the sequence command, create a distribution from the minimum simulated conviction rate to the maximum, with 1000 observations (specify this by using length.out = 1000 rather than by = ). Next, create the density function using the dnorm command. Create the second line using the simulated conviction rate data. What do the density lines show? 11