Download Introduction to R

Introduction to R Arthur White 22nd October, 2014 1 Getting Started The aim of this lab is to familiarise you with R, a statistical software platform. While R is used much more in later courses, you may find it useful for your forthcoming assignments. To open R, click on the start menu → All Programs → Rstudio. Usually, you interact with R through a command-line interface - you type in a command and R responds. Basic Arithmetic > 2+2 [1] 4 > 3*4 - 7/8 + sqrt(16) + 2^5 [1] 47.125 > 1:10 [1] 1 2 3 4 5 6 7 8 9 10 > sum(1:10) [1] 55 > x = c(1, -2, 1/7, sqrt(3)) > x [1] 1.0000000 -2.0000000 0.1428571 > sum(x) [1] 0.874908 1 1.7320508 R can be used as a calculater to perform simple tasks. Here +, -,*, / and ˆ denote the addition, subtraction, multiplication, division and exponentiation operations respectively. The colon operator : creates a vector of numbers, in this case from 1, . . . , 10. R has many in built functions. Here the meaning of the sum function should be obvious. In particular, the command sum(1:10) calculates 1 + 2 + . . . + 10. To better understand what a function does, type a question mark directly in front of it, e.g., ?sqrt. This brings up a help file for the function. We can assign values to a variable using the = operator. (The <- operator can also be used.) In the above example, we assign a vector of values to the variable x. This vector can then be stored for later use. The individual elements of the vector are joined together using the function c, which stands for concatenate, or, in simpler English, combine. Other useful functions in R include factorial, choose and exp. For example > factorial(1:5) [1] 1 2 6 24 120 > choose(7, 3) [1] 35 > exp(-1.5) [1] 0.2231302 Note that the choose function takes two arguments. We can use these functions to calculate probabilities. For example, an urn contains 5 balls, 3 of which are red and 2 of which are black. If 3 balls are selected, what is the probability that 2 will be red? This may be calculated as follows: > (choose(3,2)*choose(2,1))/choose(5, 3) [1] 0.6 Exercise 1 Use the choose function to calculate the probability that 3 heads are obtained after a fair coin has been tossed 5 times. Recall that the binomial distribution is given by n k P(k) = p (1 − p)n−k . k 2 2 Probability Distributions Note that R has a number of in built functions to calculate probabilities associated with many different probability distributions. These include functions for the hypergeometric, binomial and Poisson distributions. In particular, we will be using the functions dhyper, dbinom and dpois to calculate probabilities associated with these respective distributions. Hypergeometric Distribution Recall the question from class: An athlete conceals 2 performance enhancing drugs (PEDs) in a bottle, containing 8 vitamin pills that are similar in appearance. If 3 tablets are selected at random, what is the probability that the cheating will be detected? The following code gives us the answer > dhyper(1, 2, 8, 3) + dhyper(2, 2, 8, 3) [1] 0.5333333 Note that the dhyper function takes 4 arguments. These relate to 1) the number of successes in the sample, 2) the number of successes in the population, 3) the number of failures in the population, and 4) the size of the sample. Note that another way to obtain this answer would be to enter > sum(dhyper(1:2, 2, 8, 3)) [1] 0.5333333 Exercise 2 Use the dhyperfunction to check the answer you obtained for the question relating to the red and black balls in Section 1. Binomial Distribution Now recall the question from class: A random sample of 5 components is selected from a production line that produces, on average, 5% non-conforming components. What is the probability that the sample will contain 2 non-conforming components? > dbinom(x=2, size=5, prob=0.05) [1] 0.02143438 Can you explain what the three arguments for this function mean? Note that we can also easily inspect the probabilities associated with each element of the sample space, and confirm that this value sums to 1: 3 > dbinom(x=0:5, size=5, prob=0.05) [1] 0.7737809375 0.2036265625 0.0214343750 0.0011281250 0.0000296875 [6] 0.0000003125 > sum(dbinom(x=0:5, size=5, prob=0.05)) [1] 1 Exercise 3 Use the dbinom function to check your answer for Excercise 1 in Section 1. Poisson Distribution Now recall the question from class: A small car-hire company has 2 cars which it rents out by the day. Suppose that the number of demands for a car on each day is distributed as a Poisson distribution with mean µ = 1.5. On what proportion of days is neither car required? On what proportion of days does the demand exceed the company’s capacity? > dpois(x=0, lambda=1.5) [1] 0.2231302 > 1 - sum(dpois(x=0:2, lambda=1.5)) [1] 0.1911532 We can also use R to visualise data, and for example, compare observed data to the behaviour which would be expected from a suitable distribution. Here recall the data recording flying bomb hits on wartime London, with mean number of hits µ = 0.9323. This data is shown in Table 1. The following code compares the observed data to expected number of hits, were X ∼ Poisson(0.9323). Table 1: Flying bomb hits on wartime London. Number of hits 0 1 2 3 4 Expected number 226.74 211.39 98.54 30.62 7.14 Actual number 229 211 93 35 7 > > > + > ≥5 1.57 1 actual = c(229, 211, 93, 35, 7, 1) expected = 573 * c( dpois(0:4, 0.9323), 1 - sum(dpois(0:4, 0.9323))) plot(0:5, actual, type="h", main = "Flying Bomb Hits", xlab="Number of times hit", ylab="Frequency", ) points(0:5, expected, pch =2, cex = 2, col ="red") You should get a plot resembling that shown in Figure 1. 4 100 0 50 Frequency 150 200 Flying Bomb Hits 0 1 2 3 4 5 Number of times hit Figure 1: Comparison of actual vs. expected flying bomb hits in wartime London. The black lines represent the actual data, the red triangles the expected number of hits. 3 Monty Hall Problem We have barely scraped the surface of what R can be used for. For example, we can create our own functions and simulate random processes to develop our own tools for analysis. As a small example, consider the Monty Hall problem. Briefly, the problem is named after the host of a game show, in which contestants were asked to select one of three doors. The main prize, a sports car, lay behind one door, whereas goats waited behind the other two. (Contestants usually preferred to win the sports car.) After making their choice, the host would reveal a goat behind one of the doors which had not been chosen. The contestant was then allowed to swap their pick. Is it advantageous for the contestant to do so? Below is a function which recreates the problem, simulating it n times, for either strategy. (I.e., swapping or not.) One can then see how often the strategy is successful. Inspect the code and see if you can follow what is being done (don’t worry if you can’t follow every step.) 5 > monty<- function(n, swap = TRUE){ + + result.store <- rep(NA, n) + + for(i in 1:n){ + + samp1<- sample(c("Goat", "Goat", "Car")) + + pick1 <- samp1[1] + + if(swap){ + + if(samp1[2] == "Goat"){ + + pick1 <- samp1[3] + + } else{ + + pick1 <- samp1[2] + + } + + } + + if(pick1 == "Car"){ + + result <- "Win" + + } else{ + result <- "Lose" + } + + result.store[i] <- result + + } + return(table(result.store)) + } The function can then be used as follows, e.g., > monty(100, swap=FALSE) result.store Lose Win 60 40 Does the answer roughly correspond to what you would expect? 6

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Introduction to R