Download Introduction to R

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
Introduction to R
Arthur White
22nd October, 2014
1
Getting Started
The aim of this lab is to familiarise you with R, a statistical software platform.
While R is used much more in later courses, you may find it useful for your
forthcoming assignments.
To open R, click on the start menu → All Programs → Rstudio. Usually,
you interact with R through a command-line interface - you type in a command
and R responds.
Basic Arithmetic
> 2+2
[1] 4
> 3*4 - 7/8 + sqrt(16) + 2^5
[1] 47.125
> 1:10
[1]
1
2
3
4
5
6
7
8
9 10
> sum(1:10)
[1] 55
> x = c(1, -2, 1/7, sqrt(3))
> x
[1]
1.0000000 -2.0000000
0.1428571
> sum(x)
[1] 0.874908
1
1.7320508
R can be used as a calculater to perform simple tasks. Here +, -,*, / and ˆ
denote the addition, subtraction, multiplication, division and exponentiation
operations respectively. The colon operator : creates a vector of numbers, in
this case from 1, . . . , 10.
R has many in built functions. Here the meaning of the sum function should
be obvious. In particular, the command sum(1:10) calculates
1 + 2 + . . . + 10.
To better understand what a function does, type a question mark directly in
front of it, e.g., ?sqrt. This brings up a help file for the function.
We can assign values to a variable using the = operator. (The <- operator can
also be used.) In the above example, we assign a vector of values to the variable
x. This vector can then be stored for later use. The individual elements of the
vector are joined together using the function c, which stands for concatenate,
or, in simpler English, combine.
Other useful functions in R include factorial, choose and exp. For example
> factorial(1:5)
[1]
1
2
6
24 120
> choose(7, 3)
[1] 35
> exp(-1.5)
[1] 0.2231302
Note that the choose function takes two arguments.
We can use these functions to calculate probabilities. For example, an urn
contains 5 balls, 3 of which are red and 2 of which are black. If 3 balls are
selected, what is the probability that 2 will be red? This may be calculated as
follows:
> (choose(3,2)*choose(2,1))/choose(5, 3)
[1] 0.6
Exercise 1
Use the choose function to calculate the probability that 3 heads are obtained
after a fair coin has been tossed 5 times. Recall that the binomial distribution
is given by
n k
P(k) =
p (1 − p)n−k .
k
2
2
Probability Distributions
Note that R has a number of in built functions to calculate probabilities associated with many different probability distributions. These include functions
for the hypergeometric, binomial and Poisson distributions. In particular, we
will be using the functions dhyper, dbinom and dpois to calculate probabilities
associated with these respective distributions.
Hypergeometric Distribution
Recall the question from class: An athlete conceals 2 performance enhancing
drugs (PEDs) in a bottle, containing 8 vitamin pills that are similar in appearance. If 3 tablets are selected at random, what is the probability that the
cheating will be detected?
The following code gives us the answer
> dhyper(1, 2, 8, 3) + dhyper(2, 2, 8, 3)
[1] 0.5333333
Note that the dhyper function takes 4 arguments. These relate to 1) the number
of successes in the sample, 2) the number of successes in the population, 3) the
number of failures in the population, and 4) the size of the sample. Note that
another way to obtain this answer would be to enter
> sum(dhyper(1:2, 2, 8, 3))
[1] 0.5333333
Exercise 2
Use the dhyperfunction to check the answer you obtained for the question relating to the red and black balls in Section 1.
Binomial Distribution
Now recall the question from class: A random sample of 5 components is selected
from a production line that produces, on average, 5% non-conforming components. What is the probability that the sample will contain 2 non-conforming
components?
> dbinom(x=2, size=5, prob=0.05)
[1] 0.02143438
Can you explain what the three arguments for this function mean? Note that
we can also easily inspect the probabilities associated with each element of the
sample space, and confirm that this value sums to 1:
3
> dbinom(x=0:5, size=5, prob=0.05)
[1] 0.7737809375 0.2036265625 0.0214343750 0.0011281250 0.0000296875
[6] 0.0000003125
> sum(dbinom(x=0:5, size=5, prob=0.05))
[1] 1
Exercise 3
Use the dbinom function to check your answer for Excercise 1 in Section 1.
Poisson Distribution
Now recall the question from class: A small car-hire company has 2 cars which it
rents out by the day. Suppose that the number of demands for a car on each day
is distributed as a Poisson distribution with mean µ = 1.5. On what proportion
of days is neither car required? On what proportion of days does the demand
exceed the company’s capacity?
> dpois(x=0, lambda=1.5)
[1] 0.2231302
> 1 - sum(dpois(x=0:2, lambda=1.5))
[1] 0.1911532
We can also use R to visualise data, and for example, compare observed data
to the behaviour which would be expected from a suitable distribution. Here recall the data recording flying bomb hits on wartime London, with mean number
of hits µ = 0.9323. This data is shown in Table 1. The following code compares
the observed data to expected number of hits, were X ∼ Poisson(0.9323).
Table 1: Flying bomb hits on wartime London.
Number of hits
0
1
2
3
4
Expected number 226.74 211.39 98.54 30.62 7.14
Actual number
229
211
93
35
7
>
>
>
+
>
≥5
1.57
1
actual = c(229, 211, 93, 35, 7, 1)
expected = 573 * c( dpois(0:4, 0.9323), 1 - sum(dpois(0:4, 0.9323)))
plot(0:5, actual, type="h", main = "Flying Bomb Hits",
xlab="Number of times hit", ylab="Frequency", )
points(0:5, expected, pch =2, cex = 2, col ="red")
You should get a plot resembling that shown in Figure 1.
4
100
0
50
Frequency
150
200
Flying Bomb Hits
0
1
2
3
4
5
Number of times hit
Figure 1: Comparison of actual vs. expected flying bomb hits in wartime London. The black lines represent the actual data, the red triangles the expected
number of hits.
3
Monty Hall Problem
We have barely scraped the surface of what R can be used for. For example,
we can create our own functions and simulate random processes to develop our
own tools for analysis.
As a small example, consider the Monty Hall problem. Briefly, the problem
is named after the host of a game show, in which contestants were asked to
select one of three doors. The main prize, a sports car, lay behind one door,
whereas goats waited behind the other two. (Contestants usually preferred to
win the sports car.) After making their choice, the host would reveal a goat
behind one of the doors which had not been chosen. The contestant was then
allowed to swap their pick. Is it advantageous for the contestant to do so?
Below is a function which recreates the problem, simulating it n times, for
either strategy. (I.e., swapping or not.) One can then see how often the strategy
is successful. Inspect the code and see if you can follow what is being done (don’t
worry if you can’t follow every step.)
5
> monty<- function(n, swap = TRUE){
+
+
result.store <- rep(NA, n)
+
+
for(i in 1:n){
+
+
samp1<- sample(c("Goat", "Goat", "Car"))
+
+
pick1 <- samp1[1]
+
+
if(swap){
+
+
if(samp1[2] == "Goat"){
+
+
pick1 <- samp1[3]
+
+
} else{
+
+
pick1 <- samp1[2]
+
+
}
+
+
}
+
+
if(pick1 == "Car"){
+
+
result <- "Win"
+
+
} else{
+
result <- "Lose"
+
}
+
+
result.store[i] <- result
+
+
}
+
return(table(result.store))
+ }
The function can then be used as follows, e.g.,
> monty(100, swap=FALSE)
result.store
Lose Win
60
40
Does the answer roughly correspond to what you would expect?
6