Download R Bootcamp Addendum Important Plots: Histogram, Boxplot, and

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
R Bootcamp Addendum
September 24, 2007
Ryan Rosario
[email protected]
This document outlines a series of R functions that were not covered last year or this year in the R Bootcamp,
but that I feel are important. This should also help you get through most of your first assignment.
Important Plots: Histogram, Boxplot, and Scatterplot Matrix
In 201A, you will be asked to describe data numerically and graphically. Part of your visual analysis will
involve histograms, boxplots and scatterplot matrices.
I am using the temp dataset from the presentation if you wish to follow along, but note that copy and pasting
from this document may not work.
Histograms are easy to construct.
The following plots a histogram of the first row of temp (the temperature at location 1, throughout the
day).
> hist(temp[1,],xlab="Temperature",ylab="Frequency", main="Temperature at Location 1 over a Full Day")
30
20
0
10
Frequency
40
50
60
Temperature at Location 1 over a Full Day
−5
0
5
10
15
20
25
30
Temperature
A basic histogram can be constructed using hist(temp[1,]) but I recommend always including axis and
plot labels.
1
A histogram is an object of type hist. We can store the hist object and extract useful information from it.
By default, hist will plot the histogram on the screen whether or you actually want to see it visually. When
we are interested in accessing the attributes of the histogram (and not the visual representation itself), we
can turn plotting off using the parameter plot=FALSE.
> hist(temp[1,],plot=FALSE)
> attributes(my.hist)
$names
[1] "breaks"
"counts"
[7] "equidist"
"intensities" "density"
"mids"
"xname"
$class
[1] "histogram"
This can be very useful. Suppose we just want to get a vector of counts of how many observations fall into
a certain bin (interval) of the histogram. We can access this
> my.hist$counts
[1] 3 15 32 24 37 32 53 31
7 11
1
2
0
1
A boxplot is also simple to construct.
> boxplot(temp[1,],xlab="Temperature",ylab="Frequency", main="Temperature at Location 1 over a Full Day")
15
10
0
5
Frequency
20
25
Temperature at Location 1 over a Full Day
I only include a y axis here since it does not make much sense to include an x axis in this case.
We can also construct boxplots of a variable conditioned on a categorical variable (factor). Let’s look at
the temperatures for a full day, given we know the type of groundcover at that location.
There are two ways to do this. We can use traditional notation...
> boxplot(groundcover$type,temp) #x, y where y is temp
We can also use formula notation like is used for the lm function. That is, we write y ∼ x meaning y as a
function of x.
> boxplot(temp ∼ groundcover$type)
If you follow along, the plot below will not match because the actual code includes other parameters. I left
them out above, to point out the difference between the two calls.
To get a quick idea of the relationship among 3 or more variables, it is useful to construct a scatterplot
matrix using pairs. For this example, I load a different dataset (Xu, 2006). I pass the whole data frame to
pairs but a subset will also work.
2
0
10
20
30
Tempereature
40
50
Temperature by Ground Cover Type
plant
rock
sand
Groundcover Type
> sat <- read.table("http://www.stat.ucla.edu/~rosario/boot/sat.dat")
> pairs(sat)
●
●
●
●
●
● ●● ●
●●
●
●●
●●
●●●
●●
●●
●●
●
●●
●
●
●
●
●
●
●●
●
●●
● ● ●
●
●
●●
●●● ●
●
●●● ● ●
●
●
●●●
●●
● ●●
●
●
●
450
●●
●
●
● ●
●●●
●
●
●
●
●
● ● ●
●
●● ●●●
●●
●●
● ●
●
● ● ● ● ●
●●
● ●●
●●
●
● ●● ● ●
● ● ●●
● ●●
●
●● ● ●
●
●
● ● ●●
●
●
550
●●
●●
● ●
●
●
●
●
●
●●
● ●
●●●
● ●
●
●
●
●● ●●
●
●● ●
●
● ●
●
● ●●
● ●●●
●● ● ● ● ●
●●
●●
●● ●
●
●
●
●●●
● ●
●
● ● ●● ●
●
●●
● ● ● ● ●● ●●
●●
● ●
● ●
● ● ●
●● ● ●
●
●
●
● ● ●●
●
●
●
●
●
22
●
60
●
10
20
●●
●●
8
22
6
●
4
18
● ●
●
●
● ●●
●
● ●
●
●
● ● ●●
●●●●● ●
●
●●
●●● ●●
●●
●
●
●●
●●●
●
●●
●
● ●●
●
expend
●
14
●
●
●
●
●●●
●●
●
● ●● ●
●
●
●●
●●● ●
●●
● ●● ●
●
● ●
● ●● ●
●●
●●
● ● ●●
●
●
●
●●●
●
● ●
● ●●●
●
● ●●●●●
●
●● ●
●
●
●
●●
●
●
●
●●
●
● ●
●
●● ●
●●
●●
●● ● ●
●
●● ●
●
● ●
●●
●● ● ●
●
●●
● ●●●
●●
●
●●
6
●
●
●
●
● ●
● ●
●
● ● ●
● ●● ●
●●
●●
●● ●
●
● ●
● ●
●
● ●●●
●
●●
●●●●●●●●
●
●
●
●●
4
●●
●
● ● ● ●●
●
● ●●
●
● ●●
●
● ● ● ●
●
● ●
●
8
10
●●
●● ● ●
● ●● ●
●●
●
●●
●
●
● ●●
●
● ●
● ● ● ●●●
●
●
●
●
●● ●
●
● ●●● ●● ●
●
●
●
●
●
● ●
●
●
●●● ●
●
●● ●
● ●●●
●
●
●● ●●
●
● ● ●●
●●●
●●● ● ● ●
●●
● ●●● ●
●
●
●
●●
●
●● ●
●● ● ●
●
●
● ●●
●
●
● ●●
● ●●
●
●● ● ● ●
●● ●● ●
●●
●
●●● ●●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●●● ●●
●
●
●●
●
●● ●
●
●
●● ●
●● ●
● ● ●
●●
●
● ●●●●
●
●
●
●
●
●
●
● ●
●
●● ● ●
● ●● ●
● ●● ●
●
● ●●
●● ● ● ●
●●
● ●● ●
●
●●
●●
●
●● ●
●
● ● ●
●
●
●
● ●● ●
● ●
● ●● ●
●
● ●●●
●● ●
● ● ● ●●
●●
●● ● ●
●
● ●
●
●
●
●
●●
●
●● ●
●
●
●
●
● ●
●
● ●● ●
● ●● ●●
●
● ●●
●
● ●
●● ●
●●
●
●● ●
●● ● ●
takers
●
●
●
●●
●● ●
● ●
●
●
●●●●
●
●
●
●
●
●
●
●
●
●●●● ●
● ●
●
●● ● ●
●● ●●●●
●
●
●●●
●
● ●
● ●
●
● ●●●
●
●●
●● ●
●
●
●
●
●●●
●
●
●●
●
●
●●
● ●● ●● ●
●●
●●
●●
●
●
●
●●
●●
●
●
●●
●
●●
●●
●
●
●●
●●●
● ●
●
●●
● ●
●
●
● ●
●
● ●● ● ●
●
●
●●● ●●
●
●
● ●● ●
●
●
● ●●
●
●
● ●
● ●●
●● ●
●●●●●● ●
● ●
●
●●
●
●●
●
●
●●●
●●
●
●●
●● ●
●●
● ●●
●
●
● ●
●
●
●
● ●
● ●●
●
●●
● ●
●
●●
●
● ●●
●●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●● ● ● ● ●●●
●
●●
●
●
●
●
●●
●●
●
●
●●
●
●●
●
●●
● ●
● ●●
● ●
●
40
●
● ● ●
● ●
● ●
●●
●●
●
●
●●
●●●●
●
●
● ●●
● ●●
●
●● ●
● ●
●●
●
●●
●●
●
●
●●●●
●●
●
● ●● ●
●
●
●
● ●● ●●
●
●● ● ● ●●●
●
●
●●
30
●
●
●
●
●●● ●
● ● ●
●
●●
●●●
●
●●
●● ● ● ●
● ●
●
●
● ● ●●
●●
● ●
●●
●
●
●
●●
● ●● ●
●● ●
●
●●
●
●●
●●●
●●
●● ●
● ●
● ●
●
●
●●●●●
●● ●●
●●
● ● ●●
●●
●
●
●
●
●●
●●●●
●●●●●
●● ●
●●
●
●
●●
●●
● ●
●
●●
●
● ●
●●●
●●
●●●
●
●
●
●
●
●●●
●
● ●
●●
● ●
●●
●
50
●
●●
●
●
●
●
●
●●
●●
●
●
●●
400
3
●
●●
●
460
● ●●
●●●
●● ●
●●●
●
●●
●●
●
● ●
●
●
●
●
●
●●
●●
●
●●
●
●●
●
●
●●●
●●
math
●
●
●
●
●●
●
●●
●
●
●●
●
●●
●●
●●
●
●●
●
●
●● ●
● ●
● ●
●
●●
●●
●
●
●●●●
●●
● ●● ●●
● ●●
●● ●
●● ●
●●
●
●
●
●●● ●
●
●
● ●
●
●
verbal
●
●●
●
● ●
●
●
●●
●
●●●
●●
●
●
●
●
●
●
●
●●
●●
●●
● ●
●●●
●●●
●●
●
●
●
●●
●
●●
●
●
●●
●
●
●●
●
●●●●
●
●●
●●
●●
●●
●●
●
●
total
●
●
●
●
●●
●●
●
●
●
●
●●●
●
●●
●
●
●●
520
50
●
●
●
●
●
40
●
●
● ●
●
●
●
●
●● ●
●
●●
●
●
●●●●● ●
●
●●
salary
●●
●
● ●● ●
●
●
●● ● ●
●●● ●
●
●●
●●
●●● ●
●● ● ●
● ●
●●
●
● ●●● ●
●
●●
●●
●
●
30
●
●
●
●●
●
● ●● ●
● ●
●
● ● ●
●
●● ●● ●●●●
●●● ●
●
● ●
● ●● ● ●●
●
●●●●
●●
●●
●
●●
520
60
20
450
550
●
●●●
●●
●
●●
● ●
●●●●
●
●● ●
●●●
●●●● ●
●
●
●
●
●
●●
●● ●● ●
● ●● ● ●● ●
●
●●
●●
●
●●●
●●
●●
●
●● ●
●
●●●
● ● ●●
●
●
●
●
●
●●
●
● ●●●●
●● ●●
●
●● ●●
●
●
●
● ●
●●●
● ●●
● ● ●●
●● ●
●
●●
●●
●
● ● ● ●
●
●
●●
●●● ●
● ● ● ● ●●●
●●● ●
● ● ● ●
●
●● ●
●●● ●●
● ●●
●●
●
●●
●
460
●
● ●
●
●
●●
●
●● ●
●
●
●
●
●●
●●
●● ●
●●●●●●
●● ●●●
●●●
●
● ●
●● ●
● ●
●
●●
●
400
●
●
●
●
●
●● ●
●●●
●●●
● ●●
●
●
●
●●
●●
●●
1000
ratio
●
●
●● ●
●●
●●
●
●
● ● ●● ●● ●●● ● ●
●●
●●
●●● ●●●
●
●● ●
●
●● ●●
●
●
●
●
● ●
●
850
14
18
●●●
●●●● ●
● ●
●●
● ●
●● ● ●●
●
●●
●●
●●● ● ●
●
●●●
●
●
● ● ●● ●●
●●
850
950
1100
A Numerical Analysis of Data
We can also produce a numerical analysis of the data by using the summary function. summary will calculate
and display the mean, median, first and third quartile, minimum and maximum of a variable.
> summary(temp[,1])
Min. 1st Qu. Median
9.00
15.20
19.10
Mean 3rd Qu.
18.61
21.80
Max.
34.10
The functions that calculate the numbers above are min(temp[,1]), quantile(temp[,1],0.25), median(temp[,1]),
mean(temp[,1]), quantile(temp[,1],0.75) and max(temp[,1]) respectively. We can also get the variance and standard deviation using sd(temp[,1]) and var(temp[,1]) respectively. These functions are on
your reference card and their usage is trivial.
Working with Distributions
R makes it simple to work distributions. We can extract certain information from a distribution using
a combination of two pieces of information. To generate a random number from the standard normal
distribution, use rnorm(n) where n is the number of random numbers you wish to generate. By not including
a value for µ or σ, we rely on the parameter default values of µ = 0 and σ = 1. In the help file for rnorm,
default arguments are displayed as parameter = defaultValue. The help file shows rnorm(n, mean=0,
sd=1) as the correct form for entering in the rnorm function. mean=0,sd=1 means that if you do not specify
a mean or standard deviation, R will assume the mean is 0 and the standard deviation is 1.
To generate a random uniform number, we can use runif(n) with default parameters min = 0 and max =
1. We can also work with other distributions:
Table 1: List of Distributions
Distribution Name
R Syntax
denoting Distribution
norm
unif
binom
exp
gamma
pois
weibull
cauchy
beta
t
f
chisq
geom
hyper
logis
lnorm
nbinom
wilcox
normal
uniform
binomial
exponential
gamma
poisson
weibull
cauchy
beta
Student’s t
F (Fisher-Snedecor χ2 )
Pearson’s χ2
geometric
hypergeometric
logistic
lognormal
negative binomial
Wilcoxon’s statistics
4
Use the help functions to find the default values for the parameters.
Now that we have all of these distributions, what can we do with them? There are four things we can do
them.
Suppose we have a z-score of -1.645. We can find the p-value that corresponds to this z-score using
> pnorm(-1.645)
[1] 0.04998491
In other words, P (x ≤ 1.645) is about 0.05. Thus, by default, pnorm (and p any distribution) returns
P (X ≤ a) where X is some random variable and a is the value for which we want p. If instead we want
P (X > a), we must include in the function call lower.tail = FALSE. We can also specify a normal
distribution of some other mean and standard deviation by passing the proper arguments. See help for
more information.
Table 2: List of Things we can do with Distributions
How?
(I’m getting to this...)
d
p
q
r
What it Does
Evaluate the PDF at a given point.
Get the p-value associated with some value a.
Find critical value for a test (one-sided) of level α OR
find the standardized value at the pth percentile.
Generate n random numbers from the distribution.
We do these things in R by taking something in Table 2 (r, d, q, p) and concatenating it with something in
Table 1 (a distribution): rnorm, rbeta, df, qt, pcauchy.
So rnorm(100) generates 100 random values from the standard normal distribution and
rnorm(10,1000,100) generates 10 random values from N (1000, 100) a normal with mean 1000 and
standard deviation 100. dchisq(5,10) will evaluate the PDF for the χ2 with 10 degrees of freedom at the
value 5.
Sampling
We can draw a sample from a vector with or without replacement using the sample function.
Without any other parameters, sample returns a permutation of the vector.
#Randomize the letters of the alphabet.
> sample(letters)
[1] "d" "f" "k" "v" "c" "y" "e" "s" "p" "w" "q" "a" "g" "l" "z" "n" "u" "m" "o" "b"
[21] "h" "r" "t" "j" "x" "i"
Note that sample defaults to sampling WITHOUT replacement. We can sample with replacement by
passing the parameter replace=TRUE. Here is a random example: generating a random sequence of the
values 1 and -1 for a 3x3 Ising model (202 qual studying stuck in my mind).
> sample(c(1,-1),9,replace=TRUE)
[1] -1 1 1 -1 -1 1 1 1 1
We can draw a random number from 1 to 100...
> sample(1:100,1) # a sample of 1, so with or without replacement doesn’t matter
[1] 39
5
We can also draw a sample where each value has a certain probability of appearing. Suppose one reel of a
slot machine contains 4 bars, 3 lemons, 2 cherries, 1 bell (De Veaux, Intro Stats, p. 336). This slot machine
contains three reels each with the same distribution of symbols. We can simulate a game:
> reel <- c("BAR","LEMON","CHERRY","BELL")
> probability <- c(0.4,0.3,0.2,0.1)
> sample(reel,3,replace=TRUE,prob=probability)
[1] "LEMON" "BAR"
"LEMON"
Now you try it! Are you a winner?
A Note About R 2.5
Starting in R version 2.5, the GUI version will print a closing ) or ” whenever you type an opening ( or ”
which is kind of cool, but also can be annoying. So watch for it...
Writing R Code in a Separate File
By selecting FILE > New Document, you can create your own text file containing R code. This is typically
how you will use R. Write code in this separate file, and then copy/paste your commands into R. Make
sure you save often.
You can run the entire contents of your file by using the source command.
> source("myfile.R")
I highly recommend that you use an editor other than the built-in editor in R because if R crashes, you
lose your work. Use TextEdit (or Notepad on Windows) to store your R code.
Pound # and Semicolon ;
You may have already figured this out, but to make comments in your code (or to block out certain lines of
code when debugging, so they will not run), use the pound sign #. Any text that follows # will not be
executed. Unfortunately, R does not have a multiple-line commenting construct like C or Java. For
multiple-line comments, just start each line with #.
#This is a comment and R will not do anything with it.
makeCoffee()
#goToClass() My program doesn’t work. This line may be the problem so I comment it out.
# If my code runs, then goToClass() is the problem, otherwise it is somewhere else in my code.
goToBed()
We can put more than one statement on a line. I separate each statement with a semicolon ;.
> z <- 0;
x = y = z
Regression, again
It seems that a good number of the new students have had little or no experience with regression. During
the first two or three weeks of 201A, Hongquan will give a quick overview of regression so pay close
attention!
These notes relate to regression as it related to the example I gave in the Bootcamp.
6
We want to fit a linear model to our data of the form
y = β0 + β1 x + β2 x2
(1)
To do this, we use the function lm. Now, one point of confusion may be the question: how is this a linear
model? I did not elaborate on that point in class. The model (1) involves a second-order term x2 and the
fit is a parabola, however, (1) is linear in β, and this is what we are referring to, and why we can use lm
to fit the model. So, as long as the model is linear in β we can have higher order terms, and interactions
and still use lm.
Note: We could have also fitted the model this way
lm(y ~ x + I(x^2))
eliminating the matrix work and the stuff with cbind, but now we have this mysterious I() function
floating in there. In this case I(x2 ) forces R to compute x2 before evaluating the regression, and is
necessary when including higher order terms.
Recall that the fit object we created was of type lm and that I could extract useful information from this
object. Such information includes the fitted values (fitted) and the residuals (residuals). Thus, we can
construct a residuals vs. fitted plot manually.
I can access these values using the $ operator.
Let’s construct a residuals vs. fitted plot.
>
>
>
>
>
y <- temp[9,1:50]
x <- 1:50
X <- cbind(x,x^2)
fit <- lm(y~X)
plot(fit$fitted,fit$residuals,xlab="fitted values",ylab="residuals", main="Residuals vs. Fitted Plot")
We use this plot to verify the constant variance assumption (you’ll get to it, don’t worry about it here). I
can add a lowess smoothing curve over this plot to get a feel for the general trend in the plot. We hope
that this curve will be a straight line, which indicates constant variance. BUT, do not read too much
into this curve...it’s just a guideline.
We use the lines function to overlay a red lowess curve over the points. lowess in its basic form takes two
parameters, x and y, just like plot.
Residuals vs. Fitted Plot
2
● ●
●
●
●
● ●
●
●
●
0
●
●●
●
●
−2
●
● ● ●
●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
−4
residuals
●●
●
●
●●
● ●
●
●
●
●
●
−6
●
●
●
20
25
30
35
40
fitted values
> plot(fit$fitted,fit$residuals,xlab="fitted values",ylab="residuals", main="Residuals vs. Fitted Plot")
> lines(lowess(fit$fitted,fit$residuals),col="red")
7
Finally, let’s construct a normal quantile plot to test the assumption of normally distributed residuals.
qqnorm constructs the actual plot and qqline overlays the line usually associated with it.
> qqnorm(fit$residuals)
> qqline(fit$residuals,lty=2,col="grey")
Normal Q−Q Plot
●
●
●● ●
−2
0
Sample Quantiles
2
●
●●●●
●●●
●●●
●●
●
●
●●●
●●●
●●
●
●
●
●●
●●
●●●
●●●●
●●
●
−4
●
●
−6
●
●
●
−2
−1
0
1
2
Theoretical Quantiles
Now let’s throw these plots together and show Hongquan and Nicole our amazing R skills!
Since this code is long, I put it into a separate file and then copy/pasted it into R. Recall that
par(mfrow=c(1,2)) will plot two-plots side by side row-wise.
par(mfrow=c(1,2))
plot(fit$fitted,fit$residuals,xlab="fitted values",ylab="residuals",
main="Residuals vs. Fitted Plot")
lines(lowess(fit$fitted,fit$residuals),col="red")
qqnorm(fit$residuals)
qqline(fit$residuals,lty=2,col="grey")
Residuals vs. Fitted Plot
●●
●
●
●
●
● ●●
●
−2
● ●
●
●
●
●●
●
●
●
●
●●
●
● ●
●
−4
residuals
●
●
0
●●
●
● ●
●
2
●
●
0
●
●
−2
●
●● ●
●
●●●●
●
●
●
●●
●●●
●
●●
●
●
●●
●●●
●●
●
●
●
●●
●●●
●
●●●
●
●
●
−4
●
● ●
●
●
Sample Quantiles
2
● ●
Normal Q−Q Plot
●
●
●
20
25
30
35
●
●
●
−6
−6
●
●
●
40
●
−2
fitted values
−1
0
1
2
Theoretical Quantiles
Due to the amount of material covered, this year loops, conditional statements, functions,
and object-oriented programming were not mentioned in the Bootcamp. Instead, they will
8
be covered in 202A.
9