Download R Bootcamp Addendum Important Plots: Histogram, Boxplot, and

R Bootcamp Addendum September 24, 2007 Ryan Rosario [email protected] This document outlines a series of R functions that were not covered last year or this year in the R Bootcamp, but that I feel are important. This should also help you get through most of your first assignment. Important Plots: Histogram, Boxplot, and Scatterplot Matrix In 201A, you will be asked to describe data numerically and graphically. Part of your visual analysis will involve histograms, boxplots and scatterplot matrices. I am using the temp dataset from the presentation if you wish to follow along, but note that copy and pasting from this document may not work. Histograms are easy to construct. The following plots a histogram of the first row of temp (the temperature at location 1, throughout the day). > hist(temp[1,],xlab="Temperature",ylab="Frequency", main="Temperature at Location 1 over a Full Day") 30 20 0 10 Frequency 40 50 60 Temperature at Location 1 over a Full Day −5 0 5 10 15 20 25 30 Temperature A basic histogram can be constructed using hist(temp[1,]) but I recommend always including axis and plot labels. 1 A histogram is an object of type hist. We can store the hist object and extract useful information from it. By default, hist will plot the histogram on the screen whether or you actually want to see it visually. When we are interested in accessing the attributes of the histogram (and not the visual representation itself), we can turn plotting off using the parameter plot=FALSE. > hist(temp[1,],plot=FALSE) > attributes(my.hist) $names [1] "breaks" "counts" [7] "equidist" "intensities" "density" "mids" "xname" $class [1] "histogram" This can be very useful. Suppose we just want to get a vector of counts of how many observations fall into a certain bin (interval) of the histogram. We can access this > my.hist$counts [1] 3 15 32 24 37 32 53 31 7 11 1 2 0 1 A boxplot is also simple to construct. > boxplot(temp[1,],xlab="Temperature",ylab="Frequency", main="Temperature at Location 1 over a Full Day") 15 10 0 5 Frequency 20 25 Temperature at Location 1 over a Full Day I only include a y axis here since it does not make much sense to include an x axis in this case. We can also construct boxplots of a variable conditioned on a categorical variable (factor). Let’s look at the temperatures for a full day, given we know the type of groundcover at that location. There are two ways to do this. We can use traditional notation... > boxplot(groundcover$type,temp) #x, y where y is temp We can also use formula notation like is used for the lm function. That is, we write y ∼ x meaning y as a function of x. > boxplot(temp ∼ groundcover$type) If you follow along, the plot below will not match because the actual code includes other parameters. I left them out above, to point out the difference between the two calls. To get a quick idea of the relationship among 3 or more variables, it is useful to construct a scatterplot matrix using pairs. For this example, I load a different dataset (Xu, 2006). I pass the whole data frame to pairs but a subset will also work. 2 0 10 20 30 Tempereature 40 50 Temperature by Ground Cover Type plant rock sand Groundcover Type > sat <- read.table("http://www.stat.ucla.edu/~rosario/boot/sat.dat") > pairs(sat) ● ● ● ● ● ● ●● ● ●● ● ●● ●● ●●● ●● ●● ●● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●● ●●● ● ● ●●● ● ● ● ● ●●● ●● ● ●● ● ● ● 450 ●● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ●● ●●● ●● ●● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ● ●● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ● ●● ● ● 550 ●● ●● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ●● ●● ● ●● ● ● ● ● ● ● ●● ● ●●● ●● ● ● ● ● ●● ●● ●● ● ● ● ● ●●● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● 22 ● 60 ● 10 20 ●● ●● 8 22 6 ● 4 18 ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●●●●● ● ● ●● ●●● ●● ●● ● ● ●● ●●● ● ●● ● ● ●● ● expend ● 14 ● ● ● ● ●●● ●● ● ● ●● ● ● ● ●● ●●● ● ●● ● ●● ● ● ● ● ● ●● ● ●● ●● ● ● ●● ● ● ● ●●● ● ● ● ● ●●● ● ● ●●●●● ● ●● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ●● ● ●● ●● ●● ● ● ● ●● ● ● ● ● ●● ●● ● ● ● ●● ● ●●● ●● ● ●● 6 ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ●● ● ● ● ● ● ● ● ● ●●● ● ●● ●●●●●●●● ● ● ● ●● 4 ●● ● ● ● ● ●● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● 8 10 ●● ●● ● ● ● ●● ● ●● ● ●● ● ● ● ●● ● ● ● ● ● ● ●●● ● ● ● ● ●● ● ● ● ●●● ●● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ●● ● ● ●●● ● ● ●● ●● ● ● ● ●● ●●● ●●● ● ● ● ●● ● ●●● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● ●● ● ● ● ●● ●● ● ●● ● ●●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●●● ●● ● ● ●● ● ●● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ●●●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ●● ● ● ● ●● ●● ● ● ● ●● ● ●● ● ● ●● ●● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ●●● ●● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ●● ● ● ●● ● ● ● ●● ● ●● ● ●● ● ●● ● ● takers ● ● ● ●● ●● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ●● ● ● ●● ●●●● ● ● ●●● ● ● ● ● ● ● ● ●●● ● ●● ●● ● ● ● ● ● ●●● ● ● ●● ● ● ●● ● ●● ●● ● ●● ●● ●● ● ● ● ●● ●● ● ● ●● ● ●● ●● ● ● ●● ●●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●● ●● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●● ●● ● ●●●●●● ● ● ● ● ●● ● ●● ● ● ●●● ●● ● ●● ●● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ● ●● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ●●● ● ●● ● ● ● ● ●● ●● ● ● ●● ● ●● ● ●● ● ● ● ●● ● ● ● 40 ● ● ● ● ● ● ● ● ●● ●● ● ● ●● ●●●● ● ● ● ●● ● ●● ● ●● ● ● ● ●● ● ●● ●● ● ● ●●●● ●● ● ● ●● ● ● ● ● ● ●● ●● ● ●● ● ● ●●● ● ● ●● 30 ● ● ● ● ●●● ● ● ● ● ● ●● ●●● ● ●● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ●● ● ● ● ●● ● ●● ● ●● ● ● ●● ● ●● ●●● ●● ●● ● ● ● ● ● ● ● ●●●●● ●● ●● ●● ● ● ●● ●● ● ● ● ● ●● ●●●● ●●●●● ●● ● ●● ● ● ●● ●● ● ● ● ●● ● ● ● ●●● ●● ●●● ● ● ● ● ● ●●● ● ● ● ●● ● ● ●● ● 50 ● ●● ● ● ● ● ● ●● ●● ● ● ●● 400 3 ● ●● ● 460 ● ●● ●●● ●● ● ●●● ● ●● ●● ● ● ● ● ● ● ● ● ●● ●● ● ●● ● ●● ● ● ●●● ●● math ● ● ● ● ●● ● ●● ● ● ●● ● ●● ●● ●● ● ●● ● ● ●● ● ● ● ● ● ● ●● ●● ● ● ●●●● ●● ● ●● ●● ● ●● ●● ● ●● ● ●● ● ● ● ●●● ● ● ● ● ● ● ● verbal ● ●● ● ● ● ● ● ●● ● ●●● ●● ● ● ● ● ● ● ● ●● ●● ●● ● ● ●●● ●●● ●● ● ● ● ●● ● ●● ● ● ●● ● ● ●● ● ●●●● ● ●● ●● ●● ●● ●● ● ● total ● ● ● ● ●● ●● ● ● ● ● ●●● ● ●● ● ● ●● 520 50 ● ● ● ● ● 40 ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●●●●● ● ● ●● salary ●● ● ● ●● ● ● ● ●● ● ● ●●● ● ● ●● ●● ●●● ● ●● ● ● ● ● ●● ● ● ●●● ● ● ●● ●● ● ● 30 ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ●● ●● ●●●● ●●● ● ● ● ● ● ●● ● ●● ● ●●●● ●● ●● ● ●● 520 60 20 450 550 ● ●●● ●● ● ●● ● ● ●●●● ● ●● ● ●●● ●●●● ● ● ● ● ● ● ●● ●● ●● ● ● ●● ● ●● ● ● ●● ●● ● ●●● ●● ●● ● ●● ● ● ●●● ● ● ●● ● ● ● ● ● ●● ● ● ●●●● ●● ●● ● ●● ●● ● ● ● ● ● ●●● ● ●● ● ● ●● ●● ● ● ●● ●● ● ● ● ● ● ● ● ●● ●●● ● ● ● ● ● ●●● ●●● ● ● ● ● ● ● ●● ● ●●● ●● ● ●● ●● ● ●● ● 460 ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●● ●● ●● ● ●●●●●● ●● ●●● ●●● ● ● ● ●● ● ● ● ● ●● ● 400 ● ● ● ● ● ●● ● ●●● ●●● ● ●● ● ● ● ●● ●● ●● 1000 ratio ● ● ●● ● ●● ●● ● ● ● ● ●● ●● ●●● ● ● ●● ●● ●●● ●●● ● ●● ● ● ●● ●● ● ● ● ● ● ● ● 850 14 18 ●●● ●●●● ● ● ● ●● ● ● ●● ● ●● ● ●● ●● ●●● ● ● ● ●●● ● ● ● ● ●● ●● ●● 850 950 1100 A Numerical Analysis of Data We can also produce a numerical analysis of the data by using the summary function. summary will calculate and display the mean, median, first and third quartile, minimum and maximum of a variable. > summary(temp[,1]) Min. 1st Qu. Median 9.00 15.20 19.10 Mean 3rd Qu. 18.61 21.80 Max. 34.10 The functions that calculate the numbers above are min(temp[,1]), quantile(temp[,1],0.25), median(temp[,1]), mean(temp[,1]), quantile(temp[,1],0.75) and max(temp[,1]) respectively. We can also get the variance and standard deviation using sd(temp[,1]) and var(temp[,1]) respectively. These functions are on your reference card and their usage is trivial. Working with Distributions R makes it simple to work distributions. We can extract certain information from a distribution using a combination of two pieces of information. To generate a random number from the standard normal distribution, use rnorm(n) where n is the number of random numbers you wish to generate. By not including a value for µ or σ, we rely on the parameter default values of µ = 0 and σ = 1. In the help file for rnorm, default arguments are displayed as parameter = defaultValue. The help file shows rnorm(n, mean=0, sd=1) as the correct form for entering in the rnorm function. mean=0,sd=1 means that if you do not specify a mean or standard deviation, R will assume the mean is 0 and the standard deviation is 1. To generate a random uniform number, we can use runif(n) with default parameters min = 0 and max = 1. We can also work with other distributions: Table 1: List of Distributions Distribution Name R Syntax denoting Distribution norm unif binom exp gamma pois weibull cauchy beta t f chisq geom hyper logis lnorm nbinom wilcox normal uniform binomial exponential gamma poisson weibull cauchy beta Student’s t F (Fisher-Snedecor χ2 ) Pearson’s χ2 geometric hypergeometric logistic lognormal negative binomial Wilcoxon’s statistics 4 Use the help functions to find the default values for the parameters. Now that we have all of these distributions, what can we do with them? There are four things we can do them. Suppose we have a z-score of -1.645. We can find the p-value that corresponds to this z-score using > pnorm(-1.645) [1] 0.04998491 In other words, P (x ≤ 1.645) is about 0.05. Thus, by default, pnorm (and p any distribution) returns P (X ≤ a) where X is some random variable and a is the value for which we want p. If instead we want P (X > a), we must include in the function call lower.tail = FALSE. We can also specify a normal distribution of some other mean and standard deviation by passing the proper arguments. See help for more information. Table 2: List of Things we can do with Distributions How? (I’m getting to this...) d p q r What it Does Evaluate the PDF at a given point. Get the p-value associated with some value a. Find critical value for a test (one-sided) of level α OR find the standardized value at the pth percentile. Generate n random numbers from the distribution. We do these things in R by taking something in Table 2 (r, d, q, p) and concatenating it with something in Table 1 (a distribution): rnorm, rbeta, df, qt, pcauchy. So rnorm(100) generates 100 random values from the standard normal distribution and rnorm(10,1000,100) generates 10 random values from N (1000, 100) a normal with mean 1000 and standard deviation 100. dchisq(5,10) will evaluate the PDF for the χ2 with 10 degrees of freedom at the value 5. Sampling We can draw a sample from a vector with or without replacement using the sample function. Without any other parameters, sample returns a permutation of the vector. #Randomize the letters of the alphabet. > sample(letters) [1] "d" "f" "k" "v" "c" "y" "e" "s" "p" "w" "q" "a" "g" "l" "z" "n" "u" "m" "o" "b" [21] "h" "r" "t" "j" "x" "i" Note that sample defaults to sampling WITHOUT replacement. We can sample with replacement by passing the parameter replace=TRUE. Here is a random example: generating a random sequence of the values 1 and -1 for a 3x3 Ising model (202 qual studying stuck in my mind). > sample(c(1,-1),9,replace=TRUE) [1] -1 1 1 -1 -1 1 1 1 1 We can draw a random number from 1 to 100... > sample(1:100,1) # a sample of 1, so with or without replacement doesn’t matter [1] 39 5 We can also draw a sample where each value has a certain probability of appearing. Suppose one reel of a slot machine contains 4 bars, 3 lemons, 2 cherries, 1 bell (De Veaux, Intro Stats, p. 336). This slot machine contains three reels each with the same distribution of symbols. We can simulate a game: > reel <- c("BAR","LEMON","CHERRY","BELL") > probability <- c(0.4,0.3,0.2,0.1) > sample(reel,3,replace=TRUE,prob=probability) [1] "LEMON" "BAR" "LEMON" Now you try it! Are you a winner? A Note About R 2.5 Starting in R version 2.5, the GUI version will print a closing ) or ” whenever you type an opening ( or ” which is kind of cool, but also can be annoying. So watch for it... Writing R Code in a Separate File By selecting FILE > New Document, you can create your own text file containing R code. This is typically how you will use R. Write code in this separate file, and then copy/paste your commands into R. Make sure you save often. You can run the entire contents of your file by using the source command. > source("myfile.R") I highly recommend that you use an editor other than the built-in editor in R because if R crashes, you lose your work. Use TextEdit (or Notepad on Windows) to store your R code. Pound # and Semicolon ; You may have already figured this out, but to make comments in your code (or to block out certain lines of code when debugging, so they will not run), use the pound sign #. Any text that follows # will not be executed. Unfortunately, R does not have a multiple-line commenting construct like C or Java. For multiple-line comments, just start each line with #. #This is a comment and R will not do anything with it. makeCoffee() #goToClass() My program doesn’t work. This line may be the problem so I comment it out. # If my code runs, then goToClass() is the problem, otherwise it is somewhere else in my code. goToBed() We can put more than one statement on a line. I separate each statement with a semicolon ;. > z <- 0; x = y = z Regression, again It seems that a good number of the new students have had little or no experience with regression. During the first two or three weeks of 201A, Hongquan will give a quick overview of regression so pay close attention! These notes relate to regression as it related to the example I gave in the Bootcamp. 6 We want to fit a linear model to our data of the form y = β0 + β1 x + β2 x2 (1) To do this, we use the function lm. Now, one point of confusion may be the question: how is this a linear model? I did not elaborate on that point in class. The model (1) involves a second-order term x2 and the fit is a parabola, however, (1) is linear in β, and this is what we are referring to, and why we can use lm to fit the model. So, as long as the model is linear in β we can have higher order terms, and interactions and still use lm. Note: We could have also fitted the model this way lm(y ~ x + I(x^2)) eliminating the matrix work and the stuff with cbind, but now we have this mysterious I() function floating in there. In this case I(x2 ) forces R to compute x2 before evaluating the regression, and is necessary when including higher order terms. Recall that the fit object we created was of type lm and that I could extract useful information from this object. Such information includes the fitted values (fitted) and the residuals (residuals). Thus, we can construct a residuals vs. fitted plot manually. I can access these values using the $ operator. Let’s construct a residuals vs. fitted plot. > > > > > y <- temp[9,1:50] x <- 1:50 X <- cbind(x,x^2) fit <- lm(y~X) plot(fit$fitted,fit$residuals,xlab="fitted values",ylab="residuals", main="Residuals vs. Fitted Plot") We use this plot to verify the constant variance assumption (you’ll get to it, don’t worry about it here). I can add a lowess smoothing curve over this plot to get a feel for the general trend in the plot. We hope that this curve will be a straight line, which indicates constant variance. BUT, do not read too much into this curve...it’s just a guideline. We use the lines function to overlay a red lowess curve over the points. lowess in its basic form takes two parameters, x and y, just like plot. Residuals vs. Fitted Plot 2 ● ● ● ● ● ● ● ● ● ● 0 ● ●● ● ● −2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −4 residuals ●● ● ● ●● ● ● ● ● ● ● ● −6 ● ● ● 20 25 30 35 40 fitted values > plot(fit$fitted,fit$residuals,xlab="fitted values",ylab="residuals", main="Residuals vs. Fitted Plot") > lines(lowess(fit$fitted,fit$residuals),col="red") 7 Finally, let’s construct a normal quantile plot to test the assumption of normally distributed residuals. qqnorm constructs the actual plot and qqline overlays the line usually associated with it. > qqnorm(fit$residuals) > qqline(fit$residuals,lty=2,col="grey") Normal Q−Q Plot ● ● ●● ● −2 0 Sample Quantiles 2 ● ●●●● ●●● ●●● ●● ● ● ●●● ●●● ●● ● ● ● ●● ●● ●●● ●●●● ●● ● −4 ● ● −6 ● ● ● −2 −1 0 1 2 Theoretical Quantiles Now let’s throw these plots together and show Hongquan and Nicole our amazing R skills! Since this code is long, I put it into a separate file and then copy/pasted it into R. Recall that par(mfrow=c(1,2)) will plot two-plots side by side row-wise. par(mfrow=c(1,2)) plot(fit$fitted,fit$residuals,xlab="fitted values",ylab="residuals", main="Residuals vs. Fitted Plot") lines(lowess(fit$fitted,fit$residuals),col="red") qqnorm(fit$residuals) qqline(fit$residuals,lty=2,col="grey") Residuals vs. Fitted Plot ●● ● ● ● ● ● ●● ● −2 ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● −4 residuals ● ● 0 ●● ● ● ● ● 2 ● ● 0 ● ● −2 ● ●● ● ● ●●●● ● ● ● ●● ●●● ● ●● ● ● ●● ●●● ●● ● ● ● ●● ●●● ● ●●● ● ● ● −4 ● ● ● ● ● Sample Quantiles 2 ● ● Normal Q−Q Plot ● ● ● 20 25 30 35 ● ● ● −6 −6 ● ● ● 40 ● −2 fitted values −1 0 1 2 Theoretical Quantiles Due to the amount of material covered, this year loops, conditional statements, functions, and object-oriented programming were not mentioned in the Bootcamp. Instead, they will 8 be covered in 202A. 9

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download R Bootcamp Addendum Important Plots: Histogram, Boxplot, and