Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Statistics Workshop Day 1: Introduction to R Tarik C. Gouhier∗1 1 Marine Science Center, Northeastern University May 11, 2013 A brief introduction to RStudio If you have not already done so, please download and install R (http://www.r-project.org). Although R comes with its own Integrated Development Environment (IDE), we will be using a new and better open-source IDE called RStudio (http://www.rstudio.org). The job of the IDE is to make your life easier. You will be spending a lot of time writing code and analyzing results, so you need to make sure that you have the best environment available. IDEs provide a single application to program, access help functions, download packages, and run your code. Once you have installed and launched RStudio, you will be greeted with a 4-panel window: Command history Editor Help/Packages Console ∗ [email protected] 1 Statistics Workshop Day 1: Introduction to R May 11, 2013 The different tabs in the top left panel show the content of your open .R files. The different tabs in the top right panel show the command history (i.e., the commands that you have run in the past) and your workspace (i.e., the variables and functions that you have created during the R session). The bottom left panel shows the console where the output of your commands appears. The different tabs in the bottom right panel show figures that you have plotted, a list of installed packages, help on specific functions and the list of files in your current working directory. You can switch to each panel using keyboard shortcuts (e.g., for Macs control-1: editor, control-2: console, control-3: help). RStudio has a number of extremely useful features, most of which are easily discoverable. Here, I will focus on the killer feature: code completion via the TAB key in the console and editor panels. Once you have typed the first few leading letters of a function or variable name, hit the TAB key to get a list of candidate functions and variable that start with those letters: TAB completion of variable name You can then use the keyboard to navigate the list of candidate functions/variables and press RETURN to make a selection. For functions, autocompletion shows the package that the function belongs to in curly braces and a small snippet of documentation: TAB completion of function name If you want additional details about the function, you can hit the F1 key to bring-up the full documentation for the function in the help tab located in the bottom right panel 2 Statistics Workshop Day 1: Introduction to R May 11, 2013 of RStudio. Note that the F1 shortcut will only work on Macs if you select the “Use all F1, F2, etc. keys as standard functions” option in the Keyboard section of the System Preferences. Crucially, TAB completion also works for function arguments. Once the list of arguments appears, you can scroll up/down using the keyboard to select an argument and then press RETURN to have it appear in the console: TAB completion of function arguments Now that you are familiar with the basic functionality of the IDE, you are ready to tackle some exercises. Lab exercises These exercises are designed to simulate tasks that are commonly required to manipulate, analyze and present results for publication. For these exercises, you will be using real ecological (intertidal) and environmental datasets. Task 1: Create a map of the study system 1. Download the intertidal dataset from the web using R: d <- read.csv("http://www.northeastern.edu/synchrony/stats/pisco.csv") 2. Download and load the maps and mapdata packages: install.packages(c("maps", "mapdata")) require(maps) 3. Now that you have the data and the required packages, plot the location of the sites in the dataset on a map. You should be able to get something that resembles the following: 3 Statistics Workshop Day 1: Introduction to R May 11, 2013 # SOLUTION: d <- read.csv("http://www.northeastern.edu/synchrony/stats/pisco.csv") require(maps, quietly = TRUE) require(mapdata, quietly = TRUE) # Load site coordinates sites <- unique(cbind(d$sitenum, d$latitude, d$longitude)) map("state", regions = c("washington", "oregon", "california", "nevada", "idaho", "montana"), xlim = c(-126, -115), fill = TRUE, col = "lightgray") axis(1, pretty(seq(-126, -115), n = 4), cex = 1.5) axis(2, pretty(seq(32.5, 48.5, 3), n = 6), cex = 1.5) box() # Add axis labels mtext(1, text = expression(paste("Longitude (", degree, "W)")), line = 3) mtext(2, text = expression(paste("Latitude (", degree, "N)")), line = 3) # Label each state text(-121, 43.7, "OR", cex = 1.5) text(-120.65, 47.2, "WA", cex = 1.5) text(-121, 38, "CA", cex = 1.5) # Plot sites points(sites[, 3], sites[, 2], pch = 21, bg = "red", col = "black", lwd = 2, cex = 1.2) 4 Day 1: Introduction to R May 11, 2013 ● ● ● 48 Statistics Workshop 46 WA ● ● 44 ● ● OR 42 40 ● ● ● ● ● ● 38 Latitude (°N) ● CA 36 ● ● ● 34 ●● ● ● ● ● ● ● ● ● −126 −122 −118 Longitude (°W) Task 2: Identify trends in the data We can attempt to detect both spatial and temporal trends by plotting the data. Let’s focus on the abundance (measured as percent cover) of the dominant mussel Mytilus californianus and a key environmental variable, namely Sea Surface Temperature (sst mean). 1. Extract the relevant data from the dataset using function subset 2. Plot a 2-panel figure showing mussel cover (in panel 1) and mean SST (in panel 2) as a function of year number. Use different lines to represent the time series at each site and use a gradient based on latitude to assign the color of each line (e.g., northern sites should be represented with warmer colors than southern sites). After some time, you should be able to generate the following figure. Hint: you may want to reshape your dataset to plot the lines simultaneously and use function heat.colors to generate a color gradient. 5 Statistics Workshop Day 1: Introduction to R # d # # m May 11, 2013 SOLUTION: <- read.csv("http://www.northeastern.edu/synchrony/stats/pisco.csv") Plot spatial and temporal trends in cover and SST <- subset(d, species == 75, select = c("sitenum", "latitude", "longitude", "yearnum", "cover", "upindex_mean", "chla_mean", "sst_mean")) m.wide <- reshape(m, timevar = c("yearnum"), idvar = c("sitenum", "latitude", "longitude"), direction = "wide", drop = c("upindex_mean", "chla_mean", "sst_mean")) s.wide <- reshape(m, timevar = c("yearnum"), idvar = c("sitenum", "latitude", "longitude"), direction = "wide", drop = c("upindex_mean", "chla_mean", "cover")) # x=year col <- heat.colors(length(unique(m$sitenum))) par(mfrow = c(2, 1), mar = c(1, 4, 1, 1), oma = c(3, 0.5, 0, 0)) matplot(unique(d$yearnum), t(m.wide[, 4:NCOL(m.wide)]), t = "p", col = "black", lty = 1, pch = 21, bg = col, ylab = "Mussel cover (%)", xaxt = "n", xlab = "", main = "Temporal trends with latitude coded by color") axis(1, at = axTicks(1), label = NA) matplot(unique(d$yearnum), t(s.wide[, 4:NCOL(s.wide)]), t = "p", col = "black", lty = 1, pch = 21, bg = col, ylab = expression(paste("SST (", degree, ")")), xlab = "Year number", xpd = NA) 6 Statistics Workshop Day 1: Introduction to R May 11, 2013 Temporal trends with latitude coded by color 80 ● ● ● 60 40 ● ● ● ● 20 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 14 12 SST (°) ● ● ● ● 16 0 Mussel cover (%) ● ● ● ● ● ● ● ● ● ● ● 10 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 2 3 ● ● ● ● ● ● ● 4 5 6 Year number 3. There appear to be no temporal patterns. Let’s try to see if there are any detectable spatial patterns in the data. To do so, generate the same 2-panel figure as above by plotting mussel cover (panel 1) and SST (panel 2) as a function of latitude using a gradient based on year number to assign a color to each line. # SOLUTION: m.wide <- m.wide[order(m.wide$latitude), ] s.wide <- s.wide[order(s.wide$latitude), ] col <- heat.colors(length(unique(m$yearnum))) par(mfrow = c(2, 1), mar = c(1, 4, 1, 1), oma = c(3, 0.5, 0, 0)) matplot(m.wide$latitude, m.wide[, 4:NCOL(m.wide)], t = "p", col = "black", lty = 1, pch = 21, bg = col, ylab = "Mussel cover (%)", xlab = "", xaxt = "n", main = "Spatial trends with years coded by color") locs <- seq(from = min(m.wide$latitude), to = max(m.wide$latitude), length = 5) 7 Statistics Workshop Day 1: Introduction to R May 11, 2013 axis(1, at = locs, label = NA) matplot(s.wide$latitude, s.wide[, 4:NCOL(s.wide)], t = "p", col = "black", lty = 1, pch = 21, bg = col, ylab = expression(paste("SST (", degree, ")")), xaxt = "n", xlab = expression(paste("Latitude (", degree, "N)")), xpd = NA) axis(1, at = locs, label = format(locs, dig = 2)) 80 Spatial trends with years coded by color ● 60 40 20 ● ● ● ●● ● ● ● 14 16 ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 33 ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 12 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● 10 SST (°) ● ● ●● ● ● ● ● ●● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● 0 Mussel cover (%) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 37 41 44 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 48 Latitude (°N) There’s a clear latitudinal gradient in SST but there appears to be no trend in mussel cover. Interesting mismatch? Task 3: Use Monte Carlo randomizations to assess trends We can formally test whether mussel cover varies latitudinally using Monte Carlo randomizations. We know that Cape Blanco (43◦ N) has been identified as a environmental breakpoint 8 Statistics Workshop Day 1: Introduction to R May 11, 2013 between the northern and southern parts of the California Current System. We can test whether this environmental breakpoint is reflected in latitudinal patterns of mussel populations by comparing the mean mussel cover north vs. south of Cape Blanco. We will use Monte Carlo randomizations methods to determine whether the difference in mean mussel cover between the northern and southern regions is statistically significant. Monte Carlo methods all consist of five steps: (1) define the null hypothesis (e.g., there is no difference in mean mussel cover between the north and the south), (2) compute the test statistic in the observed dataset (e.g., difference in mean mussel cover between north and south), (3) generate a distribution of randomized datasets by shuffling the observed data many times, (4) compute the test statistic for each randomized dataset, and (5) compare the observed statistic to those generated via randomization to determine the p-value. 1. To begin, add a column called ‘region’ of type factor to the dataset with two levels: ‘north’ for sites that lie north of 43◦ N and ‘south’ for the rest. 2. Compute the difference in mean mussel cover between north and south (i.e., the observed test statistic) 3. To generate the n=999 random datasets, you will have to use a for loop. For each iteration of the loop, randomly shuffle the values in the ‘region’ column and compute the test statistic in the resulting dataset. 4. After the loop, determine the proportion of randomized datasets with a test statistic whose magnitude is greater than or equal to the one observed in the real dataset. This is your p-value. If it is smaller than some predetermined critical level (typically α = 0.05), you can conclude that the observed value is statistically significant (i.e., reject the null hypothesis and accept the alternate). 5. Plot the distribution of the test statistic obtained from the randomized datasets using the hist function. 6. Add a vertical line indicating the location of the test statistic obtained in the observed dataset. It should look something like the following: # SOLUTION: d <- read.csv("http://www.northeastern.edu/synchrony/stats/pisco.csv") m <- subset(d, species == 75, select = c("sitenum", "latitude", "longitude", "yearnum", "cover")) m$region <- as.factor(ifelse(m$latitude < 43, "south", "north")) n <- 999 ## North - South obs.mean <- diff(aggregate(cover ~ region, data = m, FUN = mean, na.rm = T)$cover) rands.mean <- numeric(n + 1) * NA 9 Statistics Workshop Day 1: Introduction to R May 11, 2013 for (i in 1:n) { tmp <- sample(m$region) rands.mean[i] <- mean(subset(m$cover, tmp == "north")) mean(subset(m$cover, tmp == "south")) } rands.mean[n + 1] <- obs.mean alpha <- 0.05 mean.pval <- sum(abs(rands.mean) >= abs(obs.mean))/n hist(rands.mean, xlab = "Distribution of test statistic from randomized datasets", main = paste("P-value: ", format(mean.pval, dig = 3), sep = "")) box() abline(v = obs.mean, col = "red", lwd = 2, lty = 2) 150 0 50 100 Frequency 200 250 P−value: 0.004 −10 −5 0 5 10 Distribution of test statistic from randomized datasets Task 4: Identifying periodic trends in climate data Marine and terrestrial ecosystems respond in strong but complex ways to climate forcing. Resolving the relationship between climate and ecosystem dynamics is particularly difficult 10 Statistics Workshop Day 1: Introduction to R May 11, 2013 because the climate is characterized by non-stationary fluctuations. We will focus on identifying trends in three key indices: the Pacific Decadal Oscillation (PDO), the Multivariate El Niño Southern Oscillation Index (MEI) and the North Atlantic Oscillation (NAO). 1. Download all three indices from the web using R: pdo <- read.table("http://jisao.washington.edu/pdo/PDO.latest", header = T, skip = 29, nrows = 113) mei <- read.table("http://www.esrl.noaa.gov/psd/enso/mei/table.html", header = T, skip = 13, nrows = 62) nao <- read.table("ftp://ftp.cpc.ncep.noaa.gov/wd52dg/data/indices/nao_index.tim", header = T, skip = 8) 2. Compute the annual mean and standard deviation of each climate index. Note that the last few entries in the year column of the PDO dataset have asterisks to indicate a change in the way the index was computed. Unfortunately, this pollutes the year column, which R codes as a factor. You will need to convert this column to a numeric in order to plot the data. Hint: you may need to convert the column to a character and then use function substr before converting it to a numeric. 3. Create a 2-panel figure and plot the mean (panel 1) and standard deviation (panel 2) of the climate indices in different colors. The final result should look something like this: # SOLUTION: nao <- read.table("ftp://ftp.cpc.ncep.noaa.gov/wd52dg/data/indices/nao_index.tim", header = T, skip = 8) mei <- read.table("http://www.esrl.noaa.gov/psd/enso/mei/table.html", header = T, skip = 13, nrows = 62) pdo <- read.table("http://jisao.washington.edu/pdo/PDO.latest", header = T, skip = 29, nrows = 113) avg.mei <- data.frame(year = mei$YEAR, mei = rowMeans(mei[, 2:NCOL(mei)])) sd.mei <- data.frame(year = mei$YEAR, mei = apply(mei[, 2:NCOL(mei)], 1, FUN = sd)) pdo$YEAR <- as.numeric(substr(as.character(pdo$YEAR), start = 1, stop = 4)) avg.pdo <- data.frame(year = pdo$YEAR, pdo = rowMeans(pdo[, 2:NCOL(pdo)])) sd.pdo <- data.frame(year = pdo$YEAR, pdo = apply(pdo[, 2:NCOL(pdo)], 1, FUN = sd)) 11 Statistics Workshop Day 1: Introduction to R May 11, 2013 avg.nao <- aggregate(INDEX ~ YEAR, data = nao, FUN = mean) sd.nao <- aggregate(INDEX ~ YEAR, data = nao, FUN = sd) par(mfrow = c(2, 1), mar = c(1, 4, 1, 1), oma = c(3, 0.5, 0, 0)) plot(avg.mei$year, avg.mei$mei, t = "l", xlab = "", ylab = "Mean", xlim = range(avg.pdo$year, avg.mei$year, avg.nao$year), ylim = range(avg.pdo$pdo, avg.mei$mei, avg.nao$INDEX), xaxt = "n") lines(avg.pdo$year, avg.pdo$pdo, t = "l", col = "red") lines(avg.nao$YEAR, avg.nao$INDEX, t = "l", col = "blue") abline(h = 0, lty = 2) legend(x = "toplef", legend = c("MEI", "PDO", "NAO"), lty = 1, col = c("black", "red", "blue")) axis(1, at = axTicks(1), label = NA) plot(sd.mei$year, sd.mei$mei, t = "l", xlab = "Year", ylab = "Standard deviation", xlim = range(avg.pdo$year, avg.mei$year, avg.nao$year), xpd = NA) lines(sd.pdo$year, sd.pdo$pdo, t = "l", col = "red") lines(sd.nao$YEAR, sd.nao$INDEX, t = "l", col = "blue") abline(h = 0, lty = 2) 12 Day 1: Introduction to R 2 Statistics Workshop May 11, 2013 0 1.0 0.5 Standard deviation 1.5 −2 −1 Mean 1 MEI PDO NAO 1900 1920 1940 1960 Year 13 1980 2000