Download Statistics Workshop Day 1: Introduction to R A brief introduction to RStudio

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Time series wikipedia , lookup

Regression toward the mean wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Statistics Workshop
Day 1: Introduction to R
Tarik C. Gouhier∗1
1
Marine Science Center, Northeastern University
May 11, 2013
A brief introduction to RStudio
If you have not already done so, please download and install R (http://www.r-project.org).
Although R comes with its own Integrated Development Environment (IDE), we will be using
a new and better open-source IDE called RStudio (http://www.rstudio.org). The job of
the IDE is to make your life easier. You will be spending a lot of time writing code and
analyzing results, so you need to make sure that you have the best environment available.
IDEs provide a single application to program, access help functions, download packages, and
run your code. Once you have installed and launched RStudio, you will be greeted with a
4-panel window:
Command history
Editor
Help/Packages
Console
∗
[email protected]
1
Statistics Workshop
Day 1: Introduction to R
May 11, 2013
The different tabs in the top left panel show the content of your open .R files. The
different tabs in the top right panel show the command history (i.e., the commands that you
have run in the past) and your workspace (i.e., the variables and functions that you have
created during the R session). The bottom left panel shows the console where the output of
your commands appears. The different tabs in the bottom right panel show figures that you
have plotted, a list of installed packages, help on specific functions and the list of files in your
current working directory. You can switch to each panel using keyboard shortcuts (e.g., for
Macs control-1: editor, control-2: console, control-3: help).
RStudio has a number of extremely useful features, most of which are easily discoverable.
Here, I will focus on the killer feature: code completion via the TAB key in the console and
editor panels. Once you have typed the first few leading letters of a function or variable
name, hit the TAB key to get a list of candidate functions and variable that start with those
letters:
TAB completion of variable name
You can then use the keyboard to navigate the list of candidate functions/variables and
press RETURN to make a selection. For functions, autocompletion shows the package that the
function belongs to in curly braces and a small snippet of documentation:
TAB completion of function name
If you want additional details about the function, you can hit the F1 key to bring-up
the full documentation for the function in the help tab located in the bottom right panel
2
Statistics Workshop
Day 1: Introduction to R
May 11, 2013
of RStudio. Note that the F1 shortcut will only work on Macs if you select the “Use all
F1, F2, etc. keys as standard functions” option in the Keyboard section of the System
Preferences. Crucially, TAB completion also works for function arguments. Once the list of
arguments appears, you can scroll up/down using the keyboard to select an argument and
then press RETURN to have it appear in the console:
TAB completion of function arguments
Now that you are familiar with the basic functionality of the IDE, you are ready to tackle
some exercises.
Lab exercises
These exercises are designed to simulate tasks that are commonly required to manipulate, analyze and present results for publication. For these exercises, you will be using real ecological
(intertidal) and environmental datasets.
Task 1: Create a map of the study system
1. Download the intertidal dataset from the web using R:
d <- read.csv("http://www.northeastern.edu/synchrony/stats/pisco.csv")
2. Download and load the maps and mapdata packages:
install.packages(c("maps", "mapdata"))
require(maps)
3. Now that you have the data and the required packages, plot the location of the sites
in the dataset on a map. You should be able to get something that resembles the
following:
3
Statistics Workshop
Day 1: Introduction to R
May 11, 2013
# SOLUTION:
d <- read.csv("http://www.northeastern.edu/synchrony/stats/pisco.csv")
require(maps, quietly = TRUE)
require(mapdata, quietly = TRUE)
# Load site coordinates
sites <- unique(cbind(d$sitenum, d$latitude, d$longitude))
map("state", regions = c("washington", "oregon", "california",
"nevada", "idaho", "montana"), xlim = c(-126, -115),
fill = TRUE, col = "lightgray")
axis(1, pretty(seq(-126, -115), n = 4), cex = 1.5)
axis(2, pretty(seq(32.5, 48.5, 3), n = 6), cex = 1.5)
box()
# Add axis labels
mtext(1, text = expression(paste("Longitude (", degree,
"W)")), line = 3)
mtext(2, text = expression(paste("Latitude (", degree,
"N)")), line = 3)
# Label each state
text(-121, 43.7, "OR", cex = 1.5)
text(-120.65, 47.2, "WA", cex = 1.5)
text(-121, 38, "CA", cex = 1.5)
# Plot sites
points(sites[, 3], sites[, 2], pch = 21, bg = "red",
col = "black", lwd = 2, cex = 1.2)
4
Day 1: Introduction to R
May 11, 2013
●
●
●
48
Statistics Workshop
46
WA
●
●
44
●
●
OR
42
40
●
●
●
●
●
●
38
Latitude (°N)
●
CA
36
●
●
●
34
●●
●
●
●
●
●
●
●
●
−126
−122
−118
Longitude (°W)
Task 2: Identify trends in the data
We can attempt to detect both spatial and temporal trends by plotting the data. Let’s focus
on the abundance (measured as percent cover) of the dominant mussel Mytilus californianus
and a key environmental variable, namely Sea Surface Temperature (sst mean).
1. Extract the relevant data from the dataset using function subset
2. Plot a 2-panel figure showing mussel cover (in panel 1) and mean SST (in panel 2) as
a function of year number. Use different lines to represent the time series at each site
and use a gradient based on latitude to assign the color of each line (e.g., northern sites
should be represented with warmer colors than southern sites). After some time, you
should be able to generate the following figure.
Hint: you may want to reshape your dataset to plot the lines simultaneously and
use function heat.colors to generate a color gradient.
5
Statistics Workshop
Day 1: Introduction to R
#
d
#
#
m
May 11, 2013
SOLUTION:
<- read.csv("http://www.northeastern.edu/synchrony/stats/pisco.csv")
Plot spatial and temporal trends in cover and
SST
<- subset(d, species == 75, select = c("sitenum",
"latitude", "longitude", "yearnum", "cover", "upindex_mean",
"chla_mean", "sst_mean"))
m.wide <- reshape(m, timevar = c("yearnum"), idvar = c("sitenum",
"latitude", "longitude"), direction = "wide", drop = c("upindex_mean",
"chla_mean", "sst_mean"))
s.wide <- reshape(m, timevar = c("yearnum"), idvar = c("sitenum",
"latitude", "longitude"), direction = "wide", drop = c("upindex_mean",
"chla_mean", "cover"))
# x=year
col <- heat.colors(length(unique(m$sitenum)))
par(mfrow = c(2, 1), mar = c(1, 4, 1, 1), oma = c(3,
0.5, 0, 0))
matplot(unique(d$yearnum), t(m.wide[, 4:NCOL(m.wide)]),
t = "p", col = "black", lty = 1, pch = 21, bg = col,
ylab = "Mussel cover (%)", xaxt = "n", xlab = "",
main = "Temporal trends with latitude coded by color")
axis(1, at = axTicks(1), label = NA)
matplot(unique(d$yearnum), t(s.wide[, 4:NCOL(s.wide)]),
t = "p", col = "black", lty = 1, pch = 21, bg = col,
ylab = expression(paste("SST (", degree, ")")),
xlab = "Year number", xpd = NA)
6
Statistics Workshop
Day 1: Introduction to R
May 11, 2013
Temporal trends with latitude coded by color
80
●
●
●
60
40
●
●
●
●
20
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
14
12
SST (°)
●
●
●
●
16
0
Mussel cover (%)
●
●
●
●
●
●
●
●
●
●
●
10
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
1
2
3
●
●
●
●
●
●
●
4
5
6
Year number
3. There appear to be no temporal patterns. Let’s try to see if there are any detectable
spatial patterns in the data. To do so, generate the same 2-panel figure as above by
plotting mussel cover (panel 1) and SST (panel 2) as a function of latitude using a
gradient based on year number to assign a color to each line.
# SOLUTION:
m.wide <- m.wide[order(m.wide$latitude), ]
s.wide <- s.wide[order(s.wide$latitude), ]
col <- heat.colors(length(unique(m$yearnum)))
par(mfrow = c(2, 1), mar = c(1, 4, 1, 1), oma = c(3,
0.5, 0, 0))
matplot(m.wide$latitude, m.wide[, 4:NCOL(m.wide)],
t = "p", col = "black", lty = 1, pch = 21, bg = col,
ylab = "Mussel cover (%)", xlab = "", xaxt = "n",
main = "Spatial trends with years coded by color")
locs <- seq(from = min(m.wide$latitude), to = max(m.wide$latitude),
length = 5)
7
Statistics Workshop
Day 1: Introduction to R
May 11, 2013
axis(1, at = locs, label = NA)
matplot(s.wide$latitude, s.wide[, 4:NCOL(s.wide)],
t = "p", col = "black", lty = 1, pch = 21, bg = col,
ylab = expression(paste("SST (", degree, ")")),
xaxt = "n", xlab = expression(paste("Latitude (",
degree, "N)")), xpd = NA)
axis(1, at = locs, label = format(locs, dig = 2))
80
Spatial trends with years coded by color
●
60
40
20
●
●
●
●●
●
●
●
14
16
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
33
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
12
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
10
SST (°)
●
●
●●
●
●
●
●
●●
●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
0
Mussel cover (%)
●
● ●
●
●
●
●
● ●
●
●
●
● ●
●
●
● ●
●
●
●
●
37
41
44
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
● ●
●
●
● ●
●
48
Latitude (°N)
There’s a clear latitudinal gradient in SST but there appears to be no trend in mussel
cover. Interesting mismatch?
Task 3: Use Monte Carlo randomizations to assess trends
We can formally test whether mussel cover varies latitudinally using Monte Carlo randomizations. We know that Cape Blanco (43◦ N) has been identified as a environmental breakpoint
8
Statistics Workshop
Day 1: Introduction to R
May 11, 2013
between the northern and southern parts of the California Current System. We can test
whether this environmental breakpoint is reflected in latitudinal patterns of mussel populations by comparing the mean mussel cover north vs. south of Cape Blanco. We will use
Monte Carlo randomizations methods to determine whether the difference in mean mussel
cover between the northern and southern regions is statistically significant. Monte Carlo
methods all consist of five steps: (1) define the null hypothesis (e.g., there is no difference
in mean mussel cover between the north and the south), (2) compute the test statistic in
the observed dataset (e.g., difference in mean mussel cover between north and south), (3)
generate a distribution of randomized datasets by shuffling the observed data many times,
(4) compute the test statistic for each randomized dataset, and (5) compare the observed
statistic to those generated via randomization to determine the p-value.
1. To begin, add a column called ‘region’ of type factor to the dataset with two levels:
‘north’ for sites that lie north of 43◦ N and ‘south’ for the rest.
2. Compute the difference in mean mussel cover between north and south (i.e., the observed test statistic)
3. To generate the n=999 random datasets, you will have to use a for loop. For each
iteration of the loop, randomly shuffle the values in the ‘region’ column and compute
the test statistic in the resulting dataset.
4. After the loop, determine the proportion of randomized datasets with a test statistic
whose magnitude is greater than or equal to the one observed in the real dataset. This is
your p-value. If it is smaller than some predetermined critical level (typically α = 0.05),
you can conclude that the observed value is statistically significant (i.e., reject the null
hypothesis and accept the alternate).
5. Plot the distribution of the test statistic obtained from the randomized datasets using
the hist function.
6. Add a vertical line indicating the location of the test statistic obtained in the observed
dataset. It should look something like the following:
# SOLUTION:
d <- read.csv("http://www.northeastern.edu/synchrony/stats/pisco.csv")
m <- subset(d, species == 75, select = c("sitenum",
"latitude", "longitude", "yearnum", "cover"))
m$region <- as.factor(ifelse(m$latitude < 43, "south",
"north"))
n <- 999
## North - South
obs.mean <- diff(aggregate(cover ~ region, data = m,
FUN = mean, na.rm = T)$cover)
rands.mean <- numeric(n + 1) * NA
9
Statistics Workshop
Day 1: Introduction to R
May 11, 2013
for (i in 1:n) {
tmp <- sample(m$region)
rands.mean[i] <- mean(subset(m$cover, tmp == "north")) mean(subset(m$cover, tmp == "south"))
}
rands.mean[n + 1] <- obs.mean
alpha <- 0.05
mean.pval <- sum(abs(rands.mean) >= abs(obs.mean))/n
hist(rands.mean, xlab = "Distribution of test statistic from randomized datasets",
main = paste("P-value: ", format(mean.pval, dig = 3),
sep = ""))
box()
abline(v = obs.mean, col = "red", lwd = 2, lty = 2)
150
0
50
100
Frequency
200
250
P−value: 0.004
−10
−5
0
5
10
Distribution of test statistic from randomized datasets
Task 4: Identifying periodic trends in climate data
Marine and terrestrial ecosystems respond in strong but complex ways to climate forcing.
Resolving the relationship between climate and ecosystem dynamics is particularly difficult
10
Statistics Workshop
Day 1: Introduction to R
May 11, 2013
because the climate is characterized by non-stationary fluctuations. We will focus on identifying trends in three key indices: the Pacific Decadal Oscillation (PDO), the Multivariate
El Niño Southern Oscillation Index (MEI) and the North Atlantic Oscillation (NAO).
1. Download all three indices from the web using R:
pdo <- read.table("http://jisao.washington.edu/pdo/PDO.latest",
header = T, skip = 29, nrows = 113)
mei <- read.table("http://www.esrl.noaa.gov/psd/enso/mei/table.html",
header = T, skip = 13, nrows = 62)
nao <- read.table("ftp://ftp.cpc.ncep.noaa.gov/wd52dg/data/indices/nao_index.tim",
header = T, skip = 8)
2. Compute the annual mean and standard deviation of each climate index. Note that
the last few entries in the year column of the PDO dataset have asterisks to indicate
a change in the way the index was computed. Unfortunately, this pollutes the year
column, which R codes as a factor. You will need to convert this column to a numeric
in order to plot the data.
Hint: you may need to convert the column to a character and then use function
substr before converting it to a numeric.
3. Create a 2-panel figure and plot the mean (panel 1) and standard deviation (panel 2)
of the climate indices in different colors. The final result should look something like
this:
# SOLUTION:
nao <- read.table("ftp://ftp.cpc.ncep.noaa.gov/wd52dg/data/indices/nao_index.tim",
header = T, skip = 8)
mei <- read.table("http://www.esrl.noaa.gov/psd/enso/mei/table.html",
header = T, skip = 13, nrows = 62)
pdo <- read.table("http://jisao.washington.edu/pdo/PDO.latest",
header = T, skip = 29, nrows = 113)
avg.mei <- data.frame(year = mei$YEAR, mei = rowMeans(mei[,
2:NCOL(mei)]))
sd.mei <- data.frame(year = mei$YEAR, mei = apply(mei[,
2:NCOL(mei)], 1, FUN = sd))
pdo$YEAR <- as.numeric(substr(as.character(pdo$YEAR),
start = 1, stop = 4))
avg.pdo <- data.frame(year = pdo$YEAR, pdo = rowMeans(pdo[,
2:NCOL(pdo)]))
sd.pdo <- data.frame(year = pdo$YEAR, pdo = apply(pdo[,
2:NCOL(pdo)], 1, FUN = sd))
11
Statistics Workshop
Day 1: Introduction to R
May 11, 2013
avg.nao <- aggregate(INDEX ~ YEAR, data = nao, FUN = mean)
sd.nao <- aggregate(INDEX ~ YEAR, data = nao, FUN = sd)
par(mfrow = c(2, 1), mar = c(1, 4, 1, 1), oma = c(3,
0.5, 0, 0))
plot(avg.mei$year, avg.mei$mei, t = "l", xlab = "",
ylab = "Mean", xlim = range(avg.pdo$year, avg.mei$year,
avg.nao$year), ylim = range(avg.pdo$pdo, avg.mei$mei,
avg.nao$INDEX), xaxt = "n")
lines(avg.pdo$year, avg.pdo$pdo, t = "l", col = "red")
lines(avg.nao$YEAR, avg.nao$INDEX, t = "l", col = "blue")
abline(h = 0, lty = 2)
legend(x = "toplef", legend = c("MEI", "PDO", "NAO"),
lty = 1, col = c("black", "red", "blue"))
axis(1, at = axTicks(1), label = NA)
plot(sd.mei$year, sd.mei$mei, t = "l", xlab = "Year",
ylab = "Standard deviation", xlim = range(avg.pdo$year,
avg.mei$year, avg.nao$year), xpd = NA)
lines(sd.pdo$year, sd.pdo$pdo, t = "l", col = "red")
lines(sd.nao$YEAR, sd.nao$INDEX, t = "l", col = "blue")
abline(h = 0, lty = 2)
12
Day 1: Introduction to R
2
Statistics Workshop
May 11, 2013
0
1.0
0.5
Standard deviation
1.5
−2
−1
Mean
1
MEI
PDO
NAO
1900
1920
1940
1960
Year
13
1980
2000