Download Elements of the R Language - the Centre for Cognitive Ageing and

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Categorical variable wikipedia , lookup

Misuse of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
The University of Edinburgh
Centre for Cognitive Ageing and Cognitive Epidemiology
Elements of the R Language
Mike Allerhand
This document has been produced for the CCACE short course: An Introduction to R Programming. No part of this document may be reproduced, in any form or by any means, without
permission in writing from the CCACE.
The CCACE is jointly funded by the University of Edinburgh and four of the United Kingdom’s
research councils, BBSRC, EPSRC, ESRC, MRC, under the Lifelong Health and Wellbeing initiative.
The document was generated using the R Sweave package and typeset using MiKTeX, (LaTeX
for Windows).
2015, Dr Mike Allerhand, CCACE Statistician.
©
1
1
The big idea: R = values and functions
ˆ
Values are objects that contain data, results, notes, whatever.
R values generally contain multiple items.
ˆ
Functions are operators on values.
Values are input, processed in some way, and values are output.
ˆ
By name:
By task:
ˆ
Consider a sum: 2+5-3.
The numbers are a value that is input to the function. Their sum is output.
> x = c(2,5,-3)
> x
ˆ
# Function c combines items. Save the value as x.
# Display the value of x, a "numeric vector"
3
2
# Run function sum with argument x.
# N
# Average
# Standard deviation
Functions are organised into libraries or packages, (same thing).
ˆ
A core set of libraries come with the R “base” distribution.
The most useful are loaded into memory when you start R.
The > symbol is the “prompt”.
ˆ
A command is held in a “line buffer” until you press the Enter key.
You can edit it using arrow keys ← →, backspace, delete, and ctrl+u.
You can recall and edit previous commands using: ↑ ↓
The line is interpreted2 when you press Enter.
ˆ
Every function has a name.
The brackets contain arguments, (input values).
The function returns a value, (output).
To run a function type its name followed by arguments in brackets3 .
> sqrt(4)
3.2
> sessionInfo()
# What packages are loaded into memory?
> help(package="stats")
# Browse the "stats" package.
> help(package="graphics") # Browse the "graphics" package.
> apropos("cor")
> apropos("^read")
> apropos("test$")
ˆ
ˆ
>
>
>
>
>
# Names containing "cor"
# Names beginning "read"
# Names ending "test"
# Square root of the argument (4)
Arithmetic expressions
Arithmetic expressions have conventional operators and syntax: + − ∗ / ^
2 + 2
2 / 3
2^3
2 / (3 + 2)
8^(1/3)
# 2 divided by 3
# 2 to the power 3
# When in doubt use brackets
1 It is only necessary to download a package once the first time you need it. Thereafter you can load
it from your disk into memory whenever you need its functions.
2 This means the command is parsed and evaluated. Parsing means making sense of the syntax. If it
makes sense by R’s rules then it is evaluated. If it makes partial sense but is incomplete you’ll see a “+”
prompt, meaning “continue”. If it makes no sense you’ll see an error message. If you need to interrupt
R to stop it trying to evaluate a command press the Esc key.
3 You must give the brackets even if they are empty. These indicate to R that this is a function rather
than a value.
Other packages may be on your hard disk.
Load these into memory to use their functions.
> library()
> library(help="foreign")
> library("foreign")
The command line
ˆ
How to find functions
ˆ
Function install.packages downloads1 a package from CRAN to your hard disk.
How to use a function
3.1
Separating value from function allows other functions to operate.
> length(x)
> mean(x)
> sd(x)
http://cran.r-project.org/web/packages/
http://cran.r-project.org/web/views/
> install.packages("lme4") # Download "lme4" from CRAN
> # The [1] is a counter, not part of the value.
> # The hash symbol starts a "comment" which is ignored by R.
> sum(x)
Many other packages are available for download.
The main repository for R packages is CRAN
(“Comprehensive R Archive Network”).
# Packages on disk
# Functions in "foreign"
# Load "foreign" into memory
2
ˆ
> sqrt(9) - sqrt(25)
> sqrt(abs(-2))
3.3
ˆ
ˆ
ˆ
Function help
Every function has a page of help.
> help(round)
> ?round
# Equivalent to: 3 - 5
# abs returns an absolute (unsigned) value
The value returned by a function is displayed by default.
It is not saved unless you assign 4 it.
The Usage section is a synopsis of the arguments.
It may show several arguments separated by commas. Two kinds:
1. Values that provide input data to the function.
2. Options that control the function’s behaviour.
ˆ
Options are shown as: name = value.
Omit the option and get its default value.
Or pass name = value to override the default.
Assignment saves a value and gives it a name so you can retrieve it.
Names can contain letters, numbers, and ‘.’, but they cannot begin with a number, and
they are case-sensitive.
> round(3.14159)
> round(3.14159, digits=2)
# x is assigned the value 2+2
# Same thing (one less key-press)
# Display the current value of x
ˆ
# Use the default
# Override the default
Options may take different kinds of values.
The kind of value is indicated on the help page. For example:
Assignment saves an independent copy of the value.
> ?t.test
> y = x
> y = 0
# y becomes an independent copy of x
# y becomes 0, x is not changed by this
> y = x + 1
> x = x + 1
# y becomes x+1, x is unchanged
# x becomes x+1, its previous value is lost
ˆ
# Help page for "round"
# ...same thing
ˆ
Assignment
> x <- 2 + 2
> x = 2 + 2
> x
ˆ
3.4
A function can appear wherever its value can be used.
It could be in an expression or a function argument.
Think of the value substituted directly in place of the function.
ˆ
The Usage shows:
y
NULL, indicating the argument is unused unless specified.
conf.level
Numeric value. The default here is 0.95.
paired
Logical value: TRUE or FALSE. This acts as a switch.
alternative Several possibilities denoted by strings. The first is the default.
ˆ
The Arguments section describes each argument in more detail.
The Value section describes the object returned by the function.
A function argument is an independent copy of a value.
> sqrt(x)
> x = sqrt(x)
# Help page for "t.test"
# x is not changed by this
# x becomes its square root
4
Values
Exercise 1. Evaluate x2 + x − 6 at x = −3 and x = 2. (Change the value assigned to x and
re-run the expression using the arrow keys).
ˆ
>
>
>
>
Elementary items of data are of three main types:
numeric
Number. Eg. 0, -1, 3.14
character String of characters in quotes. Eg. "apple", "3.14"
logical
TRUE or FALSE
ˆ
R values are objects that contain multiple items of data.
Different classes5 of object structure their items in different ways.
The choice is a trade-off between flexibility and speed.
x =
x^2
x =
x^2
-3
+ x - 6
2
+ x - 6
4 Assignment saves the value in memory while the R session is running, not to the computer’s file
system. The assignment operator is <- but you can usually use = instead.
5 An object’s class also determines how it is handled by generic functions such as: print, plot, and
summary.
3
The main classes of data object are:
vector
a row or column of items of the same type.
matrix
a 2-way layout of items of the same type.
factor
a vector of categorical items.
data.frame a collection of columns, possibly of different types.
list
a collection of any objects.
5
> seq(0,1, by=0.1)
> seq(0,1, length=64)
# Fractional steps
# From 0 to 1 so length=64
Exercise 3. Generate a regular sequence from 1 down to 0 in steps of 0.1.
> seq(1,0, by=-0.1)
Making vectors
ˆ
5.3
Vector is the simplest R object.
It is a row or column of cells6 each containing one item of data.
5.1
> x = c("low","med","high")
> rep(x, times=3)
# Repeat whole vector
> rep(x, each=3)
# Repeat cells (balanced)
Combining vectors
> c(1,3.14,0,-3,7)
# Numeric vector
> c("apple","3.14","apple","orange",".") # Character vector
ˆ
Exercise 4. Make a character vector consisting of: 3 × "low", 5 × "med", and 2 × "high". (See
the times option in ?rep).
> rep(c("low","med","high"), times=c(3,5,2)) # Repeat cells (unbalanced)
A vector must contain items of the same data type.
Attempts to mix data types are automatically “coerced” to the same type:
logical → numeric → character
> c(1,2,3,".")
Making vectors by repeating elements
5.4
# Automatically coerced to character
ˆ
Making vectors by random sampling
sample draws a random sample from a vector
Exercise 2. Given a character vector x made like this:
> sample(1:100)
# Permute (shuffle) numbers 1:100
> sample(1:100, size=10)
# Sample, N=10, numbers 1:100
> sample(c("A","B"), size=100, replace=TRUE) # Sample with replacement
> x = c("one","two","three","four")
How could you insert "zero" at the start, and "five" at the end?
ˆ
> x = c("zero", x, "five")
rnorm draws a random sample from a normal distribution.
runif draws a random sample from a uniform distribution7 .
> rnorm(50)
> rnorm(50, mean=10, sd=2)
> runif(50, min=0, max=10)
5.2
ˆ
Making numeric vectors as sequences
Exercise 5. Use function sample to draw a random sample, N=100, of "female" and "male"
from a population containing twice as many female as male. (See the prob option in ?sample).
Run command table(x), where x is your sample, to count the number of female and male.
Operator : with numbers makes a sequence with integer steps
> 1:5
> 5.5:-5
ˆ
# Sample N=50 of standard normal: N(0,1)
# Mean and standard deviation
# Uniform sample between 0 and 10
# Integer steps
> x = sample(c("male","female"), size=100, replace=TRUE, prob=c(1/3,2/3))
> table(x)
Function seq makes a numeric sequence with fractional steps
6A
7 For
single (scalar) item is seen as a single-cell vector.
4
other distributions see help(Distributions).
5.5
Descriptive functions
6.1
Some vector summary functions
length
N (eg. sample size)
sum
Sum
mean, median
Mean, median
sd, var
Standard deviation, variance
min, max, range Minimum, maximum, range
ˆ
> getwd()
> dir()
length(x1)
length(x2)
sum(x1)
mean(x1)
range(x1)
ˆ
Exercise 6. Use function rnorm to make a random sample, N=100, of a standard normal distribution, (mean = 0, standard deviation = 1). Use functions range, mean, and sd to calculate the
sample range, mean, and standard deviation. Run command hist(x), where x is your sample,
to plot the sample histogram. Repeat this plot with a new random sample where N=10000.
>
>
>
>
>
>
>
x = rnorm(10)
hist(x)
range(x)
mean(x)
sd(x)
x = rnorm(10000)
hist(x)
6
>
>
>
>
A list is a general-purpose collection of data objects.
It is used to pass multiple items as function arguments and returns.
ˆ
A data.frame is a collection of columns with names.
It is a general-purpose container for a dataset.
The columns are usually seen as variables.
Function read.table reads a plain text file10 and returns it in a data frame11 .
The first argument is the filename and its extension12 in quotes.
The two most useful options are:
header
is the first line the column names?
header=FALSE No (the default).
header=TRUE
Yes.
sep
how are the columns separated?
sep=""
Spaces (the default).
sep=","
Comma.
sep="\t"
Tab.
> hsb2 = read.table("hsb2.txt", header=TRUE, sep="")
8 The R commands to read and write data will use whatever folder is currently set as the “working
directory”. The point of setting the working directory is that R will read and write to that folder by
default.
9 The hsb2 data used here are a subset of the “High Schools and Beyond” data set originally used by
Raudenbush and Bryk. The data were obtained from the Institute for Digital Research and Education
(UCLA), downloaded from: http://www.ats.ucla.edu/stat/data/hsb2.txt.
10 R functions read.table and write.table work with plain text data for portability.
See:
help(read.table), and also help.start() and “R Data Import/Export”. Functions for reading data
in proprietary formats are provided in the foreign library. For example read.spss for SPSS data
files and read.dta for STATA data files. When using read.spss you may get a warning message like:
“Unrecognized record type 7, subtype 18 encountered in system file”. This can be ignored. It is
to do with SPSS compatibility with its own previous versions, and does not mean the data has not been
recognised.
11 You can have multiple data sets open at the same time, each in its own data frame.
12 The filename extension is the “.txt” (or similar) after the filename. This has to be given. Windows
may hide filename extensions so you may need to take action to show them. Use the menu item: Tools
> Folder Options... in any folder. (You may need to hit the Alt key to display a folder’s menu
items). On the View tab, de-select the option: Hide extensions for known file types. If you give a
full pathname you must use either forward-slashes or double back-slashes. Instead of a filename you can
use the function file.choose, or the string "clipboard" to read from Windows clipboard, or a URL to
download data through the web.
Making lists and data frames
ˆ
Set your working directory 8 before importing or exporting data9 .
Exercise 7. Use File > Change dir... to set the R working directory to point to your project
folder. Check this using commands getwd() and dir() to see what your current working
directory is and to list the files and folders within it.
> x1 = c(1,3.14,0,-3,7)
# Numeric vector
> x2 = c("apple","3.14","apple","orange",".") # Character vector
>
>
>
>
>
Reading data frames
x = c(0,1,3,4.5)
y = c("apple","apple","orange","apple")
list(x=x, y=y)
data.frame(x=x, y=y)
# The columns must have the same length
5
7
Some data frame summary functions
dim, nrow, ncol
Dimensions (number of rows, columns)
names
Column names (the variables)
summary
Summary of each column
rowSums, colSums
Sum of each row and column
rowMeans, colMeans Mean of each row and column
cov, cor
Covariance and correlation matrices
head, tail
First and last rows
View
Look at the data
>
>
>
>
>
dim(hsb2)
names(hsb2)
summary(hsb2)
colMeans(hsb2)
cor(hsb2)
> head(hsb2, 10)
> tail(hsb2, 10)
> View(hsb2)
6.2
ˆ
ˆ
>
>
>
>
>
Dimensions (number of rows, columns)
Names of the variables (columns)
Summary of each column
Mean of each column
Correlation matrix
# The first 10 rows
# The last 10 rows
# A view of the data
Arithmetic operators14 : + − ∗ / ^ are vectorized.
x =
y =
x^2
x +
x +
c(0,2,4,6,8,10)
c(7,9,11,1,3,5)
# Square each element
y
# Add corresponding pairs of elements
2
# Add 2 to each element, (2 is recycled to length(x))
Some vectorized
round
trunc
abs
sqrt
exp
log, log10, log2
sin, cos, tan
asin, acos, atan
numeric functions
Round to given number of decimal places
Truncate down to nearest whole number
Absolute (unsigned) value
Square root
Exponential
Log to base e, 10, and 2
Trigonometric functions
Inverse (arc) trigonometric functions
# What are the column names?
# Mean of a particular column
Exercise 8. Make a numeric vector x = -2:2 and guess, before running it, what the results of
the following will be:
# Set a new column ("my.col")
# Drop a column
> x + 2
> 1/x
Using attach and detach
> attach(hsb2)
> mean(science)
> detach(hsb2)
Vectorized functions apply an operation element-wise and return a vector the same length
as the input.
> hsb2$logmath = log(hsb2$math)
# Log variable
> hsb2$cmath = hsb2$math - mean(hsb2$math) # Centre on mean (recycled)
> hsb2$diff = hsb2$write - hsb2$read
# Derive difference
Data frame columns13 have names and can be addressed: data.frame$name.
> hsb2$my.col = 0
> hsb2$my.col = NULL
ˆ
ˆ
Get and set data frame columns by name using $
> names(hsb2)
> mean(hsb2$science)
ˆ
#
#
#
#
#
Vectorized functions and arithmetic
Exercise 9. Plot y = x2 + x − 6 over the range of x in the interval (-3,2). To do that generate
a regular sequence x of 100 numbers in the interval (-3,2), and use x to obtain y by evaluating
y = x2 + x − 6. Plot y on x by running command: plot(x,y).
# No need for $
> x = seq(-3,2, length=100)
> y = x^2 + x - 6
> plot(x,y)
Using with
14 If the function takes two arguments but the vectors are different lengths the “recycling rule” is
applied: the shorter vector is recycled to the length of the longer vector and the operation is carried
out element-wise between corresponding pairs of elements. A warning is displayed if recycling is not an
exact multiple.
> with(hsb2, mean(science)) # No need for $
13 Many functions return multiple values within a “list-like” collection of named components which can
be extracted using $, (see help(Extract)).
6
Exercise 10. Compare mean(hsb2$science) with mean(hsb2$science-10).
would you use, instead of 10, to make the mean as close to 0 as possible?
>
>
>
>
ˆ
mean(hsb2$science)
mean(hsb2$science-10)
mu = mean(hsb2$science) # The mean
mean(hsb2$science-mu)
# The mean of mean-centered science
8
ˆ
Positive index elements can be repeated
> y = c("red","blue","green","orange","magenta")
> y[c(1,1,1,2,2,3)] # Index with replications
Indexing using numeric and character vectors
ˆ
Cells are addressed in the order of the index vector.
ˆ
Indexing means addressing particular cells by number or name.
The index is a vector of numbers or names that address cells.
ˆ
The cells of a vector are numbered: 1,...,length.
Cells of a data frame are numbered: 1,...,nrow and 1,...,ncol,
and named with rownames and colnames.
> i = c(3,5,1)
> x[i]
> x[i] = c(9,7,8)
ˆ
The syntax uses square brackets:
Vectors are indexed using: x[i]
Data frames are indexed using: x[i,j]
where i is the row index and j is the column index.
> i = nrow(hsb2)
> j = c(3,2,1)
> hsb2[i,j]
x = c(3,4,1,2,5,6)
i = 1:3
#
x[i]
#
x[i] = 0
#
x[i] = c(7,8,9)
#
x[i] = x[i] + 1
#
ˆ
Index vector for the first three cells
Extract the first three cells
Replace the first three cells (0 is recycled)
Replace the first three cells
Update the first three cells
An empty index is shorthand for a complete index:
x[i,] Index rows (over all columns)
x[,j] Index columns (over all rows)
#
#
#
#
An index can be derived to sort15 a vector or data frame.
Index vectors can be derived by string matching16 on names.
> # Match column names in the "hsb2" data that begin with "s"
> grep("^s", names(hsb2))
# Numeric
> grep("^s", names(hsb2), value=TRUE) # Character
Exercise 11. Given a character vector x made like this:
Special syntax for addressing single columns.
hsb2[1]
hsb2[[1]]
hsb2[["id"]]
hsb2$id[2:3]
# Index the last row
# Index three columns in reverse order
> hsb2[order(hsb2$id),] # Sort the data frame rows by id
> hsb2[c(1,4),]
# Extract rows 1 and 4 (over all columns)
> hsb2[, c(1,4)]
# Extract columns 1 and 4 (over all rows)
> hsb2[, c("id","ses")]
ˆ
# Extract cells in index order
# Replace cells in index order
> y = c(3,1,2,1,4,0)
> order(y)
# Derive the index for sorting
> x[order(y)]
# Sort x into the order of y
ˆ
ˆ
# Extract all except the first cell
# Extract all except the first three cells
> hsb2[-(2:3),]
# Extract all rows except 2 and 3
> hsb2[,-ncol(hsb2)] # Extract all columns except the last
> hsb2[1:3, c(1,4)] # Extract first 3 rows, first and fourth columns
> hsb2[1:3, c("id","ses")]
>
>
>
>
A numeric index can be negative.
A negative index addresses cells not indexed.
> x[-1]
> x[-(1:3)]
Data manipulation
8.1
>
>
>
>
>
>
What number
> x = c("one","two","three","five","six","seven")
Extract the first column as a data frame
Extract the first column as a vector
Same as: hsb2$id
Extract cells 2:3 of "id"
15 Function sort provides simple sorting. Function order sorts in two steps: first derive an index, then
apply it to sort an object. The advantage is you can sort one object by the order of another.
16 Function grep does string matching with regular expressions, a tiny language used to describe
patterns of characters in strings. See: help(regex).
7
a) How could you insert "four" into it between "three" and "five"?
b) How could you delete the last element in the vector?
> x > 0 & x < 8
> x < 4 | x > 6
> x > 4 & y == "orange"
# TRUE where x>0 AND x<8
# TRUE where x<4 OR x>6
# TRUE where x>4 AND y=="orange"
> x = c(x[1:3], "four", x[4:6])
> x = x[-length(x)]
ˆ
Exercise 12. Calculate the mean of the first 10 math scores in hsb2.
> mean(hsb2$math[1:10])
> mean(hsb2[1:10, "math"])
# ...same thing
Exercise 13. Extract a random sample of 10 rows of hsb2 data.
> i = sample(nrow(hsb2), size=10)
> hsb2[i,]
Exercise 14. Divide the hsb2 data into two new data frames of equal size, one containing a
random sample of half the rows in hsb2, and the other containing the rest of the rows.
> i1 = x > 4
> i2 = y == "orange"
# Assigning (for convenience)
>
>
>
>
>
#
#
#
#
#
> i = sample(nrow(hsb2), size=nrow(hsb2)/2)
> hsb2[i,]
> hsb2[-i,]
8.2
ˆ
ˆ
In arithmetic expressions logical values are converted to numeric:
TRUE → 1 and FALSE → 0
sum counts TRUE
mean calculates the proportion TRUE
ˆ
>
>
>
>
Conditional indexing using logical vectors
Conditional operators return vectors that are TRUE where a condition is met.
Operators: ==, !=, >, >=, <, <=, are comparisons with particular values or ranges. Operator
%in% provides comparison with a given set of values.
Numbers17 are compared numerically.
Character strings are compared alphabetically and case-sensitively.
> y == "apple"
> y != "apple"
# TRUE where y=="apple"
# TRUE where y!="apple"
Logical index vectors address cells where the index is TRUE19 .
x[x
x[x
x[x
x[x
ˆ
> x = c(0,2,4,6,8,10)
> y = c("apple","orange","orange","orange","apple","apple")
# Logical vector TRUE where x==4
# TRUE where x>4
# TRUE where x is 2, 6, or 10
How many of x are > 4?
How many y are "orange"?
What proportion of x are > 4?
What proportion of y are "orange"?
How many "orange" have x > 4?
>
>
<
>
5]
1 & x < 5]
1 | x > 5]
5] = 0
#
#
#
#
Extract
Extract
Extract
Replace
> hsb2[hsb2$id == 1,]
> hsb2[hsb2$id %in% c(2,4,6),]
Composite18 conditions can be formed using & (AND) and | (OR)
> x == 4
> x > 4
> x %in% c(2,6,10)
sum(i1)
sum(i2)
mean(i1)
mean(i2)
sum(i1 & i2)
cells
cells
cells
cells
where
where
where
(with
x>5 is TRUE
x>1 AND x<5
x<1 OR x>5
0) where x>5 is TRUE
# Extract data frame row for id 1
# Extract rows for id 2,4,6
Function which turns a logical index into a numeric index.
Function subset is a convenience function for logical indexing.
> which(x > 5)
# Which cells meet the condition (x>5)?
> which(hsb2$id %in% c(2,4,6)) # Which rows are id 2,4,6?
> subset(hsb2, id == 1)
# Row for id 1
> subset(hsb2, id %in% c(2,4,6)) # Rows for id 2,4,6
Exercise 15. R provides vectors named letters and LETTERS containing the lower-case and
upper-case alphabet. Use sample to draw a random sample N=1000 from LETTERS. How many
are the letter A? How many are vowels?
17 Annoyance: a conditional expression like x<-1 is mis-interpreted as assignment. The solution is to
include space around the operator: x < -1.
18 Logical vectors are combined in pairs element-wise. Under AND the result is TRUE only if both are
TRUE. Under OR the result is TRUE if either (or both) are TRUE.
19 Numerical and character index vectors address individual cells by number and name. A logical
vector is a pattern over the whole vector that indicates cells that meet a condition.
8
>
>
>
>
x = sample(LETTERS, size=1000, replace=TRUE)
sum(x == "A")
vowels = c("A","E","I","O","U")
sum(x %in% vowels)
ˆ
Arithmetic and conditional operations propagate NA.
Descriptive functions have an option na.rm, (standing for “na remove”).
Function is.na tests for NA22 .
> x = c(NA,2,4,6,NA,10)
> mean(x)
# NA propagated through arithmetic
> mean(x, na.rm=TRUE) # Remove NA before calculating mean
Exercise 16. In the hsb2 data:
a) What is the average read score of people with low socio-economic status, (ses coded 1), and
people with high socio-economic status, (coded 3)?
b) How many people in the sample have above average read score?
c) How many female, (coded 1), have above average read score?
d) What is the average write score of the people whose read score is above average?
e) How many people have a read score greater than 2 standard deviations above its average?
Extract the id and gender of these people.
> sum(x == NA)
> sum(is.na(x))
> sum(!is.na(x))
# The wrong way to count NA
# How many NA?
# How many NOT NA?
Exercise 17. Given a numeric vector x made like this:
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
# a
mean(hsb2[hsb2$ses == 1, "read"])
mean(hsb2[hsb2$ses == 3, "read"])
# Or...
mean(hsb2$read[hsb2$ses == 1])
mean(hsb2$read[hsb2$ses == 3])
# b
m = mean(hsb2$read)
sum(hsb2$read > m)
# c
sum(hsb2$read > m & hsb2$female == 1)
# d
mean(hsb2$write[hsb2$read > m])
# e
s = sd(hsb2$read)
i = hsb2$read > m+2*s
sum(i)
hsb2[i, c("id","female")]
9
> x = sample(c(1:10,-999), size=100, replace=TRUE)
a. Recode -999 as NA, (denoting a missing value).
b. Calculate the sample mean omitting NA.
c. Calculate the proportion of the sample that is missing.
> x[x == -999] = NA
> mean(x, na.rm=TRUE)
> mean(is.na(x))
10
ˆ
Factors
A factor is a kind of vector used to represent categorical variables and grouping indicators.
It is a numeric vector with an associated vector of labels for the categories called levels.
The order of the levels determines how the categories are coded: the first level is coded
1, the second level is coded 2, and so on. The coding is always: 1, 2, . . . ,nlevels, so the
factor can be used as an index. Factors provide:
1. Economical storage of categorical variables that also enables indexing by category.
Missing values
ˆ
# a.
# b.
# c.
2. Category labelling in tables and plots.
3. Dummy variables and contrast coding for categorical variables in regression.
In R missing values are coded: NA (standing for “not available”).
NA is a special value20 that can appear in any type of object.
Data with some other missing value code21 should be recoded as NA.
When raw data has categorical variables in the form of numerical grouping indicators23 you
may want to convert these to factors using the labels given in the data codebook.
20 Other
special values are NaN (not a number) and Inf (infinity).
read.table understands NA in data. By default it also interprets empty cells (space) as
missing and recodes it as NA. Other codes such as -999 and “.” need to be handled, either by specifying
missing value codes to read.table using its na.strings option, or by recoding columns of the data
frame after the data have been read.
21 Function
22 See
also complete.cases and na.omit.
vectors that represent groups, such as 0 and 1 representing male and female. Character
vectors are automatically converted to factors by read.table, (see the stringsAsFactors option).
23 Numeric
9
The codebook for the hsb2 data looks like this:
id
female
race
ses
schtyp
prog
read
write
math
science
socst
ˆ
ID number
Gender
Ethnicity
Socioeconomic status
School type
High school program
Reading score
Writing score
Maths score
Science score
Social studies score
ˆ
0=male, 1=female
1=hispanic, 2=asian, 3=black, 4=white
1=low, 2=medium, 3=high
1=public, 2=private
1=general, 2=academic, 3=vocational
> cut(hsb2$science, breaks=quantile(hsb2$science))
11
ˆ
Function factor makes factor variables24 .
nlevels(hsb2$female)
levels(hsb2$female)
as.numeric(hsb2$female)
c("blue","red")[hsb2$female]
>
>
>
>
#
#
#
#
Number of levels
Levels
Internal code numbers
Indexing by category
# Crossed factors
# 2 x 3 = 6
ˆ
also functions gl and cut.
Continuous variables
sapply applies a function26 to data frame columns.
# One column
# Several columns
A custom function is defined like this:
name = function(arguments) {
R code for whatever the function does
The value of the last line is returned
}
For example a function27 to calculate the mean and standard deviation:
Function quantile produces cut-points along the range of a given sample to delimit its
quartiles. Other intervals may be defined by probabilities.
24 See
Counts
Cross-tab
Percentage of row sum
Percentage of column sum
> sapply(hsb2[7:11], sd, na.rm=TRUE)
> x = sample(1:3, size=100, replace=TRUE)
> factor(x, levels=1:3, labels=c("no","maybe","yes"))
> quantile(hsb2$science)
#
#
#
#
> mean(hsb2$read, na.rm=TRUE)
> sapply(hsb2[7:11], mean, na.rm=TRUE)
Exercise 18. Make a random sample, N=100, of the numbers 1:3. Suppose this is a sample of
responses to a 3-level item in a questionnaire. Convert it to a factor, labelling the responses:
no (for 1), maybe (for 2), and yes (for 3).
ˆ
table makes contingency tables and cross-tabs.
prop.table converts counts to proportions.
11.2
The : operator with factors derives a factor by crossing.
> g = with(hsb2, female:ses)
> nlevels(g)
> levels(g)
Categorical variables
table(hsb2$female)
x = table(hsb2$ses, hsb2$female)
prop.table(x, margin=1) * 100
prop.table(x, margin=2) * 100
ˆ
ˆ
Tables
11.1
> hsb2$female = factor(hsb2$female, levels=0:1, labels=c("male","female"))
> hsb2$ses = factor(hsb2$ses, levels=1:3, labels=c("low","medium","high"))
>
>
>
>
Function cut produces a factor25 by cutting a continuous variable at given points along its
range.
25 The factor levels are labelled to indicate the intervals by default. In these labels, round brackets
indicate which side of the interval is open. For example a left round bracket denotes an interval “open
on the left”, meaning that data exactly on the left boundary of an interval is grouped with that interval.
26 The help pages for many function such as sapply show a special option “...”. This is used to pass
one or more options through to functions called within the function. In sapply it is used to pass further
options to whatever function is being applied. For example it is often used to pass na.rm=TRUE through
to summary functions such as mean and sd to handle missing values.
27 A custom function can be defined separately and assigned a name so it can be saved and re-used.
Or it can be defined without a name directly where it is used, (called an “anonymous” function).
10
11.4
> mean.sd = function(x, ...) {
+
# Return a list containing the mean and sd of x
+
m = mean(x, ...)
+
s = sd(x, ...)
+
list(Mean=m, SD=s)
+ }
> mean.sd(hsb2$read, na.rm=TRUE)
> sapply(hsb2[7:11], mean.sd, na.rm=TRUE)
ˆ
sink diverts output to a text file.
> x = with(hsb2, tapply(science, list(female,ses), mean))
> x = round(x, 2)
# Round to 2dp
# One column
# Several columns
> sink("mytable.txt")
> x
> sink()
> # "Anonymous" function
> sapply(hsb2[7:11], function(x) list(Mean=mean(x), SD=sd(x)))
ˆ
Saving tables
ˆ
describe, (in package psych), provides a range of summary statistics.
# Turn on saving
# All output here is diverted
# Turn off saving
write.table and write.csv write tables to files in your current working directory.
> library(psych)
> describe(hsb2[7:11])
> write.table(x, "mytable.txt", quote=FALSE, sep="\t")
> write.csv(x, "mytable.csv") # Opens in Excel
11.3
12
ˆ
Grouped data
tapply applies a given function to a vector grouped by factors28 .
by applies a given function to a data frame with rows grouped by factors.
ˆ
> with(hsb2, tapply(science, female, mean)) # Cell means
> with(hsb2, tapply(science, list(ses,female), mean))
> with(hsb2, by(hsb2[7:11], list(ses,female), describe))
ˆ
plot is a generic function for scatter plots.
It calls different methods depending on its data arguments30 :
2. Two arguments: x,y that are coordinates32 .
3. One argument that is a formula 33 : y∼x.
The left side is the y-values, the right side the x-values.
Exercise 20. Calculate the mean of each sample quartile of science scores in the hsb2 data.
First use cut with quantile to derive a factor to group the science scores into quartiles. Then
use tapply to apply mean to the science scores grouped by the factor.
you group by two or more factors they must be passed as a list.
High-level functions
1. One argument that is a data frame31 or matrix.
The first column is the x-values, the second the y-values.
> with(warpbreaks, tapply(breaks,tension,mean))
> with(warpbreaks, tapply(breaks,list(wool,tension),mean))
28 If
“High-level” functions29 create new graphs with axes.
“Low-level” functions add further graphics, (points, lines, text, etc.),
to a graph already created by a high-level function.
12.1
Exercise 19. The warpbreaks data frame is typical of a factorial experimental design. There
is a numeric outcome variable, (breaks), and factors that group observations by experimental
conditions, (wool and tension).
a) Calculate the mean breaks at each of the three levels of tension, (averaged over the levels
of wool).
b) Calculate the mean breaks at each level of wool and tension.
> f = cut(hsb2$science, breaks=quantile(hsb2$science))
> tapply(hsb2$science, f, mean)
Graphics
> plot(hsb2[9:10])
> plot(hsb2$math, hsb2$science)
> plot(science~math, hsb2)
29 Functions for R “base graphics” are in the graphics library, loaded by default whenever you start
R. Other graphics libraries are: lattice and ggplot2.
30 See: help(xy.coords).
31 See help(plot.data.frame).
32 See help(plot.default).
33 See help(plot.formula).
11
ˆ
hist is a histogram.
ˆ
> hist(hsb2$science)
Axis ranges35 .
> plot(science~math, hsb2, ylim=c(0,100), xlim=c(0,100))
High level graphics functions
plot
Scatterplot
pairs
Scatterplot matrix
coplot
Conditioning plot
hist
Histogram
stem
Stem-and-leaf plot
boxplot
Box-and-whisker plot
qqnorm
Quantile-quantile plot
barplot
Bar plot
dotchart
Dot plot
interaction.plot Profile plot of group means
ˆ
>
>
>
>
>
>
Plot type36
x = seq(0, 2*pi, length=50)
y = sin(x)
plot(y~x, type="n") # No plot
plot(y~x, type="p") # Points (default)
plot(y~x, type="b") # Points and lines
plot(y~x, type="l") # Lines
ˆ
Plot characters37
> plot(science~math, hsb2, pch=2)
> plot(1:25, pch=1:25)
Exercise 21. Make a qqnorm plot of the science scores.
# Character code
# The codes
> qqnorm(hsb2$science)
> pch = c(4,20)
> plot(science~math, hsb2, pch=pch[female])
Exercise 22. Make a boxplot to show the science scores for each gender.
> boxplot(science ~ female, hsb2)
ˆ
Exercise 23. Make a pairs plot of read, write, math, science, and socst in the hsb2 data.
Use function cor to calculate the correlation matrix for these variables.
> pairs(hsb2[7:11])
> cor(hsb2[7:11])
12.2
ˆ
ˆ
ˆ
>
>
>
>
Optional arguments passed to high and low-level graphics functions control the look, (labels, colours, sizes, styles). See: help(par).
> plot(science~math, hsb2,
+
main="High School scores",
+
ylab="Science",
+
xlab="Maths")
Line type38 and width
x = seq(0, 2*pi, length=50)
y = sin(x)
plot(y~x, type="l", lty=2)
plot(y~x, type="l", lwd=2, lwd=3)
ˆ
# Character expansion
# Line type
# Line width
Colour39
> colours()
# Colour names
> plot(science~math, hsb2, col="blue")
Main title and axis labels34 .
34 See:
> plot(1:20, cex=1:20)
# All are strong positive correlations
Graphical options
Character expansion (size)
35 See:
help(plot.default). By default the range is calculated so the data fills the plot.
help(plot).
37 See: help(points).
38 Line types set by number: lty=1 solid, 2 dashed, 3 dotted, 4 dotdash, 5 longdash, 6 twodash. See
also: help(par).
39 See: help(colours) and help(rgb).
36 See:
# Main title
# Label y-axis
# Label x-axis
help(title).
12
Low level graphics functions
abline
Draw a line (intercept and slope, horizontal or vertical)
points
Plot points at given coordinates
lines
Draw lines between given coordinates
text
Draw text at given coordinates
mtext
Draw text in the margins of a plot
axis
Add an axis
arrows
Draw arrows
segments Draw line segments
rect
Draw rectangles
polygon
Draw polygons
box
Draw a box around the plot
grid
Add a rectangular grid
legend
Add a legend (a key)
title
Add labels
> col = c("blue","red")
> plot(science~math, hsb2, col=col[female])
> plot(science~math, hsb2, col=col[female], pch=20)
ˆ
Labels can take multiple options40 in a list.
> plot(science~math, hsb2, font=2, las=1,
+
ylab=list("Science", font=2, col="green4"),
+
xlab=list("Maths", font=2, col="green4"))
Exercise 24. Plot a histogram of read in the hsb2 data. Give it a nice colour, (maybe “lightblue”), and tidy up the labelling, (maybe main="" and xlab="Reading score").
Exercise 25. Scatter plot science (y-axis) on math (x-axis) in the hsb2 data, indicating the
prog (educational program) of each person by colour. Add points at the bivariate means of the
three levels of prog. Add a legend to the plot.
> hist(hsb2$read, col="lightblue", main="", xlab="Reading score")
>
>
>
>
>
>
>
12.3
ˆ
col = c("red","green4","blue")
# Colours
txt = c("General","Academic","Vocational") # Legend text
x = with(hsb2, tapply(math, prog, mean))
# x-values of means
y = with(hsb2, tapply(science, prog, mean)) # y-values of means
plot(science~math, hsb2, col=col[prog])
# Scatterplot
points(x,y, pch=16, cex=3, col=col)
# Add 3 points
legend("bottomright", legend=txt, pch=16, col=col)
Low-level functions
A high-level graphics function must first open a new plot.
Low-level functions add graphics to the open plot.
12.4
ˆ
> plot(science~math, hsb2)
>
>
>
>
>
Multiple plot layout
Multiple plots41 in one window
# High-level plot
>
>
>
>
>
# Low-level functions
abline(h=mean(hsb2$science), v=mean(hsb2$math), col="grey")
text(x=73,y=50, "average")
points(science~math, subset(hsb2,ses==3), pch=20, col="red")
legend("bottomright", legend="High SES", pch=20, col="red")
par(mfrow=c(2,2))
hist(hsb2$read,
hist(hsb2$write,
hist(hsb2$math,
hist(hsb2$science,
# 2 x 2 layout
main="", xlab="read")
main="", xlab="write")
main="", xlab="math")
main="", xlab="science")
> # Using mapply to vectorize over the column names
> par(mfrow=c(2,2))
> mapply(hist, hsb2[7:10], main="", xlab=names(hsb2[7:10]))
40 Typeface is set by number: font=1 plain, 2 bold, 3 italic, 4 bold italic. See also: help(expression)
and help(plotmath).
41 See
13
also: layout and split.screen.
ˆ
>
>
>
>
Multiple windows42 .
> fit = t.test(science~female, hsb2)
> names(fit)
# What are the component names?
> fit$p.value
# Get component "p.value" by name
windows()
# Open a window
pairs(hsb2[7:11])
windows()
# Open another window
boxplot(hsb2[7:11])
12.5
Saving graphs
ˆ
Copy and paste:
Right-click on a graph and choose Copy as metafile.
Paste into Word or PowerPoint.
ˆ
Printing:
Right-click on a graph and choose Print...
ˆ
Some hypothesis tests
t.test
t test of means
wilcox.test
Wilcoxon (non-parametric) test
var.test
F-test of variance
cor.test
Correlation (Pearson, Spearman, or Kendall)
binom.test
Test of proportion in a two-valued sample
prop.test
Test of proportions in several two-valued samples
chisq.test
Chi-squared test for count data
fisher.test
Fisher’s exact test for count data
ks.test
Kolmogorov-Smirnov goodness-of-fit test
shapiro.test Shapiro-Wilk normality test
Save as a PDF43 :
Exercise 26. Suppose you are a referee who tosses the same coin at the start of every match.
To test whether the coin is fair you carry out an experiment by tossing it 50 times. Suppose
you get 18 heads. Use binom.test to test whether the coin is fair.
> pdf(file="myplot.pdf")
# Open file
> plot(science~math, hsb2)
# Plot
> dev.off() # Flush output and close the file
13
ˆ
Hypothesis tests
t.test performs one and two-sample t tests.
Two ways to specify the data for a two-sample test are:
1. Two arguments: x,y, that are two44 sample vectors.
> x = with(hsb2, science[female=="male"])
> y = with(hsb2, science[female=="female"])
> t.test(x,y)
2. A single formula argument: y∼x, where y contains both samples and x is a grouping
indicator.
> t.test(science~female, data=hsb2)
ˆ
>
>
>
>
>
>
>
>
>
>
>
>
>
>
# The research hypothesis is the alternative hypothesis.
# The null hypothesis is set up counter to the research hypothesis,
# because it is what you want to reject by significance.
# The null hypothesis can contain equality, (either ==, or <=, or >=),
# but not inequality, (!=). This allows the alternative hypothesis to
# contain inequality, which is denoted "two-sided", (the default).
# The research question here has to be: "is the coin unfair".
# This allows a "two-sided" alternative hypothesis that the
# probability of success, (heads), is not equal to the hypothesized
# probability, (0.5, that heads and tails are equally likely).
# It implies the null that the probability of heads is equal to 0.5.
# The result is not significant, so we cannot reject the null that
# the coin is actually fair.
binom.test(18, 50, p=0.5)
Exercise 27. Carry out a t test that on average there is no difference between the write and
read scores in the hsb2 data.
Objects returned by testing and modelling functions contain multiple values that can be
extracted by name45 using $.
42 Under MacOS the command equivalent to windows() is X11(). This requires X client libraries
and access to an X server such as is provided by XQuartz. See: https://support.apple.com/engb/HT201341.
43 See: help(Devices) for image files: jpeg, bmp, png, etc..
44 The default option y = NULL indicates that y is unused and thereby specifies a one-sample t-test of
the mean of x.
45 The names are given in the Value section of the function’s help page.
14
>
>
>
>
>
>
>
>
#
#
#
#
#
#
#
#
This is either a one-sample t test of the mean of the
difference scores, or a two-sample paired t test of the
difference between the means. (Same thing).
The alternative hypothesis is an inequality, either that the
mean difference is not 0, or that the difference in means is
not 0. This implies the null is that the mean difference is 0.
The result is non-significant, so we cannot reject the null
that the two sets of scores have equal means.
14
> t.test(hsb2$write-hsb2$read)
> t.test(hsb2$write, hsb2$read, paired=TRUE)
Linear models
14.1
Exercise 28. A researcher predicted that, on average, female students would score higher than
male students in the social studies test, (socst). Carry out a t test to see if the data support
this.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
# The research hypothesis was that the mean score for male students
# would be lower than the mean score for female students.
# According to help(t.test):
# alternative="greater" is the alternative that x has larger mean than y.
# From that we can assume, since this option can also take "less", that:
# alternative="less" is the alternative that x has lower mean than y.
# Here x refers to the first argument, (using t.test(x,y)), or to
# the first level of a grouping factor x, (using t.test(y~x)).
# This is male, (originally coded 0). Therefore specify
# alternative="less" for the alternative that x (male) has a lower
# mean than y (female). This implies the null that the mean score for
# male was equal to or greater than for female.
# The result was non-significant so the researcher could not reject
# the null. The observed mean for male was actually lower than for
# female in this sample, but not lower enough that we could believe
# it to be consistently lower in 95% of samples.
t.test(socst~female, hsb2, alternative="less")
Exercise 29.
Use cor.test to test the correlation between science and socst in the hsb2 data:
a. Using Pearson’s correlation, (the default).
b. Using Spearman’s rank correlation46 , (method="spearman").
>
>
>
>
>
>
>
>
method is recommended if the data are not bivariate normal. To test normality see the qqnorm
plot, shapiro.test, and further functions in package MVN.
ˆ
Function lm fits a linear regression model by ordinary least squares.
Its first argument is a formula that specifies the model: y ∼ model
The left-hand side (y) is the dependent variable, (response or outcome).
The right-hand side specifies the independent variables, (predictors), as terms separated
by +.
ˆ
“+” is used to include terms in the model.
“-” is used to exclude terms.
“:” is used to form product terms (interactions).
“*” is shorthand for main effects and interaction.
“1” denotes the intercept, (0 or -1 excludes it).
“I” is a function used to include arithmetic47 within formulas.
Formula
y ∼ 1
y ∼ x
y ∼ x+I(x^2)
y ∼ x1+x2
y ∼ x1+x2+x1:x2
y ∼ x1*x2
14.2
ˆ
>
>
>
>
>
>
# The default alternative hypothesis is that the correlation is not
# equal to 0. This implies the null that the correlation is 0.
# The result is significant (p<.001) so we can reject the null and
# take it that these scores are correlated.
# Spearman's correlation is Pearson's correlation applied to the
# ranks: cor.test(rank(hsb2$science), rank(hsb2$socst))
cor.test(hsb2$science, hsb2$socst)
# "a"
cor.test(hsb2$science, hsb2$socst, method="spearman")
# "b"
46 This
The model formula
Model equation
y = β0
y = β0 + β1 x
y = β0 + β1 x + β2 x2
y = β0 + β1 x1 + β2 x2
y = β0 + β1 x1 + β2 x2 + β3 x1 x2
Intercept only
Simple regression
Quadratic
Multiple regression
Main effects and interaction
(shorthand)
Simple and multiple regression
Pass lm a formula to specify the model and a data frame containing the variables named
in the formula.
data(hills, package="MASS")
fit0 = lm(time~1, hills)
# Intercept only
fit1 = lm(time~dist, hills)
# Simple regression
fit2 = lm(time~dist+climb, hills) # Multiple regression
fit3 = lm(time~dist+climb+dist:climb, hills) # Interaction
fit4 = lm(time~dist*climb, hills)
# (shorthand)
ˆ
coef gets the model coefficients.
> coef(fit0)
> coef(fit2)
> coef(fit4)
47 Operators “+” and “-” in formulas are used to include or exclude terms, not add or subtract variables.
The “I” function allows you to escape that and do some arithmetic. Typical examples are centering
variables such as: I(x-mean(x)), and raising variables to powers such as: I(x^2) and I(x^3).
15
14.3
ˆ
>
>
>
>
Testing the parameter estimates
summary.lm tests the estimates and the model goodness of fit48 .
confint provides confidence intervals around the estimates.
vcov provides the estimates variance-covariance matrix49 .
anova provides sums of squares and mean squares50 .
summary.lm(fit4)
confint(fit4)
vcov(fit4)
anova(fit4)
ˆ
>
>
>
>
>
>
cars$cspeed = cars$speed - mean(cars$speed) # Derive variable
fit2 = lm(dist ~ cspeed, cars)
coef(fit2)
plot(dist ~ cspeed, cars)
abline(coef(fit2))
abline(h=mean(cars$dist), v=0, col="grey")
Exercise 32. Fit a quadratic model of dist on cspeed and cspeed squared. Scatter plot dist
on cspeed and add the regression line. Note that abline adds straight lines only and can’t
be used to plot a quadratic curve. The method is to predict points along the curve and use
function lines to plot a line through them. Use function predict.lm to get model predictions
over the range between the min and max of cspeed.
# 95% CI for estimates
# Variance-covariance matrix
# ANOVA table
Extracting information from the fit summary51 .
>
>
>
>
>
>
>
>
>
>
> names(summary.lm(fit4)) # What are the component names?
> summary.lm(fit4)$coef
# Estimates and tests
Exercise 30. The cars data frame has two variables that are measures of the speed and stopping
distance of some old cars. Fit a linear regression model of dist (dependent variable) on speed
(independent variable). Extract the estimated coefficients.
# The model is:
#
y = b0 + b1.x + b2.x^2
#
dy/dx = b1 + 2.b2.x
# Coefficient b1 is the slope where x=0, (at its centered value).
fit3 = lm(dist ~ cspeed + I(cspeed^2), cars)
coef(fit3)
plot(dist ~ cspeed, cars)
x = min(cars$cspeed):max(cars$cspeed)
y = predict.lm(fit3, data.frame(cspeed=x))
lines(x, y)
> fit1 = lm(dist ~ speed, cars)
> coef(fit1)
# Intercept (dist where speed==0) and slope
Exercise 31. Derive a new variable for the cars data named cspeed that is speed centered
on its mean. Fit the linear regression model of dist on cspeed. Extract the coefficients and
compare them with the previous (uncentered) coefficients. Scatter plot dist (y-axis) on cspeed
(x-axis), and use function abline to add the regression line to the plot, (see the coef option
in help(abline)). Also add a horizontal line at the mean dist and a vertical line at 0, (see
options h and v in help(abline)).
>
>
>
>
>
>
14.4
ˆ
48 The F statistic is the ratio of explained to unexplained variance. These components can be derived
from the ANOVA table, (function anova), by dividing sums of squares by degrees of freedom.
49 The diagonal contains the variances of the estimates, (their squared standard errors), and the
off-diagonal elements are covariances between estimates. Function cov2cor can be used to derive the
correlation matrix: cov2cor(vcov(fit4)).
50 anova calculates a sequential ANOVA table using “type-I” sums of squares. Results may depend
upon term order. See help(anova.lm). For “type-II” and “type-III” sums of squares see function Anova
in the car library.
51 See the “Value” section of: help(summary.lm).
16
Multiple R-squared52 .
> summary.lm(fit2)$r.squared
ˆ
# Means intersect on the regression line.
#
y = b0 + b1.x
#
ybar = 1/n . sum(b0 + b1.x)
#
= b0 + b1.xbar
# Intercept is dist where cspeed=0, which is mean(speed).
# So intercept is mean(dist).
Effect sizes
>
>
>
>
>
# R-squared
Standardized estimates53 , (“beta weights”).
Function scale takes a matrix or data frame and returns it as a matrix with standardized
columns, (each column centered on its mean and scaled into standard deviation units).
model = time ~ dist + climb
# Specify the model formula
fit2 = lm(model, data=hills)
# Raw
fit2s = lm(model, data=data.frame(scale(hills))) # Standardized
summary.lm(fit2)$coef
# Native units
summary.lm(fit2s)$coef
# SD units ("beta weights")
52 R-squared measures the proportion of outcome variation explained by the model as a whole, equivalent to eta-squared in ANOVA. There is also an adjusted version that accounts for the number of terms,
equivalent to omega-squared.
53 Beta weights measure the effect of individual model terms in standard deviation units, similar but
not directly equivalent to partial eta-squared which aims to measure the proportion of outcome variation
explained. See package MBESS for standardized mean differences, (Cohen’s d).
14.5
ˆ
Goodness of fit
Diagnostic54 plots.
ˆ
> par(mfrow=c(2,2))
> plot(fit2)
ˆ
14.7
# 4 plots
>
>
>
>
>
Residuals55 .
> residuals(fit2)
> rstandard(fit2)
> summary(fit2)$sigma^2
# Residuals
# Standardized residuals
# Residual variance
lm has an na.action option to specify a function56 to treat missing values.
na.omit (the default) treats missing values by listwise deletion57 .
na.exclude propagates NA to subsequent functions such as residuals and predict.
summary(airquality)
fit1 = lm(Ozone~Solar.R+Wind, airquality)
fit2 = lm(Ozone~Solar.R+Wind, airquality, na.action=na.exclude)
length(predict(fit1)) # Missing values omitted
length(predict(fit2)) # Padded with NA to the correct length
14.8
Exercise 33. Fit a linear regression of science on math in the hsb2 data. Use function rstandard
to extract the standardized residuals, and with these identify people whose measures are more
than 3 standard deviations from the regression line. How many are there and what are their
id? Re-fit the model excluding these people. Extract the multiple R-squared, (the proportion
of outcome variance explained).
>
>
>
>
>
>
fit1 = lm(science ~ math, hsb2)
i = abs(rstandard(fit1)) > 3
sum(i)
hsb2$id[i]
fit2 = lm(science ~ math, hsb2[!i,])
summary(fit2)$r.squared
#
#
#
#
#
Identify outliers
Count outliers
"id" of outliers
Re-fit excluding outliers
Extract R-squared
Factors are group indicators.
The first level indicates the “reference” group.
ˆ
Factors can appear in a model formula. They are automatically converted to dummy
numeric variables with values given by contrast coding58 .
contrasts gets and sets a factor’s contrast coding scheme.
model.matrix shows the dummy variables and contrast coding.
> contrasts(warpbreaks$tension)
# Default coding
> model.matrix(~tension, data=warpbreaks) # Dummy variables
aov fits an ANOVA model by ordinary least squares59 .
summary.aov displays the ANOVA table60 .
summary.lm tests the coefficients and overall fit.
Testing blocks of terms
ˆ
ˆ
Some shortcuts for model formula syntax:
All variables in the data frame are included by “.”
Variables are excluded by “-”
1-way ANOVA:
> with(warpbreaks, tapply(breaks, tension, mean)) # Group means
56 See
help(na.fail).
whole row is omitted if a value is missing for any variable mentioned in the formula, dependent
or independent.
58 The default contrast coding is 0,1 dummy coding, called “treatment contrasts” in R. The coefficients
have a simple interpretation: the intercept is the mean of the reference group, and other coefficients
are mean differences between a group and the reference group. Treatment contrasts are not orthogonal.
Hence the message: “Estimated effects may be unbalanced”, (which can be ignored if the design is
balanced). Orthogonal contrasts are available, (see help(contr.helmert)).
59 aov and lm are the same calculation with results displayed differently: lm shows model coefficients,
aov shows sums-of-squares.
60 summary.aov calculates a “sequential” ANOVA table using “type-I” sums-of-squares. Terms are
assessed in model order, except interaction terms are assessed after main effects. Results may depend
upon term order if the design is not balanced, (if the count is not the same in each cell). See Anova in
package car for “type-II” and “type-III” sums-of-squares.
57 A
> fit1 = lm(Fertility~., swiss)
> fit2 = lm(Fertility~.-(Examination+Agriculture), swiss)
ˆ
Model comparison (likelihood ratio) using anova:
> anova(fit1,fit2)
54 See:
# Test a block of terms
help(plot.lm), and also: help(influence.measures), and help(vif) in the car library.
residual variance is defined, (see help(summary.lm)), as the sum of the squared residual deviations divided by n-p, where n is the sample size and p is the number of model terms including the
intercept. This is given by: sum(residuals(fit)^2)/(n-p), since the mean of the residuals is 0.
55 The
ANOVA
ˆ
ˆ
14.6
Missing values
17
> fit = aov(breaks ~ tension, warpbreaks) # 1-way ANOVA
> summary.aov(fit)
# ANOVA table
> summary.lm(fit)
# Coefficients
ˆ
2-way ANOVA:
> # Is the design balanced?
> with(warpbreaks, table(wool,tension))
> # 2-way patterns of group means
> with(warpbreaks, tapply(breaks,list(wool,tension),mean))
> with(warpbreaks, interaction.plot(tension,wool,breaks))
>
>
>
>
>
fit1 = aov(breaks~tension+wool, data=warpbreaks) # Main effects
fit2 = aov(breaks~tension*wool, data=warpbreaks) # Interaction
summary.aov(fit1)
summary.aov(fit2)
summary.lm(fit2)
Exercise 34. With the hsb2 data, use aov to carry out a 1-way ANOVA of the science scores
grouped by prog (the high school program).
a) Use summary.aov to calculate the ANOVA table.
b) Use summary.lm to assess the model coefficients61 .
>
>
>
>
with(hsb2, tapply(science, prog, mean))
fit = aov(science~prog, data=hsb2)
summary.aov(fit)
summary.lm(fit)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
# Intercept = expected science score for male with average read.
# cread = slope of science on read relationship for male.
# femalefemale = change in intercept from male to female.
# cread:femalefemale = change in slope from male to female.
# The interaction is non-significant but large enough to matter.
# The size and sign of the "femalefemale" effect will change
# depending upon how you centre "read".
# Because "cread" is mean centered the "femalefemale" effect is
# the "average treatment effect" of gender.
hsb2$cread = hsb2$read - mean(hsb2$read) # Derive variable
fit = lm(science ~ cread * female, hsb2)
round(summary.lm(fit)$coef, 4)
x = min(hsb2$cread) : max(hsb2$cread)
# Range to predict over
# Predictions with female variable coded 0:1
y0 = predict.lm(fit, data.frame(cread=x, female=0))
y1 = predict.lm(fit, data.frame(cread=x, female=1))
# Or predictions with factor with levels "male" and "female"
y0 = predict.lm(fit, data.frame(cread=x, female=factor("male")))
y1 = predict.lm(fit, data.frame(cread=x, female=factor("female")))
plot(science ~ cread, hsb2)
lines(y0~x, lwd=2, col="blue")
lines(y1~x, lwd=2, col="red")
legend("bottomright", legend=c("male","female") , lwd=2, col=c("blue","red"))
14.10
ˆ
14.9
ˆ
ANCOVA
Both numeric and factor variables can appear in a model formula.
Factors become dummy numeric variables.
Exercise 35. Derive a new variable for the hsb2 data named cread that is read centered on its
mean. Fit a regression of science on cread and female, including their interaction. Extract
the table of estimated coefficients, standard errors, and p-values, rounded to 4 dp.
Scatter plot science on cread and add two predicted regression lines: one for female and the
other for male. Use function predict.lm to get model predictions over the range between the
min and max of cread for female, and then again for male. Use function lines to add each line
to the plot.
>
>
>
>
Generalized linear models
glm fits generalized linear regression by maximum likelihood.
A single argument specifies the model as a formula.
A single argument named family specifies the response distribution62 and link function.
fit1 = lm(dist~speed, cars)
fit2 = glm(dist~speed, cars, family=gaussian(link="identity"))
summary.lm(fit1)
summary.glm(fit2)
Exercise 36. Load the car library to access the dataset named Cowles. (See its help page:
help(Cowles)).
61 Comparisons with the reference group can be changed by making a factor with a different reference
level. See function relevel. Post-hoc tests can be carried out. See functions pairwise.t.test and
TukeyHSD.
18
62 The default response distribution is normal, (gaussian), and the default link is the identity (do
nothing) function. Results should be the same as lm. See help(family) and help(make.link) for the
range of response distributions and link functions provided.
For logit (logistic) regression use: family=binomial(link="logit").
For probit regression use: family=binomial(link="probit").
For Poisson regression use: family=poisson(link="log").
a) Use glm with family=binomial(link="logit") to fit a logistic regression model of volunteer
predicted by sex.
b) Use coef to extract the model coefficients, and exp to anti-log them for odds ratios.
c) Update the model to control for extraversion and neuroticism.
>
>
>
>
>
>
>
>
data(Cowles,package="car")
# Unadjusted odds ratio
fit1 = glm(volunteer ~ sex, data=Cowles, family=binomial(link="logit"))
coef(fit1)
# In log odds units
exp(coef(fit1))
# In odds units
# Intercept (0.8097) is odds that a female will volunteer (reference level)
# sexmale (0.779) is odds multiplier from female to male, (less than 1, so odds
# of a male volunteering are lower than female)
>
>
+
>
>
>
>
>
>
# Control for extraversion and neuroticism
fit2 = glm(volunteer ~ sex + extraversion + neuroticism, data=Cowles,
family=binomial(link="logit"))
exp(coef(fit2))
# extraversion (1.069) is odds multiplier for unit increase in extraversion.
# Greater than 1 so more extravert means more likely to volunteer.
# sexmale (0.790) still less than 1, but not as much lower as before.
# Female still more likely to volunteer, but extraversion and neuroticism
# explains some of the difference between female and male.
15
> i = complete.cases(dat) # Get index
> sum(i)
# 2308 cases are complete
> dat = dat[i,]
# Subset the data
15.1
ˆ
ˆ
Mixed effects models are for clustered64 data.
Functions for these models require data in long format65 .
Wide format: each row is a complete record of a person’s repeated measures.
Long format: repeated measures of each time-varying variable are stacked into one column.
The data include factors to indicate which person and which time-point each measure
belongs to.
Function reshape converts between wide and long format data.
direction Reshape to "long" or "wide"
varying
List of groups of variables to stack
v.names
Name for the stacked variables
idvar
Name for the factor to indicate persons
timevar
Name for the factor to indicate time-points
> i1 = grep("phf", names(dat), value=TRUE)
> i2 = grep("age", names(dat), value=TRUE)
> dat = reshape(dat, direction="long",
+
varying=list(i1,i2),
+
v.names=c("phf","age"),
+
idvar="id", timevar="occ")
Linear mixed-effects models
The phf data63 are measures of physical fitness taken on six occasions from a panel of people
aged between 40 and 80 years. The data include each persons’s age at each occasion, their
employment grade at baseline, (coded: 1=high, 2=intermediate, 3=low), and their gender,
(coded: 0=male, 1=female).
15.2
ˆ
> dat = read.table("phf.txt", header=TRUE)
ˆ
Exercise 37. To check the data have been read correctly run commands: dim(dat), names(dat),
and summary(dat). You should see 4423 rows of data and 15 column names. The summary
shows that the time-varying variables, ("age" and "phf"), have increasing numbers of missing
values as people drop out of the study. Suppose for simplicity we decide to restrict analysis to
those who complete the study. Derive an index of people with complete records using function
complete.cases, and use it to subset the data.
63 The phf data were originally provided by Jenny Head (University College London), and obtained
from the Centre for Multilevel Modelling, (University of Bristol).
Reshaping
Growth curves
Assume everyone’s growth has the same general form.
Growth curve parameters may vary66 between people.
Function lmList67 estimates each person’s growth curve parameters.
64 For example students within schools, health outcomes within regions, and longitudinal data that
are repeated measures within persons.
65 Long format data allow rows to be deleted at particular time-points without deleting the person’s
repeated measures listwise. Data with missing values at some time-points still contribute information
in a mixed effects model.
66 For example if everyone’s growth follows a straight line the parameters are the intercept and slope,
but different people may have a different intercept and slope. Parameters that vary are called random
effects. The averages they vary about are called the fixed effects.
67 This is a convenience function for running a series of lm regressions using a common model. The
model is fitted independently to each person’s repeated measures using ordinary least squares, and a
list of lm fits is returned. The formula is the same as for lm, except it also has a bar (“|”) separating
the regression model from a variable that indicates the group of data for each regression.
19
> library(lme4)
> library(lmerTest)
>
>
>
>
>
# Straight line growth
fit = lm(phf ~ age, dat)
fits = lmList(phf ~ age | id, dat)
coef(fit)
coef(fits)
>
>
>
>
>
>
>
>
+
+
# Growth curves for two people
tmp = subset(dat, id %in% c(2,7))
# Two people
tmp = tmp[order(tmp$id),]
# Sort by id
fits = lmList(phf ~ age | id, tmp) # Within-person regressions
tmp$phf2 = predict(fits)
# Predicted outcomes
plot(phf ~ age, dat, col="lightblue")
abline(coef(fit), lwd=2, col="cadetblue")
by(tmp, tmp$id, function(tmp) {
points(phf ~ age, tmp, type="b")
lines(phf2 ~ age, tmp, lwd=2, col="red") })
15.3
ˆ
ˆ
>
>
>
>
>
>
# Overall regression
# Within-person regressions
# Some methods
fixef(fit0)
coef(summary(fit0))
print(VarCorr(fit0),
confint(fit0)
coef(fit0)
# Fixed effects
# Fixed effects with SEs
comp=c("Var","Std")) # Variance components
# CIs
# Within-person effects
2. Growth model with random intercepts72 .
> dat$age50 = dat$age - 50
# Centre age on 50
> fit1 = lmer(phf ~ age50 + (1|id), dat)
> summary(fit1)
3. Random intercepts and slopes with covariance73 .
> fit2 = lmer(phf ~ age50 + (age50|id), dat)
> summary(fit2)
4. Quadratic growth74 .
Function anova75 does model comparison by likelihood ratio test.
> dat$age50.2 = dat$age50^2
Linear mixed-effects models using function lmer
Linear mixed-effects models68 are fitted by function lmer in package lme4.
Its first argument is a formula that specifies the model.
Fixed effects are specified as terms in the same way as lm.
These and/or other terms can be specified as random by entering additional terms within
brackets, with a bar (“|”) separating the terms from their associated grouping factor69 .
# Squared centred age
> fit3 = lmer(phf ~ age50 + age50.2 + (age50|id), dat)
> summary(fit3)
> anova(fit2, fit3)
5. Time-invariant covariate76 .
How do the average growth parameters differ across gender77 ?
A set of functions called methods 70 are provided for extracting information from lmer
objects.
1. Model of the mean71 .
> fit0 = lmer(phf ~ 1 + (1|id), dat)
> summary(fit0)
# See: ?summary.merMod
68 Also called multi-level models, or random effects models, or random coefficients models. These
models are for a continuous response variable. See also function glmer for generalized linear mixedeffects models.
69 A single bar (“|”) is used to specify unstructured covariances which are free and estimated. A
double bar (“||”) is used to specify a structure in which covariances between random effects for the
same grouping factor are fixed at 0. This is the only covariance structure provided by this function. See
function lme in package nlme for a wider range of covariance structures.
70 A list of the methods is displayed by: methods(class="merMod"). See also: help(merMod) and
help(pvalues).
71 The “empty” or “null” model with a random intercept only. The fixed effect estimates the population
grand mean. The variance components divide the total variance into between-person intercept variance
and average residual variance within-person. The proportion of the total that is between-person is the
variation due to individual difference, (intra-class correlation). For example: 32.11 / (32.11+31.25) =
0.507.
20
72 Specifying random intercepts only, and not random slopes, implies the slopes are parallel. All
subjects change in the same way over time, corresponding to “sphericity”. The intercept represents the
expected phf at age 50. The slope (age50) represents the rate of linear change with age, the outcome
change per year. A negative slope indicates decline.
73 Positive intercept-slope correlation suggests a higher level at baseline is associated with a less steep
decline, (a more positive slope). This implies the fan-out pattern of increasing between-person variance
with age. Note: the “Correlation of Fixed Effects” reported by the summary method represents correlation
between the estimates of the fixed effects expected over multiple experiments. For example negative
correlation suggests estimates that would change in opposite directions. If a subsequent study found a
higher average baseline level, it would probably also find a more negative average slope.
74 The slope effect age50 represents the instantaneous slope at age 50. The quadratic effect age50.2
represents curvature: the rate of change of the slope. Negative curvature indicates a concave trajectory:
the rate of decline increases with age.
75 Is a quadratic growth curve a better fit than a straight line? The difference between the model’s
fit is significant. Note: when comparing models with different fixed effects you should use ML, not REML.
For that reason anova will automatically re-fit models if necessary.
76 A time-invariant covariate, like gender, does not change over time. So it cannot explain withinperson residual variation. It explains between-person variation. Intercept and slope (age50) variation
are reduced by the gender variable.
77 The main effect of female represents the change in baseline level from male (coded 0) to female. The
interaction age50:female represents the change in instantaneous slope, and age50.2:female represents
the change in curvature.
> fit4 = lmer(phf ~ (age50 + age50.2) * female + (age50|id), dat)
> summary(fit4)
>
>
>
>
>
>
>
>
>
# Predicted average growth curves by gender
age50 = seq(45,75, length=100) - 50
newdata0 = data.frame(age50=age50, age50.2=age50^2, female=0)
newdata1 = data.frame(age50=age50, age50.2=age50^2, female=1)
phf0 = predict(fit4, newdata0, re.form=NA)
phf1 = predict(fit4, newdata1, re.form=NA)
plot(phf0 ~ age50, ylim=c(40,55), type="l", lwd=2, lty=2)
lines(phf1 ~ age50, lwd=2)
legend("topright", legend=c("Female","Male"), lwd=2, lty=1:2)
Exercise 38. Quit R like this: q(). Click Yes to save your workspace image 78 . Files named
.RData and .Rhistory should appear79 in your project folder.
Restart R by double-clicking on the .RData file. This should restore your objects and resume
the R session at the point you left it. Run function getwd() to check that your project folder has
been restored as the working directory. Run functions ls() and history() to list the objects
that have been restored and the last few commands you ran before quitting.
> getwd()
> ls()
> history()
78 The “workspace image” is two files: .RData and .Rhistory, containing all your current objects,
(variables and functions you defined), and recent command history. The point is to enable you to
keep different projects in different folders, so you can have multiple running sessions each with its own
workspace image.
79 Some systems may hide filenames that begin with a dot. On Windows you may need to take action
to show them. Use the menu item: Tools > Folder Options... in any folder. (You may need to hit
the Alt key to display a folder’s menu items). On the View tab ensure the option: Show hidden files
and folders is selected. On MacOS you may find the history file appears in /usr/<user> rather than
the working directory you have set.
21