Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
The University of Edinburgh Centre for Cognitive Ageing and Cognitive Epidemiology Elements of the R Language Mike Allerhand This document has been produced for the CCACE short course: An Introduction to R Programming. No part of this document may be reproduced, in any form or by any means, without permission in writing from the CCACE. The CCACE is jointly funded by the University of Edinburgh and four of the United Kingdom’s research councils, BBSRC, EPSRC, ESRC, MRC, under the Lifelong Health and Wellbeing initiative. The document was generated using the R Sweave package and typeset using MiKTeX, (LaTeX for Windows). 2015, Dr Mike Allerhand, CCACE Statistician. © 1 1 The big idea: R = values and functions Values are objects that contain data, results, notes, whatever. R values generally contain multiple items. Functions are operators on values. Values are input, processed in some way, and values are output. By name: By task: Consider a sum: 2+5-3. The numbers are a value that is input to the function. Their sum is output. > x = c(2,5,-3) > x # Function c combines items. Save the value as x. # Display the value of x, a "numeric vector" 3 2 # Run function sum with argument x. # N # Average # Standard deviation Functions are organised into libraries or packages, (same thing). A core set of libraries come with the R “base” distribution. The most useful are loaded into memory when you start R. The > symbol is the “prompt”. A command is held in a “line buffer” until you press the Enter key. You can edit it using arrow keys ← →, backspace, delete, and ctrl+u. You can recall and edit previous commands using: ↑ ↓ The line is interpreted2 when you press Enter. Every function has a name. The brackets contain arguments, (input values). The function returns a value, (output). To run a function type its name followed by arguments in brackets3 . > sqrt(4) 3.2 > sessionInfo() # What packages are loaded into memory? > help(package="stats") # Browse the "stats" package. > help(package="graphics") # Browse the "graphics" package. > apropos("cor") > apropos("^read") > apropos("test$") > > > > > # Names containing "cor" # Names beginning "read" # Names ending "test" # Square root of the argument (4) Arithmetic expressions Arithmetic expressions have conventional operators and syntax: + − ∗ / ^ 2 + 2 2 / 3 2^3 2 / (3 + 2) 8^(1/3) # 2 divided by 3 # 2 to the power 3 # When in doubt use brackets 1 It is only necessary to download a package once the first time you need it. Thereafter you can load it from your disk into memory whenever you need its functions. 2 This means the command is parsed and evaluated. Parsing means making sense of the syntax. If it makes sense by R’s rules then it is evaluated. If it makes partial sense but is incomplete you’ll see a “+” prompt, meaning “continue”. If it makes no sense you’ll see an error message. If you need to interrupt R to stop it trying to evaluate a command press the Esc key. 3 You must give the brackets even if they are empty. These indicate to R that this is a function rather than a value. Other packages may be on your hard disk. Load these into memory to use their functions. > library() > library(help="foreign") > library("foreign") The command line How to find functions Function install.packages downloads1 a package from CRAN to your hard disk. How to use a function 3.1 Separating value from function allows other functions to operate. > length(x) > mean(x) > sd(x) http://cran.r-project.org/web/packages/ http://cran.r-project.org/web/views/ > install.packages("lme4") # Download "lme4" from CRAN > # The [1] is a counter, not part of the value. > # The hash symbol starts a "comment" which is ignored by R. > sum(x) Many other packages are available for download. The main repository for R packages is CRAN (“Comprehensive R Archive Network”). # Packages on disk # Functions in "foreign" # Load "foreign" into memory 2 > sqrt(9) - sqrt(25) > sqrt(abs(-2)) 3.3 Function help Every function has a page of help. > help(round) > ?round # Equivalent to: 3 - 5 # abs returns an absolute (unsigned) value The value returned by a function is displayed by default. It is not saved unless you assign 4 it. The Usage section is a synopsis of the arguments. It may show several arguments separated by commas. Two kinds: 1. Values that provide input data to the function. 2. Options that control the function’s behaviour. Options are shown as: name = value. Omit the option and get its default value. Or pass name = value to override the default. Assignment saves a value and gives it a name so you can retrieve it. Names can contain letters, numbers, and ‘.’, but they cannot begin with a number, and they are case-sensitive. > round(3.14159) > round(3.14159, digits=2) # x is assigned the value 2+2 # Same thing (one less key-press) # Display the current value of x # Use the default # Override the default Options may take different kinds of values. The kind of value is indicated on the help page. For example: Assignment saves an independent copy of the value. > ?t.test > y = x > y = 0 # y becomes an independent copy of x # y becomes 0, x is not changed by this > y = x + 1 > x = x + 1 # y becomes x+1, x is unchanged # x becomes x+1, its previous value is lost # Help page for "round" # ...same thing Assignment > x <- 2 + 2 > x = 2 + 2 > x 3.4 A function can appear wherever its value can be used. It could be in an expression or a function argument. Think of the value substituted directly in place of the function. The Usage shows: y NULL, indicating the argument is unused unless specified. conf.level Numeric value. The default here is 0.95. paired Logical value: TRUE or FALSE. This acts as a switch. alternative Several possibilities denoted by strings. The first is the default. The Arguments section describes each argument in more detail. The Value section describes the object returned by the function. A function argument is an independent copy of a value. > sqrt(x) > x = sqrt(x) # Help page for "t.test" # x is not changed by this # x becomes its square root 4 Values Exercise 1. Evaluate x2 + x − 6 at x = −3 and x = 2. (Change the value assigned to x and re-run the expression using the arrow keys). > > > > Elementary items of data are of three main types: numeric Number. Eg. 0, -1, 3.14 character String of characters in quotes. Eg. "apple", "3.14" logical TRUE or FALSE R values are objects that contain multiple items of data. Different classes5 of object structure their items in different ways. The choice is a trade-off between flexibility and speed. x = x^2 x = x^2 -3 + x - 6 2 + x - 6 4 Assignment saves the value in memory while the R session is running, not to the computer’s file system. The assignment operator is <- but you can usually use = instead. 5 An object’s class also determines how it is handled by generic functions such as: print, plot, and summary. 3 The main classes of data object are: vector a row or column of items of the same type. matrix a 2-way layout of items of the same type. factor a vector of categorical items. data.frame a collection of columns, possibly of different types. list a collection of any objects. 5 > seq(0,1, by=0.1) > seq(0,1, length=64) # Fractional steps # From 0 to 1 so length=64 Exercise 3. Generate a regular sequence from 1 down to 0 in steps of 0.1. > seq(1,0, by=-0.1) Making vectors 5.3 Vector is the simplest R object. It is a row or column of cells6 each containing one item of data. 5.1 > x = c("low","med","high") > rep(x, times=3) # Repeat whole vector > rep(x, each=3) # Repeat cells (balanced) Combining vectors > c(1,3.14,0,-3,7) # Numeric vector > c("apple","3.14","apple","orange",".") # Character vector Exercise 4. Make a character vector consisting of: 3 × "low", 5 × "med", and 2 × "high". (See the times option in ?rep). > rep(c("low","med","high"), times=c(3,5,2)) # Repeat cells (unbalanced) A vector must contain items of the same data type. Attempts to mix data types are automatically “coerced” to the same type: logical → numeric → character > c(1,2,3,".") Making vectors by repeating elements 5.4 # Automatically coerced to character Making vectors by random sampling sample draws a random sample from a vector Exercise 2. Given a character vector x made like this: > sample(1:100) # Permute (shuffle) numbers 1:100 > sample(1:100, size=10) # Sample, N=10, numbers 1:100 > sample(c("A","B"), size=100, replace=TRUE) # Sample with replacement > x = c("one","two","three","four") How could you insert "zero" at the start, and "five" at the end? > x = c("zero", x, "five") rnorm draws a random sample from a normal distribution. runif draws a random sample from a uniform distribution7 . > rnorm(50) > rnorm(50, mean=10, sd=2) > runif(50, min=0, max=10) 5.2 Making numeric vectors as sequences Exercise 5. Use function sample to draw a random sample, N=100, of "female" and "male" from a population containing twice as many female as male. (See the prob option in ?sample). Run command table(x), where x is your sample, to count the number of female and male. Operator : with numbers makes a sequence with integer steps > 1:5 > 5.5:-5 # Sample N=50 of standard normal: N(0,1) # Mean and standard deviation # Uniform sample between 0 and 10 # Integer steps > x = sample(c("male","female"), size=100, replace=TRUE, prob=c(1/3,2/3)) > table(x) Function seq makes a numeric sequence with fractional steps 6A 7 For single (scalar) item is seen as a single-cell vector. 4 other distributions see help(Distributions). 5.5 Descriptive functions 6.1 Some vector summary functions length N (eg. sample size) sum Sum mean, median Mean, median sd, var Standard deviation, variance min, max, range Minimum, maximum, range > getwd() > dir() length(x1) length(x2) sum(x1) mean(x1) range(x1) Exercise 6. Use function rnorm to make a random sample, N=100, of a standard normal distribution, (mean = 0, standard deviation = 1). Use functions range, mean, and sd to calculate the sample range, mean, and standard deviation. Run command hist(x), where x is your sample, to plot the sample histogram. Repeat this plot with a new random sample where N=10000. > > > > > > > x = rnorm(10) hist(x) range(x) mean(x) sd(x) x = rnorm(10000) hist(x) 6 > > > > A list is a general-purpose collection of data objects. It is used to pass multiple items as function arguments and returns. A data.frame is a collection of columns with names. It is a general-purpose container for a dataset. The columns are usually seen as variables. Function read.table reads a plain text file10 and returns it in a data frame11 . The first argument is the filename and its extension12 in quotes. The two most useful options are: header is the first line the column names? header=FALSE No (the default). header=TRUE Yes. sep how are the columns separated? sep="" Spaces (the default). sep="," Comma. sep="\t" Tab. > hsb2 = read.table("hsb2.txt", header=TRUE, sep="") 8 The R commands to read and write data will use whatever folder is currently set as the “working directory”. The point of setting the working directory is that R will read and write to that folder by default. 9 The hsb2 data used here are a subset of the “High Schools and Beyond” data set originally used by Raudenbush and Bryk. The data were obtained from the Institute for Digital Research and Education (UCLA), downloaded from: http://www.ats.ucla.edu/stat/data/hsb2.txt. 10 R functions read.table and write.table work with plain text data for portability. See: help(read.table), and also help.start() and “R Data Import/Export”. Functions for reading data in proprietary formats are provided in the foreign library. For example read.spss for SPSS data files and read.dta for STATA data files. When using read.spss you may get a warning message like: “Unrecognized record type 7, subtype 18 encountered in system file”. This can be ignored. It is to do with SPSS compatibility with its own previous versions, and does not mean the data has not been recognised. 11 You can have multiple data sets open at the same time, each in its own data frame. 12 The filename extension is the “.txt” (or similar) after the filename. This has to be given. Windows may hide filename extensions so you may need to take action to show them. Use the menu item: Tools > Folder Options... in any folder. (You may need to hit the Alt key to display a folder’s menu items). On the View tab, de-select the option: Hide extensions for known file types. If you give a full pathname you must use either forward-slashes or double back-slashes. Instead of a filename you can use the function file.choose, or the string "clipboard" to read from Windows clipboard, or a URL to download data through the web. Making lists and data frames Set your working directory 8 before importing or exporting data9 . Exercise 7. Use File > Change dir... to set the R working directory to point to your project folder. Check this using commands getwd() and dir() to see what your current working directory is and to list the files and folders within it. > x1 = c(1,3.14,0,-3,7) # Numeric vector > x2 = c("apple","3.14","apple","orange",".") # Character vector > > > > > Reading data frames x = c(0,1,3,4.5) y = c("apple","apple","orange","apple") list(x=x, y=y) data.frame(x=x, y=y) # The columns must have the same length 5 7 Some data frame summary functions dim, nrow, ncol Dimensions (number of rows, columns) names Column names (the variables) summary Summary of each column rowSums, colSums Sum of each row and column rowMeans, colMeans Mean of each row and column cov, cor Covariance and correlation matrices head, tail First and last rows View Look at the data > > > > > dim(hsb2) names(hsb2) summary(hsb2) colMeans(hsb2) cor(hsb2) > head(hsb2, 10) > tail(hsb2, 10) > View(hsb2) 6.2 > > > > > Dimensions (number of rows, columns) Names of the variables (columns) Summary of each column Mean of each column Correlation matrix # The first 10 rows # The last 10 rows # A view of the data Arithmetic operators14 : + − ∗ / ^ are vectorized. x = y = x^2 x + x + c(0,2,4,6,8,10) c(7,9,11,1,3,5) # Square each element y # Add corresponding pairs of elements 2 # Add 2 to each element, (2 is recycled to length(x)) Some vectorized round trunc abs sqrt exp log, log10, log2 sin, cos, tan asin, acos, atan numeric functions Round to given number of decimal places Truncate down to nearest whole number Absolute (unsigned) value Square root Exponential Log to base e, 10, and 2 Trigonometric functions Inverse (arc) trigonometric functions # What are the column names? # Mean of a particular column Exercise 8. Make a numeric vector x = -2:2 and guess, before running it, what the results of the following will be: # Set a new column ("my.col") # Drop a column > x + 2 > 1/x Using attach and detach > attach(hsb2) > mean(science) > detach(hsb2) Vectorized functions apply an operation element-wise and return a vector the same length as the input. > hsb2$logmath = log(hsb2$math) # Log variable > hsb2$cmath = hsb2$math - mean(hsb2$math) # Centre on mean (recycled) > hsb2$diff = hsb2$write - hsb2$read # Derive difference Data frame columns13 have names and can be addressed: data.frame$name. > hsb2$my.col = 0 > hsb2$my.col = NULL Get and set data frame columns by name using $ > names(hsb2) > mean(hsb2$science) # # # # # Vectorized functions and arithmetic Exercise 9. Plot y = x2 + x − 6 over the range of x in the interval (-3,2). To do that generate a regular sequence x of 100 numbers in the interval (-3,2), and use x to obtain y by evaluating y = x2 + x − 6. Plot y on x by running command: plot(x,y). # No need for $ > x = seq(-3,2, length=100) > y = x^2 + x - 6 > plot(x,y) Using with 14 If the function takes two arguments but the vectors are different lengths the “recycling rule” is applied: the shorter vector is recycled to the length of the longer vector and the operation is carried out element-wise between corresponding pairs of elements. A warning is displayed if recycling is not an exact multiple. > with(hsb2, mean(science)) # No need for $ 13 Many functions return multiple values within a “list-like” collection of named components which can be extracted using $, (see help(Extract)). 6 Exercise 10. Compare mean(hsb2$science) with mean(hsb2$science-10). would you use, instead of 10, to make the mean as close to 0 as possible? > > > > mean(hsb2$science) mean(hsb2$science-10) mu = mean(hsb2$science) # The mean mean(hsb2$science-mu) # The mean of mean-centered science 8 Positive index elements can be repeated > y = c("red","blue","green","orange","magenta") > y[c(1,1,1,2,2,3)] # Index with replications Indexing using numeric and character vectors Cells are addressed in the order of the index vector. Indexing means addressing particular cells by number or name. The index is a vector of numbers or names that address cells. The cells of a vector are numbered: 1,...,length. Cells of a data frame are numbered: 1,...,nrow and 1,...,ncol, and named with rownames and colnames. > i = c(3,5,1) > x[i] > x[i] = c(9,7,8) The syntax uses square brackets: Vectors are indexed using: x[i] Data frames are indexed using: x[i,j] where i is the row index and j is the column index. > i = nrow(hsb2) > j = c(3,2,1) > hsb2[i,j] x = c(3,4,1,2,5,6) i = 1:3 # x[i] # x[i] = 0 # x[i] = c(7,8,9) # x[i] = x[i] + 1 # Index vector for the first three cells Extract the first three cells Replace the first three cells (0 is recycled) Replace the first three cells Update the first three cells An empty index is shorthand for a complete index: x[i,] Index rows (over all columns) x[,j] Index columns (over all rows) # # # # An index can be derived to sort15 a vector or data frame. Index vectors can be derived by string matching16 on names. > # Match column names in the "hsb2" data that begin with "s" > grep("^s", names(hsb2)) # Numeric > grep("^s", names(hsb2), value=TRUE) # Character Exercise 11. Given a character vector x made like this: Special syntax for addressing single columns. hsb2[1] hsb2[[1]] hsb2[["id"]] hsb2$id[2:3] # Index the last row # Index three columns in reverse order > hsb2[order(hsb2$id),] # Sort the data frame rows by id > hsb2[c(1,4),] # Extract rows 1 and 4 (over all columns) > hsb2[, c(1,4)] # Extract columns 1 and 4 (over all rows) > hsb2[, c("id","ses")] # Extract cells in index order # Replace cells in index order > y = c(3,1,2,1,4,0) > order(y) # Derive the index for sorting > x[order(y)] # Sort x into the order of y # Extract all except the first cell # Extract all except the first three cells > hsb2[-(2:3),] # Extract all rows except 2 and 3 > hsb2[,-ncol(hsb2)] # Extract all columns except the last > hsb2[1:3, c(1,4)] # Extract first 3 rows, first and fourth columns > hsb2[1:3, c("id","ses")] > > > > A numeric index can be negative. A negative index addresses cells not indexed. > x[-1] > x[-(1:3)] Data manipulation 8.1 > > > > > > What number > x = c("one","two","three","five","six","seven") Extract the first column as a data frame Extract the first column as a vector Same as: hsb2$id Extract cells 2:3 of "id" 15 Function sort provides simple sorting. Function order sorts in two steps: first derive an index, then apply it to sort an object. The advantage is you can sort one object by the order of another. 16 Function grep does string matching with regular expressions, a tiny language used to describe patterns of characters in strings. See: help(regex). 7 a) How could you insert "four" into it between "three" and "five"? b) How could you delete the last element in the vector? > x > 0 & x < 8 > x < 4 | x > 6 > x > 4 & y == "orange" # TRUE where x>0 AND x<8 # TRUE where x<4 OR x>6 # TRUE where x>4 AND y=="orange" > x = c(x[1:3], "four", x[4:6]) > x = x[-length(x)] Exercise 12. Calculate the mean of the first 10 math scores in hsb2. > mean(hsb2$math[1:10]) > mean(hsb2[1:10, "math"]) # ...same thing Exercise 13. Extract a random sample of 10 rows of hsb2 data. > i = sample(nrow(hsb2), size=10) > hsb2[i,] Exercise 14. Divide the hsb2 data into two new data frames of equal size, one containing a random sample of half the rows in hsb2, and the other containing the rest of the rows. > i1 = x > 4 > i2 = y == "orange" # Assigning (for convenience) > > > > > # # # # # > i = sample(nrow(hsb2), size=nrow(hsb2)/2) > hsb2[i,] > hsb2[-i,] 8.2 In arithmetic expressions logical values are converted to numeric: TRUE → 1 and FALSE → 0 sum counts TRUE mean calculates the proportion TRUE > > > > Conditional indexing using logical vectors Conditional operators return vectors that are TRUE where a condition is met. Operators: ==, !=, >, >=, <, <=, are comparisons with particular values or ranges. Operator %in% provides comparison with a given set of values. Numbers17 are compared numerically. Character strings are compared alphabetically and case-sensitively. > y == "apple" > y != "apple" # TRUE where y=="apple" # TRUE where y!="apple" Logical index vectors address cells where the index is TRUE19 . x[x x[x x[x x[x > x = c(0,2,4,6,8,10) > y = c("apple","orange","orange","orange","apple","apple") # Logical vector TRUE where x==4 # TRUE where x>4 # TRUE where x is 2, 6, or 10 How many of x are > 4? How many y are "orange"? What proportion of x are > 4? What proportion of y are "orange"? How many "orange" have x > 4? > > < > 5] 1 & x < 5] 1 | x > 5] 5] = 0 # # # # Extract Extract Extract Replace > hsb2[hsb2$id == 1,] > hsb2[hsb2$id %in% c(2,4,6),] Composite18 conditions can be formed using & (AND) and | (OR) > x == 4 > x > 4 > x %in% c(2,6,10) sum(i1) sum(i2) mean(i1) mean(i2) sum(i1 & i2) cells cells cells cells where where where (with x>5 is TRUE x>1 AND x<5 x<1 OR x>5 0) where x>5 is TRUE # Extract data frame row for id 1 # Extract rows for id 2,4,6 Function which turns a logical index into a numeric index. Function subset is a convenience function for logical indexing. > which(x > 5) # Which cells meet the condition (x>5)? > which(hsb2$id %in% c(2,4,6)) # Which rows are id 2,4,6? > subset(hsb2, id == 1) # Row for id 1 > subset(hsb2, id %in% c(2,4,6)) # Rows for id 2,4,6 Exercise 15. R provides vectors named letters and LETTERS containing the lower-case and upper-case alphabet. Use sample to draw a random sample N=1000 from LETTERS. How many are the letter A? How many are vowels? 17 Annoyance: a conditional expression like x<-1 is mis-interpreted as assignment. The solution is to include space around the operator: x < -1. 18 Logical vectors are combined in pairs element-wise. Under AND the result is TRUE only if both are TRUE. Under OR the result is TRUE if either (or both) are TRUE. 19 Numerical and character index vectors address individual cells by number and name. A logical vector is a pattern over the whole vector that indicates cells that meet a condition. 8 > > > > x = sample(LETTERS, size=1000, replace=TRUE) sum(x == "A") vowels = c("A","E","I","O","U") sum(x %in% vowels) Arithmetic and conditional operations propagate NA. Descriptive functions have an option na.rm, (standing for “na remove”). Function is.na tests for NA22 . > x = c(NA,2,4,6,NA,10) > mean(x) # NA propagated through arithmetic > mean(x, na.rm=TRUE) # Remove NA before calculating mean Exercise 16. In the hsb2 data: a) What is the average read score of people with low socio-economic status, (ses coded 1), and people with high socio-economic status, (coded 3)? b) How many people in the sample have above average read score? c) How many female, (coded 1), have above average read score? d) What is the average write score of the people whose read score is above average? e) How many people have a read score greater than 2 standard deviations above its average? Extract the id and gender of these people. > sum(x == NA) > sum(is.na(x)) > sum(!is.na(x)) # The wrong way to count NA # How many NA? # How many NOT NA? Exercise 17. Given a numeric vector x made like this: > > > > > > > > > > > > > > > > > > # a mean(hsb2[hsb2$ses == 1, "read"]) mean(hsb2[hsb2$ses == 3, "read"]) # Or... mean(hsb2$read[hsb2$ses == 1]) mean(hsb2$read[hsb2$ses == 3]) # b m = mean(hsb2$read) sum(hsb2$read > m) # c sum(hsb2$read > m & hsb2$female == 1) # d mean(hsb2$write[hsb2$read > m]) # e s = sd(hsb2$read) i = hsb2$read > m+2*s sum(i) hsb2[i, c("id","female")] 9 > x = sample(c(1:10,-999), size=100, replace=TRUE) a. Recode -999 as NA, (denoting a missing value). b. Calculate the sample mean omitting NA. c. Calculate the proportion of the sample that is missing. > x[x == -999] = NA > mean(x, na.rm=TRUE) > mean(is.na(x)) 10 Factors A factor is a kind of vector used to represent categorical variables and grouping indicators. It is a numeric vector with an associated vector of labels for the categories called levels. The order of the levels determines how the categories are coded: the first level is coded 1, the second level is coded 2, and so on. The coding is always: 1, 2, . . . ,nlevels, so the factor can be used as an index. Factors provide: 1. Economical storage of categorical variables that also enables indexing by category. Missing values # a. # b. # c. 2. Category labelling in tables and plots. 3. Dummy variables and contrast coding for categorical variables in regression. In R missing values are coded: NA (standing for “not available”). NA is a special value20 that can appear in any type of object. Data with some other missing value code21 should be recoded as NA. When raw data has categorical variables in the form of numerical grouping indicators23 you may want to convert these to factors using the labels given in the data codebook. 20 Other special values are NaN (not a number) and Inf (infinity). read.table understands NA in data. By default it also interprets empty cells (space) as missing and recodes it as NA. Other codes such as -999 and “.” need to be handled, either by specifying missing value codes to read.table using its na.strings option, or by recoding columns of the data frame after the data have been read. 21 Function 22 See also complete.cases and na.omit. vectors that represent groups, such as 0 and 1 representing male and female. Character vectors are automatically converted to factors by read.table, (see the stringsAsFactors option). 23 Numeric 9 The codebook for the hsb2 data looks like this: id female race ses schtyp prog read write math science socst ID number Gender Ethnicity Socioeconomic status School type High school program Reading score Writing score Maths score Science score Social studies score 0=male, 1=female 1=hispanic, 2=asian, 3=black, 4=white 1=low, 2=medium, 3=high 1=public, 2=private 1=general, 2=academic, 3=vocational > cut(hsb2$science, breaks=quantile(hsb2$science)) 11 Function factor makes factor variables24 . nlevels(hsb2$female) levels(hsb2$female) as.numeric(hsb2$female) c("blue","red")[hsb2$female] > > > > # # # # Number of levels Levels Internal code numbers Indexing by category # Crossed factors # 2 x 3 = 6 also functions gl and cut. Continuous variables sapply applies a function26 to data frame columns. # One column # Several columns A custom function is defined like this: name = function(arguments) { R code for whatever the function does The value of the last line is returned } For example a function27 to calculate the mean and standard deviation: Function quantile produces cut-points along the range of a given sample to delimit its quartiles. Other intervals may be defined by probabilities. 24 See Counts Cross-tab Percentage of row sum Percentage of column sum > sapply(hsb2[7:11], sd, na.rm=TRUE) > x = sample(1:3, size=100, replace=TRUE) > factor(x, levels=1:3, labels=c("no","maybe","yes")) > quantile(hsb2$science) # # # # > mean(hsb2$read, na.rm=TRUE) > sapply(hsb2[7:11], mean, na.rm=TRUE) Exercise 18. Make a random sample, N=100, of the numbers 1:3. Suppose this is a sample of responses to a 3-level item in a questionnaire. Convert it to a factor, labelling the responses: no (for 1), maybe (for 2), and yes (for 3). table makes contingency tables and cross-tabs. prop.table converts counts to proportions. 11.2 The : operator with factors derives a factor by crossing. > g = with(hsb2, female:ses) > nlevels(g) > levels(g) Categorical variables table(hsb2$female) x = table(hsb2$ses, hsb2$female) prop.table(x, margin=1) * 100 prop.table(x, margin=2) * 100 Tables 11.1 > hsb2$female = factor(hsb2$female, levels=0:1, labels=c("male","female")) > hsb2$ses = factor(hsb2$ses, levels=1:3, labels=c("low","medium","high")) > > > > Function cut produces a factor25 by cutting a continuous variable at given points along its range. 25 The factor levels are labelled to indicate the intervals by default. In these labels, round brackets indicate which side of the interval is open. For example a left round bracket denotes an interval “open on the left”, meaning that data exactly on the left boundary of an interval is grouped with that interval. 26 The help pages for many function such as sapply show a special option “...”. This is used to pass one or more options through to functions called within the function. In sapply it is used to pass further options to whatever function is being applied. For example it is often used to pass na.rm=TRUE through to summary functions such as mean and sd to handle missing values. 27 A custom function can be defined separately and assigned a name so it can be saved and re-used. Or it can be defined without a name directly where it is used, (called an “anonymous” function). 10 11.4 > mean.sd = function(x, ...) { + # Return a list containing the mean and sd of x + m = mean(x, ...) + s = sd(x, ...) + list(Mean=m, SD=s) + } > mean.sd(hsb2$read, na.rm=TRUE) > sapply(hsb2[7:11], mean.sd, na.rm=TRUE) sink diverts output to a text file. > x = with(hsb2, tapply(science, list(female,ses), mean)) > x = round(x, 2) # Round to 2dp # One column # Several columns > sink("mytable.txt") > x > sink() > # "Anonymous" function > sapply(hsb2[7:11], function(x) list(Mean=mean(x), SD=sd(x))) Saving tables describe, (in package psych), provides a range of summary statistics. # Turn on saving # All output here is diverted # Turn off saving write.table and write.csv write tables to files in your current working directory. > library(psych) > describe(hsb2[7:11]) > write.table(x, "mytable.txt", quote=FALSE, sep="\t") > write.csv(x, "mytable.csv") # Opens in Excel 11.3 12 Grouped data tapply applies a given function to a vector grouped by factors28 . by applies a given function to a data frame with rows grouped by factors. > with(hsb2, tapply(science, female, mean)) # Cell means > with(hsb2, tapply(science, list(ses,female), mean)) > with(hsb2, by(hsb2[7:11], list(ses,female), describe)) plot is a generic function for scatter plots. It calls different methods depending on its data arguments30 : 2. Two arguments: x,y that are coordinates32 . 3. One argument that is a formula 33 : y∼x. The left side is the y-values, the right side the x-values. Exercise 20. Calculate the mean of each sample quartile of science scores in the hsb2 data. First use cut with quantile to derive a factor to group the science scores into quartiles. Then use tapply to apply mean to the science scores grouped by the factor. you group by two or more factors they must be passed as a list. High-level functions 1. One argument that is a data frame31 or matrix. The first column is the x-values, the second the y-values. > with(warpbreaks, tapply(breaks,tension,mean)) > with(warpbreaks, tapply(breaks,list(wool,tension),mean)) 28 If “High-level” functions29 create new graphs with axes. “Low-level” functions add further graphics, (points, lines, text, etc.), to a graph already created by a high-level function. 12.1 Exercise 19. The warpbreaks data frame is typical of a factorial experimental design. There is a numeric outcome variable, (breaks), and factors that group observations by experimental conditions, (wool and tension). a) Calculate the mean breaks at each of the three levels of tension, (averaged over the levels of wool). b) Calculate the mean breaks at each level of wool and tension. > f = cut(hsb2$science, breaks=quantile(hsb2$science)) > tapply(hsb2$science, f, mean) Graphics > plot(hsb2[9:10]) > plot(hsb2$math, hsb2$science) > plot(science~math, hsb2) 29 Functions for R “base graphics” are in the graphics library, loaded by default whenever you start R. Other graphics libraries are: lattice and ggplot2. 30 See: help(xy.coords). 31 See help(plot.data.frame). 32 See help(plot.default). 33 See help(plot.formula). 11 hist is a histogram. > hist(hsb2$science) Axis ranges35 . > plot(science~math, hsb2, ylim=c(0,100), xlim=c(0,100)) High level graphics functions plot Scatterplot pairs Scatterplot matrix coplot Conditioning plot hist Histogram stem Stem-and-leaf plot boxplot Box-and-whisker plot qqnorm Quantile-quantile plot barplot Bar plot dotchart Dot plot interaction.plot Profile plot of group means > > > > > > Plot type36 x = seq(0, 2*pi, length=50) y = sin(x) plot(y~x, type="n") # No plot plot(y~x, type="p") # Points (default) plot(y~x, type="b") # Points and lines plot(y~x, type="l") # Lines Plot characters37 > plot(science~math, hsb2, pch=2) > plot(1:25, pch=1:25) Exercise 21. Make a qqnorm plot of the science scores. # Character code # The codes > qqnorm(hsb2$science) > pch = c(4,20) > plot(science~math, hsb2, pch=pch[female]) Exercise 22. Make a boxplot to show the science scores for each gender. > boxplot(science ~ female, hsb2) Exercise 23. Make a pairs plot of read, write, math, science, and socst in the hsb2 data. Use function cor to calculate the correlation matrix for these variables. > pairs(hsb2[7:11]) > cor(hsb2[7:11]) 12.2 > > > > Optional arguments passed to high and low-level graphics functions control the look, (labels, colours, sizes, styles). See: help(par). > plot(science~math, hsb2, + main="High School scores", + ylab="Science", + xlab="Maths") Line type38 and width x = seq(0, 2*pi, length=50) y = sin(x) plot(y~x, type="l", lty=2) plot(y~x, type="l", lwd=2, lwd=3) # Character expansion # Line type # Line width Colour39 > colours() # Colour names > plot(science~math, hsb2, col="blue") Main title and axis labels34 . 34 See: > plot(1:20, cex=1:20) # All are strong positive correlations Graphical options Character expansion (size) 35 See: help(plot.default). By default the range is calculated so the data fills the plot. help(plot). 37 See: help(points). 38 Line types set by number: lty=1 solid, 2 dashed, 3 dotted, 4 dotdash, 5 longdash, 6 twodash. See also: help(par). 39 See: help(colours) and help(rgb). 36 See: # Main title # Label y-axis # Label x-axis help(title). 12 Low level graphics functions abline Draw a line (intercept and slope, horizontal or vertical) points Plot points at given coordinates lines Draw lines between given coordinates text Draw text at given coordinates mtext Draw text in the margins of a plot axis Add an axis arrows Draw arrows segments Draw line segments rect Draw rectangles polygon Draw polygons box Draw a box around the plot grid Add a rectangular grid legend Add a legend (a key) title Add labels > col = c("blue","red") > plot(science~math, hsb2, col=col[female]) > plot(science~math, hsb2, col=col[female], pch=20) Labels can take multiple options40 in a list. > plot(science~math, hsb2, font=2, las=1, + ylab=list("Science", font=2, col="green4"), + xlab=list("Maths", font=2, col="green4")) Exercise 24. Plot a histogram of read in the hsb2 data. Give it a nice colour, (maybe “lightblue”), and tidy up the labelling, (maybe main="" and xlab="Reading score"). Exercise 25. Scatter plot science (y-axis) on math (x-axis) in the hsb2 data, indicating the prog (educational program) of each person by colour. Add points at the bivariate means of the three levels of prog. Add a legend to the plot. > hist(hsb2$read, col="lightblue", main="", xlab="Reading score") > > > > > > > 12.3 col = c("red","green4","blue") # Colours txt = c("General","Academic","Vocational") # Legend text x = with(hsb2, tapply(math, prog, mean)) # x-values of means y = with(hsb2, tapply(science, prog, mean)) # y-values of means plot(science~math, hsb2, col=col[prog]) # Scatterplot points(x,y, pch=16, cex=3, col=col) # Add 3 points legend("bottomright", legend=txt, pch=16, col=col) Low-level functions A high-level graphics function must first open a new plot. Low-level functions add graphics to the open plot. 12.4 > plot(science~math, hsb2) > > > > > Multiple plot layout Multiple plots41 in one window # High-level plot > > > > > # Low-level functions abline(h=mean(hsb2$science), v=mean(hsb2$math), col="grey") text(x=73,y=50, "average") points(science~math, subset(hsb2,ses==3), pch=20, col="red") legend("bottomright", legend="High SES", pch=20, col="red") par(mfrow=c(2,2)) hist(hsb2$read, hist(hsb2$write, hist(hsb2$math, hist(hsb2$science, # 2 x 2 layout main="", xlab="read") main="", xlab="write") main="", xlab="math") main="", xlab="science") > # Using mapply to vectorize over the column names > par(mfrow=c(2,2)) > mapply(hist, hsb2[7:10], main="", xlab=names(hsb2[7:10])) 40 Typeface is set by number: font=1 plain, 2 bold, 3 italic, 4 bold italic. See also: help(expression) and help(plotmath). 41 See 13 also: layout and split.screen. > > > > Multiple windows42 . > fit = t.test(science~female, hsb2) > names(fit) # What are the component names? > fit$p.value # Get component "p.value" by name windows() # Open a window pairs(hsb2[7:11]) windows() # Open another window boxplot(hsb2[7:11]) 12.5 Saving graphs Copy and paste: Right-click on a graph and choose Copy as metafile. Paste into Word or PowerPoint. Printing: Right-click on a graph and choose Print... Some hypothesis tests t.test t test of means wilcox.test Wilcoxon (non-parametric) test var.test F-test of variance cor.test Correlation (Pearson, Spearman, or Kendall) binom.test Test of proportion in a two-valued sample prop.test Test of proportions in several two-valued samples chisq.test Chi-squared test for count data fisher.test Fisher’s exact test for count data ks.test Kolmogorov-Smirnov goodness-of-fit test shapiro.test Shapiro-Wilk normality test Save as a PDF43 : Exercise 26. Suppose you are a referee who tosses the same coin at the start of every match. To test whether the coin is fair you carry out an experiment by tossing it 50 times. Suppose you get 18 heads. Use binom.test to test whether the coin is fair. > pdf(file="myplot.pdf") # Open file > plot(science~math, hsb2) # Plot > dev.off() # Flush output and close the file 13 Hypothesis tests t.test performs one and two-sample t tests. Two ways to specify the data for a two-sample test are: 1. Two arguments: x,y, that are two44 sample vectors. > x = with(hsb2, science[female=="male"]) > y = with(hsb2, science[female=="female"]) > t.test(x,y) 2. A single formula argument: y∼x, where y contains both samples and x is a grouping indicator. > t.test(science~female, data=hsb2) > > > > > > > > > > > > > > # The research hypothesis is the alternative hypothesis. # The null hypothesis is set up counter to the research hypothesis, # because it is what you want to reject by significance. # The null hypothesis can contain equality, (either ==, or <=, or >=), # but not inequality, (!=). This allows the alternative hypothesis to # contain inequality, which is denoted "two-sided", (the default). # The research question here has to be: "is the coin unfair". # This allows a "two-sided" alternative hypothesis that the # probability of success, (heads), is not equal to the hypothesized # probability, (0.5, that heads and tails are equally likely). # It implies the null that the probability of heads is equal to 0.5. # The result is not significant, so we cannot reject the null that # the coin is actually fair. binom.test(18, 50, p=0.5) Exercise 27. Carry out a t test that on average there is no difference between the write and read scores in the hsb2 data. Objects returned by testing and modelling functions contain multiple values that can be extracted by name45 using $. 42 Under MacOS the command equivalent to windows() is X11(). This requires X client libraries and access to an X server such as is provided by XQuartz. See: https://support.apple.com/engb/HT201341. 43 See: help(Devices) for image files: jpeg, bmp, png, etc.. 44 The default option y = NULL indicates that y is unused and thereby specifies a one-sample t-test of the mean of x. 45 The names are given in the Value section of the function’s help page. 14 > > > > > > > > # # # # # # # # This is either a one-sample t test of the mean of the difference scores, or a two-sample paired t test of the difference between the means. (Same thing). The alternative hypothesis is an inequality, either that the mean difference is not 0, or that the difference in means is not 0. This implies the null is that the mean difference is 0. The result is non-significant, so we cannot reject the null that the two sets of scores have equal means. 14 > t.test(hsb2$write-hsb2$read) > t.test(hsb2$write, hsb2$read, paired=TRUE) Linear models 14.1 Exercise 28. A researcher predicted that, on average, female students would score higher than male students in the social studies test, (socst). Carry out a t test to see if the data support this. > > > > > > > > > > > > > > > > > # The research hypothesis was that the mean score for male students # would be lower than the mean score for female students. # According to help(t.test): # alternative="greater" is the alternative that x has larger mean than y. # From that we can assume, since this option can also take "less", that: # alternative="less" is the alternative that x has lower mean than y. # Here x refers to the first argument, (using t.test(x,y)), or to # the first level of a grouping factor x, (using t.test(y~x)). # This is male, (originally coded 0). Therefore specify # alternative="less" for the alternative that x (male) has a lower # mean than y (female). This implies the null that the mean score for # male was equal to or greater than for female. # The result was non-significant so the researcher could not reject # the null. The observed mean for male was actually lower than for # female in this sample, but not lower enough that we could believe # it to be consistently lower in 95% of samples. t.test(socst~female, hsb2, alternative="less") Exercise 29. Use cor.test to test the correlation between science and socst in the hsb2 data: a. Using Pearson’s correlation, (the default). b. Using Spearman’s rank correlation46 , (method="spearman"). > > > > > > > > method is recommended if the data are not bivariate normal. To test normality see the qqnorm plot, shapiro.test, and further functions in package MVN. Function lm fits a linear regression model by ordinary least squares. Its first argument is a formula that specifies the model: y ∼ model The left-hand side (y) is the dependent variable, (response or outcome). The right-hand side specifies the independent variables, (predictors), as terms separated by +. “+” is used to include terms in the model. “-” is used to exclude terms. “:” is used to form product terms (interactions). “*” is shorthand for main effects and interaction. “1” denotes the intercept, (0 or -1 excludes it). “I” is a function used to include arithmetic47 within formulas. Formula y ∼ 1 y ∼ x y ∼ x+I(x^2) y ∼ x1+x2 y ∼ x1+x2+x1:x2 y ∼ x1*x2 14.2 > > > > > > # The default alternative hypothesis is that the correlation is not # equal to 0. This implies the null that the correlation is 0. # The result is significant (p<.001) so we can reject the null and # take it that these scores are correlated. # Spearman's correlation is Pearson's correlation applied to the # ranks: cor.test(rank(hsb2$science), rank(hsb2$socst)) cor.test(hsb2$science, hsb2$socst) # "a" cor.test(hsb2$science, hsb2$socst, method="spearman") # "b" 46 This The model formula Model equation y = β0 y = β0 + β1 x y = β0 + β1 x + β2 x2 y = β0 + β1 x1 + β2 x2 y = β0 + β1 x1 + β2 x2 + β3 x1 x2 Intercept only Simple regression Quadratic Multiple regression Main effects and interaction (shorthand) Simple and multiple regression Pass lm a formula to specify the model and a data frame containing the variables named in the formula. data(hills, package="MASS") fit0 = lm(time~1, hills) # Intercept only fit1 = lm(time~dist, hills) # Simple regression fit2 = lm(time~dist+climb, hills) # Multiple regression fit3 = lm(time~dist+climb+dist:climb, hills) # Interaction fit4 = lm(time~dist*climb, hills) # (shorthand) coef gets the model coefficients. > coef(fit0) > coef(fit2) > coef(fit4) 47 Operators “+” and “-” in formulas are used to include or exclude terms, not add or subtract variables. The “I” function allows you to escape that and do some arithmetic. Typical examples are centering variables such as: I(x-mean(x)), and raising variables to powers such as: I(x^2) and I(x^3). 15 14.3 > > > > Testing the parameter estimates summary.lm tests the estimates and the model goodness of fit48 . confint provides confidence intervals around the estimates. vcov provides the estimates variance-covariance matrix49 . anova provides sums of squares and mean squares50 . summary.lm(fit4) confint(fit4) vcov(fit4) anova(fit4) > > > > > > cars$cspeed = cars$speed - mean(cars$speed) # Derive variable fit2 = lm(dist ~ cspeed, cars) coef(fit2) plot(dist ~ cspeed, cars) abline(coef(fit2)) abline(h=mean(cars$dist), v=0, col="grey") Exercise 32. Fit a quadratic model of dist on cspeed and cspeed squared. Scatter plot dist on cspeed and add the regression line. Note that abline adds straight lines only and can’t be used to plot a quadratic curve. The method is to predict points along the curve and use function lines to plot a line through them. Use function predict.lm to get model predictions over the range between the min and max of cspeed. # 95% CI for estimates # Variance-covariance matrix # ANOVA table Extracting information from the fit summary51 . > > > > > > > > > > > names(summary.lm(fit4)) # What are the component names? > summary.lm(fit4)$coef # Estimates and tests Exercise 30. The cars data frame has two variables that are measures of the speed and stopping distance of some old cars. Fit a linear regression model of dist (dependent variable) on speed (independent variable). Extract the estimated coefficients. # The model is: # y = b0 + b1.x + b2.x^2 # dy/dx = b1 + 2.b2.x # Coefficient b1 is the slope where x=0, (at its centered value). fit3 = lm(dist ~ cspeed + I(cspeed^2), cars) coef(fit3) plot(dist ~ cspeed, cars) x = min(cars$cspeed):max(cars$cspeed) y = predict.lm(fit3, data.frame(cspeed=x)) lines(x, y) > fit1 = lm(dist ~ speed, cars) > coef(fit1) # Intercept (dist where speed==0) and slope Exercise 31. Derive a new variable for the cars data named cspeed that is speed centered on its mean. Fit the linear regression model of dist on cspeed. Extract the coefficients and compare them with the previous (uncentered) coefficients. Scatter plot dist (y-axis) on cspeed (x-axis), and use function abline to add the regression line to the plot, (see the coef option in help(abline)). Also add a horizontal line at the mean dist and a vertical line at 0, (see options h and v in help(abline)). > > > > > > 14.4 48 The F statistic is the ratio of explained to unexplained variance. These components can be derived from the ANOVA table, (function anova), by dividing sums of squares by degrees of freedom. 49 The diagonal contains the variances of the estimates, (their squared standard errors), and the off-diagonal elements are covariances between estimates. Function cov2cor can be used to derive the correlation matrix: cov2cor(vcov(fit4)). 50 anova calculates a sequential ANOVA table using “type-I” sums of squares. Results may depend upon term order. See help(anova.lm). For “type-II” and “type-III” sums of squares see function Anova in the car library. 51 See the “Value” section of: help(summary.lm). 16 Multiple R-squared52 . > summary.lm(fit2)$r.squared # Means intersect on the regression line. # y = b0 + b1.x # ybar = 1/n . sum(b0 + b1.x) # = b0 + b1.xbar # Intercept is dist where cspeed=0, which is mean(speed). # So intercept is mean(dist). Effect sizes > > > > > # R-squared Standardized estimates53 , (“beta weights”). Function scale takes a matrix or data frame and returns it as a matrix with standardized columns, (each column centered on its mean and scaled into standard deviation units). model = time ~ dist + climb # Specify the model formula fit2 = lm(model, data=hills) # Raw fit2s = lm(model, data=data.frame(scale(hills))) # Standardized summary.lm(fit2)$coef # Native units summary.lm(fit2s)$coef # SD units ("beta weights") 52 R-squared measures the proportion of outcome variation explained by the model as a whole, equivalent to eta-squared in ANOVA. There is also an adjusted version that accounts for the number of terms, equivalent to omega-squared. 53 Beta weights measure the effect of individual model terms in standard deviation units, similar but not directly equivalent to partial eta-squared which aims to measure the proportion of outcome variation explained. See package MBESS for standardized mean differences, (Cohen’s d). 14.5 Goodness of fit Diagnostic54 plots. > par(mfrow=c(2,2)) > plot(fit2) 14.7 # 4 plots > > > > > Residuals55 . > residuals(fit2) > rstandard(fit2) > summary(fit2)$sigma^2 # Residuals # Standardized residuals # Residual variance lm has an na.action option to specify a function56 to treat missing values. na.omit (the default) treats missing values by listwise deletion57 . na.exclude propagates NA to subsequent functions such as residuals and predict. summary(airquality) fit1 = lm(Ozone~Solar.R+Wind, airquality) fit2 = lm(Ozone~Solar.R+Wind, airquality, na.action=na.exclude) length(predict(fit1)) # Missing values omitted length(predict(fit2)) # Padded with NA to the correct length 14.8 Exercise 33. Fit a linear regression of science on math in the hsb2 data. Use function rstandard to extract the standardized residuals, and with these identify people whose measures are more than 3 standard deviations from the regression line. How many are there and what are their id? Re-fit the model excluding these people. Extract the multiple R-squared, (the proportion of outcome variance explained). > > > > > > fit1 = lm(science ~ math, hsb2) i = abs(rstandard(fit1)) > 3 sum(i) hsb2$id[i] fit2 = lm(science ~ math, hsb2[!i,]) summary(fit2)$r.squared # # # # # Identify outliers Count outliers "id" of outliers Re-fit excluding outliers Extract R-squared Factors are group indicators. The first level indicates the “reference” group. Factors can appear in a model formula. They are automatically converted to dummy numeric variables with values given by contrast coding58 . contrasts gets and sets a factor’s contrast coding scheme. model.matrix shows the dummy variables and contrast coding. > contrasts(warpbreaks$tension) # Default coding > model.matrix(~tension, data=warpbreaks) # Dummy variables aov fits an ANOVA model by ordinary least squares59 . summary.aov displays the ANOVA table60 . summary.lm tests the coefficients and overall fit. Testing blocks of terms Some shortcuts for model formula syntax: All variables in the data frame are included by “.” Variables are excluded by “-” 1-way ANOVA: > with(warpbreaks, tapply(breaks, tension, mean)) # Group means 56 See help(na.fail). whole row is omitted if a value is missing for any variable mentioned in the formula, dependent or independent. 58 The default contrast coding is 0,1 dummy coding, called “treatment contrasts” in R. The coefficients have a simple interpretation: the intercept is the mean of the reference group, and other coefficients are mean differences between a group and the reference group. Treatment contrasts are not orthogonal. Hence the message: “Estimated effects may be unbalanced”, (which can be ignored if the design is balanced). Orthogonal contrasts are available, (see help(contr.helmert)). 59 aov and lm are the same calculation with results displayed differently: lm shows model coefficients, aov shows sums-of-squares. 60 summary.aov calculates a “sequential” ANOVA table using “type-I” sums-of-squares. Terms are assessed in model order, except interaction terms are assessed after main effects. Results may depend upon term order if the design is not balanced, (if the count is not the same in each cell). See Anova in package car for “type-II” and “type-III” sums-of-squares. 57 A > fit1 = lm(Fertility~., swiss) > fit2 = lm(Fertility~.-(Examination+Agriculture), swiss) Model comparison (likelihood ratio) using anova: > anova(fit1,fit2) 54 See: # Test a block of terms help(plot.lm), and also: help(influence.measures), and help(vif) in the car library. residual variance is defined, (see help(summary.lm)), as the sum of the squared residual deviations divided by n-p, where n is the sample size and p is the number of model terms including the intercept. This is given by: sum(residuals(fit)^2)/(n-p), since the mean of the residuals is 0. 55 The ANOVA 14.6 Missing values 17 > fit = aov(breaks ~ tension, warpbreaks) # 1-way ANOVA > summary.aov(fit) # ANOVA table > summary.lm(fit) # Coefficients 2-way ANOVA: > # Is the design balanced? > with(warpbreaks, table(wool,tension)) > # 2-way patterns of group means > with(warpbreaks, tapply(breaks,list(wool,tension),mean)) > with(warpbreaks, interaction.plot(tension,wool,breaks)) > > > > > fit1 = aov(breaks~tension+wool, data=warpbreaks) # Main effects fit2 = aov(breaks~tension*wool, data=warpbreaks) # Interaction summary.aov(fit1) summary.aov(fit2) summary.lm(fit2) Exercise 34. With the hsb2 data, use aov to carry out a 1-way ANOVA of the science scores grouped by prog (the high school program). a) Use summary.aov to calculate the ANOVA table. b) Use summary.lm to assess the model coefficients61 . > > > > with(hsb2, tapply(science, prog, mean)) fit = aov(science~prog, data=hsb2) summary.aov(fit) summary.lm(fit) > > > > > > > > > > > > > > > > > > > > > > > # Intercept = expected science score for male with average read. # cread = slope of science on read relationship for male. # femalefemale = change in intercept from male to female. # cread:femalefemale = change in slope from male to female. # The interaction is non-significant but large enough to matter. # The size and sign of the "femalefemale" effect will change # depending upon how you centre "read". # Because "cread" is mean centered the "femalefemale" effect is # the "average treatment effect" of gender. hsb2$cread = hsb2$read - mean(hsb2$read) # Derive variable fit = lm(science ~ cread * female, hsb2) round(summary.lm(fit)$coef, 4) x = min(hsb2$cread) : max(hsb2$cread) # Range to predict over # Predictions with female variable coded 0:1 y0 = predict.lm(fit, data.frame(cread=x, female=0)) y1 = predict.lm(fit, data.frame(cread=x, female=1)) # Or predictions with factor with levels "male" and "female" y0 = predict.lm(fit, data.frame(cread=x, female=factor("male"))) y1 = predict.lm(fit, data.frame(cread=x, female=factor("female"))) plot(science ~ cread, hsb2) lines(y0~x, lwd=2, col="blue") lines(y1~x, lwd=2, col="red") legend("bottomright", legend=c("male","female") , lwd=2, col=c("blue","red")) 14.10 14.9 ANCOVA Both numeric and factor variables can appear in a model formula. Factors become dummy numeric variables. Exercise 35. Derive a new variable for the hsb2 data named cread that is read centered on its mean. Fit a regression of science on cread and female, including their interaction. Extract the table of estimated coefficients, standard errors, and p-values, rounded to 4 dp. Scatter plot science on cread and add two predicted regression lines: one for female and the other for male. Use function predict.lm to get model predictions over the range between the min and max of cread for female, and then again for male. Use function lines to add each line to the plot. > > > > Generalized linear models glm fits generalized linear regression by maximum likelihood. A single argument specifies the model as a formula. A single argument named family specifies the response distribution62 and link function. fit1 = lm(dist~speed, cars) fit2 = glm(dist~speed, cars, family=gaussian(link="identity")) summary.lm(fit1) summary.glm(fit2) Exercise 36. Load the car library to access the dataset named Cowles. (See its help page: help(Cowles)). 61 Comparisons with the reference group can be changed by making a factor with a different reference level. See function relevel. Post-hoc tests can be carried out. See functions pairwise.t.test and TukeyHSD. 18 62 The default response distribution is normal, (gaussian), and the default link is the identity (do nothing) function. Results should be the same as lm. See help(family) and help(make.link) for the range of response distributions and link functions provided. For logit (logistic) regression use: family=binomial(link="logit"). For probit regression use: family=binomial(link="probit"). For Poisson regression use: family=poisson(link="log"). a) Use glm with family=binomial(link="logit") to fit a logistic regression model of volunteer predicted by sex. b) Use coef to extract the model coefficients, and exp to anti-log them for odds ratios. c) Update the model to control for extraversion and neuroticism. > > > > > > > > data(Cowles,package="car") # Unadjusted odds ratio fit1 = glm(volunteer ~ sex, data=Cowles, family=binomial(link="logit")) coef(fit1) # In log odds units exp(coef(fit1)) # In odds units # Intercept (0.8097) is odds that a female will volunteer (reference level) # sexmale (0.779) is odds multiplier from female to male, (less than 1, so odds # of a male volunteering are lower than female) > > + > > > > > > # Control for extraversion and neuroticism fit2 = glm(volunteer ~ sex + extraversion + neuroticism, data=Cowles, family=binomial(link="logit")) exp(coef(fit2)) # extraversion (1.069) is odds multiplier for unit increase in extraversion. # Greater than 1 so more extravert means more likely to volunteer. # sexmale (0.790) still less than 1, but not as much lower as before. # Female still more likely to volunteer, but extraversion and neuroticism # explains some of the difference between female and male. 15 > i = complete.cases(dat) # Get index > sum(i) # 2308 cases are complete > dat = dat[i,] # Subset the data 15.1 Mixed effects models are for clustered64 data. Functions for these models require data in long format65 . Wide format: each row is a complete record of a person’s repeated measures. Long format: repeated measures of each time-varying variable are stacked into one column. The data include factors to indicate which person and which time-point each measure belongs to. Function reshape converts between wide and long format data. direction Reshape to "long" or "wide" varying List of groups of variables to stack v.names Name for the stacked variables idvar Name for the factor to indicate persons timevar Name for the factor to indicate time-points > i1 = grep("phf", names(dat), value=TRUE) > i2 = grep("age", names(dat), value=TRUE) > dat = reshape(dat, direction="long", + varying=list(i1,i2), + v.names=c("phf","age"), + idvar="id", timevar="occ") Linear mixed-effects models The phf data63 are measures of physical fitness taken on six occasions from a panel of people aged between 40 and 80 years. The data include each persons’s age at each occasion, their employment grade at baseline, (coded: 1=high, 2=intermediate, 3=low), and their gender, (coded: 0=male, 1=female). 15.2 > dat = read.table("phf.txt", header=TRUE) Exercise 37. To check the data have been read correctly run commands: dim(dat), names(dat), and summary(dat). You should see 4423 rows of data and 15 column names. The summary shows that the time-varying variables, ("age" and "phf"), have increasing numbers of missing values as people drop out of the study. Suppose for simplicity we decide to restrict analysis to those who complete the study. Derive an index of people with complete records using function complete.cases, and use it to subset the data. 63 The phf data were originally provided by Jenny Head (University College London), and obtained from the Centre for Multilevel Modelling, (University of Bristol). Reshaping Growth curves Assume everyone’s growth has the same general form. Growth curve parameters may vary66 between people. Function lmList67 estimates each person’s growth curve parameters. 64 For example students within schools, health outcomes within regions, and longitudinal data that are repeated measures within persons. 65 Long format data allow rows to be deleted at particular time-points without deleting the person’s repeated measures listwise. Data with missing values at some time-points still contribute information in a mixed effects model. 66 For example if everyone’s growth follows a straight line the parameters are the intercept and slope, but different people may have a different intercept and slope. Parameters that vary are called random effects. The averages they vary about are called the fixed effects. 67 This is a convenience function for running a series of lm regressions using a common model. The model is fitted independently to each person’s repeated measures using ordinary least squares, and a list of lm fits is returned. The formula is the same as for lm, except it also has a bar (“|”) separating the regression model from a variable that indicates the group of data for each regression. 19 > library(lme4) > library(lmerTest) > > > > > # Straight line growth fit = lm(phf ~ age, dat) fits = lmList(phf ~ age | id, dat) coef(fit) coef(fits) > > > > > > > > + + # Growth curves for two people tmp = subset(dat, id %in% c(2,7)) # Two people tmp = tmp[order(tmp$id),] # Sort by id fits = lmList(phf ~ age | id, tmp) # Within-person regressions tmp$phf2 = predict(fits) # Predicted outcomes plot(phf ~ age, dat, col="lightblue") abline(coef(fit), lwd=2, col="cadetblue") by(tmp, tmp$id, function(tmp) { points(phf ~ age, tmp, type="b") lines(phf2 ~ age, tmp, lwd=2, col="red") }) 15.3 > > > > > > # Overall regression # Within-person regressions # Some methods fixef(fit0) coef(summary(fit0)) print(VarCorr(fit0), confint(fit0) coef(fit0) # Fixed effects # Fixed effects with SEs comp=c("Var","Std")) # Variance components # CIs # Within-person effects 2. Growth model with random intercepts72 . > dat$age50 = dat$age - 50 # Centre age on 50 > fit1 = lmer(phf ~ age50 + (1|id), dat) > summary(fit1) 3. Random intercepts and slopes with covariance73 . > fit2 = lmer(phf ~ age50 + (age50|id), dat) > summary(fit2) 4. Quadratic growth74 . Function anova75 does model comparison by likelihood ratio test. > dat$age50.2 = dat$age50^2 Linear mixed-effects models using function lmer Linear mixed-effects models68 are fitted by function lmer in package lme4. Its first argument is a formula that specifies the model. Fixed effects are specified as terms in the same way as lm. These and/or other terms can be specified as random by entering additional terms within brackets, with a bar (“|”) separating the terms from their associated grouping factor69 . # Squared centred age > fit3 = lmer(phf ~ age50 + age50.2 + (age50|id), dat) > summary(fit3) > anova(fit2, fit3) 5. Time-invariant covariate76 . How do the average growth parameters differ across gender77 ? A set of functions called methods 70 are provided for extracting information from lmer objects. 1. Model of the mean71 . > fit0 = lmer(phf ~ 1 + (1|id), dat) > summary(fit0) # See: ?summary.merMod 68 Also called multi-level models, or random effects models, or random coefficients models. These models are for a continuous response variable. See also function glmer for generalized linear mixedeffects models. 69 A single bar (“|”) is used to specify unstructured covariances which are free and estimated. A double bar (“||”) is used to specify a structure in which covariances between random effects for the same grouping factor are fixed at 0. This is the only covariance structure provided by this function. See function lme in package nlme for a wider range of covariance structures. 70 A list of the methods is displayed by: methods(class="merMod"). See also: help(merMod) and help(pvalues). 71 The “empty” or “null” model with a random intercept only. The fixed effect estimates the population grand mean. The variance components divide the total variance into between-person intercept variance and average residual variance within-person. The proportion of the total that is between-person is the variation due to individual difference, (intra-class correlation). For example: 32.11 / (32.11+31.25) = 0.507. 20 72 Specifying random intercepts only, and not random slopes, implies the slopes are parallel. All subjects change in the same way over time, corresponding to “sphericity”. The intercept represents the expected phf at age 50. The slope (age50) represents the rate of linear change with age, the outcome change per year. A negative slope indicates decline. 73 Positive intercept-slope correlation suggests a higher level at baseline is associated with a less steep decline, (a more positive slope). This implies the fan-out pattern of increasing between-person variance with age. Note: the “Correlation of Fixed Effects” reported by the summary method represents correlation between the estimates of the fixed effects expected over multiple experiments. For example negative correlation suggests estimates that would change in opposite directions. If a subsequent study found a higher average baseline level, it would probably also find a more negative average slope. 74 The slope effect age50 represents the instantaneous slope at age 50. The quadratic effect age50.2 represents curvature: the rate of change of the slope. Negative curvature indicates a concave trajectory: the rate of decline increases with age. 75 Is a quadratic growth curve a better fit than a straight line? The difference between the model’s fit is significant. Note: when comparing models with different fixed effects you should use ML, not REML. For that reason anova will automatically re-fit models if necessary. 76 A time-invariant covariate, like gender, does not change over time. So it cannot explain withinperson residual variation. It explains between-person variation. Intercept and slope (age50) variation are reduced by the gender variable. 77 The main effect of female represents the change in baseline level from male (coded 0) to female. The interaction age50:female represents the change in instantaneous slope, and age50.2:female represents the change in curvature. > fit4 = lmer(phf ~ (age50 + age50.2) * female + (age50|id), dat) > summary(fit4) > > > > > > > > > # Predicted average growth curves by gender age50 = seq(45,75, length=100) - 50 newdata0 = data.frame(age50=age50, age50.2=age50^2, female=0) newdata1 = data.frame(age50=age50, age50.2=age50^2, female=1) phf0 = predict(fit4, newdata0, re.form=NA) phf1 = predict(fit4, newdata1, re.form=NA) plot(phf0 ~ age50, ylim=c(40,55), type="l", lwd=2, lty=2) lines(phf1 ~ age50, lwd=2) legend("topright", legend=c("Female","Male"), lwd=2, lty=1:2) Exercise 38. Quit R like this: q(). Click Yes to save your workspace image 78 . Files named .RData and .Rhistory should appear79 in your project folder. Restart R by double-clicking on the .RData file. This should restore your objects and resume the R session at the point you left it. Run function getwd() to check that your project folder has been restored as the working directory. Run functions ls() and history() to list the objects that have been restored and the last few commands you ran before quitting. > getwd() > ls() > history() 78 The “workspace image” is two files: .RData and .Rhistory, containing all your current objects, (variables and functions you defined), and recent command history. The point is to enable you to keep different projects in different folders, so you can have multiple running sessions each with its own workspace image. 79 Some systems may hide filenames that begin with a dot. On Windows you may need to take action to show them. Use the menu item: Tools > Folder Options... in any folder. (You may need to hit the Alt key to display a folder’s menu items). On the View tab ensure the option: Show hidden files and folders is selected. On MacOS you may find the history file appears in /usr/<user> rather than the working directory you have set. 21