Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Statistics with R — a practical guide for beginners Uppsala University Fall 2014 1 Table of Contents INTRODUCTION ........................................................................................................... 4 WHO SHOULD USE THIS SCRIPT? ............................................................................................................... 4 HOW TO USE THIS SCRIPT ........................................................................................................................... 4 1 2 3 GETTING STARTED WITH STATISTICS ............................................................. 6 1.1 BASIC STATISTICAL CONCEPTS ..................................................................................................... 6 1.2 DESCRIPTIVE STATISTICS ............................................................................................................14 1.3 EXERCISES: GETTING STARTED WITH STATISTICS ..................................................................18 GETTING STARTED WITH R ............................................................................... 21 2.1 WHAT IS R? ...................................................................................................................................21 2.2 DOWNLOADING AND INSTALLING R........................................................................................21 2.3 HOW TO WORK WITH R ................................................................................................................21 2.4 HANDLING DATA.......................................................................................................................30 2.5 DEALING WITH MISSING VALUES ............................................................................................39 2. 6 UNDERSTANDING HELP() FUNCTIONS ................................................................................42 2.7 EXERCISES ..................................................................................................................................43 2.8 WEB RESOURCES AND BOOKS ON R .........................................................................................46 BASIC STATISTICS WITH R ..................................................................................48 3.1 TYPES OF DATA ............................................................................................................................48 3.2 EXPLORING DATA WITH TABLES ...............................................................................................48 3.3 EXPLORING DATA GRAPHICALLY .............................................................................................49 3.4 DESCRIPTIVE STATISTICS ............................................................................................................51 3.5 COMPARING TWO GROUPS OF MEASUREMENTS .......................................................................53 3.6 USING T-TESTS WITH R ................................................................................................................57 3.7 NON-PARAMETRIC ALTERNATIVES ..........................................................................................60 3.8 CORRELATION ANALYSIS ...........................................................................................................60 3.9 CROSS-TABULATION AND THE Χ2 TEST.....................................................................................61 2 4 5 6 7 8 3.10 SUMMARY ......................................................................................................................................62 3.11 EXERCISES ....................................................................................................................................63 LINEAR MODELS...................................................................................................67 4.1 OVERVIEW - IN THIS SECTION YOU WILL .................................................................................67 4.2 CLASSES OF LINEAR MODELS ....................................................................................................67 4.3 WORKFLOW FOR LINEAR MODELS .............................................................................................68 4.4 DEFINING THE MODEL ...............................................................................................................69 4.5 ANALYZING AND INTERPRETING THE MODEL .......................................................................71 4.6 WORKED EXAMPLES ...................................................................................................................74 4.7 SUMMARY ......................................................................................................................................79 4.8 EXERCISES ...................................................................................................................................80 BASIC GRAPHS WITH R ........................................................................................83 5.1 BAR-PLOTS .....................................................................................................................................83 5.2 GROUPED SCATTER PLOT WITH REGRESSION LINES ...............................................................86 5.3 EXERCISES ....................................................................................................................................88 LOGISTIC REGRESSION .......................................................................................90 6.1 GOALS............................................................................................................................................90 6.2 HOW TO DO IT ...............................................................................................................................90 6.3 SUMMARY ......................................................................................................................................93 6.4 EXERCISES ....................................................................................................................................94 R PROGRAMMING STRUCTURES .......................................................................96 7.1 FLOW CONTROL ............................................................................................................................96 7.2 WRITE YOUR OWN FUNCTION.....................................................................................................98 7.3 SUMMARY ................................................................................................................................... 100 APPENDIX: CODE REFERENCES ..................................................................... 101 3 Introduction Welcome to our statistics with R script! Who should use this script? This introduction directed at advanced BSc and at beginning MSc students on Biology at Uppsala University and may of course also be helpful for anyone interested. We have kept this script concise and applied in order to allow people to conduct own statistical analyses in R after a short time. Note that this “quick- start” guide does not replace a full-fledged course. However, we hope that successfully using R for statistical analyses with the help of this script will generate interest in learning more about statistics and R! You may find this script helpful if you are: an incoming Master student at Uppsala University with no or little previous education in either Statistics or R a student who wishes to freshen up knowledge in statistics and/or R for courses, project work or research (Master or Doctoral thesis) a prospective student who wants to check out the level and content of statistics with R at Uppsala University anyone interested in a quick-start guide to beginner level statistics with R How to use this script We wrote this script for flexible use, such that you can direct your attention to the parts that you want to focus on, given your background and current interest. You will have most use of this enhanced .pdf file if you read it electronically using a pdf reader that provides a content sidebar and allows hyperlinks as well as attachments. Recent versions of the free program Adobe Reader for Macintosh and for PC have these functions (do not use Preview). The script contains internal links and links to webpages. You can navigate between sections and subsection using the bookmarks pane (Figure 0-0-1). Some solutions or data files are provides as attachments to the .pdf document that are accessible through the attachment pane (press paperclip symbol). Note that Adobe Reader also allows you to add notes, highlight text and set bookmarks on your own. 4 Figure 0-0-1 Screenshot of this script in Acrobat Reader. Use bookmarks to navigate between sections and subsections. The attachment pane (paperclip symbol) will display a list of files included (datasets, and exercise solutions). You can also insert your own marks, as the yellow “1996” as well as comments using Adobe Readers tools. Please browse the overview en summery sections that are present in all chapters in order to find out where just you need to start reading. All sections have exercises with solutions such that you can practice and test your knowledge. We hope that you will find our script useful and fun! August 2014, Sophie Karrenberg, Andres Cortes, Elodie Chapurlat, Xiaodong Liu, Matthew Tye 5 1 Getting started with statistics 1.1 Basic statistical concepts Goals: learn about why and how statistics are used in Biology be introduced to basic statistical concepts such as distributions and probabilities become familiar with the normal distribution get an idea about how statistical tests work Why do we need (so much) statistics in biology? Organisms that biologists study are influenced by a multitude of factors including the genetic makeup of the organisms, their developmental stage and the environmental conditions at various scales ranging from microscopic to world-wide. This brings about the need to detect the most important factors and to isolate certain factors experimentally for further analysis. In biology, it is also usually impossible to work on all the units of the group or species that you are interested in. Instead, biologists often have to work on a subset of units taken at random and make inferences from this subset. The whole group of units is called a “population”, while the subset is a “sample” (Figure 1-1 Population and sample). The problem is that all these units are often different, even though they belong to the same population. By chance, your random sample may not be very representative of the population. Thus, even two samples taken from two similar populations may differ greatly, just by chance. It is also possible that two samples taken from two very different populations may be very similar, misleading you to conclude that the two populations are similar. Also, the natural variation among units within your samples may obscure the effect of an experimental treatment. Thus, working with samples means that we have to deal with all these uncertainties in some way. If there is really a difference between the samples, you need to know what differences you can expect by chance, and how to deal with the variation within samples. Statistics help you to make these decisions. In other words, statistical tests are methods to use samples to make inferences about the populations. 6 Figure 1-1 Population and sample Biological questions such as which genes affect certain traits or how climate change affects the biosphere can only be solved using statistical analyses on massive datasets. But even comparatively small questions, for example to what extent men are taller than women are in need of statistical treatment. Thus, as soon as you formulate a study question, you should start thinking about statistics. Statistical analyses have a central place in biological studies and in many other sciences (Figure 12 The role of statistical analyses in the biological sciences): Figure 1-2 The role of statistical analyses in the biological sciences Hypothesis testing Many statistical tests evaluate a strict null hypothesis H0 against an alternative hypothesis HA. For example: 7 H0: mean dispersal distance does NOT differ between male and female butterflies HA: mean dispersal distance differs between male and female butterflies Testing these hypotheses involves test statistics, distributions and probabilities. For this reason, the next parts of this lesson reviews first concepts of distribution and probability, after which we come back to statistical testing. Distributions A distribution refers to how often different values occur in a set of data. In the graph below you see a common representation of a distribution: a histogram (Figure 1-3 Histogram of normally distributed data). In histograms, the horizontal x-axis represents the values occurring in the data, separated into groups (columns), and the y-axis shows how many values fall into each group. The vertical y-axis in histograms can also be given as the percentage of the data the values represent. Frequency 30 20 10 0 2 3 4 5 6 7 8 Values Figure 1-3 Histogram of normally distributed data Probability and probability density functions Probability refers to how likely an event is. For example, when you throw a coin it is equally likely that it lands on heads or tails. The probability of a coin landing on heads is 0.5. Nonetheless, for a single throw of the coin you cannot predict where it will land! If you, however, throw the coin very many time times you expect it to land on heads about half of the times, corresponding to the probability of 0.5. The coin example concerned an outcome with two categories, heads and tails. For continuous (measurement) values, probability density functions can be derived (Figure 1-4 Probability 8 density for a standard normal distribution). Note the similarity in shape to the histogram above (Figure 1-3 Histogram of normally distributed data). For each value on the x-axis the value of the probability density function displayed on the y-axis is the expected probability of that value occurring. The value -1 is thus expected to occur with a probability 0.242 or in 24 of 100 cases. Probability density functions of test statistics are used for the evaluation of statistical tests. 0.4 Density 0.3 0.2 0.1 0.0 -5 -4 -3 -2 -1 0 1 2 3 4 5 Standard deviations Figure 1-4 Probability density for a standard normal distribution The normal distribution y The normal distribution, also referred to as the bell curve, was described by Gauss and others in the early 19th century providing the basis for many statistical analyses in use today. The parameters of the normal distribution are the mean corresponding to the center of the distribution and the standard deviation referring to the spread of the distribution. Below you see normal distributions with different means and standard deviations (Figure 1-5): x Figure 1-5 Probability density for normal distributions with various means and standard deviations (sd) 9 Mean and standard deviation of the normal distribution are intricately linked to how common values are. In fact, the probability of obtaining values in a certain range corresponds to the area under the curve in this range. The entire area under the curve sums to 1. Values within one standard deviation to either side of the mean represent 34.4% of the data (pink), 13.6% of the values occur between one and two standard deviations from the mean on either side (yellow), 2.1% of the values occur between two and 3 standard deviations from the mean (green) and 0.1% of the data occur beyond three standard deviations from the mean on either side (white, Figure 1-6). Figure 1-6 Standard normal distribution with percentage of values occurring in a certain range indicated How statistical tests work Let us now come back to statistical hypothesis testing and our example: H0: mean dispersal distance does NOT differ between male and female butterflies HA: mean dispersal distance differs between male and female butterflies Note that by simply saying that dispersal distance differs, we imply that female dispersal distance could be either higher or lower than male dispersal distance. Because one group can differ from the other in either direction this is called a two-sided hypothesis and is followed by a two-tailed test. We could also phrase the alternative hypothesis on either of the following ways: 10 HA: female butterfly dispersal distance is longer than male butterflies. HA: female butterfly dispersal distance is shorter than male butterflies. Here we hypothesize that the female butterflies differ from the male butterflies in a specific direction. This is referred to a one-sided hypothesis and is followed by a one-tailed test. . To evaluate either one or two-sided hypotheses, statistical tests calculate a test statistic from the data to find out how likely the obtained result would be under the null hypothesis. To do so, a probability distribution of the test statistic is theoretically derived assuming the null hypothesis. The probability of the test statistic from the data given that the null hypothesis is true is then found using this theoretical distribution. This probability is termed the P-value. Common statistical tests usually have an outcome of "significant", meaning that the alternative hypothesis is accepted, or "not significant", meaning that the alternative hypothesis is discarded and the null hypothesis accepted. How is this decision made? If the test statistic calculated from the data happens to be a value that is very rare under the null hypothesis, usually occurring at a probability of less than 5% (P-value < 0.05), the null hypothesis is discarded and the alternative hypothesis accepted. If the test statistic happens to have a commonly occurring value of the test statistic under the null hypothesis, the alternative hypothesis is discarded instead. This is illustrated for one- and two tailed tests in Figure 1-7. Importantly, all statistical tests make assumptions on the data and are only valid if these are met. You will come across these assumptions in the section Basic analyses with R. Figure 1-7 Illustration of significance (P< 0.05) ranges in one- and two-tailed tests. 11 Test outcomes, error types, significance and power A common statistical tests can have four potential outcomes, two are correct and two are false (Table 1-1). Table 1-1 Possible outcomes of statistical tests with the significance level of 0.05. Test result Reality H0 true HA true P-value > = 0.05: H0 accepted, HA discarded Correct! Type II error (false negative) P-value < 0.05: H0 discarded, HA accepted Type I error (false positive) Correct! Note that the P-values correspond to Type I errors (false positives), i.e. accepting the alternative hypothesis when it is not true. The significance level is commonly set to 0.05 in biological studies and P < 0.01 or P < 0.001 are regarded as highly significant. Importantly, the choice of significance level has direct implications for the two error types. Let´s look at this further. The two probability density curves on the graph below represent theoretical normal distributions of measurements for our example of dispersal distance: female (black) and male (red) butterflies. When the significance level is set to 0.05 (P-values < 0.05 taken as significant, upper graph, Figure 1-8) the black areas under the black curve for females represent the type I error, i.e. erroneously accepting the alternative hypothesis. The same cut-off applies to the red curve for males: here the area in red represents the type II error, erroneously accepting the null hypothesis when the alternative hypothesis is true. If the significance level and thus the type I error is decreased to 0.01 (lower graph) the type II error is inevitably increased! This means that even if you can be surer that the alternative hypothesis is true when you do accept it, you also have to live with higher chances of missing cases where the alternative hypothesis is true. 12 Significance level = 0.05 0.5 Density 0.4 0.3 0.2 0.1 0.0 -4 -2 0 2 4 6 8 4 6 8 Values Significance level = 0.01 0.5 Density 0.4 0.3 0.2 0.1 0.0 -4 -2 0 2 Values Figure 1-8 Trade-off between type I and type II errors, for the example of female (black) vs. male (red) dispersal distance in butterflies. Summary Distributions can be displayed as histograms and show how often different values (ore classes of values) occur. Probabilities express how likely events or outcomes are. Probability density functions show how likely it is to obtain values under a certain distribution. The normal distribution is fundamentally liked to many common statistical tests. Normal distributions are described by their mean and standard deviation. Statistical tests use samples to makes inferences on large populations and generally evaluate a null hypothesis (usually no difference) against an alternative hypothesis (a difference). They do so by comparing a test statistic calculated from the data against a theoretical distribution of this statistic under the null hypothesis. The significance level used in statistical testing is related to both type I errors (false positives) and type II errors (false negatives). 13 1.2 Descriptive statistics Goals In this section you will learn how to describe your data in terms of range, quartiles and median mean and standard deviation standard error of the mean Range, median and quartiles Once you obtain data you often wish to gain an overview before you start conducting analyses. One of the most basic measures of data series is their range. The range refers to the interval between the smallest value, the minimum, and the largest value, the maximum. Indeed, looking at the range is highly recommended as it allows you to conduct a first check of the data: are the values actually in the expected (or reasonable) range? In addition to the range it also a good idea to calculate the median, the value that is exactly in the middle of the data: 50% of the values are larger than the median and the other 50% are smaller than the median. Whether the median is more or less in the middle of the range will show you whether you data is distributed symmetrically. For example data with a range of 1 to 10 and a median of 2 is NOT symmetrically distributed. The data in the illustration below is symmetrically distributed (Figure ). Figure 1-9 Median and quartiles on a histogram of a normal distribution 14 Similar to the median you can also calculate the 25% and 75% quartiles, values that are larger than 25% or 75% of the (ordered) data. The example data above with a range of 1 to 10 and a median of 2 has a 25% quartile of 2 and a 75% quartile of 3, this would indicate that there probably are some outliers causing the range to extend to 10. Mean and standard deviation Very common descriptive statistics are mean and the standard deviation. They make most sense for symmetrically distributed data. The mean is calculated as the sum of all values Xi divided by the number of values n (Figure 1-10). The standard deviation s (also sd) is a measure of the spread of the data and is calculated: Figure 1-10 Histogram of a normal distribution showing mean, median, quartiles and standard deviation 15 Standard error and confidence interval of the mean The standard error of the mean (SE or se) gives a measure of the precision of the estimate of the mean. The standard error can be used to calculate a confidence interval (CI) for the mean. The 95% confidence interval around the sample mean that should contain the mean of the population with 95% probability. It is calculated as: Note that standard error and confidence interval of the mean become smaller the larger the sample is. This reflects the greater trust you can have for a mean calculated from a large sample as opposed to a mean calculated from a small sample. How to deal with data that is not normally distributed Biological data is often not normally distributed, especially size measurements. It is for example not rare that there are many small measurement values and fewer and fewer larger values such that the data has a distribution as in the histogram below. In this case the mean and standard deviation are not such god measures of center and spread. On the graph below, the mean is rather far from the bulk of the data. A range of one standard deviation around the mean does not contain the same number of measurements of each side. Median and quartiles are better descriptive statistics for such data: the median indeed is in the center of the data and the quartiles nicely reflect the asymmetry in the distribution, i.e. the distance between 25% quartile and median is smaller than the distance between 75% quartile and the median (Figure 1-11). Alternatively, you can use data transformations and calculate mean and standard deviation from transformed data (see Basic analyses with R). 16 Figure 1-11 Descriptive statistics on a right-skewed distribution Summary Descriptive statistics are important to check data and are used to summarize data. Range, quartiles and median are basic descriptive statistics for data with any distribution. Mean and standard deviation are more useful for symmetrically distributed data. 17 1.3 Exercises: Getting started with statistics 1-A Please have a look at the two distributions below, A and B. They correspond to commonly observed distributions of biological data. Read the statements below and select whether they are true for distribution A, distribution B, both distributions or neither of the distributions. Most values are between 3 and 7 Values are between zero and 9 Corresponds to about 1000 values in total The distribution is symmetric The distribution is asymmetric Corresponds to about 350 values in total Less than 200 values are smaller than 4 Many small values occur Many large values and few small values. Few extreme values occur. Most values are smaller than 4. 1-B You and friend wonder if it is "normal" that some bottles of your favourite beer contain more beer than others although the volume is stated as 0.33 L. You find out from the manufacturer that the volume of beer in a bottle has a mean of 0.33 L and a standard deviation of 0.03 l. If you now measure the beer volume in the next 100 bottles that you drink with your friend, how many of those 100 bottles are expected to contain more than 0.39 L given that the information of the manufacturer is correct? 1-C 18 Your data is distributed as shown below. Where do you expect the median to be? Select one: To the left of the mean. To the right of the mean. At the same place as the mean 1-D Check all answers that are correct. A P-value of 0.051 for a t-test … (a) …means that 0.949 % of the data are greater than the mean. (b) …indicates a 5.1% probability of Type I error (c) …proves that there is no difference between the groups. (d) …shows that the difference would be significant if more data were used. (e) …is regarded significant in biostatistical analyses. (f) …is regarded non-significant in biostatistical analyses. 19 Solutions to Exercises Getting started with statistics 1-A Most values are between 3 and 7 – True for A Values are between zero and 9 – True for A and B Corresponds to about 1000 values in total – True for A and B Symmetric – True for A Asymmetric – True for B Corresponds to about 350 values in total – True for neither Less than 200 values are smaller than 4 – True for A Many small values occur – True for B Many large values and few small values – True for neither Few extreme values occur – True for A Most values are smaller than 4 – True for B 1-B Correct answer: 0.39 l corresponds to the mean plus two standard deviations (0.33 + 2* 0.03) and values larger than 0,39 are thus expected to occur in 2.3% of the cases. 2.3% of 100 is 2.3 or 2. The correct answer is: 2.3 1-C The correct answer is: To the right of the mean 1-D (b), (f) 20 2 Getting started with R 2.1 What is R? R is a versatile and powerful statistical programming language developed by the statistics professors Robert Gentleman and Ross Ihaka at the University of Auckland, New Zealand. Different from other statistical programs, R is free and its source code is available. R was released in 1996 and is maintained by the R development core team (http://www.R-project.org/) with a very large number of international contributions and continues to develop at a fast pace. R is among the most widely used statistical programs at universities today. For more information please check out this list of Webpages and books on R at the end of this section. 2.2 Downloading and installing R R is available for Macintosh, PC and Linux operating systems and easy to install. To download and install R, please go to the The Comprehensive R Archive Network page (CRAN) and follow the instructions there. 2.3 How to work with R Goal In this section, you will learn: 1. What R scripts, R commands and the R console are 2. How to work with RStudio 3. Basic data types in R and how to assign them 4. Steps to follow when working with R (work flow) 21 Elements of R: Script, Console & Co. Using R involves mostly writing commands (or “code”) rather than clicking on menus. The commands are usually assembled in a script that can be saved and reused. The R console receives the commands either from the script or by direct typing, shows the progress of analyses and displays the output. Graphical output will open in a separate window. This means that working with R can involve quite a lot of windows and files, the script, the console, graphical out and then, of course your data files and other user-specified output. You can assign a working directory where all of these files and outputs are saved by default. Nonetheless, it can be difficult to keep track of all these windows on the screen when working with R! (Figure 2-1) Figure 2-1 An overview of workflow of R An excellent way of ordering and manipulating your R windows and files is to use the free and powerful interface for RStudio (see below). We highly recommend that you use this program. Using RStudio to work with R RStudio incorporates the console, script, graphical output and various other elements in an accessible and easy-to-manipulate form. RStudio is free and available for both Windows and Macintosh operating systems and can be downloaded from http://www.rstudio.com/products/rstudio/. Note that the R studio menu differs slightly between PC and Mac versions. 22 The RStudio screen is divided into four resizable parts (Figure 2-2). The upper left part contains a script editor where commands are written and saved. The various tabs in the upper left part can contain multiple scripts and also data files. Commands are sent to the console in the lower left part using the key combination cmd + Return (Macintosh) or Control+R (PC). On the right side, RStudio displays a workspace tab listing all objects in the current analysis and a history tab providing a recollection of executed commands. The lower right partition hosts a figure tab where graphical output is stored, a package tab where packages can be viewed and installed, a file tab to manipulate files and a help tab where R help information can be searched and displayed. Workspace/ History tab Script editor Write your commands here Files/ Plots/ Packages/ Help tab Console Execute your script and view output Figure 2-2 A screenshot of RStudio In RStudio, you can bundle your analyses into projects using the project drop-down menu on the top (through “file” for PC and through “project” for Mac) or the pop-up menu in the top right corner of RStudio (both versions). Projects will contain all elements of analyses allowing you to continue a session exactly where you ended the previous time. For help with RStudio, you can go to https://support.rstudio.com. You can set a new working directory following the menu options Session > Set working directory (Figure 2-3). 23 Figure 2-3 Setting the working directory in RStudio To create a new script, you can follow “File > New File > R script”, or use the shortcut Ctrl + Shift + N. Save your scripts regularly. A file that has been modified but not saved again will show with a red title and a * at the end. You can navigate between different plots produced during a session using the blue arrows at the top left corner of the Plots tab. You can save your graphs by clicking on “Export”. For help with RStudio, you can go https://support.rstudio.com. Typing commands and Writing Code R commands always have the same structure (Figure 2-4). A command name is followed by parentheses without space. Command names are often closely related to what the command does, for example the command mean() will calculate the mean. The parentheses contain the arguments, separated by comma that the command will use. For example, such arguments tell R what data to use. Arguments are further used to select options for analyses. What arguments are needed differs between commands and is explained in the help on commands. Figure 2-4 Structure of R command 24 In R, you can create scripts that are saved as an .R file, can be re-used and serve as a documentation of your analysis. Scripts are written and manipulated in the script editor window, and should be saved to the working directory with a .R file extension. It is also possible to enter code directly into the console. Typing in the console can be faster when trying out code. However, any analysis that you need to keep a record of should be created as a script and saved. Important notes for writing and executing code: You send commands from the script to the console such that they are executed by highlighting one or multiple lines and pressing the keyboard shortcuts cmd + Return (Macintosh) or Control+R (PC). You can enter comments preceded by #. Everything written after # will be ignored and not executed in the console (Figure 2-5). In RStudio, you can use 4 of these symbols #### after a title to organize the scripts and mark specific points that you can find easily at the bottom left of the script window indicated by an orange # (Figure 2-5). Figure 2-5 Screen shot of RStudio script tab Observe the ">" sign in the console. This is the command prompt, the place where you can type in your commands and execute them by pressing return. The command prompt indicates that R is ready to receive commands and has finished executing previous commands. R will display a "+" if a command is longer than a single line or incomplete. On the console, you can cycle through previous commands using the "arrow up" and “arrow down” keys on your keyboard. Object assignment and the workspace R stores data, analyses and outputs as objects in the current workspace that can be saved. Observe that workspace is a collection of R objects with properties assigned through R commands whereas the working directory basically is a folder on your computer that contains various files of any type. You assign information to variable using the operator- an "arrow" composed of a smaller than sign and a minus sign (<-) pointing to the name of the variable. There are certain rules for naming your variables. 25 The first character must be an English letter or underscore (_). You can use uppercase or lowercase letters. Blanks and special characters, except for the underscore and dot are not allowed to appear in variable name. For instance, the following command greeting <- "hej" will assign "hej" to an object named greeting. All text has to be in quotes (" "), otherwise R will look for an object with this name and create an error message, for example, greeting <- hej will result in the error message Error: object 'hej' not found To call an object and see what it contains you enter its name. Type the object name on the command line as below. greeting will result in the output [1] "hej" The [1]indicates that this is the first (and only) element of this object. You can check what variables have been created in R using the command ls() This will result in a list of the current objects in the workspace. In RStudio you can view and manipulate current objects in the workspace tab. You can also remove (delete) objects using rm(), for example rm(greeting) Variables are overwritten without notice whenever something else is assigned to the same name. If you want to delete all current objects, use rm(list=ls()) 26 Data types R deals with numbers, characters (entered in quotes, " ", as the “hej” in the example above) and logical statements (TRUE or FALSE). The following types of data are commonly used by beginners: Vector: one-dimensional, contains a sequence of one type of data, i.e. numbers OR categories (letters, group names) OR logical statements. Vectors can be created using c(element1, element2, element3, .... , which concatenates (connect them one after each other) the different elements into a vector. Note that the elements can themselves be vectors. For example, c("population1","population2","population3","population4") will generate a vector as follows: [1] "population1" "population2" "population3" "population4" Number sequences can be created using the operator ‘:’. For instance, x <- 1:7 creates the vector x that contains a sequence of number from 1 to 7: [1] 1 2 3 4 5 6 7 Besides, there are a number of other functions for creating vectors, for user-defined seq() for sequences and rep() for repeated elements. You can find out about these functions using the R help. Factor: similar to vectors but also contains information on levels. Entries of a factor that are equal belong to the same factor level, or in other words, to the same category. Factors can be created from vectors using factor(). For example, you can create a factor named sex using the code below: sex <- c(rep("male",25), rep("female", 35)) sex <- factor(sex) Data frame: collection of vectors and factors of the same length that can contain different data types. This is the format commonly used for data analysis where each row corresponds to an observation and each column corresponds to a variable (vector or factor). The section Getting data into R explains how to create data frames from your data and the sections the sections Accessing and changing individual entries, Accessing and changing entire rows or columns, Adding and deleting columns explain how to handle and manipulate the contents of data frames. 27 List: collection of elements of any format and type, can be created using list( ). Outputs of statistical analyses are often lists. Other types of objects include matrices and arrays. Work flow in R 1. Define/create a folder on your computer that is to be used as a working directory. 2. Open R Studio and create a new file in the script editor (see R scripts in RStudio) The hash sign (#) is used in scripts to identify text that is NOT a command, i.e., titles and comments, and prevent R from trying to execute such text. 3. Add a title (preceded by a hash sign (#)), for example, # My first R session, my name, today´s date 4. Set the working directory to your prepared folder (see 1). The working directory can be changed at any time. If you want to make sure that the working directory is correct, use the command getwd() to obtain the path to the current working directory. 5. Save the script file now and regularly later on using the menu or symbol on the script window or the shortcut Ctrl + S. The script file has the extension .R and will be saved in the working directory by default. This script can be re-used and shared. 6. Load data from the working directory into an object (see Loading data files). 7. Conduct analyses, produce output and graphs creating further objects and save the script and outputs /graphs. 8. Quit R using the command q() that you can type into the console or by closing the R Studio window. This will result in a question whether you want to save the workspace. You can safely answer no to that; save the script and the data instead. Saving the workspace is only recommended for analyses that take a long time to complete. Basic calculations in R R is sometimes referred as an "overgrown" calculator. You can perform calculations directly in R, for example the line 3 + 4 will result in [1] 7 Other most common arithmetic operators are: - (minus), / (division), * (multiplication), ^ (exponentiation).Calculations can also be done on vectors. Basic calculations (+, -, /, *, ^, etc.) are conducted on each element and vectors of the same length will be combined element-wise in calculations. This also applies to columns or rows of data frames. numbers.1 <- c(1,2,3) will assign the numbers 1, 2, and 3 to a vector called numbers.1. 28 numbers.1 * 2 will multiply each element of the vector by 2. Creating a second vector with 3 elements, numbers.2 and adding these two vectors: numbers.2 <- c(2,2,2) numbers.1 + numbers.2 You can use this to establish functional relationships of interest, plot and examine them. Try out this code! Note that c(1:100) creates a vector with the numbers from 1 to 100. plot(c(1:100)^2) Summary In R, you use a script window to enter the commands. Commands are transferred to the R console for execution. Scripts can be saved and re-used. Data, output and scripts are saved in a designated working directory. R stores data and analysis outputs as objects that often are vectors, factors, data frames or lists. They contain data as numbers, characters or logical statements. Factor levels or other text must be in quotes "". Objects are given names with the assignment arrow “<-“ The workflow in R involves setting the working directory at the beginning and saving the script file repeatedly. In R, you can conduct basic mathematical calculations directly and element-wise, for example on vectors and on columns of data frames. 29 2.4 Handling data Goal In this section you will learn how to enter and load data into R check data structure explore data graphically change and subset data In R, you can input data either by directly typing within the script or by loading data files. Entering data in the script You can enter data using the functions c() and data.frame(). Below is an example from plant experiments. You measured the width and the length of six leaves in the plant species Silene dioica in cm. The first three plants were flowering and the last three were not flowering. You can enter the data directly as arguments to the function data.frame(). Silene.leaves <- data.frame(plant.number = c(1,2,3,4,5,6), leaf.width = c(4.0, 4.7, 2.8, 4.1,3.5, 3.7), leaf.length = c(5.3, 4.9, 5.7, 5.0, 5.5, 4.3 ), flowering.state = c("flowering", "flowering", "flowering", "vegetative", "vegetative", "vegetative")) Each column is set by an argument: column.name = c(data). Note that the data entries within c() must be separated by commas and the arguments (columns in this case) within data.frame() must also be separated by commas. The function c() produces a vector from the data (compare data types). You can choose column names freely. Note that data for the flowering state is entered in quotes because flowering state is a category. Alternatively, the data can be assigned to vectors first and then combined into a data frame. Vectors can also be used for analysis by themselves. width.vector <- c(4.0, 4.7, 2.8, 4.1,3.5, 3.7) length.vector <- c(5.3, 4.9, 5.7, 5.0, 5.5, 4.3 ) flowering.vector <- c("flowering", "flowering", "flowering", "vegetative", "vegetative", "vegetative") Silene.leaves <- data.frame(leaf.width = width.vector, leaf.length = length.vector, flowering.state = flowering.vector) Preparing data files If you want to load data from a file instead, you first need to prepare a suitable file from another program, for example Excel. Follow these points: 30 1. Arrange the variables (measurements) in columns and observations in rows (see example). 2. Make sure that column headings and table entries contain only letters from English alphabet or numbers; in particular they should have no spaces or no slashes. Headings should start with a letter and must not contain commas. If you want to make your headings clearer, you can use points or underscores, for example, height_august. This is advisable even though some newer versions of R (sometimes) tolerate spaces and more letters. 3. Save your table as .csv. 4. Open your file in a text editor. Observe two things: Firstly,, what is the decimal separator, i.e., is it 1,5 or 1.5 (comma or point)? Secondly, how are the entries separated? This can for example be a space( ), tabulator(\t), comma (,) or semicolon (;). You need this information to ensure correct loading of your data in R, explained on the next page. Loading data files There are many ways to load data into R. In RStudio you can also load data through the workspace tab using the import data pull-down menu. This allows you to view the files and directly specify the separator between entries, the decimal separator and whether or not you have headings. Note that using this menu option in R studio will also produce code in the console that you can copy into your script for later use if desired. You can also view the data in the tab in the top left partition. If you want to use a direct command to read data we recommend read.table() because it is universally applicable. Within the read.table() command you can specify to browse your computer for the file to load using the argument file = file.choose(). Or you can enter the path of your file: file = "document_name.txt" for example. You indicate whether your data contains a header (a row of titles) with header = TRUE (has a header) or header = FALSE (no header). The table separator is set using sep, for example, sep = ";" (semicolon), sep = "," (comma), or sep = "\t" (tabulator). The decimal separator is specified by argument dec, for example dec = "," (comma) or dec = "." (point). The input file needs to be assigned to an object using the arrow <-. In most cases, this will automatically be a data frame object. For a .csv file with header, semicolon-separated entries and decimal commas (as usually used for Swedish settings for .csv files saved from Excel) the command looks like this: my.data <- read.table(file = file.choose(), header=TRUE , sep=";", dec = "," ) For a .csv file with header, comma -separated entries and decimal points (as common in North America) the command looks like this: 31 my.data <- read.table(file = file.choose(), header=TRUE, dec=".", sep="," ) Note: when you execute the commands nothing happens! See next page for how to access the data. Common problems when loading data Additional entries. Loading a file with additional entries (sometimes invisible ones such as spaces) in cells outside the data will yield an error message similar to this one: Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : line 148 did not have 5 elements Remedy: copy your data (and only your data!) to a new sheet in Excel, save it as .csv and reload. Non-English characters and signs. Non- English characters(ä, ö, é, etc. ), signs (for example * ! & / | \ + - > < $? =) or spaces in the column names will produce an error message similar to this: Error in make.names(col.names, unique = TRUE) : invalid multibyte string 1 Remedy: change the names in an Excel file or directly in a .csv file, save it as csv and reload. Note: Non-English characters elsewhere in the data might also lead to an error message and thus should be avoided. Checking data structure To control what sort of object you data is stored in and whether it has the correct structure, you can use the structure command str( ), for example after loading the example file on Silene leaves, we use str(Silene.leaves) This yields the following output: 'data.frame': 6 obs. of 4 variables: $ plant.number : num 1 2 3 4 5 6 $ leaf.width : num 4 4.7 2.8 4.1 3.5 3.7 $ leaf.length : num 5.3 4.9 5.7 5 5.5 4.3 $ flowering.state: Factor w/ 2 levels "flowering","vegetative": 1 1 1 2 2 2 This indicates that the object Silene.leaves is a data frame with six observations, i.e. six rows in our input file, and four variables, i.e. four columns in our input file. This matches well with the six plants and four columns in our input file. 32 The column names and the type of the columns are also given together with the first few values (in this case all six values). The columns plant.number, leaf.width and leaf.length are numeric (continuous numbers) and you will be able to do calculations with these numbers. The flowering.state column is a factor with the two levels "flowering" and "non-flowering", you will be able to use this factor as a grouping variable. All of this appears as expected and correct. If you want to change the type of the column, for example changing plant.number from numeric to factor (because the number is a "name" in this case), use the following command Silene.leaves$plant.number <as.factor(Silene.leaves$plant.number) You can control whether this has been successful by calling the structure command again. To look at the data you can also just type the name of the object. This is advisable only for small data frames. For example Silene.leaves will yield plant.number leaf.width leaf.length flowering.state 1 1 4.0 5.3 flowering 2 2 4.7 4.9 flowering 3 3 2.8 5.7 flowering 4 4 4.1 5.0 vegetative 5 5 3.5 5.5 vegetative 6 6 3.7 4.3 vegetative Accessing and changing data Sometimes you need to check or change individual elements of your data. In R, the elements of vector are always internally numbered and we use this numbering to access and change the data. For example, the vector width.vector <- c(4.0, 4.7, 2.8, 4.1,3.5, 3.7) has six elements. We can access its third element using the vector name followed by square brackets and the element number. This line will bring up the third element: width.vector[3] [1] 2.8 Should we realize that this element needs to be changed from 2.8 to 3.0 we can do that using the assignment arrow: 33 width.vector[3] <- 3.0 calling width.vector again shows that this has happened width.vector [1] 4.0 4.7 3.0 4.1 3.5 3.7 We can access the elements of data frames in the same way, except that data frames have two dimensions, rows and columns, such that two numbers separated by comma are needed with the square brackets. The first number always refers to rows, the second to columns. To access the element in the third row and the second column in our Silene.leaves data frame we use Silene.leaves[3, 2] [1] 2.8 Incidentally this is the same measurement as in the vector example above. To change it to 3.0 in the data frame we use the same kind of assignment operation: Silene.leaves[3, 2] <- 3.0 We can check whether this has happened by Silene.leaves[3,2] [1] 3.0 Entire rows and columns of data frames can be accessed by leaving column (or row) number empty in the square brackets. Note that the comma must always be entered because data frames have two dimensions. Accessing rows and columns is needed to conduct analyses and to make changes or calculation. For example, Silene.leaves[ ,2] [1] 4 4.7 2.8 4.1 3.5 3.7 brings up entries in the entire second column and Silene.leaves[3, ] plant.number leaf.width leaf.length flowering.state 3 3 2.8 5.7 "flowering" brings up the entire third row, for example to check that plant´s measurements. The first 3 is the row number. 34 Column names can be used in place of the numbers. R has a special notation for columns involving the dollar sign as you may have noticed in the output of the structure command. The following line will also bring up the third column. Silene.leaves$width Alternatively, the column names can be entered in quotes directly within the square brackets (note the comma!). Silene.leaves[ ,"width"] Should you now realize that the width measurements all need to be increased by 0.2 you can do that using Silene.leaves$width <- Silene.leaves$width + 0.2 OR Silene.leaves[,"width"] <- Silene.leaves[,"width"] + 0.2 OR Silene.leaves[,2] <- Silene.leaves[,2] + 0.2 Which of these options is most convenient depends on your column names, the size of your data file and your preferences. Adding and deleting columns Additional columns can be assigned at any times. For example, you may wish to create column of the ratio of leaf width and leaf length in our Silene.leaves data frame. Silene.leaves$width.length.ratio <- Silene.leaves$leaf.width / Silene.leaves$leaf.length Calling the structure command shows that the column has been added and is numeric. str(Silene.leaves) 'data.frame': 6 obs. of 5 variables: $ plant.number : num 1 2 3 4 5 6 $ leaf.width : num 4 4.7 2.8 4.1 3.5 3.7 $ leaf.length : num 5.3 4.9 5.7 5 5.5 4.3 $ flowering.state : Factor w/ 2 levels "flowering","vegetative": 1 1 1 2 2 2 $ width.length.ratio: num 0.755 0.959 0.491 0.82 0.636 ... Deleting one or several columns can be done using the minus sign within the square brackets. This only works with column numbers not with column names. This line removes the newly added width-length ratio column: 35 Silene.leaves <- Silene.leaves[ ,-5] str(Silene.leaves) 'data.frame': 6 obs. of 4 variables: $ plant.number : num 1 2 3 4 5 6 $ leaf.width : num 4 4.7 2.8 4.1 3.5 3.7 $ leaf.length : num 5.3 4.9 5.7 5 5.5 4.3 $ flowering.state : Factor w/ 2 levels "flowering","vegetative": 1 1 1 2 2 2 Removing rows, for example, when you realize that measurements of an entire row are faulty, works in the same way. This line removes the first row (observe the placement of the comma!). Silene.leaves <- Silene.leaves[ -1, ] str(Silene.leaves) 'data.frame': 5 obs. of 4 variables: $ plant.number : num 2 3 4 5 6 $ leaf.width : num 4.7 2.8 4.1 3.5 3.7 $ leaf.length : num 4.9 5.7 5 5.5 4.3 $ flowering.state: Factor w/ 2 levels "flowering","vegetative": 1 1 2 2 2 If you need to remove more than one column use the c() command within the square brackets: Silene.leaves[-c(1:3), ] will remove rows one to three. Silene.leaves[, -c(1,4)] will remove columns one and four. Subsetting data There are many situations where only a specific subset of the data needs to be accessed. In R this is done with entering logical statements that are into the square brackets for row and column selection. If you want to select for example only the flowering plants in the Silene.leaves data frame you use a Silene.leaves$flowering.state == "flowering" statement for row selection. Silene.flowering <- Silene.leaves[Silene.leaves$flowering.state == "flowering" , ] will produce a new data frame named Silene.flowering containing only the flowering plants. str(Silene.flowering) 36 'data.frame': 3 obs. of 4 variables: $ plant.number : num 1 2 3 $ leaf.width : num 4 4.7 2.8 $ leaf.length : num 5.3 4.9 5.7 $ flowering.state: Factor w/ 2 levels "flowering","vegetative": 1 1 1 Let´s have a closer look at the logical statement: Silene.leaves$flowering.state == "flowering": In words this statement means something like "check for each element of Silene.leaves$flowering whether it reads "flowering" or not". If you execute only the logical statement, you create a vector with six elements, the first three are TRUE (corresponding to the flowering plants) and last three are FALSE (corresponding to vegetative plants). Silene.leaves$flowering.state == "flowering" [1] TRUE TRUE TRUE FALSE FALSE FALSE When you use such statements for row selection, all rows corresponding to TRUE will be selected, in this case the first three rows. Note that R does not assume that you will use only columns of the same data frame for the logical statements, in fact, you can also use columns from other data frames or vectors. For this reason you need to write Silene.leaves$flowering.state == "flowering" and not only fowering.state == "flowering". You can use the following logical operators: == != > < identical not identical greater than smaller than and combine conditions using | & logical OR, one of the conditions fulfilled logical AND, all conditions fulfilled Here are some more examples: Selecting plants with leaf width over 4.0 Silene.leaves[Silene.leaves$leaf.width > 4.0, ] Selecting plants with either width or length under 3.5 37 Silene.leaves[Silene.leaves$leaf.width < 3.5 | Silene.leaves$leaf.length < 3.5, ] Selecting plants with both width and length over 4.0 Silene.leaves[Silene.leaves$leaf.width >4.0 & Silene.leaves$leaf.length >4.0, ] Further, subset() is a useful function to perform these kind of selections and subset a data set. The first argument specifies the data frame to subset. The second argument is a logical expression as explained above used to select specific rows in the data frame and the third argument indicates the columns to be selected by their names (if several columns are selected, the names have to be in a vector). If you only want to omit one column, use – in front of the column name: for example, New.Silene.leaves <subset(Silene.leaves,flowering.state==”flowering”,select=flowering.state) will create a new data frame containing only the rows concerning the flowering plants, and all columns except the flowering.state column (which is not needed any longer since we know all the plants in the data set are flowering). Since you specify the data frame to subset in the first argument of the subset() function, you can directly refer to the different variables (columns) by their names, without using the $ sign. Summary Data can be entered in the script using the data.frame() command. Loading data from files involves preparing .csv file with observations as rows and measurements and grouping factors as columns. These files should only contain numbers and letters from the English alphabet. In grouping variables and headers points and underscores can also be used. Data can be loaded through the menu in RStudio. Alternatively, data is read and then assigned to an object using the command read.table(). Data structure can be checked using the str(data.name) command. Exploring data graphically can involve pair-wise plots of all variables with plot(), histograms with hist() and boxplots with boxplot(). Individual data entries, row and columns can be accessed and changed using their row and column subscripts. Data can be subset using logical statements involving ==, !=, |, and &. 38 2.5 Dealing with missing Values Goal In this section you will learn how to 1. Interpret different types of missing values indicators in R 2. Handle missing values in common functions 3. Identify, count and set missing values Types of missing values NA (not available) codes missing data in R. When preparing a data it is good practice to enter NA into ”empty cells” in your Excel table. NA also appears as a result when a command cannot be executed, for example because the data contains NA and the command is not prepared to handle NA. NaN (not a number) appears when a calculation does not yield a mathematically defined answer. R often gives a warning when NaN are generated as in > log(-1) [1] NaN Warning message: In log(-1) : NaNs produced Handling missing values in common commands R is very cautious. Most of the basic commands return NA as soon as an NA is present in the data. However, they usually have an optional argument to tell R to ignore NA but this differs between commands. For example, mean(c(1,2,3, NA)) [1] NA yields NA. Setting the optional argument na.rm (for NA remove) to TRUE tells R to consider only non.NA in the calculation, thus, mean(c(1,2,3,NA), na.rm=TRUE) [1] 2 yields the mean of the three non-NA values. This also works for range(), sd(), var(), sum(), median(), max(), min()and many other commands. An exception is the command length( ). It gives the number of cases regardless of the presence of NA. Thus, 39 length(c(NA, NA, NA)) [1] 3 The commands cor() for correlation and cov() for covariance ignore NA with the argument use="complete.obs": cor(n.1, n.2, use=”complete.obs”) Here, n.1 and n.2 are two vectors of the same length. Other commands such as lm() for calculating linear models ignore NA in the default setting. Consult the help files to find out how NA is dealt with for specific commands (see lesson on interpreting help files). Finding and counting missing values To find out whether your data contains NA use is.na(data.name) or more specifically is.na(data.name$column.name) This command can be applied to any data structure or part thereof. is.na() returns logical statements for each element of the data with TRUE for both NA and NaN and FALSE for other entries. To find only NaN use is.nan(). For example, is.na(c(1,NA,3,NA,5)) returns [1] FALSE TRUE FALSE TRUE FALSE Vectors of logical statements can be summed, because TRUE is automatically converted to 1 and FALSE to 0. This way the number of missing values can be obtained. For example, sum(is.na(c(1,NA,3,NA,5))) will yield the answer [1] 2 indicating the number of NA in the data. To access rows that have no NA in any of the columns use complete.cases(data.name) Note that the function summary(data.name) will also provide the number of NA in each column. To find out where the NA are in the data use the command which(), for example 40 which(is.na(c(1,NA,3,NA,5))==TRUE) returns [1] 2 4 because elements 2 and 4 are NA. Setting missing values To set certain data points as NA, for example when you realize that there is a problem with them, access the elements of the data frame using row and column numbers and assign NA to those, for example numbers.1[1] <- NA will set the first element of the vector to NA. data.name[2,3] <- NA will set row 2, column 3 to NA. Note that these changes are made to the data frame object stored in R’s current workspace NOT to your original data file. Summary Missing data type sin R are NA (not available, to be used in data tables) and NaN (not a number). Many commands have optional arguments to deal with missing values, for example na.rm=TRUE will tell R to ignore missing values in mean(), range(), sum() and other basic functions. The command is.na(data.name) is used to identify NA and NaN. sum(is.na(data.name)) will return the number of missing values in the data and which(is.na(data.name)) will return subscript numbers of the elements that are NA or NaN. Data entries can be set to NA with the assignment arrow as in numbers[1] <- NA. 41 2.6 Understanding help() functions R provides a large number of standardized help functionalities and web resources. You can find information on 1. the use of commands (that you know the name of) 2. search for terms or words, perhaps related to an analysis you want to do, or 3. use web-based search functions that allows you to find commands but also packages, tutorials and forum entries. Typing a question mark followed by a command, for example ?t.test will open a help file, try it out! At the top of the page, the package that the command originates from is given in braces. Here t.test{stats}, shows that the command t.test() originates form the package stats. Further sections contain a description, the usage, the arguments and the value or object returned by the command. The help file for t.test() indicates that its arguments include x, y, alternative, mu. It further explains that the t.test() command returns a list object including the the value of the t-statistic, the estimated mean or difference in means, the degrees of freedom and the P-value. The help pages end with references, similar commands ("see also" section) and importantly, examples. Example code can be directly copy-pasted into the console and only internal data is used. Running example code is a very good way to examine how to work with a command. When looking for a command or term, you use two question marks followed by the term you are looking for. Information on correlation analyses, for example, is found by typing ??correlation This command will open a table on command related to correlation in some way. The table lists the commands and the packages they originate from, as well as a short description. Clicking on these entries will take you to the help files for these commands. A number of number webpages are dedicated to informing about the use of R. For beginning users as search in the R help archive can be very helpful. It collects questions on R and answers there are often given by well-known authors of R books and packages. The Namazu R search page is accessible through directly or from the R console using the command RSiteSearch(”search.word”, make sure to enter the search word in quotes (" "). This page often leads to newer and more advanced topics. Try it out! 42 2.7 Exercises 2-A Vector creation Write R code to generate the following vectors, explore the functions seq() and rep() using the help on commands: 1.0 1.3 1.6 1.9 2.2 2.5 2.8 3.1 3.4 3.7 4.0 4.3 4.6 4.9 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 85 14 12 10 8 6 4 2 0 5 5 12 12 13 13 20 20 2-B What is the correct code to load the data in a .csv file that looks like this? Fertilizer;plant.biomass;plant.height;seed.weight high;0,180476248;21,31200596;0,029829762 … Select one: (a) read.table(file=file.choose(), header=T, sep=",", dec=",") (b) read.table(file=file.choose(), header=T, sep=",", dec=";") (c) read.table(file=file.choose(), header=F, sep=";", dec=",") (d) read.table(file=file.choose(), header=T, sep=";", dec=”,”) (e) read.table(file=file.choose(), header=T, sep=",", dec=".") 2-C Loading and exploring data structure Load the iris data that R provides internally by typing data(iris) A. What sort of data type is iris? B. How many rows (observations) and columns (variables) does the iris dataset have? C. Which variable of the data frame iris is a factor and how many levels does it have? Select one: (a) The variable Species is a factor and it has 5 levels. (b) The variable Species is a factor and it has 3 levels. (c) The variable 'data.frame' is a factor and it has 150 levels. (d) The variable 'data.frame' is a factor and it has 5 levels. 43 2-D Loading and graphical exploration of data Please download this file and load it into R: sunflower fertilizer file A. Is there any indication in the graphs that plant height or seed weight differ between plants subjected to the two fertilizer treatments? Select one: (a) Plant height appears to be considerably larger in plants treated with from high nutrient fertilizer, whereas seed mass appears to be similar in plants from both treatments. (b) Plant height appears to be considerably lower in plants treated with high nutrient fertilizer, whereas seed mass appears to be similar in plants from both treatments. (c) Plants from then two treatments do not appear to differ in height or seed weight. (d) Plant height and seed weight appear to be lower in plants from the low nutrient fertilizer treatment. B. Create a new data frame containing only the rows of the “low” treatment. 2-E Subsetting data Explore dataset- mtcars in R. You can get the structure and column names of data by typing the command str(mtcars) and names(mtcars) respectively. Write your code to subset the dataset- mtcars according to the following requirements (NOTE: each requirement is independent.) A. Select cars whose cyl (a column in the dataset) value is no smaller than 5. B. Show all the fields (columns) of the first 10 cars. 2-F Data manipulation Which of the following lines of script multiplies column 3 of the data frame my.data with 1.5? (a) my.data[,3] *1.5 <- my.data[,3] (b) my.data[3,] *1.5 <- my.data[3,] (c) my.data[,3] <- my.data[,3] *1.5 (d) my.data[3] <- my.data[3] *1.5 44 Solutions: 2-A A. v1=seq(1,4.9,0.3): you have to create a regular sequence, thus indicating the use of the function seq(), from 1 (first argument) to 4.9 (second argument) using an increment of 0.3 (third argument) B. v2=rep(1:4,3): this time you need to repeat the sequence of the 1 to 4 integers three times, so you should use the rep() function, giving the vector to be repeated (first argument) and the number of repeats (second argument). As the vector to be repeated is simply the sequence of integers from 1 to 4, you can use the : symbol to save time. C. v3=c(v2,85): the vector you have to create is basically the same as the vector in question b), with just 85 added at the end. So you should use the function c() that concatenates several vectors into one, in the order specified. D. v4=seq(14,0,-2): again a sequence, but this time in decreasing order. You can generate that by using seq() with a negative increment (-2 here). E. v5=rep(c(5,12,13,20),each=2): this time we do not want to repeat a whole vector as in b), but we want to repeat each element of a vector twice. This is done by using the argument each in the function rep(). The vector (5,12,13,20) is not an obvious sequence so we just use c() to provide the vector to the rep() function. 2-B d 2-C A. data.frame B. 150 obersvations 5 columns C. Species. Three factors: setosa versicolor virginica. 2-D A. a Read the table into R using this command: t <- read.table("Downloads/sunflower.csv",sep=",",header=T) Make the boxplot by typing: boxplot(t$plant.height ~ t$Fertilizer) B. t.low <- t[t$Fertilizer=="low",] 2-E A. mtcars[mtcars$cyl >= 4, ] B. mtcars[1:10, ] 2-F c 45 boxplot(t$see.weight ~ t$Fertilizer) 2.8 Web resources and books on R Web resources CRAN page Main R page for downloading and information R-Studio Effective user interface for R, free download. Quick R Useful web-page for the beginners and a little bit more advanced user, related to the book R in action mentioned below R reference card - print it and fold it! The perfect pocket card. Great for refreshing your memory or point you in the right direction Introduction to R - Webinar in Youtube Some of the presentations from the famous Paul H. Geissler's (now retired) web-based R introductory course Stackoverflow (forum) Every time that you type an R related question in google, this is one of the best hits to follow R-help info page Questions and answers on R use. Note that that the domain stat.ethz.ch often has good information. R tutor Very informative with both beginner and selected advanced topics Teaching with data simulations Inspiration for teachers Books Click on the title to see the book in the Swedish libraries The R book (Crawley) 46 Large overview with many biological examples, suitable for beginning users with some statistical knowledge. Getting started with R (Beckerman & Petchey) Very good and concise introduction to R for people with experience in statistical analyses from other programs. 25 recipes for getting started with R (Teetor) A quite brief (44p) list of basic tips on how to use R for data exploration. Only recommended for beginners. Most of those tips are found in Quick R Introductory statistics with R (Dalgaard) A popular and well-written compendium of basic statistics in R. Suitable for beginners. R in action (Kabacoff) From the author of Quick R; this book follows a case-study approach with many practical data sets. Previous experience with R is desirable. Data analysis and graphics using R (Maindonald & Braun) A long compendium of case-studies, some of them a bit outdated and quite technical. Ideal for more advanced students. R graphics (Murrell) A popular guide on how to make perfect graphs. Intended for those who want to improve their artwork in R. The art of R programming (Matloff) Teaches you how to use R for programming efficiently. Previous experience with R and programming concepts is required. Modern applied statistics with S (Venables & Ripley) Rather technical explanations on how to use the S language (the one R is based on) for statistics. For more advanced learners. Check more books and free PDFs here! 47 3 Basic Statistics with R 3.1 Types of data Most data types can be broadly classified in two categories: Continuous data, also known as numeric data, is any form of data in which data points can be any numbers within a given range. Common examples of this include measurements such as height, weight, etc. and many mathematical solutions (e.g. integration, slope, etc.). Categorical data, also known as factor data, is any form of data in which data is grouped into multiple categories. Examples of this include species type, hair color, etc. Binary data is a subset of categorical data in which the data can only be one of two groups (e.g. dead or alive, heads or tails, etc.). Being able to distinguish between these types of data is extremely important, because as we will see later, the type of data being used is an important factor in deciding the appropriate way to analyze the data statistically. 3.2 Exploring data with tables One of the simplest diagnostics used in R is the table() command. This command allows for the creation of contingency tables that report the counts of cases (rows) in different categories of another variable or several variable combinations. These tables are extremely useful for determining if your data is in the correct. To demonstrate this we will use the example dataset warpbreaks, which records the amount of times different wools at different levels of tension break. The researchers set up this experiment so that each combination of wool and tension had an equal sample size. In order to test this we could simply call the data and count each measurement. However, this would become extremely tedious in larger datasets. We could instead use the table() command to answer this question: data(warpbreaks) 48 table(warpbreaks$wool,warpbreaks$tension) L M H A 9 9 9 B 9 9 9 We can thus quickly confirm that each combination of treatments does indeed have 9 measurements associated with it. Further exploration of using statistics to analyze contingency tables will be discussed in an oncoming section. 3.3 Exploring data graphically In most situations you will benefit from first looking at your data graphically. This is in order to Find out whether your data is reasonable Detect any large outliers (for example data entry mistakes) Assess what the approximate distribution of the data is See the main patterns in the data. The plot() command A quick graphical check of the data is provided by the simple command plot(data.name)that will open a new window displaying plots of all pair-wise variable combinations. (Figure 3-1). plot(iris) Figure 3-1 Pairwise plot for Iris dataset 49 Histograms The command hist() produces a histogram displaying data values on the x-axis against their frequencies on the y-axis allowing you to judge the distribution of the data. The command hist() is applied to individual variables (columns) of the data, that are given by the name of the data frame followed by a dollar sign and the name of the variable (column). The output is shown in Figure 3-2. hist(iris$Sepal.Length) Figure 3-2 Histogram of sepal length of Iris Boxplots Boxplots display continuous data separated into the levels (groups) of a factor (grouping variable). In the default settings, the command boxplot() shows medians as thick black lines and quartiles as a box around the median. The t-bars ("whiskers") are the range of the data that is within 1.5 times the inter-quartile distance from the median. Data points outside that range are regarded as outliers and are displayed as circles. The main argument in the boxplot() command is a formula statement relating the continuous variable on the left side to the grouping variable on the right side with a tilde symbol (~), continuous.variable.name ~ factor.variable.name ). Boxplots (Figure 3-3) can be used to get an idea on whether there are large differences between groups, whether the data is distributed symmetrically within groups and whether there are outliers. boxplot(Sepal.Length~Species, data=iris) 50 Figure 3-3 Boxplot of sepal length of Iris You can learn how to produce nicer looking graphs in R in the section Basic graphs in R. 3.4 Descriptive statistics Another quick way to explore data is to use the command summary(name.of.data.frame). This command gives you a number of descriptive statistics for each continuous variable (range, quartiles, mean, median) (see getting started with statistics section). For factor variable the command tabulates the number of observations for each factor level. In our example summary(iris) Sepal.Length Min. :4.300 1st Qu.:5.100 Median :5.800 Mean :5.843 3rd Qu.:6.400 Max. :7.900 Sepal.Width Min. :2.000 1st Qu.:2.800 Median :3.000 Mean :3.057 3rd Qu.:3.300 Max. :4.400 51 Petal.Length Min. :1.000 1st Qu.:1.600 Median :4.350 Mean :3.758 3rd Qu.:5.100 Max. :6.900 Petal.Width Min. :0.100 1st Qu.:0.300 Median :1.300 Mean :1.199 3rd Qu.:1.800 Max. :2.500 You can calculate these and further descriptive statistics directly using the following commands (Table 3), that are all applied to vectors (or data frame columns) of continuous variables: Table 3. Commands to calculate descriptive statistics Statistic Command Mean mean(variable.name) Median median(variable.name) Range range(variable.name) Standard deviation sd(variable.name) No. observations length(variable.name) Variance var(variable.name) Calculating descriptive statistics for groups of data Often you will wish to apply these commands over groups of the data that are defined in a grouping factor. You can do this using tapply(). The main arguments of tapply()are X, the variable that you want to summarize, INDEX, one or more grouping variable(s) and FUN the command you want to apply. The INDEX variable is given as a list object, that is typically is converted into a list within the tapply() command using INDEX=list(variable.name). tapply(X=variable.name, INDEX=list(variable.name), FUN=command.name) You can state all kinds of functions in the function argument. tapply(X=iris$Sepal.Length, INDEX=list(iris$Species), FUN=mean) setosa versicolor 5.006 5.936 virginica 6.588 52 3.5 Comparing two groups of measurements Identifying the type of test One-sample test One-sample tests are used when a single sample with a specific hypothesized value for the mean is to be considered. Examples of this include fixed value comparisons such as whether average human height is 1.77m. Two independent sample test Here, measurements on two samples from two different populations are compared. Examples include comparisons of males and females, comparisons of plant/animals subjected to two different treatments, and comparisons of two different species/localities. Paired-sample test Paired sample tests are used when two different measurements were taken on the SAME experimental units. Examples are before and after studies on the effect of medical treatments Work-flow for group comparisons (one or two groups) Assume that we ask the question whether males run faster than females. We suggest the following workflow (Figure 3-4) for group comparisons: First the data is plotted. The assumption of t-tests is that the data is normally distributed and this needs to be assessed (see below) before proceeding further. If the data is normally distributed you can proceed to the t-test, otherwise data transformations are needed (i.e. transform the speed at which males and females run). If that does not work, then a non-parametric tests should be conducted. 53 Figure 3-4 Workflow for one- and two-group comparisons Assessing normality using quantile-quantile-plots The main assumption for t-tests is that the data is normally distributed. Depending on the type of group comparison you need to assess normality for the following vector(s): one sample test: data of the sample two independent sample test: data vector of each group separately paired-sample test: vector of differences between the two treatments for each experimental unit You can view the distribution of your data using the command hist()as described above. The command qqnorm(your.vector) produces a quantile-quantile plot (qq-plot) that is regarded as the best graphical assessment of whether or not data conforms to the normal distribution. A quantile is a value of the data that is just larger than a certain percentage of the data. The median, for example is the 50% quantile and the quartiles are the 25% and 75% quantiles (see here). The qq-plot displays two different types of quantiles. On the y-axis sample quantiles, i.e. each data point, are indicated. You can check this out by comparing the histogram and the qq-plots below; the histogram and the y-axis of the qq-plot have the same range. The x axis of the qq-plot represents the standardized theoretical quantiles for a normal distribution corresponding to each data point. The qqnorm(your.vector) command first calculates the quantile of each value in the data , i.e. what percentage of the data is smaller or equal to that value. It then looks up the corresponding quantile (i.e value) of the standard normal distribution with a mean of 0 and standard deviation of 1. Thus, points with values around zero for the theoretical quantile should be close to mean of the data on the y-axis. qq-plots are evaluated with the aid of the command qqline(your.vector), producing a line from the first to the third quartile of the data. You expect the points to be close to this line if the data is normally distributed. 54 Below you see examples of histograms and qq-plots (Figure 3-5) for 250 data points that are normally distributed (green), left skewed (red) and right skewed (red). Data that look like the left skewed or right skewed example below should transformed before analysis. If this does not work a non-parametric test should be used. Figure 3-5 Histogram and QQ plot for normally-distributed, right skewed, left skewed data For smaller datasets, quite some deviations from the line are expected even in normally distributed data, especially at the extremes. Below you see three examples of histograms and corresponding qq-plots (Figure 3-6) for five and ten values sampled from a normal distribution. If your data looks like this you can use the parametric tests! 55 Figure 3-6 Histogram and QQ plot for small datasets 56 Transformations To obtain normally distributed data for further analysis the following transformations (Table 3-2 Useful transformations) are recommended: Table 3-2 Useful transformations Data Right-skewed (common!) Transformations loge(Y) log10(Y) 1/Y √(Y) R code log(your.data) log10(your.data) 1/your.data sqrt(your.data) Ya Left-skewed data (rare) your.data^a Percent data arcsin(√(Y)) asin(sqrt(your.data)) Remember that most of these commands, with the exception of the exponential transformation for left-skewed data, are not defined for values smaller than zero. You may need to add a constant value to all values in order to perform the transformation. You can transform your data, assign it to a new vector or data frame column and plot it again. You may need to try out several different transformations. If you are still not satisfied with the distribution, please use a non-parametric test. It is also possible to apply the transformations directly with other commands, for example hist(log(your.data)). Once you have established the distribution of your data is normal you are ready to conduct the appropriate t-test otherwise proceed to non-parametric alternatives. 3.6 Using t-tests with R T-tests are calculated using the command t.test(). The arguments to this command identify the type of test to be conducted. One-sample t-test We use our example in which we want to test whether average human height is 1.77m. This is our data: 57 height <- c(1.43,1.75,1.85,1.74,1.65,1.83,1.91,1.52,1.92,1.83) Be aware that the alternative hypothesis is that average human height is significantly different from 1.77m. We declare the model in the following way, to obtain the following output: t.test(height, mu = 1.77) One Sample t-test data: height t = -0.5205, df = 9, p-value = 0.6153 alternative hypothesis: true mean is not equal to 1.77 95 percent confidence interval: 1.625646 1.860354 sample estimates: mean of x 1.743 Therefore, we cannot reject the hypothesis that average human height is significantly different from 1.77m. Two independent sample t-test We use our example of dispersal distance in male and female butterflies. This is your data: distance <- c(3,5,5,4,5,3,1,2,2,3) sex <c("male","male","male","male","male","female","female","fe male","female","female") Before running the test it is important to consider your alternative hypothesis, specifically whether you want to run a one-tailed or two-tailed test. If no alternative hypothesis is specified, the command will assume a two-tailed test. The two-sample t-test has a second assumption in addition to the normality of the data: equal variance in the two samples. If the variances are assumed to be equal, this must be specified using the argument var.equal = TRUE, otherwise Welch´s t-test that does not assume equal variances is automatically used when needed. Here we assume equal variances and perform a two-tailed test. t.test(distance ~ sex, var.equal = TRUE) Two Sample t-test data: distance by sex t = -4.0166, df = 8, p-value = 0.003859 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -3.4630505 -0.9369495 sample estimates: 58 mean in group female mean in group male 2.2 4.4 Thus, male butterflies dispersal is significantly different from female butterflies. We can also specify a one-sided alternative hypothesis by adding the argument alternative= “less” or alternative = “greater” depending on which tail is to be tested: t.test(distance ~ sex, var.equal = TRUE, alternative="greater") Two Sample t-test data: distance by sex t = -4.0166, df = 8, p-value = 0.9981 alternative hypothesis: true difference in means is greater than 0 95 percent confidence interval: -3.218516 Inf sample estimates: mean in group female mean in group male 2.2 4.4 The results of these tests state that female dispersal distance is not significantly greater than male dispersal distance. Paired sample t-test As an example of a paired test, you investigate whether the sleep of students is affected by an exam. You ask 6 students how long they sleep the night before an exam and the night after an exam. These are the answers you get: sleep.before <- c(4,2,7,4,3,2) sleep.after <- c(5,1,3,6,2,1) Here you simply run add the argument paired=TRUE to the command from the two-sampled test above. t.test(sleep.before, sleep.after, paired=TRUE) Paired t-test data: sleep.before and sleep.after t = 0.7906, df = 5, p-value = 0.465 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -1.501038 2.834372 sample estimates: mean of the differences 0.6666667 Well, this test is NOT significant, and thus the data does not support an effect of exams on students sleeping time. But maybe you forgot about the party after the exam? 59 3.7 Non-parametric alternatives In certain cases it is impossible to meet the assumption of normality required for standard t-tests, even after transformations. In such cases a non-parametric alternative such as the Wilcoxon family of tests may be appropriate. These include one-sample, two-sample (also names MannWhitney-U test) and paired alternatives, all available through the command wilcox.test(). The syntax of wilcox.test() is similar to that of t.test(), see ?wilcox.test. 3.8 Correlation analysis In some cases, you may wish to assess the relationship between two variables. One way to do this is by using correlation analysis. The goal of correlation analysis is to determine how related two variables are. This differs from regression analysis, which seeks to determine a line of best fit from the relationship and assumes that predictor variables is directly affecting (causing) the outcomes of the response variable. Pearson Correlation One of the most common correlation analyses for parametric data is the Pearson productmoment correlation coefficient, commonly called Pearson’s r. This test seeks to determine the level of relatedness between two variables using a score that runs from -1 (perfect negative correlation) to 1(perfect positive correlation). A value of zero indicates no correlation. Since Pearson’s r is parametric, it is advisable to test the assumption of normality before running this test. In r, Pearson’s r can be calculated using the cor.test() command. Here we once again use the sample data set iris to assess the correlation between sepal length and petal length: cor.test (iris$Sepal.Length,iris$Petal.Length) Pearson's product-moment correlation data: iris$Sepal.Length and iris$Petal.Length t = 21.646, df = 148, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.8270363 0.9055080 sample estimates: cor 0.8717538 This data shows a highly-significant (P-value < 2.2e-16) and strongly positive (0.87) correlation between these two variables. Note that in this case, the P-value is used to reject the null hypothesis that the true correlation is equal to zero. 60 Spearman Correlation A non-parametric alternative to Pearson’s r is Spearman's rank correlation coefficient, or Spearman’s rho. Like Pearson’s r, Spearman’s rho determines the level of correlation of two variables ranging from -1 to 1. The difference between the two measures is that Spearman uses the rank-order of the data rather than the raw values. Spearman’s rho can also be calculated using the cor.test() command. Below we repeat the previous correlation analysis this time using Spearman’s rho. cor.test (iris$Sepal.Length,iris$Petal.Length, method="spearman") Spearman's rank correlation rho data: iris$Sepal.Length and iris$Petal.Length S = 66429.35, p-value < 2.2e-16 alternative hypothesis: true rho is not equal to 0 sample estimates: rho 0.8818981 Note that this test produced similar, but not identical, results compared with Pearson’s R 3.9 Cross-tabulation and the χ2 test Basic contingency tables would have two categorical variables. In many cases we may wish to test whether the two grouping variables are independent. One of the most common ways to analyze contingency tables is with the χ2-test (Chi-square test). The χ2 tests work by first calculating the difference between expected and observed values: ∑ The result of this calculation, the so-called the X2score, is then compared to the χ2 distribution to calculate a p-value to determine if the observed values differ significantly from the expected values. In order to demonstrate the usage of χ2 tests we will use an example of eye color counts in two different groups of flies. The dataset can be found in the attachments as 3.9_flies_eyes_color.csv. We begin by loading the data and creating a contingency table. flyeyes <- read.csv("3.9_flies_eyes_color", header = T) tab <- table(Eyecolor, Group, data=flyeyes) Red White A B 34 41 16 9 61 Note that the ratio between red and white eyes differs between group A and group B. We will use chi-squared test to determine whether the data is more compatible with the null hypothesis that the variables of eye color and group are independent of each other or with the alternative hypothesis that eye color and group are not independent. chisq.test(tab) Pearson's Chi-squared test with Yates' continuity correction data: tab X-squared = 1.92, df = 1, p-value = 0.1659 In this case, based on a chi-squared value of 1.92 and 1 degree of freedom, we calculated a Pvalue of 0.1659. Thus, data suggests that these variables are independent. Another useful command is the prop.test() command. prop.test(tab) data: tab X-squared = 1.92, df = 1, p-value = 0.1659 alternative hypothesis: two.sided 95 percent confidence interval: -0.43264180 0.05930846 sample estimates: prop 1 prop 2 0.4533333 0.6400000 The first two lines return the same values as we observed with the chi-squared test. However two additional groups of numbers are present in this output. The first is the 95% confidence interval. These two numbers represent the lower and upper estimates of the difference between the two groups. Note that because the interval includes zero that we are not confident that there is a difference between the two groups at all. The sample estimates show the estimated proportion of Red individuals in group A and White individuals in group A respectively. 3.10 Summary It is important to distinguish between continuous and categorical data when determining which statistical test is most appropriate. Exploring data through the use of tables and graphs is critical to understanding the data Many statistical tests rely on an assumption of a normal distribution It is sometimes possible to transform non-normal data into normal data Non-parametric tests should be used on data that is not-normal even after transformation. T-tests compare the means of one or two data groups Correlation analysis determines whether two variables are associated Chi-squared tests are used to determine whether two categorical variables are independent. 62 3.11 Exercises 3-A Below you find a number of descriptions of experiments. Please assign the appropriate test: one-sample test, two independent groups or paired samples A. You read that it costs on average 600 SEK you go to the hairdresser in Uppsala and you want to find out whether that is actually true. You walk through the city and obtain prices from 10 hairdressers. B. You investigate whether the flower color of your grandmothers’ orchids becomes more intensive after applying fertilizer. You score color intensity in 10 orchids before fertilizing and one week after fertilizing C. You want to investigate whether arrival times at lectures differs between male and female students. You come 15 minutes early to large lectures and record arrival time and sex of the students. You obtain data from 60 women and 57 men. D. You study two species of plants, red and white campion. You want to know which species has larger flowers and measure flower size in 50 individuals of each species. E. You want to study whether the hand people write with is stronger than their other had. You ask 25 people to participate in your experiments and measure how well they can squeeze balloons with either hand. 3-B A nonparametric test is applied when: a. There are not parameters b. The variables are not independent c. The groups do not have the same sample size d. The assumptions of parametric tests are not met 3-C Below you find data on the snow-melt times in two different habitats, snow-beds and ridges. Use a t-test to find out whether there is a significant difference in snowmelt times between these two habitats. Assume that the variances are equal. What is the P-value? Generate the data using this code: snowmelt <- c(110, 120,109,101,105,99,106,108,95,98) habitat <- c(rep("snowbed",times=5), rep("ridge", times=5)) 3-D 63 You are investigating two different nature conservation areas (area 1 and area 2). You would like to know if interactions between poplar trees and leaf eating insects differ between these two reserves. For this purpose, you measured the leaf area that has been eaten (in %) on 10-20 year old poplars (in %). 52 trees were sampled in each reserve. The data is available in the attachments as 3-D. Does the consumed leaf surface differ significantly between the two reserves? Hint, Conduct the appropriate statistical test, produce an appropriate graph and interpret the results critically. 3-E You want to understand interactions between insects and oaks. Do older trees support more insects, and if so, how much more? For this purpose, you set up insect traps in 20 oaks trees of different ages and measure the total dry weight of insects collected in one month (July). The data is available in the attachments as 3-E. Hint, conduct the appropriate statistical test, produce a graph and interpret the results. 3-F You want to study willow shrub responses to herbivory – do willows produce tannins (that are known to act as defense compounds) in response to herbivory? For this purpose, you selected 10 willow shrub pairs with similar ages and sizes and growing closed to each other. In May and June you spray one shrub in each pair with insecticide each week and the other one with water. At the end of June you measure tannin concentration in 50 leaves per tree. The data is available in the attachments as 3-F. Hint, Conduct the appropriate statistical test, produce an appropriate graph and interpret the results. 3-G Explore the Swiss dataframe. And indicate whether the following statements are true or false: A. Fertility positively correlates with religiosity and agriculture, and correlates negatively with education and examination B. Education and examination are not strongly positive correlated C. Swiss cities with better education are usually more catholic D. More extensive agriculture causes more fertility E. The more catholic you are the more fertile you will be 3-H 64 Using Spearman correlation and Pearson correlation, the correlation between fertility and mortality is: (a) 0.44 and 0.42 (b) 0.42 and 0.44 (c) 0.43 and 0.41 (d) 0.41 and 0.43 3-I I am counting people who arrive into the bank in a sunny summer afternoon and I record the color of their clothes. I got that 35 guys wore a white t-shirt, 22 a blue one, and 7 a black one. On the other hand, 14 ladies wore white dress, 12 a light-blue and 8 dark one. Answer the following questions: A. Is the color being randomly chosen? B. Assuming that the female/male ratio in the city where the bank is situated is 50:50, do men prefer go more often to the bank than women? C. Is color selection gender-biased? 3-J You receive the following data regarding behavior and colour of crabs: Blue Red Aggressive 36 24 Passive 32 28 Are colour and behavior independent? 65 Solutions 3-A A. one-sample test, B. paired samples, C. two-independent groups, D. two-independent groups, E. paired samples 3-B d 3-C 0.09112: t.test(snowmelt~habitat). Data indicates that there is not difference between microhabitats 3-D #Load data check object and data;poplar <- read.table(file.choose(),sep=";", header=T, dec=",");str(poplar);#convert area to a factor;poplar$area <- as.factor(poplar$area);#plot data;boxplot(consumed_surface~area, data=poplar,xlab="Area",ylab="Consumed surface");#check normality;par(mfrow=c(2,2));hist(poplar$consumed_surface[poplar$area=="1"], main=”Area 1”);hist(poplar$consumed_surface[poplar$area=="2"],main=”Area1”);qqnorm(poplar$consumed_surface[poplar$area=="1"]);qqli ne(poplar$consumed_surface[poplar$area=="1"]);qqnorm(poplar$consumed_surface[poplar$area=="2"]);qqline(poplar$consum ed_surface[poplar$area=="2"]);# The data have only minor deviations from normality, so a two sample t-test is appropriate; t.test(consumed_surface~area, data=poplar);#Output and graphs;means <- tapply(poplar$consumed_surface, poplar$area,mean);se <- tapply(poplar$consumed_surface, poplar$area, function(x)sd(x)/sqrt(length(x)));par(mfrow = c(1,1));mp <- barplot(means, ylim = c(0, 20), las=1, xlab="Area", ylab="Consumed surface");arrows(mp, means - se, mp, means+se, mp, angle = 90, length = 0.2, code=3, col="black", lty=1, lwd=1). Interpretation: The consumed surface is around 17,5% in both areas and does not differ significantly between areas. 3-E #Load data, check object and data;grazing <- read.table(file.choose(), sep=";", header=T, dec=",");str(grazing);par(mfrow = c(1,1));plot(grazing$insect ~ grazing$age, xlab="Age of the tree (year)", ylab="Occurrence of insects");#Analysis: Linear Regression;model <- lm(grazing$insect ~ grazing$age);Checking assumptions;par(mfrow=c(2,2)) ;plot(model);#Output and graphs;summary(model);anova(model);#This graphic is optional;p_conf1 <- predict(model, interval = "confidence");p_pred1 <predict(model, interval = "prediction"); par(mfrow=c(1,1));plot(grazing$insect ~ grazing$age, xlab="Age of the tree (year)", ylab="Occurrence of insects ");abline(model);matlines(grazing$age, p_conf1[,c("lwr","upr")], col=2, lty=1, type="b", pch="+");matlines(grazing$age, p_pred1[,c("lwr","upr")], col=2,lty=2, type="b", pch=1).Interpretation: Insects occur more frequently on older trees, on average, an increase in one year of age is related to an increase in dry insect biomass supported of 16mg. 3-F #Load data, check object and data;tannin <- read.table(file.choose(), sep=";", dec=",", header=T);str(tannin);#Testing assumptions;tannin$diff <- tannin$water tannin$insecticide;qqnorm(tannin$diff);qqline(tannin$diff);#Analysis;t.test(tannin$water, tannin$insecticide, var.equal=T);mean<mean(tannin$diff);mean;se<-sd(tannin$diff)/ sqrt(length(tannin$diff));se. Interpretation: Insecticide application significantly reduced the tannin content as compared to a water treatment suggesting that the presence of insects induced tannin production 3-G A. True, B. False, C. False, D. False, E. False 3-H a 3-I A. No, p-value = 0.0001381, chisq.test(c(35+14,22+12,7+8)) B. Yes, p-value = 0.002442, chisq.test(c(35 + 22 + 7, 14 + 12 + 8)) C. No, p-value = 0.2105,summer <- as.table(rbind(c(35, 22, 7), c(14, 12, 8))); dimnames(summer) <- list(gender=c("M","F"), color = c("White","Blue", "Black"));chisq.test(summer) 3-J Yes, P=.5805 66 4 Linear models 4.1 Overview - In this section you will 4.2 learn how to design and interpret linear models assess whether models meet assumption using analysis of residuals learn how to define and interpret models including interaction terms Classes of linear models Linear are model are a large family of statistical analyses that relate a continuous response to one ore several explanatory variables. Explanatory variables can be grouping factors or continuous or a combination of both. One-way analysis of variance (ANOVA) Tests whether means of more than two groups are the same, for example whether fruit production differs among five populations of a plant species. If there are only two groups, a t-test is the way to proceed. ANOVA relates variance within groups to variance between groups. The analysis does not, however, tell you which groups are significantly different from each other. For this purpose a Tukey test can be applied. Two-way ANOVA This analysis can assesses the influence of two grouping factors on groups means, for example, whether irrigation and fertilization have an effect on plant growth. Importantly, two-way ANOVA can also analyze whether the two factors interact, in the example, whether the effect of irrigation depends on the fertilizer level (or the other way around). This is called a statistical interaction. The same methods can also be applied to studies with more than two grouping factors (multi-way ANOVA). 67 Linear regression Linear regression analyzes to what extent changes in a continuous explanatory variable result in changes in the response variable, for example whether larger females cause longer male courtship behavior. If a causal relationship cannot be assumed a correlation analysis should be used. This type of analysis can also be conducted with more than one continuous explanatory variable (multiple regression). Analysis of covariance (ANCOVA) ANCOVA allows more complicated analyses that involve effects of grouping factors, explanatory factors and their interactions. An example is an analysis of whether the response to different doses of a medication differs between male and female patients. Such more complicated linear models can also include more than two explanatory factors. 4.3 Workflow for linear models First, start by exploring the data through a basic plot (check plot section). Based on this, define the model and analyze the residuals. If the residuals are normally distributed, obtains and interpret the result. Otherwise, try transforming the data and re-checking the residuals or use a different model (Figure 4-1). Plot data Analysis of residuals Define model Transformation OK not OK Obtain and interpret results Different model Figure 4-1 Flowchart of linear model 68 4.4 Defining the model First you must define the linear model that you want to use, using the lm() function. Within this function a so-called formula statement defines the relationship of the variables to each other. The response variable is always on the left side of the tilde-symbol (~) and the explanatory variable(s) are on the right side as in lm(, …). For instance, if one uses the AirQuality R internal dataset and we want to make a model to predict Ozone content in the atmosphere using wind speed, the model definition would be as follows: My.model <- lm(Ozone ~ Wind, data = airquality) Observe that we are assigning the model to an object, for example My.model <- lm (…, …). This is a good practice since later you can just use the object for testing assumptions and for extracting results. Whether or not variable are categorical or continuous is defined in the data itself and can be checked by str()(check data section), R will calculate the appropriate model by itself. Thus, the following formula statement will yield a one-way ANOVA: airquality$Month <- as.factor(airquality$Month) #turns Month into a factor lm(Ozone ~ Month, data = airquality) While the following formula statement will yield a regression analysis: lm(Ozone ~ Temp, data = airquality) Formula statements are further used to combine explanatory variables and to define interactions. If variables should be considered only by themselves (additive effects), for example in a multiple regression without interaction you connect the variables by a plus sign as in: lm(Ozone ~ Temp + Wind, data = airquality) On the other hand, if you want to consider interactions in addition to the additive effects use an asterisk (*) between the explanatory variables, as in: lm(Ozone ~ Temp * Wind, data = airquality) Checking assumptions with diagnostic plots All linear models including regression, one-way ANOVA and ANCOVA have the following assumptions: 1. The experimental units are independent and sampled at random. The independence assumption depends heavily on the experimental design. 2. The residuals have constant variance across values of explanatory variables. 69 3. The residuals, i.e. the differences between the observed values of a response variable and the values fitted by the model, are normally distributed with a mean of zero. Analysis of residuals is thus a key step when conduction linear model analyses. We can for now concentrate on two diagnostic plots for analysis of residuals, the Tukey Anscombe plot, that displays residuals vs. fitted values. Fitted values are those predicted by the model. The other diagnostic plot is the qq-plot that was explained in the section basic statistical analyses. You obtain both of these graphs from your model object using the plot(,…) command. In the following example, one is interested in analyzing whether wind speed is a good predictor of ozone levels using an AirQuality internal dataset from R, such as in the previous section (defining the model). The output is presented in Figure 4-2. My.model <- lm(Ozone ~ Wind, data = airquality) par(mfrow=c(1,2) plot(My.model, which = c(1,2)) Figure 4-2 Residuals and Q-Q plot for a situation when assumptions are not fulfilled. In the first graph, the Tukey Anscombe plot, we expect as random scatter of points around zero. If there is any pattern in the graphs such as a funnel shape, the model fit is not good, in our example, the residuals of higher fitted values are much larger than those at low values violating assumption 2 above. This is very common especially in measurement data and can be related to a larger variation at higher values. A log-transformation of the response variable often improves model fit, such as in this case (see Figure 4-3). The corresponding qq-plot testing assumption 3 above, also improves after log-transformation (see assumption here) and thus, this analysis should definitely be based on log-transformed values. 70 My.model2 <- lm(log(Ozone) ~ Wind, data = airquality) plot(My.model2, which = c(1,2)) Figure 4-3 Residuals and Q-Q plot for a situation when assumptions are fulfilled. 4.5 Analyzing and interpreting the model An ANOVA table shows how much variation in the response is explained by the explanatory factors. To get the ANOVA table, use the command anova(My.model), where My.model is the object that stores the defined model (i.e. the model in the previous section that used the AirQuality internal dataset). Bellow you find the command and its corresponding output. airquality$Month <- as.factor(airquality$Month) #turns Month into a factor My.Model <- lm(Ozone ~ Month, data = airquality) anova(My.Model) Analysis of Variance Table Response: Ozone Df Sum Sq Mean Sq F value Pr(>F) Month 4 29438 7359.5 8.5356 4.827e-06 *** Residuals 111 95705 862.2 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 In the output, df stands for degrees of freedom, which is the number of values in the final calculation of a statistic that are free to vary, the F value which is the test statistic calculated as the ratio between the explained and the unexplained variance, and the corresponding p-value for this F statistic, which is the probability of not rejecting the alternative hypothesis (significant effect of the explanatory variable on the response variable) given than is false. 71 The command summary(model) will show the parameters estimated by the model, for example the slope of the regression for regression analyses or the difference between group means for ANOVA. My.model <- lm(Ozone ~ Wind, data = airquality) summary(My.model) lm(formula = log(Ozone) ~ Wind, data = airquality) Residuals: Min 1Q -3.4219 -0.4662 Median 0.0663 3Q 0.5021 Max 1.4035 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.75331 0.21879 21.726 < 2e-16 *** Wind -0.13726 0.02153 -6.376 4.39e-09 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.723 on 110 degrees of freedom (37 observations deleted due to missingness) Multiple R-squared: 0.2699, Adjusted R-squared: 0.2632 F-statistic: 40.65 on 1 and 110 DF, p-value: 4.389e-09 The first part of the output indicates the model that was run and the distribution of residuals by means of quantiles (see basic stats section). In the first row of the table under the title “Coefficients” one can find the estimate of the intercept of the regression line, a t-value which is the test statistics that tests whether the estimate for the intercept is different from zero, and the p-value for this test. In the second row of the same table one can find the slope, which is effect of the explanatory variable on the response variable (a.k.a. magnitude of the effect), and the same test statistic and p-value as before, but this time for the test that the slope is significantly different from zero. If the slope is different from zero, then there is an appreciable effect. The effect can either be positive or negative (gray and black lines, respectively, in the case III plot in figure 1.6.2). Notice that for these two tests there is a significance coding that is shown at the bottom of the table. Finally, R2, which is the percentage of explained variance by the model, is presented at the bottom of the output. This indicated how much of the variation in the data is explained by the model. Interpreting interactions Sometimes the effect of an explanatory variable on a response variable depends on another explanatory variable, this is termed an interaction. Below you will find some examples on how interactions would look like and how to interpret them. In the following graphs, different colors symbolize different grouping variables. The X-axis is the explanatory variable, categorical in the 72 case of the bar-plots and continuous in the case of the dot-plots with tendency line. The Y-axis is the continuous response variable. In the following graph (Figure 4-4), you can assume that the continuous response variable is seed productivity, the white and dark-gray colors correspond to watering treatments (irrigation and drought) and A and B are different populations. A two-way ANOVA is the right test to be applied in a scenario such as this, a graphical representation of which is shown in figure 1.6.1. The first two figures show cases where the interaction term of a two-way ANOVA is not significant, while in the last two cases the interaction term of the ANOVA is significant. In the first case, the response variable (i.e. seed productivity) differs between populations but not between treatments. In the second case it also differs between treatments. Observe that in the fourth case the overall mean between the populations is the same. In this context the interaction cases (III & IV) can be interpreted as follows: the effect of the treatment (water availability) on the response variable (i.e. seed productivity) depends on the population. Case I A Case II B A Case III B A Case IV B A B Figure 4-4 Bar plots for two treatments (white and grey) and two populations (A and B) Now, assume that instead of having the treatment as a categorical variable, there is a spectrum of different values. For instance, instead of having drought and watered treatments, we measured the amount of water naturally available in the soil (x-axis in Figure 4-5) in the two different populations (black and empty dots). ANCOVA is the test to be applied in this particular case. To the left of Figure 4-5 you can see a case where the interaction term of an ANCOVA is not significant (case I), but in the next two cases the interaction term of the ANCOVA is significant. In other words, if the interaction term is significant then how the response variable (i.e. seed productivity) varies as a function of the continuous explanatory variable (i.e. water availability) will depend on the population. As was discussed in section interpreting the model, this effect can be accessed by looking at the slope. Observe that in the first two cases the response variable also varies regarding the population, which does not happen in the third case (mentally try projecting the empty and filled dots into the y-axis). 73 Case I Case II Case III Figure 4-5 Dot plots for two populations (filled and empty dots) 4.6 Worked Examples The following examples follow the workflow structure (see here): make exploratory plot, define the model, check assumptions, and analyze and interpret the summary and test statistics. One-way ANOVA To test whether fruit production differs between populations of Lythrum, fruits were counted on 10 individuals on each of 3 populations. fruits <- data.frame(fruits = c(24, 19, 21, 20, 23, 19, 17, 20, 23, 20, 11, 15, 11, 9, 10, 14, 12, 12, 15, 13, 13, 11, 19, 12, 15, 15, 13, 18, 17, 13), pop = c(rep(c(1), 10), rep(c(2), 10), rep(c(3), 10))) fruits$pop <- as.factor(fruits$pop) 15 10 fruits 20 plot(fruits ~ pop, data = fruits) 1 2 3 pop Figure 4-6 Boxplot showing distribution of fruits per population model<-lm(fruits~pop,data=fruits) 74 12 16 7 20 Fitted values 1 2 23 0 2 0 22 -4 Residuals 4 23 Normal Q-Q -1 Residuals vs Fitted Standardized residuals par(mfrow=c(1,2));plot(model, which=c(1,2)) 227 -2 -1 0 1 2 Theoretical Quantiles Figure 4-7 Distribution and QQ plot for residuals of one-way ANOVA anova(model) Analysis of Variance Table Response: fruits Analysis of Variance Table Response: fruits Df Sum Sq Mean Sq F value Pr(>F) pop 2 420.0 210.000 19.104 6.767e-06 *** Residuals 27 296.8 10.993 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 The analysis shows that fruit production differs among populations. You can also make a bar plot out of this data. For that, please refer to the section on how to make plots. Two-way ANOVA In a study on pea cultivation methods, pea production was assessed in two treatments of irrigation (normal irrigation and drought) and in three treatments of radiation (low, medium and high). 10 plants in each of the six combinations were considered. plants<data.frame(seeds=c(39,39,39,40,40,39,41,42,40,40,39,38,41, 41,40,41,40,40,41,40,38,40,40,39,42,40,39,41,39,40,39,40,4 1,40,41,39,40,41,40,39,42,40,39,39,42,40,39,39,39,39,41,38 ,40,39,41,42,40,40,40,41),irrigation=c(rep(c(1),30),rep(c( 2),30)),radiation=c(rep(c(1,2,3),20))) plants$irrigation<-as.factor(plants$irrigation) plants$radiation<-as.factor(plants$radiation) par(mfrow=c(1,2));plot(seeds~irrigation*radiation,data=plants) 75 42 38 39 40 seeds 41 42 41 40 38 39 seeds 1 2 1 irrigation 2 3 radiation Figure 4-8 Boxplots for seed numbers (response variable) categorized by irrigation and radiation (explanatory variable) model<-lm(seeds~irrigation*radiation,data=plants) 39.2 39.6 40.0 40.4 2 8 41 45 1 1 0 -2 Residuals 2 845 41 Normal Q-Q -1 Residuals vs Fitted Standardized residuals par(mfrow=c(1,2));plot(model, which=c(1,2)) -2 Fitted values -1 0 1 2 Theoretical Quantiles Figure 4-9 Distribution and QQ for residuals of two-way ANOVA anova(model) Analysis of Variance Table Response: seeds Df Sum Sq Mean Sq F value Pr(>F) irrigation 1 0.067 0.0667 0.0747 0.785671 radiation 2 2.233 1.1167 1.2510 0.294370 irrigation:radiation 2 11.433 5.7167 6.4046 0.003192 ** Residuals 54 48.200 0.8926 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 76 This analysis shows that the effect of irrigation depends on that of radiation as the interaction term in significant. To find out more, this analysis should be followed by analyses within either irrigation or radiation treatments (split the dataset). Linear regression In an experiment testing whether the duration of male courtship behavior depends on female size 16 pairs of earwigs were observed. sex<-data.frame(pair = 1:16, fem_size = c(58.84, 60.37, 57, 59.86, 61.42, 60.34, 60.1, 59.63, 58.06, 61, 58.61, 60.94, 60.83, 57.7, 60, 59.09), male_court_hrs = c(8.37, 9.88, 10.12, 8.39, 9.93, 9.69, 8.68, 11.74, 11.07, 8.69, 10.53, 10.38, 10.12, 11.14, 8.6, 11.26)) 10.5 10.0 9.5 8.5 9.0 male_court_hrs 11.0 11.5 plot(male_court_hrs~ fem_size,data=sex) 57 58 59 60 61 fem_size Figure 4-10 Plotting duration of male courtship against female size model<-lm(male_court_hrs~ fem_size,data=sex) par(mfrow=c(1,2));plot(model, which=c(1,2)) 77 -2 1 9.4 9.8 10.2 2 0 1 8 -1 0 -1 Residuals 1 8 4 Normal Q-Q Standardized residuals 2 Residuals vs Fitted 1 10.6 -2 Fitted values 4 -1 0 1 2 Theoretical Quantiles 2 1 8 1 0.5 0 1 0.8 summary(model) 8 4 Standardized residuals 1.2 0.0 Residuals: Min 9.8 1Q 9.4 10.2 Median 10.6 -1.7679 -0.8837 0.2574 Fitted values -1 0.4 Call: 3 lm(formula = male_court_hrs ~ fem_size, data = sex) -2 Standardized residuals Figure 4-11 Distribution and QQ plot for residuals ofvs linear regression Residuals Leverage Scale-Location 0.5 1 Cook's distance 3Q0.00 Max 0.10 0.6771 1.8334 0.20 1 0.30 Leverage Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 27.3698 12.8304 2.133 0.0511 . fem_size -0.2929 0.2152 -1.361 0.1950 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.071 on 14 degrees of freedom Multiple R-squared: 0.1168, Adjusted R-squared: 0.05376 F-statistic: 1.852 on 1 and 14 DF, p-value: 0.195 This analysis suggests that larger females do not cause longer male courtship behavior. 78 Summary Linear models specify a linear relationship between a response variable and one or more explanatory variables. When working with linear models the workflow involves exploring the data through a plot, define the model, analyze the residuals and, depending on the output of this step, obtain and interpret the results or transform variables and/or try another model. The assumption to apply linear models are that the experimental units are independent and sampled at random, the residuals have constant variance across values of explanatory variables and the residuals are normally distributed with a mean of zero. Interactions are cases in which the effect of an explanatory variable on a response variable depends on another explanatory variable. Use the command model<-lm()to define the model, the command plot(model) to revise the assumptions and the commands summary(model) and anova(model) to retrieve a model estimates and an ANOVA table of the analysis. When defining a model with two or more explanatory variables, use * to include both direct and interaction effects, + to include only direct (additive) effects. 79 4.8 Exercises Exercise 4-A R has different internal datasets that do not require being uploaded. What sort of linear model is conducted by the code below? Choose between one-way ANOVA, two-way ANOVA, linear regression, multiple regression, and ANCOVA. In order do to decide, use the command str(…)to understand the structure of each internal dataset (i.e. str(ChickWeight)). You can also plot the data and run the code. A. lm(weight~Time,ChickWeight) B. lm(weight~Time*Chick,ChickWeight) C. lm(GNP.deflator~Unemployed + Population,longley) D. lm(weight~Diet+Chick,ChickWeight) E. lm(weight~Diet,ChickWeight) Exercise 4-B 7 economic indicators have been collected during 16 years (1947 to 1962) and were recently made available in the data frame called "longley". GNP.deflator is the GNP implicit price deflator, a measure of inflation. GNP is the gross national product, Unemployed/Employed is the number of unemployed/employed people, Armed.Forces is the number of people in the armed forces and Population is the 'noninstitutionalized' population over 14 years of age. Answer the following questions: A. Does unemployment implies a reduction in the gross national product? Does employment increase the gross national product? (a) YES / YES (b) NO /YES (c) NO / NO (d) YES / NO B. What percentage of the variation in the gross national product is explained by the employment rate? (Hint: refer to the concept of percentage of explained variance covered here) C. How much (euros) does the gross national product increase for every person that is newly employed? (Hint: refer to the concept of magnitude of effect and slope covered here) Exercise 4-C Are the following models adequate in terms of residuals’ distribution? (It uses “longley” R internal dataset) A. lm(GNP.deflator~Unemployed+Armed.Forces+Population,longley) B. lm(GNP~Employed,longley) 80 Exercise 4-D Forests are sometimes fertilized with nitrogen compound to increase their growth. However, this could lead to a change in herbivory. 42 3-year old birch trees were used in a greenhouse experiment. They were divided in 6 groups with seven trees each. Trees were subjected to two fertilization treatments (yes and no) and three herbivory treatments (none, low, high). Resulting in six combinations of treatments. One tree died so one treatment combination is missing a replicate. The data is available in the attachments as 4-D. How do trees react to fertilization and herbivory? Are these effects independent? Does fertilization increase herbivory risk? Hint, Conduct the appropriate statistical test, produce an appropriate graph and interpret the results. Exercise 4-E In this exercise you are going to use linear models in order to perform a selection analysis in the orchid Gymnadenia conopsea. A selection analysis is actually “simply” a multiple regression analysis in which the response variable is the fitness and the explanatory variables are the different phenotypic quantitative traits. The idea is that if there is a relationship between a phenotypic trait and fitness, that means that some values of the trait are favored, i.e. the trait is under selection. The strength of selection is represented by the slope of the regression line between fitness and the phenotypic trait (See next figure). The data and full exercise are available from this in the attachments as 4Ea and 4Eb. Hint: You will have to practice your skills on handling a data set (checking for outliers, subsetting, computing and adding new variables), get and interpret descriptive statistics (means, correlations), graphics (exploration, bar plots with error bars) and linear models (multiple regression), make variable transformation and standardization of variables, and extract values from a statistical output for plotting. 81 Solutions 4-A A. Linear regression, B. ANCOVA, C. multiple regression, D. Two-way ANOVA, E. ANOVA. 4-B A.b:summary(lm(GNP~Unemployed,longley));summary(lm(GNP~Unemployed,longley))$coefficients["Employed",];summary(lm( GNP~Employed,longley));summary(lm(GNP~Employed,longley))$coefficients["Employed",], B. 97: summary(lm(GNP~Employed,longley));summary(lm(GNP~Employed,longley))$r.squared, C. 28EUR: summary(lm(GNP~Employed,longley)) 4-C A. T:plot(lm(GNP.deflator~Unemployed+Armed.Forces+Population,longley)), B. T:plot(lm(GNP~Employed,longley)) 4-D #Load data, check object and data;fertilization <- read.table(file.choose(), sep=";", header=T, dec=",");str(fertilization);#You need to make sure that the response is numeric.;fertilization$growth <- as.numeric( fertilization$growth);interaction.plot(x.factor= fertilization$fertilization, trace.factor=fertilization$herbivory, response= fertilization$growth, cex.axis=1);#Analyses: two-way ANOVA;lm.fert.lab <- lm(growth~fertilization*herbivory, data=fertilization);Testing the assumptions: residual analysis;par(mfrow=c(2,2));plot(lm.fert.lab) ;#The assumptions of the analysis are met, residuals are normally distributed and model fit is satisfactory.;#Result output and graph;anova(lm.fert.lab);#Means, se and barplot;means <tapply(fertilization$growth, list(fertilization$fertilization,fertilization$herbivory),mean);se <tapply(fertilization$growth,list(fertilization$fertilization,fertilization$herbivory), function(x) sd(x)/sqrt(length(x)));mp <barplot(means, beside=T, ylim=c(0,55), las=1, xlab="Herbivory", ylab="Growth (cm)", col=c(0,8));legend(5,55, legend=c("No", "Yes"), fill=c(0,8), bty="n", title="Fertilization", horiz=T);arrows(mp, means-se, mp, means+se, angle=90, length=0.05, code=3, col="black", lty=1, lwd=1). Interpretation: The interaction between grazing and fertilization has a significant effect on growth indicating that the effect of fertilizer depends on whether or not there is grazing. Vice versa the effect of grazing depends on how much fertilizer there is. It is difficult to interpret the main effects (i.e., fertilizer and grazing) when the interaction is significant. If this is desired the data needs to be split into the different levels of one of the factors and reanalyzed with one-way ANOVAs. From the graph of means it is clear that fertilization generally increases growth. The higher the grazing intensity the more pronounced is the growth response to fertilization. 4-E Get the solution with detailed explanations from the attachments as 4Ec 82 5 Basic graphs with R 5.1 Bar-plots Goal In this section, you will learn how to script a grouped barplot of means with standard errors indicated a T-bars (Figure 5-1) and how to adjust the layout of this type of graph. Figure 5-1 Example of a grouped barplot of means with standard errors (as T-bars) How to do it We are going to use the internal dataset ToothGrowth (available with R installation), which contains measurements of tooth length in guinea pigs that received three levels of vitamin C and two supplement types (Figure 5-1). To explore this dataset you can use ?ToothGrowth, str(ToothGrowth) and summary(ToothGrowth). We want to produce a barplot of the mean tooth length for all six combinations of the two factors (supplement type: 2 levels, dose: 3 levels). We first need to calculate the mean tooth 83 length each of the combinations. For this, we use the command tapply(). tapply() can return a table with mean tooth lengths for all six combinations and this table will be the input for the barplot. Importantly, tapply() will create a matrix with two rows and three columns corresponding to the factor levels in the dataset as you can see below. This structure is needed to produce a groups barplot. mean.tg <- tapply(ToothGrowth$len, list(ToothGrowth$supp, ToothGrowth$dose), FUN = mean) mean.tg 0.5 1 2 OJ 13.23 22.70 26.06 VC 7.98 16.77 26.14 You are now ready to use the command barplot(). The first argument is the data to use, in this case our mean.tg matrix. Using the argument beside = T indicates that the bars should be plotted beside each other instead of on top of each other. barplot(mean.tg, beside = T) We can customize the layout of the barplot using further arguments, otherwise default options will be used. The labels of the axes can be specified with the arguments xlab and ylab and the labels below each group of bars are controlled with the argument names. The font size of these labels can be changed with cex.lab and cex.names. These arguments are set to 1 by default and changes are relative to this default. For example, cex.axis = 2 will double the font size. The limit of the y-axis is specified with ylim. Here we use the maxiumum and the minimum in the datset. The orientation of the axis labels can be altered with the argument las that has four options (0,1,2,3). Here, las = 1 produces horizontal axis labels. The colors of the bars are determined by col, in our example by a vector with a length of two for the two groups, specifying 1 (black) and 8 (grey). The color can be specified either with numbers (1 to 8) or with the color name. You can to get an overview of all available color names type colors(). You can further explore colors at http://research.stowersinstitute.org/efg/R/Color/Chart/. barplot(mean.tg, beside=T, xlab = "Dose (mg)", ylab = "Tooth length (cm)", names = c("0.5", "1.0", "2.0"), cex.lab = 1.3, cex.names = 1.2,col=c(0, 8), ylim = c(0, max(ToothGrowth$len)), las = 1) The next step is to add error bars to the barplot. There is no standard command to add error bars. Instead, we have to draw them ourselves with the command arrows(). First, we need the standard error of the mean for all six groups. We do this in the same way as the calculating the means: we use tapply() but ask for calculation of the standard error. Besides the length of the error bars, we also need the horizontal locations of the bars, such that they end up in the middle of the bars. These midpoints, in the same matrix format as the means above, can be extracted from a basic barplot().We assign the barplot(…) command to an object, here 84 named midpoints, use plot = F to suppress the plotting as we want to use the improved barplot we produced above. sem.tg <- tapply(ToothGrowth$len, list(ToothGrowth$supp, ToothGrowth$dose), FUN = function(x) sd(x) / sqrt(length(x))) midpoints <- barplot(mean.tg, beside = T, plot = F) Now we are ready to draw the error bars using the command arrows(). Within this command we first state the position of the error bars by two sets of x and y coordinates corresponding to the start and end of the error bars. The coordinates are given in our matrix format and correspond to the six bars on the graph. The starting coordinate set, midpoints, mean.tg - sem.tg identifies the six midpoints as x coordinates and the means minus the standard errors as the six y coordinates. Likewise, midpoints, mean.tg + sem.tg is used for the end of the error bars. We further use the argument code = 3 and angle = 90 such that we get bars with T´s on both ends and not arrows. The arguments length and lwd set the size of the T´s and line width of the entire error bars. arrows(midpoints, mean.tg-sem.tg, midpoints, mean.tg+sem.tg, angle = 90, length = 0.1, code=3, lwd = 1.5,) We can further use the command legend()to add a legend with our groups to the graph. We can specify the place of the legend in the graph either with coordinates (here, 0.75, 30) or with the options such as “topright” or “topleft” (see help for legend()). The fill argument produces boxes with the specified colors to place next to the legend text. bty determines if there will be a box drawn around the legend or not (default: bty = "o", with box), here, bty = ”n” removes the box. The font size in the legend is determined by cex, as explained above. legend(0.75, 30, legend = c("Orange juice", "Ascorbic acid"), fill = c(0,8), bty = "n", cex = 1.1) There are many more details of the plot that can be controlled and changed. For an overview of the graphical parameters that can be changed by arguments use ?par. Summary A barplot can be made with the command barplot(), a higher-level plotting command that creates a new graph. Mean values to be plotted should be calculated first with the command tapply(). Error bars, calculated with tapply(), and a legend can be added with the lower-level plotting commands arrows() and legend()that add extra features to an existing graph. A large number of graphical parameters can be used to customize plots arguments. 85 5.2 Grouped scatter plot with regression lines Goal In this section, you will learn how to script a grouped scatterplot with regression lines (Figure 5-2) and to adjust the layout of this type of graph. Figure 5-2 Example of a grouped scatterplot with regression lines How to do it To produce a scatterplot, we will use the plot() command. plot() is a higher-level plotting command that it will create a new graph. We are going to use part of the internal dataset Iris (available with the R installation) as an example (Figure 5-2). Iris contains flower measurements of three different Iris species. You can explore the dataset by ?iris, summary(iris) and str(iris). To reduce the dataset o two species and to plot all the datapoints use: iris.short <- iris[1:100, ] plot(iris.short$Sepal.Length, iris.short$Sepal.Width) We can now assign two different plotting symbols for the species by creating a new column in the data frame iris.short, named iris.short$pch, that contains the number of the plotting symbol to be used. There are 26 different plotting symbols, ranging from 0 to 25. Here we use symbol 1 for Iris setosa and symbol 16 for Iris versicolor. You can use the same procedure to assign different colors to the two species (see above). We can then set the axis labels, range and orientation as well as font size using xlab, ylab, xlim, ylim, las, cex.axis and cex.lab as explained above. iris.short$pch[iris.short$Species == "setosa"] 86 <- 1 iris.short$pch[iris.short$Species == "versicolor"] <- 16 plot(iris.short$Sepal.Length, iris.short$Sepal.Width, type = "n", xlab = "Sepal length (mm)", ylab = "Sepal width (mm)", xlim = c(4, 7.5), las = 1, cex.axis = 1.2, cex.lab = 1.3,pch = iris.short$pch) The next step is to add a regression line for each species assuming that sepal length causes changes in sepal width (which may or may not be reasonable). For this, we have to model the regression lines first. Subsequently, we plot lines corresponding to these models with the lowerlevel plotting command lines(). The line is specified by the x and y coordinates, which are both a vector: the x-vector contains are sepal lengths the y vector contains the sepal widths predicted by the model. We increase the line widths using with lwd = 1.5. model.seto <- lm(Sepal.Width ~ Sepal.Length, data = iris.short[iris.short$Species == "setosa",]) lines(iris.short$Sepal.Length[iris$Species == "setosa"], predict(model.seto), lwd = 1.5) model.versi <- lm(Sepal.Width ~ Sepal.Length, data= iris.short[iris.short$Species == "versicolor",]) lines(iris.short$Sepal.Length[iris$Species == "versicolor"], predict(model.versi), lwd = 1.5) We should also add a legend to the figure. This is similar as above, and we can produce species names in italics using the command expression(italics()) for each legend entry. legend("topright", legend = c( expression( italic("Iris setosa")), expression( italic( "Iris versicolor"))), pch = c(1, 16), cex = 1.1, bty = "n") There are many more details of the plot that can be customized. An overview of the graphical parameters that can be changed can be viewed using ?par. Summary Scatter plots can be created with the higher-level plotting command plot() A new vector in the dataframe can be used to specify plotting symbol and color The lower level plotting command lines() can be used to add regression from a linear model 87 5.3 Exercises Exercise 5-A Please use the dataset below to produce a scatterplot where each point has a different color and symbol. x <- c(2,3,4,5,7,8,9,10) y <- c(10,14,14,17,18,22,23,26) Exercise 5-B Starting from graph produced by the code below, change the symbols to filled red triangles that are twice as large. plot(iris$Sepal.Length, iris$Sepal.Width, xlab="Sepal length (mm)", ylab="Sepal width (mm)") Exercise 5-C Please use the internal datafile CO2 and create the barplot below. Hint: use str(CO2) and ?CO2 to find out more about the file; the argument horiz = T to legend will place legend entries horizontally. 88 Solutions 5-A x<-c(2,3,4,5,7,8,9,10) y<-c(10,14,14,17,18,22,23,26) plot(x,y, las=1, cex.lab=1.5, cex.axis=1.5,cex=1.5, pch=c(1:8), col=c(1:8)) 5-B plot(iris$Sepal.Length, iris$Sepal.Width, xlab="Sepal length (mm)", ylab="Sepal width (mm)", pch=17, col="red", cex=2) 5-C means <- tapply(CO2$uptake, INDEX = list(CO2$Type, CO2$Treatment), FUN = mean) se <- tapply(CO2$uptake, INDEX = list(CO2$Type, CO2$Treatment), FUN = function(x) sd(x)/sqrt(length(x))) uptake <- barplot(means, beside = T, col = c(0,2), las = 1, ylim = c(0,50)) arrows(uptake, means-se, uptake, means+se, angle = 90, type = 3, length = 0.1, code = 3) legend(x = "top", fill = c(0, 2), legend = c("Quebec", "Mississippi"), bty = "n", horiz 89 6 Logistic Regression 6.1 Goals In this section you will learn 6.2 how and when logistic regression analyses should be used how to create and interpret logistic regression models in R how to create simple plots of logistic regression models How to do it Background Logistic regression models are used to in situations where we want to know how a binary response variable is affected by one or more continuous variables. Common biological examples of this include assessing probability of survival, probability of reproducing, or probability of an individual possessing a certain allele. On a natural scale, logistic regression is non-linear and cannot be analyzed using linear models. However this problem is circumvented by using the logit transformation to linearize the model. Logit = log (p/1-p) Creating and analyzing model In R, logistic regression models are created using the generalized linear model function glm(. This takes the general form of: Model<-glm(probability_data~ continuous_predictor, family = ”binomial”) The argument family=”binomial” tells the function to create a binomial logistic regression model. As with the lm()function, we can use summary() to obtain summary data of the model. 90 To demonstrate this, we will use survival data collected by Quintana-Ascencio et al. on Hypericum cumulicola, a plant endemic to the southeastern United States. This dataset, which can be obtained in the attachments as Hypericum, contains both a probabilistic response variable (survival) as well as continuous predictor variables (log-transformed number of fruits produced and height in the previous year). First we create the generalized linear model and use the summary() function to obtain a summary: Lmodel <- glm(survival ~ height, family = binomial, data = Hypericum) summary(Lmodel) Call: glm(formula = survival ~ height, family = binomial) Deviance Residuals: Min 1Q Median 3Q Max -2.4470 -1.2380 0.6166 0.8544 1.2199 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 3.9931 0.4201 9.505 < 2e-16 *** height -2.1885 0.2912 -7.515 5.71e-14 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1018.68 on 878 degrees of freedom Residual deviance: 947.12 on 877 degrees of freedom AIC: 951.12 Number of Fisher Scoring iterations: 4 As can be seen from this summary, height has a significant negative effect on survival. One thing that should be noted is that the intercept in this case is far larger than 1. This is because summary presents values after logit transformation, so values are no longer bound between 0 and 1. There are several ways to determine the goodness of fit and significance on the overall model rather than the individual parameters. One way is to use the G2 statistic (similar to a chi-squared) to compare Null and Residual deviance. This technique compares the unexplained variation in a null model (that is, one that has no predictive value) to the unexplained variation in the model being tested (i.e. the residual deviance). A greater difference between null and residual deviance indicates lower deviance in the model and a better model fit. This difference is then tested using the chi-squared distribution to determine a p-value. G_sq <- LModel$null.deviance - LModel$deviance pchisq(G_sq, 1, lower.tail=F) 91 This technique produces a p-value of ~2.7 X 10-17 so the overall model in this case is highly significant. Plotting Logistic Regression Models Logistic regressions can be plotted either with survival presented on the linearized logit scale or on the natural logistic curve. The plotted logit can be useful for diagnostic purposes to determine the quality of fit of the model. However, the logistic curve is often more intuitive to present as a result. In order to plot on the logit scale, you must first define a sequence using the seq()function. Sequence<-seq(0,4,.1) The first and second numbers define the minimum and maximum values of the sequence and the third value specifies the intervals. Next we must create a generated predicted values of the model using the predict() function PLMlogit<-predict(LModel, list(height=Sequence)) plot(Sequence,PLMLogit, type="l", xlab="Log(height)", ylab="Logit Survival") The resulting figure should look like this (Figure 6-1): Figure 6-1 Relationship between height and survival on the logit scale In order to plot a logistic curve we again use the predict() and plot() functions, but we add an additional argument to predict: type= “response” 92 PLMcurve<predict(LModel,list(height=Sequence),type="response" ) plot(Sequence,PLMcurve,type="l", xlab="Log(height)", ylab="Survival") This argument tells the predict function to output the response variable (survival) on its original scale rather than on the transformed scale. The resulting figure should look like this (Figure 6-2): Figure 6-2 Relationship between log(height) and survival on the natural scale. 6.3 Summary Logistic Regressions are used when you have a probability as a response variable and a continuous predictor variable Logistic curves are analyzed as generalized linear models glm() though the use of the logit transformation. Logistic regressions can be plotted either as a logistic curve or as a linear function. 93 6.4 Exercises 6-A Repeat the example above using fruits as the predictor variable rather than height. A. Compare the overall significances of both models. Which predictor variable is a better fit for the data? Why? B. Compare the graphs of the two predictor variables. How are they similar? How are they different? 6-B (Yes/No) Which of the following questions could be answered using logistic regression? A. Is the probability of getting a head in a coin flip is affected by wind speed? B. Is there a correlation between rate of coffee consumption and hours worked? C. Is the probability of successfully building a nest related to body size in doves? D. Are yellow or red crabs are more likely to occupy holes at the beach. E. Does advertising spending affect the proportion of people who receive flu shots? F. How does increasing the concentration of a drug affect mortality rates in mice? G. Do different species of bears produce differing number of offspring? H. Is the ratio of body length to width similar throughout the family Mustelidae? 6-C You created a logistic model which has a null deviance of 432 and a residual deviance of 425. Is the overall model significant? 94 Solutions: 6-A a) Fruits is a better fit because the difference between null and residual deviance is larger. b) Both negative; logit plots are both linear/ Height has a stronger negative effect; shape of the logistic plots are different. 6-B Y;N;Y;N;Y;Y;N;N 6-C Yes, P=.008 95 7 R programming structures 7.1 Flow control Goal As a full and miscellaneous programming language, R comes with various looping and conditional constructs. In this section, we will preliminarily discuss iteration and conditionals. R provides three basic C- style paradigms to write explicit loop: for(), while() and repeat(). Conditional evaluations can be employed using function if() and ifelse(). How to do it The syntax of the looping functions is listed below. for (VAR in SEQ) while (COND) repeat {EXPR} {EXPR} {EXPR} VAR is the abbreviation of variable. SEQ is the abbreviation of sequence, which is equivalent to a vector (including list) in R. COND is the abbreviation of conditional, which can evaluate to TRUE or FALSE. EXPR is the abbreviation of expression in a formal sense. The first one, for(), iterates through each component VAR of the sequence SEQfor example, in the first iteration VAR = SEQ[1], in the second iteration VAR = SEQ[2], and so on. The following code uses the for() structure to print out square of each component in a vector. 96 for ( i in 1:5 ) { print( paste('square of', i, '=', i^2) ) } [1] [1] [1] [1] [1] "square "square "square "square "square of of of of of 1 2 3 4 5 = = = = = 1" 4" 9" 16" 25" The other two loop structure with while() and repeat() rely on the change state of expression, or the use of break to leave the loop. The function break halts the execution of the innermost loop and passes control to the first statement outside. Similarly, next exists the processing of the current iteration of the loop and causes the execution of the next iteration. When using repeat() or while(), special attention should be paid to averting infinite loop, that is, a loop which iterates without end. Below is an example showing two different ways of accomplishing the same job. i_w <- 1 while ( i_w <=10 ) { i_w <- i_w + 5 } i_w [1] 11 i_r <-1 repeat { i_r <- i_r + 1 if (i_r > 10) {break} } i_r [1] 11 Note that excessive use of loops will make your R code rather crappy. Although loop in R is very straightforward and convenient, sometimes you should avoid it due to its high cost of computational time and performance, especially when working on long vectors. A better alternative choice is to use vectorized function, for example which(), where(), all(), etc. In case of matrix computation, you can use rowSums(), colSums(), and so on. Now it is time to move to conditionals. The syntax of the if()statement looks like this: 97 if ( COND ) {EXPR1} else {EXPR2} The conditional COND is first evaluated, and if it is TRUE, then expression EXPR1 is executed; if COND evaluates to FALSE; then EXPR2 is executed. Particularly, when COND evaluates to numeric value of zero, R treats it as FALSE; and COND evaluates to any non-zero number, it is treated as TRUE. We can also easily extend/shrink if() structures by adding/removing one or several else clause as it is optional. But note that in case of extension, the order of conditional clauses are vital because once a condition statement is satisfied, R will ignore the rest part of the whole if-else structure and jump out of it. Here is a simple example: x <- 3 if ( ! is.numeric(x) ) { stop( paste(x, 'is not numeric') ) } else if ( x%%2 == 1) { print ( paste(x, 'is an odd') ) } else if ( x == round(x) ) { print ( paste(x, 'is an integer') ) } else { print ( paste(x, 'is a number') ) } [1] "3 is an odd" You can assign other value to x, for example, x <- 1.3, x<-'abc', etc., then copy and execute the if-else structure to check the result. 7.2 Write your own function Goal In this section, you will learn how to develop your own R function. How to do it R provides a convenient way to define custom function and make good use of it. All functions read and parse input, which are referred to as arguments, and then return output. R function is actually first-class object defined in R. It can be created by using the command function(), which is followed by a comma separated list of formal arguments enclosed by a pair of parenthesis, and then the expression that form the body 98 of the function. If the expression only includes one statement, it can be directly entered and when there are multiple expressions, they have to be enclosed in braces {}. The value returned by a R function, can be either yielded by R built-in function return or simply the value of the last expression. Here is an example of a function, which returned x to the power of n (x^n): expon <- function(x,n) { if ( x%%1 != 0 ) { stop('x must be an integer!') } else if ( n==0 ) { return(1) } else { prod <- x while( n>1 ) { prod <- x*prod n <- n-1 } return(prod) } # end of else } # end of the function Now let it calculate 3 to the power of 4. expon(3,4) [1] 81 The formal and body arguments to function expon() can later be accessed via the R functions formals() and body(), as following: formals(expon) $x $n body(expon) { if (x%%1 != 0) { stop("x must be an integer!") } (…) Another point worth mentioning is that you can print out the built-in function in R when you are not sure what it does. By looking at the code, you may have a better idea. For 99 instance, if you are curious about detail of the function cat(), you can glance over its code by typing it without braces. cat function (..., file = "", sep = " ", fill = FALSE, labels = NULL, append = FALSE) { if (is.character(file)) if (file == "") file <- stdout() (…) It is very handy to view a lengthy function via the command page(). page(cat) However, as most fundamental functions in R are written directly in C, they are not viewable in this manner. There is a lot more to writing custom function in R than what is shown here. Nonetheless you won’t need it unless you move to the advance level. 7.3 Summary R programs are made up of expressions. Basic control-flow constructs are needed to control compound expressions. R provides for(), while() and repeat() to write loops. If-else statements are also available when choosing between two or several expressions. But sometimes loops should be avoided due to low efficiency. One prominent advantage of R over other statistical languages is its extensibility. You can easily add new functionality to the system by defining new functions. 100 8 Appendix: Code References The following appendix provides a list and general descriptions of commands used in each section of this document. Clicking on the names links to the CRAN reference pages. Getting Started with R rm c seq rep factor list q data.frame read.table str ? subset library attach Description Example remove an object creates a vector rm(x) a<-c(1,2,3,4) creates a sequence seq(1, 10, by=2) replicate vector elements define a function as a categorical variable construct a list rep (1:10, 2) quits r session creates a data frame q() data.frame(x) reads an external file and converts it to a data frame displays the structure of an object place in front of functions to get a description of the function creates a subset of a vector or matrix loads packages that are already downloaded attaches data frame read.table(example.txt, header=T) creates a contingency table produces summaries of model functions applies a function across an array of categories creates a qqnorm plot to visualizes deviations from normality Runs t tests (either one sample, paired or twosample)on data runs non-parametric Wilcoxon tests on data table(x,y) factor (x) z<-list(x, y) str(x) ?lm subset(data, x>10) library(ggplot2) data<-read.table(example.txt, header=T); attach(data) Basic Statistics with R table summary tapply qqnorm t.test wilcox.test 101 ab<-lm(x~y) summary(ab) tapply(x,y,mean) qqnorm(x) t.test(y, mu=value); t.test(y1, y2, paired=TRUE) t.test(y ~ group) wilcox.test(y, mu=value) wilcox.test(y1, y2, paired=TRUE) cor.test chisq.test prop.test tests for correlations among variables runs a chi-squared test wilcox.test(y~group) cor.test(x,y) chisq.test(x,y) tests for equal proportions prop.test (x,y) constructs a linear model calculates analysis of variance on a model lm(x~y) barplot(x) arrows creates and edits barplots (see link for list of plotting arguments) adds arrows to a plot legend adds a legend to a plot legend(0.75, 30, legend = c("Orange juice", "Ascorbic acid"), fill = c(0,8), bty = "n", cex = 1.1) plot(x,y) Linear Models lm anova x<-lm(x~y) anova(x) Basic Graphs with R barplot plot lines expression par creates an x,y scatterplot (see link for list of plotting arguments) adds lines to a plot calls a subsetable list arrows(midpoints, mean.tg-sem.tg, midpoints, mean.tg+sem.tg, angle = 90, length = 0.1, code=3, lwd = 1.5,) lines(x,y) expression(x) sets and edits graphical parameters (see link for a list of arguments) par(mfrow=c(2,2)) creates a generalized linear model generates predictions from the model glm(y~x, family=gaussian) Logistic Regression glm predict 102 a<-glm(y~x, family=gaussian) predict(a)