* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Introduction to R and R-Studio
Survey
Document related concepts
Transcript
Introduction to R and R-Studio Math Talk – January 30, 2014 Rachel Saidi Introduction to R 1. What is R? R is a statistical software system for data analysis and graphics, developed by Ihaka and Gentleman of University of Auckland Department of Statistics (1995), and it is considered a dialect of S and S-Plus. R is freely distributed and has many built-in functions for statistical analysis and excellent graphics. 2. How is R-Studio different from R? R-Studio is a slightly more user-friendly version of R with 4 window panels appearing 3. Where can I get R and R-Studio? First download R: http://cran.us.r-project.org/bin/windows Then you can download R-studio: http://www.rstudio.com/products/rstudio/download/ You need R before you can use R-Studio (below is what R-Studio looks like with 4 panels). 4. How can I get help when in R-Studio? Type: help() and put the particular command you are interested in within the parenthesis; Highlight that line and press “Control” “R” and information will come up in the bottom right screen Example: I would like more information about the command, mean Type: Help(mean) The resulting information that appears is shown below: mean {base} R Documentation Arithmetic Mean Description Generic function for the (trimmed) arithmetic mean. Usage mean(x, ...) ## Default S3 method: mean(x, trim = 0, na.rm = FALSE, ...) Arguments x An R object. Currently there are methods for numeric/logical vectors and date, date-timeand time interval objects. Complex vectors are allowed for trim = 0, only. trim the fraction (0 to 0.5) of observations to be trimmed from each end of x before the mean is computed. Values of trim outside that range are taken as the nearest endpoint. na.rm a logical value indicating whether NA values should be stripped before the computation proceeds. ... further arguments passed to or from other methods. Value If trim is zero (the default), the arithmetic mean of the values in x is computed, as a numeric or complex vector of length one. If x is not logical (coerced to numeric), numeric (including integer) or complex, NA_real_ is returned, with a warning. If trim is non-zero, a symmetrically trimmed mean is computed with a fraction of trimobservations deleted from each end before the mean is computed. References Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole. See Also weighted.mean, mean.POSIXct, colMeans for row and column means. Examples x <- c(0:10, 50) xm <- mean(x) c(xm, mean(x, trim = 0.10)) 5. How do I import data? It is easy to input small size data sets by typing, but for large data sets, it is always convenient to import them from external sources. Use the read.csv () command for Excel files: a. Name your file for R-studio, a name like “mydata” b. Use <- following your name for the file c. Add the command: read.csv(……) d. Within the parenthesis and in quotes type the name of the file you are importing. If you want to keep the first line as the headings for the columns, include “…. , head=TRUE). mydata<- read.csv(“Pollution.csv, head=TRUE) e. Use the command: attach(mydata) to be able to use the data set. *** Note: capital letters are read differently than lower case letters in R 6. Now let’s try to type data directly into R to find a linear regression equation and correlation. For fun, I have found some silly data with “Spurious (false) Correlation” at www.tylervigen.com . Feel free to browse this site on your own, but remember…. CORRELATION DOES NOT IMPLY CAUSATION!!!!! Example: Per capita consumption of cheese (US) correlates with Number of people who died by becoming tangled in their bedsheets 2000 Per capita consumption of cheese (US) Pounds (USDA) Number of people who died by becoming tangled in their bedsheets Deaths (US) (CDC) 2001 2002 2003 2004 2005 2006 2007 2008 2009 29.8 30.1 30.5 30.6 31.3 31.7 32.6 33.1 32.7 32.8 327 456 509 497 596 573 661 741 809 717 Create R-Code to make a linear regression and correlation for per capita consumption of cheese in pounds (US) correlated with numbers of people who died by becoming tangled in their bedsheets: ***** but remember…. CORRELATION DOES NOT IMPLY CAUSATION!!!!! You can copy from here as we go through the code together, or you can copy the entire code provided at the end of this document in the appendix # Clear all rm (list = ls()) # Define variables percapcheese = c (29.8, 30.1, 30.5, 30.6, 31.3, 31.7, 32.6, 33.1, 32.7, 32.8) deathbysheets = c (327, 456, 509, 497, 596, 573, 661, 741, 809, 717) # Create a histogram hist (percapcheese, probability = TRUE, breaks=seq(29, 34, 0.5), col="lightblue") # Create a normal curve on top of the histogram curve (dnorm (x, mean=mean(percapcheese), sd=sd(percapcheese)), col = "red", add=TRUE) # Create a scatterplot – start with a new window window() plot (percapcheese, deathbysheets, main = "Scatterplot of Death By Sheet Entanglement vs Cheese Consumption") # Compute statistical properties summary (percapcheese) > summary (percapcheese) Min. 1st Qu. Median 29.80 30.52 31.50 Mean 3rd Qu. 31.52 32.68 Max. 33.10 > summary (deathbysheets) Min. 1st Qu. Median Mean 3rd Qu. 327.0 500.0 584.5 588.6 703.0 Max. 809.0 Summary(deathbysheets) # Perform a simple linear regression (or linear model – lm) linearfit = lm(deathbysheets ~ percapcheese) summary (linearfit) Call: lm(formula = deathbysheets ~ percapcheese) Residuals: Min 1Q -67.011 -33.560 Median -1.964 3Q 31.229 Max 86.903 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2977.35 427.56 -6.964 0.000117 *** percapcheese 113.13 13.56 8.346 3.22e-05 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 50.06 on 8 degrees of freedom Multiple R-squared: 0.897, Adjusted R-squared: 0.8841 F-statistic: 69.66 on 1 and 8 DF, p-value: 3.216e-05 Notice the “multiple R-Squared” value is the correlation coefficient: 0.897. This is slightly different from the one presented on the website. 7. What should you do next? Of course, this is just a very brief introduction into R-code. Download R and R-Studio on your own computer. Try to explore more. I have posted sample data sets and sites to find more data on my site: http://cms.montgomerycollege.edu/EDU/Department2.aspx?id=73735 Also, it is very easy to google most commands and procedures for R-code. If you are interested, here are just a few of the many online courses and resources available: o A pdf manual for an introduction to R http://cran.r-project.org/doc/manuals/R-intro.pdf o Try R Code http://tryr.codeschool.com/ o The Johns Hopkins Data Science Specialization on Coursera https://www.coursera.org/specialization/jhudatascience/1/courses o Stanford University StatLearning: Statistical Learning https://class.stanford.edu/courses/HumanitiesScience/StatLearning/Winter2014/about Thank you for attending this presentation! Appendix - Code # Clear all rm (list = ls()) # Define variables percapcheese = c (29.8, 30.1, 30.5, 30.6, 31.3, 31.7, 32.6, 33.1, 32.7, 32.8) deathbysheets = c (327, 456, 509, 497, 596, 573, 661, 741, 809, 717) # Create a histogram hist (percapcheese, probability = TRUE, breaks= seq(29, 34, 0.5), col="lightblue") # Create a normal curve on top of the histogram curve (dnorm (x, mean=mean(percapcheese), sd=sd(percapcheese)), col = "red", add=TRUE) # Create a scatterplot - start with a new window window() plot (percapcheese, deathbysheets, main = "Scatterplot of Death By Sheet Entanglement vs Cheese Consumption") # Compute statistical properties summary (percapcheese) summary (deathbysheets) # Perform a simple linear regression (or linear model – lm) linearfit = lm(deathbysheets ~ percapcheese) summary (linearfit)