Download Introduction to R and R-Studio

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Time series wikipedia , lookup

Categorical variable wikipedia , lookup

Misuse of statistics wikipedia , lookup

Regression toward the mean wikipedia , lookup

Transcript
Introduction
to R and
R-Studio
Math Talk – January 30, 2014
Rachel Saidi
Introduction to R
1. What is R?
R is a statistical software system for data analysis and graphics, developed by Ihaka and Gentleman of University
of Auckland Department of Statistics (1995), and it is considered a dialect of S and S-Plus. R is freely distributed
and has many built-in functions for statistical analysis and excellent graphics.
2. How is R-Studio different from R?
R-Studio is a slightly more user-friendly version of R with 4 window panels appearing
3. Where can I get R and R-Studio?
First download R:
http://cran.us.r-project.org/bin/windows
Then you can download R-studio:
http://www.rstudio.com/products/rstudio/download/
You need R before you can use R-Studio (below is what R-Studio looks like with 4 panels).
4. How can I get help when in R-Studio?
Type: help() and put the particular command you are interested in within the parenthesis;
Highlight that line and press “Control” “R” and information will come up in the bottom right screen
Example:
I would like more information about the command, mean
Type: Help(mean)
The resulting information that appears is shown below:
mean {base}
R Documentation
Arithmetic Mean
Description
Generic function for the (trimmed) arithmetic mean.
Usage
mean(x, ...)
## Default S3 method:
mean(x, trim = 0, na.rm = FALSE, ...)
Arguments
x
An R object. Currently there are methods for numeric/logical vectors and date, date-timeand time interval objects. Complex vectors are allowed for trim =
0, only.
trim
the fraction (0 to 0.5) of observations to be trimmed from each end of x before the mean is computed. Values of trim outside that range are taken as the
nearest endpoint.
na.rm a logical value indicating whether NA values should be stripped before the computation proceeds.
...
further arguments passed to or from other methods.
Value
If trim is zero (the default), the arithmetic mean of the values in x is computed, as a numeric or complex vector of length one. If x is not logical (coerced to
numeric), numeric (including integer) or complex, NA_real_ is returned, with a warning.
If trim is non-zero, a symmetrically trimmed mean is computed with a fraction of trimobservations deleted from each end before the mean is computed.
References
Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole.
See Also
weighted.mean, mean.POSIXct, colMeans for row and column means.
Examples
x <- c(0:10, 50)
xm <- mean(x)
c(xm, mean(x, trim = 0.10))
5. How do I import data?
It is easy to input small size data sets by typing, but for large data sets, it is always convenient to import them
from external sources. Use the read.csv () command for Excel files:
a. Name your file for R-studio, a name like “mydata”
b. Use <- following your name for the file
c. Add the command: read.csv(……)
d. Within the parenthesis and in quotes type the name of the file you are importing. If you want to keep the
first line as the headings for the columns, include “…. , head=TRUE).
mydata<- read.csv(“Pollution.csv, head=TRUE)
e. Use the command: attach(mydata)
to be able to use the data set.
*** Note: capital letters are read differently than lower case letters in R
6.
Now let’s try to type data directly into R to find a linear regression equation and correlation. For fun, I have
found some silly data with “Spurious (false) Correlation” at www.tylervigen.com . Feel free to browse this site
on your own, but remember….
CORRELATION DOES NOT IMPLY CAUSATION!!!!!
Example:
Per capita consumption of cheese (US)
correlates with
Number of people who died by becoming tangled in their bedsheets
2000
Per capita consumption of cheese (US)
Pounds (USDA)
Number of people who died by becoming tangled in
their bedsheets
Deaths (US) (CDC)
2001
2002
2003
2004
2005
2006
2007
2008
2009
29.8 30.1 30.5 30.6 31.3 31.7 32.6 33.1 32.7 32.8
327
456
509
497
596
573
661
741
809
717
Create R-Code to make a linear regression and correlation for per capita consumption of cheese in pounds (US)
correlated with numbers of people who died by becoming tangled in their bedsheets:
***** but remember….
CORRELATION DOES NOT IMPLY CAUSATION!!!!!
You can copy from here as we go through the code together, or you can copy the entire code provided at the end of this
document in the appendix
# Clear all
rm (list = ls())
# Define variables
percapcheese = c (29.8, 30.1, 30.5, 30.6, 31.3, 31.7, 32.6, 33.1, 32.7, 32.8)
deathbysheets = c (327, 456, 509, 497, 596, 573, 661, 741, 809, 717)
# Create a histogram
hist (percapcheese, probability = TRUE, breaks=seq(29, 34, 0.5), col="lightblue")
# Create a normal curve on top of the histogram
curve (dnorm (x, mean=mean(percapcheese), sd=sd(percapcheese)), col = "red", add=TRUE)
# Create a scatterplot – start with a new window
window()
plot (percapcheese, deathbysheets, main = "Scatterplot of Death By Sheet Entanglement vs Cheese Consumption")
# Compute statistical properties
summary (percapcheese)
> summary (percapcheese)
Min. 1st Qu. Median
29.80
30.52
31.50
Mean 3rd Qu.
31.52
32.68
Max.
33.10
> summary (deathbysheets)
Min. 1st Qu. Median
Mean 3rd Qu.
327.0
500.0
584.5
588.6
703.0
Max.
809.0
Summary(deathbysheets)
# Perform a simple linear regression (or linear model – lm)
linearfit = lm(deathbysheets ~ percapcheese)
summary (linearfit)
Call:
lm(formula = deathbysheets ~ percapcheese)
Residuals:
Min
1Q
-67.011 -33.560
Median
-1.964
3Q
31.229
Max
86.903
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2977.35
427.56 -6.964 0.000117 ***
percapcheese
113.13
13.56
8.346 3.22e-05 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 50.06 on 8 degrees of freedom
Multiple R-squared: 0.897,
Adjusted R-squared: 0.8841
F-statistic: 69.66 on 1 and 8 DF, p-value: 3.216e-05
Notice the “multiple R-Squared” value is the correlation coefficient: 0.897. This is slightly different from the one
presented on the website.
7.
What should you do next?
Of course, this is just a very brief introduction into R-code. Download R and R-Studio on your own computer. Try
to explore more. I have posted sample data sets and sites to find more data on my site:
http://cms.montgomerycollege.edu/EDU/Department2.aspx?id=73735
Also, it is very easy to google most commands and procedures for R-code.
If you are interested, here are just a few of the many online courses and resources available:
o
A pdf manual for an introduction to R
http://cran.r-project.org/doc/manuals/R-intro.pdf
o
Try R Code
http://tryr.codeschool.com/
o
The Johns Hopkins Data Science Specialization on Coursera
https://www.coursera.org/specialization/jhudatascience/1/courses
o
Stanford University StatLearning: Statistical Learning
https://class.stanford.edu/courses/HumanitiesScience/StatLearning/Winter2014/about
Thank you for attending this presentation!
Appendix - Code
# Clear all
rm (list = ls())
# Define variables
percapcheese = c (29.8, 30.1, 30.5, 30.6, 31.3, 31.7, 32.6, 33.1, 32.7, 32.8)
deathbysheets = c (327, 456, 509, 497, 596, 573, 661, 741, 809, 717)
# Create a histogram
hist (percapcheese, probability = TRUE, breaks=
seq(29, 34, 0.5), col="lightblue")
# Create a normal curve on top of the histogram
curve (dnorm (x, mean=mean(percapcheese),
sd=sd(percapcheese)), col = "red", add=TRUE)
# Create a scatterplot - start with a new window
window()
plot (percapcheese, deathbysheets, main = "Scatterplot of Death
By Sheet Entanglement vs Cheese Consumption")
# Compute statistical properties
summary (percapcheese)
summary (deathbysheets)
# Perform a simple linear regression (or linear model – lm)
linearfit = lm(deathbysheets ~ percapcheese)
summary (linearfit)