Download R Lab #2: Measures of Spread

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
R Lab #2: Measures of Spread
STAT 250
01/24/2017
How to get graphs ofr Homework
I forgot how to copy your graphs for the homework assignment. Let me show you.
bullhead <- read.csv("http://tiny.cc/pubh2w")
hist(bullhead$length,main="Bullhead Length",xlab="mm")
In the lower right window of RStudio, the plot tab has an “Export” option. You can copy the plot to the
clipboard, and then paste into a Word document. You can put multiple graphs on one page.
Measures of Spread
The standard deviation, variance, and quartlies have built in commands in R. The coefficient of variation
needs to be calculated, but is straight forward to find.
bullhead <- read.csv("http://tiny.cc/pubh2w")
sd(bullhead$length)
## [1] 24.51624
var(bullhead$length)
## [1] 601.0458
quantile(bullhead$length,c(.25,.75))
## 25% 75%
## 313 345
summary(bullhead$length)
##
##
Min. 1st Qu.
263.0
313.0
Median
331.0
Mean 3rd Qu.
329.6
345.0
Max.
405.0
So, for bullhead length, the data appears to be unimodal and symmetric. Note that the mean and the median
are identical, indicating a lack of skewness. The middle 50% of lengths are between 313mm and 345mm.
A word about the quantile command. The quantile command finds any percentage, that is the data point
for which x% of the data is below is th x quantile. Normally, we use it just to find the Q1 (25th quantile)
and Q3 (75th quantile), but it can find others.
quantile(bullhead$length,c(.05))
# Lower 5% of fish lengths
##
5%
## 286.2
quantile(bullhead$length,c(.90))
# Upper 10% of fish lengths
## 90%
## 360
1
We talked about finding Q1 and Q3 in class by hand. Do the values calculated by hand match the values
given in R?
x <- c(4.58,3.80,4.01,4.05,4.27,4.35,4.21)
sort(x)
## [1] 3.80 4.01 4.05 4.21 4.27 4.35 4.58
quantile(x,c(.25,.75))
## 25% 75%
## 4.03 4.31
Turns out there are at least nine different methods for calculating Q1 and Q3. R uses all of them depending
on the nature of the data. For small data sets, the difference is notable. For large data sets, you would never
notice. Either answer is acceptable.
The five number summary is also generated by summary. Notice that R includes the sample mean as well as
the median. Regardless, the median is used when the boxplot is created.
260
300
340
380
boxplot(bullhead$length)
boxplot(bullhead$length,horizontal=T,main="Brown Bullhead in PIB",xlab="Length (mm)")
2
Brown Bullhead in PIB
260
300
340
380
Length (mm)
According to the boxplot, there are two outliers. Outliers can be caused by a few different reasons.
1. A typographic or transcription error
2. A miscalculation of an instrument
3. Natural variation
Sometimes, it is argued that you should remove outliers. I think that if you do that, you need to report
results with and with outliers that can’t be attributed to human errors.
mean(bullhead$length)
# With outliers
## [1] 329.6036
sd(bullhead$length)
# With outliers
## [1] 24.51624
which(bullhead$length == min(bullhead$length))
## [1] 32
which(bullhead$length == max(bullhead$length))
## [1] 188
boxplot(bullhead$length[-c(32,188)])
# Without outliers
3
360
320
280
mean(bullhead$length[-c(32,188)])
# Without outliers
## [1] 329.5636
sd(bullhead$length[-c(32,188)])
# Without outliers
## [1] 23.67108
In this case, removing the outliers has no effect on the mean while the standard deviation shrinks a little bit.
The more samples you have, the less the need to remove outliers. There are 222 observations here, so I would
not remove these outliers.
We can explore some relationships between the data using boxplots. There are two categorical variables of
interest, gender and sampling location within PIB, we may be interested in comparing lengths across.
table(bullhead$sex)
boxplot(length~sex,data=bullhead,main="Length (in mm) by Gender")
boxplot(length~sex,data=bullhead,main="Length (in mm) by Gender",
names=c("Unknown","Female","Male"))
table(bullhead$pibloc)
boxplot(length~pibloc,data=bullhead,main="Length (in mm) by Location",
names=c("Duck","Graveyard","Lagoons","Misery Bay","Sara's Cove"))
boxplot(age~pibloc,data=bullhead,main="Age (in years) by Location",
names=c("Duck","Graveyard","Lagoons","Misery Bay","Sara's Cove"))
Coefficient of Variation
Another measure of spread, often used in biology, is the coefficient of variation (CV).
CV =
s
× 100%
x̄
Uses for the coefficient of variation:
•
•
•
•
•
The resulting answer is unitless
Can compare spread between variables with different units
Can compare spread between variables on different scales
It is a measure of variability
Also a measure of accuracy
The standard deviation, variance, and quartiles have built in commands in R. The coefficient of variation
needs to be calculated, but is straight forward to find.
4
Which is more variable in bullhead: age, length, or weight?
sd(bullhead$length)/mean(bullhead$length)*100
## [1] 7.438097
sd(bullhead$weight)/mean(bullhead$weight)*100
## [1] 26.43223
sd(bullhead$age)/mean(bullhead$age)*100
## [1] 31.44295
5