Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
R Lab #2: Measures of Spread STAT 250 01/24/2017 How to get graphs ofr Homework I forgot how to copy your graphs for the homework assignment. Let me show you. bullhead <- read.csv("http://tiny.cc/pubh2w") hist(bullhead$length,main="Bullhead Length",xlab="mm") In the lower right window of RStudio, the plot tab has an “Export” option. You can copy the plot to the clipboard, and then paste into a Word document. You can put multiple graphs on one page. Measures of Spread The standard deviation, variance, and quartlies have built in commands in R. The coefficient of variation needs to be calculated, but is straight forward to find. bullhead <- read.csv("http://tiny.cc/pubh2w") sd(bullhead$length) ## [1] 24.51624 var(bullhead$length) ## [1] 601.0458 quantile(bullhead$length,c(.25,.75)) ## 25% 75% ## 313 345 summary(bullhead$length) ## ## Min. 1st Qu. 263.0 313.0 Median 331.0 Mean 3rd Qu. 329.6 345.0 Max. 405.0 So, for bullhead length, the data appears to be unimodal and symmetric. Note that the mean and the median are identical, indicating a lack of skewness. The middle 50% of lengths are between 313mm and 345mm. A word about the quantile command. The quantile command finds any percentage, that is the data point for which x% of the data is below is th x quantile. Normally, we use it just to find the Q1 (25th quantile) and Q3 (75th quantile), but it can find others. quantile(bullhead$length,c(.05)) # Lower 5% of fish lengths ## 5% ## 286.2 quantile(bullhead$length,c(.90)) # Upper 10% of fish lengths ## 90% ## 360 1 We talked about finding Q1 and Q3 in class by hand. Do the values calculated by hand match the values given in R? x <- c(4.58,3.80,4.01,4.05,4.27,4.35,4.21) sort(x) ## [1] 3.80 4.01 4.05 4.21 4.27 4.35 4.58 quantile(x,c(.25,.75)) ## 25% 75% ## 4.03 4.31 Turns out there are at least nine different methods for calculating Q1 and Q3. R uses all of them depending on the nature of the data. For small data sets, the difference is notable. For large data sets, you would never notice. Either answer is acceptable. The five number summary is also generated by summary. Notice that R includes the sample mean as well as the median. Regardless, the median is used when the boxplot is created. 260 300 340 380 boxplot(bullhead$length) boxplot(bullhead$length,horizontal=T,main="Brown Bullhead in PIB",xlab="Length (mm)") 2 Brown Bullhead in PIB 260 300 340 380 Length (mm) According to the boxplot, there are two outliers. Outliers can be caused by a few different reasons. 1. A typographic or transcription error 2. A miscalculation of an instrument 3. Natural variation Sometimes, it is argued that you should remove outliers. I think that if you do that, you need to report results with and with outliers that can’t be attributed to human errors. mean(bullhead$length) # With outliers ## [1] 329.6036 sd(bullhead$length) # With outliers ## [1] 24.51624 which(bullhead$length == min(bullhead$length)) ## [1] 32 which(bullhead$length == max(bullhead$length)) ## [1] 188 boxplot(bullhead$length[-c(32,188)]) # Without outliers 3 360 320 280 mean(bullhead$length[-c(32,188)]) # Without outliers ## [1] 329.5636 sd(bullhead$length[-c(32,188)]) # Without outliers ## [1] 23.67108 In this case, removing the outliers has no effect on the mean while the standard deviation shrinks a little bit. The more samples you have, the less the need to remove outliers. There are 222 observations here, so I would not remove these outliers. We can explore some relationships between the data using boxplots. There are two categorical variables of interest, gender and sampling location within PIB, we may be interested in comparing lengths across. table(bullhead$sex) boxplot(length~sex,data=bullhead,main="Length (in mm) by Gender") boxplot(length~sex,data=bullhead,main="Length (in mm) by Gender", names=c("Unknown","Female","Male")) table(bullhead$pibloc) boxplot(length~pibloc,data=bullhead,main="Length (in mm) by Location", names=c("Duck","Graveyard","Lagoons","Misery Bay","Sara's Cove")) boxplot(age~pibloc,data=bullhead,main="Age (in years) by Location", names=c("Duck","Graveyard","Lagoons","Misery Bay","Sara's Cove")) Coefficient of Variation Another measure of spread, often used in biology, is the coefficient of variation (CV). CV = s × 100% x̄ Uses for the coefficient of variation: • • • • • The resulting answer is unitless Can compare spread between variables with different units Can compare spread between variables on different scales It is a measure of variability Also a measure of accuracy The standard deviation, variance, and quartiles have built in commands in R. The coefficient of variation needs to be calculated, but is straight forward to find. 4 Which is more variable in bullhead: age, length, or weight? sd(bullhead$length)/mean(bullhead$length)*100 ## [1] 7.438097 sd(bullhead$weight)/mean(bullhead$weight)*100 ## [1] 26.43223 sd(bullhead$age)/mean(bullhead$age)*100 ## [1] 31.44295 5