Download Class # 2: The Boxplot and Numerical

February 8, 2010 Class # 2: The Boxplot and Numerical Summaries of Data Reading: Section 1.3 & 1.4 of Devore The Boxplot: The boxplot is another device which is meant to show where a data set is centered and how it is spread out. The ”box” in the boxplot is composed of three lines. The left line of the box marks the first quartile (or the 25th percentile), the middle line marks the median (the 50th percentile), and the right line marks the third quartile (the 75th percentile). A whisker is then drawn from each end of the box to either the most extreme observation in that direction or to the point that is 1.5IQR units away (which ever is less extreme). Note that the IQR is the interquartile range and is calculated as IQR = third quartile − first quartile. A boxplot can easily be drawn in R using the boxplot command. > grades = c(43, 72, 78, 81, 85, 87, 88, 92, 98) > boxplot(grades) Measures of Location: There are two common measures of location, the median and the mean. The median is typically preferred in data sets that have outliers. • mean: x = typing 1 n ∑n i=1 xi . The mean for the data set grades can easily be calculated in R by > mean(grades)   x( n+1 ) 2 • median: x̃ = x +x n+1  ( n2 ) ( 2 ) 2 n is odd . The median of the data set grades can be calcun is even lated in R by typing > median(grades) Measures of Spread: The four most common measures of spread are √ the IQR, the Range, the sample variance (s2 ), and the sample standard deviation (s, where s = s2 ). The range and sample variance are defined below. • Range = x(n) − x(1) and is not preferred in data sets with extreme values. √ (∑n ) ∑n 2 1 1 2 2 • s2 = n−1 s2 is typically preferred to s2 since it i=1 (xi − x) = n−1 i=1 xi − nx . s = is in the same units as x. The sample variance of the data set grades is calculated in R by typing > var(grades) and the standard deviation can be calculated by typing > sd(grades) Homework: These problems are due at the beginning of class on February 10, 2010. 1. Exercise 11, page 20 in Devore. Don’t worry about correctly labeling the stems as 6L and 6H, etc. Just construct an appropriate stem-and-leaf plot. The data set is on the course webpage. Help on this problem: The course website is http://williams.edu/Mathematics/cbotts/st201.html. To download the data set for this problem, go to the icon for the data set on problem 1, right 1 click, and save the file as a .csv document. Specifically save it as hw2.data1.csv. Remember where you put this file! Now type in the following in R > myfile = file.choose() This gives you a dialog box to choose the filename. Find the file hw2.data1.csv. Now type in > data = read.csv(myfile, header = T) This is your data set. And the actual grades in this data set are in data$exam.scores. To create a stem-and-leaf plot, you first have to install the R package that allows you to create this plot. This package is called graphics. To install this package, type in the following: > install.packages("graphics") After typing this in, you will be asked what site you would like to download this package from. Any site in the US is fine. Now type in > library("graphics") You can now create the stem-and-leaf plot with the command > stem(data$exam.scores) This will produce a stem-and-leaf plot, but it might not be the exact stem-and-leaf plot you are looking for. To construct the stem-and-leaf plot you want, you might need to look into the options that are available with the stem command. To find these options, type in help("stem") 2. Exercise 15, page 21 in Devore. The data set is on course webpage. Note: on this problem, there is no need to make a comparative stem-and-leaf plot. Just plot two stem-and-leaf plots and compare them in writing. To download the data on this problem, follow the same procedure as you did with the first problem. Only this time the data for the creamy peanut butter is in data$Creamy and the data for the crunchy is in data$Crunchy. 3. Would you expect distributions of these variables to be uniform, unimodal, or bimodal? Symmetric or skewed? Explain why. (a) Ages of people at a Little League game. (b) Number of siblings of people in your class. (c) Pulse rates of college-age males. (d) Number of times each face of a die shows in 100 tosses. 4. The data set for this problem (on the course website) gives the heights of some of the singers in a chorus, collected so that the singers could be positioned on stage with shorter ones in front and taller ones in back. To download the data, follow the steps that are given in Problem 1. The heights will be in data$height. (a) Plot a histogram of this data set and describe the distribution. (b) Can you account for the features you see in the distribution of the data? 5. How many points do football teams score in the Super Bowl? The total numbers of points scored by both teams in each of the first 41 super bowl games is given on the course website. Plot a histogram of these scores and describe the distribution. To download the data, follow the steps that are given in Problem 1. This time the super bowl scores are in the vector data$SB. 2

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Class # 2: The Boxplot and Numerical