Download Class # 2: The Boxplot and Numerical

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
February 8, 2010
Class # 2: The Boxplot and Numerical Summaries of Data
Reading: Section 1.3 & 1.4 of Devore
The Boxplot: The boxplot is another device which is meant to show where a data set is centered
and how it is spread out. The ”box” in the boxplot is composed of three lines. The left line of
the box marks the first quartile (or the 25th percentile), the middle line marks the median (the
50th percentile), and the right line marks the third quartile (the 75th percentile). A whisker is then
drawn from each end of the box to either the most extreme observation in that direction or to the
point that is 1.5IQR units away (which ever is less extreme). Note that the IQR is the interquartile
range and is calculated as
IQR = third quartile − first quartile.
A boxplot can easily be drawn in R using the boxplot command.
> grades = c(43, 72, 78, 81, 85, 87, 88, 92, 98)
> boxplot(grades)
Measures of Location: There are two common measures of location, the median and the mean.
The median is typically preferred in data sets that have outliers.
• mean: x =
typing
1
n
∑n
i=1
xi . The mean for the data set grades can easily be calculated in R by
> mean(grades)

 x( n+1 )
2
• median: x̃ =
x
+x n+1
 ( n2 ) ( 2 )
2
n is odd
. The median of the data set grades can be calcun is even
lated in R by typing
> median(grades)
Measures of Spread: The four most common measures of spread are
√ the IQR, the Range, the
sample variance (s2 ), and the sample standard deviation (s, where s = s2 ). The range and sample
variance are defined below.
• Range = x(n) − x(1) and is not preferred in data sets with extreme values.
√
(∑n
)
∑n
2
1
1
2
2
• s2 = n−1
s2 is typically preferred to s2 since it
i=1 (xi − x) = n−1
i=1 xi − nx . s =
is in the same units as x. The sample variance of the data set grades is calculated in R by
typing
> var(grades)
and the standard deviation can be calculated by typing
> sd(grades)
Homework: These problems are due at the beginning of class on February 10, 2010.
1. Exercise 11, page 20 in Devore. Don’t worry about correctly labeling the stems as 6L and 6H,
etc. Just construct an appropriate stem-and-leaf plot. The data set is on the course webpage.
Help on this problem: The course website is http://williams.edu/Mathematics/cbotts/st201.html.
To download the data set for this problem, go to the icon for the data set on problem 1, right
1
click, and save the file as a .csv document. Specifically save it as hw2.data1.csv. Remember
where you put this file! Now type in the following in R
> myfile = file.choose()
This gives you a dialog box to choose the filename. Find the file hw2.data1.csv.
Now type in
> data = read.csv(myfile, header = T)
This is your data set. And the actual grades in this data set are in data$exam.scores.
To create a stem-and-leaf plot, you first have to install the R package that allows you to create
this plot. This package is called graphics. To install this package, type in the following:
> install.packages("graphics")
After typing this in, you will be asked what site you would like to download this package from.
Any site in the US is fine. Now type in
> library("graphics")
You can now create the stem-and-leaf plot with the command
> stem(data$exam.scores)
This will produce a stem-and-leaf plot, but it might not be the exact stem-and-leaf plot you
are looking for. To construct the stem-and-leaf plot you want, you might need to look into the
options that are available with the stem command. To find these options, type in
help("stem")
2. Exercise 15, page 21 in Devore. The data set is on course webpage. Note: on this problem,
there is no need to make a comparative stem-and-leaf plot. Just plot two stem-and-leaf plots
and compare them in writing.
To download the data on this problem, follow the same procedure as you did with the first
problem. Only this time the data for the creamy peanut butter is in data$Creamy and the
data for the crunchy is in data$Crunchy.
3. Would you expect distributions of these variables to be uniform, unimodal, or bimodal? Symmetric or skewed? Explain why.
(a) Ages of people at a Little League game.
(b) Number of siblings of people in your class.
(c) Pulse rates of college-age males.
(d) Number of times each face of a die shows in 100 tosses.
4. The data set for this problem (on the course website) gives the heights of some of the singers in
a chorus, collected so that the singers could be positioned on stage with shorter ones in front
and taller ones in back. To download the data, follow the steps that are given in Problem 1.
The heights will be in data$height.
(a) Plot a histogram of this data set and describe the distribution.
(b) Can you account for the features you see in the distribution of the data?
5. How many points do football teams score in the Super Bowl? The total numbers of points
scored by both teams in each of the first 41 super bowl games is given on the course website.
Plot a histogram of these scores and describe the distribution. To download the data, follow
the steps that are given in Problem 1. This time the super bowl scores are in the vector
data$SB.
2