Download ch2-links

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Data mining wikipedia , lookup

Time series wikipedia , lookup

Transcript
boxplot-outliers.txt / ch2-image
Ch2 exercises: 2.5, 2.11, 2.12, 2.36, 2.39, 2.44, 2.51
2.5 Carbon dioxide emissions. Table 1.6 gives the 2007 carbon dioxide (CO2) emissions per
person for countries with populations of at least 30 million. Find the mean and the median for
these data. Make a histogram of the data. What features of the distribution explain why the mean
is larger than the median?
Read the data by this R command:
data<-read.csv("http://www.yorku.ca/nuri/econ2500/moorebps6e/data/tbl-1.6-co2emissions.csv",header=T)
attach(data)
names(data)
Answer
The histogram shows a right skew. Hence, the mean is larger than the median. Here, the mean is
4.61 and the median is 3.95 tons per person.
Co2<-data[,2]
sort(Co2)
# 39 observations
0.0272 0.0389 0.0828 0.1046 0.1464 0.2685 0.2773 0.2850 0.2976 0.6449
0.7993 0.9031 1.2935 1.3844 1.4301
1.4862 1.7677 1.9373 2.3065 3.9543 #
median 20th observation
4.1384 4.1432 4.3862 4.6525 4.9194 6.0207 6.8472 6.8598 7.6923 8.1555
8.3231 8.8163 8.8608 9.5690 9.8476 10.4941 10.8309 16.9171 18.9144
hist(Co2,main="per capita carbon dioxide emissions in 2007",ylab="Frequencies",xlab="CO2
emissions (metric per person)",col="green")
10
0
5
Frequencies
15
per capita carbon dioxide emissions in 200
0
5
10
CO2 emissions (metric per person)
round(summary(Co2),4)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0272 0.7221 3.9540 4.6110 7.9240 18.9100
The mean is larger than the median because the distribution is positively skewed.
2.11 xbar (mean) and s (standard deviation) are not enough. The mean and standard
deviation s measure center and spread but are not a complete description of a distribution. Data
sets with different shapes can have the same mean and standard deviation. To demonstrate this
fact, use your calculator to find and s for these two small data sets. Then make a stemplot of
each and comment on the shape of each distribution.
Answer
Both data sets have the same mean and standard deviation (about 7.5 and 2.0, respectively).
Stemplots reveal that Data A have a very left-skewed distribution, while Data B have a slightly
right-skewed distribution.
data<-read.table("http://www.yorku.ca/nuri/econ2500/moore-bps6e/data/xrs-2.11data.txt",header=T)
attach(data)
names(data)
value<-data[,2]
dataA<-value[1:11]
# data A
dataB<-value[12:22] # data B
sort(dataA)
3.10 4.74 6.13 7.26 8.10 8.14 8.74 8.77 9.13 9.14 9.26
mean(dataA)
sd(dataA)
# 7.500909
# 2.031657
stem(dataA,scale=2)
The decimal point is at the |
3
4
5
6
7
8
9
|
|
|
|
|
|
|
1
7
1
3
1178
113
sort(dataB)
5.25 5.56 5.76 6.58 6.89
mean(dataB) # 7.500909
sd(dataB) # 2.030579
7.04
stem(dataB,scale=2)
The decimal point is at the |
7.71
7.91
8.47
8.84 12.50
5
6
7
8
9
10
11
12
|
|
|
|
|
|
|
|
368
69
079
58
5
The two data sets have identical means and nearly identical standard deviations, but the first data
set is
left-skewed, while the second is right-skewed with a high outlier.
2.12 Choose a summary. The shape of a distribution is a rough guide to whether the mean and
standard deviation are a helpful summary of center and spread. For which of the following
distributions would and s be useful? In each case, give a reason for your decision.
(a Percents of high school graduates in the states taking the SAT, Figure 1.8 (page 18)
)
(b Iowa Test scores, Figure 1.7 (page 17)
)
(c New York travel times, Figure 2.1 (page 46)
)
Answer
(a) The mean and standard deviation are not good summaries because of the distribution is
bimodal and
asymmetric.
(b) This distribution is symmetric and single-peaked, without outliers, so the mean and standard
deviation
are appropriate measures of centre and spread.
(c) The distribution is positively skewed, so the mean and standard deviations are not good
summaries.
2.36 Never on Sunday: also in Canada? Exercise 1.5 (page 11)
1.5 Never on Sunday? Births are not, as you might think, evenly
distributed across the days of the week. Here are the average numbers of
babies born on each day of the week in 2008:
Present these data in a well-labeled bar graph. Would it also be correct to
make a pie chart? Suggest some possible reasons why there are fewer births
on weekends.
Correct Answer
A pie chart would make it more difficult to distinguish between the weekend days and
the weekdays. Some births are scheduled (induced labor, for example), and probably
most are scheduled for weekdays.
gives the number of births in the United States on each day of the week during an entire year.
The boxplots in Figure 2.5 (page 62) are based on more detailed data from Toronto, Canada: the
number of births on each of the 365 days in a year, grouped by day of the week. Based on these
plots, compare the day-of-the-week distributions using shape, center, and spread. Summarize
your findings.
Answer
The most striking result is that there are fewer births on the weekend. There are also slightly
fewer
births on Mondays than on the other weekdays. The distributions for each day do not have
strikingly
different spreads, although Wednesdays are somewhat more spread out than the other days
(judging
by the interquartile ranges). Most of the daily distributions are reasonably symmetric. The
general
patterns are similar in the US and Toronto, but of course there are many more births in the US
than
in Toronto.
2.39 A standard deviation contest. This is a standard deviation contest. You must choose four
numbers from the whole numbers 0 to 10, with repeats allowed.
(a) Choose four numbers that have the smallest possible standard deviation.
(b) Choose four numbers that have the largest possible standard deviation.
(c) Is more than one choice possible in either (a) or (b)? Explain.
Answer
(a) Pick any four numbers all the same: e.g., (4,4,4,4) or (6,6,6,6). Choosing all of the numbers the
same (e.g., 5, 5, 5, 5) produces the smallest possible standard deviation, 0.
(b) (0,0,10,10). The four numbers 0, 0, 10, 10, have the largest possible standard deviation, because
they are as far as
possible from their mean (i.e., xbar= 5).
(c) There is more than one possible answer for (a) but not for (b).
As mentioned, there is more than
one choice in (a), but there is only one in (b)
2.44 Athletes’ salaries. The Montreal Canadiens were founded in 1909 and are the longest
continuously operating professional ice hockey team. They have won 24 Stanley Cups, making
them one of the most successful professional sports teams of the traditional four major sports of
Canada and the United States. Table 2.2 gives the salaries of the 2010—2011 roster. Provide the
team owner with a full description of the distribution of salaries and a brief summary of its most
important features.
Answer
Read the data in R:
data<-read.table("http://www.yorku.ca/nuri/econ2500/moorebps6e/data/xrs-2.44-data.txt",header=T)
attach(data)
names(data)
salary<-data[,2]
sort(salary)
500000 500000 550000 600000 600000 637500 875000 875000
875000 1000000
1300000 1350000 1500000 2250000 2500000 3250000 3250000 3833000
5000000 5000000
5000000 5500000 5750000 8000000
a stem-and-leaf plot of the data, using 100,000s of dollars as the leafs unit, millions for the
stems , and dividing each stem into five parts. The distribution is obviously positively skewed
and possibly bimodal, and so I decided to make a fivenumber summary of the data.
stem(salary,scale=2)
The decimal point is 6 digit(s) to the right of the |
0 | 556666999
1 | 0345
2 | 35
3 | 338
4 |
5 | 00058
6 |
7 |
8 | 0
and the five-number summary is
fivenum(salary)
500000 756250 1425000 4416500 8000000
As mentioned, the distribution of salaries is strongly positively skewed, with salaries ranging
from halfamillion dollars to 8 million dollars. The median salary is 1.4 million dollars, with half the players
earning
between 0.8 and 4.4 million. There appear to be two modes: one near the lower end of the
distribution, and
the other near 5 million dollars. The highest salary, 8 million dollars, appears to be an outlier.
2.51 Carbon dioxide emissions. Table 1.6 gives the 2007 carbon dioxide (CO2) emissions per
person for countries with populations of at least 30 million in that year. A stemplot or histogram
shows that the distribution is strongly skewed to the right. The United States and several other
countries appear to be high outliers.
(a) Give the five–number summary. Explain why this summary suggests that the distribution is
right–skewed.
(b) Which countries are outliers according to the 1.5 × IQR rule? Make a stemplot of the data or
look at your stemplot from Exercise 1.36. Do you agree with the rule’s suggestions about
which countries are and are not outliers?
Answer
Read the data by this R command:
data<-read.csv("http://www.yorku.ca/nuri/econ2500/moorebps6e/data/tbl-1.6-co2emissions.csv",header=T)
attach(data)
names(data)
Co2<-data[,2]
sort(Co2)
0.0272
0.2976
0.0389
0.7993
2.3065
0.9031
4.1384
7.6923
4.1432
0.0828
0.1046
0.1464
0.2685
0.2773
0.2850
1.3844
1.4301
1.4862
1.7677
1.9373
4.6525
4.9194
6.0207
6.8472
6.8598
9.5690
9.8476 10.4941 10.8309 16.9171
0.6449[=Q1]
1.2935
3.9543[=Q2]
4.3862
8.1555[=Q3]
8.3231 8.8163
18.9144
8.8608
stem(Co2)
The decimal point is at the |
0 | 001113333689344589
2 | 3
4 | 011479
6 | 0897
8 | 238968
10 | 58
12 |
14 |
16 | 9
18 | 9
(a) Min = 0.0272. Q1 = 0.7221. Median = 3.954. Q3 = 7.9239. Max
= 18.9144. The maximum is farther from Q3 than the minimum is
from Q1. This suggests right skew.
fivenum(Co2)
0.0272
0.7221
3.9543
7.9239 18.9144
0
5
10
15
boxplot(Co2,col="green”
(b) IQR = 7.9239 − 0.7221 = 7.2018. Hence, 1.5 × IQR = 10.8027.
Now Q1 − 1.5 × IQR
= 0.7221-1.5*(7.9239 − 0.7221)= -10.0806<0, so no values are more
than 1.5 IQRs below Q1. Also, Q3 + 1.5 × IQR =
7.9239+1.5*(7.9239 − 0.7221)= 18.7266, so the United States’s
value (18.9144) is an outlier - and perhaps Canada value
(16.9171) would be considered an outlier too.
The authors’ answer computing Q1, Q2, and Q3 manually
is the following:
(a) Min = 0.0272, Q1 = 0.6449, Median = 3.954, Q3 = 8.1555, Max = 18.9144. Notice that the
maximum is farther from Q3 than the minimum is from Q1. This suggests right skew.
(b) IQR = 8.1555 – 0.6449 = 7.5106. Hence, 1.5 IQR = 11.2659. Now Q1 – 1.5 IQR = 0.6449 –
11.2659 <0, so no values are more than 1.5 IQR’s below Q1. Also, Q3 + 1.5 IQR = 8.1555 +
11.2659 = 19.4214, so there are no high outliers. This rule is rather conservative – most people
would easily call the United States’ value (18.9144) a far outlier, and perhaps Canada would be
considered an outlier, too.