Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
boxplot-outliers.txt / ch2-image Ch2 exercises: 2.5, 2.11, 2.12, 2.36, 2.39, 2.44, 2.51 2.5 Carbon dioxide emissions. Table 1.6 gives the 2007 carbon dioxide (CO2) emissions per person for countries with populations of at least 30 million. Find the mean and the median for these data. Make a histogram of the data. What features of the distribution explain why the mean is larger than the median? Read the data by this R command: data<-read.csv("http://www.yorku.ca/nuri/econ2500/moorebps6e/data/tbl-1.6-co2emissions.csv",header=T) attach(data) names(data) Answer The histogram shows a right skew. Hence, the mean is larger than the median. Here, the mean is 4.61 and the median is 3.95 tons per person. Co2<-data[,2] sort(Co2) # 39 observations 0.0272 0.0389 0.0828 0.1046 0.1464 0.2685 0.2773 0.2850 0.2976 0.6449 0.7993 0.9031 1.2935 1.3844 1.4301 1.4862 1.7677 1.9373 2.3065 3.9543 # median 20th observation 4.1384 4.1432 4.3862 4.6525 4.9194 6.0207 6.8472 6.8598 7.6923 8.1555 8.3231 8.8163 8.8608 9.5690 9.8476 10.4941 10.8309 16.9171 18.9144 hist(Co2,main="per capita carbon dioxide emissions in 2007",ylab="Frequencies",xlab="CO2 emissions (metric per person)",col="green") 10 0 5 Frequencies 15 per capita carbon dioxide emissions in 200 0 5 10 CO2 emissions (metric per person) round(summary(Co2),4) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.0272 0.7221 3.9540 4.6110 7.9240 18.9100 The mean is larger than the median because the distribution is positively skewed. 2.11 xbar (mean) and s (standard deviation) are not enough. The mean and standard deviation s measure center and spread but are not a complete description of a distribution. Data sets with different shapes can have the same mean and standard deviation. To demonstrate this fact, use your calculator to find and s for these two small data sets. Then make a stemplot of each and comment on the shape of each distribution. Answer Both data sets have the same mean and standard deviation (about 7.5 and 2.0, respectively). Stemplots reveal that Data A have a very left-skewed distribution, while Data B have a slightly right-skewed distribution. data<-read.table("http://www.yorku.ca/nuri/econ2500/moore-bps6e/data/xrs-2.11data.txt",header=T) attach(data) names(data) value<-data[,2] dataA<-value[1:11] # data A dataB<-value[12:22] # data B sort(dataA) 3.10 4.74 6.13 7.26 8.10 8.14 8.74 8.77 9.13 9.14 9.26 mean(dataA) sd(dataA) # 7.500909 # 2.031657 stem(dataA,scale=2) The decimal point is at the | 3 4 5 6 7 8 9 | | | | | | | 1 7 1 3 1178 113 sort(dataB) 5.25 5.56 5.76 6.58 6.89 mean(dataB) # 7.500909 sd(dataB) # 2.030579 7.04 stem(dataB,scale=2) The decimal point is at the | 7.71 7.91 8.47 8.84 12.50 5 6 7 8 9 10 11 12 | | | | | | | | 368 69 079 58 5 The two data sets have identical means and nearly identical standard deviations, but the first data set is left-skewed, while the second is right-skewed with a high outlier. 2.12 Choose a summary. The shape of a distribution is a rough guide to whether the mean and standard deviation are a helpful summary of center and spread. For which of the following distributions would and s be useful? In each case, give a reason for your decision. (a Percents of high school graduates in the states taking the SAT, Figure 1.8 (page 18) ) (b Iowa Test scores, Figure 1.7 (page 17) ) (c New York travel times, Figure 2.1 (page 46) ) Answer (a) The mean and standard deviation are not good summaries because of the distribution is bimodal and asymmetric. (b) This distribution is symmetric and single-peaked, without outliers, so the mean and standard deviation are appropriate measures of centre and spread. (c) The distribution is positively skewed, so the mean and standard deviations are not good summaries. 2.36 Never on Sunday: also in Canada? Exercise 1.5 (page 11) 1.5 Never on Sunday? Births are not, as you might think, evenly distributed across the days of the week. Here are the average numbers of babies born on each day of the week in 2008: Present these data in a well-labeled bar graph. Would it also be correct to make a pie chart? Suggest some possible reasons why there are fewer births on weekends. Correct Answer A pie chart would make it more difficult to distinguish between the weekend days and the weekdays. Some births are scheduled (induced labor, for example), and probably most are scheduled for weekdays. gives the number of births in the United States on each day of the week during an entire year. The boxplots in Figure 2.5 (page 62) are based on more detailed data from Toronto, Canada: the number of births on each of the 365 days in a year, grouped by day of the week. Based on these plots, compare the day-of-the-week distributions using shape, center, and spread. Summarize your findings. Answer The most striking result is that there are fewer births on the weekend. There are also slightly fewer births on Mondays than on the other weekdays. The distributions for each day do not have strikingly different spreads, although Wednesdays are somewhat more spread out than the other days (judging by the interquartile ranges). Most of the daily distributions are reasonably symmetric. The general patterns are similar in the US and Toronto, but of course there are many more births in the US than in Toronto. 2.39 A standard deviation contest. This is a standard deviation contest. You must choose four numbers from the whole numbers 0 to 10, with repeats allowed. (a) Choose four numbers that have the smallest possible standard deviation. (b) Choose four numbers that have the largest possible standard deviation. (c) Is more than one choice possible in either (a) or (b)? Explain. Answer (a) Pick any four numbers all the same: e.g., (4,4,4,4) or (6,6,6,6). Choosing all of the numbers the same (e.g., 5, 5, 5, 5) produces the smallest possible standard deviation, 0. (b) (0,0,10,10). The four numbers 0, 0, 10, 10, have the largest possible standard deviation, because they are as far as possible from their mean (i.e., xbar= 5). (c) There is more than one possible answer for (a) but not for (b). As mentioned, there is more than one choice in (a), but there is only one in (b) 2.44 Athletes’ salaries. The Montreal Canadiens were founded in 1909 and are the longest continuously operating professional ice hockey team. They have won 24 Stanley Cups, making them one of the most successful professional sports teams of the traditional four major sports of Canada and the United States. Table 2.2 gives the salaries of the 2010—2011 roster. Provide the team owner with a full description of the distribution of salaries and a brief summary of its most important features. Answer Read the data in R: data<-read.table("http://www.yorku.ca/nuri/econ2500/moorebps6e/data/xrs-2.44-data.txt",header=T) attach(data) names(data) salary<-data[,2] sort(salary) 500000 500000 550000 600000 600000 637500 875000 875000 875000 1000000 1300000 1350000 1500000 2250000 2500000 3250000 3250000 3833000 5000000 5000000 5000000 5500000 5750000 8000000 a stem-and-leaf plot of the data, using 100,000s of dollars as the leafs unit, millions for the stems , and dividing each stem into five parts. The distribution is obviously positively skewed and possibly bimodal, and so I decided to make a fivenumber summary of the data. stem(salary,scale=2) The decimal point is 6 digit(s) to the right of the | 0 | 556666999 1 | 0345 2 | 35 3 | 338 4 | 5 | 00058 6 | 7 | 8 | 0 and the five-number summary is fivenum(salary) 500000 756250 1425000 4416500 8000000 As mentioned, the distribution of salaries is strongly positively skewed, with salaries ranging from halfamillion dollars to 8 million dollars. The median salary is 1.4 million dollars, with half the players earning between 0.8 and 4.4 million. There appear to be two modes: one near the lower end of the distribution, and the other near 5 million dollars. The highest salary, 8 million dollars, appears to be an outlier. 2.51 Carbon dioxide emissions. Table 1.6 gives the 2007 carbon dioxide (CO2) emissions per person for countries with populations of at least 30 million in that year. A stemplot or histogram shows that the distribution is strongly skewed to the right. The United States and several other countries appear to be high outliers. (a) Give the five–number summary. Explain why this summary suggests that the distribution is right–skewed. (b) Which countries are outliers according to the 1.5 × IQR rule? Make a stemplot of the data or look at your stemplot from Exercise 1.36. Do you agree with the rule’s suggestions about which countries are and are not outliers? Answer Read the data by this R command: data<-read.csv("http://www.yorku.ca/nuri/econ2500/moorebps6e/data/tbl-1.6-co2emissions.csv",header=T) attach(data) names(data) Co2<-data[,2] sort(Co2) 0.0272 0.2976 0.0389 0.7993 2.3065 0.9031 4.1384 7.6923 4.1432 0.0828 0.1046 0.1464 0.2685 0.2773 0.2850 1.3844 1.4301 1.4862 1.7677 1.9373 4.6525 4.9194 6.0207 6.8472 6.8598 9.5690 9.8476 10.4941 10.8309 16.9171 0.6449[=Q1] 1.2935 3.9543[=Q2] 4.3862 8.1555[=Q3] 8.3231 8.8163 18.9144 8.8608 stem(Co2) The decimal point is at the | 0 | 001113333689344589 2 | 3 4 | 011479 6 | 0897 8 | 238968 10 | 58 12 | 14 | 16 | 9 18 | 9 (a) Min = 0.0272. Q1 = 0.7221. Median = 3.954. Q3 = 7.9239. Max = 18.9144. The maximum is farther from Q3 than the minimum is from Q1. This suggests right skew. fivenum(Co2) 0.0272 0.7221 3.9543 7.9239 18.9144 0 5 10 15 boxplot(Co2,col="green” (b) IQR = 7.9239 − 0.7221 = 7.2018. Hence, 1.5 × IQR = 10.8027. Now Q1 − 1.5 × IQR = 0.7221-1.5*(7.9239 − 0.7221)= -10.0806<0, so no values are more than 1.5 IQRs below Q1. Also, Q3 + 1.5 × IQR = 7.9239+1.5*(7.9239 − 0.7221)= 18.7266, so the United States’s value (18.9144) is an outlier - and perhaps Canada value (16.9171) would be considered an outlier too. The authors’ answer computing Q1, Q2, and Q3 manually is the following: (a) Min = 0.0272, Q1 = 0.6449, Median = 3.954, Q3 = 8.1555, Max = 18.9144. Notice that the maximum is farther from Q3 than the minimum is from Q1. This suggests right skew. (b) IQR = 8.1555 – 0.6449 = 7.5106. Hence, 1.5 IQR = 11.2659. Now Q1 – 1.5 IQR = 0.6449 – 11.2659 <0, so no values are more than 1.5 IQR’s below Q1. Also, Q3 + 1.5 IQR = 8.1555 + 11.2659 = 19.4214, so there are no high outliers. This rule is rather conservative – most people would easily call the United States’ value (18.9144) a far outlier, and perhaps Canada would be considered an outlier, too.