Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Describing Distributions Problems Part2 Answer Key 1. Determine if any of the data sets contain outliers. Cars99 CityMPG HwyMpg FuelCapacity Acc030 Acc060 QtrMile 17 19 20.5 23 30 23 26 28.5 31 38 10.3 14.5 16.2 18.45 23.7 2.4 3.3 3.5 3.9 4.5 5.6 8.8 9.5 10.9 12.5 14.1 16.8 17.4 18.2 19.1 S1 = CItyMPG Q1 ( min ( ) ) S2 = S3 = median ( ) S4IQR = Q3 (= )23 – 19 = 4 S51.5(4) = max (=) 6 [19 – 6, 23 + 6] = [13, 29] Since 30 is larger than 29 there is an at least one outlier on the upper end HwyMPG IQR = 31 – 26 = 5 1.5(5) = 7.5 [26 – 7.5, 31 + 7.5] = [18.5, 38.5] Since all the numbers are within this interval there are no outliers FuelCapacity IQR = 18.45 – 14.5 = 3.95 1.5(3.95) = 5.925 [14.5 – 5.925, 18.45 + 5.925] = [8.575, 24.375] Since all the numbers are within this interval there are no outliers Acc030 IQR = 3.9 – 3.3 = 0.6 1.5(0.6) = 0.9 [3.3 – 0.9, 3.9 + 0.9] = [2.4, 4.8] Since all the numbers are within this interval there are no outliers Acc060 IQR = 10.9 – 8.8 = 2.1 1.5(2.1) = 3.15 [8.8 - 3.15, 10.9 + 3.15] = [5.65, 14.05] Since 5.6 is smaller than 5.65 there is at least one outlier on the lower end QtrMile IQR = 18.2 – 16.8 = 1.4 1.5(1.4) = 2.1 [16.8 – 2.1, 18.2 + 2.1] = [14.7, 20.3] Since 14.1 is smaller than 14.7 there is at least one outlier on the lower end 2. Affect of changing value on Center and Spread a. b. c. d. e. Create a box plot and stem-and-leaf plot for Bonds Homeruns. Compute summary statistics for Bonds Homeruns Multiply each original value by 3 and re-compute. What changed? Add 2 to each original value and re-compute. What changed? Change Bonds highest number of homeruns to 100 and re-compute. What changed? f. What conclusion can be made by your result in parts (c) – (i) Bonds_Home_Runs 5 16 19 24 25 25 26 28 33 33 34 34 37 37 40 42 45 45 46 46 49 73 0 1 2 3 4 5 6 7 5 6 4 3 0 Collection 2 9 5 5 6 8 3 4 4 5 5 2 4 4 5 5 9 0 3 Box Plot 10 20 30 40 50 60 70 80 Bonds_Home_Runs When you add or subtract a number to every value of a data set, measures of center (mean and median) will change by what you added or subtracted by. However, measures of spread (Range, IQR, and Standard Deviation) are unaffected When you multiply or divide a number to every value of a data set both measure of center and spread will be altered by whatever you multiplied or divided. When you adjust high or low extreme values, keeping them as high or low extreme values, measures which are remain the same(Q1, Median, Q3, and IQR) are called Resistant Statistical Measures and measures which are altered (Mean, Standard Deviation, and Range) are called Non-Resistant Statistical Measures. 3. Comparing Highway MPG for Car Types a. Which car type tends to get the best fuel efficiency? Second Best? Worst? Small is Best, Family is Second, and Luxury is Worst b. Which car type has the most variability? Least variability? Sports is most variable and Upscale is least variable c. Compare the family and large vehicle types. Family and large are both symmetric data sets. Family cars have a smaller spread with an IQR of 2 compared to Large cars with an IQR of 4. The center for Family cars is bigger at 29 mpg compared to Large cars at 27 mpg. I would conclude that Family cars are more fuel-efficient than Large cars. This is because Family cars have a larger center and a tighter spread making the points focus around the center. Also more than 75% of Family cars are above 27mpg compared to only 50% of Large. d. Compare the large and upscale vehicle types. Large cars are symmetric and Upscale cars are skewed left. The medians for each are about the same at 27 mpg. The spread for Large cars is bigger with and IQR of 4 compared to than the spread for Upscale cars with an IQR of 2. Large cars are more fuel efficient because 25% of Large cars get above 29mpg and none of the Upscale cars get above 29mpg Boxplot of hwy mpg vs type 40 hwy mpg 35 30 25 family large luxury small type sports upscale 4. The other day I reached into my penny jar and pulled out 1000 pennies. I then recorded the age of each penny. A graph of my finding are shown blow. Describe the distribution seen below. Also state the relationship between the mean and median. 1000Pennies 140 Histogram The graph is skewed to the right, the median age is around 12.5 years old, and the age of the pennies have a range of about 60 years. There is a cluster around 2.5 years old and there appears to be an outlier at 60 years old. Since the graph is skewed right the mean is greater than the median. 120 100 80 60 40 20 0 10 20 30 40 Age 50 60 70 5. Below is a graph of the number of losses for National League teams during the 1999 season. Describe the distribution seen below. Also state the relationship between the mean and median. NLPayroll Box Plot The graph is skewed to the left, the median number of losses is about 85 games, and the numbers of losses have a range of 40 games with an IQR of 30 games. There is a cluster centered on 87. There are no apparent outliers in the data set. Since the graph is skewed left the mean is less than the median. 60 70 80 90 100 Losses 6. The graph below shows Skull measurements of 150 male Egyptian skulls from 5 different time periods. Describe the distribution seen below. Also state the relationship between the mean and median. Collection 1 Dot Plot The graph is symmetric, the median skull measurement is 95, and the measurements have a range of 35. There is a cluster around 95 and there appears to be two outliers, one at 81 and one at 115. Since the graph is symmetric the mean and median are about the same. 80 90 100 BL 110 120 7. The five number summary of a data set is (17, 27, 35, 49, 90) a) Are there any outliers in this data set? Exaplin. (Show all work) IQR = 22 1.5IQR = 33 Outliers exist outside [-6, 82]. Since the maximum lies outside this interval, there is at least one outlier at 90 b) Is the mean less than, equal to, or greater than the median? Why? Max – Median = 55 Median – Min = 18 Since the right side is more stretched out the graph is skewed right, therefore the mean > median c) If the maximum is changed from 90 to 60 what will happen the following statistics? (i) Mean (Decrease) (ii) Median (Stay the Same) (iii)Standard Deviation (Decrease) (iv) Inter Quartile range (Stay the Same) d) If 7 is added to every value in the data set, what will happen to the following statistics? (i) Mean (Increase by 7) (ii) Median (Increase by 7) (iii)Standard Deviation (Stay the Same) (iv) Inter Quartile range (Stay the Same) e) If 4 is multiplied to every value in the data set, what will happen to the following statistics? (i) Mean (Multiplied by 4) (ii) Median (Multiplied by 4) (iii)Standard Deviation (Multiplied by 4) (iv) Inter Quartile range (Multiplied by 4) 8. The statistics below summarize the money earned, in million of dollars, during the first a second weekend for movies that open in 1999. Weekend 1 Weekend 2 Minimum 1 2 Lower Quartile 23 17 Median 34 23 Upper Quartile 43 26 Maximum 66 51 a. Construct parallel boxplots using the given data. Show your plots on the grid below. Weekend 1 1 23 2 17 23 34 26 43 66 51 Weekend 2 0 5 10 15 20 25 30 35 40 45 50 55 60 b. Compare the distribution of the two data sets (Use IQR for Spread). 65 70 The graph for Weekend 1 is symmetric and the graph for Weekend 2 is skewed right. The median for Weekend 2 is 23 million, which is smaller than the median for Weekend 1 at 34 Million. The spread for Weekend 2 is smaller with and IQR of 9 million than the spread for Weekend 1 with an IQR of 20 million c. Based on the boxplots, what conclusion can you make? Explain. Since Weekend 2 has a smaller center and a smaller spread, making the values focus around the center, I would conclude that movies make less money during there second weekend of release than during there first weekend of release. Since the upper 75% of the movies during Weekend 1 make more than or equal to the upper 50% of movies during Weekend 2, I would conclude that movies make more money during Weekend 1 11. Two identical footballs, one air-filled and one helium-filled, were used outdoors on a windless day at The Ohio State University's athletic complex. Each football was kicked 39 times and the two footballs were alternated with each kick. The experimenter recorded the distance traveled by each ball. Air 15 16 18 19 20 20 22 22 22 23 24 24 25 25 25 25 25 26 26 26 27 27 27 28 28 28 28 28 28 29 29 29 31 31 31 32 33 34 35 Helium 11 12 14 14 16 22 22 23 23 24 25 25 25 26 26 26 26 26 27 28 28 28 29 29 29 29 29 30 30 30 30 31 31 32 32 33 34 35 39 a. Find the five number summary for each and determine if either data set Collection 1 contains outliers. Air Helium 15 11 23 24 26 28 29 30 35 39 S1 = min ( ) Air Helium IQR = 29 – 23 = 6 IQR = 30 – 24 = 6 1.5(6) = 9 1.5(6) = 9 [23 – 9, 29 + 9] = [14, 38] [24 – 9, 30 + 9] = [15, 39] There are no outliers 11, 12, and 14 are outliers b. Create side-by side box-plots and back-to-back stem-and-leaf plots for each. For the boxplots properly indicate outliers. Collection 1 0 Box Plot 5 10 15 9865 44322200 99988888877766655555 432111 5 20 1 1 2 2 3 3 25 30 35 1244 6 22334 55566666788899999 0000112234 55 40 14. The five-number summary for the weights (in pounds) of fish caught in a bass tournament is. 2.3 2.8 3.0 3.3 4.5 a. Would you expect the mean weight of all fish caught to be higher or lower than the median? Explain. Median – Min = 3 – 2.3 = 0.7 Max – Median = 4.5 – 3 = 1.5 Since the right side is more stretched out the graph would be skewed right making the mean greater than the median b. You caught 3 bass weighing 2.3 pounds, 3.9 pounds, and 4.2 pounds. Were any of your fish outliers? Explain. IQR = 3.3 – 2.8 = 0.5 1.5(.5) = 0.75 [2.8 – 0.75, 3.3 + 0.75] = [2.05, 4.05] 4.2 would be an outlier c. If the maximum weight was recorded incorrectly and it was actually 5.4, instead of 4.5, what statistical measures would change and which ones would stay the same? Mean, Standard Deviation, and Range would Increase Median, Q1, Q3, and IQR would stay the same d. It was discovered that the scale being used to measure the weights was found to be overweighing by 1 pound, how would this affect the measure of center and how would it affect the measures spread? The measures of center, mean and median, would increase by 1 The measures of spread, IQR and Standard Deviation, would stay the same e. If the data was converted into kilograms how would this affect the measure of center and how would this affect the measures of spread? 1 pound = 0.454kilograms. Therefore the measures of both center and spread would be multiplied by 0.454 15. Here are the weekly payrolls for two imaginary restaurants, Mooseburgers and McTofu. a. Create parallel boxplots. Label your graph clearly. Collection 1 100 Box Plot 150 200 250 300 350 b. Write a few sentences comparing the distributions. McTofu is skewed right compared to Mooseburgers which is symmetric. McTofu has a median at 120 which is smaller than Mooseburger’s center which is 134. Both McTofu and Mooseburgers have IQR of about 20. McTofu has an outlier at 360 but Mooseburgers does not contain any outliers. The Box-Plot do not indicate any clusters. c. Which restaurant pays the higher average salary? McTofu has a higher average because of the extreme outlier pulling the mean up d. Why is the mean salary misleading? Since skew ness and outliers can alter the mean it is not always a good value to use when describing center e. At which restaurant would you rather work? Give a sound statistical justification for your decision. I would choose to work at Mooseburgers because 75% of the workers make more than $120 compared to only 50% of McTofu’s workers. I have a better chance of making more than $120 at Mooseburgers than at McTofu’s