Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CHAPTER 4 PART II Measures of central tendency A value that represents a typical, or central, entry of a data set. Most common measures of central tendency: Mean Median Mode Larson/Farber 4th ed. Copyright © 2010, 2007, 2004 Pearson Education, Inc. 1 Slide 4 - 1 Median The value that lies in the middle of the data when the data set is ordered. Measures the center of an ordered data set by dividing it into two equal parts. If the data set has an odd number of entries: median is the middle data entry. even number of entries: median is the mean of the two middle data entries. Larson/Farber 4th ed. Copyright © 2010, 2007, 2004 Pearson Education, Inc. 2 Slide 4 - 2 Example: Finding the Median The prices (in dollars) for a sample of roundtrip flights from Chicago, Illinois to Cancun, Mexico are listed. Find the median of the flight prices. 872 432 397 427 388 782 397 Larson/Farber 4th ed. Copyright © 2010, 2007, 2004 Pearson Education, Inc. 3 Slide 4 - 3 Example: Finding the Median The flight priced at $432 is no longer available. What is the median price of the remaining flights? 872 397 427 388 782 397 Larson/Farber 4th ed. Copyright © 2010, 2007, 2004 Pearson Education, Inc. 4 Slide 4 - 4 Finding the Median in a Histogram The median is the value with exactly half the data values below it and half above it. It is the middle data value (once the data values have been ordered) that divides the histogram into two equal areas It has the same units as the data Copyright © 2010, 2007, 2004 Pearson Education, Inc. Slide 4 - 5 Mean or Median? Regardless of the shape of the distribution, the mean is the point at which a histogram of the data would balance: Copyright © 2010, 2007, 2004 Pearson Education, Inc. Slide 4 - 6 Summarizing Symmetric Distributions -The Mean The mean feels like the center because it is the point where the histogram balances: HOW DOES THE MEAN COMPARE TO THE MEDIAN? Copyright © 2010, 2007, 2004 Pearson Education, Inc. Slide 4 - 7 Mean or Median? (cont.) In symmetric distributions, the mean and median are approximately the same in value, so either measure of center may be used. For skewed data, though, it’s better to report the median than the mean as a measure of center. Copyright © 2010, 2007, 2004 Pearson Education, Inc. Slide 4 - 8 Comparing the Mean, Median, and Mode All three measures describe a typical entry of a data set. Advantage of using the mean: The mean is a reliable measure because it takes into account every entry of a data set. Disadvantage of using the mean: Greatly affected by outliers (a data entry that is far removed from the other entries in the data set). Larson/Farber 4th ed. Copyright © 2010, 2007, 2004 Pearson Education, Inc. Larson/Farber 4th 9 ed.9 Example: Comparing the Mean, Median, and Mode Find the mean, median, and mode of the sample ages of a class shown. Which measure of central tendency best describes a typical entry of this data set? Ages in a class Larson/Farber 4th ed. 20 20 20 20 20 20 21 21 21 21 22 22 22 23 23 23 23 24 24 65 10 Solution: Comparing the Mean, Median, and Mode Ages in a class 20 20 20 20 20 20 21 21 21 21 22 22 22 23 23 23 23 24 24 65 Mean: x 20 20 ... 24 65 x 23.8 years n 20 Median: 21 22 21.5 years 2 Mode: Larson/Farber 4th ed. 20 years (the entry occurring with the greatest frequency) 11 MEASURES OF SPREAD AND THE BOXPLOT! CONSIDER THE FOLLOWING 3 SAMPLE DATA SETS: I 20 40 50 30 60 70 II 47 43 44 46 20 70 III 44 43 40 50 46 47 COMPUTE THE RANGE, MEDIAN & MEAN FOR EACH DATA SET WHAT DO YOU NOTICE??? NOW TAKE A LOOK AT COMPARING THE DOT PLOTS How Spread Out is the Distribution? • Variation matters, and Statistics is about variation. • Are the values of the distribution tightly clustered around the center or more spread out? • Always report a measure of spread along with a measure of center when describing a distribution numerically. Slide 4 - 14 Measures of Variability • range (max-min) • interquartile range (Q3-Q1) • deviations x x Lower case Greek letter 2 • variance sigma • standard deviation Spread: Home on the Range • The range of the data is the difference between the maximum and minimum values: Range = max – min • A disadvantage of the range is that a single extreme value can make it very large and, thus, not representative of the data overall. Slide 4 - 16 Spread: The Interquartile Range • The interquartile range (IQR) lets us ignore extreme data values and concentrate on the middle of the data. • To find the IQR, we first need to know what quartiles are… Slide 4 - 17 Spread: The Interquartile Range (cont.) • Quartiles divide the data into four equal sections. – One quarter (25%) of the data lies below the lower quartile, Q1 – One quarter of the data lies above the upper quartile, Q3, that is Q3 is the 75% mark – The quartiles border the middle half of the data. • A simple way to find quartiles is to start by splitting the data into 2 halves at the median. Q1 is the median of the lower half and Q3 the median of the upper half. Slide 4 - 18 Example: Finding Quartiles The number of nuclear power plants in the top 15 nuclear power-producing countries in the world are listed. Find the first, second, and third quartiles of the data set. 7Solution: 18 11 6 59 17 18 54 104 20 31 8 10 15 19 • THE MEDIAN divides the data set into two halves. 6 7 8 10 11 15 17 18 18 19 20 31 54 59 104 Upper half Lower half MEDIAN © 2012 Pearson Education, Inc. All rights reserved. 19 of 149 Solution: Finding Quartiles • The first and third quartiles are the medians of the lower and upper halves of the data set. Lower half Upper half 6 7 8 10 11 15 17 18 18 19 20 31 54 59 104 Q1 Q2 Q3 INTERPRETING THE QUARTILES: About one fourth of the countries have 10 or fewer nuclear power plants; about one half have 18 or fewer; and about three fourths have 31 or fewer. © 2012 Pearson Education, Inc. All rights reserved. 20 of 149 FINDING QUARTILES • Listed below are the lengths of the touchdown passes for the Green Bay Packers over the course of several games 28,18,20,30,32,27,32,20,22,31,35,39,33,19,18 Find Q1, the median, and Q3 and explain what these values tell about the distribution. Slide 4 - 21 THE IQR • The difference between the quartiles is the interquartile range (IQR), so IQR = upper quartile – lower quartile OR Q3 - Q1 Find the IQR of the Green Bay data and write a sentence explaining the meaning of this value. Slide 4 - 22 FINDING THE IQR • Find each of the Quartiles and compute the IQR of the following data set of New York travel times: 5 10 10 15 15 15 15 20 20 20 25 30 30 40 40 45 60 60 65 85 Larson/Farber 5th ed. 23 Definition: The 1.5 x IQR Rule for Outliers Call an observation an outlier if it falls more than 1.5 x IQR above the third quartile or below the first quartile. In the New York travel time data, we found Q1=15 minutes, Q3=42.5 minutes, and IQR=27.5 minutes. For these data, 1.5 x IQR = 1.5(27.5) = 41.25 Q1 - 1.5 x IQR = 15 – 41.25 = -26.25 Q3+ 1.5 x IQR = 42.5 + 41.25 = 83.75 Any travel time shorter than -26.25 minutes or longer than 83.75 minutes is considered an outlier. SO DO WE HAVE ANY OUTLIERS? 0 1 2 3 4 5 6 7 8 5 005555 0005 00 005 005 5 Describing Quantitative Data In addition to serving as a measure of spread, the interquartile range (IQR) is used as part of a rule of thumb for identifying outliers. + • Identifying Outliers To Determine Outliers Find Quartile 1 & Quartile 2 Determine Interquartile Range : IQR = Q3 - Q1 Multiply 1.5xIQR Set up “fences” Q1-(1.5IQR) and Q3+(1.5IQR) Observations “outside” the fences are outliers. Why 1.5? According to John Tukey, 1 IQR seemed like too little and 2 IQRs seemed like too much... IDENTIFYING OUTLIERS • USE THE 1.5 IQR RULE TO DECIDE IF THERE ARE ANY OUTLIERS IN THE FOLLOWING DATA SET: 17 23 24 27 32 35 16 70 12 15 22 35 34 18 0 26 5-Number Summary • The 5-number summary of a distribution reports its median, quartiles,(Q1 & Q3) and extremes (maximum and minimum) • The 5-number summary for the recent tsunami earthquake Magnitudes looks like this: Interpret these values Slide 4 - 27 USING THE CALCULATOR • TO FIND THE 5-NUMBER SUMMARY ON THE CALCULATOR: 1. ENTER DATA INTO A LIST 2. USING THE STAT MENU SCROLL TO STAT AND RUN 1-VARS STATS ON LIST Find the 5-number summary for the data list: 7 18 11 6 59 17 18 54 104 20 31 8 10 15 19 Larson/Farber 5th ed. 28 FIVE NUMBER SUMMARIES Of course, in real life, where data sets are often large an full of “messy” numbers, you’ll use a calculator to find the 5-Number Summary, but for now let’s try calculating one by Hand! Find the 5-Number Summary for the following data set which lists the number of calories in 9 different candy bars: 280 250 290 240 210 220 190 220 230 Slide 4 - 29 Box-and-Whisker Plot Box-and-whisker plot • Exploratory data analysis tool. • Highlights important features of a data set. • Requires (five-number summary): Minimum entry First quartile Q1 Median Q2 Third quartile Q3 Maximum entry © 2012 Pearson Education, Inc. All rights reserved. 30 of 149 Drawing a Box-and-Whisker Plot 1. Find the five-number summary of the data set. 2. Construct a horizontal scale that spans the range of the data. 3. Plot the five numbers above the horizontal scale. 4. Draw a box above the horizontal scale from Q1 to Q3 and draw a vertical line in the box at Median. 5. Draw whiskers from the box to the minimum and maximum entries if there are no outliers. Box Whisker Minimum entry Whisker Q1 © 2012 Pearson Education, Inc. All rights reserved. Median, Q2 Q3 Maximum entry 31 of 149 Example: Drawing a Box-and-Whisker Plot Draw a box-and-whisker plot that represents the 15 test scores. Recall Min = 5 Q1 = 10 Q2 = 15 Q3 = 18 Max = 37 Solution: 5 10 15 18 37 About half the scores are between 10 and 18. By looking at the length of the right whisker, you can conclude 37 is a possible outlier. Larson/Farber 4th ed. 32 Modified boxplots • display outliers • fences mark off mild & extreme outliers ALWAYS use modified • whiskers extendintothis largest boxplots class!!! (smallest) data value inside the fence Outlier Example IQR=45.72-19.06 IQR=26.66 fence: 19.0639.99 = -20.93 fence: 45.72+39.99 = 85.71 outliers } { 0 10 1.5IQR=1.5(26.66) 1.5IQR=39.99 20 30 40 50 60 70 Spending ($) 80 90 100 A report from the U.S. Department of Justice gave the following percent increase in federal prison populations in 20 northeastern & mid-western states in 1999. 5.9 4.5 2.3 3.5 5.0 8.2 5.9 6.4 4.5 5.5 5.6 5.3 4.1 10.9 6.3 4.4 Construct a modified boxplot. Describe the distribution. 4.8 8.5 6.9 3.2 Why use boxplots? • ease of construction • convenient handling of outliers • Used with medium or large size data sets (n > 10) • useful for comparative displays More About Spread: The Standard Deviation A more powerful measure of spread than the IQR is the standard deviation, which takes into account how far each data value is from the mean. A deviation is the distance that a data value is from the mean. Copyright © 2010, 2007, 2004 Pearson Education, Inc. Slide 4 - 37 Example: Finding the Sample Standard Deviation The starting salaries are for the Chicago branches of a corporation. The corporation has several other branches, and you plan to use the starting salaries of the Chicago branches to estimate the starting salaries for the larger population. Find the sample standard deviation of the starting salaries. Starting salaries (1000s of dollars) 41 38 39 45 47 41 44 41 37 42 Larson/Farber 4th ed. Copyright © 2010, 2007, 2004 Pearson Education, Inc. Slide 4 - 38 38 Solution: Finding the Sample Standard Deviation First find the mean Next calculate each deviation from the mean Sum these deviations What do you notice??? Salary, x Larson/Farber 4th ed. Copyright © 2010, 2007, 2004 Pearson Education, Inc. Deviation: x – μ Squares: (x – μ)2 41 41 – 41.5 = –0.5 (–0.5)2 = 0.25 38 38 – 41.5 = –3.5 (–3.5)2 = 12.25 39 39 – 41.5 = –2.5 (–2.5)2 = 6.25 45 45 – 41.5 = 3.5 (3.5)2 = 12.25 47 47 – 41.5 = 5.5 (5.5)2 = 30.25 41 41 – 41.5 = –0.5 (–0.5)2 = 0.25 44 44 – 41.5 = 2.5 41 41 – 41.5 = –0.5 (–0.5)2 = 0.25 37 37 – 41.5 = –4.5 (–4.5)2 = 20.25 42 42 Σ(x– –41.5 μ) ==00.5 (2.5)2 = 6.25 2 = 0.25 (0.5) SSx = 88.5 Slide 4 - 39 39 Since adding all deviations together would total zero, we square each deviation and find an average of sorts for the deviations. Copyright © 2010, 2007, 2004 Pearson Education, Inc. Slide 4 - 40 Solution: Finding the Sample Standard Deviation Sample Variance ( x x ) 88.5 9.8 • s n 1 10 1 2 2 Sample Standard Deviation 88.5 3.1 • s s 9 2 The sample standard deviation is about 3.1, or $3100. Larson/Farber 4th ed. Copyright © 2010, 2007, 2004 Pearson Education, Inc. Slide 4 - 41 41 Example: Using Technology to Find the Standard Deviation Sample office rental rates (in dollars per square foot per year) for Miami’s central business district are shown in the table. Use a calculator or a computer to find the mean rental rate and the sample standard deviation. (Adapted from: Cushman & Wakefield Inc.) Larson/Farber 4th ed. Copyright © 2010, 2007, 2004 Pearson Education, Inc. Office Rental Rates 35.00 33.50 37.00 23.75 26.50 31.25 36.50 40.00 32.00 39.25 37.50 34.75 37.75 37.25 36.75 27.00 35.75 26.00 37.00 29.00 40.50 24.50 33.00 38.00 Slide 4 - 42 42 Solution: Using Technology to Find the Standard Deviation Sample Mean Sample Standard Deviation Larson/Farber 4th ed. Copyright © 2010, 2007, 2004 Pearson Education, Inc. Slide 4 - 43 43 Interpreting Standard Deviation Standard deviation is a measure of the typical amount an entry deviates from the mean. The more the entries are spread out, the greater the standard deviation. Larson/Farber 4th ed. Copyright © 2010, 2007, 2004 Pearson Education, Inc. Slide 4 - 44 44 Thinking About Variation Since Statistics is about variation, spread is an important fundamental concept of Statistics. Measures of spread help us talk about what we don’t know. When the data values are tightly clustered around the center of the distribution, the IQR and standard deviation will be small. When the data values are scattered far from the center, the IQR and standard deviation will be large. Copyright © 2010, 2007, 2004 Pearson Education, Inc. Slide 4 - 45 Tell -- Draw a Picture When telling about quantitative variables, start by making a histogram, boxplot, or stem-and-leaf display and discuss the shape of the distribution. Copyright © 2010, 2007, 2004 Pearson Education, Inc. Slide 4 - 46 Tell -- Shape, Center, and Spread Next, always report the shape of its distribution, along with a center and a spread. If the shape is skewed, report the median and IQR. If the shape is symmetric, report the mean and standard deviation and possibly the median and IQR as well. Copyright © 2010, 2007, 2004 Pearson Education, Inc. Slide 4 - 47 Tell -- What About Unusual Features? If there are multiple modes, try to understand why. If you identify a reason for the separate modes, it may be good to split the data into two groups. If there are any clear outliers and you are reporting the mean and standard deviation, report them with the outliers present and with the outliers removed. The differences may be quite revealing. Note: The median and IQR are not likely to be affected by the outliers. Copyright © 2010, 2007, 2004 Pearson Education, Inc. Slide 4 - 48 What have we learned? We’ve learned how to make a picture for quantitative data to help us see the story the data have to Tell. We can display the distribution of quantitative data with a histogram, stem-and-leaf display, or dotplot. We’ve learned how to summarize distributions of quantitative variables numerically. Measures of center for a distribution include the median and mean. Measures of spread include the range, IQR, and standard deviation. Use the median and IQR when the distribution is skewed. Use the mean and standard deviation if the distribution is symmetric. Copyright © 2010, 2007, 2004 Pearson Education, Inc. Slide 4 - 49 What have we learned? (cont.) We’ve learned to Think about the type of variable we are summarizing. All methods of this chapter assume the data are quantitative. The Quantitative Data Condition serves as a check that the data are, in fact, quantitative. Copyright © 2010, 2007, 2004 Pearson Education, Inc. Slide 4 - 50