Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 4 ST 101 Reiland Displaying and Summarizing Quantitative Data Chapter Objectives: At the end of this chapter you should be able to: 1) Create appropriate displays to graphically depict quantitative data (frequency tables, histograms, stem-and-leaf displays, dotplots, timeplots; the use of software will be emphasized) 2) Describe the important features of the distribution of a quantitative variable: shape, center, spread, and any unusual features such as outliers, gaps, or clusters. Throughout the course we will emphasize the paradigm "Think, Show, Tell". The above objectives fit into this paradigm as follows: "Think" about what graphical display is appropriate for the data at hand; create the display to "show" the data (objective 1). "Tell" what characteristics of the data are conveyed by the graphical display (objective 2). Reading Assignment: Text: Chapter 4. Histograms A histogram shows three general types of information: It provides visual indication of where the approximate center of the data is. We can gain an understanding of the degree of spread, or variation, in the data. We can observe the shape of the distribution. Construction of a histogram (automate!): i) identify the smallest and largest measurements in data set ii) divide interval between smallest and largest measurements into between 5 and 20 subintervals (called bins in Excel.) iii) count the number of data values that are in each bin (the bins and the count in each bin give the distribution of the quantitative variable iv) plot the bin counts as bars over the bins; the height of the bar over a bin indicates the count for that bin EXAMPLE: (Number of daily employee absences from a large corporation; 106 days) 106 obs. approx # of classes œ 146 144 140 140 138 140 148 140 129 153 143 141 140 140 143 136 148 142 139 143 148 143 139 138 141 143 138 140 133 158 148 144 148 140 139 143 149 144 140 140 135 138 138 141 145 147 134 136 136 139 141 132 149 150 145 141 139 146 141 145 139 145 148 146 148 141 142 141 134 143 143 144 148 142 141 138 131 137 142 143 137 138 139 145 142 145 142 141 133 141 142 146 136 145 144 145 140 132 149 140 146 153 141 121 137 142 ST 101 Displaying and Summarizing Quantitative Data Histogram of Employee Absences 70 60 y 50 c n 40 e u q 30 e r F 20 10 0 125.5 132.5 Statcrunch histogram 139.5 146.5 Absences from Work 153.5 160.5 page 2 ST 101 Displaying and Summarizing Quantitative Data page 3 Heights of students in ST101 EXCEL Student Heights ST 101 20 yc n e 10 u q e rF 0 59 61 63 65 67 69 71 73 75 More Height (inches) DATADESK Stem-and-Leaf Displays Partition each number in data set into a “stem" and “leaf" Constructing a stem and leaf display: i) determine the stem and leaf you want to use; ( 5 - 20 stems) ii) write stems in a column with smallest stem at top; include all stems in range of data, even those without leaves; iii) include only 1 digit in the leaves; drop digits after the first digit or round off; iv) record the leaf for each measurement in the row corresponding to its stem; ordering of leaves in a row is optional, but this does make the display more informative. EXAMPLE: Below is a list of the number of home runs that Roger Maris hit during his 10 years in the American League. Make a stemplot of the data. 8 13 14 16 23 26 28 33 39 61 EXAMPLE: Number of touchdown passes thrown by each of the 31 teams in the NFL during the 2000 season. 37, 33, 33, 32, 29, 28, 28, 23, 22, 22, 22, 21, 21, 21, 20, 20, 19, 19, 18, 18, 18, 18, 16, 15, 14, 14, 14, 12, 12, 9, 6 ST 101 Displaying and Summarizing Quantitative Data page 4 STEMS ARE 10'S DIGIT stem leaf 3 | 7 3 | 233 2 | 889 2 | 001112223 1 | 56888899 1 | 22444 0 | 69 EXAMPLE: Nielsen ratings for week of Aug. 8 - Aug. 14, 2005. Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Program CSI WITHOUT A TRACE CSI: MIAMI 60 MINUTES TWO AND A HALF MEN 930P TWO AND A HALF MEN EXTREME MAKEOVER:HM ED-8P NCIS AFC-NFC HALL OF FAME GAME(S) LAW AND ORDER:CRIM INTENT AFC-NFC HALL-FME SHOWCASE(S) EVERYBODY LOVES RAYMOND LAW AND ORDER:SVU MT&R: UNFORGET MOMNTS TV(S) COLD CASE CSI: NY LAW AND ORDER BIG BROTHER 6-TUE CROSSING JORDAN DATELINE FRI Network CBS CBS CBS CBS CBS CBS ABC CBS ABC NBC ABC CBS NBC NBC CBS CBS NBC CBS NBC NBC Time 9:00PM 10:01PM 10:00PM 7:00PM 9:30PM 9:00PM 8:00PM 8:00PM 8:08PM 9:00PM 8:00PM 8:30PM 10:00PM 8:30PM 8:00PM 10:00PM 10:00PM 9:00PM 10:00PM 8:00PM Day Thu Thu Mon Sun Mon Mon Sun Tue Mon Sun Mon Mon Tue Wed Sun Wed Wed Tue Sun Fri Rating 9.3 8 7.9 7.6 7 6.9 6.8 6.5 6.2 6 5.9 5.8 5.8 5.8 5.6 5.6 5.6 5.5 5.5 5.1 *There are an estimated 105.5 million television households in the USA. A single /ratings/ point represents 1%, or 1,055,000 households for the 2005-06 season. /Share/ is the percentage of television sets in use tuned to a specific program. Stem-and-leaf for Shares stems are 10's 0 1* 1t 1f 1s 1** |9 9 |0 0 0 0 0 0 0 1 1 1 1 |2 2 3 |4 |6 | Share Households 16 10,225,000 14 8,742,000 13 8,668,000 14 8,368,000 11 7,659,000 11 7,540,000 12 7,423,000 12 7,175,000 11 6,846,000 10 6,625,000 11 6,478,000 10 6,390,000 10 6,409,000 10 6,331,000 10 6,174,000 10 6,104,000 10 6,121,000 9 5,981,000 9 6,065,000 10 5,592,000 ST 101 Displaying and Summarizing Quantitative Data Stem-and-Leaf for Rating stems are 1's 5* 5. 6* 6. 7* 7. 8* 8. 9* |1 |556668889 |02 |589 |0 |69 |0 | |3 EXAMPLE: (beginning of class pulses) # --. 3 9 10 23 23 16 23 10 10 4 2 4 . 1 BPULSE Unit = 1.000000 n = 138. missing = Stem Leaves . . . ---- -------------------------------------------------------------4* | 4. | 588 5* | 001233444 5. | 5556788899 6* | 00011111122233333344444 6. | 55556666667777788888888 7* | 0000011222233444 7. | 55555666666777888888999 8* | 0000112224 8. | 5555667789 9* | 0012 9. | 58 10* | 0223 10. | 11* | 1 Advantages of stem and leaf displays: i) each measurement displayed ii) ascending order iii) relatively simple (if data set not too large) Disadvantage: i) display becomes unwieldy for large data sets 0. page 5 ST 101 Displaying and Summarizing Quantitative Data page 6 EXAMPLE Population of 185 US cities with between 100,000 and 500,000 residents. Since a stem and leaf plot shows only two-place accuracy, we had to round the numbers to the nearest 10,000. For example the largest number (493,559) was rounded to 490,000 and then plotted with a stem of 4 and a leaf of 9. The fourth highest number (463,201) was rounded to 460,000 and plotted with a stem of 4 and a leaf of 6. Thus, the stems represent units of 100,000 and the leaves represent units of 10,000. Notice that each stem value is split into five parts: 0-1, 2-3, 4-5, 6-7, and 8-9. Dotplots simple display, it just places a dot along an axis for each case in the data. similar to a stem-and-leaf display Kentucky Derby winning times, plotting each race as its own dot. ST 101 Timeplots Displaying and Summarizing Quantitative Data page 7 Winning Times in Olympic 100m Dash 13 12.5 12 11.5 11 10.5 10 9.5 9 1880 1900 1920 1940 1960 1980 11.46 More Histogram Frequency 15 10 5 0 9.84 10.38 10.92 Bin The Shape of a Distribution skewnessskewed to the right (positively skewed) 45 8 2006 Baseball Salaries 400 300 2006 Salary ($1,000's) 21325 19325 17325 15325 8 9 8 3 3 2 1 1 2 2 1 13325 11325 9325 33 16 17 23 16 15 14 7325 0 5325 100 3325 71 64 54 200 1325 Frequency 500 2000 2020 ST 101 Displaying and Summarizing Quantitative Data page 8 skewed to the left (negatively skewed) H istogram of Exam Scores Fre que ncy 30 20 10 0 20 30 40 50 60 70 80 Ex a m S core s 90 100 symmetric B a n k C u s to m e rs : 1 0 : 0 0 -1 1 : 0 0 a m 20 Fr e que ncy 15 10 5 e 2 3. m or 4 Nu m b e r o f Cu sto m e rs 13 5. 6 12 7. 8 11 9. 2 10 10 .2 94 86 .4 .6 78 70 .8 0 outliers 200 m Races 20.2 secs or less (approx. 700) 60 50 40 y c n e u 30 q e r F 20 Usain Bolt 2008 19.30 Michael Johnson 1996 19.32 10 0 6 .2 3 9 .2 2 1 9 .9 1 1 9 .2 9 1 2 .3 9 1 5 .3 9 1 8 .3 9 1 1 .4 9 1 4 .4 9 1 7 5 . 3 5 .4 9 . 9 1 9 1 1 6 .5 9 1 9 .5 9 1 2 .6 9 1 5 .6 9 1 8 .6 9 1 1 .7 9 1 4 .7 9 1 7 .8 .7 9 9 1 1 3 .8 9 1 6 .8 9 1 9 .8 9 1 2 .9 9 1 5 .9 9 1 8 .9 9 1 1 .0 0 2 4 .0 0 2 7 .1 0 0 0 2 2 3 .1 0 2 6 .1 0 2 9 .1 0 2 TIMES BIMODAL DISTRIBUTIONS (two peaks) (frequently results from measurements on two populations, such as heights of male and female adults). ST 101 Displaying and Summarizing Quantitative Data page 9 His to g ra m Frequency 60 50 40 30 F re q ue nc y 20 10 More 73.5 71 68.5 66 63.5 61 58.5 56 53.5 51 0 B in Describing Distributions Numerically Section Objectives: At the end of this section you should be able to: 1) Calculate appropriate numerical summaries of quantitative data to describe center (median, mean, quartiles) and spread (range, interquartile range, standard deviation) [the use of software will be emphasized!] 2) Describe the characteristics of various numerical summaries with emphasis on the affects of outliers 3) Interpret the values of the numerical summaries for a particular data set. 4) Match graphical displays of quantitative data to the values of the summary statistics. 5) Apply graphical and numerical procedures to compare 2 or more sets of data Throughout the course we will emphasize the paradigm "Think, Show, Tell". The above objectives fit into this paradigm as follows: "Think" about what numerical summaries of center and spread are appropriate for the data at hand; calculate the values of the numerical summaries to "show" the center and spread. "Tell" what characteristics of the data are conveyed by the values of the numerical summaries. Finding Center and Spread Would like to numerically summarize two characteristics of quantitative data: i) center ii) spread Ö Finding the center: the median median: the value that falls in the middle when the data are arranged in order of magnitude Calculating the Median Given a set of 8 data values arranged in order of magnitude Middle value if 8 is odd Median œ œ Mean of the two middle values if 8 is even graphically, the median splits the histogram of the data into two halves of equal area. ST 101 Displaying and Summarizing Quantitative Data page 10 EXAMPLES: 1) Below is a list of the home runs hit by Babe Ruth in each of his seasons as a Yankee: 54 59 35 41 46 25 47 60 54 46 49 46 41 34 22 median œ 2) student pulse rates - ordered values: 38, 59, 60, 60, 62, 62, 63, 63, 64, 64, 65, 67, 68, 70, 70, 70,70, 70, 70, 70, 71, 71, 72, 72, 73, 74, 74, 75, 75, 75, 75, 76, 77, 77, 77, 77, 78, 78, 79, 79, 80,80, 80, 84, 84, 85, 85, 87, 90, 90, 91, 92, 93, 94, 94, 95, 96, 96, 96, 98, 98, 103 median = 3) Year 2002 baseball salaries: 8 œ 805 median œ $900,000; maximum œ $25,000,000 (Alex Rodriguez) minimum = $200,000 4) Median fan age: MLB: 45; NFL: 43; NBA: 41 NHL: 39 (Scarborough Research) Ö Measuring spread: home on the range range = max min EXAMPLE: Year 2002 baseball salaries: range = $25,000,000 $200,000 = $24,200,000 disadvantage of range: too crude and sensitive, a single extreme value can make the range very large. Ö Measuring spread: the interquartile range (IQR) focus on the middle of the data instead of the extremes of the data find the range of the middle half of the data: i) divide the data in half at the median ii) now divide both halves in half again, cutting the data into quarters " % of the data lies below the lower quartile à half the data lies between " of the data lies above the upper quartile ß % interquartile range ST 101 Displaying and Summarizing Quantitative Data page 11 IQR = upper quartile lower quartile quartiles are NOT well-defined, different software packages give different answers FINDING QUARTILES BY HAND when n is odd, include the overall median in both halves when n is even, do NOT include the overall median in either half EXAMPLES: 1) odd number of observations in data set Below is a list of the home runs hit by Babe Ruth in each of his seasons as a Yankee: 54 59 35 41 46 25 47 60 54 46 49 46 41 34 22 ordered values: 22 25 34 35 41 41 46 46 46 47 49 54 54 59 60 median = 46 lower half (including median) 22 25 34 35 41 41 46 46 U" œ 69A/< ;?+<>36/ œ $& %" œ $) # upper half (including median) 46 46 47 49 54 54 59 60 U$ œ ?::/< ;?+<>36/ œ %* &% œ &"Þ& # IQR = 51.5 38 œ 13.5 software Excel: U" = 38; U$ = 51.5; IQR = DataDesk: U" = 36.5; U$ = 52.75; IQR = 16.25 2) even number of observations in data set ten "distance of hometown from NCSU campus" values: 300 500 65 180 200 120 270 10 100 10 ordered values: 10 10 65 100 120 180 200 270 300 500 median = "#!")! œ 150 # lower half: 10 10 65 100 120 U" œ 69A/< ;?+<>36/ œ '& upper half: 180 200 270 300 500 U$ œ ?::/< ;?+<>36/ œ #(! IQR = 270 65 œ 205 software Excel: U" = 73.75; U$ = 252.5; IQR = 252.5 73.75 œ 178.75 DataDesk: U" = 65; U$ = 270; IQR = 3) median, quartiles from stem and leaf plot class beginning pulse rates # --- BPULSE Unit = 1.000000 n = 138. missing = Stem Leaves . . . ---- -------------------------------------------------------------- 0. ST 101 Displaying and Summarizing Quantitative Data . 4* | 3 4. | 588 9 5* | 001233444 10 5. | 5556788899 23 6* | 00011111122233333344444 23 6. | 55556666667777788888888 16 7* | 0000011222233444 23 7. | 55555666666777888888999 10 8* | 0000112224 10 8. | 5555667789 4 9* | 0012 2 9. | 58 4 10* | 0223 . 10. | 1 11* | 1 page 12 median = lower quartile = upper quartile = 5-Number Summary minimum Q" median Q$ maximum 5-number summary for the above 138 student pulses Summarizing Symmetric Distributions EXAMPLE (body temperature of 93 adults) median œ 98.2 beats per min. mean œ 98.12 beats per minute Ö Finding the center: the mean median; determined by counting the data, doesn't care how large or how small the data values are (except the middle one or two data values). Often we do care about the actual data values; would like a measure of center that uses each data value. ST 101 Displaying and Summarizing Quantitative Data NOTATION C 8 C page 13 represents an observation in a data set number of observations in the data set denotes the sample mean consider any set of data values represented by C's; then Cœ !C sum of C's œ 8 8 IMPORTANT: the mean is an appropriate measure of the middle only when the shape is approximately symmetric and there are no outliers. Connection to histogram A histogram balances when supported at the mean median = 57.7 years; mean = 55.26 years Mean or median? It makes a difference (sometimes) EXAMPLE: 2004 major league baseball salaries n œ 826 C œ $2,482,530 median œ $787,500 min œ $300,000 max œ $21,726,881 ST 101 Displaying and Summarizing Quantitative Data page 14 2004 Major League Baseball Salaries Frequency 500 423 400 300 200 100 50 61 51 48 33 19 13 22 22 13 11 15 3 10 5 8 2 2 0 1 4 3 2 2 0 1 1 202 187 171 156 141 125 110 95 80 64 49 34 18 3 0 Salary ($100,000's) Mean , Median, and Maxim um B aseball S alaries M ax $27,000,000 $2,050,000 $22,000,000 $1,550,000 $17,000,000 $1,050,000 $12,000,000 2002 2000 1998 1996 1994 1992 1990 1988 1986 1984 $2,000,000 1982 $50,000 1980 $7,000,000 1978 $550,000 M a x im um S a la ry M edian $2,550,000 1976 M e a n, M e dia n S a la ry M ean Ye a r Ö Finding spread: the standard deviation IQR: uses only Q" and Q$ to measure spread standard deviation: takes into account how far each observation is from the mean !ÐC CÑ œ ? variance =# œ units: square gallons, square dollars !ÐC CÑ# 8" ST 101 Displaying and Summarizing Quantitative Data standard deviation page 15 Í Í! Í ÐC CÑ# = œ Ì 8" automate this calculation! IMPORTANT: 1) the standard deviation is an appropriate measure of spread only when the shape is approximately symmetric and there are no outliers. 2) Always (always!) report a spread along with any summary of the center. EXAMPLE 1 3 5 9 Thinking about the standard deviation: 1) Note that = is always nonnegative, that is, = !Þ When does = œ !? 2) The larger the value of =, the greater the spread of the data. Given two data sets, the standard deviation is useful as a relative measure of spread. 3) The standard deviation is the most commonly used measure of risk in many areas such as finance, business, education, social sciences, etc. 4) Why divide by n 1 instead of n when computing the sample standard deviation? i) to drive you crazy. ii) dividing by 8 to find the standard deviation of a small group would underestimate the variability present in the larger groups they represent. iii) above formula for s includes the sample mean C. Since !(C3 C) œ 0, only n 1 of n i=1 the data values are free to vary. example: Reporting shape, center, and spread of quantitative data 1) when telling about a quantitative variable, always report shape, along with a center and a spread 2) if the shape is skewed, report the median and IQR; the mean and standard deviation are sensitive to outliers (you can include the mean and standard deviation, but you should point out why the mean and median differ) 3) if the shape is symmetric, report the mean and standard deviation. 4) if there are obvious outliers and you are reporting the mean and standard deviation, report them with the outliers included and the outliers removed (the median and IQR will not be affected by the outliers). ST 101 Displaying and Summarizing Quantitative Data page 16 SUMMARY We can now summarize distributions of quantitative variables numerically. ñ The 5-number summary displays the min, Q1, median, Q3, and max. ñ Measures of center include the mean and median. ñ Measures of spread include the range, IQR, and standard deviation. We know which measures to use for symmetric distributions and skewed distributions. We can also display distributions with boxplots. ñ While histograms better show the shape of the distribution, boxplots reveal the center, middle 50%, and any outliers in the distribution. ñ Boxplots are useful for comparing groups.