Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Transcript

Chapter 3 Displaying and Summarizing Quantitative Data Display: Histograms, Stem and Leaf Plots Numerical Summaries: Median, Mean, Quartiles, Standard Deviation Relative frequency Relative Frequency Histogram of Exam Grades .30 .25 .20 .15 .10 .05 0 40 50 60 70 80 Grade 90 100 Frequency Histograms BAKER CITY HOSPITAL - LENGTH OF STAY DISTRIBUTION 70 60 50 40 30 20 10 0 0<2 2<4 4<6 6<8 8<10 10<12 12<14 14<16 16<18 Frequency Histograms A histogram shows three general types of information: It provides visual indication of where the approximate center of the data is. We can gain an understanding of the degree of spread, or variation, in the data. We can observe the shape of the distribution. 30 19.2 19.23 19.26 19.29 19.32 19.35 19.38 19.41 19.44 19.47 19.5 19.53 19.56 19.59 19.62 19.65 19.68 19.71 19.74 19.77 19.8 19.83 19.86 19.89 19.92 19.95 19.98 20.01 20.04 20.07 20.1 20.13 20.16 20.19 Frequency All 200 m Races 20.2 secs or less 200 m Races 20.2 secs or less (approx. 700) 60 50 40 Usain Bolt 2008 19.30 Michael Johnson 1996 19.32 20 10 0 TIMES Histograms Showing Different Centers 70 60 50 40 30 20 10 0 0<2 2<4 4<6 6<8 8<10 10<12 12<14 14<16 16<18 0<2 2<4 4<6 6<8 8<10 10<12 12<14 14<16 16<18 70 60 50 40 30 20 10 0 Histograms - Same Center, Different Spread 70 60 50 40 30 20 10 16 < 18 14 < 16 12 < 14 10 10 < 12 8 8< 6< 6 4< 4 2< 0< 2 0 70 60 50 40 30 20 10 0 0<2 2<4 4<6 6<8 8<10 10<12 12<14 14<16 16<18 Frequency and Relative Frequency Histograms identify smallest and largest values in data set divide interval between largest and smallest values into between 5 and 20 subintervals called classes * each data value in one and only one class * no data value is on a boundary How Many Classes? Can choose from two formulas 2n .3333 Sturges' Rule : log( n) 1 log( 2) n is the sample size Histogram Construction (cont.) * compute frequency or relative frequency of observations in each class * x-axis: class boundaries; y-axis: frequency or relative frequency scale * over each class draw a rectangle with height corresponding to the frequency or relative frequency in that class Example. Number of daily employee absences from work 106 obs; approx. no of classes= {2(106)}1/3 = {212}1/3 = 5.69 1+ log(106)/log(2) = 1 + 6.73 = 7.73 There is no single “correct” answer for the number of classes For example, you can choose 6, 7, 8, or 9 classes; don’t choose 15 classes EXCEL Histogram Histogram of Employee Absences 45 Frequency 40 35 30 25 20 15 10 5 0 Absences from Work Absences from Work (cont.) 6 classes class width: (158-121)/6=37/6=6.17 7 6 classes, each of width 7; classes span 6(7)=42 units data spans 158-121=37 units classes overlap the span of the actual data values by 42-37=5 lower boundary of 1st class: (1/2)(5) units below 121 = 121-2.5 = 118.5 EXCEL histogram Histogram of Employee Absences 70 Frequency 60 50 40 30 20 10 0 118.5 125.5 132.5 139.5 146.5 Absences from Work 153.5 160.5 Grades on a statistics exam Data: 75 66 77 66 64 73 91 65 59 86 61 86 61 58 70 77 80 58 94 78 62 79 83 54 52 45 82 48 67 55 Frequency Distribution of Grades Class Limits 40 up to 50 Frequency 2 50 up to 60 6 60 up to 70 8 70 up to 80 7 80 up to 90 5 90 up to 100 2 Total 30 Relative Frequency Distribution of Grades Class Limits 40 up to 50 Relative Frequency 2/30 = .067 50 up to 60 6/30 = .200 60 up to 70 8/30 = .267 70 up to 80 7/30 = .233 80 up to 90 5/30 = .167 90 up to 100 2/30 = .067 Relative frequency Relative Frequency Histogram of Grades .30 .25 .20 .15 .10 .05 0 40 50 60 70 80 Grade 90 100 Based on the histogram, about what percent of the values are between 47.5 and 52.5? 1. 2. 3. 4. 50% 5% 17% 30% 0% 1 0% 2 0% 3 0% 10 4 Countdown Stem and leaf displays Have the following general appearance stem leaf 1 8 9 2 1 2 8 9 9 3 2 3 8 9 4 0 1 5 6 7 6 4 Stem and Leaf Displays Partition each no. in data into a “stem” and “leaf” Constructing stem and leaf display 1) deter. stem and leaf partition (5-20 stems) 2) write stems in column with smallest stem at top; include all stems in range of data 3) only 1 digit in leaves; drop digits or round off 4) record leaf for each no. in corresponding stem row; ordering the leaves in each row helps Example: employee ages at a small company 18 21 22 19 32 33 40 41 56 57 64 28 29 29 38 39; stem: 10’s digit; leaf: 1’s digit 18: stem=1; leaf=8; 18 = 1 | 8 stem leaf 1 8 9 2 1 2 8 9 9 3 2 3 8 9 4 0 1 5 6 7 6 4 Suppose a 95 yr. old is hired stem 1 2 3 4 5 6 7 8 9 leaf 8 9 1 2 8 9 9 2 3 8 9 0 1 6 7 4 5 Number of TD passes by NFL teams: 2010 season (stems are 10’s digit) stem 3 2 2 1 0 leaf 011337 5566667889 0123444 03447889 9 Pulse Rates n = 138 # 3 9 10 23 23 16 23 10 10 4 2 4 1 Stem 4* 4. 5* 5. 6* 6. 7* 7. 8* 8. 9* 9. 10* 10. 11* Leaves 588 001233444 5556788899 00011111122233333344444 55556666667777788888888 00000112222334444 55555666666777888888999 0000112224 5555667789 0012 58 0223 1 Advantages/Disadvantages of Stem-and-Leaf Displays Advantages 1) each measurement displayed 2) ascending order in each stem row 3) relatively simple (data set not too large) Disadvantages display becomes unwieldy for large data sets Population of 185 US cities with between 100,000 and 500,000 Multiply stems by 100,000 Back-to-back stem-and-leaf displays. TD passes by NFL teams: 1999, 2009 multiply stems by 10 1999 2 6 2 6655 43322221100 9998887666 421 2009 4 3 3 2 2 1 1 0444 6677788899 011113 55666788 0122 Below is a stem-and-leaf display for the pulse rates of 24 women at a health clinic. How many pulses are between 67 and 77? Stems are 10’s digits 1. 2. 3. 4. 5. 4 6 8 10 12 0% 1 0% 0% 2 3 0% 4 0% 10 5 Countdown Interpreting Graphical Displays: Shape Symmetric distribution A distribution is symmetric if the right and left sides of the histogram are approximately mirror images of each other. A distribution is skewed to the right if the right side of the histogram (side with larger values) extends much farther out than the left side. It is skewed to the left if the left side of the histogram Skewed distribution extends much farther out than the right side. Complex, multimodal distribution Not all distributions have a simple overall shape, especially when there are few observations. Shape (cont.)Female heart attack patients in New York state Age: left-skewed Cost: right-skewed Shape (cont.): Outliers An important kind of deviation is an outlier. Outliers are observations that lie outside the overall pattern of a distribution. Always look for outliers and try to explain them. The overall pattern is fairly symmetrical except for 2 states clearly not belonging to the main trend. Alaska and Florida have unusual representation of the elderly in their population. A large gap in the distribution is typically a sign of an outlier. Alaska Florida Center: typical value of frozen personal pizza? ~$2.65 Spread: fuel efficiency 4, 8 cylinders 4 cylinders: more spread 8 cylinders: less spread Other Graphical Methods for Economic Data Time plots plot observations in time order, with time on the horizontal axis and the variable on the vertical axis ** Time series measurements are taken at regular intervals (monthly unemployment, quarterly GDP, weather records, electricity demand, etc.) Unemployment Rate, by Educational Attainment Water Use During Super Bowl Winning Times 100 M Dash Annual Mean Temperature End of Histograms, Stem and Leaf plots Describing Distributions Numerically: Medians and Quartiles 2 characteristics of a data set to measure center measures where the “middle” of the data is located variability measures how “spread out” the data is The median: a measure of center Given a set of n measurements arranged in order of magnitude, Median= middle value n odd mean of 2 middle values, n even Ex. 2, 4, 6, 8, 10; n=5; median=6 Ex. 2, 4, 6, 8; n=4; median=(4+6)/2=5 Student Pulse Rates (n=62) 38, 59, 60, 60, 62, 62, 63, 63, 64, 64, 65, 67, 68, 70, 70, 70, 70, 70, 70, 70, 71, 71, 72, 72, 73, 74, 74, 75, 75, 75, 75, 76, 77, 77, 77, 77, 78, 78, 79, 79, 80, 80, 80, 84, 84, 85, 85, 87, 90, 90, 91, 92, 93, 94, 94, 95, 96, 96, 96, 98, 98, 103 Median = (75+76)/2 = 75.5 Medians are used often Year 2011 baseball salaries Median $1,450,000 (max=$32,000,000 Alex Rodriguez; min=$414,000) Median fan age: MLB 45; NFL 43; NBA 41; NHL 39 Median existing home sales price: May 2011 $166,500; May 2010 $174,600 Median household income (2008 dollars) 2009 $50,221; 2008 $52,029 The median splits the histogram into 2 halves of equal area Examples Example: n = 7 17.5 2.8 3.2 13.9 14.1 25.3 45.8 Example n = 7 (ordered): m = 14.1 2.8 3.2 13.9 14.1 17.5 25.3 45.8 Example: n = 8 17.5 2.8 3.2 13.9 14.1 25.3 35.7 45.8 Example n =8 (ordered) m = (14.1+17.5)/2 = 15.8 2.8 3.2 13.9 14.1 17.5 25.3 35.7 45.8 Below are the annual tuition charges at 7 public universities. What is the median tuition? 4429 4960 4960 4971 5245 5546 7586 1. 2. 3. 4. 5245 4965.5 4960 4971 Below are the annual tuition charges at 7 public universities. What is the median tuition? 4429 4960 5245 5546 4971 5587 7586 1. 2. 3. 4. 5245 4965.5 5546 4971 Measures of Spread The range and interquartile range Ways to measure variability range=largest-smallest OK sometimes; in general, too crude; sensitive to one large or small data value The range measures spread by examining the ends of the data A better way to measure spread is to examine the middle portion of the data Quartiles: Measuring spread by examining the middle The first quartile, Q1, is the value in the sample that has 25% of the data at or below it (Q1 is the median of the lower half of the sorted data). The third quartile, Q3, is the value in the sample that has 75% of the data at or below it (Q3 is the median of the upper half of the sorted data). 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 1 2 3 4 5 6 7 6 5 4 3 2 1 2 3 4 5 6 7 6 5 4 3 2 1 0.6 1.2 1.6 1.9 1.5 2.1 2.3 2.3 2.5 2.8 2.9 3.3 3.4 3.6 3.7 3.8 3.9 4.1 4.2 4.5 4.7 4.9 5.3 5.6 6.1 Q1= first quartile = 2.3 m = median = 3.4 Q3= third quartile = 4.2 Quartiles and median divide data into 4 pieces 1/4 1/4 Q1 1/4 M 1/4 Q3 Quartiles are common measures of spread http://www2.acs.ncsu.edu/UPA/admissi ons/fresprof.htm http://www2.acs.ncsu.edu/UPA/peers/cu rrent/ncsu_peers/sat.htm University of Southern California UNC-CH Rules for Calculating Quartiles Step 1: find the median of all the data (the median divides the data in half) Step 2a: find the median of the lower half; this median is Q1; Step 2b: find the median of the upper half; this median is Q3. Important: when n is odd include the overall median in both halves; when n is even do not include the overall median in either half. 11 Example 2 4 6 8 10 12 14 16 18 20 n = 10 Median m = (10+12)/2 = 22/2 = 11 Q1 : Q3 median of lower half 2 4 6 8 10 Q1 = 6 : median of upper half 12 14 16 18 20 Q3 = 16 Pulse Rates n = 138 # 3 9 10 23 23 16 23 10 10 4 2 4 1 Stem 4* 4. 5* 5. 6* 6. 7* 7. 8* 8. 9* 9. 10* 10. 11* Leaves Median: mean of pulses in locations 69 & 70: median= (70+70)/2=70 588 001233444 5556788899 00011111122233333344444 55556666667777788888888 00000112222334444 55555666666777888888999 0000112224 5555667789 0012 58 0223 1 Q1: median of lower half (lower half = 69 smallest pulses); Q1 = pulse in ordered position 35; Q1 = 63 Q3 median of upper half (upper half = 69 largest pulses); Q3= pulse in position 35 from the high end; Q3=78 Below are the weights of 31 linemen on the NCSU football team. What is the value of the first quartile Q1? 1. 2. 3. 4. 287 257.5 263.5 262.5 # stemleaf 2 2255 4 2357 6 2426 7 257 10 26257 12 2759 (4) 281567 15 2935599 10 30333 7 3145 5 32155 2 336 1 340 0% 1 0% 2. 0% 3. 0% 10 4. Countdown Interquartile range lower quartile Q1 middle quartile: median upper quartile Q3 interquartile range (IQR) IQR = Q3 – Q1 measures spread of middle 50% of the data Example: beginning pulse rates Q3 = 78; Q1 = 63 IQR = 78 – 63 = 15 Below are the weights of 31 linemen on the NCSU football team. The first quartile Q1 is 263.5. What is the value of the IQR? 1. 2. 3. 4. 23.5 39.5 46 69.5 # stemleaf 2 2255 4 2357 6 2426 7 257 10 26257 12 2759 (4) 281567 15 2935599 10 30333 7 3145 5 32155 2 336 1 340 0% 1. 0% 2. 0% 3 0% 10 4. Countdown 5-number summary of data Minimum Q1 median Q3 maximum Pulse data 45 63 70 78 111 End of Medians and Quartiles Numerical Summaries of Symmetric Data. Measure of Center: Mean Measure of Variability: Standard Deviation Symmetric Data Body temp. of 93 adults Recall: 2 characteristics of a data set to measure center measures where the “middle” of the data is located variability measures how “spread out” the data is Measure of Center When Data Approx. Symmetric mean (arithmetic mean) notation xi : ith measurement in a set of observations x1 , x2 , x3 , , xn n: number of measurements in data set; sample size n xi x1 x2 x3 xn i 1 Sample mean x n x x1 x2 x3 xn i 1 x n n i Population mean (value typically not known) N = population size N x i 1 N i Connection Between Mean and Histogram A histogram balances when supported at the mean. Mean x = 140.6 Histogram 70 60 50 40 Fr equency 30 20 10 Abs e nce s f rom Work More 1 60.5 153.5 146.5 139 .5 132.5 125.5 0 118.5 Fre que ncy Mean: balance point Median: 50% area each half right histo: mean 55.26 yrs, median 57.7yrs Properties of Mean, Median 1. The mean and median are unique; that is, a data set has only 1 mean and 1 median (the mean and median are not necessarily equal). 2. The mean uses the value of every number in the data set; the median does not. 20 46 Ex. 2, 4, 6, 8. x 5; m 5 4 2 21 1 46 Ex. 2, 4, 6, 9. x 5 4 ; m 5 4 2 Example: class pulse rates 53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140 n 23 23 x x i 1 i 84.48; 23 m :location: 12th obs. m 85 2010, 2011 baseball salaries 2010 n = 845 = $3,297,828 median = $1,330,000 max = $33,000,000 2011 n = 848 = $3,305,393 median = $1,450,000 max = $32,000,000 Disadvantage of the mean Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data Mean, Median, Maximum BB Salaries Baseball Salaries: Mean, Median and Maximum 1985-2006 Maximum 30,000,000 2,700,000 25,000,000 2,200,000 20,000,000 1,700,000 15,000,000 1,200,000 10,000,000 Year 2005 2003 2001 1999 1997 1995 1993 0 1991 200,000 1989 5,000,000 1987 700,000 Maximum Salary Median 3,200,000 1985 Mean, Median Salary Mean Skewness: comparing the mean, and median Skewed to the right (positively skewed) mean>median 2011 Baseball Salaries 600 490 Frequency 500 400 300 200 100 53 102 72 35 21 26 17 8 10 0 Salary ($1,000's) 2 3 1 0 0 1 Skewed to the left; negatively skewed Mean < median mean=78; median=87; Histogram of Exam Scores Frequency 30 20 10 0 20 30 40 50 60 70 80 Exam Scores 90 100 Symmetric data mean, median approx. equal Bank Customers: 10:00-11:00 am 20 15 10 5 0 70 .8 78 .6 86 .4 94 .2 10 2 10 9. 8 11 7. 6 12 5. 4 13 3. 2 m or e Frequency Number of Customers DESCRIBING VARIABILITY OF SYMMETRIC DATA Describing Symmetric Data (cont.) Measure of center for symmetric data: Sample mean x n x1 x2 x3 x n xn x i 1 i n Measure of variability for symmetric data? Example 2 data sets: x1=49, x2=51 x=50 y1=0, y2=100 y=50 On average, they’re both comfortable 49 51 0 100 Ways to measure variability range=largest-smallest ok sometimes; in general, too crude; sensitive to one large or small obs. 1. 2. measure spread from the middle, where the middle is the mean x ; deviation of xi from the mean: xi x n (x i 1 i x ); sum the deviations of all the xi 's from x ; n ( x x ) 0 always; tells us nothing i 1 i Previous Example sum of deviations from mean: x1 49, x2 51; x 50 ( x1 x ) ( x2 x ) (49 50) (51 50) 1 1 0; y1 0, y2 100; y 50 ( y1 y ) ( y2 y ) (0 50) (100 50) 50 50 0 The Sample Standard Deviation, a measure of spread around the mean Square the deviation of each observation from the mean; find the square root of the “average” of these squared deviations n ( x i x ) ; ( x i x ) 2 and find the " average" , 2 i 1 then take the square root of the average n s (x i 1 deviation i x )2 n 1 called the sample standard Calculations … Women height (inches) i xi x (xi-x) (xi-x)2 1 59 63.4 -4.4 19.0 2 60 63.4 -3.4 11.3 3 61 63.4 -2.4 5.6 4 62 63.4 -1.4 1.8 5 62 63.4 -1.4 1.8 6 63 63.4 -0.4 0.1 7 63 63.4 -0.4 0.1 8 63 63.4 -0.4 0.1 9 64 63.4 0.6 0.4 10 64 63.4 0.6 0.4 11 65 63.4 1.6 2.7 12 66 63.4 2.6 7.0 13 67 63.4 3.6 13.3 14 68 63.4 4.6 21.6 Mean = 63.4 Sum 0.0 Sum 85.2 Sum of squared deviations from mean = 85.2 Mean 63.4 x (n − 1) = 13; (n − 1) is called degrees freedom (df) s2 = variance = 85.2/13 = 6.55 inches squared s = standard deviation = √6.55 = 2.56 inches i xi x (xi-x) (xi-x)2 1 59 63.4 -4.4 19.0 2 60 63.4 -3.4 11.3 3 61 63.4 -2.4 5.6 4 62 63.4 -1.4 these 1.8by hand, so make sure to know how to get the We’ll never calculate standard deviation using your calculator, Excel, or other software. 5 62 63.4 -1.4 1.8 6 63 63.4 -0.4 0.1 7 63 63.4 -0.4 0.1 8 63 63.4 -0.4 0.1 9 64 63.4 0.6 0.4 10 64 63.4 0.6 0.4 11 65 63.4 1.6 2.7 12 66 63.4 2.6 7.0 13 67 63.4 3.6 13.3 14 68 63.4 4.6 21.6 Sum 0.0 Sum 85.2 Mean 63.4 1. First calculate the variance s2. n 1 s ( xi x ) 2 n 1 1 2 x Mean ± 1 s.d. 2. Then take the square root to get the standard deviation s. 1 n 2 s ( x x ) i n 1 1 Population Standard Deviation N 2 ( x ) i i 1 N value of population standard deviation typically not known; use s to estimate value of Remarks 1. The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement Remarks (cont.) 2. Note that s and are always greater than or equal to zero. 3. The larger the value of s (or ), the greater the spread of the data. When does s=0? When does =0? When all data values are the same. Remarks (cont.) 4. The standard deviation is the most commonly used measure of risk in finance and business – Stocks, Mutual Funds, etc. 5. Variance s2 sample variance 2 population variance Units are squared units of the original data square $, square gallons ?? Remarks 6):Why divide by n-1 instead of n? degrees of freedom each observation has 1 degree of freedom however, when estimate unknown population parameter like , you lose 1 degree of freedom In formula for s , we use x to estimate the unkown n value of ; s 2 ( x x ) i i 1 n 1 Remarks 6) (cont.):Why divide by n-1 instead of n? Example Suppose we have 3 numbers whose average is 9 Choose ANY values for x and x x1= x2= Since the average (mean) is 9, x x + x must equal 9*3 = 27, so x then x3 must be 27 – (x + x ) once we selected x1 and x2, x3 was determined since the average was 9 3 numbers but only 2 “degrees of freedom” 1 2 + 3 = 1 2 3 1 2 Computational Example observations 1, 3, 5, 9; x 184 4.5 (1 4.5) 2 (3 4.5) 2 (5 4.5) 2 (9 4.5) 2 s 4 1 (3.5) 2 (1.5) 2 (.5) 2 (4.5) 2 3 12.25 2.25 .25 20.25 35 11.67 3.42; 3 3 s 2 11.67 class pulse rates 53 64 67 67 70 76 77 77 78 83 84 85 85 89 90 90 90 90 91 96 98 103 140 n 23 x 84.48 m 85 s 290.26(beats per minute) s 17.037 beats per minute 2 2 Review: Properties of s and s and are always greater than or equal to 0 when does s = 0? = 0? The larger the value of s (or ), the greater the spread of the data the standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement Summary of Notation SAMPLE y sample mean POPULATION population mean m sample median m population median s sample variance 2 population variance s sample stand. dev. population stand. dev. 2 End of Chapter 3