Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Introduction to Business Statistics, 6e Kvanli, Pavur, Keeling Chapter 3 – Data Summary Using Descriptive Measures Slides prepared by Jeff Heyl, Lincoln University Thomson/South-WesternLearning™ 1 ©2003 South-Western/Thomson Types of Descriptive Measures Measures of central tendency Measures of variation Measures of position Measures of shape ©2003 Thomson/South-Western 2 Measures of Central Tendency The Mean The Median The Midrange The Mode ©2003 Thomson/South-Western 3 The Mean The Mean is simply the average of the data A Sample Mean Each value in the sample is represented by x. Thus to get the mean simply add all the values in the sample and divide by the number of values in the sample (n) x x= n ©2003 Thomson/South-Western 4 The Population Mean Each value in the population is represented by x. Thus to get the population mean (m) simply add all the values in the population and divide by the number of values in the population (N) x m= N ©2003 Thomson/South-Western 5 The Accident Data Set 6 + 9 + 7 + 23 +5 x= = 10.0 5 If we remove the last value from the data set, then 6 + 9 + 7 + 23 x= = 11.25 4 ©2003 Thomson/South-Western 6 The Median The Median (Md) of a set of data is the value in the center of the data values when they are arranged from lowest to highest ©2003 Thomson/South-Western 7 Accident Data Ordered array: 5, 6, 7, 9, 23 The value that has an equal number of items to the right and left is the median Md = 7 If n is an odd number, Md is the center data value of the ordered data set n+1 Md = st ordered value 2 ©2003 Thomson/South-Western 8 Even Numbered Data Ordered array: 3, 8, 12, 14 The value that has an equal number of items to the right and left is the median Md = (8 + 12)/2 = 10 If n is an even number, Md is the average of the two center values of the ordered data set ©2003 Thomson/South-Western 9 The Midrange The Midrange (Mr) provides an easyto-grasp measure of central tendency L+H Mr = 2 ©2003 Thomson/South-Western 10 Accident Data Ordered array: 5, 6, 7, 9, 23 5 + 23 Mr = = 14 2 Note: that the Midrange is severely affected by outliers Compare Mr to x = 10 and Md = 7 ©2003 Thomson/South-Western 11 The Mode The Mode (Mo) of a data set is the value that occurs more than once and the most often The Mode is not always a measure of central tendency; this value need not occur in the center of the data ©2003 Thomson/South-Western 12 Bellaire College Example Figure 3.2 ©2003 Thomson/South-Western 13 Bellaire College Example Figure 3.3 ©2003 Thomson/South-Western 14 Bellaire College Example Figure 3.4 ©2003 Thomson/South-Western 15 Level of Measurement and Measure of Central Tendency Summary of levels of measurement and appropriate measure of central tendency. A “Y” indicates this measure can be used with the corresponding level of measurement. Measure of Central Tendency Mean Median Midrange Mode Level of Measurement Nominal Ordinal Interval Y Y Y Y Y Y Y Ratio Y Y Y Y Table 3.1 ©2003 Thomson/South-Western 16 Measures of Variation Homogeneity refers to the degree of similarity within a set of data The more homogeneous a set of data is, the better the mean will represent a typical value Variation is the tendency of data values to scatter about the mean, x ©2003 Thomson/South-Western 17 Common Measures of Variation Range Variance Standard Deviation Coefficient of Variation ©2003 Thomson/South-Western 18 The Range For the Accident data: Range = H - L = 23 - 5 = 18 Rather crude measure but easy to calculate and contains valuable information in some situations ©2003 Thomson/South-Western 19 The Variance and Standard Deviation Both measures describe the variation of the values about the mean Data Value (x) 5 6 7 9 23 (x - x ) (x - x )2 -5 25 -4 16 -3 9 -1 1 13 169 (x - x ) = 0 (x - x )2 = 220 ©2003 Thomson/South-Western 20 Sample Variance (x - x )2 s2 = n-1 Using the accident data: s2 220 220 = = = 55.0 4 5-1 ©2003 Thomson/South-Western 21 Sample Standard Deviation s= (x - x )2 n-1 Using the accident data: s= 55.0 = 7.416 ©2003 Thomson/South-Western 22 Population Variance and Standard Deviation (x - m)2 2 = N = (x - m)2 N ©2003 Thomson/South-Western 23 The Coefficient of Variation The Coefficient of Variation (CV) is used to compare the variation of two or more data sets where the values of the data differ greatly s CV = 100 x ©2003 Thomson/South-Western 24 Machined Parts Example Figure 3.6 ©2003 Thomson/South-Western 25 Measures of Position Percentile (Quartile) Most common measure of position Quartiles are percentiles with the data divided into quarters Z-Score The relative position of a data value expressed in terms of the number of standard deviations above or below the mean ©2003 Thomson/South-Western 26 Percentile Example The 35th Percentile (P35) is that value such that at most 35% of the data values are less than P35 and at most 65% of the data values are greater than P35. ©2003 Thomson/South-Western 27 Aptitude Test Scores 22 25 28 31 34 35 39 39 40 42 Table 3.2 44 44 46 48 49 51 53 53 55 55 56 57 59 60 61 63 63 63 65 66 68 68 69 71 72 72 74 75 75 76 78 78 80 82 83 85 88 90 92 96 Ordered array of aptitude test scores for 50 applicants (x = 60.36, s = 18.61) ©2003 Thomson/South-Western 28 Percentile Texon Industries Data Number of data values, n = 50 Percentile, P = 35 P n• = 50 • .35 = 17.5 100 17.5 represents the position of the 35th percentile ©2003 Thomson/South-Western 29 Percentile Location Rules Rule 1: If n P/100 is not a counting number, round it up, and the Pth percentile will be the value in this position of the ordered data Rule 2: If n P/100 is a counting number, the Pth percentile is the average of the number in this location (of the ordered data) and the number in the next largest location ©2003 Thomson/South-Western 30 Aptitude Scores Example Ms. Jensen received a score of 83 on the aptitude test. What is her percentile value? 83 is the 45th largest value out of 50. A guess of the percentile would be: P= 45 • 100 = 90 50 Examining the surrounding values clarifies the true percentile Example 3.5 P 88 89 90 (n • P)/100 50 • .88 = 44 50 • .89 = 44.5 50 • .90 = 45 P th Percentile (80 + 83)/2 = 82.5 45th value = 83 (83 + 85)/2 = 84 ©2003 Thomson/South-Western 31 Quartiles Quartiles are merely particular percentiles that divide the data into quarters, namely: Q1 = 1st quartile = 25th percentile (P25) Q2 = 2nd quartile = 50th percentile = median (P50) Q3 = 3rd quartile = 75th percentile (P75) ©2003 Thomson/South-Western 32 Quartile Example Using the applicant data, the first quartile is: P n• = (50)(.25) = 12.5 100 Rounded up Q1 = 13th ordered value = 46 Similarly the third quartile is: P n• = (50)(.75) = 37.5 ≈ 38 and Q3 = 75 100 ©2003 Thomson/South-Western 33 Interquartile Range The interquartile range (IQR) is essentially the middle 50% of the data set IQR = Q3 - Q1 Using the applicant data, the IQR is: IQR = 75 - 46 = 29 ©2003 Thomson/South-Western 34 Z-Scores Z-score determines the relative position of any particular data value x and is based on the mean and standard deviation of the data set The Z-score is expresses the number of standard deviations the value x is from the mean A negative Z-score implies that x is to the left of the mean and a positive Z-score implies that x is to the right of the mean ©2003 Thomson/South-Western 35 Z Score Equation x-x z= s For a score of 83 from the aptitude data set, 83 - 60.66 z= = 1.22 18.61 For a score of 35 from the aptitude data set, 35 - 60.66 z= = -1.36 18.61 ©2003 Thomson/South-Western 36 Standardizing Sample Data The process of subtracting the mean and dividing by the standard deviation is referred to as standardizing the sample data. The corresponding z-score is the standardized score. ©2003 Thomson/South-Western 37 Measures of Shape Skewness Skewness measures the tendency of a distribution to stretch out in a particular direction Kurtosis Kurtosis measures the peakedness of the distribution ©2003 Thomson/South-Western 38 Skewness In a symmetrical distribution the mean, median, and mode would all be the same value and Sk = 0 A positive Sk number implies a shape which is skewed right and the mode < median < mean In a data set with a negative Sk value the mean < median < mode ©2003 Thomson/South-Western 39 Skewness Calculation Pearsonian coefficient of skewness 3(x - Md) Sk = s Values of Sk will always fall between -3 and 3 ©2003 Thomson/South-Western 40 Frequency Histogram of Symmetric Data Figure 3.7 x = Md = Mo ©2003 Thomson/South-Western 41 Relative Frequency Histogram with Right (Positive) Skew Sk > 0 Mode Median Mean (Mo) (Md) (x) Figure 3.8 ©2003 Thomson/South-Western 42 Relative Frequency Histogram with Left (Negative) Skew Figure 3.9 Sk < 0 Mean Median Mode (x) (Md) (Mo) ©2003 Thomson/South-Western 43 Kurtosis Kurtosis is a measure of the peakedness of a distribution Large values occur when there is a high frequency of data near the mean and in the tails The calculation is cumbersome and the measure is used infrequently ©2003 Thomson/South-Western 44 Chebyshev’s Inequality 1. At least 75% of the data values are between x - 2s and x + 2s, or At least 75% of the data values have a zscore value between -2 and 2 2. At least 89% of the data values are between x - 3s and x + 3s, or At least 75% of the data values have a zscore value between -3 and 3 3. In general, at least (1-1/k2) x 100% of the data values lie between x - ks and x + ks for any k>1 ©2003 Thomson/South-Western 45 Empirical Rule Under the assumption of a bell shaped population: 1. Approximately 68% of the data values lie between x - s and x + s (have z-scores between -1 and 1) 2. Approximately 95% of the data values lie between x - 2s and x + 2s (have z-scores between -2 and 2) 3. Approximately 99.7% of the data values lie between x - 3s and x + 3s (have z-scores between -3 and 3) ©2003 Thomson/South-Western 46 A Bell-Shaped (Normal) Population Figure 3.10 ©2003 Thomson/South-Western 47 Chebyshev’s Versus Empirical Between Actual Percentage Chebyshev’s Inequality Percentage Empirical Rule Percentage x - s and x + s 66% (33 out of 50) — ≈ 68% x - 2s and x + 2s 98% (49 out of 50) ≥ 75% ≈ 95% x - 3s and x + 3s 100% (50 out of 50) ≥ 89% ≈ 100% Table 3.3 Md = 62 Sk = -.26 ©2003 Thomson/South-Western 48 Allied Manufacturing Example Is the Empirical Rule applicable to this data? Probably yes. Histogram is approximately bell shaped. x - 2s = 10.275 and x + 2s = 10.3284 96 of the 100 data values fall between these limits closely approximating the 95% called for by the Empirical Rule ©2003 Thomson/South-Western 49 Grouped Data When raw data are not available Estimate x by assuming data values are equal to the midpoint of their class Class Number Class (Age in years) Frequency 1 2 3 4 5 20 and under 30 30 and under 40 40 and under 50 50 and under 60 60 and under 70 5 14 9 6 2 36 Table 3.4 ©2003 Thomson/South-Western 50 Grouped Data When raw data are not available Estimate x by assuming data values are equal to the midpoint of their class 5 values at (20 + 30)/2 14 values at (30 + 40)/2 9 values at (40 + 50)/5 6 values at (50 + 60)/2 2 values at (60 + 70)/2 = 25 = 35 = 45 = 55 = 65 (5)(25) + (14)(35) + (9)(45) + (6)(55) + (2)(65) x= 36 1480 x= = 41.1 36 ©2003 Thomson/South-Western 51 Grouped Data When raw data are not available Estimate s2 by assuming data values are equal to the midpoint of their class and using the normal method ∑(each data value)2 - ∑(each data value)2/n s2 = n-1 65,100 - (1480)2/36 s2 = = 121.59 35 s= 121.59 = 11.03 ©2003 Thomson/South-Western 52 Grouped Data Summary of calculations Class Number 1 2 3 4 5 Class 20 and under 30 30 and under 40 40 and under 50 50 and under 60 60 and under 70 f m f•m f • m2 5 14 9 6 2 25 35 45 55 65 125 490 405 330 130 3,125 17,150 18,225 18,150 8,450 36 ∑f • m = 1,480 ∑f • m2 = 65,100 Table 3.5 ©2003 Thomson/South-Western 53 Grouped Data Figure 3.11 ©2003 Thomson/South-Western 54 Box Plots Box plots are graphical representations of data sets that illustrate the lowest data value (L), the first quartile (Q1), the median (Q2, MD), the third quartile (Q3), the interquartile range (IQR), and the highest data value (H) ©2003 Thomson/South-Western 55 Box Plots Given the aptitude test data: L = 22 Q1 = 46 Q2 = Md = 62 Q3 = 75 IQR = 75 - 46 = 29 H = 96 x | 20 L = 22 x | 30 | 40 | 50 Q1 = 46 | 60 | 70 | 80 Md = 62 Q3 = 75 | 90 | 100 H = 96 Figure 3.12 ©2003 Thomson/South-Western 56 Box Plots Figure 3.13 ©2003 Thomson/South-Western 57 Box Plots Figure 3.14 ©2003 Thomson/South-Western 58 Box Plots Figure 3.15 ©2003 Thomson/South-Western 59 Box Plots Figure 3.16a ©2003 Thomson/South-Western 60 Box Plots Figure 3.16b ©2003 Thomson/South-Western 61 Box Plots Box Plots for Aptitude Scores Apptitude Score 100 80 60 40 20 1 2 Sample Figure 3.17 ©2003 Thomson/South-Western 62