Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
4A-1 Chapter 4A Descriptive Statistics (Part 1) Numerical Description Central Tendency Dispersion McGraw-Hill/Irwin © 2008 The McGraw-Hill Companies, Inc. All rights reserved. 4A-3 Numerical Description • Statistics are descriptive measures derived from a sample (n items). • Parameters are descriptive measures derived from a population (N items). 4A-4 Numerical Description • Three key characteristics of numerical data: Characteristic Interpretation Central Tendency Where are the data values concentrated? What seem to be typical or middle data values? Dispersion How much variation is there in the data? How spread out are the data values? Are there unusual values? Shape Are the data values distributed symmetrically? Skewed? Sharply peaked? Flat? Bimodal? 4A-5 Numerical Description  Example: Vehicle Quality • Consider the data set of vehicle defect rates from J. D. Power and Associates. • Defect rate = total no. defects x 100 no. inspected • Numerical statistics can be used to summarize this random sample of brands. • Must allow for sampling error since the analysis is based on sampling. 4A-6 Numerical Description • Number of defects per 100 vehicles, 1004 models. 4A-7 Numerical Description • Sorted data provides insight into central tendency and dispersion. 4A-8 Numerical Description  Visual Displays • The dot plot offers a visual impression of the data. 4A-9 Central Tendency • The central tendency is the middle or typical values of a distribution. • Central tendency can be assessed using a dot plot, histogram or more precisely with numerical statistics. 4A-10 Central Tendency  Six Measures of Central Tendency Statistic Formula Excel Formula Mean 1 n xi  n i 1 Familiar and uses all the =AVERAGE(Data) sample information. Median Middle value in sorted array =MEDIAN(Data) Pro Robust when extreme data values exist. Con Influenced by extreme values. Ignores extremes and can be affected by gaps in data values. 4A-11 Central Tendency  Six Measures of Central Tendency Statistic Mode Midrange Formula Most frequently occurring data value xmin  xmax 2 Excel Formula =MODE(Data) =0.5*(MIN(Data) +MAX(Data)) Pro Con Useful for attribute data or discrete data with a small range. May not be unique, and is not helpful for continuous data. Easy to understand and calculate. Influenced by extreme values and ignores most data values. 4A-12 Central Tendency  Six Measures of Central Tendency Statistic Geometric mean (G) Trimmed mean Formula n x1 x2 ... xn Same as the mean except omit highest and lowest k% of data values (e.g., 5%) Excel Formula =GEOMEAN(Data) Pro Con Useful for growth rates and mitigates high extremes. Less familiar and requires positive data. Mitigates effects of =TRMEAN(Data, %) extreme values. Excludes some data values that could be relevant. 4A-13 Central Tendency  Mean • A familiar measure of central tendency. Population Formula Sample Formula n N   xi i 1 N x  xi i 1 n • In Excel, use function =AVERAGE(Data) where Data is an array of data values. 4A-14 Central Tendency  Mean • For the sample of n = 37 car brands: n x  xi i 1 n  87  93  98  ...  159  164  173 4639   125.38 37 37 4A-15 Central Tendency  Characteristics of the Mean • Arithmetic mean is the most familiar average. • Affected by every sample item. • The balancing point or fulcrum for the data. 4A-16 Central Tendency  Characteristics of the Mean • Regardless of the shape of the distribution, absolute distances from the mean to the data n points always sum to zero.  ( xi  x )  0 • Consider the following i 1 asymmetric distribution of quiz scores whose mean = 65. n  ( xi  x ) = (42 – 65) + (60 – 65) + (70 – 65) + (75 – 65) + (78 – 65) i 1 = (-23) + (-5) + (5) + (10) + (13) = -28 + 28 = 0 4A-17 Central Tendency  Median • The median (M) is the 50th percentile or midpoint of the sorted sample data. • M separates the upper and lower half of the sorted observations. • If n is odd, the median is the middle observation in the data array. • If n is even, the median is the average of the middle two observations in the data array. 4A-18 Central Tendency  Median • For n = 8, the median is between the fourth and fifth observations in the data array. 4A-19 Central Tendency  Median • For n = 9, the median is the fifth observation in the data array. 4A-20 Central Tendency  Median • Consider the following n = 6 data values: 11 12 15 17 21 32 • What is the median? xn / 2  x( n / 21) For even n, Median = n/2 = 6/2 = 3 and 2 n/2+1 = 6/2 + 1 = 4 M = (x3+x4)/2 = (15+17)/2 = 16 11 12 15 16 17 21 32 4A-21 Central Tendency  Median • Consider the following n = 7 data values: 12 23 23 25 27 34 41 • What is the median? For odd n, Median = x( n 1) / 2 (n+1)/2 = (7+1)/2 = 8/2 = 4 M = x4 = 25 12 23 23 25 27 34 41 4A-22 Central Tendency  Median • Use Excel’s function =MEDIAN(Data) where Data is an array of data values. • For the 37 vehicle quality ratings (odd n) the position of the median is (n+1)/2 = (37+1)/2 = 19. • So, the median is x19 = 121. • When there are several duplicate data values, the median does not provide a clean “50-50” split in the data. 4A-23 Central Tendency  Characteristics of the Median • The median is insensitive to extreme data values. • For example, consider the following quiz scores for 3 students: Tom’s scores: 20, 40, 70, 75, 80 Jake’s scores: 60, 65, 70, 90, 95 Mary’s scores: 50, 65, 70, 75, 90 Mean =57, Median = 70, Total = 285 Mean = 76, Median = 70, Total = 380 Mean = 70, Median = 70, Total = 350 • What does the median for each student tell you? 4A-24 Central Tendency  Mode • The most frequently occurring data value. • Similar to mean and median if data values occur often near the center of sorted data. • May have multiple modes or no mode. 4A-25 Central Tendency  Mode • For example, consider the following quiz scores for 3 students: Lee’s scores: 60, 70, 70, 70, 80 Pat’s scores: 45, 45, 70, 90, 100 Sam’s scores: 50, 60, 70, 80, 90 Xiao’s scores: 50, 50, 70, 90, 90 Mean =70, Median = 70, Mode = 70 Mean = 70, Median = 70, Mode = 45 Mean = 70, Median = 70, Mode = none Mean = 70, Median = 70, Modes = 50,90 • What does the mode for each student tell you? 4A-26 Central Tendency  Mode • Easy to define, not easy to calculate in large samples. • Use Excel’s function =MODE(Array) - will return #N/A if there is no mode. - will return first mode found if multimodal. • May be far from the middle of the distribution and not at all typical. 4A-27 Central Tendency  Mode • Generally isn’t useful for continuous data since data values rarely repeat. • Best for attribute data or a discrete variable with a small range (e.g., Likert scale). 4A-28 Central Tendency  Example: Price/Earnings Ratios and Mode • Consider the following P/E ratios for a random sample of 68 Standard & Poor’s 500 stocks. 7 8 8 10 10 10 10 12 13 13 13 13 13 13 13 14 14 14 15 15 15 15 15 16 16 16 17 18 18 18 18 19 19 19 19 19 20 20 20 21 21 21 22 22 23 23 23 24 25 26 26 26 26 27 29 29 30 31 34 36 37 40 41 45 48 55 68 91 • What is the mode? 4A-29 Central Tendency  Symptoms of Skewness Distribution’s Shape Histogram Appearance Skewed left (negative skewness) Long tail of histogram points left (a few low values but most data on Mean < Median right) Symmetric Tails of histogram are balanced (low/high values offset) Mean  Median Skewed right (positive skewness) Long tail of histogram points right (most data on left but a few high values) Mean > Median Statistics 4A-30 Central Tendency  Geometric Mean • The geometric mean (G) is a multiplicative average. G  n x1 x2 ... xn • For the J. D. Power quality data (n=37): G  37 (87)(93)(98)...(164)(173)  37 2.37667 1077  123.38 • In Excel use =GEOMEAN(Array) • The geometric mean tends to mitigate the effects of high outliers. 4A-31 Central Tendency  Midrange • The midrange is the point halfway between the lowest and highest values of X. • Easy to use but sensitive to extreme data values. xmin  xmax Midrange = 2 • For the J. D. Power quality data (n=37): x1  x37 87  173 xmin  xmax   130 Midrange = = 2 2 2 • Here, the midrange (130) is higher than the mean (125.38) or median (121). 4A-32 Central Tendency  Trimmed Mean • To calculate the trimmed mean, first remove the highest and lowest k percent of the observations. • For example, for the n = 68 P/E ratios, we want a 5 percent trimmed mean (i.e., k = .05). • To determine how many observations to trim, multiply k x n = 0.05 x 68 = 3.4 or 3 observations. • So, we would remove the three smallest and three largest observations before averaging the remaining values. 4A-33 Central Tendency  Trimmed Mean • Here is a summary of all the measures of central tendency for the n = 68 P/E values. Mean: 22.72 =AVERAGE(PERatio) Median: 19.00 =MEDIAN(PERatio) Mode: 13.00 =MODE(PERatio) Geometric Mean: 19.85 =GEOMEAN(PERatio) Midrange: 49.00 =(MIN(PERatio)+MAX(PERatio))/2 5% Trim Mean: 21.10 =TRIMMEAN(PERatio,0.1) • The trimmed mean mitigates the effects of very high values, but still exceeds the median. 4A-34 Dispersion • Variation is the “spread” of data points about the center of the distribution in a sample. Consider the following measures of dispersion:  Measures of Variation Statistic Range Formula xmax – xmin n Variance (s2)   xi  x  i 1 n 1 Excel Pro Con =MAX(Data)MIN(Data) Sensitive to Easy to calculate extreme data values. =VAR(Data) Plays a key role in mathematical statistics. 2 Non-intuitive meaning. 4A-35 Dispersion  Measures of Variation Statistic Standard deviation (s) Coefficient. of variation (CV) Formula n   xi  x  i 1 2 Excel Pro Con =STDEV(Data) Most common measure. Uses same units as the raw data ($ , £, ¥, etc.). Non-intuitive meaning. None Measures relative variation in percent so can compare data sets. Requires nonnegative data. n 1 100  s x 4A-36 Dispersion  Measures of Variation Statistic Mean absolute deviation (MAD) Formula Excel Pro Con Easy to understand. Lacks “nice” theoretical properties. n  xi  x i 1 n =AVEDEV(Data) 4A-37 Dispersion  Range • The difference between the largest and smallest observation. Range = xmax – xmin • For example, for the n = 68 P/E ratios, Range = 91 – 7 = 84 4A-38 Dispersion  Variance • The population variance (s2) is defined as the sum of squared deviations around the mean  divided by the population size. N s2  • For the sample variance (s2), we divide by n – 1 instead of n, otherwise s2 would tend to 2 s  underestimate the unknown population variance s2.   xi    2 i 1 N n   xi  x  i 1 n 1 2 4A-39 Dispersion  Standard Deviation • The square root of the variance. • Explains how individual values in a data set vary from the mean. • Units of measure are the same as X. Population standard deviation N s   xi    i 1 N 2 Sample standard deviation n s   xi  x  i 1 n 1 2 4A-40 Dispersion  Standard Deviation • Excel’s built in functions are Statistic Excel population formula Excel sample formula Variance =VARP(Array) =VAR(Array) =STDEVP(Array) =STDEV(Array) Standard deviation 4A-41 Dispersion  Calculating a Standard Deviation • Consider the following five quiz scores for Stephanie. 4A-42 Dispersion  Calculating a Standard Deviation • Now, calculate the sample standard deviation: n s 2 x  x    i i 1 n 1  2380  595  24.39 5 1 • Somewhat easier, the two-sum formula can also be used: 2   x   i n 2 (360) 2  i 1   xi  n 28300  2 5  28300  25920  595  24.39 s  i 1  n 1 5 1 5 1 n 4A-43 Dispersion  Calculating a Standard Deviation • The standard deviation is nonnegative because deviations around the mean are squared. • When every observation is exactly equal to the mean, the standard deviation is zero. • Standard deviations can be large or small, depending on the units of measure. • Compare standard deviations only for data sets measured in the same units and only if the means do not differ substantially.