Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
4B-1 Chapter 4B Descriptive Statistics (Part 2) Standardized Data Percentiles and Quartiles Box Plots McGraw-Hill/Irwin © 2008 The McGraw-Hill Companies, Inc. All rights reserved. 4B-3 Standardized Data Chebyshev’s Theorem • Developed by mathematicians Jules Bienaymé (1796-1878) and Pafnuty Chebyshev (1821-1894). • For any population with mean m and standard deviation s, the percentage of observations that lie within k standard deviations of the mean must be at least 100[1 – 1/k2]. 4B-4 Standardized Data Chebyshev’s Theorem • For k = 2 standard deviations, 100[1 – 1/22] = 75% • So, at least 75.0% will lie within m + 2s • For k = 3 standard deviations, 100[1 – 1/32] = 88.9% • So, at least 88.9% will lie within m + 3s • Although applicable to any data set, these limits tend to be too wide to be useful. 4B-5 Standardized Data The Empirical Rule • The normal or Gaussian distribution was named for Karl Gauss (1771-1855). • The normal distribution is symmetric and is also known as the bell-shaped curve. • The Empirical Rule states that for data from a normal distribution, we expect that for k = 1 about 68.26% will lie within m + 1s k = 2 about 95.44% will lie within m + 2s k = 3 about 99.73% will lie within m + 3s 4B-6 Standardized Data The Empirical Rule • Distance from the mean is measured in terms of the number of standard deviations. Note: no upper bound is given. Data values outside m + 3s are rare. 4B-7 Standardized Data Example: Exam Scores • If 80 students take an exam, how many will score within 2 standard deviations of the mean? • Assuming exam scores follow a normal distribution, the empirical rule states about 95.44% will lie within m + 2s so 95.44% x 80 76 students will score + 2s from m. • How many students will score more than 2 standard deviations from the mean? 4B-8 Standardized Data Unusual Observations • Unusual observations are those that lie beyond m + 2s. • Outliers are observations that lie beyond m + 3s. 4B-9 Standardized Data Unusual Observations • For example, the P/E ratio data contains several large data values. Are they unusual or outliers? 7 8 8 10 10 10 10 12 13 13 13 13 13 13 13 14 14 14 15 15 15 15 15 16 16 16 17 18 18 18 18 19 19 19 19 19 20 20 20 21 21 21 22 22 23 23 23 24 25 26 26 26 26 27 29 29 30 31 34 36 37 40 41 45 48 55 68 91 4B-10 Standardized Data The Empirical Rule • If the sample came from a normal distribution, then the Empirical rule states x 1s = 22.72 ± 1(14.08) = (8.9, 38.8) x 2s = 22.72 ± 2(14.08) = (-5.4, 50.9) x 3s = 22.72 ± 3(14.08) = (-19.5, 65.0) 4B-11 Standardized Data The Empirical Rule • Are there any unusual values or outliers? 7 8 . . . 48 55 Unusual 68 91 Unusual Outliers Outliers -19.5 -5.4 8.9 22.72 38.8 50.9 65.0 4B-12 Standardized Data Defining a Standardized Variable • A standardized variable (Z) redefines each observation in terms the number of standard deviations from the mean. Standardization formula for a population: xi m zi s Standardization formula for a sample: xi x zi s 4B-13 Standardized Data Defining a Standardized Variable • zi tells how far away the observation is from the mean. • For example, for the P/E data, the first value x1 = 7. The associated z value is xi x zi s = 7 – 22.72 = -1.12 14.08 4B-14 Standardized Data Defining a Standardized Variable • A negative z value means the observation is below the mean. • Positive z means the observation is above the mean. For x68 = 91, xi x zi = 91 – 22.72 = 4.85 14.08 s 4B-15 Standardized Data Defining a Standardized Variable • Here are the standardized z values for the P/E data: • What do you conclude for these four values? 4B-16 Standardized Data Defining a Standardized Variable • MegaStat calculates standardized values as well as checks for outliers. • In Excel, use =STANDARDIZE(Array, Mean, STDev) to calculate a standardized z value. 4B-17 Standardized Data Outliers • What do we do with outliers in a data set? • If due to erroneous data, then discard. • An outrageous observation (one completely outside of an expected range) is certainly invalid. • Recognize unusual data points and outliers and their potential impact on your study. • Research books and articles on how to handle outliers. 4B-18 Standardized Data Estimating Sigma • For a normal distribution, the range of values is 6s (from m – 3s to m + 3s). • If you know the range R (high – low), you can estimate the standard deviation as s = R/6. • Useful for approximating the standard deviation when only R is known. • This estimate depends on the assumption of normality. 4B-19 Percentiles and Quartiles Percentiles • Percentiles are data that have been divided into 100 groups. • For example, you score in the 83rd percentile on a standardized test. That means that 83% of the test-takers scored below you. • Deciles are data that have been divided into 10 groups. • Quintiles are data that have been divided into 5 groups. • Quartiles are data that have been divided into 4 groups. 4B-20 Percentiles and Quartiles Percentiles • Percentiles are used to establish benchmarks for comparison purposes (e.g., health care, manufacturing and banking industries use 5, 25, 50, 75 and 90 percentiles). • Quartiles (25, 50, and 75 percent) are commonly used to assess financial performance and stock portfolios. • Percentiles are used in employee merit evaluation and salary benchmarking. 4B-21 Percentiles and Quartiles Quartiles • Quartiles are scale points that divide the sorted data into four groups of approximately equal size. Q1 Lower 25% | Q2 Second 25% | Q3 Third 25% | Upper 25% • The three values that separate the four groups are called Q1, Q2, and Q3, respectively. 4B-22 Percentiles and Quartiles Quartiles • The second quartile Q2 is the median, an important indicator of central tendency. Q2 Lower 50% | Upper 50% • Q1 and Q3 measure dispersion since the interquartile range Q3 – Q1 measures the degree of spread in the middle 50 percent of data values. Q1 Lower 25% | Q3 Middle 50% | Upper 25% 4B-23 Percentiles and Quartiles Quartiles • The first quartile Q1 is the median of the data values below Q2, and the third quartile Q3 is the median of the data values above Q2. Q1 Lower 25% | Q2 Second 25% For first half of data, 50% above, 50% below Q1. | Q3 Third 25% | Upper 25% For second half of data, 50% above, 50% below Q3. 4B-24 Percentiles and Quartiles Quartiles • Depending on n, the quartiles Q1,Q2, and Q3 may be members of the data set or may lie between two of the sorted data values. 4B-25 Percentiles and Quartiles Method of Medians • For small data sets, find quartiles using method of medians: Step 1. Sort the observations. Step 2. Find the median Q2. Step 3. Find the median of the data values that lie below Q2. Step 4. Find the median of the data values that lie above Q2. 4B-26 Percentiles and Quartiles Excel Quartiles • Use Excel function =QUARTILE(Array, k) to return the kth quartile. • Excel treats quartiles as a special case of percentiles. For example, to calculate Q3 =QUARTILE(Array, 3) =PERCENTILE(Array, 75) • Excel calculates the quartile positions as: Position of Q1 0.25n + 0.75 Position of Q2 Position of Q3 0.50n + 0.50 0.75n + 0.25 4B-27 Percentiles and Quartiles Example: P/E Ratios and Quartiles • Consider the following P/E ratios for 68 stocks in a portfolio. 7 8 8 10 10 10 10 12 13 13 13 13 13 13 13 14 14 14 15 15 15 15 15 16 16 16 17 18 18 18 18 19 19 19 19 19 20 20 20 21 21 21 22 22 23 23 23 24 25 26 26 26 26 27 29 29 30 31 34 36 37 40 41 45 48 55 68 91 • Use quartiles to define benchmarks for stocks that are low-priced (bottom quartile) or high-priced (top quartile). 4B-28 Percentiles and Quartiles Example: P/E Ratios and Quartiles • Using Excel’s method of interpolation, the quartile positions are: Quartile Position Q1 Q2 Q3 Formula = 0.25(68) + 0.75 = 17.75 = 0.50(68) + 0.50 = 34.50 = 0.75(68) + 0.25 = 51.25 Interpolate Between X17 + X18 X34 + X35 X51 + X52 4B-29 Percentiles and Quartiles Example: P/E Ratios and Quartiles • The quartiles are: Quartile First (Q1) Second (Q2) Third (Q3) Formula Q1 = X17 + 0.75 (X18-X17) = 14 + 0.75 (14-14) = 14 Q2 = X34 + 0.50 (X35-X34) = 19 + 0.50 (19-19) = 19 Q3 = X51 + 0.25 (X52-X51) = 26 + 0.25 (26-26) = 26 4B-30 Percentiles and Quartiles Example: P/E Ratios and Quartiles • So, to summarize: Q1 Lower 25% of P/E Ratios 14 Q2 Second 25% of P/E Ratios 19 Q3 Third 25% of P/E Ratios 26 Upper 25% of P/E Ratios • These quartiles express central tendency and dispersion. What is the interquartile range? • Because of clustering of identical data values, these quartiles do not provide clean cut points between groups of observations. 4B-31 Percentiles and Quartiles Tip Whether you use the method of medians or Excel, your quartiles will be about the same. Small differences in calculation techniques typically do not lead to different conclusions in business applications. 4B-32 Percentiles and Quartiles Caution • Quartiles generally resist outliers. • However, quartiles do not provide clean cut points in the sorted data, especially in small samples with repeating data values. Data set A: 1, 2, 4, 4, 8, 8, 8, 8 Q1 = 3, Q2 = 6, Q3 = 8 Data set B: 0, 3, 3, 6, 6, 6, 10, 15 Q1 = 3, Q2 = 6, Q3 = 8 • Although they have identical quartiles, these two data sets are not similar. The quartiles do not represent either data set well. 4B-33 Percentiles and Quartiles Dispersion Using Quartiles • Some robust measures of central tendency and dispersion using quartiles are: Statistic Midhinge Formula Excel Q1 Q3 2 =0.5*(QUARTILE (Data,1)+QUARTILE (Data,3)) Pro Con Robust to presence of extreme data values. Less familiar to most people. 4B-34 Percentiles and Quartiles Dispersion Using Quartiles Statistic Midspread Formula Excel Q3 – Q1 Stable when =QUARTILE(Data,3)extreme QUARTILE(Data,1) data values exist. Coefficient Q3 Q1 100 of quartile Q3 Q1 variation (CQV) Pro None Relative variation in percent so we can compare data sets. Con Ignores magnitude of extreme data values. Less familiar to nonstatisticians 4B-35 Percentiles and Quartiles Midhinge • The mean of the first and third quartiles. Q1 Q3 Midhinge = 2 • For the 68 P/E ratios, Q1 Q3 14 26 20 Midhinge = 2 2 • A robust measure of central tendency since quartiles ignore extreme values. 4B-36 Percentiles and Quartiles Midspread (Interquartile Range) • A robust measure of dispersion Midspread = Q3 – Q1 • For the 68 P/E ratios, Midspread = Q3 – Q1 = 26 – 14 = 12 4B-37 Percentiles and Quartiles Coefficient of Quartile Variation (CQV) • Measures relative dispersion, expresses the midspread as a percent of the midhinge. Q3 Q1 CQV 100 Q3 Q1 • For the 68 P/E ratios, Q3 Q1 26 14 CQV 100 100 30.0% Q3 Q1 26 14 • Similar to the CV, CQV can be used to compare data sets measured in different units or with different means. 4B-38 Box Plots • A useful tool of exploratory data analysis (EDA). • Also called a box-and-whisker plot. • Based on a five-number summary: Xmin, Q1, Q2, Q3, Xmax • Consider the five-number summary for the 68 P/E ratios: Xmin, Q1, Q2, Q3, Xmax 7 14 19 26 91 4B-39 Box Plots Whiskers Center of Box is Midhinge Box Q1 Q3 Minimum Median (Q2) Right-skewed Maximum 4B-40 Box Plots Fences and Unusual Data Values • Use quartiles to detect unusual data points. • These points are called fences and can be found using the following formulas: Lower fence Upper fence Inner fences Q1 – 1.5 (Q3–Q1) Q3 + 1.5 (Q3–Q1) Outer fences: Q1 – 3.0 (Q3–Q1) Q3 + 3.0 (Q3–Q1) • Values outside the inner fences are unusual while those outside the outer fences are outliers. 4B-41 Box Plots Fences and Unusual Data Values • For example, consider the P/E ratio data: Inner fences Outer fences: Lower fence: 14 – 1.5 (26–14) = 4 14 – 3.0 (26–14) = 22 Upper fence: 26 + 1.5 (26–14) = +44 26 + 3.0 (26–14) = +62 • Ignore the lower fence since it is negative and P/E ratios are only positive. 4B-42 Box Plots Fences and Unusual Data Values • Truncate the whisker at the fences and display unusual values Inner Outer and outliers Fence Fence as dots. Unusual Outliers • Based on these fences, there are three unusual P/E values and two outliers. 4B-43 Grouped Data Nature of Grouped Data • Although some information is lost, grouped data are easier to display than raw data. • When bin limits are given, the mean and standard deviation can be estimated. • Accuracy of grouped estimates depend on - the number of bins - distribution of data within bins - bin frequencies