Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 3: Numerically Summarizing Data 3.1 Measures of Central Tendency 3.2 Measures of Dispersion 3.3 Measures of Central Tendency and Dispersion from Grouped Data 3.4 Measures of Position 3.5 The Five-Number Summary and Boxplots September 25, 2008 1 The Mean of a Set Suppose we have a set of numerical values {x1 , x2 ,K , xn } in an observation of a population. We define the mean of this set to be the number: n x x i i 1 n x1 x2 K xn n Hence, the mean is the sum of the observations divided by the number of observations. Section 3.1 2 Remark Your book distinguishes between two types of means: population and sample. 1 N Population : x1 , x2 ,..., xN , x j N j 1 1 n Sample : x1 , x2 ,..., xn , x x j , n N. n j 1 Note that they are both calculated in exactly the same way. 3 The Median of a Set Consider a set of numerical observations, x1 , x2 ,K , xn , such that x1 x2 L xn . If n is an odd integer, then then the median of the set is the data point in the set: x(n1)/21 . If n is an even interger, then the median of the set the number: xn /2 xn /21 . 2 In other words, the median is the midpoint of the observations when they are ordered from smallest to largest or vice-versa. 4 Example 1 Find the mean and median of the set of observations: {20, -3, 4, 10, 6, -1}. Here n 6. Therefore, mean = x 20 3 4 10 6 1 36 6. 6 6 Since n 6 is an even integer, the median is the average of the two " central" data points after ordering. 4 6 10 We reorder the set : -3,-1,4,6,10,20. Then, median 5. 2 2 5 Example 2 Find the mean and median of the set of observations: {-10, -6 ,0, 4, 9}. Here n 5. Therefore, mean = x 10 6 0 4 9 3 3 . 5 5 5 Since n 5 is an odd integer, the median is the " central" data points after ordering. The set is already ordered. Then, median 0. 6 Mean and Dot Plot Consider the set of observations : 1,1,2,3,5,5,5,6,7,7. The mean of this set is : x 1 1 2 3 5 5 5 6 7 7 42 4.2. 10 10 Next we construct the dot plot of this set and show the location of the mean on it. Notice that the mean is a fulcrum for the distribution of point masses on the lever (x-axis). 7 Add Points (“Weights”) Suppose that we add two new points to the set, 1,1,2,3,5,5,5,6,7,7; namely, 2 and 4. 4 2 1 1 2 3 5 5 5 6 7 7 36 3. 12 12 Next we construct the dot plot of this set and show the location of the mean on it. The mean of the new set is : x The fulcrum has moved 1.2 units to the left. 8 Shape, Mean and Median Right - skewed median < mean Left - skewed mean < median Symmetric mean = median 9 Outlier • An outlier is an observation (data point) that falls well above or below the overall set of data. • The mean can be highly influence by an outlier. • The median is said to be resistant to outliers i.e., it value is not changed significantly by the addition or removal of an outlier. 10 Example Consider the two sets : S1 1,3,5,6,7 and S2 1,3,5,6,7,25. The point 25 in S2 is an outlier. We compute the mean and median for each set. For S1 the mean is 4.4 and the median is 5. For S2, the mean is approximately 7.8 and the median is 5.5. 11 Mode • The mode is the most frequent observation of the variable. • It is most often used with categorical data. • For numerical data, it can be used when the data is discrete. Color Count Black 20 White 10 Red 35 Blue 15 Green 10 Other 20 The mode of the categorical variable color is 35 (red). 12 Example Mia Hamm, who retired at the 2004 Olympics, is considered to be the most prolific player in international soccer. He is a list of the number of goals scored over her 18-year career. MHG = {0,0,0,4,10,1,10,10,19,9,18,20,13,13,2,7,8,13}. Considering the population as the number of goals scored by Mia Hamm, find the mean and median and mode of this set. MHG {0, 0, 0, 1, 2, 4, 7, 8, 9, 10, 10, 10, 13, 13, 13, 18, 19, 20} 1 18 157 xj 8.7222 18 j 1 18 median 9 10 19 9.5 2 2 mode 3 13 Mean, Median and Mode and Distribution Shape 14 Measures of Dispersion Consider the following sets of observations: S1 = {0,0,0,0,0,0,0,0,0,0} S2 = {-5,-4,-3,-2,-1,1,2,3,4,5}. Both sets have the same mean and median (namely, 0). However, the histograms or dot plots are quite different. Yet, their dot plot is very different. Notice that the difference between the smallest and largest number in each set is quite different. Section 3.2 15 Range of a Set of Observations Consider a set of numerical observations: S {x1 , x2 ,K , xn }. Let =min xi and =max xi . The number r is called 1in 1in the range of the set. It is a measure how spread out the observations are. Example : S {3, 1, 9, 2, 4, 6, 8, 1, 9, 8, 9}. min{3, 1, 9, 2, 4, 6, 8, 1, 9, 8, 9} 4 min{3, 1, 9, 2, 4, 6, 8, 1, 9, 8, 9} 9 r 9 (4) 13 Remark: The range is completely determined by only two points of the set of observations. 16 Example Lance Armstrong won the Tour de France seven consecutive times (1999-2005). Here is data about his victories. Year Winning Time (h) Distance (km) Winning Speed (km/h) Winning Margin (min) 1999 91.538 3687 40.28 7.617 2000 92.552 3662 39.46 6.033 2001 86.291 3453 40.02 6.733 2002 82.087 3278 39.93 7.283 2003 83.687 3427 40.94 1.017 2004 83.601 3391 40.56 6.317 2005 86.251 3593 41.65 4.667 The ranges for each category of winning are: Winning Time: range = 92.552 - 82.087 = 10.465 Distance: range = 3687 - 3278 = 409 Winning Speed: range = 41.65 - 39.46 = 2.19 Winning Margin: range = 7.283 - 1.017 = 6.266 17 The Spread of Quantitative Data Consider the frequency distributions of two different data sets. Notice how the tails of each distribution change from being close together to being far apart. 18 Section 2.4 The Deviation from the Mean n Consider a set of numerical observations: S x1 , x2 ,K , xn . Let x x i i 1 n be the mean. If z S, e.g., z x j , then the deviation of z is defined to be the number v z- x. If z x, then v 0; if z x, then v 0. Example : S {2, 0,1, 3, 4} 2 0 1 3 4 6 x 5 5 6 16 z 2 v 2 5 5 6 1 z 1 v 1 5 5 6 14 z4v4 5 5 19 Variance and Standard Deviation Definition: The “average” of the square of all deviations in a sample is called the variance of the sample. The standard deviation of a sample is defined as the square root of the variance. vi xi x variation of xi n v 2 i i 1 n 1 n (x i i 1 n 1 n s x )2 (x i variance x )2 i 1 n 1 standard deviation Question: Why n -1 instead of n in these formulas? 20 Remark There is an unfortunate duplicity on how the words, variance and standard deviation, are used. These quantities are computed different ways, depending on whether the set under consideration is a population or a sample of a population. It turns out that if we use the formulas for variance and standard deviation where we divide by n instead of n-1, then the standard deviation of the sample will consistently underestimate the standard deviation of the population. This is called bias. Hence, we will sometimes use the following definitions and will distinguish between sample standard deviation and population standard deviation. Population : n variance of population population (xi x )2 i 1 n n standard deviation of population s population (x i x )2 i 1 n Sample : n variance of sample sample (x i n x )2 i 1 n 1 standard deviation of sample ssample (x i x )2 i 1 n 1 21 Example For the set of observations (sample), {0,-3,10,7,5,-3,0}, • Find the range of the sample. • Find the mean and median of the sample. • Find the variance of the sample. • Find the standard deviation of the sample. min{0, 3,10, 7, 5, 3, 0} 3 max{0, 3,10, 7, 5, 3, 0} 10 r 10 (3) 13 n x x 16 1 0 3 10 7 5 3 0 7 7 i i 1 n Ordered set: 3, 3, 0, 0, 5, 7,10 median 0 n (x i x )2 i 1 n 1 s 2 2 2 1 16 16 16 544 0 3 K 0 25.9 6 7 7 7 21 544 34 4 5.1 21 21 22 Example For the two set of observations, S = {-1,0,0,0,1} and T = {-1,-1,-1,-1,0,1,1,1,1}, • Find the mean and median for each set. • Find the standard deviation for each set. 1 0.71 2 For the set T : x 0, median 0, s 1 For the set S : x 0, median 0, s We see from the dot plot that the set T has more points that vary from the mean and hence, has a larger standard deviation. 23 Properties of the Standard Deviation • The larger the spread (variation) in the data, the larger the standard deviation. • The standard deviation is zero only if and only if the set from which it is computed has all of its elements the same in which case the mean of the set is this number. • The standard deviation is influenced by outliers. This is true because the deviation from the mean of the set to the outlier is a large number in absolute value. • The standard deviation yields more information than the range of the set. (Why?) 24 Example The following data represents the walking time (in minutes) from the dorm or apartment to Professor Bisch’s course on operator algebras. We treat the nine students as the population of Prof. Bisch’s class. Student Time Student Time T.S. 39 S.Q. 45 P.C. 21 E.W. 11 A.A. 9 T.B. 12 C.S. 32 G.W. 39 N.G. 30 (a) Find the population mean and standard deviation. (b) Choose a sample of 4 and compute the mean and standard deviation of the sample. 25 population 39, 21, 9, 32, 30, 45,11,12, 39 xi 39 21 9 32 30 45 11 12 39 238 26.444 9 9 238 xi 9 j 1 9 2 9 13358 12.8419 9 sample 21, 30, 45,12 xi x 21 30 45 12 27 9 4 x i s 27 j 1 4 1 2 13358 22.2428 27 26 Bell-shaped (symmetric) Distributions Consider a set of observations that is bell-shaped. All three distributions have different standard deviations. 27 Empirical Rule for almost Bellshaped Distributions Let denote the standard deviation of the distribution (population) and the mean of the distribution. 68% of the observations fall within the interval , . 95% of the observations fall within the interval 2 , 2 . Greater than 99% of the observations fall within the interval 3 , 3 . 28 Caution The Empirical Rule for bell-shaped distributions is an empirical law, not a fact. The better the distribution is being perfectly bell-shaped, then better the accuracy of the law. It is useful in telling us how the data is concentrated about the mean of the distribution. 29 Example Consider the population: 1, 2, 3, 3, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 7, 7, 8, 8, 9. Note that histogram of this data is approximately bell-shaped. 98 85 5.2 and 2.1. 19 19 Hence, , 3.1, 7.3 and 2 , 2 1.0, 9.4 The mean and standard deviation of this set are: Think of the area between the yellow lines. Each bar is of width 1 and hence, its area is 1 times its height. For example, the total area of all bars is 19 and the area between the yellow bars in the first plot is 2 3 25 25/2 25 1 4 3 2 . Then 0.657. 2 4 2 19 38 30 Detailed Empirical Rule 31 Example The distribution of the length of bolts produced by the Acme Bolt Company is approximately bellshaped with a mean of 4 inches and a standard deviation of 0.007 inches. (a) What is the range of length for 68% of the bolts produced by this company? (b) What percentage of bolts will be between 3.986 inches and 4.014 inches? (c) If the company discards any bolts that are less than 3.986 inches or greater than 4.014 inches, what percentage of bolts will be discarded? (d) What percentage of the bolts will be between 4.007 inches and 4.021 inches? (a) 4 and 0.007 , 4 0.007, 4 0.007 3.993, 4.007 The Emperical Rule states that 68% of the bolts will lie in this interval. (b) 2 , 2 4 0.014, 4 0.014 3.986, 4.014 The Emperical Rule states that 95% of the bolts will lie in this interval. (c) The bolts that lie outside of the interval 3.986, 4.014 comprise 100%-95% or 5% of the bolts. (d) 4.007 4 0.007 and 4.021 4 0.021 3 . Hence, the interval 4.007, 4.021 , 3 . The area under the distribution for this interval is 13.5% 2.35% 15.85%. 32 Chebyshev Inequality Theorem : Consider a set of data, S x1 , x2 ,..., xn , with mean and standard 1 deviation . Let k be any positive integer. Then at least 100 1 2 % of the k points of S will lie in the interval k , k . Example: Suppose that a population has a mean of 73.5 and a standard deviation of 5.5. Find an interval that contains at least 75% of the data points in the population. 1 1 1 75 100 1 2 0.75 1 2 2 0.25 k 2 4 k 2 k k k k , k 73.5 2 5.5, 73.5 2 5.5 62.5,84.5 33 Example In December 2004, the average price of regular unleaded gasoline excluding taxes in the United States as $1.37 per gallon. Researchers in the Department of Energy estimated that the standard deviation for this mean price was $0.05. Using Chebyshev’s Inequality,estimate the percentage of gasoline stations that had prices within 3 standard deviations of the mean? What percentage had prices within 2.5 standard deviations? 1 8 From Chebyshev's Inequality, at least 100 1 2 % 100 % 88.9% of the gas stations where 9 3 selling gas in the range: 3 , 3 1.22,1.53. 1 21 From Chebyshev's Inequality, at least 100 1 % 100 % 84% of the gas stations where 2 25 2.5 selling gas in the range: 2.5 , 2.5 1.245,1.495 . 34 Remark • Chebyshev’s Inequality does not place any preconditions on the shape of the data set. • It is true for populations and samples. • The theorem does not say that there are exactly 100(1-1/k2)% points in an interval that is one standard deviation from the mean, but rather there are at least this number. 35 Mean and Standard Deviation for Grouped Data Suppose that we have a set (sample or population), S, for which the we have a histogram. Let x1b , x2b ,..., xkb be the midpoints of the bins for the histogram and let f1 , f2 ,..., fk be the frequencies for the k bins. Then an approximation for the mean is given by x1b f1 x2b f2 ... xkb fk . f1 f2 ... fk Section 3.3 36 Example S 1, 1,1, 0,1, 0, 2, 3,1, 0, 2,1 x1b 0.5, x2b 0.5, x3b 1.5, x4b 2.5, x5b 3.5 f1 1, f2 3, f3 5, f4 2, f5 1 11 0.91666666 12 x1b f1 x2b f2 ... x5b f5 1.41667 f1 f2 ... f5 37 Example S {-0.233419, -1.74643, -1.17611, 0.115127, -0.387499, -0.243923, 0.935241, -1.40094, 1.00318, -0.29893, -1.1775, -1.05954, -1.75079, -0.570382, 1.78043, -0.890746, 0.274231, -1.88105, 0.431684, -1.52741, 1.05588, -0.122219, 1.14102, -0.00826077, 0.81772, -1.66893, -0.26497, -1.99627, -0.279399, 0.0530089, -1.15805, -1.72074, -1.93831, -1.45983, 1.0851, -0.532795, 0.0568446, -0.447141, 1.53799, 0.989186, 0.0532697, -0.178675, 1.68054, -0.0318339, -1.51951, 0.519102, -0.545774, -0.64818, -1.76854, -0.0157137, -1.56891, 1.55986, -1.37954, -1.81756, -0.357188, 0.430748, 1.49016, -1.32359, 0.503981, 1.88901, -0.690596, 0.457233, 1.29942, 0.431846, 0.538415, 1.48462, 0.979356, 1.18019, 1.30296, 1.50126, 1.75375, 0.281253, 0.917936, -1.57578, -1.93716, -0.876824, 1.87008, -1.8755, 0.117552, 0.851759, -1.47976, 0.37836, -0.826459, -1.94213, 1.21858, -1.91226, -0.0167282, -0.716761, -0.383359, 1.00214, 0.853372, 0.668228, 0.395186, 0.913779, -0.749079, -0.198149, 1.77186, 0.41528, -1.9636, -1.23352} n 100 x1b 1.75, x2b 1.25, x3b 0.75, ..., x8b 1.75 f1 18, f2 10, f3 10, f4 16, f5 14, f6 12, f7 11, f8 9 0.0091245 x1b f1 x2b f2 ... x8b f8 0.135 f1 f2 ... f8 38 Weighted Mean of a Set Given a set of numbers, suppose that we believe that some of the numbers are more important than other numbers in the set. To reflect this notation, we defined the weighted mean of a set of numbers. Consider a set of numbers: S x1 , x2 ,..., xn . Furthermore, suppose each number of the set is assign a weight: w1 , w2 ,..., wn . The weighted mean of the set with respect to its weights is defined as n w x w2 x2 ... wn xn xw 1 1 w1 w2 ... wn w x j j 1 x w j . j j 1 39 Example Consider the set S = {-3, 1, 0, 3, -1, 1, 0} and the weights {1.5, 0, 1, -1, 1, 2, 1}. Find the weighted mean of this set with respect to the given weights. xw (1.5)(3) (0)(1) (1)(0) (1)(3) (1)(1) (2)(1) (1)(0) 9.5 1.72727 1.5 0 1 1 1 2 1 5.5 40 Approximation for Standard Deviation and Variance for Grouped Data Suppose that we have a set (sample or population), S, for which the we have a histogram. Let x1b , x2b ,..., xkb be the midpoints of the bins for the histogram and let f1 , f2 ,..., fk be the frequencies for the k bins. Then approximations for the standard deviation (depending on population or sample) are given by f x k j b j j 1 k f j 1 j f x k 2 j and s j 1 b j x 2 k fj 1 j 1 where is the mean of the population and x is the mean of the sample. 41 Example sample 1, 1,1, 0,1, 0, 2, 3,1, 0, 2,1 x1b 0.5, x2b 0.5, x3b 1.5, x4b 2.5, x5b 3.5 f1 1, f2 3, f3 5, f4 2, f5 1 x1b f1 x2b f2 ... x5b f5 x 1.41667 f1 f2 ... f5 f x k j s j 1 b j x f j 1 j 1 k Note : x 2 x b 1 x f 2 1 ... x5b x f1 f2 ... f5 1 f 2 5 1.08362 11 155 / 33 ,s 1.08362 12 2 42 Approximating the Median of grouped Data In the problem section of 3.3 there is a formula for approximating the median of data that is given in frequency tables: n CF 2 b median Lmedian xmedian fmedian where Lmedian lower limit of bin that contains median n number of data points CF cumulative frequency of bin before the bin that contains median fmedian frequency of bin that contains median b xmedian width of bin that contains median The bin that contains of median is the bin that has n in its cumulative frequency. 2 43 Example Bin Frequency Cumulative Frequency [0,10) 24 24 [10,20) 14 38 [20,30) 39 77 [30,40) 18 95 [40,50] 5 100 n 50 median bin is the third bin i.e., data in the interval [20, 30). 2 300 50 38 b 20, CF 38, xmedian 10, fmedian 39 median 20 10 23.0769 39 13 n 100 Lmedian 44 Measures of Position in a Distribution • The mean and median give us information about the “center” of a set of observations (the distribution). • The range and standard deviation give us information about the “spread” of the distribution. • We now introduce a concept that is equivalent to the “position” in a distribution. It will use the concept of percentiles. The percentile will how the distribution can be divided into parts (sometimes equal) which in turn will give us the notion of position within the distribution. Section 3.4 45 z-score Consider a population distribution with mean and standard deviation . Let x be a point in x the distribution. Then the z-score of x in the population is defined as the number: z population . Consider a sample with mean x and standard deviation s. Let x be a point in the sample. Then the z-score of x in the sample is defined as the number: zsample xx . s x x z z standard deviations from the mean. When z 1, x . When z 2, x 2 . When z 2.5, x 2.5 . Remark : z 46 Example Example: Consider the sample: {-1,0,1,5,19}. Compute the z-score for each data point. 1 0 1 5 19 24 4.8 5 5 (1 24 / 5)2 (0 24 / 5)2 (1 24 / 5)2 (5 24 / 5)2 (19 24 / 5)2 s 4 29 x 1 z 0.70 1705 24 x0z 0.58 1705 19 x 1 z 0.46 1705 1 x5z 0.02 1705 19 x 19 z 1.72 1705 x 341 8.26 5 47 Application of z-score The average 20- to 29- year old man is 69.6 inches tall with a standard deviation of 2.7 inches. The average 20- to 29- year old woman is 64.1 inches with a standard deviation of 2.6 inches. With respect to their population, who is relatively taller: a 75-inch man or a 70-inch woman? As a measure of relativeness within each population we use the z-score. Man: z x Woman: z 75 69.6 2.0 2.7 x 70 64.1 2.26923 2.6 Hence, the 70-inch man is 2 standard deviations from the mean of his population and the woman is 2.37922 from the mean of her population. Hence, she is relatively taller. 48 Percentile Definition: The kth percentile in a distribution, Pk, is a number that is the percentage of the observations that fall below or at this value. In other words, it subdivides the total area enclosed by the distribution into two sub-areas, A1 and A2, so that total area is divided into two parts: k and 100-k. 49 Algorithm for Percentiles S x1 , x2 ,..., xn (sample or population) Let k be the percentile to be computed. k Compute: i n 1. Note that 1 i n. 100 If i is an integer, then Pk xi . If i is not an integer, then let j be the largest integer such that j i. Then Pk For example, if i 10.31, the j 10 and Pk x j x j 1 2 . x10 x11 . 2 50 Example Find the 20th percentile of the set: S = {-1,0,3,5,9,12,15,18,25}. Next find the 45th percentile. n9 k 20 n9 20 k i n 1 10 2 100 100 45 k i n 1 10 4.5 (not an integer) 100 100 P20 x2 0 P45 k 45 x 4 x5 5 9 7 2 2 51 Remark Your book proposes an algorithm on page 169 which tells us what percentile a particular data point represents in the set. S x1 , x2 ,..., xn so that x1 x2 ... xn . number of data points less than xk Let xk S. Then the percentile of xk is: Pxk 100 . n If Pxk is not an integer, we round to the nearest integer. Example : What percentile does the number 3 represent in the set 2, 3,1,1, 4, 2? n6 Ordered set: 1,1, 2, 2, 3, 4 Number of data points less than 3: 4 4 P3 100 66.6666% 67 th perecentile 6 52 Quartiles When k = 50%, half of the observations are above and half are below this position. One can argue that this is equivalent to the notion of the median of the set of observations. When k = 25%, one quarter of the observations are below this position and three quarters are above. Similarly for k = 75%. These demarcation points are given special names. When k = 25%, it is called the first quartile (Q1). When k = 50%, it is called the second quartile or median (Q2) and finally, when k = 75%, it is called the third quartile (Q3). 53 To Find Quartiles • To calculate Q1, we calculate P25%. • To calculate Q2, we calculate P50%. • To calculate Q3, we calculate P75%. 54 Example Find the quartiles for the set {-1,1,5,5,0,7,2,7}. Reordered Set: 1, 0,1, 2, 5, 5, 7, 7 n8 25 9 225 x x3 0 1 k k 25 i n 1 2.25 P25 2 0.5 Q1 0.5 100 100 100 2 2 50 9 450 x x5 2 5 k k 50 i n 1 4.5 P50 4 3.5 Q2 3.5 100 100 100 2 2 75 9 675 x x7 5 7 k k 75 i n 1 6.75 P75 6 6 Q3 6 100 100 100 2 2 55 Example Find the quartiles for the set {-1,1,5,5,0,7,2,7,2}. Same set as previous example with the data point 2 added. Reordered Set: 1, 0,1, 2, 2, 5, 5, 7, 7 n9 25 10 25 x x3 0 1 k k 25 i n 1 2.5 P25 2 0.5 Q1 0.5 100 100 10 2 2 50 10 50 k k 50 i n 1 5 P50 x5 2 Q2 2 100 100 10 75 10 75 x x8 5 7 k k 75 i n 1 7.5 P75 7 6 Q3 6 100 100 10 2 2 56 Example Find the median, Q1, and Q3 for the set of data: {68,76,60,88,69,80,75,67,71,100,63,62,71,74,64,48,100,72,65,50,72,100,63,45,54,60,75,57,74,84,83}. Reordered Set: {45, 48, 50, 54, 57, 60, 60, 62, 63, 63, 64, 65, 67, 68, 69, 71, 71, 72, 72, 74, 74, 75, 75, 76, 80, 83, 84, 88, 100, 100, 100}. n 31 25 32 800 k k 25 i n 1 8 P25 x8 62 Q1 62 100 100 100 50 32 1600 k k 50 i n 1 16 P50 x16 71 Q2 71 100 100 10 75 32 2400 k k 75 i n 1 24 P75 x24 76 Q3 76 100 100 10 57 Interquartile Range Definition: Let Q1, Q2, and Q3 denote the quartiles for a set of observations. The interquartile range (IQR) of the set is defined as IQR = Q3 - Q1. Hence, it is simply the distance between the first and third quantile. Example: Consider {-1,1,5,5,0,7,2,7}. Previously, we showed that Q1 = 0.5 and Q3 = 6.0. Hence, IQR = 6.0 - 0.5 = 5.5. 58 IQR and Outlier Criterion Criterion: Consider a set of observations. An observation may be a possible outlier on the left if the distance from it to Q1 is larger than (1.5)IQR. It may be a possible outlier on the right if the distance from it to Q3 is larger than 1.5xIQR. We can call these demarcation values the upper and lower fences of the set: LF = Q1 - 1.5(IQR) UF = Q3 + 1.5(IQR) 59 Example Example: Consider a set of data points: {-1,0,3,5,9,10,26}. Doe it have any potential outliers? Reordered Set: {-1, 0, 3, 5, 9,10, 26} n7 25 8 200 k k 25 i n 1 2 P25 x2 0 Q1 0 100 100 100 50 8 400 k k 50 i n 1 4 P50 x4 5 Q2 5 100 100 100 75 8 600 k k 75 i n 1 6 P75 x6 6 Q2 10 100 100 100 IQR Q3 Q1 10 0 10 Since 1.5 IQR 15 and 26 Q3 1.5 IQR 25, we consider the largest data point to be a possible outlier. 60 Example The following sample of the concentration of dissolved organic carbon (mg/L) in mineral soil: {8.5, 10.3, 5.5, 8.05, 3.02, 12.57, 8.37, 4.6, 7.9, 9.11, 3.91, 11.56, 4.71,10.72, 7.45, 12.89, 7.92, 8.5, 11.72, 8.79, 9.29, 7, 7.66, 21.82, 11.33, 9.81, 17.9, 4.8, 4.85, 21, 3.99, 11.72, 22.62, 7.11, 17.99, 7.31, 4.9, 11.97,10.89, 3.79, 11.8, 10.74, 9.6, 21.4, 16.92, 9.1, 7.85}. Calculate the quartiles and IQR for this sample. Lastly, compute the upper and lower fences. We first sort the set: {3.02, 3.79, 3.91, 3.99, 4.6, 4.71, 4.8, 4.85, 4.9, 5.5, 7, 7.11, 7.31, 7.45,7.66, 7.85, 7.9, 7.92, 8.05, 8.37, 8.5, 8.5, 8.79, 9.1, 9.11, 9.29, 9.6,9.81, 10.3, 10.72, 10.74, 10.89, 11.33, 11.56, 11.72, 11.72, 11.8, 11.97, 12.57, 12.89, 16.92, 17.9, 17.99, 21, 21.4, 21.82, 22.62} and notice that there are 47 points. Hence, the median (Q2) is the middle point of the sorted set: 9.1 (the 24th point). Therefore, Q2 = 9.1. To calculate Q1 and Q3, we use the 25 th and 75th percentiles: Q1 = 7.16 and Q3 = 11.72. Therefore, IQR = 11.72 - 7.16 = 4.56. The upper and lower fences are: LF = Q1 - 1.5(IQR ) = 7.16 - 1.5(7.56) = 0.32 UF = Q3 +1.5(IQR ) = 7.16 + 1.5(7.56) = 18.56 61 The Five Number Summary of Position It is often convenient to summarize the quartile information, the smallest and largest values in the set as a 5-tuple: (smallest, Q1, median, Q3, largest). Example: Find the five number summary for the set: {-2,-1,0,1,5,6,6,8,10,11,12}. The small number is -2, the largest number is 12, Q2 = 6, Q1 = 0 and Q3 = 10. Hence, the 5-tuple is (-2,0,6,10,12). Section 3.5 62 Box-whisker Plot of the Five Number Summary (smallest, Q1, median, Q3, largest) 63 Some Remarks • A box-whisker is a very compact way of summarizing the spread of the distribution. • It does not give the shape of the distribution and hence, a histogram and a box-whisker plot often go together. • A box-whisker plot is a convenient way to compare two sets of data. 64 Comparing Two Sets Which set has the larger mean? Which set has the larger median? Which set has the largest member? Which has the larger standard deviation? 65 Can Descriptive Summaries be Misleading? Example: Suppose a sample of Vanderbilt students are asked to estimate how many miles that they have driven during the month of August. After receiving the sample for this population of Vanderbilt students, we compute the following statistics: • number in sample: 954 • smallest value: 0 • largest value: 25,000 • mean: 2,072.6 • median: 1,903 • standard deviation: 1,662.9 • IQR: 1,908 Is it reasonable to say that average Vanderbilt student drove approximately 2,073 miles with the median 1,903 during the month of August? 66 Actual Data Set {0,1,5,9,13,...,2997,3000,3010,3020,3030,…,5000,24000,25000} 954 data points 25,000 and 24,000 are outliers Remove outliers: mean = 2,025.5, median = 1,899, SD = 1307.5 67 Home Ownership in America • SmartMoney Magazine, October 2006. • Question: Is real estate making millionaires of the average citizen? • Median value of homes has risen from $131,000 in 2001 to $160,000 in 2004. • 69% of American’s own homes. • Average net worth of homeowners: $625,000. • Median net worth of homeowners: $184,000. 68 Summary • • • • Center of a Distribution – Mean – Median – Mode Spread of a Distribution – Range – Variance – Standard Deviation Position in a Distribution – Quartile – Percentile – IQR – Five number summary – Box-whisker plot – z-score Grouped Data 69