Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
3.3 MEASURING VARIATION OR SPREAD Both sets of data have the same mean, median and mode but the values obviously differ in another respect -- the variation or spread of the values. The values in List 1 are much more tightly clustered around the center value of 60. The values in List 2 are much more dispersed or spread out. List 1: 55, 56, 57, 58, 59, 60, 60, 60, 61, 62, 63, 64, 65 mean = median = mode = 60 X X XXXXXXXXXXX 35 40 45 50 55 60 . 65 70 75 80 85 List 2: 35, 40, 45, 50, 55, 60, 60, 60, 65, 70, 75, 80, 80 mean = median = mode = 60 X X X X 35 40 45 50 X 55 X X X 60 X X 65 X 70 X 75 X 80 . 85 1 Range The range is the simplest measure of variability or spread. Range is just the difference between the largest value and the smallest value. Range can give a distorted picture of the actual pattern of variation. Two distributions: same range but different patterns of variation. The first distribution has most of its values far from the center, second distribution has most of its values closer to the center. X X X 20 X X X X X X X X X X X X X X X X X X 21 22 23 24 25 26 27 28 29 30 X X X X X X X X 20 21 22 23 24 while the X X X X X X X X X X X X 25 26 27 28 29 30 2 Interquartile Range The interquartile range measures the spread of the middle 50% of the data. You first find the median (represented by Q2—the value that divides the data into two halves), and then find the median for each half.The three values that divide the data into four parts are called the quartiles, represented by Q1, Q2, and Q3. The difference between the third quartile and the first quartile is called the interquartile range, denoted by IQR=Q3-Q1. Finding the Quartiles 1. Find the median of all of the observations. 2. First Quartile = Q1 = median of observations that fall below the median. 3. Third Quartile = Q3 = median of observations that fall above the median. Notes When the number of observations is odd, the middle observation is the median. This observation is not included in either of the two halves when computing Q1 and Q3. Although different books, calculators, and computers may use slightly different ways to compute the quartiles, they are all based on the same idea. In a left-skewed distribution, the first quartile will be farther from the median than the third quartile is. If the distribution is symmetric, the quartiles should be the same distance from the median. Example Quartiles for Age The ages of the 20 subjects in the medical study are listed below in order. 32, 37, 39, 40, 41, 41, 41, 42, 42, 43, 44, 45, 45, 45, 46, 47, 47, 49, 50, 51 The histogram of the ages is also provided. 3 32 (a) Calculate the median age. (b) Calculate the first Quartile Q1 for this age data. (c) Calculate the third Quartile Q3 for this age data. (d) Calculate the range for this age data. 37 39 40 41 41 41 42 42 43 44 45 45 45 46 47 47 49 50 51 median = 43.5 Q1 = 41 We see that the distribution of age is approximately symmetric and that the quartiles are about the same distance from the median. Q3 = 46.5 Count 8 6 4 2 The quartiles are actually the 25th, 50th, and 75th percentiles. 30 35 40 45 50 55 DEFINITION: The pth percentile is the value such that p% of the observations fall at or below that value and (100 - p)% of the observations fall at or above that value. 4 Five-Number Summary Five-number summary: Minimum, Q1, Median, Q3, Maximum Boxplot: Min Q1 Q2=Median Max Q3 To Build a Basic Boxplot List the data values in order from smallest to largest. Find the five number summary: minimum, Q1, median, Q3, and maximum. Locate the values for Q1, the median and Q3 on the scale. These values determine the “box” part of the boxplot. The quartiles determine the ends of the box, and a line is drawn inside the box to mark the value of the median. Draw lines (called whiskers) from the midpoints of the ends of the box out to the minimum and maximum. Example Five-Number Summary and Boxplot for Age Problem Consider the (ordered) ages of the 20 subjects in a medical study : 32, 44, 37, 45, 39, 45, 40, 45, 41, 46, 41, 47, 41, 47, 42, 49, 42, 50, 43, 51 The five-number summary for the age data is given by: min = 32, Q1 = 41, median = 43.5, Q3 = 46.5, and max = 51. 5 Draw the basic boxplot. The distance between the median and the quartiles is roughly the same, supporting the rough symmetry of the distribution as seen previously from the histogram. Side-by-side boxplots are helpful for comparing two or more distributions with respect to the five-number summary. Although the median of the first process is closer to the target value of 20.000 cm, the second process produces a less variable distribution. 6 Using the 1.5 x IQR Rule to Identify Outliers and Build a Modified Boxplot List the data values in order from smallest to largest. Find the five number summary: minimum, Q1, median, Q3, and maximum. Locate the values for Q1, the median and Q3 on the scale. These values determine the “box” part of the boxplot. The quartiles determine the ends of the box, and a line is drawn inside the box to mark the value of the median. Find the IQR = Q3 – Q1. Compute the quantity STEP = 1.5 x (IQR) Find the location of the inner fences by taking 1 step out from each of the quartiles lower inner fence = Q1 – STEP; upper inner fence = Q3 + STEP. Draw the lines (whiskers) from the midpoints of the ends of the box out to the smallest and largest values WITHIN the inner fences. Observations that fall OUTSIDE the inner fences are considered potential outliers. If there are any outliers, plot them individually along the scale using a solid dot. Five-number summary: min=1 Q1=21 median=32 Q3=66 max=325 Inner Fences Potential Outliers Outside value Far Outside value Farthest observations that are not potential outliers 7 Example Any Age Outlier? Let’s apply the "rule of thumb" to our age data set to assess if there are any outliers. (a) Construct the fences for the modified boxplot based on the 1.5 * IQR rule. (b) Are there any outliers using the 1.5 * IQR rule? (c) Construct the modified boxplot. 8 Let's Do It! 1( 3min) Five-Number Summary and Outliers 9 Let's Do It! 2 (3min) 10 Let’s Do It! 2 Cost of Running Shoes The prices for 12 comparable pairs of running shoes produced the following boxplot. * 40 60 80 PRICE 100 120 (a) What was the approximate range of prices for such running shoes? Range = ______________ (b)Twenty-five percent of the shoes cost more than approximately what amount? $ _____________ 11 Let's Do It! 3 (10min) Comparing Ages—Antibiotic Study Variable = age for 23 children randomly assigned to one of two treatment groups. (a) Give the five-number summary for each of the two treatment groups. Comment on your results. Amoxicillin Group (n=11): 8 9 9 10 Cefadroxil Group (n=12): 7 8 9 9 10 11 11 12 14 14 17 Five-number summary: 9 10 10 11 12 13 14 16 Five-number summary: (b) Make side-by-side boxplots for the antibiotic study data in part (a). (c) Using our “rule of thumb,” are there any outliers for the Amoxicillin group? If so, modify your boxplot above. (d) Using our “rule of thumb,” are there any outliers for the Cefadroxil group? If so, modify your boxplot above. 12 Standard Deviation .…...a measure of the spread of the observations from the mean. .……think of the standard deviation as an “average (or standard) distance of the observations from the mean.” Example 5.9 Standard Deviation—What Is It? Deviations: -4, 1, Squared Deviations: 16, 1, 9 3 ----------------------------------------------------------------------------------------Observation Deviation Squared Deviation x x 2 x x x ----------------------------------------------------------------------------------------0 0 - 4 = -4 16 5 5-4= 1 1 7 7-4= 3 9 ----------------------------------------------------------------------------------------mean = 4 sum always = 0 sum = 26 sample variance 4 2 1 2 3 2 31 16 1 9 26 13 2 2 sample standard deviation 13 36 . 13 Interpretation of the Standard Deviation Think of the standard deviation as roughly an average distance of the observations from their mean. If all of the observations are the same, then the standard deviation will be 0 (i.e. no spread). Otherwise the standard deviation is positive and the more spread out the observations are about their mean, the larger the value of the standard deviation. If x 1 , x 2 ,..., x n denote a sample of n observations, the sample variance is denoted by: s 2 x x 2 2 2 x1 x x2 x xn x 2 i n 1 2 2 i 2 n 1 x x i n ( xi2 ) xi n(n 1) /n (n 1) Sample standard deviation, denoted by s, is the square root of the variance: s s2 . The population standard deviation, denoted by the Greek letter (sigma), is the square root of the population variance and is computed as: 2 x i N 2 . 14 Remarks: The variance is measured in squared units. By taking the square root of the variance we bring this measure of spread back into the original units. Just as the mean is not a resistant measure of center, since the standard deviation used the mean in its definition, it is not a resistant measure of spread. It is heavily influenced by extreme values. There are statistical arguments that support why we divide by n 1 instead of n in the denominator of the sample standard deviation. Let's Do It! 4 (4min) 5.13Increasing Spread Consider the following three data sets. I: 20 20 20 II: 18 20 22 III: 17 20 23 (a) Which data set will have the smallest standard deviation? (b) Which data set will have the largest standard deviation? (c) Find the standard deviation for each data set and check your answers to (a) and (b). Think About It (3 min) Given that two (or more) sets of n observations yield the same standard deviation, will these sets show the same amount of variability? Just what is variability anyway? 15 Example There Are Many Measures of Variability Consider the following four data sets along with their histograms: 6 Data Set I 2 3 3 3 4 4 4 4 5 5 5 5 5 Distribution I 4 4 2 2 Data Set II 3 3 3 3 3 4 4 4 4 5 5 5 6 1 2 3 4 5 6 Distribution III Data Set III 2 3 3 4 4 4 4 4 4 4 5 5 6 Data Set IV 3 3 3 3 3 3 4 5 5 5 5 5 5 4 4 2 2 Measure of Variability I x 4. Distribution II III 6 2 3 4 5 6 Distribution IV 6 5 Distribution II 1 6 (a) Calculate the mean for each 1 2 3 4 data set. (b) Calculate the range for each data set. (c) Calculate the interquartile range, IQR, for each data set. (d) Calculate the standard deviation for each data set. (e) Which data set is most variable? Explain. The mean for all four distributions is distributions: 6 1 2 3 4 5 6 The table presents three measures of variability for each of the four IV If we look at the range: Distribution III is most variable; if we look at the IQR: Distribution III is least variable; while all four distributions have the SAME standard deviation. Some people associate variability with range while others associate variability with how values differ from the mean. There are many measures of variability, with the standard deviation being the most widely used measure. But keep in mind, a distribution with the smallest standard deviation is not necessarily the distribution that is least variable with respect to other definitions or to your own definition of variability. (Reference: A. J. Nitko, (1983), Educational Tests and Measurement: An Introduction.) Range IQR Std dev 3 2 1 3 2 1 4 1 1 2 2 1 Think About It What do you think would happen to the measures of variability if the last value in all four of the preceding data sets were changed to 16? 16 IQR and Standard Deviation The interquartile range, IQR, is the distance between the first and third quartiles (Q3 - Q1), and measures the spread of the middle 50% of the data. When the median is used as a measure of center, the IQR is often used as a measure of spread. For skewed distributions, or distributions with outliers, the IQR tends to be a better measure of spread if your goal is to summarize that distribution. Adding the minimum and maximum values to the median and quartiles results in the five-number summary. A graphical display of the five-number summary is a boxplot, and the length of the box corresponds to the IQR. The standard deviation is roughly the average distance of the observed values from their mean. The mean and the standard deviation are most useful for approximately symmetric distributions with no outliers. In the next chapter we will discuss an important family of symmetric distributions, called the normal distributions, for which the standard deviation is a very useful summary. Tip: The numerical summaries presented in this chapter provide information about the center and spread of a distribution, but a graph, such as a histogram or stem-and-leaf plot, provides the best picture of the overall shape of the distribution. Graph your data first! 17 Variance and Standard Deviation for Grouped Data The procedure for finding the variance and standard deviation for grouped data is similar to that for finding the mean for grouped data, and it uses the midpoints of each class. Example The data represent the number of miles that 20 runners ran during one week. Find the variance and the standard deviation for the frequency distribution of the data. Solution Step1 Make a table as shown, and find the midpoint of each class. Step 2 Multiply the frequency by the midpoint for each class, and place the products in column D. 1 .8 = 8, 2 . 13 =26, . . . , 2 .38 = 76 Step 3 Multiply the frequency by the square of the midpoint, and place the products in column E. 1 .82 = 64, 2 . 132 = 338, . . . , 2 .382 = 2888 Step 4 Find the sums of columns B, D, and E. The sum of column B is n, the sum of column D is f i xm , and the sum of column E is f i xm2 . The completed table is shown. Step 5 Substitute in the formula and solve for s2 to get the variance. Step 6 Take the square root to get the standard deviation. 18 Let's Do It! 5 The data show distribution of the birth weight ( in oz.) of 100 consecutive deliveries. Find the variance and the standard deviation. Interval 29.50-69.45 69.50-89.45 89.50-99.45 99.50-109.45 109.50-119.45 119.50-129.45 129.50-139.45 139.50-169.45 Frequency 5 10 11 19 17 20 12 6 Practice Exercises from Textbook For 3.3 section Page 129: 1-7 all, 9-11 all, 16, 18-21 all Page 157: 1-12 all, 16, 17, and 18 19 TI Quick Steps Obtaining Summary Measures Step 1 Clear data. Step 2 Enter data to be summarized. Step 3 Obtain the summary measures for the data in L1. Summary measures are obtained by requesting the 1-Var Stats from under the STAT CALC menu list. The sequence of buttons is as follows: The 1-Var Stats are now displayed in the window. Notice that both the sample standard deviation s and the population depending on whether the values in L1 are a sample or the entire population. The only mean provided is x values are the entire population. To find more information, in particular the five-number summary, press down arrow button. Producing a Boxplot 20 Step 1 Clear data and plots Step 2 Enter data to be plotted Step 3 Setting the STAT PLOT options for a boxplot. Finally set the stat plot options for producing a boxplot of the data in L1 as Plot 1. The sequence of steps is as follows: Press the ZOOM button and then “9” to have the boxplot displayed. Use the TRACE button and the right and left arrow keys to see values for the fivenumber summary. Note that the modified boxplot type is 4th graph icon in the Type list. 21