Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 3 Numerical Summaries of Data Introduction In the previous chapter we learned some of the graphical methods of summarizing data. In this chapter we will begin to learn some of the numerical methods of doing the same thing. The numerical methods are actually more important because they also are used in inferential statistics. Remember that inferential statistics are the guesses we make about the population based on sample information. We will cover Measures of Central Tendency o Mean* o Median o Mode Measures of Dispersion o Range o Mean Absolute Deviation (MAD) o Standard Deviation* o Variance* Z scores The empirical rule Chebychev’s Theorem and we will also learn how some of these are used in the empirical rule and Chebychev's Theorem to describe how data are distributed. The '*' indicates the most important measures. Measures of Central Tendency Mode X = {1, 2, 3, 3, 3, 4}. The mode of a set of data is the most frequently occurring value. In this set the value 3 occurs more times than any other value so Mode = 3. and in general Mode = the most frequently occurring value in a set of data You can compute the mode using Excel with the command =mode(data range) Median The median is a value that splits the data in half. One half of the data will have values less than or equal to the median; the other half will have values greater or equal to the median. Consider the following set of data X = {1.1, 2.2, 2.5, 3.0, 3.5, 6.3, 7.2}. which has N=7 observations. Note that the data has been ordered (sorted) from lowest to highest value. Look at the 4-th observation, x=3. It is pretty clear that the 3 values to the left of this observation will have values less than 3.0, and that the 3 observations to the right will have values greater than 3.0. So the middle observation will split the data into two parts. Note also that the middle observation will be observation (N+1)/2. For this data it is the (N+1)/2=(7+1)/2=4$-th observation. So if N is odd M edi a n N 1 o rdered o bs erv a t i o n 2 If N is an even number we must find the two middle values of the ordered data. So if the data were X = {1.1, 2.2, 2.5, 3.0, 3.5, 6.3, 7.2, 10.0} with N=8, we want to average the 4-th and 5-th data values. Then Med = $(3.0+3.5)/2=3.25.$ Note that 4 values will be below 3.25 and 4 will be above 3.25. So if N is even 1 Median X N X N 2 1 2 2 There is an Excel command that can be used to compute the median =MEDIAN(data range) and for the data sets above Mean When calculating the mean we have to be concerned with whether the data is population or sample data. A statistical convention is to indicate numerical summaries calculated using sample data with lower case Greek letters. The population mean is indicated with the lower case Greek letter 'mu' = population mean. Numerical summaries calculated using sample data is indicated by lower case English letters, so that x = sample mean. A further statistical convention is to call the population values parameters and the sample numbers statistics. Yet an additional convention is to indicate the number of observations in the population by N (often called the population size) and the number of observations in the sample by n (called the sample size). The formula for the population mean is Population Mean 1 N N x i 1 i and the formula for the sample mean is Sample Mean x 1 n xi n i 1 You may recall from a previous chapter that statistical inference, otherwise known as guessing, is an important topic in statistics. As you will learn in subsequent chapters, the sample mean is the best guess we can make for a value of the population mean given the information available in a sample. That is one of the reasons why a sample mean is important to us in statistics. The mean is useful; it indicates the average of a set of data, and in many cases the data will tend to group about the mean. The following examples will show how the mean is calculated. Consider the plight of a statistics instructor in deciding how to assign grades in a class, particularly if the administration has determined that the class average should be a C. Suppose that the University has the typical grading system that A=4.0, B=3.0, etc. Then the easiest way to insure that a C average results is to not grade tests or homework and to give every student a grade of C=2.0. Suppose the class consists of N=4 students then using this scheme the instructor assigns the following grades: X={2.0, 2.0, 2.0, 2.0}. Then the population mean for such a class, let's call this class 1, is given by: 1 N N x i i 1 1 2 2 2 2 2 4 Now consider another class where two students really annoy the instructor, so these students receive the grade of D=1.0, and the instructor gives the remaining students the grade of B=3.0 in hopes of getting an average of C for the class as a whole. This class, Class 2, gets the following distribution of grades: X={1.0, 1.0, 3.0, 3.0}. The population mean in this case is 1 N N x i 1 i 1 1 1 3 3 2 4 So the instructor's grading scheme achieved the desired average of 2 C . Now consider Class 3, which has two very brilliant students. They show this brilliance by bribing the instructor to give them A's. The other two poor souls then get F's to give the desired grade point average for the class. Calculate for this class yourself. Now suppose that the administration decides to check up on the instructor to determine if the C average policy is being maintained. The administration being lazy and wanting to reduce costs (typical), decides to take a sample of the grades given by the instructor. The administration decides to take a sample of n=3 of the second class and gets sample results X = { 1.0, 3.0, 3.0}. that gives a sample mean of x 1 n 1 xi 1 3 3 2.33 n i 1 3 \ They use this as a guess for the numerical value of the population mean. These examples may seem trite, but there are a couple of very important points which can be drawn from them. First of all the mean is a number about which the values seem to balance in some sense. The second point, and the most critical, is that the mean would not be of much use to you in deciding which class you would want to be in. In every case the class mean is 2 C, but your individual grade could vary drastically from the mean, depending on which class you were in. In Class 1 there is no variation from the mean. We might say there is no risk in attending Class 1, you know precisely what your grade will be. On the other hand there is considerable risk associated with Class 3; you will either receive a very good or very bad grade. The grades vary from the mean a great deal in the last class. So while the mean may provide some guidance as to a desirable class, it by no means gives you all the information you would need in deciding on a particular class (assuming grade point average is your only consideration). Excel computes means using the =Average(data range) function. Measures of Variation Other useful numerical descriptions of data are measures of variation, or measures of how spread out the data are. These measures include the range, the mean absolute deviation(MAD), the variance, and the standard deviation. Range The simplest measure of variation is the range. The range is found by subtracting the smallest value in a set of data from the largest value. range = largest value - smallest value. So if the data set is X={1.0, 2.2, 4.4, 6.5} the range of the data is range = 6.5-1.0=5.5. Now consider the range for the three classes discussed above. Class 1: range = 2 – 2 = 0 Class 2: range = 3 – 1 = 2 Class 3: range = 4 – 0 = 4 Note that the range gets larger as the data becomes more spread out. The range then measures the spread -- the larger the range the greater the spread. Excel does not have a RANGE function, but it does have a MAX function and a MIN function, so the range can be computed using =MAX(data range) – MIN(data range) Using MAX and MIN to find the range of data Mean Absolute Deviation - MAD is another common measure of variation. It is defined to be MAD 1 N N x i 1 i for population data and MAD 1 n xi n i 1 for sample data. The value of the MAD for Class 2 is MAD 1 N N x i 1 i 1 1 1 2 1 2 3 2 3 2 1 1 1 1 1 4 4 The value of MAD for Class 1 is MAD=0.0, indicating no variation or spread in the data. The value of MAD for Class 3 is MAD=2.0 (try it yourself). So MAD is a measure of variation - again the larger the spread the larger the value of MAD. Finally note the reason for the absolute value signs . If absolute values are not taken the resulting sum will {\em always} be zero. Getting a value of zero regardless of what the data looks like would not be a good measure of variation. Excel have a MAD function called AVEDEV Mad = AVEDEV(data range) The AVEDEV (MAD) function in Excel Variance and Standard Deviation The most important measures of the variation in a set of data are the variance and standard deviation. These measures are used in the empirical rule and Chebychev's Theorem (to be discussed in the next section). The population variance is defined to be 2 1 N N x i 1 2 i and the population standard deviation is defined to be 2 where is the lower case greek letter `sigma'. Recall the convention that lower case Greek letters indicate population parameters. The sample variance is defined to be s2 1 n 2 xi n 1 i 1 and the sample standard deviation is s s2 Again note the convention, the lower case English letter s designates a sample statistic. If we calculate 2 for Class 1 we find that 2 =0.0 because xi 0 for every term. The calculation of 2 for Class 2 is 2 1 N N x i 1 i 2 1 4 2 2 2 2 1 2 1 2 3 2 3 2 1 4 4 and the population standard deviation is 2 1 1. The value of 2 for Class 3 is 2 =4.0 so that = 4 =2 for this class. Both 2 and are measures of spread because both get larger as the data becomes more spread out. One of the reasons for using the standard deviation rather than the variance is because of the units that arise from the calculation of the variance. Suppose the data measured were gallon of paint. Then a particular term in the variance would be (1.0 gallon - 2.0 gallon) 2 = (-1.0 gallon) 2 = gallon) 2 or 1.0 square gallon. Had the data been measured in terms of dollars the value of the variance would have been in terms of square dollars. Now square dollars and square gallons do not exist. Taking the square root converts these values into dollars and gallons which are familiar units. So one of the reasons for using the standard deviations is that it has the same units as the data. Again, things are easier with Excel. Excel has functions for computing the population variance VARPA(data range) and population standard deviation STDEVPA(data range) Or if sample data is used, the functions VAR(data range) and STDEV(data range) for the sample variance and standard deviation, respectively If you have the Analysis Toolpack add—in installed you can use the descriptive statistics feature to get a lot of information The descriptive statistics command using the Analysis Toolpack add-in of Excel Measures of Position Percentiles Percentiles are additional measures that help describe data. For example the 10-th percentile is a value such that 10% of the data will lie to the left of that value and 90% of the data will lie to the right. So if the 10-th percentile value for Flagstaff new home prices is $86,000 then 10% of new homes will have a price less than $86,000 and 90% of the homes will have values greater than $86,000. Quartiles Quartiles are percentiles. Quartile one, Q1, is the 25-th percentile, Q2 is the 50-th percentile (which is also the Median, of course), and Q3 is the 75-th percentile. Fifty percent of the data lies between Q1 and Q3. Twenty five percent lies between Q1 and Q2 and twenty five percent also between Q2 and Q3.. Excel has a percentile function =PERCENTILE(data range, percentile) The 23--rd percentile for the data in column A of the following spreadshicould be found by 1.4496292 10.06805628 13.85845515 38.20001831 40.74221015 59.64842677 86.3246559 88.46095157 89.91058077 95.84643086 15.56236457 =PERCENTILE(A1:A10,0.23) Finding the 23—rd percentile of a set of data The Empirical Rule After several years of collecting and plotting data people noticed a certain consistency between data with histograms which looked bell shaped. They found that approximately Approximately this much of the data 68% 95% Almost all Lies in this interval , 2 , 2 3 , 3 Approximately 68% of the data lies in the interval , Approximately 68% of the data lies in the interval 2 , 2 Further refinements of the empirical rule As an example of the application of the empirical rule suppose that we know that the average income of Flagstaff is 30000 and that 2000 and that household incomes in Flagstaff are normally distributed. Then according to the empirical rule, approximately 68% of the data will lie in , =[30000-2000,30000+2000]=[28000,32000] about 95% of the data will lie in 2 , 2 = [30000-2(2000), 30000+2(2000)] = [26000,34000] and almost all of the data will lie in 3 , 3 = [30000-3(2000), 30000+3(2000)] = [24000,36000] So using the empirical rule we can make some statements about how the data is distributed - that is how much of the data is located in specific intervals. Chebychev's Theorem The empirical rule is very useful, but it applies only to data which is normally distributed. Suppose the data has some other distribution such as the one shown below. The Russian mathematician Chebychev was able to show the following relationship holds for any distribution of data: Chebychev’s Theorem: 1 At least 1 2 of the data will lie in k , k k Let's apply Chebychev's Theorem to the Flagstaff income problem where 30000 and 2000 . he Theorem says that for k=2 At least 1 1 2 k 1 1 1 3 1 2 1 2 1 0.75 4 4 2 k lies in k , k [30000-2(2000),3000+2(2000) = [30000-4000,3000+4000] = [26000,34000] and we find that at least 75% of households in Flagstaff have incomes between $26000 and $34000. If we try k=3 we get At least 1 1 2 k 1 1 1 8 1 2 1 2 1 0.89 9 9 3 k lies in k , k [30000-3(2000),3000+3(2000) = [30000-6000,3000+6000] = [24000,36000] or that at least 89% of Flagstaff households have incomes in the indicated range. One question that comes to mind is what sort of value of k should we use. Remember k can have any value greater than or equal to one. A value of k=1.57 is perfectly acceptable. Unfortunately, there is no rule to indicate what values of k to use. Values of k=2 and k=3 usually work well. The only rule that works is to look at the results produced and if results look silly, forget them and try something else. If we tried k=20 and found out that almost all Flagstaff households have incomes between -$1,000,000 and $3,000,000 the result would be correct but not give any useful information. Again, the benefit of Chebychev's Theorem is that it will work on anything while the empirical rule works only on normally distributed data. Like anything else a cost is associated with the benefit. The cost here is the at least part of the Theorem. The statement that at least 75% of the data lies in 2 , 2 means that 75% or more of the data lies in that interval. So the cost of using Chebychev's Theorem is a loss in precision. Z – Scores Both the empirical rule and Chebychev's Theorem indicate that most of the observations will be within a couple of standard deviation of the mean. Using the empirical rule, for example, we saw that almost all of the observations would be within three standard deviations of the mean. It is useful to think in terms of standard deviations in doing statistical analysis. Suppose we are told that the temperature for Podunk on a particular day is 20 degrees Fahrenheit. That may or may not imply unusual weather for Podunk. If we find out this temperature is three standard deviations above the average temperature we could conclude a heat wave has occurred, for Podunk that is. Z-scores give an easy way to convert raw data into standard deviations. The formula for the Z-score is Z x Let's consider an example to see what the z-score formula tells us. Suppose we have a population with 100 and 10. It is pretty clear that x=110 is one standard deviation to the right of the mean and x=90 is one standard deviation to the left of the mean. The z-scores for these values of x are For x = 110 Z For x= 90 x 110 100 10 1 10 10 Z x 90 100 10 1 10 10 For x = 100 Z x 100 100 10 0 0 10 The z-score z=0 indicates that the mean is no standard deviations away from the mean, the z-score z=+1 indicates that x=110 is one standard deviation to the right of the mean, and z=-1 indicates that x=90 is one standard deviation to the left of the mean. Note that a positive value of z means the value is to the right of the mean while a negative z value means something to the left of the mean. Now then if the average snowfall in Flagstaff is 123.4 inches per year and the standard deviation of the snowfall is 13.3 inches per year, then a snowfall of x=114.2 inches is Z x 114.2 123.4 0.692 Z x 165.0 123.4 3.13 13.3 or that this snowfall is $-0.692$ standard deviations below normal. A snowfall of x = 165 inches gives 13.3 Note that 3.13 standard deviations represent a very rare event if the distribution is normal (consider the empirical rule). Appendix It is useful to be able to identify each observation in a set of data. To do this let the symbol xi indicate the i-th observation in a set of data. So the symbol x5 will mean the fifth observation in a set of data. Suppose the data is organized as follows xi i 0.1 0.2 0.5 0.3 1.2 1 2 3 4 5 Then the value of the first observation is x1 0.1 the value of the third observation is xi 0.5 , etc. The set of instructions N x i 1 i means to add all of the x values from i=1 to I=N, that is N x i i 1 0.1 0.2 0.5 0.3 1.2 If a is some constant, then 3 a a a a 3a i 1 Which means add a three times. If a=10 this becomes 3 10 10 10 10 i1 The following instructions have the indicated meanings N x i 1 i x1 x2 xN Problems 1. Suppose that the median price of a new Flagstaff home is $100,000. What does this say about the distribution of new home prices in Flagstaff? 2. Consider the following set of data: X 3.67 4.12 5.56 5.11 1.22 1.22 6.13 1.22 613.0 Find the mean, median, and mode of this data. (Assume population data). 3 Consider the following set of data: X 3.67 4.12 5.56 5.11 1.22 which is the same as in Problem.2, except for the last observation which has been multiplied by 100. Find the mean, median, and mode of this data. Do values of these measures differ from those in Problem 2? (Assume population data) 4. Assume that Flagstaff household income is normally distributed with 30000 and =2000. Use the empirical rule to determine the proportion the proportion of household with incomes a) between $28000 and $32000 b) between $26000 and $34000 c) between $24000 and $36000 5. Assume that Flagstaff household income is normally distributed with 30000 and =2000. Use the empirical rule to determine the proportion the proportion of household with incomes a) b) c) less than $26000 greater than $26000 greater than $34000 6. Suppose that a machine shop manufactures ball bearings for a critical assembly. Extensive testing indicates that the ball bearing are normally distributed and have mean diameter of 1 cm and that the standard deviation of the bearings diameters is 0.01cm. Use the empirical rule to find a) b) c) d) The proportion of ball bearings that will be in The range of 0.99 cm to 1.01 cm The range of 0.99 cm to 1.02 cm Excess of 1.02 cm Suppose that it is critical that the ball bearings fit in a range of 0.98 to 1.01 cm. What proportion of the bearings do NOT fit in this range. 7. The manager of B\&J's Olde Tyme Ice Creame Parlor needs to know how much chocolate ice cream to have available on Monday morning for customers during the week. If too much is ordered this increases the restaurant's inventory costs. If too little is ordered, the restaurant will run out and lose sales. His vendor will only deliver on Monday morning. After long and careful study the manager determines that weekly chocolate ice cream sales are normally distributed with a mean of 100 gallons and a standard deviation of 6 gallons. The manager determines that he is willing to run out of chocolate 2.5% of the weeks. How much chocolate should he have on hand at the beginning of the week to insure this? 8. Use the values for the mean and standard deviation in problem 4. Use Chebychev's Theorem for k=2 and k=3 to set limits on Flagstaff household incomes if the distribution were not normal. 9. Use the values for the mean and standard deviation in problem 5. Assume the population is not normally distributed. Use Chebychev's Theorem to set limits on the proportion of ball bearing in the range of the mean 2 standard deviations and in the range of the mean 3 standard deviations (use Chebychev's Theorem with k=2 and k=3). 10. Find the range, MAD, standard deviation and variance for the data in problem 2. 11. Find the range, MAD, standard deviation and variance for the data in problem 3. Answers 1. Half of the homes in Flagstaff sold for a price less than or equal to $100,000. 2. The results from EXCEL Data 3.67 4.12 5.56 5.11 1.22 1.22 6.13 3.861 =AVERAGE(A2:A8) 1.22 =MODE(A2:A8) 4.12 =MEDIAN(A2:A8) 3. The results from EXCEL Data 3.67 4.12 5.56 5.11 1.22 1.22 613 90.557 =AVERAGE(A2:A8) 1.22 =MODE(A2:A8) 4.12 =MEDIAN(A2:A8) 4. a)68%, b) 95%, c) almost all 5. a) 2.5%, b) 97.5%, c) 2.5% 6. a) 68%, b) 81.5%, c) 2.5%, d) 52.5% 7. If B\&J's manager orders enough chocolate to have 112 gallons available on Monday morning they will run out only 2.5\% of the time. 8. By Chebyshev's theorem for k=2, at least 75% of Flagstaff households will have incomes in the range [26000, 34000]. For k=3 at least 8/9--th of Flagstaff households will have incomes in the range [24000, 36000]. 9.. 75% in the range [0.98,1.02]. 8/9 of the ball bearings will have diameters in the range [0.97,1.03]. 10. The results from EXCEL Data 3.67 4.12 5.56 5.11 1.22 1.22 3.13 4.340 1.3510204 2.9585238 1.720036 =MAX(A2:A8)-MIN(A2:A8) =AVEDEV(A2:A8) =VAR(A2:A8) =SQRT(VAR(A2:A8)) 11. The results from EXCEL Data 3.67 4.12 5.56 5.11 1.22 1.22 613 611.780 149.26939 53075.879 230.38203 =MAX(A2:A8)-MIN(A2:A8) =AVEDEV(A2:A8) =VAR(A2:A8) =SQRT(VAR(A2:A8))