Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
2. Descriptive Statistics This chapter discusses various Excel charts that describe data and then looks at the statistical description of data. 2.1 Pie Chart A Pie Chart is used to display the proportions of different categories in a data set. In the workbook named Chap002.xls, the sheet named Pie Chart has the template shown in Figure 2.1.1. To use this template, enter the data in the shaded area in columns B and C. If you wish to edit the chart in any way, double click on the chart and use the menus that appear. You can move the labels on the chart by dragging them. Figure 2.1.1. A Pie Chart [Workbook: Chap002.xls; Sheet: Pie Chart] 2-1 2.2 Histogram Figure 2.2.1. A Histogram [Workbook: Chap002.xls; Sheet: Histogram] When a numerical data set is large, it is convenient to group the data points into classes by choosing suitable class interval. The class intervals must be chosen so as to have neither too few nor too many classes. The number of classes is usually kept between 5 and 20. The number of data points in a class is called the frequency of that class. A data set so grouped can then be represented by a histogram. A histogram of a grouped data set is a bar chart where each bar represents the frequency of the corresponding class. Figure 2.2.1 shows the template that can be used for drawing histograms. Note that you have to enter, besides the data, the Start, Interval Width and End values in cells J24, M24 and P24 respectively. These values are to be arrived at by trial and error. In general the number of classes should be somewhere between 5 and 20. The data currently entered in the template is the Gas Mileage data from the textbook. 2.3 Pareto Chart A Pareto Chart is a combination of a histogram and a line graph that shows cumulative relative frequency. The line graph is helpful in locating the cutoff point at which the cumulative relative frequency equals a desired value. The template for creating Pareto Chart is shown in Figure 2.3.1. To use the template, enter the data in columns B and C and sort the items in descending order of frequency. If a category called Other is present it is always listed last, regardless of its frequency. The cumulative relative frequency is shown by the line graph, and it uses the vertical axis on the right. If a quality control manager wants to know the fewest categories of defects that account for, say, 80% of the total defects, this line graph helps to identify the answer. The line graph crosses 80% just after the first three categories. Thus, taking care of these three categories will take care of 80% of the defects. 2-2 Figure 2.3.1. A Pareto Chart [Workbook: Chap002.xls; Sheet: Pareto] 2.4 Box-and-Whiskers Display or the Box Plots A common type of chart used in descriptive statistics is the box-and-whiskers display or the box plot. A box plot is shown in Figure 2.4.1 with its parts labeled. The plot contains enough data for the user to visualize the shape of the distribution. Additionally, it shows outliers and suspected outliers explicitly, while a histogram does not. Suspected outlier Outlier Figure 2.4.1. A Box-and-Whiskers Display [Workbook: Chap002.xls; Sheet: Box Plot] 2.5 Percentile and Percentile Rank Percentiles and percentile ranks are useful in describing the distribution of a numerical data set. Definition 2.5.1: The n-th percentile of a data set is that value below which lie n% of the data. Consider the following data of 20 customer satisfaction ratings from Example 2.7 of the textbook. 1, 3, 5, 5, 7, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 10, 10, 10, 10 2-3 Figure 2.5.1 shows the template that can be used to compute all the descriptive statistics discussed here. The percentiles can be calculated using the range D4:D6. Once a number x is entered in this range, the x-th percentile appears on the adjacent cell to the right. Figure 2.5.1. Descriptive Statistics [Workbook: Chap002.xls; Sheet: Statistics] The percentile rank is the inverse of percentile as seen in the next definition. Definition 2.5.2: For a given data set, if the x-th percentile is y then x is the percentile rank of y. In cell H4 of the template, the percentile rank of 8.0 has been calculated. One would expect the percentile rank to be 50 because the 50th percentile was, indeed, 8.0. The reason for this discrepancy is that for our small data set of only 20 numbers, it so happens that the 26 th, 27th, ... up to the 52nd percentile are all equal to 8.0. The template returned the smallest possible value of 0.26 as the answer. [The lesson is that we should not calculate percentiles and percentile ranks for small data sets, or else such discrepancies would occur. A large data set of hundreds of numbers will not cause such problems.] Definition 2.5.3: The median of a data set is its 50th percentile. Definition 2.5.4: The first quartile of a data set is its 25th percentile. The third quartile is its 75th percentile. 2.6 Measures of Central Tendency Often, we need to find one representative number from a data set around which all the other numbers are distributed. In other words, we need to find the center of a data set. A few alternatives are available for this purpose and, depending upon the situation, one of them may be more suitable than another. The first alternative for the center has already been seen, and that is the median. It is a good choice because half the data will be above it and half below it. The second alternative is the mode defined below. Definition 2.6.1: The mode of a data set is the value that occurs most frequently. At times, especially in small data sets, there can be more than one mode, in which case the template may report any one of them as the mode. The third alternative is the most common one, and it is the mean. Definition 2.6.2: The mean of a data set is the arithmetic average, calculated as the sum of all the data points divided by the number of data points. The symbol used to denote the mean depends on whether the data is of a population or of a sample. If it is of a population, the symbol is , and if it is of a sample, the symbol is x . A problem with the mean 2-4 is that it is affected by outliers, or extreme values in the data. An abnormally large data point will inflate the average and make it non-representative. 2.7 Measures of Variability One measure of variability is the range defined below. Definition 2.7.1: The range of a data set is the difference between the maximum and minimum values in the data set. Another measure for variability is the Inter Quartile Range or IQR. Definition 2.7.2: The inter quartile range of a data set is the difference between its first and third quartiles. The deviation of a data point is calculated by subtracting the mean from it. Data points larger than the mean will have positive deviations and those smaller than the mean will have negative deviations. On summing, the positive and negative deviations will exactly cancel each other and yield zero. Because the sum of all the deviations from the mean will always be zero, the average deviation will also always be zero. One way around this is to take absolute deviations and then average them. Another way is to square the deviations and then average them. Definition 2.7.3: The mean absolute deviation (MAD) of a data set is the average of all absolute deviations. Definition 2.7.4: The variance of a data set is the average of all squared deviations. The variance is the most commonly used measure of variability. If SSD denotes the sum of all the squared deviations and n denotes the number of data points in the data set, then the variance is SSD/n. The symbol used to denote variance is 2. We can write the formula for variance as 2 = SSD/n When the data set is of a sample, we do not know the true population mean . We are forced to calculate the deviations from the sample mean x rather than the true mean , and this introduces a small downward bias in the sum of all the squared deviations, SSD. To compensate for this, we divide SSD by (n1) rather than n to find the average. The result, by definition, is the sample variance and it is denoted by s2. We can write the formula for sample variance as s2 = SSD/(n1) There is a small problem in using variance as the measure of variability. The deviations are squared during the calculation, and therefore, if we started with our data in, say, dollars, then the variance would be in dollars squared. In a subsequent calculation, dollar squared may not be suitable. To get back to the original unit of dollars, we take the (positive) square root of the variance and call it the standard deviation. Definition 2.7.5: The standard deviation of a data set is the positive square root of the variance. Definition 2.7.6: The Coefficient of Variation (CV) is defined as CV = (/ ) 100%. 2.8 The z-Score Because the standard deviation is in the same units as the original data, it can be used to standardize the deviation of any data point from the mean. Suppose the mean of a population data set is 1000 and the standard deviation is 10. A data point in the set is, say, 1012. The deviation of this data point from the mean is 12. We divide this 12 by the standard deviation of 10 to get 1.2. In other words, the data point is 1.2 standard deviations more than the mean. If the data set is “normally distributed” (we shall see in a later chapter what that means), then the z-score of the data point is said to be 1.2. Similarly, a data point of 986 has a z-score of (986-1000)/10 = -1.4. We shall take the liberty of calling it z-score even if the population is not normally distributed. The formula for z-score is z = (x )/ where x is the data point, is the mean and is the standard deviation. Since the sum of all deviation from the mean is zero for all data sets, the average z-score is always zero. The variance of all z-scores is always 1, because it works out to 2/2. 2.9 Skewness and Kurtosis Definition 2.8.1: Skewness of a data set is the average of the cubes of all the z-scores. The cube of a negative number is negative. If a data set is symmetric, then for every positive zscore there will be a corresponding negative one and their cubes will cancel each other when summed. 2-5 Thus the skewness of a symmetric data set is zero. In a negatively skewed data set, the negative cubes will outweigh the positives and we will end up with a negative value for skewness. By the same reasoning, the skewness of a positively skewed data set will be positive. Because the true mean is not known and x is used in its place, there will be a downward bias. While calculating Skewness, Excel always assumes that the data is of a sample and not of a population and carries out a bias correction. If the data is of a population, then Excel’s result will be slightly wrong. Definition 2.8.2: The (absolute) kurtosis of a data set is the average of the fourth powers of the z-scores. Since the fourth power of a negative number is positive, the (absolute) kurtosis is always positive. The (absolute) kurtosis of the Normal Distribution is 3. Statisticians find it convenient to measure kurtosis relative to that of the Normal Distribution. Therefore, they subtract 3 from the average fourth power of the z-scores and declare the result as the (relative) kurtosis. The (relative) kurtosis of the Normal Distribution is thus zero. A shape that is more peaked than normal will have a positive (relative) kurtosis, and will be called leptokurtic. A shape that is less peaked than normal will have a negative (relative) kurtosis, and will be called platykurtic. If the data is of a sample, there would be the bias again. While calculating kurtosis, Excel always assumes the data to be of a sample and applies a bias correction. The template, thus, always displays the sample kurtosis. . 2.10 Exercises 1. The household expenses of a family for a month are: Rent $960 Food $680 Entertainment $156 Telephone $120 Utilities $208 Miscellaneous $210 i. Draw a Pie Chart that shows the item names and percentages. ii. Sort the data in descending order of the dollar amount. Make sure the chart updates properly. 2. Construct a Histogram for the Trash Bag case data in Table 1.10 of the textbook. 3. Construct a Box Plot of the Trash Bag case data in Table 1.10 of the textbook. 4. Table 2.15 in the textbook contains data on elapsed time for completing and delivering medical lab tests. Draw a Box-and-Whiskers display for the data. Are there any suspected outliers or outliers? 5. Consider the Consumer Complaints data in Table 2.17. Draw a Pareto Chart for the year 1996. Find the fewest set of complaint categories that accounted for 75% of the complaints in that year. 6. Table 2.18 of the textbook contains data on Greater Cincinnati banks. Use the Assets column of data for the following questions: a. Find Mean, Median, Mode, Population Variance and Population Standard Deviation. b. Find the first and third quartiles. c. Find the 62nd percentile. d. Find the percentile rank of $100 million. 2-6