Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Descriptive Statistics Descriptive Statistics can be best viewed as a bag of tricks that allows one to present the essential information contained in a data set in a way the information can be readily interpreted by the reader. Included in the bag of tricks will be a number of pictures/graphs - histograms, stem plots, time plots, box plots - that give us a general picture of the data. There are also a number of formulas/indices - mean, median, quartiles, standard deviation - that give us a number to describe some facet of the data. To see what we are talking about with descriptive statistics, let's talk about grades, something that should be able to catch your interest. [For those who are new to statistics, you might also want to check out another simpler example based on rolling dice]. You remember after each exam when you wanted to know the results and what they would mean for your grade in the course. Now let's look at grades from the instructor's side. Below you will find the complete data set for the first test in a recent course (ECN202). There were 52 students that received the following scores on their exams. The information for this example can be found on Histogram and Desc stats tabs of the Descriptive Statistics example. Grades on Exam 1 in ECN202 Student Grade Student Grade Student Grade Student Grade 1 91.983 14 97.531 27 96.8659 40 89.0001 2 91.7597 15 98.2979 28 76.2859 41 76.5876 3 87.9158 16 70.4242 29 99.6804 42 93.3512 4 77.0586 17 72.6251 30 87.6299 43 88.82 5 98.7479 18 86.9584 31 89.6395 44 77.4919 6 79.8029 19 95.2241 32 85.6969 45 94.7336 7 80.5968 20 91.9544 33 95.2098 46 75.4139 8 77.8953 21 80.2882 34 71.9719 47 86.5368 9 96.1051 22 77.2291 35 92.3448 48 93.7865 10 74.1581 23 93.1482 36 74.4269 49 73.8672 11 83.152 24 75.1727 37 82.7137 50 75.7028 12 91.7678 25 87.5282 38 77.8714 51 73.415 13 78.1368 26 80.501 39 71.828 52 74.7755 While this entire data set may be useful to you for some purposes, I suspect you were not interested in studying the entire class results. Furthermore, I suspect if you looked at the table for 30 seconds and were asked to describe what you saw, you would have considerable difficulty capturing the essential features of the data set. It is for this reason that we have descriptive statistics that allow us to summarize these data with a few pictures and numbers. As we work our way through these summarizing techniques, you should realize that in general we will not be able to reproduce the exact data from the summary statistics. Some information is lost in the process, but the loss should be more than offset by the gains that we have in terms of insight into the underlying data. As a starter, the data can be transformed from a table to a graph. One approach would be to simply create a bar graph as we did in a previous section. If you were to do this by hand the first step in creating the diagram would be to round off the grades to the nearest whole number and then sort them by score. 1 Test Scores Score # of tests Score # of tests 70 1 86 1 71 87 2 72 2 88 3 73 2 89 2 74 3 90 1 75 3 91 76 2 92 5 77 4 93 2 78 3 94 1 79 95 3 80 2 96 1 81 2 97 1 82 98 2 83 2 99 1 84 100 1 If we take these data and use Excel to create a column graph with score on the horizontal and # of tests on the vertical and graph how many students received each score, we get the following diagram. In this example we can see that four students received a 77 and 5 students received a 92. We could also create a second graph that will look exactly like the first except we will graph the relative frequency against the scores. For example, a score of 92 is received by approximately 10 percent of the students (5 of 52). These two graphs give us a picture of the distribution of ECN202 grades for the entire class. You could also use the data analysis tools in excel to create a Histogram that looks very similar to what you created above. The most important difference is you do not need to sort the data to construct the graph. To see how the Histogram is constructed you should examine the Histogram tab of the Descriptive Statistics example. The first step was to create a column of grades and then select Data Analysis located in Tools. Once the table has been created in Excel you select Data Analysis from the Tool menu and then select Histogram. The dialogue box lets you know that you need to tell it where the data are, where the categories are (Bins), and where you want the results. You can identify the input by simply dragging the cursor down the column of scores and then you tell it where the Bin data are by dragging the cursor down the column of numbers. Finally you give the location for the output by identifying the cell that will be the top left corner of the data output. The results appear below and on the Histogram tab beginning in cell D3. [Note that I did a bit of double clicking on things to clean it up a bit. You notice the difference between the two Histograms on this page]. 2 Bin 73.0 76.0 79.0 82.0 85.0 88.0 91.0 94.0 97.0 Here the grades are grouped. We have four students who received below 73 and six students who received between 85 and 88. Before you leave histograms behind, you should check out an on-line interactive example of a histogram, and David Lane's site. Once you are comfortable with looking at the data set graphically, we can then look at specific features of the distribution of scores. Generally there are four features of the distribution we are concerned with: modality, symmetry, central tendency, and variability. Of the four, modality and symmetry are the easiest to visualize. Any value at which the frequency curve or relative frequency curve reaches a peak is called a mode. Most distributions in practice have one peak and are described as "unimodal." A distribution with two peaks is called "bimodal." The distribution of grades pictured above has a number of modes and would not be easily characterized by the mode. A distribution is said to be symmetric if the relative frequency is the same distance either side of its center. In the above example, if the midpoint of the distribution were 85, then symmetry would imply that the same number of students received a grade 90 as a grade of 80, and the number that received a grade of 75 equaled the number that received a 95. The mean and median, concepts that we will discuss in the next section, are equal in a symmetric distribution. An asymmetric frequency distribution is skewed to the left if the lower tail is longer than the upper tail and skewed to the right if the upper tail is longer than the lower tail. To understand the concept of symmetry fully, however, we need to look at the concepts of Central Tendency and Variability. As you work your way through the statistical analysis, keep in mind that one of the wonders of modern computers and statistical software is they can make statistical analysis quite simple. Consider the situation where you have been asked to analyze Voter Turnout data in the Presidential Elections as provided by the Federal Elections Commission (FEC) appears below. With these data tools you can then highlight the data in the second column and select Data Analysis in the Tools menu of Excel. This analysis can be found on the Desc stats tab of the Descriptive Statistics example. On that worksheet you will also find the results of running descriptive statistics on the grade data. 3 1992 Voter Turnout (%) STATE Alabama 55.2% Alaska 65.4% Arizona 54.1% Arkansas 53.8% California 49.1% Colorado 62.7% Connecticut 63.8% Delaware 55.2% District of Columbia 49.6% ….. ….. Utah 65.2% Vermont 67.5% Virginia 52.8% Washington 59.9% West Virginia 50.7% Wisconsin 69.0% Wyoming 62.3% UNITED STATES 55.2% Once you have selected the Data Analysis a dialogue box will appear and you should select Descriptive Statistics. You will then get a new dialogue box and in the Input Range you should type in the first and last cell in the third column separated by a colon - in my example it was I4:I54. You will also want to tell it where you want the output. In this example it went on the same spreadsheet with the top left cell being K4 that is what you put in the dialogue box. Included here is the output you will get that includes the basic measures of central tendency (mean, median, mode) and variability (standard deviation, variance, and range). You also note the minimum and maximum values and the number of observations (count). As for measures of central tendency, the mean was 58% and the median was 59 %. The voter participation rate ranged from approximately 42% to 72% with a standard deviation 7.3%. Column1 Mean 0.581318 Standard Error 0.010279 Median Mode 0.5894 0.6515 Standard Deviation 0.073408 Sample Variance 0.005389 Kurtosis -0.84522 Skewness -0.03072 Range 0.3004 Minimum 0.4194 Maximum 0.7198 Sum Count 29.6472 51 4 A second example, one based on the earlier grade data for ECN202, can also be found on the Desc stats tab of the Descriptive Statistics example. The Desc stats sheet provides the summary statistics for a set of 52 grades that were generated by selecting Data Analysis under the Tools menu and then choosing Descriptive Statistics. The input is specified by simply highlighting the column of data and the output is specified to appear in cell D4. To understand the concepts that appear in this table you should look at the concepts of central tendency and variability. Central Tendency What's the average? We have heard the question many times: travelers will ask about on-time averages, university administrators will ask about retention rates (average number of students that return after their first year), investors will ask about average rates of return, sports fans may ask about batting averages,... But how do we go about answering the questions? How do we compute the averages? The first thing to realize is that when people are talking about the average they are talking about a measure of central tendency. In this section we will talk about three measures of central tendency, the mean, median, and mode. To better understand the difference between the three measures, let's return to our example of grades. For an on-line discussion of measures of central tendency you should check out the UCLA On-line Statistics Course, DAU the Stat refresher, and Hyperstat Online by David Lane at Rice University. To demonstrate the measures of central tendency we will use our two grading examples. The frequency distribution of the exam scores for ECN202 is repeated in the first diagram below and the histogram for the grades in ECN201 follows it. Mean: This is what people generally mean when they say average. The mean is the arithmetic mean of all the data and is defined as the sum of all possible values divided by the number of observations. In the table below we have the grade data for the class that had been divided into four groups. For each group we added up the grades to get a total for the grades and then divided the total by 13, the number of students in each group. If we then add the four group averages and then divide this sum by the number of groups, you get the average of 84.454. This is the same number you would have obtained if you had simply added up all 52 grades and divided by 52 [Note: this procedure works here only because the groups all had the same number of students. If the groups were of different size, then you would have needed to weight the groups by their relative size.] In terms of the frequency distribution above, you see there are not any actual scores of 84 or 85, but those scores are in the middle of the distribution of grades. If there were more grades in the higher range, then this would drag up the mean score. 5 Grades on ECN202 Exam 1 Student Grade Student Grade Student Grade Student Grade 1 91.983 14 97.531 27 96.8659 40 89.0001 2 91.7597 15 98.2979 28 76.2859 41 76.5876 3 87.9158 16 70.4242 29 99.6804 42 93.3512 4 77.0586 17 72.6251 30 87.6299 43 88.82 5 98.7479 18 86.9584 31 89.6395 44 77.4919 6 79.8029 19 95.2241 32 85.6969 45 94.7336 7 80.5968 20 91.9544 33 95.2098 46 75.4139 8 77.8953 21 80.2882 34 71.9719 47 86.5368 9 96.1051 22 77.2291 35 92.3448 48 93.7865 10 74.1581 23 93.1482 36 74.4269 49 73.8672 11 83.152 24 75.1727 37 82.7137 50 75.7028 12 91.7678 25 87.5282 38 77.8714 51 73.415 26 80.501 39 71.828 52 74.7755 13 78.1368 Total 1109.08 1106.88 1102.16 1073.48 Group Average 85.3138 85.1448 84.7819 82.5756 Average 84.454 One notable feature is the difference between the class average and the average for the groups. As we will see in the inferential statistics section, group averages can be viewed as averages based on a sample of the entire student population, and differences persist. Median: The median is the midpoint in the distribution of grades. At the median there are as many people above as there are below the median score. In this ECN202 class, the median grade was 84.42, slightly below the mean value. Grades on ECN202 Exam 1 Grade Student 70.4 1 76.6 14 85.7 27 92.3 40 71.8 2 77.1 15 86.5 28 93.1 41 72.0 3 77.2 16 87.0 29 93.4 42 72.6 4 77.5 17 87.5 30 93.8 43 73.4 5 77.9 18 87.6 31 94.7 44 73.9 6 77.9 19 87.9 32 95.2 45 74.2 7 78.1 20 88.8 33 95.2 46 74.4 8 79.8 21 89.0 34 96.1 47 74.8 9 80.3 22 89.6 35 96.9 48 75.2 10 80.5 23 91.8 36 97.5 49 75.4 11 80.6 24 91.8 37 98.3 50 75.7 12 82.7 25 92.0 38 98.7 51 76.3 13 83.2 26 92.0 39 99.7 52 Median = (83.2+85.7) Grade Student Grade Student Grade Student 84.4244 6 The median for the population of ECN201 grades was 22 while the median for the sample was 21.5. Mode: The mode refers to the most frequently observed number. If we look at the scores and round them to whole numbers and sort them we can find the number 77 appears four times and the number 92 appears 5 times. These would be the modes in the score distribution. Grades Grades Grades Grades 70 77 86 92 72 77 87 93 72 77 87 93 73 77 88 94 73 78 88 95 74 78 88 95 74 78 89 95 74 80 89 96 75 80 90 97 75 81 92 98 75 81 92 98 76 83 92 99 76 83 92 100 In this example we have looked at three measures of central tendency, each of which gives us a bit of information on the class' performance on the exam. The fact that they are all different provides the sophisticated observer with additional information. For example, if the mean, median, and mode are the same, you are most likely looking at a symmetric distribution. If the mean is above the median, then you probably have an outlier to the right, which would give you a long upper tail that would translate into a skew to the right in the distribution. This is what we would be likely to see in income statistics if there were a few very wealthy people in the sample. If the mean tends to be below the median, then there is a long lower tail - what you would expect to see in a distribution of grades when there was someone with a very low grade. Now it is time to move on to a discussion of variability. Variability As important as measures of central tendency are, they do not provide us with a complete picture of the underlying scores. We may know what the averages are, but what about the distribution of grades - the 'spread' of the data? Will you feel the same about two possible grading schemes with the same average one where there will be only five A's and five F's and a second where there will no A's or F's? Experience suggests most students would care, although there is no agreement as to the preferred scheme. Those with good academic records tend to favor the first distribution while those with weak records tend to favor the second. The differences between the two grade distributions will be captured in the variability of the scores. As with the previous discussion of central tendency, there are a number of measures of variation. We will look at range, variance, and standard deviation. For an on-line discussion of measures of variability you should check out the UCLA On-line Statistics Course, the Electronic Textbook, Introductory Statistics: Concepts, Models, and Applications Copyright 1996 by David W. Stockburger, DAU the Stat refresher, and Hyperstat Online by David Lane at Rice University. To demonstrate the measures of variability we will use our grading example. The frequency diagram of the exam scores for ECN202 is repeated below. 7 Range: This is the easy one. Once you have your data sorted to identify your median, simply look at the highest and lowest values. The range is the difference between the lowest and highest values. In the ECN202 grade example, grades range from 70 to 100. No one received a score higher than 100 or lower than 70. Variance: How likely is it that we would get a score close to the average? Did most of the students receive similar scores or were they spread out roughly equally over the entire range. The variance can be thought of as a measure of variability derived by a two-step process. In the first we compute a new variable that equals the test score minus the mean score (approximately 84.5). The result appears in the second column below. If we add the deviation in the first row (7.9) to the mean (84.5) we get the score (92.3) [Note: there may be a small difference due to rounding]. ECN202 Grades Score 92.3 85.7 93.1 76.6 86.5 70.4 77.1 71.8 93.4 87 93.8 77.2 87.5 94.7 72 … 95.2 96.1 73.4 77.9 88.8 Deviation 7.9 1.2 8.7 -7.9 2.1 -14 -7.4 -12.6 8.9 2.5 9.3 -7.2 3.1 10.3 -12.5 … 10.8 11.7 -11 -6.6 4.4 Deviation 2 62.264 1.54463 75.5893 61.8804 4.33774 196.835 54.6925 159.417 79.1603 6.2721 87.0956 52.1999 9.45035 105.67 155.804 … 115.994 135.747 121.861 43.0169 19.0616 Score 79.8 89.6 97.5 98.3 74.2 74.4 80.3 91.8 91.8 98.7 74.8 80.5 80.6 92 99.7 … … Deviation -4.7 5.2 13.1 13.8 -10.3 -10 -4.2 7.3 7.3 14.3 -9.7 -4 -3.9 7.5 15.2 … 76.3 89 96.9 73.9 78.1 -8.2 4.5 12.4 -10.6 -6.3 Mean 84.454 Variance 76.45 Standard deviation 8.74 Deviation 2 21.633 26.8889 171.007 191.654 106.006 100.544 17.3537 53.3722 53.4916 204.316 93.6733 15.6263 14.8779 56.2552 231.843 … 66.7191 20.667 154.054 112.082 39.9078 8 In the second step we square all of the deviation terms in column 2 that generates column 3. We now add these squared terms and divide by the number of observations to get the variance. All other things equal, the greater the variance the greater the spread of the scores. What would the distribution look like if the variance were larger? What you would see would be fewer observations close to the mean and more observations near the upper and lower limits. Standard Deviation: There is one problem with the variance - it is influenced by the size of the variable being analyzed which will make comparisons of different score distributions impossible. For example, if one teacher used a 4-point scale and another used the 100-point scale, then there would be a real problem comparing the variability in grades for the two classes. To allow for this comparability, we can 'normalize' the variance by taking its square root. The result is the standard deviation, which in the ECN202 example is 8.74 points. In the ECN201 example, the standard deviations for the entire class and the sample are 4.37 and 4.11. In the next section where we examine probability and probability distributions, the standard deviation will take on special significance. 9