Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
AMS 5 NUMERICAL DESCRIPTIVE METHODS Introduction A histogram provides a graphical description of the distribution of a sample of data. If we want to summarize the properties of such a distribution we can measure the center and the spread of the histogram. Introduction These two histograms correspond to samples with the same center. The spread of the sample on top is smaller than that of the sample in the bottom The Average 57 83 92 237 51 82 127 65 87 66 76 85 70 198 152 95 110 117 83 52 69 134 74 70 53 165 116 78 161 129 70 74 156 93 53 66 58 77 88 65 49 86 64 100 95 80 70 156 81 65 81 70 94 65 80 222 105 59 83 86 49 80 168 78 161 63 62 92 69 72 102 92 68 67 72 53 57 55 143 75 110 64 56 53 50 57 79 89 48 107 100 77 103 63 145 82 105 85 65 79 59 76 57 47 118 91 167 94 154 56 72 69 55 106 53 145 123 83 65 119 Table: Maximum daily ozone values (in PSI in L.A. on 120 typical days in 1989) 0 10 Frequency 20 30 40 The Average 50 100 150 Maximum Daily Ozone (PSI) 200 250 The Average What was the “typical” ozone level in 1989? How should we define what we mean by a typical number? The most frequently used definition, is the simple rule; add the numbers up and divide them by how many there are. The average (or arithmetic mean) of a list of numbers equals their sum divided by how many there are. n 1 x = ∑ x i = 10, 729 /120 = 89.4 PSI n i =1 30 40 The Median We can easily observe that only 42 out of the 120 observations are above the average (i.e. only 35%). This is because the histogram is not symmetric. Frequency 20 median 0 10 average 50 100 150 Maximum Daily Ozone (PSI) 200 250 A Histogram balances when supported at the Average. The median is the value with half the area of the histogram to the left and half to the right. The Median A symmetric histogram will look like this. In this case 50% of the data are above the average, i.e. the median and the average coincide. Calculating the Median Sort your data. If you have an odd number of observations your median is exactly the middle number of the ordered ones (i.e. with 9 ordered numbers, the median is the 5th ordered observation). If you have an even number of observations, like in our case with the ozone example where n=120, the there are two numbers that could be the middle, the 60th and the 61st ordered observations (with values 79 and 80 PSI respectively). In order to resolve this conflict we can choose as the median the average of this two numbers, and therefore the median in our example is (79+80)/2=79.5. Average or Median The Root-Mean-Square Back in the ozone example, and we saw that the typical ozone level is either around 80 or 90, depending on how to define “typical”. But give or take how much? In other words what is the typical amount by which the numbers differ or deviate from the middle one? What we need is a numerical measure of the spread of the data. Before defining this measure of size lets, answer to an even simpler question. How big are our numbers? To make things easier lets suppose that we only had the following 5 numbers: 0, 5, -8, 7, -3 The average is (0+5-8+7-3)/5 = 0.2 but this is a very poor measure of size, since it allows the positives to be cancelled with the negatives. The Root-Mean-Square The simpler way around this problem would be to wipe out the signs by averaging the absolute values. The average neglecting signs then is (0+5+8+7+3)/5 = 4.6 and it gives us a measure of size. Another measure is the root-mean-square SQUARE all the entries Take the MEAN of the squares Take the square ROOT of the mean. Therefore : 2 2 2 2 2 0 + 5 + (−8) + 7 + (−3) r.m.s. size = = 5.4 5 The Standard Deviation Back to our initial question. What is the typical amount by which the numbers differ or deviate from the middle one? By choosing as a middle number the average, what we really want to calculate is a measure of size of every entry from the average. We will use the root mean square as a measure of size and the final measure of spread that we will obtain will call it standard deviation (SD). The Standard Deviation Therefore SD = r.m.s. deviation from the average = r.m.s. (entry – average) To calculate the standard deviation of a sample follow the steps: Calculate the average Calculate the list of deviations from the average by taking the difference between each entry and the average. Calculated the r.m.s. size of the resulting list. The Standard Deviation 1. 2. 3. Example: Lets go back to the simple example with only the following 5 numbers: 0, 5, -8, 7, -3 The average is (0+5-8+7-3)/5 = 0.2 In order to find the deviations from the average we just have to subtract the average from each entry: -0.2, 4.8, -8.2, 6.8, -3.2 Find the r.m.s. size of the deviations: (−0.2) + 4.8 + (−8.2) + 6.8 + (−3.2) SD = = 5.42 5 2 2 2 2 2 The Standard Deviation Shortcut Formula 1 x − ∑ xi ∑ n i =1 i =1 SD = n n n 2 2 i For the ozone example: Observation number (i) 1 2 . . 120 xi 57 83 . . 107 sum 10729 xi2 1121637 − (10729 /120 ) SD = = 36.7 3249 120 6889 . . 11449 1121637 2 The Standard Deviation The SD comes out in the same units as the data. Do not confuse the SD of a list with its r.m.s. size. The SD is the r.m.s. not of the original numbers on the list, but of their deviations from the average. In many data sets (especially when the distribution is symmetric) the following rule of thumb applies: 1. Roughly 68% of the observations are within one SD of the average. 2. Roughly 95% of the observations are within two SDs of the average. 3. Roughly 99% of the observations are within three SDs of the average. Using a Statistical Calculator