Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Examples “Real” Data The following is a data set of body temperatures of 106 people, recorded at 12 am, by doctors in a Maryland institute. No further details are available, so we are merely describing (or, better, summarizing) the data with no further deductions. 98.6 98.6 98.0 98.0 99.0 98.4 98.4 98.4 98.4 98.6 98.6 98.8 98.6 97.0 97.0 98.8 97.6 97.7 98.8 98.0 98.0 98.3 98.5 97.3 98.7 97.4 98.9 98.6 99.5 97.5 97.3 97.6 98.2 99.6 98.7 99.4 98.2 98.0 98.6 98.6 97.2 98.4 98.6 98.2 98.0 97.8 98.0 98.4 98.6 98.6 97.8 99.0 96.5 97.6 98.0 96.9 97.6 97.1 97.9 98.4 97.3 98.0 97.5 97.6 98.2 98.5 98.8 98.7 97.8 98.0 97.1 97.4 99.4 98.4 98.6 98.4 98.5 98.6 98.3 98.7 98.8 99.1 98.6 97.9 98.8 98.0 98.7 98.5 98.9 98.4 98.6 97.1 97.9 98.8 98.7 97.6 98.2 99.2 97.8 98.0 98.4 97.8 98.4 97.4 98.0 97.0 The same data is sorted and a rank and percentile position assigned to each Data 99.6 99.5 99.4 99.4 99.2 99.1 99 99 Rank 1 2 3.5 3.5 5 6 7.5 7.5 Percentile Rank 100.00% 99.05% 97.14% 97.14% 96.19% 95.24% 93.33% 93.33% 98.9 98.9 98.8 98.8 98.8 98.8 98.8 98.8 98.8 98.7 98.7 98.7 98.7 98.7 98.7 98.6 98.6 98.6 98.6 98.6 98.6 98.6 98.6 98.6 98.6 98.6 98.6 98.6 98.6 98.6 98.5 98.5 98.5 98.5 98.4 98.4 98.4 98.4 98.4 98.4 98.4 98.4 9.5 9.5 14 14 14 14 14 14 14 20.5 20.5 20.5 20.5 20.5 20.5 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 40.5 40.5 40.5 40.5 48.5 48.5 48.5 48.5 48.5 48.5 48.5 48.5 91.43% 91.43% 84.76% 84.76% 84.76% 84.76% 84.76% 84.76% 84.76% 79.05% 79.05% 79.05% 79.05% 79.05% 79.05% 64.76% 64.76% 64.76% 64.76% 64.76% 64.76% 64.76% 64.76% 64.76% 64.76% 64.76% 64.76% 64.76% 64.76% 64.76% 60.95% 60.95% 60.95% 60.95% 49.52% 49.52% 49.52% 49.52% 49.52% 49.52% 49.52% 49.52% 98.4 98.4 98.4 98.4 98.3 98.3 98.2 98.2 98.2 98.2 98.2 98 98 98 98 98 98 98 98 98 98 98 98 98 97.9 97.9 97.9 97.8 97.8 97.8 97.8 97.8 97.7 97.6 97.6 97.6 97.6 97.6 97.6 97.5 97.5 97.4 48.5 48.5 48.5 48.5 55.5 55.5 59 59 59 59 59 68 68 68 68 68 68 68 68 68 68 68 68 68 76 76 76 80 80 80 80 80 83 86.5 86.5 86.5 86.5 86.5 86.5 90.5 90.5 93 49.52% 49.52% 49.52% 49.52% 47.62% 47.62% 42.86% 42.86% 42.86% 42.86% 42.86% 30.48% 30.48% 30.48% 30.48% 30.48% 30.48% 30.48% 30.48% 30.48% 30.48% 30.48% 30.48% 30.48% 27.62% 27.62% 27.62% 22.86% 22.86% 22.86% 22.86% 22.86% 21.90% 16.19% 16.19% 16.19% 16.19% 16.19% 16.19% 14.29% 14.29% 11.43% 97.4 97.4 97.3 97.3 97.3 97.2 97.1 97.1 97.1 97 97 97 96.9 96.5 93 93 96 96 96 98 100 100 100 103 103 103 105 106 11.43% 11.43% 8.57% 8.57% 8.57% 7.62% 4.76% 4.76% 4.76% 1.90% 1.90% 1.90% 0.95% 0.00% Some indexes were computed by software: Mean Median Mode Standard Deviation Sample Variance Range Minimum Maximum 1st quartile 3rd quartile 98.2 98.4 98.6 0.62290 0.388 3.1 96.5 99.6 97.8 98.6 Here is how the same data would be presented in a worksheet or test for our class: 99.6 99.5 99.4 99.4 99.2 99.1 99.0 99.0 98.9 98.9 98.8 98.8 98.8 98.8 98.8 98.8 98.8 98.7 98.7 98.7 98.7 98.7 98.7 98.6 98.6 98.6 98.6 98.6 98.6 98.6 98.6 98.6 98.6 98.6 98.6 98.6 98.6 98.6 98.5 98.5 98.5 98.5 98.4 98.4 98.4 98.4 98.4 98.4 98.4 98.4 98.4 98.4 98.4 98.4 98.3 98.3 98.2 98.2 98.2 98.2 98.2 98.0 98.0 98.0 98.0 98.0 98.0 98.0 98.0 98.0 98.0 98.0 98.0 98.0 97.9 97.9 97.9 97.8 97.8 97.8 97.8 97.8 97.7 97.6 97.6 97.6 97.6 97.6 97.6 97.5 97.5 97.4 97.4 97.4 97.3 97.3 97.3 97.2 97.1 97.1 97.1 97.0 97.0 97.0 96.9 96.5 count 106 Sum 10409.2 Sum of Squares 1022224 The last three rows would help us compute the mean, and the variances (“population”, and “sample”). The division in columns would help us determine median and quartiles. In particular, the mean is given by Sum/Count, or 10409.2/106. The “population” variance by ³ ´2 1022224 10409:2 “average of squares - square of average”: 106 ¡ . The “population” standard 106 deviation is the square root of this number, the “sample” variance is obtained by multiplying the “population” variance by 106/105, and its square root is the “sample” standard deviation. Don’t worry: a “cheat sheet” comes with every worksheet or test. The example above uses “real” data, on which, however, we have very little information as to how they were collected. This is the usual case when working with information from the web (or from a textbook, for that matter). Even with more information, it is often difficult to verify that the assumptions behind the mathematical models we will use to analyze the data are satisfied (if they are not, many “indexes”, like mean and variance, may have very little useful meaning). Many example we will work on in class, and at home will instead be “simulated data”, produced by computer in a way that makes sure that they do indeed satisfy specific models. Summation (“Sigma”) Notation Try to familiarize with the following notation for sums. Sums are everywhere in statistics, and, more generally, in mathematics, so we have developed a shorthand symbol for the operation of summing numbers. Suppose we have n numbers and need their sum. Denote the numbers by x 1; x2; … ; xn . For 1 example, if the numbers are 2; 4; ¡5; , this would correspond to 2 n X n = 4; x1 = 2; x2 = 4; x3 = ¡5; x4 = 1. The sum of our n numbers is written as xk, and, 2 k=1 for our example, this would mean the computation 2 + 4 + ( ¡5) + 21 = 23. ; xn is given by Thus, the mean of a data set Xofn n numbers x1; x2; … Xn 2 ¹) x (x ¡ x k k x + x2 + …+ xn 2 k=1 k=1 = , and the “population” variance by S = . x¹ = 1 n n n Incidentally, it is sometimeX convenient to compute the “population” variance using the n x2k 2 2 k=1 equivalent1 formula S = ¡ (x¹ ) n p 2 The “population” standard deviation is then given by S , while p the “sample” variance is given 2 n 2 by s = S , and the “sample” standard deviation by s = s2. n ¡1 1 The two expressions are algebraically equivalent. However, they may behave differently when used with large data sets, as the rounding approximations that are inevitable may affect the second expression more than the first. Of course, this issue will not concern us at all, since we will never face big calculations in our class.