Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
1 z-scores A data set is normalised by subtracting the mean from each observation and then dividing by the standard deviation. The resulting observations are called normalised scores or z-scores. The new data set will have mean 0 and standard deviation 1. If x is a value and z is the corresponding z-score, z= x−µ σ and x = σz + µ. In Lecture 3, we considered the data set 2, 3, 6, 9, 10. We found that µ = 6 and σ ≈ 3.16. the observation 3 has a z-score of 3−6 ≈ −0.949, z≈ 3.16 so we say that this observation is 0.949 standard deviations below the mean. The mean corresponds to a z-score of 0. The value at two standard deviations above the mean is x ≈ 3.16 · 2 + 6 ≈ 12.3. No observations were this high. z-scores give a standardised measure of value. 2 Percentiles, Deciles, and Quartiles Consider the data set 4, 24, 25, 27, 27, 27, 29, 30, 32, 32, 33, 33, 33, 35, 35, 37, 37, 39, 55, 100 consisting of 20 observations. Notice that this data is presented in increasing order. Suppose we want to find a value such that 37% of the observations are less than this value. This value is called the 37th percentile and written as P37 . To do this, we multiply the number of observations by 37% (0.37) and obtain 7.4. Since 7.4 is not a whole number, we look at the 8th observation. The 8th observation is 30. So we say that the 37th percentile is P37 = 30. Notice that 7 observations (which is actually 35% of 20) are less than 30. This is as good as we can do. Now say we want to find a value such that 85% of the observations are less than this value. Again, multiply the number of observations by 85% (0.85) and obtain 17. Since 17 is a whole number, take the average of the 17th and 18th observations on the list. That is, we say that the 85th percentile is P85 = 37+39 = 38. 2 Notice that 17 observations (which is exactly 85% of 20) are less than 38. In general, to find the pth percentile, we compute i= Np , 100 where N is the size of the data set. If i is a whole number, report the average of the observations in the ith and (i + 1) st position. If i is not a whole number, report the observation whose position is the next integer larger than i. A similar calculation works to obtain deciles (where the data is broken up into 10 equal pieces) and quartiles (where the data is broken up into 4 equal pieces). We can think of computing percentiles as putting the data into 100 bins (0th percentile to 99th percentile) deciles as putting the data into 10 bins (0th decile to 9th decile) and quartiles as putting the data into 4 bins 1 (0th quartile to 3rd quartile) These bins are reported by their minimum value. That is, Q3 is the minimum value of the 3rd quartile bin. To find the dth decile, compute i= Nd 10 and to find the qth quartile, compute Nd . 4 Or, use the facts that, for example, the third decile is the same as the 30th percentile and the third quartile is the same as the 75th percentile. i= As an example, to compute the 5th decile, we find i = 20·5 10 = 10. Thus the 5th decile is the average of the 10th and 11th observations, which is D5 = 32+33 = 32.5. 2 To compute the 1st quartile, we find i = observations, which is Q1 = 27+27 = 27. 2 20·1 4 = 5. Thus the 1st quartile is the average of the 5th and 6th We could think of the 3rd quartile as the 75th percentile and find i = 20·75 100 = 15. Thus the 3rd quartile is the average of the 15th and 16th observations, which is Q3 = 35+37 = 36. 2 We can also give a percentile, decile, or quartile for each observation. To do this, find the fraction observations which are smaller than this observation and multiply this by 100 for percentile, 10 for decile, and 4 for quartile. Truncate, do not round, the result. 19 Consider the observation 100. There are 19 out of 20 observations which are smaller than this. Since 20 ×100 = 95, we say that this is in the 95th percentile. The maximum observation is not at the 100th percentile because it is not larger than itself. Consider the two observations of 32. There are 8 out of 20 observations which are smaller than these. Since 8 20 × 4 = 1.6, we say that this is in the 1st quartile. We do not round up. The second quartile starts at 33. 8 Since 20 × 10 = 4, we say that this is in the 4th decile. Percentiles, deciles, and quartiles give standardised measures of rank. 3 Inter-Quartile Range The inter-quartile range is IQR = Q3 − Q1 . This is a measure of spread of the data and indicates the range of the middle half of the observations. It is more robust than the ordinary range, which takes into account all the observations, and can be significantly affected by outliers. In the above example, the IQR is 36 − 27 = 9. The range is 100 − 4 = 96, and is very directly dependent upon the most extreme data. 4 Outliers Any observation which is at least 1.5 IQRs above Q3 but no more than 3 IQRs above Q3 is considered a possible outlier. The same goes for any observation which is at least 1.5 IQRs below Q1 but no more than 3 IQRs below Q1 . Since IQR is 9 here, 1.5 IQRs is 13.5 and 3 IQRs is 27. So possible outliers are between 36 + 13.5 = 49.5 and 36 + 27 = 63 on the high end, and between 27 − 27 = 0 and 27 − 13.5 = 13.5 on the low end. That is, 4 and 55 are possible outliers. Any observation which more than 3 IQRs above Q3 is considered a probable outlier. The same goes for any observation which is more than 3 IQRs below Q1 . So possible outliers are more than 63 on the high end, and less than 0 on the low end. That is, 100 is a probable outlier. 2 5 Five Number Summary The five number summary consists of the minimum, Q1 , the median (also known as Q2 or D5 or P50 ), Q3 , and the maximum. In the example considered here, it is 4, 27, 32.5, 36, 100. One quarter of the data is between the minimum and Q1 , one quarter of the data is between the Q1 and the median, one quarter of the data is between the median and Q3 , and the remaining quarter of the data is between Q3 and the maximum. 6 Box and Whisker Plot A box is drawn between Q1 and Q3 . The median is indicated with a line through this box. Whiskers extend from the boxes out to the most extreme non-outliers. Possible outliers are indicated with × and probable outliers are indicated with +. Here, the largest non-outlier is 39 and the smallest non-outlier is 24. Therefore, the whisker coming off the top of the box extends up to 39 and the whisker coming off the bottom of the box extends down to 24. 110 100 90 80 70 60 50 40 30 20 10 0 One quarter of the data is in the lower part of the box. One quarter of the data is in the upper part of the box. Half the data is within the box. Although the two parts of the box may not have the same size, they each contain the same number of observations. 3