Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
MATH2560 C F03 Elementary Statistics I Lecture 2: Describing Distributions with Numbers. 1 Outline. ⇒ mean; ⇒ median; ⇒ quartiles; ⇒ boxplots; ⇒ variance; ⇒ standard deviation; ⇒ linear transformation. 2 Description of a Distribution with Numbers A numerical summary of a distribution should report its center and its spread or variability, and brief description should include its shape (describing by histograms and stemplots).. 3 Measuring Center: the Mean The mean x̄ describes the arithmetic average of the observations. The Mean x̄ To find the mean of a set of observations, add their values and divide by the number of observations. If the n observations are x1 , x2 , ..., xn , their mean is x̄ = x1 + x2 + ... + xn n or, in more compact notation, n 1X x̄ = xi . n i=1 3.1 Examples. 1. The Babe’s mean number of home runs hit in a year is: x̄ = 1 (54 + 59 + ... + 22) = 43.9. 15 2. Roger Maris’s mean, from the data of Lecture 1, is: x̄ = 1 261 (8 + 13 + ... + 61) = = 26.1. 10 10 Ruth’s superiority is evident from these averages: 43.9 > 26.1. If numerical description can resist the influence of extreme observations, we say that it is a resistant measure. For example, mean cannot resist it and mean is not a resistant measure of center. 4 Measuring Center: the Median The median M describes the midpoint of the observations. The Median M The meadian M is the midpoint of a distribution, the number such that half the observation are smaller and the other half are larger. To find the median of a distribution: 1. Arrange all observations in order of size, from smallest to largest. 2. If the number of observations n is odd, the median M is the center observation in the ordered list. Find the location of the median by counting (n + 1)/2 observations up from the bottom of the list. 3. If the number of observations n is even, the median M is the mean of the two center observations in the ordered list. The location of the meadian is again (n+1)/2 from the bottom of the list. 4.1 Examples. 1. Babe Ruth median: 1. Arrange the data in increasing order: 22, 25, 34, 35, 41, 41, 46, 46, 46, 47, 49, 54, 54, 59, 60. 2. The median is the bold 46, the eight observation in the ordered list. 3. You can also find it using the recipe (n + 1)/2 = 16/2 = 8 to locate the median in the list. 2. Roger Maris median: 1. 8, 13, 14, 16, 23, 26, 28, 33, 39, 61. 2. Number of observation is even: n = 10. Hence, M= 23 + 26 49 = = 24.5. 2 2 3. The recipe (n+1)/2 = 11/2 = 5.5 for the position of the median in the list means that the median is at location ”five and one-half”, that is, halfway between the fifth and sixth observations. The mean and median describe the center of a distribution in different ways. 4.2 Mean versus Median The median is a ”middle value” rather than the mean which is ”arithmetic average value”. The mean and meadian for symmetric distribution are exactly the same. In a skewed distribution, the mean is farther out in the long tail than is the median. 5 Measuring Spread: The Quartiles The quartiles is given to describe the spread of the distribution, when you use the median to describe the center of the distribution. pth-percentile of a distribution is the value such that p percent of the observations fall at or below it. The most commonly used percentiles other than the median (50th-percentile) are the quartiles. The first quartile is the 25th percentile, and the third quartile is the 75th percentile. The first quartile Q1 has 1/4 of the observations below it. The third quartile Q3 has 3/4 of the observations below it. How to Calculate the Quartiles Q1 and Q3 ? To calculate the quartiles: 1. Arrange the observations in increasing order and locate the median M in the ordered list of observations. 2. The first quartile Q1 is the median of the observations whose position in the ordered list is to the left of the location of the overall median. 3. The third quartile Q3 is the median of the observations whose position in the ordered list is to the right of the location of the overall median. 5.1 Example 1.15. In Example 1.5 (shopping data) let us arrange the data in order after rounding to eliminate the cents. We obtain 3, 9, 9, 11, ..., 28, ||, 28, ..., 86, 86, 93. There n = 50 observations, the position of the median is (50 + 1)/2 = 25.5 or midway betweeen the 25th and 26th observations. This location is marked by ||. To find the first quartile, consider the 25 observations falling to the left of the location || of the median. The median of these 25 observations is the thirteenth in order, or 19D. This is the first quartile. The third quartile is the thirteenth value above, or 45D. Summarizing these results in compact form we receive: Q1 = 19D M = 28D Q3 = 45D. It is easy to find other percentiles. Since 0.95 × 50 = 47.5, then the 95th percentile of the 50 observations to be the 48th in the ordered list, namely 86. The median for Ruth’s 15 home run 22, 25, 34, ..., 46, 46, 46, ..., 59, 60 total is the 46. The first quartile is the median of the seven observations falling to the left of this point in the list, Q1 = 35. And, similarly, Q3 = 54. 6 Measuring Spread: the Interquartile Range The interquartiles range (IQR) is the difference between the quartiles: IQR = Q3 − Q1 . It is the spread of the center half of the data. Example 1.15: Shopping Data: IQR = 45 − 19 = 26D. The quartiles and the IQR are not affected by the changes in either tail of the distribution. They are threfore resistant. 6.1 The 1.5 × IQR Criterion for Outliers The 1.5 × IQR criterion flags observations more than 1.5 × IQR beyond the quartiles as possible outliers. It call an observation a suspected outlier if it falls more than 1.5 × IQR above the third quartile or below the first quartile. Example 1.15. 1.5 × IQR = 1.5 × 26D = 39D. Any values below 19 − 39 = −20D or above 45 + 39 = 84 are flagged as possible outliers. The flagged values 86D and 93D do not apper to be outliers in the sense of deviations from the overall pattern of the distribution. 7 The five-number Summary The five-number summary (FNS) provides a quick overall description of a distribution. FNS consists of the median+2quartiles+smallest+largest individual observations (IO). The median describes the center, and the quartiles and extremes (smallest and largest IO) show the spread. The Five-Number Summary Minimum Q1 M Q3 Maximum Example 1.15. The five-number summary is: 3, 19, 28, 45, 93. 8 Boxplots. Another visual representation of a distribution is boxplot. Boxplots based on the five-number summery are useful for comparing several distributions. The box spans the quartiles and shows the pread of the central half of the distribution. The median is marked within the box. Lines extend from the box to the extremes and show the full spread of the data. (The points identified by the 1.5 × IQR criterion are often plotted individually). Boxplot A boxplot is a graph of the five-number summary, with suspected outliers plotted individually. 1. A central box spans the quartiles. 2. A line in the box marks the median. 3. Observation more than 1.5 × IQR outside the central box are plotted individually as possible outliers. 4. Lines extend from the box out to the smallest and largest observations that are not suspected outliers. 8.1 Comparing Distributions Boxplots are most useful for comparing distributions. Consider the following test results for calories and milligrams of sodiumin a number of hot dogs (see Table 1.8). The five-number summeries of the distributions of calories Type Min. Q1 M Q2 Max. Beef 111 140 152.5 178.5 190 Meat 107 138.5 153 180.5 195 Poultry 86 100.5 129 143.5 170 We can see that no observations fall more than 1.5 × IQR outside the quartiles. Figure 1.16 presents boxplots based on these calculations. Figure 1.16 also illustrates why boxplots are generally inferior to stemplots and histograms as displays of a single distribution. Let us make a stemplot of the calorie content of the 17 brands of meat hot dogs: Stemplot of the calorie content 10 7 11 12 13 5689 14 067 15 3 16 17 2359 18 2 19 015 There are two distinct clusters and one outlier in the lower tail. The boxplot hid the clusters. 9 Measuring Spread: the Standard Deviation The variance s2 and especcially its square root, the standard deviation s, are common measures of spread about the mean as center. Variance The variance s2 of a set of observations is the average of the squares of the deviations of the observations from their mean. In symbols, the variance of n observations x1 , x2 , ..., xn is 2 2 +...+(x −x̄)2 n s2 = (x1 −x̄) +(x2 −x̄) n−1 P n 1 2 or s2 = n−1 i=1 (xi − x̄) . The idea of the variance is clear: it is the average of the squares of the deviations (xi − x̄), i = 1, ..., n, of the observations xi from their mean x̄. The standard deviation measures spread by looking at how far the observations are from their mean. The standard deviation s is zero when there is no spread. It gets larger as the spread increases. The Standard Deviation s s is the root of the variance s2 : q square Pn 1 2 s = n−1 i=1 (xi = x̄) . Example 1.18. Metabolic rates of seven men: 1792, 1666, 1362, 1614, 1460, 1867, 1439. Here, x̄ = 1600 x1 − x̄ = 1792 − 1600 = 192 calories. s = 189.24 9.1 Properties of Standard Deviation Properties of s 1. s measures spread about the mean and should be used only when the mean is chosen as the measure of center. 2. s = 0 only when there is no spread. This happen only when all observations have the same value Otherwise, s > 0. As the observations become more spead out about their mean, s gets larger. 3. s, like the mean x̄, is not resistant. A few outliers can make s very large. 9.2 Choosing Measures of Center and Spread Choosing Summary The five-number summary is usually better than the mean and standard deviation for describing a skewed distribution or a distribution with strong outliers. Use x̄ and s only for reasonably symmetric distributions that are free of outliers. 10 Changing the Unit of Measurement Linear Transformations A linear transformation changes the original variable x into new variable xnew given by an equation of the form xnew = a + bx. Adding the constant a shifts all values of x upward or downward by the same amount. Such a shift changes the origin (zero point) of the variable. Multiplying by the positive constant b changes the size of the unit of measurement. 10.1 Example 1.20. (a) Transformation kilometers into miles: xnew = 0.62x. So, 10km is 6.2 miles. (b) Transformation Fahrenheit into Celcius: 5 −160 5 xnew = (x − 32) = + x. 9 9 9 So, 95F is 35C : 35 = xnew = −160 + 59 95. 9 A LT changes the origin if a 6= 0 and changes the size of the unit of measurement if b > 0. LTs do not change the overall shape of a distribution. A LT multiplies a measure of spread by b, and changes a percentile or measure of center m into a + bm. Effect of Linear Transformations Apply the following rule to see the effect of a LT on measures of center and spread: 1. Multiply each observation by a positive number a multiplies both measures of center (mean and median) and measures of spread (interquartile range and standard deviation) by b. 2. Adding the same number a (positive or negative) to each observation adds a to measures of center and to quartiles and other percentiles but does not change measures of spread. For example, if x has mean x̄ the transformed variable xnew has mean a + bx̄.