Download Chapter 2 - Dr. George Fahmy

Chapter 2 DESCRIPTIVE STATISTICS 2.1 Frequency Distributions It is often useful to organize or arrange a body of data into a frequency distribution. This breaks up the data into groups or classes and shows the number of observations in each class. The number of classes is usually between 5 and 15. A relative frequency distribution ins obtained by dividing the number of observations in each class by total number of observations in the data as a whole. The sum of the relative frequencies equals 1. A histogram is a bar graph of a frequency distribution, where classes are measured along the horizontal axis and frequencies along the vertical axis. A frequency polygon is a line graph of a frequency distribution resulting from joining the frequency of each class plotted at the class midpoint. A cumulative frequency distribution shows, for each class, the total number of observations in all classes up to and including that class. When plotted, this gives a distribution curve, or ogive. Example 1. A student received the following grades (measured from 0 to 10) on the 10 quizzes he took during a semester: 6, 7, 6, 8, 5, 7, 6, 9, 10, and 6. These grades can be arranged into frequency distributions as in table 2.1 and shown graphically as in Fig. 2-1. Table 2.1 Frequency Distributions of Grades Grades 5 6 7 8 9 10 Absolute Frequency 1 4 2 1 1 1 10 Relative Frequency 0.1 0.4 0.2 0.1 0.1 0.1 1.0 Example 2. The cans in a sample of 20 cans of fruit contain weights of fruit ranging from 19.3 to 20.9 oz, as given in Table 2.2. If we want to group these data into 6 classes, we get class intervals of 0.3 oz [(21.019.2)/6=0.3 oz]. The weights given in Table 2.2 can be arranged into the frequency distributions given in Table 2.3 and shown graphically in Fig. 2-2. Table 2.2 Net Weight in Ounces of Fruits 19.7 19.9 20.2 19.9 20.0 20.6 19.3 20.4 19.9 20.3 20.1 19.5 20.9 20.3 20.8 19.9 20.0 20.6 19.9 19.8 Weight, oz 19.2-19.4 19.5-19.7 19.8-20.0 20.1-20.3 20.4-20.6 20.7-20.9 Table 2.3 Frequency Distribution of Weights Class Absolute Relative Cumulative Midpoint Frequency Frequency Frequency 19.3 1 0.05 1 19.6 2 0.10 3 19.9 8 0.40 11 20.2 4 0.20 15 20.5 3 0.15 18 20.8 2 0.10 20 20 1.00 2.2 Measures of Central Tendency Central tendency refers to the location of a distribution. The most important measures of central tendency are (1) the mean, (2) the median, and (3) the mode. We will be measuring these for populations (i.e., the collection of all the elements we are describing) and for samples drawn from populations, as well as for grouped and ungrouped data. 1. The arithmetic mean, or average, of a population is represented by μ (the Greek letter mu); and for a sample, by X’ (read “X’ bar”). For ungrouped data, μ and X’ are calculated by the following formulas:   n and X'   n (2.1a, b) Where ΣΧ refers to the sum of all the observations, while N and n refer to the number of observations in the population and sample, respectively. For grouped data, μ and X’ are calculated by   n and X'   n (2.2a, b) Where Σ∫Χ refers to the sum of the frequency of each class f times the class midpoint X. 2. The medium for ungrouped data is the value of the middle item when all the items are arranged in either ascending or descending order in terms of values. That is,  N 1 Median  the  th item in the data array  2  (2.3) Where N refers to the number of items in the population (n for a sample). The median for grouped data is given by the formula Median  L  n/2 - F c fm (2.4) Where L = lower limit of the median class (i.e., the class that contains the middle item of the distribution) n = the number of observations in the data set F = sum of the frequencies up to but not including the median class fm = frequency of the median class c = width of the class interval 3. The mode is the value that occurs most frequently in the data set. For grouped data, Median  L  d1 c d1  d 2 (2.5) Where L = lower limit of the median class (i.e., the class with the greatest frequency) d1 = frequency of the model class minus the frequency of the previous class d2 = frequency of the model class minus the frequency of the following class c = width of the class interval The mean is the most used measure of central tendency. The mean, however, is affected by extreme values in the data set, while the median and the mode are not. Other measures of central tendency are the weighted mean, the geometric mean, and the harmonic mean (see probs. 2.7 to 2.9). Example 3. The mean grade for the population on the 10 quizzes given in Example 1, using the formula for ungrouped data, is   N  6  7  6  8  5  7  6  9  10  6 70   7 points 10 10 To find the median for the ungrouped data, we first arrange the 10 grades in ascending order: 5, 6, 6, 6, 6, 7, 7, 8, 9, 10. Then we find the grade of the (N+1)/2 or (10+1)/2=5.5th item. Thus the median is the average of the 5th and 6th item in the array, or (6+7)/2=6.5. The mode for the ungrouped data is 6 (the value that occurs most frequently in the data set). Example 4. We can estimate the mean for the grouped data given in Table 2.3 with the aid of Table 2.4: '   f  401.6  20.08 oz n 20 This calculation could be simplified by coding (see Prob. 2.6) Table 2.4 Calculation of the Sample Mean for the Data in Table 2.3 Class Frequency Weight, oz Midpoint X F FX 19.2-19.4 19.3 1 19.3 19.5-19.7 19.6 2 39.2 19.8-20.0 19.9 8 159.2 20.1-20.3 20.2 4 80.8 20.4-20.6 20.5 3 61.5 20.7-20.9 20.8 2 41.6 f  n  20  fX  401.6 We can estimate the median (med) for the same grouped data as follows: Med  L  n/2 F 20 / 2  3 7 c  19.8  0.3  19.8  0.3 fm 8 8  19.8  0.2625  20.06 oz Where L = 19.8= lower limit of the median class (i.e., the 19.8-20.0 class which contains the 10th and 11th observations) n = 20 = number of observations or items F = 3 = sum of frequencies up to but not including the median class fm = 8 = frequency of the median class c = 0.3 = width of class interval Similarly, Mode  L d1 6 1.8 c  19.8  0.3  19.8   19.8  0.18  19.98 oz d1  d 2 64 10 As noted in Prob. 2.4, the mean, median, and mode for grouped data are estimates used when only the grouped data are available or to reduce calculations with a large ungrouped data set. 2.3 Measures of Dispersion Dispersion refers to the variability or spread in he data. The most important measures of dispersion are (1) the average deviation, (2) the variance, and (3) the standard deviation. We will measure these for the populations and samples, as well as for grouped and ungrouped data. 1. The average deviation (AD) for ungrouped data is given by AD   X  N for population s (2.6a) for samples (2.6b) and AD   - n where the two vertical bars indicate the absolute value, or the values omitting the sign, with the other symbols having the same meaning as in sec.2.2. For grouped data, AD  f X  N for population s (2.7a) for samples (2.6b) and AD   f - n where f refers to the frequency of each class and X to the class midpoints. 2. Variance. The population variance 2 (the Greek letter sigma squared) and the sample variance s2 for ungrouped data are given by        and N s 2  f      s 2 3.  f     2 2  (2.8a, b) n 1 For grouped data, 2      2 2 2 and N n 1 (2.9a, b) Standard deviation. The population standard deviation σ and sample standard deviation s are the positive square roots of their respective variances. For ungrouped data,        s and N For grouped data,  f     (2.10a, b) n 1  f  -  2 2     -  2 2 s and N n 1 (2.11a, b) The most widely used measure of (absolute) dispersion in the standard deviation. Other measures (beside the variance and average deviation) are the range, the interquartile range, and the quartile deviation (see Probs. 2.11 and 2.12). 4. The coefficient of variation V measures relative dispersion: V    for population s (2.12a) s  for samples (2.12b) and v Example 5. The average deviation, variance, standard deviation, and coefficient of variation for the ungrouped data given in Example 1 can be found with the aid of Table 2.5 (μ = 7; see Example 3): AD    2  N 2              N 12  1.2 points 10  22  2.2 points squared 10 2  22  2.2  1.48 points 10 N  1.48 V   0.21, or 21%  7 Table 2.5 Calculations on the Data in Example 1 μ 7 7 7 7 7 7 7 7 7 7 Grade X 6 7 6 8 5 7 6 9 10 6 X- μ -1 0 -1 1 -2 0 -1 2 3 -1 |X- μ| 1 0 1 1 2 0 1 2 3 1 (X- μ)2 1 0 1 1 4 0 1 4 9 1      0      0      2  22 Example 6. The average deviation, variance, standard deviation, and coefficient of variation for the frequency distribution of weights (grouped data) given in table 2.3 can be found with the aid of Table 2.6 (X’=20.08 oz; see Example 4):  f    f      6.36  0.318 oz 20   f  -   2.9520  0.1554 oz squared 19  AD  n 2 s 2 n 1 2 s 2.9520  0.1544  0.3942 oz 19 n 1 s 0.3942oz V   0.0196, or 1.96% 20.08oz  Note that in the formula for s2 and s, n-1 rather than n is used in the denominator (see Prob. 2.16 for the reason). From the formulas for σ2, σ, s2, and s given in this section, other may be derived that will simplify the calculations for a large body of data (see Prob. 2.17 to 2.19 for their derivation and application). Table 2.6 Calculations on the data in Table 2.4 Weight, oz Class Midpoint X 19.20-19.40 19.50-19.70 19.80-20.00 20.10-20.30 20.40-20.60 20.70-20.90 19.30 19.60 19.90 20.20 20.50 20.80  Frequency f Mean X’  1 2 8 4 3 2 20.08 20.08 20.08 20.08 20.08 20.08 -0.78 -0.48 -0.18 0.12 0.42 0.72 0.78 0.48 0.18 0.12 0.42 0.72 f  n  20  f        2 0.78 0.96 1.44 0.48 1.26 1.44 0.6084 0.2304 0.0324 0.0144 0.1764 0.5184 =6.36 2.4 Shape of Frequency Distributions The shape of a distribution refers to (1) its symmetry or lack of it (skewness) and (2) its peakedness (kurtosis). 1. Skewness. A distribution has zero skewness if it is symmetrical about its mean. For a symmetrical (unimodal) distribution, the mean, median, and mode are equal. A distribution is positively skewed if the right tail is longer. Then, mean>median>mode. A distribution is negatively skewed if the left tail is longer. Then, mode>median>mean (see Fig. 2-3). Skewness can be measured by the Pearson’s coefficient of skewness: Sk  and Sk  3  med    3   med s  for population s (2.13a) for samples (2.13b)  f   2 0.6084 0.4608 0.2592 0.0576 0.5292 1.0368 =2.9520 Skewness also can be measured by the third moment [the numerator of Eq. (2.14a, b)] divided by the cube of the standard deviation:  f     3 Sk  3 for population s (2.14a) for samples (2.14b) and  f     Sk  3 s3 2. Kurtosis. A peaked curve is called leptokurtic, as opposed to a flat on (platykurtic), relative to one that is mesokurtic (see Fig. 2-4). Kurtosis can be measured by the fourth moment [the enumerator of Eq. (2.15a, b)] divided by the standard deviation raised to the fourth power. The kurtosis for a mesokurtic curve is 3.  f     Kurtosis  4 4 and  f  -    for population s (2.15a) 4 Kurtosis s4 for samples (2.15b) Example 7. We can find the Pearson’s coefficient of skewness for the grades given in Example 1 by using μ=7, med=6.5 (see Example 3), and σ=1.48 (see Example 5): Sk  3  med    37  6.5 30.5   1.01 1.48 1.48 (seeFig. 2 - 1) Similarly, by using X’=20.08 oz, med=20.06 oz (see Example 4), and s=0.39 oz (see Example 6), we can find the Pearson’s coefficient of skewness for the frequency distribution of weights in Table 2.3 as follows: Sk    3   med 320.08  20.06   0.15 (see Fig. 2 - 2c). s 0.39 For kurtosis, see Prob. 2.23. SOLVED PROBLEMS Frequency Distributions 2.1 Table 2.7 gives the grades on a quiz for a class of 40 students. (a) Arrange these grades (raw data set) into an array from the lowest to the highest grade. (b) Construct a table showing class intervals and class midpoints and the absolute, relative, and cumulative frequency for each grade. (c) Present the data in the form of a histogram, relative-frequency histogram, frequency polygon, and ogive. Table 2.7 Grades on a Quiz for a Class of 40 Students 7 5 6 2 8 7 6 7 3 9 10 4 5 5 4 6 7 4 8 2 3 5 6 7 9 8 2 4 7 9 4 6 7 8 3 6 7 9 10 5 (One) See Table 2.8. Table 2.8 Data Array of Grades 2 2 2 3 3 3 4 4 4 4 4 5 5 5 5 5 6 6 6 6 6 6 7 7 7 7 7 7 7 7 8 8 8 8 9 9 9 9 10 10 (Two) See Table 2.9. Note that since we are dealing here with discrete data (i.e., data expressed in whole numbers), we used the actual grades as the class midpoints. Grade 1.5-2.4 2.5-3.4 3.5-4.4 4.5-5.4 5.5-6.4 6.5-7.4 7.5-8.4 8.5-9.4 9.5-10.4 Table 2.9 Frequency Distribution of Grades Class Absolute Relative Cumulative Midpoint Frequency Frequency Frequency 2 3 0.075 3 3 3 0.075 6 4 5 0.125 11 5 5 0.125 46 6 6 0.150 22 7 8 0.200 30 8 4 0.100 34 9 4 0.100 38 10 2 0.050 40 40 1.000 (Three) See Fig. 2-5. 2.2 A sample of 25 workers in a plant receives the hourly wages given in Table 2.10. (a) Arrange these raw data into an array from the lowest to the highest wage. (b) Group the data into classes. (c) Present the data in the form of a histogram, relative-frequency histogram, frequency polygon, and ogive. 3.65 3.60 3.88 3.78 3.90 3.95 Table 2.10 Hourly Wages in Dollars 3.85 3.95 4.00 4.10 4.25 3.55 4.26 3.75 3.95 4.05 4.08 4.15 4.06 4.18 4.05 3.85 3.80 3.96 4.05 Table 2.11 Data Array of Wages in Dollars 3.60 3.65 3.75 3.78 3.80 3.85 3.85 3.88 3.95 3.95 3.96 4.00 4.05 4.05 4.05 4.06 4.15 4.18 4.25 4.26 3.90 4.08 (One) See Table 2.11 3.55 3.95 4.10 (Two) The hourly wages in Table 2.10 range from $3.55 to $4.26. This can be conveniently subdivided into 8 equal classes of $0.10 each. That is, ($4.30-$3.50)/8 = $0.80/8 = $0.10. Note that the range was extended from $3.50 to $4.30 so that the lowest wage, $3.55, falls within the lowest class and the largest wage, $4.26, falls within the largest class. It is also convenient (and needed for plotting the frequency polygon) to find the class mark or midpoint of each class. These are shown in Table 2.12. Table 2.11 Data Array of Wages in Dollars Hourly Class Absolute Relative Cumulative Wage Midpoint Frequency Frequency Frequency $3.50-3.59 $3.55 1 0.04 1 3.60-3.69 3.65 2 0.08 3 3.70-3.79 3.75 2 0.08 5 3.80-3.89 3.85 4 0.16 9 3.90-3.99 3.95 5 0.20 14 4.00-4.09 4.05 6 0.24 20 4.10-4.19 4.15 3 0.12 23 4.20-4.29 4.25 2 0.08 25 25 1.00 (Three) See Fig. 2-6. Another way of getting the ogive is to plot the cumulative frequencies up to $3.595, 3.695, 3.795, and so on (so as to include the upper limit of each class). The values $3.595, 3.695, 3.795, etc. are often referred to as the class boundaries or exact limits. Note that the class midpoints are obtained by adding together the lower and upper class boundaries and dividing by 2. Four example, the second class midpoint is given by (3.595+3.695)/2 = 7.290/2 = 3.65 (see Table 2.12). Measures of Central Tendency 2.3 Find the mean, median, and mode (a) for the grade on the quiz for the class of 40 students given in Table 2.7 (the ungrouped data) and (b) for the grouped data of these grades given in Table 2.9. (One) Since we are dealing with all the grades, we want the population mean:     7  5  6  ...  5  240  6 points N 40 40 That is, μ is obtained by adding together all the 40 grades given in Table 2.7 and dividing by 40 (the three centered dots were put in to avoid repeating the 40 values in Table 2.7). the median is given by the value of the [(N+1)/2]th item in the data array in Table 2.8. Therefore, the median is the value of the (40+1)/2 or 20.5 th item, or the average of the 20th and 21st item. Since they are both equal to 6, the median is 6. The mode is 7 (the value that occurs most frequently in the data set). (Two) We can find the population mean for the grouped data in Table 2.9 with the aid of Table 2.13:   f  240  6 N 40 This is the same mean we found for the ungrouped data. Note that the sum of the frequencies, Σf, equals the number of observations in the population, N, and ΣX = Σ fX. The median for the grouped data of Table 2.13 is given by Med  L  N /2 F 40 / 2  16 c  5.5  1  5.5  0.67  6.17 fm 6 where L = 5.5 =lower limit of the median class (i.e., the 5.5-6.4 class, which contains the 20th and 21st observation) N = 40 = number of observations F = 16 = sum of observations up to but not including the median class fm = 6 = frequency of the median class c = 1 = width of class interval The mode for the grouped data in Table 2.13 is given by Mode  L  d1 2 c  6.5  1  6.5  0.33  6.83 d1  d 2 24 where L = 6.5 =lower limit of the model class (i.e., the 6.5-7.4 class with the highest frequency of 8) d1 = 2 = frequency of the modal class, 8, minus the frequency of the previous class, 6 d2 = 4 = frequency of the modal class, 8, minus the frequency of the following class, 4 c = 1 = width of class interval Note that while the mean calculated from the grouped data is in this case identical to the mean calculated for the ungrouped data, the median and the mode are only (good) approximations. Table 2.13 Calculation of the Population Mean for the Grouped Data in Table 2.9 Class Grade Midpoint X Frequency f FX 1.5-2.4 2 3 6 2.5-3.4 3 3 9 3.5-4.4 4 5 20 4.5-5.4 5 5 25 5.5-6.4 6 6 36 6.5-7.4 7 8 56 7.5-8.4 8 4 32 8.5-9.4 9 4 36 9.5-10.4 10 2 20 f 2.4  N  40  fX  240 Find the mean, median, and mode (a) for the sample of hourly wages received by the 25 workers recorded in Table 2.10 (the ungrouped data) and (b) for the grouped data of these wages given in Table 2.12. (a) X  X n  $3.65  $3.78  $3.85  ...  $4.05 $98.65   $3.946 or $3.95 25 25 Median = $3.95 [the value of the (n+1)/2 = (25 + 1)/2 = 13th item in the data array in Table 2.11]. Mode = $3.95 and $4.05, since there are three of each of these wages. Thus the distribution is bimodal (i.e., it has two modes). (Three) We can find the sample mean for the grouped data in Table 2.12 with the aid of Table 2.14:  fX $98.75  $3.95 n 25 Note that in this case  fX  $98.75   X  $98.65 (found in part a) since X   the average of the observations in each class is not equal to the class midpoint for all classes [as in Prob. 2.3(b)]. Thus X’ calculated from the grouped data is only a very good approximation for the true value of X’ calculated for the ungrouped data. In the real world, we often have only the grouped data, or if we have a very large body of ungrouped data, it will save on calculations to estimate the mean by first grouping the data. Med  L  n/2  F 25 / 2  9 c  $3.90  (0.10)  $3.90  $0.07  $3.97 fm 5 as compared with the true median of $3.95 found from the ungrouped data (see part a) Mode  L  d1 1 c  $4.00  (0.10)  $4.00  $0.025  $4.025 or $4.03 d1  d 2 1 3 as compared with the true modes of $3.95 and $4.05 found from the ungrouped data (see part a). Sometimes the mode is simply given as the midpoint of the modal class. Table 2.14 Calculation of the Sample Mean for the Grouped Data in Table 2.12 Hourly Class Wage Midpoint X Frequency f FX $3.50-3.59 $3.55 1 $3.55 3.60-3.69 3.65 2 7.30 3.70-3.79 3.75 4 7.50 3.80-3.89 3.85 5 15.40 3.90-3.99 3.95 6 19.75 4.00-4.09 4.05 3 24.30 4.10-4.19 4.15 2 12.45 4.20-4.29 4.25  fX  $98.75  f  n  25 2.5 Compare the advantages and disadvantages of (a) the mean, (b) the median, and (c) the mode as measures of central tendency. (One) The advantages of the mean are (1) it is familiar and understood by virtually everyone, (2) all the observations in the data are taken into account, and (3) it is used in performing many other statistical procedures and tests. The disadvantages of the mean are (1) it is affected by extreme values, (2) it is time-consuming to compute for a large body of ungrouped data, and (3) it cannot be calculated when the last class of grouped data is open-ended (i.e., it includes the lower limit of the last class “and over”). (Two) The advantages of the median are (1) it is not affected by extreme values, (2) it is easily understood (i.e., half the data are smaller than the median and half are greater), and (3) it can be calculated even when the last class in open-ended and when the data are qualitative rather than quantitative. The disadvantages of the median are (1) it does not use much of the information available, and (2) it requires that observations be arranged into an array, which is time-consuming for a large body of ungrouped data. (Three) The advantages of the mode are the same as those for the median. The disadvantages of the mode are (1) as for median, the mode does not use much of the information available, and (2) sometimes no value if the data is repeated more than once, so that there is no mode, while at other times there may be many modes. In general, the mean is the most frequently used measure of central tendency and the mode is the least used. 2.6 Find the mean for the grouped data in Table 2.12 by coding (i.e., by assigning the value of μ=0 to the 4th or 5th classes and μ =-1, μ =-2, etc. to each lower class and μ=1, μ=2, etc. to each larger class and then using the formula X  X0   f c (2.16) n where X0 is the midpoint of the class assigned μ=0 and c is the width of the class intervals). See Table 2.15 Table 2.15 Calculation of the Sample Mean by Coding for the Grouped Data in Table 2.12 Hourly Class Wage Midpoint X Code μ Frequency f FX $3.50-3.59 $3.55 -3 1 $3.55 3.60-3.69 3.65 -2 2 7.30 3.70-3.79 3.75 -1 4 7.50 3.80-3.89 3.85 0 5 15.40 3.90-3.99 3.95 1 6 19.75 4.00-4.09 4.05 2 3 24.30 4.10-4.19 4.15 3 2 12.45 4.20-4.29 4.25 4  f  n  25  f  25 X  X0   f c  $3.85  25 ($0.10)  $3.85  $0.10  $3.95 n 25 X’ for the grouped data formed by coding is identical to that found in Prob. 2.4b without coding. Coding eliminates the problem of having to deal with possibly large and inconvenient class midpoints; thus it may simplify the calculations. 2.7 A firm pays a wage of $4 per hour to its 25 unskilled workers, $6 to its 15 semiskilled workers, and $8 to its 10 skilled workers. What is the weighted average, or weighted mean, wage paid by this firm? In finding the weighted mean, or weighted average, of a population, μw, or sample Xm’ , the weights, w, have the same function as the frequency in finding the mean for the grouped data. Thus, X w or  w   wX w (2.17) For this problem, the weights are the number of workers employed at each wage, and Σw equals the sum of all the workers: w  $425  $615  $810  $100  $90  $80  $270  $5.40 25  15  10 50 50 This weighted average compares with the simple average of $6 [($4+$6+$8)/3=$6] and is a better measure of the average wages. 2.8 A nation faces a rate of inflation of 2% in one year, 5% in the second year, and 12,5% in the third year. Find the geometric mean of the inflation rates. (the geometric mean, μG or XG’, of a set of n positive numbers in the nth root of their product and is used mainly to average rates of change and index numbers.) That is, G XG  3 X1  X 2 ... X n or (2.18) where X1, X2, … , Xn refer to the n (or N) observations. G  3 2512.5  3 125  5% this compares with μ=(2+5+12.5)/3=19.5/3=6.5%. When all the numbers are equal, μG equals μ; otherwise μG is smaller than μ. In practice, μG is calculated by logarithms: log  G   log  (2.19) N The geometric mean is used primarily in the mathematics of finance and financial management. 2.9 A commuter drives 10mi on the highway at 60mi/h and 10mi on local streets at 15mi/h. Find the harmonic mean. The harmonic mean μH is used primarily to average ratios: H  H  2 N 1 160 115  X  (2.20) 2 1  4 as compared with    X N  60 2 60 120 2   24 mi/h 5 5 5 60  60  15 2  75  37.5 mi/h . Note that if the 2 commuter had averaged 37.5 mi/h, it would have taken her (20 mi/37.5mi) 60 min = 32 min to drive the 20 mi. Instead she drives 6 min on the highway (10mi at 60mi/h) and 40min on local streets (10mi at 15mi/h) for a total of 50min, and this is the (correct) answer we get by using μH=24mi/h. That is, (20mi/24mi/h) X 60min = 50min. 2.10 (a) For the ungrouped data in Table 2.7, find the first, second, and third quartiles and the third deciles and sixtieth percentiles. (b) Do the same for the grouped data in Table 2.12. (Quartiles divide the data into 4 parts, deciles into 10 parts, and percentiles into 100 parts). (One) Q1 (first quartile) = 4 (the average of the 10 th and 11th values in Table 2.8) Q2 (second quartile) = 6 = the value of the 20.5th item = the median Q3 (third quartile) = 7.5 = the value of the 30.5th item D3 (third decile) = 5 = the value of the 12.5th item P60 (sixtieth percentile) = 7 = the value of the 24.5th item (b) n F Q1  L  4 c ft  $3.80  25 4  5 ($0.10)  $3.80  $0.03125  $3.83 4 (2.21) n F Q2  L  2 c f2  $3.90  Q3  L  25 2  9 ($0.10)  $3.90  $0.07  $3.97  median 5 (2.22) 3n  F 4 c f3 75 4  14 ($0.10)  $4.00  $0.0792  $4.08 6 3n  F D3  L  10 c f3  $4.00   $3.80  75 10  5 ($0.10)  $3.80  $0.0625  $3.86 4 (2.23) (2.24) Measures of Dispersion 2.11 (a) Find the range for the ungrouped data in Table 2.7. (b) Find the range for the ungrouped data in Table 2.10 and for the grouped data in Table 2.12. (c) What are the advantages and disadvantages of the range? (One) The range for ungrouped data is equal to the value of the largest observation minus the value of the smallest observation in the data set. The range for the ungrouped data in Table 2.7 is from 2 to 10, or 8 points. (Two) The range for the ungrouped data in Table 2.10 is from $3.55 to $4.26, or $0.71. For grouped data, the range extends from the lower limit of the smallest class to the upper limit of the largest class. For the grouped data in Table 2.12, the range extends from $3.50 to $4.29. (Three) The advantages of the range are that it is easy to find data and understand. Its disadvantages are that it considers only the lowest and highest values of a distribution, it is greatly influenced by extreme values, and it cannot be found for open-ended distributions. Because of these disadvantages, the range is of limited usefulness (except in quality control). 2.12 Find the interquartile range and the quartile deviation (a) for the ungrouped data in Table 2.7 and (b) for the grouped data in Table 2.12. (One) The interquartile range is equal to difference between the third and first quartiles. That is, IR = Q3 – Q1 (2.26) For the ungrouped data in Table 2.7, IR=7.5-4=3.5 points [utilizing the values of Q3 and Q4 found in Prob. 2.10(a)]. Note that the interquartile range is not affected by extreme values because it utilize only the middle half of the data. It is thus better than the range, but it is not as widely used as the other measures of dispersion. For the quartile deviation. QD  Q3  Q1 2 (2.27) Therefore, QD = (7.5 –4)/2 = 3.5/2 = 1.75 points. Quartile deviation measures the average range of one-fourth of the data. (Two) IR = Q3 – Q1 = $4.08 - $3.83 = $0.25 [utilizing the values of Q3 and Q1 found in Prob. 2.10(b)]: QD  Q3  Q1 $4.08  $3.83   $0.125 2 2 2.13 Find the average deviation for (a) the ungrouped data in Table 2.7 and (b) for the grouped data in Table 2.9 (One) Since μ=6 [see Prob.2.3(a)],    11 0  4  2 1 0 1 3  3  4  2 11 2  0 1 2  2  4  3  1  0  1  3  2  4  2  1  3  2  0  1  2  3  0  1  3  4  1  72      72  1.8 points AD  N 40 Note that the average deviation takes every observation into account. It measures the average of the absolute deviation of each observation from the mean. It takes the absolute value (indicated by the two vertical bars) because      0 (see Example 5). (Two) We can find the average deviation for the same grouped data with the aid of Table 2.16: AD  F    N  72  1.8 points 40 The same as we found for the ungrouped data. Table 2.16 Calculations for the Average Deviation for the Grouped Data in Table 2.9 Class Grade Midpoint X Frequency f Mean μ X- μ |X- μ | f|X- μ | 1.5-2.4 2 3 6 -4 4 12 2.5-3.4 3 3 6 -3 3 9 3.5-4.4 4 5 6 -2 2 10 4.5-5.4 5 5 6 -1 1 5 5.5-6.4 6 6 6 0 0 0 6.5-7.4 7 8 6 1 1 8 7.5-8.4 8 4 6 2 2 8 8.5-9.4 9 4 6 3 3 12 9.5-10.4 10 2 6 4 4 8 f  X   N  40 2.14 Find the average deviation for the grouped data in Table 2.12 We can find the average deviation for the grouped data of hourly wages in Table 2.12 with aid of Table 2.17 [X’=$3.95; see Prob. 2.4(b)]: AD  f X X n  $3.60  $0.144 25 Note that the average deviation found for the grouped data is an estimate of the “true” average deviation that could be found for the ungrouped data. It usually differs slightly from the true average deviation because we use the estimate of the mean for the grouped data in our calculations [compare the values of X’ found in Prob. 2.4(a) and (b)].  72 Table 2.17 Calculations for the Average Deviation for the Grouped Data in Table 2.12 Hourly Wage $3.50-3.59 3.60-3.69 3.70-3.79 3.80-3.89 3.90-3.99 4.00-4.09 4.10-4.19 4.20-4.29 Class Midpoint $3.55 3.65 3.75 3.85 3.95 4.05 4.15 4.25 Mean X’ $3.95 3.95 3.95 3.95 3.95 3.95 3.95 Frequency f 1 2 2 4 5 6 3 2 25 X-X’ -$0.40 -0.30 -0.20 -0.10 0.00 0.10 0.20 0.30 |X-X’| $0.40 0.30 0.20 0.10 0.00 0.10 0.20 0.30 f|X-X.| $0.40 0.60 0.40 0.40 0.00 0.60 0.60 0.60  f     $3.60 2.15 Find the variance and the standard deviation for (a) the ungrouped data in Table 2.7 and (b) the grouped data in Table 2.9. (c) What is the advantage of the standard deviation over the variance?       2  (One) 2      2 and   6 (see Prob. 2.3a) N  1  1  0  16  4  1  0  1  9  9  13  4  1  1  4  0  1  4  4  16  9  1  0  1  9  4  16  4  1  9  4  0  1  4  9  0  1  9  10  1  192         X -    2  2 N 192  4.8 points squared 40 2   (Two) N 192  4.8  2.19 points 40 We can find the variance and the standard deviation for the grouped data of grades with the aid of Table 2.18:  f      2  2 N  192  4.8 points squared 40 and    2  4.8  2.19 points The same as we found for the ungrouped data. Table 2.18 Calculations for the Variance and Standard Deviation for the Data in Table 2.9 Class Grade Midpoint X Frequency f Mean μ X- μ (X- μ)2 f(X- μ)2 1.5-2.4 2 3 6 -4 16 48 2.5-3.4 3 3 6 -3 9 27 3.5-4.4 4 5 6 -2 4 20 4.5-5.4 5 5 6 -1 1 5 5.5-6.4 6 6 6 0 0 0 6.5-7.4 7 8 6 1 1 8 7.5-8.4 8 4 6 2 4 16 8.5-9.4 9 4 6 3 9 36 9.5-10.4 10 2 6 4 16 32 f  f      N  40 2 (c) The advantage of the standard deviation over the variance is that standard deviation is expressed in the same units as the data rather than in “the width squared,” which is how the variance is expressed. The standard deviation is by far the most widely used measure of (absolute) dispersion. 2.16 Find the variance and the standard deviation for the grouped data in Table 2.10. We can find the variance and the standard deviation for the grouped data of hourly wages with the aid of Table 2.19 [X’=$3.95; see Prob. 2.4(b)]:  f X  X   2 s 2 n 1  0.82  0.0342 dollars squared 24 and  f X - X  2 s n 1  0.0342  $0.18  192 Hourly Wage $3.50-3.59 3.60-3.69 3.70-3.79 3.80-3.89 3.90-3.99 4.00-4.09 4.10-4.19 4.20-4.29 Table 2.19 Calculations for the Variance and Standard Deviation for the Data in Table 2.12 Class Midpoint Frequency f Mean X’ X-X’ (X-X’)2 F(X-X’)2 $3.55 1 $3.95 -$0.40 0.16 0.16 3.65 2 3.95 -0.30 0.09 0.18 3.75 2 3.95 -0.20 0.04 0.08 3.85 4 3.95 -0.10 0.01 0.04 3.95 5 3.95 0.00 0.00 0.00 4.05 6 3.95 0.10 0.01 0.06 4.15 3 3.95 0.20 0.04 0.12 4.25 2 0.30 0.09 0.18 2 25 f      0.82  Note that in the formula for s2 and s, n-1 rather than n is used in the denominator. The reason for this is that if we take many samples from a population, the average of the sample variances does not tend to equal population variance, 2, unless we use n-1 in the denominator of the formula for s2 (more will be said on this in Chap.5). Furthermore, s 2 and s for the grouped data are estimates for the true s2 and s that could be found for the ungrouped data because we used the estimate of X’ from the grouped data in our calculations. 2.17 Starting with the formula for 2 and s2 given in Sec. 2.3, prove that  (One) 2 2 (Two)   2  N 2  f   N 2 2        (a)   N N    2 2 S    n  f  S2  2    2    N    N 2 2 2 (2.28a, b) n 1 and N 2 2 and N 2 2  n 2 n 1 2 (2.29a, b)  2     N 2 N 2  2 2   2  N We can get s by simply replacing  with X’ and N with n in the numerator and N with n-1 in the denominator of the formula for 2. 2 (b)  f      2   2  f N N   f  2 2  2   2 N  f  N  2    f 2  2  f  N 2 N 2  2 2   2  N We can get s2 in the same way as we did in part a. The preceding formulas will simplify the calculations for 2 and s2 for a large body of data. Coding also helps (see Prob. 2.6).

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Chapter 2 - Dr. George Fahmy