Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
BBA240 Statistics for Economics and Finance SCHOOL OF BUSINESS, ECONOMICS AND MANAGEMENT BBA240 STATISTICS/ QUANTITATIVE METHODS FOR BUSINESS AND ECONOMICS Unit Two Moses Mwale e-mail: [email protected] ii Contents Contents UNIT 2: Numerical Descriptions of Data 3 2.1 Measures of Central Tendency ................................................................................... 3 2.1.1 Mean, Median, and Mode............................................................................... 3 Mean. ............................................................................................................. 3 Finding the Mean of grouped Data ........................................................ 3 Finding the Mean of a Frequency Distribution ..................................... 4 Median ........................................................................................................... 4 Finding the Median of a Frequency Distribution ........................................... 5 Mode .............................................................................................................. 7 Finding the Mode of a Frequency Distribution ..................................... 7 2.2 Measures of Variation ................................................................................................. 9 2.2.1 Deviation, Variance, and Standard Deviation .............................................. 10 Finding the Sample Variance and Standard Deviation ........................ 12 Standard deviation for grouped data .................................................... 14 2.2.2 Chebychevβs Theorem .................................................................................. 16 2.3 Measures of Position ................................................................................................. 17 2.3.1 Quartiles and Interquartile Range................................................................. 17 Calculating Interquartile Range ................................................................... 18 2.3.2 Percentiles and other Fractiles ...................................................................... 19 BBA240 Statistics for Economics and Finance UNIT 2: Numerical Descriptions of Data 2.1 Measures of Central Tendency In Sections 1.4, you learned about the graphical representations of quantitative data. In this section, you will learn how to supplement graphical representations with numerical statistics that describe the center and variability of a data set. A measure of central tendency is a value that represents a typical, or central, entry of a data set. The three most commonly used measures of central tendency are the mean, the median, and the mode. 2.1.1 Mean, Median, and Mode Mean. The mean of a data set is the sum of the data entries divided by the number of entries. To find the mean of a data set, use one of the following formulas. ο· Population Mean: π = ο· Sample Mean: π₯Μ = Ξ£π₯ π Ξ£π₯ π The lowercase Greek letter π (pronounced mu) represents the population mean and π₯Μ (read as βx barβ) represents the sample mean. Note that N represents the number of entries in a population and n represents the number of entries in a sample. Recall that the uppercase Greek letter sigma Ξ£ indicates a summation of values. Finding the Mean of grouped Data If data are presented in a frequency distribution, you can approximate the mean as follows. Definition 3 4 UNIT 2: Numerical Descriptions of Data The mean of a frequency distribution for a sample is approximated by π₯Μ = Ξ£(π₯βπ) π Note that π = Ξ£π where x and f are the midpoints and frequencies of a class, respectively. Finding the Mean of a Frequency Distribution 1. Find the midpoint of each class. (πΏππ€ππ πππππ‘) + (πππππ πππππ‘) 2 2. Find the sum of the products of the midpoints and the frequencies. π₯= Ξ£(π₯ β π) 3. Find the sum of the inconsistence frequencies. π = Ξ£π Inconsistence 4. Find the mean of the frequency distribution. π₯Μ = Ξ£(π₯ β π) π Example: Finding the Mean of a Frequency Distribution Use the frequency distribution below to approximate the mean number of minutes that a sample of Internet subscribers spent online during their most recent session. Solution π₯Μ = = Ξ£(π₯ β π) π 2089.0 50 β 41.8 So, the mean time spent online was approximately 41.8 minutes. Median Another important measure of central tendency is the median. It is defined as follows. Definition: Median The median is the value of the middle term in a data set that has been ranked in increasing order. BBA240 Statistics for Economics and Finance As is obvious from the definition of the median, it divides a ranked data set into two equal parts. The calculation of the median consists of the following two steps: 1. Rank the data set in increasing order. 2. Find the middle term. The value of this term is the median. Note that if the number of observations in a data set is odd, then the median is given by the value of the middle term in the ranked data. However, if the number of observations is even, then the median is given by the average of the values of the two middle terms. Example The following data give the prices (in thousands of dollars) of seven houses selected from all houses sold last month in a city. 312 257 421 289 526 374 497 Find the median. Solution First, we rank the given data in increasing order as follows: 257 289 312 374 421 497 526 Since there are seven homes in this data set and the middle term is the fourth term, the median is given by the value of the fourth term in the ranked data. 257 289 312 374 421 497 526 Thus, the median price of a house is 374, or $374,000. Finding the Median of a Frequency Distribution To estimate the Median, let's look at an example. Example: Alex did a survey of how many games each of 20 friends owned, and got this: 9, 15, 11, 12, 3, 5, 10, 20, 14, 6, 8, 8, 12, 12, 18, 15, 6, 9, 18, 11 The Frequency Distribution for the Data is as follows Number of games Frequency 1-5 2 5 6 UNIT 2: Numerical Descriptions of Data 6 - 10 7 11 - 15 8 16 - 20 3 ο· The groups (1-5, 6-10, etc) also called class intervals, are of width 5 ο· The numbers 1, 6, 11 and 16 are the lower class boundaries ο· The numbers 5, 10, 15 and 20 are the upper class boundaries ο· The midpoints ο· So the midpoints are 3, 8, 13 and 18 are halfway between the lower and upper class boundaries The median is in the class where the cumulative frequency reaches half the sum of the absolute frequencies. The median is the mean of the middle two numbers (the 10th and 11th values) and they are both in the 11 - 15 group: We can say "the median group is 11 - 15" But if we need to estimate a single Median value we can use this formula: (n/2) β cfb Estimated Median = L + ×w fm where: ο· L is the lower class boundary of the group containing the median ο· n is the total number of data ο· cfb is the cumulative frequency of the groups before the median group ο· fm is the frequency of the median group ο· w is the group width For our example: ο· L = 11 ο· n = 20 ο· cfb = 2 + 7 = 9 ο· fm = 8 ο· w=5 BBA240 Statistics for Economics and Finance (20/2) β 9 Estimated Median = 11 + × 5 = 11 + (1/8) x 5 = 11.625 8 Mode Mode is a French word that means fashionβan item that is most popular or common. In statistics, the mode represents the most common value in a data set. Definition: Mode The mode is the value that occurs with the highest frequency in a data set. Example The following data give the speeds (in miles per hour) of eight cars that were stopped for speeding violations. 77 82 74 81 79 84 74 78 Find the mode. Solution In this data set, 74 occurs twice, and each of the remaining values occurs only once. Because 74 occurs with the highest frequency, it is the mode. Therefore, Mode = 74 miles per hour A major shortcoming of the mode is that a data set may have none or may have more than one mode, whereas it will have only one mean and only one median. For instance, a data set with each value occurring only once has no mode. A data set with only one value occurring with the highest frequency has only one mode. The data set in this case is called unimodal. A data set with two values that occur with the same (highest) frequency has two modes. The distribution, in this case, is said to be bimodal. If more than two values in a data set occur with the same (highest) frequency, then the data set contains more than two modes and it is said to be multimodal. Finding the Mode of a Frequency Distribution Again, looking at our data: Number of games Frequency 1-5 2 6 - 10 7 7 8 UNIT 2: Numerical Descriptions of Data 11 - 15 8 16 - 20 3 We can easily identify the modal group (the group with the highest frequency), which is 11 - 15 We can say "the modal group is 11 - 15" But the actual Mode may not even be in that group! Or there may be more than one mode. Without the raw data we don't really know. But, we can estimate the Mode using the following formula: fm β fm-1 Estimated Mode = L + ×w (fm β fm-1) + (fm β fm+1) where: ο· L is the lower class boundary of the modal group ο· fm-1 is the frequency of the group before the modal group ο· fm is the frequency of the modal group ο· fm+1 is the frequency of the group after the modal group ο· w is the group width In this example: ο· L = 11 ο· fm-1 = 7 ο· fm = 8 ο· fm+1 = 3 ο· w=5 8β7 Estimated Mode = 11 + (8 β 7) + (8 β 3) × 5 = 11 + (1/6) × 5 = 11.833... Exercises 1. 2. 3. Explain how the value of the median is determined for a data set that contains an odd number of observations and for a data set that contains an even number of observations. Briefly explain the meaning of an outlier. Is the mean or the median a better measure of central tendency for a data set that contains outliers? Illustrate with the help of an example. Using an example, show how outliers can affect the value of the mean. BBA240 Statistics for Economics and Finance 4. Which of the three measures of central tendency (the mean, the median, and the mode) can be calculated for quantitative data only, and which can be calculated for both quantitative and qualitative data? 5. Illustrate with examples. 6. Which of the three measures of central tendency (the mean, the median, and the mode) can assume more than one value for a data set? Give an example of a data set for which this summary measure assumes more than one value. 7. Is it possible for a (quantitative) data set to have no mean, no median, or no mode? Give an example of a data set for which this summary measure does not exist. 8. Explain the relationships among the mean, median, and mode for symmetric and skewed histograms. Illustrate these relationships with graphs. 9. Prices of cars have a distribution that is skewed to the right with outliers in the right tail. Which of the measures of central tendency is the best to summarize this data set? Explain. 10. The following data set belongs to a population: 5 -7 2 0 -9 16 10 7 Calculate the mean, median, and mode. 11. The following data set belongs to a sample: 14 18 -1 08 8 -16 Calculate the mean, median, and mode. 12. The following data give the 2007 gross domestic product (GDP) in billions of dollars for all 50 states. The data are entered in alphabetic order by state (Bureau of Economic Analysis, June 2005). 166 45 247 95 62 51 610 246 352 382 255 89 76 1103 399 28 34 244 1142 106 1813 236 129 117 229 34 466 139 25 383 216 60 735 397 154 216 48 269 80 127 57 465 158 531 47 153 311 58 232 32 a. Calculate the mean and median for these data. Are these values of the mean and the median sample statistics or population parameters? Explain. b. Do these data have a mode? Explain. 2.2 Measures of Variation In this section, you will learn different ways to measure the variation of a data set. The simplest measure is the range of the set. Definition: Range The range of a data set is the difference between the maximum and minimum data entries in the set. To find the range, the data must be quantitative. Range = (Maximum data entry) β (Minimum data entry) 9 10 UNIT 2: Numerical Descriptions of Data Example 2.2.1 Deviation, Variance, and Standard Deviation As a measure of variation, the range has the advantage of being easy to compute. Its disadvantage, however, is that it uses only two entries from the data set. Two measures of variation that use all the entries in a data set are the variance and the standard deviation. However, before you learn about these measures of variation, you need to know what is meant by the deviation of an entry in a data set. Definition: Deviation The deviation of an entry x in a population data set is the difference between the entry and the mean of the data set. Deviation of π₯ = π₯ β π BBA240 Statistics for Economics and Finance Example In the previous example, notice that the sum of the deviations is zero. Because this is true for any data set, it doesnβt make sense to find the average of the deviations. To overcome this problem, you can square each deviation. When you add the squares of the deviations, you compute a quantity called the sum of squares, denoted SSX. In a population data set, the mean of the squares of the deviations is called the population variance. The population variance of a population data set of N entries is Population variance = π 2 = Ξ£(π₯βπ)2 π The symbol π is the lowercase Greek letter sigma. The population standard deviation of a population data set of N entries is the square root of the population variance. Population standard Deviation = π = βπ 2 = β Ξ£(π₯βπ)2 π How to find the Population Variance and Standard Deviation 1. Find the mean of the population data set. 2. Find the deviation of each entry. 3. Square each deviation. 4. Add to get the sum of squares. 5. Divide by N to get the population variance. 11 12 UNIT 2: Numerical Descriptions of Data 6. Find the square root of the variance to get the population standard deviation. Example Definition The sample variance and sample standard deviation of a sample data set of n entries are listed below. Sample variance = π 2 = Ξ£(π₯βπ₯Μ )2 πβ1 Sample standard deviation = π = βπ 2 = β Finding the Sample Variance and Standard Deviation 1. Find the mean of the sample data set. 2. Find the deviation of each entry. 3. Square each deviation. 4. Add to get the sum of squares. 5. Divide by n-1 to get the sample variance. Ξ£(π₯βπ₯Μ )2 πβ1 BBA240 Statistics for Economics and Finance 6. Find the square root of the variance to get the sample standard deviation. Example When interpreting the standard deviation, remember that it is a measure of the typical amount an entry deviates from the mean. The more the entries are spread out, the greater the standard deviation. 13 14 UNIT 2: Numerical Descriptions of Data Standard deviation for grouped data We have learned that large data sets are usually best represented by frequency distributions. The formula for the sample standard deviation for a frequency distribution is Sample standard deviation = π = β Ξ£(π₯βπ₯Μ )2 π πβ1 where π = Ξ£π is the number of entries in the data set. Example EXERCISES: CONCEPTS AND PROCEDURES 1. The range, as a measure of spread, has the disadvantage of being influenced by outliers. Illustrate this with an example. BBA240 Statistics for Economics and Finance 2. Can the standard deviation have a negative value? Explain. 3. When is the value of the standard deviation for a data set zero? Give one example. Calculate the standard deviation for the example and show that its value is zero. 4. Briefly explain the difference between a population parameter and a sample statistic. Give one example of each. 5. The following data set belongs to a population: 5 -7 2 0 -9 1 61 07 Calculate the range, variance, and standard deviation. 6. The following data set belongs to a sample: 14 18 -1 08 8 -16 Calculate the range, variance, and standard deviation. 7. The following data give the number of shoplifters apprehended during each of the past 8 weeks at a large department store. 8. 7 1 08 3 1 51 26 1 1 a. Find the mean for these data. Calculate the deviations of the data values from the mean. Is the sum of these deviations zero? b. Calculate the range, variance, and standard deviation. 9. The following data give the prices of seven textbooks randomly selected from a university bookstore. $89 $170 $104 $113 $56 $161 $147 a. Find the mean for these data. Calculate the deviations of the data values from the mean. Is the sum of these deviations zero? b. Calculate the range, variance, and standard deviation. 10. The following data give the numbers of car thefts that occurred in a city in the past 12 days. 6 3 7 1 14 3 8 7 2 6 9 1 5 Calculate the range, variance, and standard deviation. 11. Refer to the data in Exercise 3.23, which contained the numbers of tornadoes that touched down in 12. 12 states that had the most tornadoes during the period 1950 to 1994. The data are reproduced here. 1113 2009 1374 1137 2110 1086 1166 1039 1673 2300 1139 5490 Find the variance, standard deviation, and range for these data. 13. The following data give the numbers of pieces of junk mail received by 10 families during the past month. 41 33 28 21 29 19 14 31 39 36 Find the range, variance, and standard deviation. 14. The following data give the number of highway collisions with large wild animals, such as deer or moose, in one of the northeastern states during each week of a 9-week period. 7 1 03 8 2 5 7 4 9 Find the range, variance, and standard deviation. 15. Attacks by stinging insects, such as bees or wasps, may become medical emergencies if either the victim is allergic to venom or multiple stings are involved. The following data give the number 15 16 UNIT 2: Numerical Descriptions of Data of patients treated each week for such stings in a large regional hospital during 13 weeks last summer. 1 5 2 3 0 4 1 7 0 1 2 0 1 Compute the range, variance, and standard deviation for these data. 16. The following data give the number of hot dogs consumed by 10 participants in a hot-dog-eating contest. 21 17 32 8 20 15 17 23 9 18 Calculate the range, variance, and standard deviation for these data. 2.2.2 Chebychevβs Theorem Chebyshevβs theorem gives a lower bound for the area under a curve between two points that are on opposite sides of the mean and at the same distance from the mean. The portion of any data set lying within k standard deviations (k>1) of the mean is at least 1β 1 π2 1 3 1 8 β’ k=2: In any data set, at least 1 β 22 = 4 or 75%, of the data lie within 2 standard deviations of the mean. β’ k=3: In any data set, at least 1 β 2 = or 88.9%, of the data lie within 3 3 9 standard deviations of the mean. Example The average systolic blood pressure for 4000 women who were screened for high blood pressure was found to be 187 mm Hg with a standard deviation of 22. Using Chebyshevβs theorem, find at least what percentage of women in this group have a systolic blood pressure between 143 and 231 mm Hg. Solution BBA240 Statistics for Economics and Finance Let π and π be the mean and the standard deviation, respectively, of the systolic blood pressures of these women. Then, from the given information, ΞΌ = 187 and Ο = 22 To find the percentage of women whose systolic blood pressures are between 143 and 231 mm Hg, the first step is to determine k. As shown below, each of the two points, 143 and 231, is 44 units away from the mean. The value of k is obtained by dividing the distance between the mean and each point by the standard deviation. Thus, Hence, according to Chebyshevβs theorem, at least 75% of the women have systolic blood pressure between 143 and 231 mm Hg. 2.3 Measures of Position A measure of position determines the position of a single value in relation to other values in a sample or a population data set. There are many measures of position; however, only quartiles, percentiles, and percentile rank are discussed in this section. 2.3.1 Quartiles and Interquartile Range Quartiles are the summary measures that divide a ranked data set into four equal parts. Three measures will divide any data set into four equal parts. These three measures are the first quartile (denoted by Q1), the second quartile (denoted by Q2), and the third quartile (denoted by Q3). The data should be ranked in increasing order before the quartiles are determined. The quartiles are defined as follows. 17 18 UNIT 2: Numerical Descriptions of Data Definition: Quartiles are three summary measures that divide a ranked data set into four equal parts. The second quartile is the same as the median of a data set. The first quartile is the value of the middle term among the observations that are less than the median, and the third quartile is the value of the middle term among the observations that are greater than the median. Approximately 25% of the values in a ranked data set are less than Q1 and about 75% are greater than Q1. The second quartile, Q2, divides a ranked data set into two equal parts; hence, the second quartile and the median are the same. Approximately 75% of the data values are less than Q3 and about 25% are greater than Q3. The difference between the third quartile and the first quartile for a data set is called the interquartile range (IQR). Calculating Interquartile Range The difference between the third and the first quartiles gives the interquartile range; that is, IQR = Interquartile range = Q3 - Q1 Example BBA240 Statistics for Economics and Finance 2.3.2 Percentiles and other Fractiles In addition to using quartiles to specify a measure of position, you can also use percentiles and deciles. These common fractiles are summarized as follows. Percentiles are often used in education and health-related fields to indicate how one individual compares with others in a group. They can also be used to identify unusually high or unusually low values. For instance, test scores and childrenβs growth measurements are often expressed in percentiles. Scores or measurements in the 95th percentile and above are unusually high, while those in the 5th percentile and below are unusually low. Exercise 1. The following data give the weights (in pounds) lost by 15 members of a health club at the end of 2 months after joining the club. 5 10 8 7 25 12 5 14 11 10 21 9 8 11 18 a. Compute the values of the three quartiles and the interquartile range. b. Calculate the (approximate) value of the 82nd percentile. c. Find the percentile rank of 10. 2. The following data give the speeds of 13 cars (in mph) measured by radar, traveling on I-84. 73 75 69 68 78 69 74 19 20 UNIT 2: Numerical Descriptions of Data 76 72 79 68 77 71 a. Find the values of the three quartiles and the interquartile range. b. Calculate the (approximate) value of the 35th percentile. c. Compute the percentile rank of 71. 3. The following data give the numbers of computer keyboards assembled at the Twentieth Century Electronics Company for a sample of 25 days. 45 52 48 41 56 46 44 42 48 53 51 53 51 48 46 43 52 50 54 47 44 47 50 49 52 a. Calculate the values of the three quartiles and the interquartile range. b. Determine the (approximate) value of the 53rd percentile. c. Find the percentile rank of 50. 4. The following data give the numbers of minor penalties accrued by each of the 30 National Hockey League franchises during the 2007β08 regular season. 318 336 337 339 362 363 366 369 372 375 378 381 384 385 386 387 390 393 395 403 405 409 417 431 433 434 438 444 461 480 a. Calculate the values of the three quartiles and the interquartile range. b. Find the approximate value of the 57th percentile. c. Calculate the percentile rank of 417. 5. According to Fair Isaac, βThe Median FICO (Credit) Score in the U.S. is 723β (The Credit Scoring Site, 2009). Suppose the following data represent the credit scores of 22 randomly selected loan applicants. 494 728 468 533 747 639 430 690 604 422 356 805 749 600 797 702 628 625 617 647 772 572 a. Calculate the values of the three quartiles and the interquartile range. Where does the value 617 fall in relation to these quartiles? b. Find the approximate value of the 30th percentile. Give a brief interpretation of this percentile. c. Calculate the percentile rank of 533. Give a brief interpretation of this percentile rank. 6. The fatality rate on the nationβs highways in 2007 was the lowest since 1994, but these numbers are still mind-boggling. The number of persons killed in motor vehicle traffic crashes, by town, in 2007 is listed here. 1110 84 1066 650 3974 554 277 117 44 3214 1641 BBA240 Statistics for Economics and Finance 138 252 1249 898 445 416 864 985 183 614 417 1088 504 884 992 277 256 373 129 724 413 1333 1675 111 1257 754 455 1491 69 1066 146 1210 3363 299 66 1027 568 431 756 150 a. Draw a dotplot of fatality data. b. Draw a stem-and-leaf display of these data. Describe how the three large-valued data are handled. c. Find the 5-number summary and draw a box-and whiskers display. d. Find P10 and P90. e. Describe the distribution of the number of fatalities per state, being sure to include information learned in parts a through d. f. Why might it be unfair to draw conclusions about the relative safety level of highways in the 51 states based on these data? 21