Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 3 Describing Distributions with Numbers Overview Measures of Center Measures of Variation Measures of Relative Standing Exploratory Data Analysis (EDA) Thinking Challenge $400,000 $70,000 $50,000 $30,000 $20,000 ... employees cite low pay -- most workers earn only $20,000. ... President claims average pay is $70,000! Numerical Data Properties Central Tendency (Center) Variation (Spread) Shape Numerical Data Properties and Measures Numerical Data Properties Measures of Center Mean Measures of Variation Range Median Interquartile Range Mode Variance Standard Deviation Shape Symmetric Skew Mean Measure of the center or central tendency Most common measure Affected by extreme values (‘outliers’) Example Raw Data: 10.3 4.9 8.9 11.7 6.3 7.7 x1 x2 x3 x4 x5 x6 Mean 6 10.3 4.9 8.9 11.7 6.3 7.7 6 8.30 Median Measure of the center or central tendency Middle value in an ordered sequence If Odd n, Middle Value of Sequence If Even n, Average of 2 Middle Values Position of median in the sequence n 1 Positioning point 2 Not affected by extreme values Median of a Data Set Median Odd-Sized Sample Raw Data: 24.1 22.6 21.5 23.7 22.6 Ordered: 21.5 22.6 22.6 23.7 24.1 Position: 1 2 3 4 5 n 1 5 1 Position 3 2 2 Median = 22.6 Median Even-Sized Sample Raw Data: 10.3 4.9 8.9 11.7 6.3 7.7 Ordered: 4.9 6.3 7.7 8.9 10.3 11.7 Position: 1 2 3 4 5 6 n 1 6 1 Position 3.5 2 2 7.7 8.9 8.3 Median = 2 Mode Measure of the center or central tendency Value that occurs most often Not affected by extreme values May be no mode or several modes May be used for numerical and categorical data Mode Mode Example No Mode Raw Data: 10.3 4.9 8.9 11.7 6.3 7.7 One Mode Raw Data: 6.3 4.9 8.9 6.3 4.9 4.9 More Than 1 Mode Raw Data: 21 28 28 41 43 43 Mean versus Median Selecting an Appropriate Measure of Center a) A student takes four exams in a biology class. His grades are 88, 40, 95, and 100. If asked for his grade in the class, which measure of center is the student likely to report? b) The National Association of REALTORS publishes data on resale prices of U.S. homes. Which measure of center is most appropriate for such resale prices? c) In the 2003 Boston Marathon, there were two categories of official finishers: male and female, of which there were 10,737 and 6,309, respectively. Which measure of center should be used here? Population Mean - Sample Mean Possible interpretations for the mean of a data set Notation for Sample Mean n x x i 1 n i x1 x2 x3 xn1 xn n Notation denotes the sum of a set of values. x is the variable usually used to represent the individual data values. n represents the number of values in a sample. N represents the number of values in a population. Notation for Population Mean Notation used for a sample and for the population Best Measure of Center Measuring Spread or Variation Range Measure of spread, variation or dispersion Difference between largest and smallest observations Range Largest ( X i ) Smallest ( X i ) Ignores how data are distributed 7 8 9 10 7 8 9 10 Quartiles and Boxplots Quartiles Measure of Spread, variation or dispersion Split Ordered Data Set into 4 Quarters 25% Min 25% Q1 25% Q2 25% Q3 Max How To Calculate the Quartiles Quartile (Q2) Example Raw Data: 10.3 4.9 8.9 11.7 6.3 7.7 Ordered: 4.9 6.3 7.7 8.9 10.3 11.7 Position: 1 2 3 4 5 6 Q2 = 8.3 Quartile (Q1) Example Raw Data: 10.3 4.9 8.9 11.7 6.3 7.7 Ordered: 4.9 6.3 7.7 8.9 10.3 11.7 Position: 1 2 3 4 5 6 Q1 = 6.3 Quartile (Q3) Example Raw Data: 10.3 4.9 8.9 11.7 6.3 7.7 Ordered: 4.9 6.3 7.7 8.9 10.3 11.7 Position: 1 2 3 4 5 6 Q3 = 10.3 Notice that, Q1 (First Quartile) separates the bottom 25% of sorted values from the top 75%. Q2 (Second Quartile) same as the median; separates the bottom 50% of sorted values from the top 50%. Q3 (Third Quartile) separates the bottom 75% of sorted values from the top 25%. Percentiles Just as there are three quartiles separating data into four parts, there are 99 percentiles, denoted P1, P2, . . . P99, which partition the data into 100 groups. The kth percentile, Pk is the value for which k % of all observations are below that value. For instance, Q1= P25 , Q2= P50 , and Q3= P75 Finding the Percentile of a Given Score The following formula gives the percentile that a given score represents. Notice that the data set must be ordered. Round the result to the nearest integer number of values less than x Percentile of value x 100 total number of values Example: Ages of Best Actresses Original Data Sorted Data number of values less than 30 Percentile of value 30 100 total number of values 26 100 = 34% 76 Interpretation: The age of 30 years is the 34th percentile, that is, P34 = 30 Converting from the kth Percentile to the Corresponding Data Value then ask the question, Example: Ages of Best Actresses Refer to the sorted ages of Best Actresses given below to find the value of the 20th percentile, P20 Original Data Sorted Data P20 is the value for which 20 % of all observations are below that value. Example: Ages of Best Actresses Refer to the sorted ages of Best Actresses given below to find the value of the 20th percentile, P20 Original Data Sorted Data k 20 L n 76 15.2 Therefore, L 16 100 100 and the 16th value in the sorted list is P20 . Example: Ages of Best Actresses Refer to the sorted ages of Best Actresses given below to find the value of the 75th percentile, P75 Original Data k 75 L n 76 57 100 100 Sorted Data Therefore, L 57 and the average between the 57th and 58th values in the sorted list is P75 39.5. The Interquartile Range IQR Measure of spread, variation or dispersion Also called midspread Difference between third and first quartiles IQR Q3 Q1 Spread in middle 50% Not affected by extreme values The Interquartile Range IQR Preferred measure of variation when the median is used as the measure of center. Like the median, the interquartile range is a resistant measure. Outliers The Five-number Summary Example: Supermarket Spending M = Q3 = 28 Q1 = 19 Q3 = 45 Max Min The Five-number Summary is: $3 $19 $28 $45 $93 Boxplot Min 5 Q1 Median Q3 6 7 9 Max 10 Example 20 customer satisfaction ratings: 1 3 5 5 7 8 8 8 8 8 8 9 9 9 9 9 10 10 10 10 M = (8+8)/2 = 8 Q1 = (7+8)/2 = 7.5 Q3 = (9+9)/2 = 9 IQR = Q3 - Q1 = 9 - 7.5 = 1.5 Boxplot Boxplot Distribution shapes and boxplots Modified Boxplots Some statistical packages provide modified boxplots which represent outliers as special points. A modified boxplot is constructed with these specifications: A special symbol (such as an asterisk) is used to identify outliers. The solid horizontal line extends only as far as the minimum data value that is not an outlier and the maximum data value that is not an outlier. Example Variance and Standard Deviation Measures of spread, variation or dispersion Most common measures Consider how data are distributed Show variation about mean Sample Variance and Sample Standard Deviation Properties of the Standard Deviation The idea behind the variance and the standard deviation as measures of spread is as follows: The deviations xi − x display the spread of the values xi about their mean x. Some of these deviations will be positive and some negative because some of the observations fall on each side of the mean. In fact, the sum of the deviations of the observations from their mean will always be zero. Properties of the Standard Deviation Squaring the deviations makes them all positive, so that observations far from the mean in either direction have large positive squared deviations. The variance is the average squared deviation. Therefore both, s2 and s will be large if the observations are widely spread about their mean, and small if the observations are all close to the mean. Properties of the Standard Deviation s measures spread about the mean and should be used only when the mean is chosen as the measure of center. s = 0 only when there is no spread. This happens only when all observations have the same value. Otherwise, s > 0. As the observations become more spread out about their mean, s gets larger. s is not resistant. A few outliers can make s very large. Example - Metabolic Rate A person’s metabolic rate is the rate at which the body consumes energy. This rate is important in studies of weight gain, dieting, and exercise. Here are the metabolic rates of 7 men who took part in a study of dieting. The units are calories per 24 hours. These are the same calories used to describe the energy content of foods. 1792 1666 1362 1614 1460 1867 1439 Notice that x 1600 Example - Metabolic Rate The table shows the observations xi , their deviations from the mean and the square of these deviations. xi x xi xi x 1792 192 36864 1666 66 4356 1362 -238 56644 1614 14 196 1460 -140 19600 1867 267 71289 1439 -161 25921 x 1600 2 7 x x i 1 i 2 214870 1 7 2 s xi x 35811.78 6 i 1 2 1 7 2 s xi x 189.24 6 i 1 Example - Metabolic Rate The figure plots these data as dots on the calorie scale, with their mean marked by an asterisk (∗). The arrows mark two of the deviations from the mean. ∗ Metabolic rates for seven men, with the mean (∗) and the deviations of two observations from the mean Choosing measures of center and spread How do we choose between the five-number summary and x and s to describe the center and spread of a distribution? Because the two sides of a strongly skewed distribution have different spreads, no single number such as s describes the spread well. The five-number summary, with its two quartiles and two extremes, does a better job. Choosing a summary The five-number summary is usually better than the mean and standard deviation for describing a skewed distribution or a distribution with strong outliers. Use x and s only for reasonably symmetric distributions that are free of outliers. Remarks The idea of the variance is straightforward: it is the average of the squares of the deviations of the observations from their mean. The details we have just presented, however, raise some questions. Why do we square the deviations? Why do we emphasize the standard deviation rather than the variance? Why do we average by dividing by n −1 rather than n in calculating the variance? Remarks Why do we square the deviations? Why not just average the distances of the observations from their mean? There are two reasons, neither of them obvious. First, the sum of the squared deviations of any set of observations from their mean is the smallest that the sum of squared deviations from any number can possibly be. This is not true of the unsquared distances. So squared deviations point to the mean as center in a way that distances do not. Remarks Second, the standard deviation turns out to be the natural measure of spread for a particularly important class of symmetric unimodal distributions, the normal distributions. We will meet the normal distributions in a later section. We commented earlier that the usefulness of many statistical procedures is tied to distributions of particular shapes. This is distinctly true of the standard deviation. Remarks Why do we emphasize the standard deviation rather than the variance? One reason is that s, not s2, is the natural measure of spread for normal distributions. There is also a more general reason to prefer s to s2. Because the variance involves squaring the deviations, it does not have the same unit of measurement as the original observations. The variance of the metabolic rates, for example, is measured in squared calories. Taking the square root remedies this. The standard deviation s measures spread about the mean in the original scale. Remarks Why do we average by dividing by n −1 rather than n in calculating the variance? Because the sum of the deviations is always zero, the last deviation can be found once we know the other n − 1. So we are not averaging n unrelated numbers. Only n−1 of the squared deviations can vary freely, and we average by dividing the total by n −1. The number n − 1 is called the degrees of freedom of the variance or standard deviation. Many calculators offer a choice between dividing by n and dividing by n − 1, so be sure to use n − 1. Population Standard Deviation Standardized Variables We can associate with any variable x a new variable z, called the standardized version of x or the standardized variable, defined as follows. Example Consider a simple variable x, namely, one with possible observations shown in the first row of following table. Example - Continued a. Determine the standardized version of x. b. Find the observed value of z corresponding to an observed value of x of 5. c. Obtain all possible observations of z. d. Find the mean and standard deviation of z. e. Obtain dotplots of the distributions of both x and z. Interpret the results. Example - Continued a. Determine the standardized version of x. Using the definitions of µ and σ we find that the mean and standard deviation of the variable x are µ = 3 and σ = 2. Therefore, the standardized version of x is Example - Continued b. Find the observed value of z corresponding to an observed value of x of 5. The observed value of z corresponding to an observed value of x of 5 is Example - Continued c. Obtain all possible observations of z. Applying the formula z = (x − 3)/2 to each observation of the variable x shown in the first row of the table we obtain t each observation of the standardized variable z shown in the second row. Example - Continued d. Find the mean and standard deviation of z. From the second row of the table we get Example - Continued e. Obtain dotplots of the distributions of both x and z. Interpret the results. The dotplots of the distributions of x and z are Standard Scores or z-Scores An important concept associated with standardized variables is that of the z-score, or standard score, which we now define. Standard Scores or z-Scores The standard score or z-score, represents the number of standard deviations that a data value, x, falls from the mean, µ. z That is, x x z Empirical Rule (68-95-99.7%) For data with a (symmetric) bell-shaped distribution, the standard deviation has the following characteristics. 1) About 68% of the data lie within one standard deviation of the mean. 2) About 95% of the data lie within two standard deviations of the mean. 3) About 99.7% of the data lie within three standard deviation of the mean. Empirical Rule (68-95-99.7%) 99.7% within 3 standard deviations 95% within 2 standard deviations 68% within 1 standard deviation 34% 34% 2.35% 2.35% 13.5% –4 –3 –2 –1 13.5% 0 1 2 3 4 Empirical Rule (68-95-99.7%) Interpreting z-Scores Ordinary values: z-score between -2 and 2 Unusual Values: z-score < -2 or z-score > 2 Using the Empirical Rule The mean value of homes on a street is $125 thousand with a standard deviation of $5 thousand. The data set has a bell shaped distribution. Estimate the percent of homes between $120 and $130 thousand. 68% 105 110 115 120 125 130 µ–σ µ µ+σ 135 140 145 68% of the houses have a value between $120 and $130 thousand. Standard Scores – Example 1 The weight data for the 2003 U.S. Women’s World Cup soccer team is given in the fourth column of the following table. Standard Scores – Example 1 So, in this case, the standardized variable is a. Find and interpret the z-score of Tiffany Roberts’s weight of 51 kg. b. Find and interpret the z-score of Cindy Parlow’s weight of 70 kg. c. Construct a graph showing the results obtained in parts (a) and (b). Standard Scores – Example 1 a. The z-score for Tiffany’s weight of 51 kg is Which means that Tiffany’s weight is 2.36 standard deviations below the mean. b. The z-score for Cindy’s weight of 70 kg is Which means that Cindy’s weight is 1.52 standard deviations above the mean. Standard Scores – Example 1 c. In the figure, we marked Tiffany’s weight of 51 kg with a color dot and Cindy’s weight of 70 kg with a black dot. Additionally, we located the mean, µ = 62.55 kg, and measured intervals equal in length to the standard deviation, σ = 4.9 kg. Dotplot for the weight data for the Women’s World Cup soccer team 68 59 61 68 51 58 67 61 61 61 59 70 57 61 61 61 66 64 64 73 51 57 58 59 59 61 61 61 61 61 61 61 64 64 66 67 68 68 70 73 Standard Scores – Example 2 John received a 75 on a test whose class mean was 73.2 with a standard deviation of 4.5. Samantha received a 68.6 on a test whose class mean was 65 with a standard deviation of 3.9. Which student had the better test score? John’s z-score Samantha’s z-score z x 75 73.2 4.5 z x 68.6 65 3.9 0.4 0.92 John’s score was 0.4 standard deviations higher than the mean, while Samantha’s score was 0.92 standard deviations higher than the mean. Samantha’s test score was better than John’s. Shape Skewness