Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 3 NUMERICAL METHODS FOR DESCRIBING DATA DISTRIBUTIONS Created by Kathy Fritz Suppose that you have just received your score on an exam in one of your classes. What would you want to know about the distribution of scores for this exam? Measures of center Measures of spread The stress of the final years of medical training can contribute to depression and burnout. The authors of the paper “Rates of Medication Errors Among Depressed and Burnt Out Residents” (British Medical Journal [2008]: 488) studied 24 residents in pediatrics. Medical records of patients treated by these residents during a fixed time period were examined for errors in ordering or administering medications. The accompanying dotplot displays the total number of medication errors for each of the 24 residents. Choosing Appropriate Measures for Describing Center and Spread If the shape of the data distribution is … Describe Center and Spread Using … Describing Center and Spread For Data Distributions That Are Approximately Symmetric Mean Standard Deviation Mean Definition: In mathematics, the capital Greek letter Σ is short for “add them all up.” Therefore, the formula for the mean can be written in more compact notation: Measuring Center Use the data below to calculate the mean of the commuting times (in minutes) of 20 randomly selected New York workers. 10 30 5 25 40 20 10 15 30 20 15 20 85 15 65 15 60 60 40 45 Measuring Variability Consider the three sets of six exam scores displayed below: Each data set has a mean exam score of 75. Does that completely describe these data sets? Range Deviations The most widely used measures of variability 50 60 70 80 90 100 Variance and Standard Deviation Suppose that we are interested in finding the “typical” or average deviation from the mean. So, to calculate the “typical” or average deviation from the mean, we must first square each deviation. Then the all the squared deviations are positive. The deviations from the mean were -25, -15, -5, 5, 15, and 25. The squares of these deviations from the mean are Now we can average these. 50 60 70 80 90 100 Variance and Standard Deviation Variance and Standard Deviation 50 60 70 80 90 100 Variance and Standard Deviation Consider the following data on the number of pets owned by a group of 9 children. Measuring Spread: The Standard Deviation xi 1 3 4 4 4 5 7 8 9 (xi-mean) (xi-mean)2 Notation to remember Putting it Together Describing Center and Spread For Data Distributions That Are Skewed or Have Outliers Median Interquartile Range Median The median M The sample median is obtained by first ordering the n observations from smallest to largest (with any repeated values included, so that every sample observation appears in the ordered list). Then . . . Forty students were enrolled in a statistical reasoning course at a California college. The instructor made course materials, grades, and lecture notes available to students on a class web site. Course management software kept track of how often each student accessed any of these web pages. The data set below (in order from smallest to largest) is the number of times each of the 40 students had accessed the class web page during the first month. 0 0 0 0 0 0 3 4 4 4 5 5 7 7 8 8 8 12 12 13 13 13 14 14 16 18 19 19 20 20 21 22 23 26 36 36 37 42 84 331 Comparing the Mean and the Median The mean and median measure center in different ways, and both are useful. Don’t confuse the “average” value of a variable (the mean) with its “typical” value, which we might describe by the median. Comparing the Mean and the Median Measuring Spread - Interquartile Range Interquartile range (iqr) is based on quantities called quartiles which divide the data set into four equal parts (quarters). Lower quartile (Q1) = Upper quartile (Q3) = In n is odd, the median of the entire data set is excluded from both halves when computing quartiles. Measuring Spread: The Interquartile Range A measure of center alone can be misleading. A useful numerical description of a distribution requires both a measure of center and a measure of spread. How to Calculate the Quartiles and the Interquartile Range To calculate the quartiles: Recall the website data set: 0 0 0 0 0 0 3 4 4 4 5 5 7 7 8 8 8 12 12 13 13 13 14 14 16 18 19 19 20 20 21 22 23 26 36 36 37 42 84 331 The lower quartile (Q1) is the median of the lower 20 data values. The upper quartile (Q3) is the median of the upper 20 data values. The interquartile (iqr) is the difference of the upper and lower quartile. Putting it Together The Chronicle of Higher Education (Almanac Issue, 2009-2010) published the accompanying data on the percentage of the population with a bachelor’s degree or graduate degree in 2007 for each of the 50 U.S. states and the District of Columbia. The data distribution is shown in the histogram below. Step 1: Select Putting it Together Step 2: Calculations Step 3: Interpret Find and Interpret the IQR Travel times to work for 20 randomly selected New Yorkers 10 30 5 25 40 20 10 15 30 20 15 20 85 15 65 15 60 60 40 45 Boxplots General Boxplots Modified Boxplots Five-Number Summary The five-number summary consists of the following: Boxplots When to Use Univariate numerical data How to construct What to look for center, spread, and shape of the data distribution and if there are any unusual features Boxplot Example Comparative Boxplots A comparative boxplot is Recall the video game study. There were two groups: 1) told to improve total score or 2) told to improve a different aspect, such as speed. 1st 2nd Identifying Outliers In addition to serving as a measure of spread, the interquartile range (IQR) is used as part of a rule of thumb for identifying outliers. Definition: The 1.5 x IQR Rule for Outliers In the New York travel time data, we found Q1=15 minutes, Q3=42.5 minutes, and IQR=27.5 minutes. Modified boxplots How to construct Compute the values in the five-number summary 2. Draw a horizontal line and add an appropriate scale. 3. Draw a box above the line that extends from the lower quartile (Q1) to the upper quartile (Q3) 4. Draw a line segment inside the box at the location of the median. 1. Construct a Boxplot Consider our NY travel times data. Construct a boxplot. 10 30 5 25 40 20 10 15 30 20 15 20 85 15 65 15 60 60 40 45 Big Mac prices in U.S. dollars for 44 different countries were given in the article “Big Mac Index 2010”. The following 44 Big Mac prices are arranged in order from the lowest price (Ukraine) to the highest price (Norway). 1.84 1.86 1.90 1.95 2.17 2.19 2.19 2.28 2.33 2.34 2.45 2.46 2.50 2.51 2.60 2.62 2.67 2.71 2.80 2.82 2.99 3.08 3.33 3.34 3.43 3.48 3.54 3.56 3.59 3.67 3.73 3.74 3.83 3,84 3.84 3.86 3.89 4.00 4.33 4.39 4.90 4.91 6.19 6.56 7.20 Big Mac Prices Continued . . . Smallest observation = Lower quartile = Median = Upper quartile = Largest observation = 1 2 3 4 5 Big Mac Prices 6 7 8 The 2009-2010 salaries of NBA players published on the web site hoopshype.com were used to construct the comparative boxplot of salary data for five teams. Measures of Relative Standing z -scores Percentiles Percentiles For a number r between 0 and 100, the rth percentile is a value such that r percent of the observations fall AT or BELOW that value. This diagram illustrates the 90th percentile. Measuring Position: Percentiles One way to describe the location of a value in a distribution is to tell what percent of observations are less than it. Definition: Jenny earned a score of 86 on her test. How did she perform relative to the rest of the class? 6 7 7 8 8 9 7 2334 5777899 00123334 569 03 In addition to weight and length, head circumference is another measure of health in newborn babies. The National Center for Health Statistics reports the following summary values for head circumference (in cm) at birth for boys. Head circumference (cm) 32.2 33.2 34.5 Percentile 5 10 25 35.8 50 37.0 38.2 38.6 75 90 What value of head circumference is at the 75th percentile? What is the median value of head circumference? 95 z -scores Definition: The z -score tells you. Measuring Position: z-Scores Jenny earned a score of 86 on her test. The class mean is 80 and the standard deviation is 6.07. What is her standardized score? Using z-scores for Comparison We can use z-scores to compare the position of individuals in different distributions. Jenny earned a score of 86 on her statistics test. The class mean was 80 and the standard deviation was 6.07. She earned a score of 82 on her chemistry test. The chemistry scores had a fairly symmetric distribution with a mean 76 and standard deviation of 4. On which test did Jenny perform better relative to the rest of her class? What do these z-scores mean? -2.3 1.8 Suppose that two graduating seniors, one a marketing major and one an accounting major, are comparing job offers. The accounting major has an offer for $45,000 per year, and the marketing major has an offer for $43,000 per year. Accounting: mean = 46,000 standard deviation = 1500 Marketing: mean = 42,500 standard deviation = 1000 Density Curve Definition: A density curve is a curve that A density curve describes the overall pattern of a distribution. The area under the curve and above any interval of values on the horizontal axis is the proportion of all observations that fall in that interval. The overall pattern of this histogram of the scores of all 947 seventh-grade students in Gary, Indiana, on the vocabulary part of the Iowa Test of Basic Skills (ITBS) can be described by a smooth curve drawn through the tops of the bars. Normal Distributions One particularly important class of density curves are the Normal curves, which describe Normal distributions. All Normal curves are A Specific Normal curve is described by giving its Two Normal curves, showing the mean µ and standard deviation σ. Normal Distributions Definition: A Normal distribution is described by a Normal density curve. Any particular Normal distribution is completely specified by two numbers: its mean µ and standard deviation σ. Normal distributions are good descriptions for some distributions of real data. Normal distributions are good approximations of the results of many kinds of chance outcomes. Many statistical inference procedures are based on Normal distributions. Empirical Rule If the data distribution is mound shaped and approximately symmetric, then . . . Approximately 68% of the observations Approximately 95% of the observations Approximately 99.7% of the observations are Empirical Rule This illustrates the percentages given by the Empirical Rule. The distribution of Iowa Test of Basic Skills (ITBS) vocabulary scores for 7th grade students in Gary, Indiana, is close to Normal. Suppose the distribution is N(6.84, 1.55). a) Sketch the Normal density curve for this distribution. b) What percent of ITBS vocabulary scores are less than 3.74? c) What percent of the scores are between 5.29 and 9.94? Common Mistakes Avoid these Common Mistakes 1. Watch out for categorical data that look numerical! Often, categorical data is coded numerically. For example gender might be coded as 0 = female and 1 = male, but this does not make gender a numerical variable. Categorical data CANNOT be summarized using the mean and standard deviation or the median and interquartile range. Avoid these Common Mistakes 2. Measures of center don’t tell all. Although measures of center, such as the mean and the median, do give you a sense of what might be typical value for a variable, this is only one characteristic of a data set. Without additional information about variability and distribution shape, you don’t really know much about the behavior of the variable. Avoid these Common Mistakes 3. Data distributions with different shapes can have the same mean and standard deviation. For example, consider the following two histograms: Both histograms have the same mean of 10 and standard deviation of 2, but VERY different shapes. Avoid these Common Mistakes 4. Both the mean and the standard deviation are sensitive to extreme values in a data set, especially if the sample size is small. If the data distribution is markedly skewed or if the data set has outliers, the median and interquartile range are a better choice for describing center and spread. Avoid these Common Mistakes 5. Measures of center and measures of variability describe values of a variable, not frequencies in a frequency distribution or heights of bars in a histogram. For example, consider the following two frequency distributions and histograms: Avoid these Common Mistakes 6. Be careful with boxplots based on small sample sizes. Boxplots convey information about center, variability, and shape, but interpreting shape information is problematic when the sample size is small. Avoid these Common Mistakes 7. Not all distributions are mound shaped. Using the Empirical Rule in situations where you are not convinced that the data distribution is mound shaped and approximately symmetric can lead to incorrect statements. Avoid these Common Mistakes 8. Watch for outliers! Unusual observations in a data set often provide important information about the variable under study, so it is important to consider outliers in addition to describing what is typical. Outliers can also be problematic because the values of some summaries are influenced by outliers and because some methods for drawing conclusions from data are not appropriate if the data set has outliers.