Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
AP Stats Chapter 4 Part 3 Displaying and Summarizing Quantitative Data Learning Goals 1. Know how to display the distribution of a quantitative variable with a histogram, a stem-and-leaf display, or a dotplot. 2. Know how to display the relative position of quantitative variable with a Cumulative Frequency Curve and analysis the Cumulative Frequency Curve. 3. Be able to describe the distribution of a quantitative variable in terms of its shape. 4. Be able to describe any anomalies or extraordinary features revealed by the display of a variable. Learning Goals 5. Be able to determine the shape of the distribution of a variable by knowing something about the data. 6. Know the basic properties and how to compute the mean and median of a set of data. 7. Understand the properties of a skewed distribution. 8. Know the basic properties and how to compute the standard deviation and IQR of a set of data. Learning Goals 9. Understand which measures of center and spread are resistant and which are not. 10. Be able to select a suitable measure of center and a suitable measure of spread for a variable based on information about its distribution. 11. Be able to describe the distribution of a quantitative variable in terms of its shape, center, and spread. Learning Goal 6 Know the basic properties and how to compute the mean and median of a set of data. Learning Goal 6: Measures of Central Tendency A measure of central tendency for a collection of data values is a number that is meant to convey the idea of centralness or center of the data set. The most commonly used measures of central tendency for sample data are the: mean, median, and mode. Learning Goal 6: Measures of Central Tendency Overview Central Tendency Mean Median Mode n X X i 1 n i Midpoint of ranked values Most frequently observed value Learning Goal 6: The Mean • Mean: The mean of a set of numerical (data) values is the (arithmetic) average for the set of values. • When computing the value of the mean, the data values can be population values or sample values. • Hence we can compute either the population mean or the sample mean Learning Goal 6: Mean Notation • NOTATION: The population mean is denoted by the Greek letter µ (read as “mu”). • NOTATION: The sample mean is denoted by 𝑥 (read as “x-bar”). • Normally the population mean is unknown. Learning Goal 6: The Mean The mean is the most common measure of central tendency. The mean is also the preferred measure of center, because it uses all the data in calculating the center. For a sample of size n: n X X i1 n Sample size i X1 X2 Xn n Observed values Learning Goal 6: The Mean - Example • What is the mean of the following 11 sample values? 3 8 6 14 0 0 12 -7 0 -10 -4 Learning Goal 6: The Mean - Example (Continued) • Solution: 3 8 6 14 0 (4) 0 12 (7) 0 (10) x 11 2 Learning Goal 6: Mean – Frequency Table • When a data set has a large number of values, we summarize it as a frequency table. • The frequencies represent the number of times each value occurs. • When the mean is calculated from a frequency table it is an approximation, because the raw data is not known. Learning Goal 6: Mean – Frequency Table Example What is the mean of the following 11 sample values (the same data as before)? Class Frequency -10 to < -4 2 -4 to < 2 4 2 to < 8 2 8 to < 14 2 14 to < 20 1 Learning Goal 6: Mean – Frequency Table Example Solution: Class Midpoint Frequency -10 to < -4 -7 2 -4 to < 2 -1 4 2 to < 8 5 2 8 to < 14 11 2 14 to < 20 17 1 2 7 4 1 2 5 2 11 1 17 x 11 2.82 Learning Goal 6: Calculate Mean on TI-84 Raw Data 1. Enter the raw data into a list, STAT/Edit. 2. Calculate the mean, STAT/CALC/1-Var Stats List: L1 FreqList: (leave blank) Calculate 16 Learning Goal 6: Calculate Mean on TI-84 Frequency Table Data Same Data Class Mark Freq 0-50 25 1 50-100 75 1 100-150 125 3 150-200 175 4 200-250 225 7 250-300 275 4 1. Enter the Frequency table data into two lists (L1 – Class Midpoint, L2 – Frequency), STAT/Edit. 2. Calculate the mean, STAT/CALC/1-Var Stats List: L1 FreqList: L2 Calculate 17 Learning Goal 6: Calculate Mean on TI-84 – Your Turn Raw Data: 548, 405, 375, 400, 475, 450, 412 375, 364, 492, 482, 384, 490, 492 490, 435, 390, 500, 400, 491, 945 435, 848, 792, 700, 572, 739, 572 Learning Goal 6: Calculate Mean on TI-84 – Your Turn Frequency Table Data (same): Class Limits 350 to < 450 450 to < 550 550 to < 650 650 to < 750 750 to < 850 850 to < 950 Frequency 11 10 2 2 2 1 Learning Goal 6: Median The median is the midpoint of the observations when they are ordered from the smallest to the largest (or from the largest to smallest) If the number of observations is: Odd, then the median is the middle observation Even, then the median is the average of the two middle observations 20 Center of a Distribution -- Median The median is the value with exactly half the data values below it and half above it. It is the middle data value (once the data values have been ordered) that divides the histogram into two equal areas. It has the same units as the data. Learning Goal 6: Finding the Median The location of the median: n 1 Median position position in the ordered data 2 If the number of values is odd, the median is the middle number. If the number of values is even, the median is the average of the two middle numbers. Note that 𝑛+1 2 is not the value of the median, only the position of the median in the ranked data. Learning Goal 6: Finding the Median – Example (n odd) • What is the median for the following sample values? 3 8 6 2 12 -7 14 0 -1 -10 -4 Learning Goal 6: Finding the Median – Example (n odd) • First of all, we need to arrange the data set in order ( STATS/SortA ) • The ordered set is: • -10 -7 -4 -1 0 2 3 6 8 12 14 6th value • Since the number of values is odd, the median will be found in the 6th position in the ordered set (To find; data number divided by 2 and round up, 11/2 = 5.5⇒6). • Thus, the value of the median is 2. Learning Goal 6: Finding the Median – Example (n even) • Find the median age for the following eight college students. 23 19 32 25 26 22 24 20 Learning Goal 6: Finding the Median – Example (n even) • First we have to order the values as shown below. 19 20 22 23 24 25 26 32 Middle Two Average • Since there is an even number of ages, the median will be the average of the two middle values (To find; data number divided by 2, that number and the next are the two middle numbers, 8/2 = 4⇒4th & 5th are the middle numbers). • Thus, median = (23 + 24)/2 = 23.5. Learning Goal 6: The Median - Summary The median is the midpoint of a distribution—the number such that half of the observations are smaller and half are larger. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 0.6 1.2 1.6 1.9 1.5 2.1 2.3 2.3 2.5 2.8 2.9 3.3 3.4 3.6 3.7 3.8 3.9 4.1 4.2 4.5 4.7 4.9 5.3 5.6 25 12 6.1 1. Sort observations from smallest to largest.n = number of observations ______________________________ 2. If n is odd, the median is observation n/2 (round up) down the list n = 25 n/2 = 25/2 = 12.5=13 Median = 3.4 3. If n is even, the median is the mean of the two center observations n = 24 n/2 = 12 &13 Median = (3.3+3.4) /2 = 3.35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11 0.6 1.2 1.6 1.9 1.5 2.1 2.3 2.3 2.5 2.8 2.9 3.3 3.4 3.6 3.7 3.8 3.9 4.1 4.2 4.5 4.7 4.9 5.3 5.6 Learning Goal 6: Finding the Median on the TI-84 1. Enter data into L1 2. STAT; CALC; 1:1-Var Stats 28 Learning Goal 6: Find the Mean and Median – Your Turn CO2 Pollution levels in 8 largest nations measured in metric tons per person: 2.3 1.1 19.7 9.8 1.8 1.2 0.7 0.2 a. Mean = 4.6 b. Mean = 4.6 c. Mean = 1.5 Median = 1.5 Median = 5.8 Median = 4.6 29 Learning Goal 6: Mode A measure of central tendency. Value that occurs most often or frequent. Used for either numerical or categorical data. There may be no mode or several modes. Not used as a measure of center. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Mode = 9 0 1 2 3 4 5 6 No Mode Learning Goal 6: Mode - Example The mode is the measurement which occurs most frequently. The set: 2, 4, 9, 8, 8, 5, 3 The mode is 8, which occurs twice The set: 2, 2, 9, 8, 8, 5, 3 There are two modes - 8 and 2 (bimodal) The set: 2, 4, 9, 8, 5, 3 There is no mode (each value is unique). Learning Goal 6: Summary Measures of Center Learning Goal 7 Understand the properties of a skewed distribution. Learning Goal 7: Where is the Center of the Distribution? If you had to pick a single number to describe all the data what would you pick? It’s easy to find the center when a histogram is unimodal and symmetric—it’s right in the middle. On the other hand, it’s not so easy to find the center of a skewed histogram or a histogram with outliers. Learning Goal 7: Meaningful measure of Center Your measure of center must be meaningful. The distribution of women’s height appears coherent and symmetrical. The mean is a good measure center. Height of 25 women in a class x 69.3 Is the mean always a good measure of center? Learning Goal 7: Impact of Skewed Data Mean and median of a symmetric distribution Disease X: x 3.4 M 3.4 Mean and median are the same. and skewed distribution. Multiple myeloma: x 3.4 M 2.5 The mean is pulled toward the skew. Learning Goal 7: The Mean Nonresistant – The mean is sensitive to the influence of extreme values and/or outliers. Skewed distributions pull the mean away from the center towards the longer tail. The mean is located at the balancing point of the histogram. For a skewed distribution, is not a good measure of center. Learning Goal 7: Mean – Nonresistant Example The most common measure of central tendency. Affected by extreme values (skewed dist. or outliers). 0 1 2 3 4 5 6 7 8 9 10 Mean = 3 1 2 3 4 5 15 3 5 5 0 1 2 3 4 5 6 7 8 9 10 Mean = 4 1 2 3 4 10 20 4 5 5 Learning Goal 7: The Median Resistant – The median is said to be resistant, because extreme values and/or outliers have little effect on the median. In an ordered array, the median is the “middle” number (50% above, 50% below). Learning Goal 7: Median – Resistant Example Not affected by extreme values (skewed distributions or outliers). 0 1 2 3 4 5 6 7 8 9 10 Median = 3 0 1 2 3 4 5 6 7 8 9 10 Median = 3 Learning Goal 7: Mean vs. Median with Outliers Percent of people dying x 3.4 x 4.2 Without the outliers With the outliers The mean (non-resistant) is The median (resistant), on the pulled to the right a lot by the other hand, is only slightly outliers (from 3.4 to 4.2). pulled to the right by the outliers (from 3.4 to 3.6). Learning Goal 7: Effect of Skewed Distributions • The figure below shows the relative positions of the mean and median for right-skewed, symmetric, and left-skewed distributions. • Note that the mean is pulled in the direction of skewness, that is, in the direction of the extreme observations. • For a right-skewed distribution, the mean is greater than the median; for a symmetric distribution, the mean and the median are equal; and, for a left-skewed distribution, the mean is less than the median. Learning Goal 7: Comparing the mean and the median The mean and the median are the same only if the distribution is symmetrical. The median is a measure of center that is resistant to skew and outliers. The mean is not. Mean and median for a symmetric distribution Mean Median Left skew Mean Median Mean and median for skewed distributions Mean Median Right skew Learning Goal 7: Which measure of location is the “best”? Because the median considers only the order of values, it is resistant to values that are extraordinarily large or small; it simply notes that they are one of the “big ones” or “small ones” and ignores their distance from center. To choose between the mean and median, start by looking at the distribution. Mean is used, for unimodal symmetric distributions, unless extreme values (outliers) exist. Median is used, for skewed distributions or when there are outliers present, since the median is not sensitive to extreme values. Learning Goal 7: Class Problem Observed mean =2.28, median=3, mode=3.1 What is the shape of the distribution and why? Learning Goal 7: Example Five houses on a hill by the beach. $2,000 K House Prices: $500 K $300 K $100 K $100 K $2,000,000 500,000 300,000 100,000 100,000 Learning Goal 7: Example – Measures of Center House Prices: $2,000,000 500,000 300,000 100,000 100,000 Which is the best measure of center? Median Sum $3,000,000 Mean: ($3,000,000/5) = $600,000 Median: middle value of ranked data = $300,000 Mode: most frequent value = $100,000 Conclusion – Mean or Median? Mean – use with symmetrical distributions (no outliers), because it is nonresistant. Median – use with skewed distribution or distribution with outliers, because it is resistant. Learning Goal 8 Know the basic properties and how to compute the standard deviation and IQR of a set of data. Learning Goal 8: How Spread Out is the Distribution? Variation matters, and Statistics is about variation. Are the values of the distribution tightly clustered around the center or more spread out? Always report a measure of spread along with a measure of center when describing a distribution numerically. Learning Goal 8: Measures of Spread A measure of variability for a collection of data values is a number that is meant to convey the idea of spread for the data set. The most commonly used measures of variability for sample data are the: range interquartile range variance and standard deviation Learning Goal 8: Measures of Variation Variation Range Interquartile Range Variance Standard Deviation Measures of variation give information on the spread or variability of the data values. Same center, different variation Learning Goal 8: The Interquartile Range One way to describe the spread of a set of data might be to ignore the extremes and concentrate on the middle of the data. The interquartile range (IQR) lets us ignore extreme data values and concentrate on the middle of the data. To find the IQR, we first need to know what quartiles are… Learning Goal 8: The Interquartile Range Quartiles divide the data into four equal sections. One quarter of the data lies below the lower quartile, Q1 One quarter of the data lies above the upper quartile, Q3. The quartiles border the middle half of the data. The difference between the quartiles is the interquartile range (IQR), so IQR = upper quartile(Q3) – lower quartile(Q1) Learning Goal 8: Interquartile Range Eliminate some outlier or extreme value problems by using the interquartile range. Eliminate some high- and low-valued observations and calculate the range from the remaining values. IQR = 3rd quartile – 1st quartile IQR = Q3 – Q1 Learning Goal 8: Finding Quartiles 1. 2. 3. 4. 5. Order the Data Find the median, this divides the data into a lower and upper half (the median itself is in neither half). Q1 is then the median of the lower half. Q3 is the median of the upper half. Example Even data Q1=27, M=39, Q3=50.5 IQR = 50.5 – 27 = 23.5 Odd data Q1=35, M=46, Q3=54 IQR = 54 – 35 = 19 Learning Goal 8: Quartiles Example: X minimum Q1 25% 12 Middle fifty Median (Q2) 25% 30 25% 45 X Q3 maximum 25% 57 70 Interquartile range = 57 – 30 = 27 Not influenced by extreme values (Resistant). Learning Goal 8: Quartiles Quartiles split the ranked data into 4 segments with an equal number of values per segment. 25% 25% 25% 25% Q1 Q2 Q3 The first quartile, Q1, is the value for which 25% of the observations are smaller and 75% are larger. Q2 is the same as the median (50% are smaller, 50% are larger). Only 25% of the observations are greater than the third quartile. Learning Goal 8: The Interquartile Range - Histogram The lower and upper quartiles are the 25th and 75th percentiles of the data, so… The IQR contains the middle 50% of the values of the distribution, as shown in figure: + Learning Goal 8: Find and Interpret IQR Travel times to work for 20 randomly selected New Yorkers 10 30 5 25 40 20 10 15 30 20 15 20 85 15 65 15 60 60 40 45 5 10 10 15 15 15 15 20 20 20 25 30 30 40 40 45 60 60 65 85 Q1 = 15 M = 22.5 Q3= 42.5 IQR = Q3 – Q1 = 42.5 – 15 = 27.5 minutes Interpretation: The range of the middle half of travel times for the New Yorkers in the sample is 27.5 minutes. Learning Goal 8: Interquartile Range on the TI-84 • • Use STATS/CALC/1-Var Stats to find Q1 and Q3. Then calculate IQR = Q3 – Q1. Interquartile range = Q3 – Q1 = 9 – 6 = 3. Learning Goal 8: Calculate IQR - Your Turn The following scores for a statistics 10point quiz were reported. What is the value of the interquartile range? 7 8 9 6 8 0 9 9 9 0 0 7 10 9 8 5 7 9 Learning Goal 8: 5-Number Summary Definition: The five-number summary of a distribution consists of the smallest observation, the first quartile, the median, the third quartile, and the largest observation, written in order from smallest to largest. Minimum Q1 M Q3 Maximum Learning Goal 8: 5-Number Summary The 5-number summary of a distribution reports its minimum, 1st quartile Q1, median, 3rd quartile Q3, and maximum in that order. Obtain 5-number summary from 1-Var Stats. Min. 3.7 Q1 6.6 Med. 7 Q3 7.6 Max. 9 Learning Goal 8: Calculate 5 Number Summary 1. 2. 3. 4. 5. Enter data into L1. STAT; CALC; 1:1-Var Stats; Enter. List: L1. Calculate. Scroll down to 5 number summary. 65 Learning Goal 8: Calculate 5 Number Summary – Your Turn The grades of 25 students are given below : 42, 63, 47, 77, 46, 71, 68, 83, 91, 55, 67, 66, 63, 57, 50, 69, 73, 82, 77, 58, 66, 79, 88, 97, 86. Calculate the 5 number summary for the students grades. Learning Goal 8: Calculate 5 Number Summary – Your Turn A group of University students took part in a sponsored race. The number of laps completed is given in the table. number of laps frequency (x) 1-5 2 6 – 10 9 11 – 15 15 16 – 20 20 21 – 25 17 26 – 30 25 31 – 35 2 36 - 40 1 Calculate the 5 number summary. Learning Goal 8: Standard Deviation A more powerful measure of spread than the IQR is the standard deviation, which takes into account how far each data value is from the mean. A deviation is the distance that a data value is from the mean. Since adding all deviations together would total zero, we square each deviation and find an average of sorts for the deviations. But to calculate the standard deviation you must first calculate the variance. Learning Goal 8: Variance The variance is measure of variability that uses all the data. It measures the average deviation of the measurements about their mean. Learning Goal 8: Variance The variance, notated by s2, is found by summing the squared deviations and (almost) averaging them: s 2 x x 2 n 1 Used to calculate Standard Deviation. The variance will play a role later in our study, but it is problematic as a measure of spread - it is measured in squared units – not the same units as the data, a serious disadvantage! Learning Goal 8: Variance The variance of a population of N measurements is the average of the squared deviations of the measurements about their mean m. Sigma Squared 2 ( x m ) 2 i N The variance of a sample of n measurements is the sum of the squared deviations of the measurements about their mean, divided by (n – 1). S Squared ( xi x ) s n 1 2 2 Learning Goal 8: Standard Deviation The standard deviation, s, is just the square root of the variance. Is measured in the same units as the original data. Why it is preferred over variance. s x x n 1 2 Learning Goal 8: Standard Deviation In calculating the variance, we squared all of the deviations, and in doing so changed the scale of the measurements. To return this measure of variability to the original units of measure, we calculate the standard deviation, the positive square root of the variance. Population standard deviation : Sample standard deviation : s s 2 2 Learning Goal 8: Finding Standard Deviation The most common measure of spread looks at how far each observation is from the mean. This measure is called the standard deviation. Let’s explore it! Consider the following data on the number of pets owned by a group of 9 children. 1) Calculate the mean. 2) Calculate each deviation. deviation = observation – mean deviation: 1 - 5 = -4 deviation: 8 - 5 = 3 x =5 Learning Goal 8: Finding Standard Deviation (xi-mean)2 xi (xi-mean) 1 1 - 5 = -4 (-4)2 = 16 3 3 - 5 = -2 (-2)2 = 4 3) Square each deviation. 4 4 - 5 = -1 (-1)2 = 1 4) Find the “average” squared deviation. Calculate the sum of the squared deviations divided by (n-1)…this is called the variance. 4 4 - 5 = -1 (-1)2 = 1 4 4 - 5 = -1 (-1)2 = 1 5 5-5=0 (0)2 = 0 7 7-5=2 (2)2 = 4 8 8-5=3 (3)2 = 9 9 9-5=4 (4)2 = 16 5) Calculate the square root of the variance…this is the standard deviation. Sum=? “average” squared deviation = 52/(9-1) = 6.5 Standard deviation = square root of variance = Sum=? This is the variance. 6.5 2.55 Learning Goal 8: Standard Deviation - Example The standard deviation is used to describe the variation around the mean. 1) First calculate the variance s2. 1 n 2 s ( x x ) i n 1 1 2 2) Then take the square root to get the standard deviation s. x Mean ± 1 s.d. 1 n 2 s ( x x ) i n 1 1 Learning Goal 8: Standard Deviation - Procedure 1. Compute the mean . x 2. Subtract the mean from each individual value to get a list of the deviations from the mean x x . 3. Square each of the differences to produce the square of the deviations from the mean 2 x x. 4. Add all of the squares of the deviations from 2 the mean to get x x . x x 5. Divide the sum by n 1 . [variance] 6. Find the square root of the result. 2 Learning Goal 8: Standard Deviation - Example Find the standard deviation of the Mulberry Bank customer waiting times. Those times (in minutes) are 1, 3, 14. Use a Table. We will not normally calculate standard deviation by hand. Learning Goal 8: Calculate Standard Deviation 1. 2. 3. 4. 5. Enter data into L1 STAT; CALC; 1:1-Var Stats; Enter List: L1;Calculator Sx is the sample standard deviation. σx is the population standard deviation. 79 Learning Goal 8: Calculate Standard Deviation – Your Turn The prices ($) of 18 brands of walking shoes: 90 70 70 70 75 70 65 68 60 74 70 95 75 70 68 65 40 65 Calculate the standard deviation. Learning Goal 8: Calculate Standard Deviation – Your Turn During 3 hours at Heathrow airport 55 aircraft arrived late. The number of minutes they were late is shown in the grouped frequency table. minutes late frequency 010 20 30 40 50 - 9 19 29 39 49 59 27 10 7 5 4 2 Calculate the standard deviation for the number of minutes late. Learning Goal 8: Standard Deviation - Properties The value of s is always positive. s is zero only when all of the data values are the same number. Larger values of s indicate greater amounts of variation. The units of s are the same as the units of the original data. One reason s is preferred to s2. Measures spread about the mean and should only be used to describe the spread of a distribution when the mean is used to describe the center (ie. symmetrical distributions). Nonresistant (like the mean), s can increase dramatically due to extreme values or outliers. Learning Goal 8: Standard Deviation - Example Larger values of standard deviation indicate greater amounts of variation. Small standard deviation Large standard deviation Learning Goal 8: Standard Deviation - Example Standard Deviation: the more variation, the larger the standard deviation. Data set II has greater variation. Learning Goal 8: Standard Deviation - Example Data Set I Data Set II Data set II has greater variation and the visual clearly shows that it is more spread out. Learning Goal 8: Comparing Standard Deviations The more variation, the larger the standard deviation. Data A 11 12 13 14 15 16 17 18 19 20 21 Mean = 15.5 S = 3.338 20 21 Mean = 15.5 S = 0.926 20 21 Mean = 15.5 S = 4.567 Data B 11 12 13 14 15 16 17 18 19 Data C 11 12 13 14 15 16 17 18 19 Values far from the mean are given extra weight (because deviations from the mean are squared). Learning Goal 8: Spread: Range The range of the data is the difference between the maximum and minimum values: Range = max – min A disadvantage of the range is that a single extreme value can make it very large and, thus, not representative of the data overall. Learning Goal 8: Range Simplest measure of variation. Difference between the largest and the smallest values in a set of data. Example: Range = Xlargest – Xsmallest 0 1 2 3 4 5 6 7 8 9 10 11 12 Range = 14 - 1 = 13 13 14 Learning Goal 8: Disadvantages of the Range Ignores the way in which data are distributed 7 8 9 10 11 12 Range = 12 - 7 = 5 7 8 9 10 11 12 Range = 12 - 7 = 5 Sensitive to outliers 1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5 Range = 5 - 1 = 4 1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120 Range = 120 - 1 = 119 Learning Goal 8: Range • The range is affected by outliers (large or small values relative to the rest of the data set). • The range does not utilize all the information in the data set only the largest and smallest values. • Thus, range is not a very useful measure of spread or variation. Learning Goal 8: Summary Measures Describing Data Numerically Central Tendency Quartiles Variation Mean Range Median Interquartile Range Mode Variance Standard Deviation Shape Skewness Learning Goal 9 Understand which measures of center and spread are resistant and which are not. Learning Goal 9: Resistant or Non-Resistant Which measures of center and spread are resistant? 1. Median – Extreme values and outliers have little effect. 2. IQR – Measures the spread of the middle 50% of the data, therefore extreme values and outliers have no effect. 3. When using Median to measure the center of a distribution, use IQR to measure the spread of the distribution. Learning Goal 9: Resistant or Non-Resistant Which measures of center and spread are Non-Resistant? 1. Mean – Extreme values and outliers pull the mean towards those values. 2. Standard Deviation – Measures the spread relative to the mean. Extreme values or outliers will increase the standard deviation of the distribution. 3. When using Mean to measure the center of a distribution, use Standard Deviation to measure the spread of the distribution. Learning Goal 9: Resistant or Non-Resistant Measures of Center: Mean (not resistant) Median (resistant) Measures of Spread: Standard deviation (not resistant) IQR (resistant) Range (not resistant) Most often and preferred, use the mean and the standard deviation, because they are calculated based on all the data values, so use all the available information. Learning Goal 9: Resistant or Non-Resistant Animated Center and Spread 63.33 Mean: 68.82 Mean:72.5 72.5 70 Median: 70 Median:72.5 72.5 S: 16.84 S: 12.56 S:10.16 10.16 IQR: 30 IQR: 20 IQR: 15 15 What is the difference between the center and spread of a distribution? Which measure of center (mean or median) was affected more by adding data points that skewed the distribution? Explain your answer. 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 Quiz Scores In a symmetric distribution: • The mean, non-resistant, is used to represent the center. • The standard deviation (S), non-resistant, is used to represent the spread. In a skewed distribution: • The median, resistant, is used to represent the center. • The interquartile range (IQR), resistant, is used to represent the spread. ©2013 All rights reserved. For each distribution below, which measure of center and spread would you use? How do you know? A B Mean &S Median & IQR CCSS 6th Grade Statistics and Probability 2.0 Describe the distribution of a data set. Lesson to be used by EDI-trained teachers only. Learning Goal 9: Resistant or Non-Resistant Median and IQR are paired together – Resistant. Mean and Standard Deviation are paired together – Non-Resistant. Learning Goal 10 Be able to select a suitable measure of center and a suitable measure of spread for a variable based on information about its distribution. Learning Goal 10: Choosing Measures of Center and Spread We now have a choice between two descriptions for center and spread Mean and Standard Deviation Median and Interquartile Range Choosing Measures of Center and Spread •The median and IQR are usually better than the mean and standard deviation for describing a skewed distribution or a distribution with outliers. •Use mean and standard deviation only for reasonably symmetric distributions that don’t have outliers. •NOTE: Numerical summaries do not fully describe the shape of a distribution. ALWAYS PLOT YOUR DATA! Learning Goal 10: Choosing Measures of Center and Spread Plot your data Dotplot, Stemplot, Histogram Interpret what you see: Shape, Outliers, Center, Spread Choose numerical summary: 𝒙 and s, or Median and IQR Learning Goal 10: Choosing Center and Spread - Practice The distribution of a data set shows the arrangement of values in the data set. The center of a distribution is a number that represents all the values in the data set. The spread of a distribution is a number that describes the variability in the data set. The dot plots below show the ratings given to a new movie by two different audiences. 1. 1 2. Audience #1 2 3 4 5 6 7 8 Audience Rating 9 10 Mean: 7 Median: 7 S: 1.43 IQR: 2 1 Symmetric Audience #2 2 3 4 5 6 7 8 Audience Rating 9 10 Mean: 5.71 Median: 6 S: 1.67 IQR: 3 Center: Mean Spread: S Skewed Shape: The shape of the distribution is mostly Shape: The shape of the distribution is mostly symmetric. Center: Because the distribution is symmetric, the mean of 7 can be used as the measure of center. Spread: The S of the distribution is 1.43. symmetric. Center: Because the distribution is symmetric, the mean of 5.71 can be used as the measure of center. Spread:The S of the distribution is 1.67. Center: Median Spread: IQR ©2013 All rights reserved. CCSS 6th Grade Statistics and Probability 2.0 Describe the distribution of a data set. Lesson to be used by EDI-trained teachers only. Learning Goal 10: Choosing Center and Spread - Practice The distribution of a data set shows the arrangement of values in the data set. The center of a distribution is a number that represents all the values in the data set. The spread of a distribution is a number that describes the variability in the data set. The histograms below show the number of hours studied in a week for students in two math classes. 4. Class #1 Students 10 8 6 4 2 0-2 3-5 6-8 9-11 12-14 15-17 Mean: 9.69 Median: 10.5 S: 3.6 IQR: 6.5 Symmetric Class #2 10 8 6 4 2 Students 3. 0-2 Hours Studied 3-5 6-8 9-11 12-14 15-17 Mean: 7.75 Median: 7 S: 2.93 IQR: 4.5 Center: Mean Spread: S Hours Studied Shape: The shape of the distribution is skewed to Shape: The shape of the distribution is skewed to the left. the right Center: Because the distribution is skewed, the Center: Because the distribution is skewed, the Skewed median of 10.5 can be used as the measure of center. median of 7 can be used as the measure of center. Spread: The IQR of the distribution is 6.5. Spread:The IQR of the distribution is 4.5. Center: Median Spread: IQR ©2013 All rights reserved. CCSS 6th Grade Statistics and Probability 2.0 Describe the distribution of a data set. Lesson to be used by EDI-trained teachers only. Learning Goal 10: Choosing Center and Spread - Practice The distribution of a data set shows the arrangement of values in the data set. The center of a distribution is a number that represents all the values in the data set. The spread of a distribution is a number that describes the variability in the data set. The dot plot below shows the number of hours of The histogram below shows the number of hours of sleep per night for 33 students in a 6th-grade class. sleep per night for 33 adults selected at random. 1. 2. 4 5 6 7 8 9 10 11 Hours of Sleep Adults Mean: 8.4 Median: 9 S: 1.53 IQR: 3 12 10 8 6 4 2 Mean: 6.8 Median: 7 S: 1.54 IQR: 2.5 0-1 2-3 4-5 6-7 8-9 Center: Mean Spread: S 10+ Hours Slept Skewed Shape: The shape of the distribution is skewed Shape: The shape of the distribution is fairly left. symmetric, with a slight skew to the left. Center: Because the distribution is mostly symmetric, the mean of 6.8 can be used as the measure of center. Spread:The S of the distribution is 1.54. Center: Because the distribution is skewed, the median of 9 can be used as the measure of center. Spread: The IQR of the distribution is 3. Symmetric Center: Median Spread: IQR ©2013 All rights reserved. CCSS 6th Grade Statistics and Probability 2.0 Describe the distribution of a data set. Lesson to be used by EDI-trained teachers only. Learning Goal 10: Choosing Center and Spread - Practice The histograms below show the scores of 31 students on a pretest and posttest. Pretest 41-50 51-60 61-70 71-80 81-90 91-100 Mean: 57.67 Median: 54 S: 9.07 IQR: 14 Score 12 10 8 6 4 2 Students 2. 12 10 8 6 4 2 Students 1. Posttest 41-50 51-60 61-70 71-80 81-90 91-100 Mean: 76 Median: 76 S: 9.81 IQR: 24 Score Shape: The shape of the distribution is skewed Shape: The shape of the distribution is mostly right. symmetric. Center: Because the distribution is mostly symmetric, the mean of 76 can be used as the measure of center. Spread:The S of the distribution is 9.81. Center: Because the distribution is skewed, the median of 54 can be used as the measure of center. Spread: The IQR of the distribution is 14. Did scores on the test improve from the pretest to the posttest? Explain your answer. Yes, test scores improved from the pretest to the posttest. It can be seen by the noticeably higher center in the distribution of scores for the posttest. CCSS 6 Grade Statistics and Probability 2.0 th ©2013 All rights reserved. Describe the distribution of a data set. Lesson to be used by EDI-trained teachers only. Learning Goal 10: Choosing Center and Spread - Practice The dot plot below shows the number of pets in each household of 28 students in a 6th-grade class. Mean: 1.82 Median: 2 S: 1.13 IQR: 1.5 1. Shape: The shape of the distribution is skewed right. Center: Because the distribution is skewed, the median of 2 can be used as the measure of center. Spread: The IQR of the distribution is 1.5. 0 1 2 3 4 5 6 7 8 9 Number of Pets ©2013 All rights reserved. CCSS 6th Grade Statistics and Probability 2.0 Describe the distribution of a data set. Lesson to be used by EDI-trained teachers only. Learning Goal 10: Choosing Center and Spread - Questions Choose Yes or No to indicate whether each statement is true about this distributions. A. Both distributions are symmetric. B. The median is the best measure of center for Distribution A. C. Overall, scores were higher in Distribution A than Distribution B. D. There is more variability in scores for Distribution A than Distribution B. E. Distribution A is skewed to the right. F. The Standard Deviation can be used to describe the spread for Distribution B. ©2013 All rights reserved. O Yes O No O Yes O No O Yes O No O Yes O No O Yes O No O Yes O No CCSS 6th Grade Statistics and Probability 2.0 Describe the distribution of a data set. Lesson to be used by EDI-trained teachers only. Learning Goal 11 Be able to describe the distribution of a quantitative variable in terms of its shape, center, and spread. Learning Goal 11: How to Analysis Quantitative Data 2009 Fuel Economy Guide Examine each variable by itself. Then study relationships among the variables. MODEL 2009 Fuel Economy Guide 2009 Fuel Economy Guide MPG MPG MODEL <new>MODEL MPG 1 Acura RL 9 22 Dodge Avenger 1630 Mercedes-Benz E350 24 2 Audi A6 Quattro 1023 Hyundai Elantra 1733 Mercury Milan 29 3 Bentley Arnage 1114 Jaguar XF 1825 Mitsubishi Galant 27 4 BMW 5281 1228 Kia Optima 1932 Nissan Maxima 26 5 Buick Lacrosse 1328 Lexus GS 350 2026 Rolls Royce Phantom 18 6 Cadillac CTS 1425 Lincolon MKZ 2128 Saturn Aura 33 7 Chevrolet Malibu 1533 Mazda 6 2229 Toyota Camry 31 8 Chrysler Sebring 1630 Mercedes-Benz E350 2324 Volkswagen Passat 29 9 Dodge Avenger 1730 Mercury Milan 2429 Volvo S80 25 Start with a graph or graphs Add numerical summaries <new> Learning Goal 11: How to Describe a Quantitative Distribution The purpose of a graph is to help us understand the data. After you make a graph, always ask, “What do I see?” How to Describe the Distribution of a Quantitative Variable In any graph, look for the overall pattern and for striking departures from that pattern. Describe the overall pattern of a distribution by its: •Shape Don’t forget your •Outliers SOCS! •Center •Spread Note individual values that fall outside the overall pattern. These departures are called outliers. Learning Goal 11: Describing a Quantitative Distribution We describe a distribution (the values the variable takes on and how often it takes these values) using the acronym SOCS. Shape– We describe the shape of a distribution in one of two ways: Symmetric/Approx. Symmetric or Skewed right/Skewed left Approx. Symmetric (with extreme values) Dot Plot Number of Home Runs in a Single Season Babe Ruth’s Single Season Home Runs 20 25 30 35 40 45 Ruth 50 55 60 65 Learning Goal 11: Describing a Quantitative Distribution Outliers: Observations that we would consider “unusual”. Data that don’t “fit” the overall pattern of the distribution. Babe Ruth had two seasons that appear to be somewhat different than the rest of his career. These may be “outliers”. (We’ll learn a numerical way to determine if observations are truly “unusual” later). Outliers 22, 25 Dot Plot Number of Home Runs in a Single Season Babe Ruth’s Single Season Home Runs Possible Outliers 20 25 30 35 Unusual observation??? 40 45 Ruth 50 55 60 65 Learning Goal 11: Describing a Quantitative Distribution Center: A single value that describes the entire distribution. Symmetric distributions use mean and skewed distributions use median. Dot Plot Number of Home Runs in a Single Season Babe Ruth’s Single Season Home Runs 20 Median is 46 25 30 35 40 45 Ruth 50 55 60 65 Learning Goal 11: Describing a Quantitative Distribution Spread: Talk about the variation of a distribution. Symmetric distributions use standard deviation and skewed distributions use IQR. Dot Plot Number of Home Runs in a Single Season Babe Ruth’s Single Season Home Runs 20 25 30 35 Q1 IQR is 19 40 45 Ruth 50 55 Q3 60 65 Learning Goal 11: Distribution Description using SOCS The distribution of Babe Ruth’s number of home runs in a single season is approximately symmetric1 with two possible outlier observations at 23 and 25 home runs.2 He typically hits about 463 home runs in a season. Over his career, the number of home runs has normally varied from between 35 and 54.4 1-Shape 2-Outliers 3-Center 4-Spread Learning Goal 11: Describe the Distribution – Your Turn The table and dotplot below displays the Environmental Protection Agency’s estimates of highway gas mileage in miles per gallon (MPG) for a sample of 24 model year 2009 midsize cars. Describe the shape, center, and spread of the distribution. Are there any outliers? 2009 Fuel Economy Guide MODEL 2009 Fuel Economy Guide 2009 Fuel Economy Guide MPG MPG MODEL <new>MODEL MPG 1 Acura RL 922 Dodge Avenger 1630 Mercedes-Benz E350 24 2 Audi A6 Quattro 1023 Hyundai Elantra 1733 Mercury Milan 29 3 Bentley Arnage 1114 Jaguar XF 1825 Mitsubishi Galant 27 4 BMW 5281 1228 Kia Optima 1932 Nissan Maxima 26 5 Buick Lacrosse 1328 Lexus GS 350 2026 Rolls Royce Phantom 18 6 Cadillac CTS 1425 Lincolon MKZ 2128 Saturn Aura 33 7 Chevrolet Malibu 1533 Mazda 6 2229 Toyota Camry 31 8 Chrysler Sebring 1630 Mercedes-Benz E350 2324 Volksw agen Passat 29 9 Dodge Avenger 1730 Mercury Milan 2429 Volvo S80 25 <new> Learning Goal 11: Describe the Distribution – Your Turn Smart Phone Battery Life (minutes) Apple iPhone 300 Motorola Droid 385 Palm Pre 300 Blackberry Bold Blackberry Storm Motorola Cliq Samsung Moment Blackberry Tour HTC Droid 360 330 360 330 300 460 Smart Phone Battery Life: Here is the estimated battery life for each of 9 different smart phones in minutes. Describe the distribution. Cartoon Time