Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 3 Numerically Summarizing Data Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 1 of 3 Chapter 3 ● Chapter 3 – Numerically Summarizing Data 3.1 3.2 3.3 3.4 3.5 Measures of Central Tendency Measures of Dispersion Measures of Central Tendency and Dispersion from Grouped Data Measures of Position The Five Number Summary and Boxplots Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 2 of 3 Chapter 3 Section 1 Measures of Central Tendency Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 3 of 3 Chapter 3 – Section 1 ● Analyzing populations versus analyzing samples ● For populations We know all of the data Descriptive measures of populations are called parameters Parameters are often written using Greek letters ( μ ) ● For samples We know only part of the entire data Descriptive measures of samples are called statistics Statistics are often written using Roman letters ( x ) Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 4 of 3 Chapter 3 – Section 1 ● The arithmetic mean of a variable is often what people mean by the “average” … add up all the values and divide by how many there are ● Compute the arithmetic mean of 6, 1, 5 ● Add up the three numbers and divide by 3 (6 + 1 + 5) / 3 = 4.0 ● The arithmetic mean is 4.0 Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 5 of 3 Chapter 3 – Section 1 ● The arithmetic mean is usually called the mean ● For a population … the population mean Is computed using all the observations in a population Is denoted μ Is a parameter ● For a sample … the sample mean Is computed using only the observations in a sample Is denoted x Is a statistic Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 6 of 3 Chapter 3 – Section 1 ● The median of a variable is the “center” ● When the data is sorted in order, the median is the middle value ● The calculation of the median of a variable is slightly different depending on If there are an odd number of points, or If there are an even number of points Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 7 of 3 Chapter 3 – Section 1 ● To calculate the median (M) of a data set Arrange the data in order Count the number of observations, n ● If n is odd There is a value that’s exactly in the middle That value is the median M ● If n is even There are two values on either side of the exact middle Take their mean to be the median M Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 8 of 3 Chapter 3 – Section 1 ● An example with an odd number of observations (5 observations) ● Compute the median of 6, 1, 11, 2, 11 ● Sort them in order 1, 2, 6, 11, 11 ● The middle number is 6, so the median is 6 Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 9 of 3 Chapter 3 – Section 1 ● An example with an even number of observations (4 observations) ● Compute the median of 6, 1, 11, 2 ● Sort them in order 1, 2, 6, 11 ● Take the mean of the two middle values (2 + 6) / 2 = 4 ● The median is 4 Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 10 of 3 Chapter 3 – Section 1 ● One interpretation ● The median splits the data into halves M = 79.5 62, 68, 71, 74, 77, 82, 84, 88, 90, 94 62, 68, 71, 74, 77 5 on the left 82, 84, 88, 90, 94 5 on the right Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 11 of 3 Chapter 3 – Section 1 ● The mode of a variable is the most frequently occurring value ● Find the mode of 6, 1, 2, 6, 11, 7, 3 ● The values are 1, 2, 3, 6, 7, 11 ● The value 6 occurs twice, all the other values occur only once ● The mode is 6 Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 12 of 3 Chapter 3 – Section 1 ● Qualitative data Values are one of a set of categories Cannot add or order them … the mean and median do not exist The mode is the only one of these three measurements that exists ● Find the mode of blue, blue, blue, red, green ● The mode is “blue” because it is the value that occurs the most often Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 13 of 3 Chapter 3 – Section 1 ● Quantitative data The mode can be computed but sometimes it is not meaningful Sometimes each value will only occur once (which can often happen with precise measurements) ● Find the mode of 5.1, 6.6, 6.8, 9.3, 1.9 ● Each value occurs only once ● The mode is not a meaningful measurement Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 14 of 3 Chapter 3 – Section 1 ● One interpretation ● In primary elections, the candidate who receives the most votes is often called “the winner” ● Votes (data values) are Candidate Henry Number of votes 194 Kayla 215 Jason 172 ● The mode is “Kayla” … Kayla is the winner Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 15 of 3 Chapter 3 – Section 1 ● The mean and the median are often different ● This difference gives us clues about the shape of the distribution Is it symmetric? Is it skewed left? Is it skewed right? Are there any extreme values? Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 16 of 3 Chapter 3 – Section 1 ● Symmetric – the mean will usually be close to the median ● Skewed left – the mean will usually be smaller than the median ● Skewed right – the mean will usually be larger than the median Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 17 of 3 Chapter 3 – Section 1 ● If a distribution is symmetric, the data values above and below the mean will balance The mean will be in the “middle” The median will be in the “middle” ● Thus the mean will be close to the median, in general, for a distribution that is symmetric Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 18 of 3 Chapter 3 – Section 1 ● If a distribution is skewed left, there will be some data values that are larger than the others The mean will decrease The median will not decrease as much ● Thus the mean will be smaller than the median, in general, for a distribution that is skewed left Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 19 of 3 Chapter 3 – Section 1 ● If a distribution is skewed right, there will be some data values that are larger than the others The mean will increase The median will not increase as much ● Thus the mean will be larger than the median, in general, for a distribution that is skewed right Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 20 of 3 Chapter 3 – Section 1 ● For a mostly symmetric distribution, the mean and the median will be roughly equal ● Many variables, such as birth weights below, are approximately symmetric Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 21 of 3 Chapter 3 – Section 1 ● What if one value is extremely different from the others?( this is so called an outlier)? others ● What if we made a mistake and 6, 1, 2 was recorded as 6000, 1, 2 ● The mean is now ( 6000 + 1 + 2 ) / 3 = 2001 ● The median is still 2 ● The median is “resistant to extreme values” Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 22 of 3 Summary: Chapter 3 – Section 1 ● Mean The center of gravity Useful for roughly symmetric quantitative data ● Median Splits the data into halves Useful for highly skewed quantitative data ● Mode The most frequent value Useful for qualitative data Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 23 of 3 Chapter 3 Section 2 Measures of Dispersion Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 24 of 3 Chapter 3 – Section 2 ● Learning objectives 1 2 3 4 5 The range of a variable The variance of a variable The standard deviation of a variable Use the Empirical Rule Use Chebyshev’s inequality Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 25 of 3 Chapter 3 – Section 2 ● Comparing two sets of data ● The measures of central tendency (mean, median, mode) measure the differences between the “average” or “typical” values between two sets of data ● The measures of dispersion in this section measure the differences between how far “spread out” the data values are Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 26 of 3 Chapter 3 – Section 2 ● The range of a variable is the largest data value minus the smallest data value ● Compute the range of 6, 1, 2, 6, 11, 7, 3, 3 ● The largest value is 11 ● The smallest value is 1 ● Subtracting the two … 11 – 1 = 10 … the range is 10 Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 27 of 3 Chapter 3 – Section 2 ● The range only uses two values in the data set – the largest value and the smallest value ● The range is not resistant ● If we made a mistake and 6, 1, 2 was recorded as 6000, 1, 2 ● The range is now ( 6000 – 1 ) = 5999 Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 28 of 3 Chapter 3 – Section 2 ● The variance is based on the deviation from the mean ( xi – μ ) for populations ( xi – x ) for samples ● To treat positive differences and negative differences, we square the deviations ( xi – μ )2 for populations ( xi – x )2 for samples Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 29 of 3 Chapter 3 – Section 2 ● The population variance of a variable is the sum of these squared deviations divided by the number in the population 2 2 2 2 (x μ) (x μ) (x μ) ... (x μ) i 2 N 1 N N ● The population variance is represented by σ2 ● Note: For accuracy, use as many decimal places as allowed by your calculator Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 30 of 3 Chapter 3 – Section 2 ● Compute the population variance of 6, 1, 2, 11 ● Compute the population mean first μ = (6 + 1 + 2 + 11) / 4 = 5 ● Now compute the squared deviations (1–5)2 = 16, (2–5)2 = 9, (6–5)2 = 1, (11–5)2 = 36 ● Average the squared deviations (16 + 9 + 1 + 36) / 4 = 15.5 ● The population variance σ2 is 15.5 Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 31 of 3 Chapter 3 – Section 2 ● The sample variance of a variable is the sum of these squared deviations divided by one less than the number in the sample 2 (x1 x )2 (x2 x )2 ... (xn x )2 (xi x ) n -1 n 1 ● The sample variance is represented by s2 ● We say that this statistic has n – 1 degrees of freedom Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 32 of 3 Chapter 3 – Section 2 ● Compute the sample variance of 6, 1, 2, 11 ● Compute the sample mean first x = (6 + 1 + 2 + 11) / 4 = 5 ● Now compute the squared deviations (1–5)2 = 16, (2–5)2 = 9, (6–5)2 = 1, (11–5)2 = 36 ● Average the squared deviations (16 + 9 + 1 + 36) / 3 = 20.7 ● The sample variance s2 is 20.7 Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 33 of 3 Chapter 3 – Section 2 ● Why are the population variance (15.5) and the sample variance (20.7) different for the same set of numbers? ● In the first case, { 6, 1, 2, 11 } was the entire population (divide by N) ● In the second case, { 6, 1, 2, 11 } was just a sample from the population (divide by n – 1) ● These are two different situations Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 34 of 3 Chapter 3 – Section 2 ● Why do we use different formulas? ● The reason is that using the sample mean is not quite as accurate as using the population mean ● If we used “n” in the denominator for the sample variance calculation, we would get a “biased” result ● Bias here means that we would tend to underestimate the true variance Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 35 of 3 Chapter 3 – Section 2 ● The standard deviation is the square root of the variance ● The population standard deviation Is the square root of the population variance (σ2) Is represented by σ ● The sample standard deviation Is the square root of the sample variance (s2) Is represented by s Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 36 of 3 Chapter 3 – Section 2 ● If the population is { 6, 1, 2, 11 } The population variance σ2 = 15.5 The population standard deviation σ = 15.5 3.9 ● If the sample is { 6, 1, 2, 11 } The sample variance s2 = 20.7 The sample standard deviation s = 20.7 4.5 ● The population standard deviation and the sample standard deviation apply in different situations Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 37 of 3 Chapter 3 – Section 2 ● The standard deviation is very useful for estimating probabilities Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 38 of 3 Chapter 3 – Section 2 ● The empirical rule ● If the distribution is roughly bell shaped, then Approximately 68% of the data will lie within 1 standard deviation of the mean Approximately 95% of the data will lie within 2 standard deviations of the mean Approximately 99.7% of the data (i.e. almost all) will lie within 3 standard deviations of the mean Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 39 of 3 Chapter 3 – Section 2 ● For a variable with mean 17 and standard deviation 3.4 Approximately 68% of the values will lie between (17 – 3.4) and (17 + 3.4), i.e. 13.6 and 20.4 Approximately 95% of the values will lie between (17 – 2 3.4) and (17 + 2 3.4), i.e. 10.2 and 23.8 Approximately 99.7% of the values will lie between (17 – 3 3.4) and (17 + 3 3.4), i.e. 6.8 and 27.2 ● A value of 2.1 and a value of 33.2 would both be very unusual Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 40 of 3 Chapter 3 – Section 2 ● Chebyshev’s inequality gives a lower bound on the percentage of observations that lie within k standard deviations of the mean (where k > 1) ● This lower bound is An estimated percentage The actual percentage for any variable cannot be lower than this number ● Therefore the actual percentage must be this value or higher Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 41 of 3 Chapter 3 – Section 2 ● Chebyshev’s inequality ● For any data set, at least 1 1 100% k 2 of the observations will lie within k standard deviations of the mean, where k is any number greater than 1 Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 42 of 3 Chapter 3 – Section 2 ● How much of the data lies within 1.5 standard deviations of the mean? ● From Chebyshev’s inequality 1 1 100% 55.6% 2 1.5 so that at least 55.6% of the data will lie within 1.5 standard deviations of the mean Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 43 of 3 Chapter 3 – Section 2 ● If the mean is equal to 20 and the standard deviation is equal to 4, how much of the data lies between 14 and 26? ● 14 to 26 are 1.5 standard deviations from 20 1 1 100% 55.6% 2 1 . 5 so that at least 55.6% of the data will lie between 14 and 26 Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 44 of 3 Summary: Chapter 3 – Section 2 ● Range The maximum minus the minimum Not a resistant measurement ● Variance and standard deviation Measures deviations from the mean Not a resistant measurement ● Empirical rule About 68% of the data is within 1 standard deviation About 95% of the data is within 2 standard deviations Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 45 of 3 Chapter 3 Section 3 Measures of Central Tendency and Dispersion from Grouped Data Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 46 of 3 Chapter 3 – Section 3 ● Learning objectives 1 The mean from grouped data 2 The weighted mean 3 The variance and standard deviation for grouped data Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 47 of 3 Chapter 3 – Section 3 ● Data may come in groups rather than individually ● The values may have been summarized in frequency distributions Ranges of ages (20 – 29, 30 – 39, ...) Ranges of incomes ($10,000 – $19,999, $20,000 – $39,999, $40,000 – $79,999, ...) ● The exact values for the mean, variance, and standard deviation cannot be calculated Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 48 of 3 Chapter 3 – Section 3 ● Learning objectives 1 The mean from grouped data 2 The weighted mean 3 The variance and standard deviation for grouped data Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 49 of 3 Chapter 3 – Section 3 ● To compute the mean for grouped data Assume that, within each class, the mean of the data is equal to the class midpoint Use the class midpoint in the formula for the mean The number of times the class midpoint value is used is equal to the frequency of the class ● If 6 values are in the interval [ 8, 10 ] , then we assume that all 6 values are equal to 9 (the midpoint of [ 8, 10 ] Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 50 of 3 Chapter 3 – Section 3 ● As an example, for the following frequency table, 0 – 1.9 2 – 3.9 4 – 5.9 6 – 7.9 Midpoint 1 3 5 7 Frequency 3 7 6 1 Class we calculate the mean as if The value 1 occurred 3 times The value 3 occurred 7 times The value 5 occurred 6 times The value 7 occurred 1 time Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 51 of 3 Chapter 3 – Section 3 0 – 1.9 2 – 3.9 4 – 5.9 6 – 7.9 Midpoint 1 3 5 7 Frequency 3 7 6 1 Class ● The calculation for the mean would be 1 1 1 3 3 3 3 3 3 3 5 5 5 5 5 5 7 17 or (1 3) (3 7) (5 6) (7 1) 17 Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 52 of 3 Chapter 3 – Section 3 ● Evaluating this formula (1 3) (3 7) (5 6) (7 1) 61 3.6 3 7 6 1 17 ● The mean is about 3.6 ● In mathematical notation xi fi fi ● This would be μ for the population mean and x for the sample mean Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 53 of 3 Chapter 3 – Section 3 ● Sometimes not all data values are equally important ● To compute a grade point average (GPA), a grade in a 4 credit class is worth more than a grade in a 1 credit class ● The weights wi quantify the relative importance of the different values ● Higher weights correspond to more important values Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 54 of 3 Chapter 3 – Section 3 ● As an example, the following grades Course Statistics French Literature Biochemistry Badminton Credits 3 3 Grade A B 5 1 B D would yield a GPA (on a 4 point scale) of (3 4) (3 3) (5 3) (1 1) 37 3.08 3 3 5 1 12 Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 55 of 3 Chapter 3 – Section 3 ● In mathematical notation, if wi is the weight corresponding to the data value xi, then the weighted mean is w i xi xw wi ● This formula looks similar to one for the mean for grouped data, and the concepts are similar Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 56 of 3 Chapter 3 – Section 3 ● To compute the variance for grouped data Assume again that, within each class, the mean of the data is equal to the class midpoint Use the class midpoint in the formula for the variance The number of times the class midpoint value is used is equal to the frequency of the class ● If 6 values are in the interval [ 8, 10 ] , then we assume that all 6 values are equal to 9 (the midpoint of [ 8, 10 ] ● The same approach as for the mean Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 57 of 3 Chapter 3 – Section 3 ● As an example, for the following frequency table, 0 – 1.9 2 – 3.9 4 – 5.9 6 – 7.9 Midpoint 1 3 5 7 Frequency 3 7 6 1 Class we calculate the variance as if The value 1 occurred 3 times The value 3 occurred 7 times The value 5 occurred 6 times The value 7 occurred 1 time Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 58 of 3 Chapter 3 – Section 3 0 – 1.9 2 – 3.9 4 – 5.9 6 – 7.9 Midpoint 1 3 5 7 Frequency 3 7 6 1 Class ● From our previous example, the mean is 3.6 ● Just as for the mean, the calculation for the variance would then be ((1 3.6)2 3) ((3 3.6)2 7) ((5 3.6)2 6) ((7 3.6)2 1) 17 Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 59 of 3 Chapter 3 – Section 3 ● Evaluating this formula ((1 3.6)2 3) ((3 3.6)2 7) ((5 3.6)2 6) ((7 3.6)2 1) 17 46.1 2.7 17 ● The variance is about 2.7 ● The standard deviation would be about 2.7 1.6 Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 60 of 3 Chapter 3 – Section 3 ● In mathematical notation ● The population variance would be 2 ( xi ) fi fi 2 ● The sample variance would be 2 ( x x ) fi 2 i s ( fi ) 1 ● The standard deviations would be the corresponding square roots Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 61 of 3 Summary: Chapter 3 – Section 3 ● The mean for grouped data Use the class midpoints Obtain an approximation for the mean ● The variance and standard deviation for grouped data Use the class midpoints Obtain an approximation for the variance and standard deviation Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 62 of 3 Chapter 3 Section 4 Measures of Position Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 63 of 3 Chapter 3 – Section 4 ● Learning objectives 1 Determine and interpret z-scores 2 Determine and interpret percentiles 3 Determine and interpret quartiles 4 Check a set of data for outliers Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 64 of 3 Chapter 3 – Section 4 ● Mean / median describe the “center” of the data ● Variance / standard deviation describe the “spread” of the data ● This section discusses more precise ways to describe the relative position of a data value within the entire set of data Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 65 of 3 Chapter 3 – Section 4 ● The standard deviation is a measure of dispersion that uses the same dimensions as the data (remember the empirical rule) ● The distance of a data value from the mean, calculated as the number of standard deviations, would be a useful measurement ● This distance is called the z-score Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 66 of 3 Chapter 3 – Section 4 ● If the mean was 20 and the standard deviation was 6 The value 26 would have a z-score of 1.0 (1.0 standard deviation higher than the mean) The value 14 would have a z-score of –1.0 (1.0 standard deviation lower than the mean) The value 17 would have a z-score of –0.5 (0.5 standard deviations lower than the mean) The value 20 would have a z-score of 0.0 Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 67 of 3 Chapter 3 – Section 4 ● The population z-score is calculated using the population mean and population standard deviation z x ● The sample z-score is calculated using the sample mean and sample standard deviation xx z s Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 68 of 3 Chapter 3 – Section 4 ● z-scores can be used to compare the relative positions of data values in different samples Pat received a grade of 82 on her statistics exam where the mean grade was 74 and the standard deviation was 12 Pat received a grade of 72 on her biology exam where the mean grade was 65 and the standard deviation was 10 Pat received a grade of 91 on her kayaking exam where the mean grade was 88 and the standard deviation was 6 Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 69 of 3 Chapter 3 – Section 4 ● Statistics Grade of 82 z-score of (82 – 74) / 12 = .67 ● Biology Grade of 72 z-score of (72 – 65) / 10 = .70 ● Kayaking Grade of 81 z-score of (91 – 88) / 6 = .50 ● Biology was the highest relative grade Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 70 of 3 Chapter 3 – Section 4 ● Learning objectives 1 Determine and interpret z-scores 2 Determine and interpret percentiles 3 Determine and interpret quartiles 4 Check a set of data for outliers Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 71 of 3 Chapter 3 – Section 4 ● The median divides the lower 50% of the data from the upper 50% ● The median is the 50th percentile ● If a number divides the lower 34% of the data from the upper 66%, that number is the 34th percentile Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 72 of 3 Chapter 3 – Section 4 ● The computation is similar to the one for the median ● Calculation Arrange the data in ascending order Compute the index i using the formula k i n 1 100 ● If i is an integer, take the ith data value ● If i is not an integer, take the mean of the two values on either side of i Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 73 of 3 Chapter 3 – Section 4 ● Compute the 60th percentile of 1, 3, 4, 7, 8, 15, 16, 19, 23, 24, 27, 31, 33, 34 ● Calculations There are 14 numbers (n = 14) The 60th percentile (k = 60) The index k 60 14 1 9 i n 1 100 100 ● Take the 9th value, or P60 = 23, as the 60th percentile Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 74 of 3 Chapter 3 – Section 4 ● Compute the 28th percentile of 1, 3, 4, 7, 8, 15, 16, 19, 23, 24, 27, 31, 33, 54 ● Calculations There are 14 numbers (n = 14) The 28th percentile (k = 28) The index k 28 i n 1 14 1 4.2 100 100 ● Take the average of the 4th and 5th values, or P28 = (7 + 8) / 2 = 7.5, as the 28th percentile Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 75 of 3 Chapter 3 – Section 4 ● Learning objectives 1 Determine and interpret z-scores 2 Determine and interpret percentiles 3 Determine and interpret quartiles 4 Check a set of data for outliers Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 76 of 3 Chapter 3 – Section 4 ● The quartiles are the 25th, 50th, and 75th percentiles Q1 = 25th percentile Q2 = 50th percentile = median Q3 = 75th percentile ● Quartiles are the most commonly used percentiles ● The 50th percentile and the second quartile Q2 are both other ways of defining the median Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 77 of 3 Chapter 3 – Section 4 ● Quartiles divide the data set into four equal parts ● The top quarter are the values between Q3 and the maximum ● The bottom quarter are the values between the minimum and Q1 Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 78 of 3 Chapter 3 – Section 4 ● Quartiles divide the data set into four equal parts ● The interquartile range (IQR) is the difference between the third and first quartiles IQR = Q3 – Q1 ● The IQR is a resistant measurement of dispersion Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 79 of 3 Chapter 3 – Section 4 ● Learning objectives 1 Determine and interpret z-scores 2 Determine and interpret percentiles 3 Determine and interpret quartiles 4 Check a set of data for outliers Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 80 of 3 Chapter 3 – Section 4 ● Extreme observations in the data are referred to as outliers ● Outliers should be investigated ● Outliers could be Chance occurrences Measurement errors Data entry errors Sampling errors ● Outliers are not necessarily invalid data Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 81 of 3 Chapter 3 – Section 4 ● One way to check for outliers uses the quartiles ● Outliers can be detected as values that are significantly too high or too low, based on the known spread ● The fences used to identify outliers are Lower fence = LF = Q1 – 1.5 IQR Upper fence = UF = Q3 + 1.5 IQR ● Values less than the lower fence or more than the upper fence could be considered outliers Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 82 of 3 Chapter 3 – Section 4 ● Is the value 54 an outlier? 1, 3, 4, 7, 8, 15, 16, 19, 23, 24, 27, 31, 33, 54 ● Calculations Q1 = (4 + 7) / 2 = 5.5 Q3 = (27 + 31) / 2 = 29 IQR = 29 – 5.5 = 23.5 UF = Q3 + 1.5 IQR = 29 + 1.5 23.5 = 64 ● Using the fence rule, the value 54 is not an outlier Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 83 of 3 Summary: Chapter 3 – Section 4 ● z-scores Measures the distance from the mean in units of standard deviations Can compare relative positions in different samples ● Percentiles and quartiles Divides the data so that a certain percent is lower and a certain percent is higher ● Outliers Extreme values of the variable Can be identified using the upper and lower fences Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 84 of 3 Chapter 3 Section 5 The Five-Number Summary And Boxplots Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 85 of 3 Chapter 3 – Section 5 ● Learning objectives 1 Compute the five-number summary 2 Draw and interpret boxplots Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 86 of 3 Chapter 3 – Section 5 ● Learning objectives 1 Compute the five-number summary 2 Draw and interpret boxplots Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 87 of 3 Chapter 3 – Section 5 ● The five-number summary is the collection of The smallest value The first quartile (Q1 or P25) The median (M or Q2 or P50) The third quartile (Q3 or P75) The largest value ● These five numbers give a concise description of the distribution of a variable Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 88 of 3 Chapter 3 – Section 5 ● The median Information about the center of the data Resistant ● The first quartile and the third quartile Information about the spread of the data Resistant ● The smallest value and the largest value Information about the tails of the data Not resistant Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 89 of 3 Chapter 3 – Section 5 ● Compute the five-number summary for 1, 3, 4, 7, 8, 15, 16, 19, 23, 24, 27, 31, 33, 54 ● Calculations The minimum = 1 Q1 = P25, the index i = 3.75, Q1 = (4 + 7) / 2 = 5.5 M = Q2 = P50 = (16 + 19) / 2 = 17.5 Q3 = P75, the index i = 11.25, Q3 = (27 + 31) / 2 = 29 The maximum = 54 ● The five-number summary is 1, 5.5, 17.5, 29, 54 Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 90 of 3 Chapter 3 – Section 5 ● Learning objectives 1 Compute the five-number summary 2 Draw and interpret boxplots Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 91 of 3 Chapter 3 – Section 5 ● The five-number summary can be illustrated using a graph called the boxplot ● An example of a (basic) boxplot is ● The middle box shows Q1, Q2, and Q3 ● The horizontal lines (sometimes called “whiskers”) show the minimum and maximum Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 92 of 3 Chapter 3 – Section 5 ● To draw a (basic) boxplot: Calculate the five-number summary Draw a horizontal line that will cover all the data from the minimum to the maximum Draw a box with the left edge at Q1 and the right edge at Q3 Draw a line inside the box at M = Q2 Draw a horizontal line from the Q1 edge of the box to the minimum and one from the Q3 edge of the box to the maximum Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 93 of 3 Chapter 3 – Section 5 ● To draw a (basic) boxplot Draw the middle box Draw in the median Draw the minimum and maximum Voila! Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 94 of 3 Chapter 3 – Section 5 ● An example of a more sophisticated boxplot is ● The middle box shows Q1, Q2, and Q3 ● The horizontal lines (sometimes called “whiskers”) show the minimum and maximum ● The asterisk on the right shows an outlier (determined by using the upper fence) Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 95 of 3 Chapter 3 – Section 5 ● To draw this boxplot (in a slightly different way than the text) Draw the center box and mark the median, as before Compute the upper fence and the lower fence Temporarily remove the outliers as identified by the upper fence and the lower fence (but we will add them back later with asterisks) Draw the horizontal lines to the new minimum and new maximum Mark each of the outliers with an asterisk Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 96 of 3 Chapter 3 – Section 5 ● To draw this boxplot Draw the middle box and the median Draw in the fences, remove the outliers (temporarily) Draw the minimum and maximum Draw the outliers as asterisks Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 97 of 3 Chapter 3 – Section 5 ● The distribution shape and boxplot are related Symmetry (or lack of symmetry) Quartiles Maximum and minimum ● Relate the distribution shape to the boxplot for Symmetric distributions Skewed left distributions Skewed right distributions Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 98 of 3 Chapter 3 – Section 5 ● Symmetric distributions Distribution Boxplot Q1 is equally far from the median as Q3 is The median line is in the center of the box The min is equally far from the median as the max is The left whisker is equal to the right whisker Min Q1 M Q3 Max Min Q1 M Q3 Max Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 99 of 3 Chapter 3 – Section 5 ● Skewed left distributions Distribution Boxplot Q1 is further from the median than Q3 is The median line is to the right of center in the box The min is further from the median than the max is The left whisker is longer than the right whisker Min Q1 MQ3 Max Min Q1 MQ3 Max Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 100 of 3 Chapter 3 – Section 5 ● Skewed right distributions Distribution Boxplot Q1 is closer to the median than Q3 is The median line is to the left of center in the box The min is closer to the median than the max is The left whisker is shorter than the right whisker Min Q1M Q3 Max Min Q1M Q3 Max Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 101 of 3 Chapter 3 – Section 5 ● We can compare two distributions by examining their boxplots ● We draw the boxplots on the same horizontal scale We can visually compare the centers We can visually compare the spreads We can visually compare the extremes Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 102 of 3 Chapter 3 – Section 5 ● Comparing the “flight” with the “control” samples Center Spread Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 103 of 3 Summary: Chapter 3 – Section 5 ● 5-number summary Minimum, first quartile, median, third quartile maximum Resistant measures of center (median) and spread (interquartile range) ● Boxplots Visual representation of the 5-number summary Related to the shape of the distribution Can be used to compare multiple distributions Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 104 of 3 Chapter 3 Summary ● Numeric summaries of data Means, medians, modes Ranges, variances, standard deviations, IQR’s Calculations for grouped data ● Measures of relative position z-scores Percentiles and quartiles ● Exploratory data analysis Five-number summaries Box plots Sullivan – Statistics: Informed Decisions Using Data – 2nd Edition – Chapter 3 Introduction – Slide 105 of 3