Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 2 Describing Distributions with Numbers 5/24/2017 Chapter 2 1 Numerical Summaries of: • Central location – mean – median • Spread – Range – Quartiles – Standard Deviation / variance • Shape measures not covered 5/24/2017 Chapter 2 2 Arithmetic Mean • Most common measure of central location • Notation (“xbar”): x n 1 1 x x1 x 2 xn xi n n i 1 Where n is the sample size ∑ is the summation symbol 5/24/2017 Chapter 2 3 Example: Sample Mean Data: Metabolic rates, calories / day: 1792 1666 1362 1614 1460 1867 1439 1792 1666 1362 1614 1460 1867 1439 x 7 11,200 7 1600 5/24/2017 Chapter 2 4 Median (M) • Half the values are less than the median, half are greater • If n is odd, the median is the middle ordered value • If n is even, the median is the average of the two middle ordered values 5/24/2017 Chapter 2 5 Examples: Median • Example 1: 2 4 6 Median = 4 • Example 2: 2 4 6 8 Median = 5 (average of 4 and 6) • Example 3: 6 2 4 Median 2 (Values must first be ordered first 2 4 6 , Median = 4) 5/24/2017 Chapter 2 6 Example: Median The location of the median in ordered array: L(M) = (n + 1) / 2 Data = metabolic rates in slide 4 (n = 7) Ordered array: 1362 1439 1460 1614 1666 1792 1867 median Value of median = 1614 5/24/2017 Chapter 2 7 The Median is robust to outliers This data set: 1362 1439 1460 1614 1666 1792 1867 has median 1614 and mean 1600 This similar data with high outlier: 1362 1439 1460 1614 1666 1792 9867 still has median 1614 but now has mean 2742.9 5/24/2017 Chapter 2 8 The skew pulls the mean • The average salary at a high tech firm is $250K / year • The median salary is $60K • What does this tell you? • Answer: There are some very highly paid executives, but most of the workers make modest salaries, i.e., there is a positive skew to the distribution 5/24/2017 Chapter 2 9 Spread = Variability • Amount of spread around the center! • Statistical measures of spread –Range –Inter-Quartile Range –Standard deviation 5/24/2017 Chapter 2 10 Range and IQR • Range = maximum – minimum • Easy, but NOT as good as the… • Quartiles & Inter-Quartile Range (IQR) – Quartile 1 (Q1) cuts off bottom 25% of data (“25th percentile”) – Quartile 2 (Q2) cuts off two-quarters of data – same as the Median! – Quartile 3 (Q3) cuts off three-quarters of the data (“75th percentile”) Obtaining Quartiles • Order data • Find the median • Look at the lower half of data set – Find “median” of this lower half – This is Q1 • Look at the upper half of the data set. – Find “median” of this upper half – This is Q3 5/24/2017 Chapter 2 12 Example: Quartiles Consider these 10 ages: 05 11 21 24 27 28 median 30 42 50 52 The median of the bottom half (Q1) = 21 05 11 21 24 27 The median of the top half (Q3) = 42 28 30 42 50 52 5/24/2017 Chapter 2 13 Example 2: Quartiles, n = 53 100 101 106 106 110 110 119 120 120 123 124 125 127 128 130 130 133 135 139 140 148 150 150 152 155 157 165 165 165 170 170 170 172 175 175 180 180 180 180 185 215 220 260 Median = 165 L(M)=(53+1) / 2 = 27 5/24/2017 185 185 186 187 192 194 195 203 210 212 Chapter 2 14 Example 2: Quartiles, n = 53 100 101 106 106 110 110 119 120 120 123 124 125 127 128 130 130 133 135 139 140 148 150 150 152 155 157 165 165 165 170 170 170 172 175 175 180 180 180 180 185 185 185 186 187 192 194 195 203 210 212 215 220 260 Bottom half has n* = 26 L(Q1)=(26 + 1) / 2= 13.5 from bottom Q1 = avg(127, 128) = 127.5 5/24/2017 Chapter 2 15 Example 2: Quartiles, n = 53 100 101 106 106 110 110 119 120 120 123 124 125 127 128 130 130 133 135 139 140 148 150 150 152 155 157 165 165 165 170 170 170 172 175 175 180 180 180 180 185 185 185 186 187 192 194 195 203 210 212 215 220 260 Top half has n* = 26 L(Q3) = 13.5 from the top! Q3 = avg(185, 185) = 185 5/24/2017 Chapter 2 16 Example 2 Quartiles Q1 = 127.5 Q2 = 165 Q3 = 185 "5 point summary" = {Min, Q1, Median, Q3, Max} = {100, 127.5, 165, 185, 260} 5/24/2017 Chapter 2 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 0166 009 0034578 00359 08 00257 555 000255 000055567 245 3 025 0 0 17 Inter-quartile Range (IQR) • Q1 = 127.5 • Q3 = 185 Inter-Quartile Range (IQR) = Q3 Q1 = 185 – 127.5 = 57.5 “spread of middle 50%” 5/24/2017 Chapter 2 18 Simple Box 5-point summary graphically min 100 Q1 125 M 150 Q3 175 max 200 225 250 275 Weight 5/24/2017 Chapter 2 19 Boxplots are useful for comparing groups 5/24/2017 Chapter 2 20 Standard Deviation & Variance • Most popular measures of spread • Each data value has a deviation, defined as: xi x 5/24/2017 Chapter 2 21 Example: Deviations Metabolic data (n = 7) x1 x 1439 1600 161 x1 x 1792 1600 192 5/24/2017 Chapter 2 22 Variance • • • • • Find the mean Find the deviation of each value Square the deviations Sum the squared deviations Divide by (n − 1) n 1 2 s ( xi x ) (n 1) i 1 2 5/24/2017 Chapter 2 23 Data Data: Metabolic rates, n = 7 1792 1666 1362 1614 1460 1867 1439 5/24/2017 Chapter 2 24 “Sum of Squares” Obs xi xi x xi x 2 (192)2 = 36,864 1792 17921600 = 192 1666 1666 1600 = 1362 1362 1600 = -238 1614 1614 1600 = 1460 1460 1600 = -140 (-140)2 = 19,600 1867 1867 1600 = 267 (267)2 = 71,289 1439 1439 1600 = -161 (-161)2 = 25,921 0 214,870 SUMS 11,200 11200 x 1600 7 5/24/2017 Squared deviations Deviations 66 14 (66)2 = 4,356 (-238)2 = 56,644 (14)2 = 196 2 ( x x ) "Sum of Squares" i Chapter 2 25 Variance Sum of Squares 1 2 xi x s n 1 1 214,870 7 1 35,811.67 2 5/24/2017 Chapter 2 26 Standard Deviation Square root of variance s s 2 s 35,811.67 189.24 5/24/2017 Chapter 2 27 Standard Deviation Direct Formula 1 2 xi x s n 1 1 214,870 7 1 189 5/24/2017 Chapter 2 28 Use calculator to check work! I’m supporting the TI-30XIIS only TI-30XIIS sequence: • On > CLEAR > 2nd > STAT > Scroll > Clear Data > Enter • 2nd > STAT > 1-VAR or 2-VAR • DATA > “enter data • STATVAR key Choosing Summary Statistics • Use the mean and standard deviation to describe symmetrical distributions & distributions free of outliers • Use the median and quartiles (IQR) to describe distributions that are skewed or have outliers 5/24/2017 Chapter 2 30 Example: Number of Books Read 0 0 0 0 0 0 0 0 0 1 5/24/2017 n = 52 1 1 1 1 2 2 2 2 2 2 L(M)=(52+1)/2=26.5 2 4 10 2 2 3 3 3 M 3 4 4 4 4 4 5 5 5 5 5 5 6 Chapter 2 10 12 13 14 14 15 15 20 20 30 99 31 Example: Books read, n = 52 5-point summary: 0, 1, 3, 5.5, 99 Highly asymmetric distribution 0 10 20 30 40 50 60 Number of books 70 80 90 100 The mean (“xbar” = 7.06) and standard deviation (s = 14.43) give false impressions of location and spread for this distribution and are considered inappropriate. Use the median and 5-point summary instead. 5/24/2017 Chapter 2 32