Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Summary Statistics: Mean, Median, Standard Deviation, and More “Seek simplicity and then distrust it.” (Dr. Monticino) Assignment Sheet Read Chapter 4 Homework #3: Due Wednesday Feb. 9th Chapter 4 exercise set A: exercise set C: exercise set D: exercise set E: 1 -6, 8, 9 1, 2, 3 1 - 4, 8, 4, 5, 7, 8, 11, 12 Quiz #2 will be over Chapter 2 Quiz #3 on basic summary statistic calculations – mean, median, standard deviation, IQR, SD units If you’d like a copy of notes - email me Overview Measures of central tendency Mean (average) Median Outliers Measures of dispersion Standard deviation Standard deviation units Range IQR Review and applications Central Tendency Measures of central tendency - mean and median - are useful in obtaining a single number summary of a data set Mean is the arithmetic average Median is a value such that at least 50% of the data is less and at least 50% is greater Example Calculate mean and median for following data sets 37 44 55 78 100 111 125 151 161 37 44 55 69 90 120 125 152 157 161 Outliers and Robustness Mean can be sensitive to outliers in data set Not 162 166 158 154 147 150 141 233 278 288 148 152 149 265 212 154 148 158 150 137 142 149 148 145 143 152 robust to data collection errors or a single unusual measurement Blind calculation can give misleading results mean = 170.35 median = 151 Outliers and Robustness Always a good idea to plot data in the order that it was collected Spot outliers Identify possible data collection errors 350 mean without outliers = 150.14 300 Value 250 200 150 median without outliers 100 50 0 0 5 10 15 Data 20 25 30 = 149 Outliers and Robustness Median can be a more robust measure of central tendency than mean Life expectancy U.S. males: mean = 80.1, median = 83 U.S. females: mean = 84.3, median = 87 Household income Mean = $51,855, median = $38,885 .3% account for 12% of income Net worth Mean = $282,500, median = $71,600 Which Central Tendency Measure? Calculate mean, median and mode Plot data Create histogram to inspect mode(s) Do not delete data points If analyze data without outliers, report and explain outliers Many statistical studies involve studying the difference between population means Reporting the mean may be dictated by objective of study Which Central Tendency Measure? If data is Unimodal Fairly symmetric Mean is approximately equal to median Then mean is a reasonable measure of central tendency 80 Histogram 70 60 25 Value 50 20 15 Frequency 40 30 Bin 73 67 61 0 55 0 49 10 43 5 37 20 31 10 25 Frequency 0 20 40 60 Data Points 80 100 120 Which Central Tendency Measure? If data is Unimodal Asymmetric Then report both median and mean Difference between mean and median indicates asymmetry Median will usually be the more reasonable summary of central tendency Histogram 20 15 Frequency 10 5 or e M 99 90 81 72 63 54 45 0 Bin Value 25 Frequency 110 100 90 80 70 60 50 40 30 20 10 0 0 20 40 60 Data Points 80 100 120 Which Central Tendency Measure? If data is Not unimodal Then report modes and cautiously mean and median Analyze data for differences in groups around the modes Histogram 80 70 18 16 14 12 10 8 6 4 2 0 60 Frequency Value 50 40 30 20 10 69 or e M Bin 60 51 42 33 0 24 15 Frequency 0 20 40 60 Data Points 80 100 120 Limitations of Central Tendency Any single number summary may not adequately represent data and may hide differences between data sets Example 2 98 50 99 100 100 150 101 198 102 Measures of Dispersion Including an additional statistic - a measure of dispersion - can help distinguish between data sets which have similar central tendencies Range: max - min Standard deviation: root mean square difference from the mean s ( x1 m) 2 ( x2 m) 2 ... ( xn 1 m) 2 ( xn m) 2 n Measures of Dispersion Examples Range 198 2 196 102 98 4 2 98 50 99 100 100 150 101 198 102 Measures of Dispersion Examples Standard deviation SD 69.6 2 98 50 99 100 100 150 101 198 102 m = 100 m = 100 (2 100) 2 (50 100) 2 (100 100) 2 (150 100) 2 (198 100) 2 5 Measures of Dispersion Both range and standard deviation can be sensitive to outliers However, many data sets can be characterized by mean and SD If the values of the data set are distributed in an approximately bell shape, the 250 ~68% of the data will be within 1 SD unit of mean, ~95% will be within 2 SD units and nearly all will be within 3 SD units 200 Count 150 100 50 -3.00 -1.00 1.00 3.00 Measures of Dispersion Example Suppose data set has mean = 35 and SD = ( 42 35) 7 1 7 How many SD units away from the mean is 42? (38 35) .43 How many SD units away from the mean 7 is 38? How many SD units away from the mean (30 35) .714 is 30? 7 Assuming bell shape distribution, ~95% are between what two values? between (35 2 * 7) 21 and (35 2 * 7) 49 Measures of Dispersion A robust measure of dispersion is the interquartile range Q1: value such that 25% of data less than, and 75% greater than Q3: value such that 75% less than, and 25% greater than IQR = Q3 - Q1 Example Calculate range, standard deviation and interquartile range for the following data sets 1 98 99 100 100 100 102 102 104 107 95 98 99 100 100 100 102 102 104 107 Assignment, Discussion, Evaluation Read Chapter 4 Discussion problems Chapter 4 exercise set A: exercise set C: exercise set D: exercise set E: 1 -6, 8, 9 1, 2, 3 1 - 4, 8, 4, 5, 7, 8, 11, 12 Quiz #3 on basic summary statistic calculations – mean, median, standard deviation, IQR, SD units Review of Definitions Measures of central tendency Mean (average): x1 x2 xn n Median If odd number of data points, “middle” value If even number of data points, average of two “middle” values Question and Examples Can mean be larger than median? Can median be larger than mean? Give examples Can mean be a negative number? Can the median? The average height of three men is 69 inches. Two other men enter the room of heights 73 and 70 inches. What is the average height of all five men? Questions and Examples The average of a data set is 30. A value of 8 is added to each element in the data set. What is the new average? Each element of the data set is increased by 5%. What is the new average? Suppose that data consists of only 1’s and 0’s What does the average represent? Application: an experiment is performed and only two outcomes can occur Label one type of outcome 1 and the other 0 For the data set 31, 45, 72, 86, 62, 78, 50, find the median, Q1 (25th percentile) and Q3 (75th percentile) Review of Definitions Measures of dispersion Standard SD deviation = ( x1 m) 2 ( x 2 m) 2 ... ( x n 1 m) 2 ( x n m) 2 n Range = max - min IQR = Q3 - Q1 Questions and Examples Can the SD be negative? Can the range? Can the IQR? Can the SD equal 0? For the data set 3,1,5,2,1,6 find the SD, range and IQR The average weight for U.S. men is 175 lbs and the standard deviation is 20 lbs If a man weighs 190 lbs., how many standard deviation units away from the mean weight is he? Assuming a normal (bell-shaped) distribution for weight, ninety-five percent of U.S. men weigh between what two values? Questions and Examples The average of a data set is 23 and the standard deviation is 5 A value of 8 is added to each element in the data set. What is the new standard deviation? Each element of the data set is increased by 5%. What is the new standard deviation? (Dr. Monticino)