Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Other Numerical Measures Median Mode Range Percentiles Quartiles, Interquartile range BUS304 – Data Charaterization 1 Median The middle value -- The value which divides the data in half, with equal sizes above and below Steps: 1. Put your data in ordered array (sort) 2. If n (or N) is odd, the median is the middle number (i.e. the n1 th number) 2 3. If n (or N) is even, the median is the average of two middle numbers (i.e. the average of the n and the n +1 th numbers) 2 2 BUS304 – Data Charaterization 2 Sensitivity to outliers 0 1 2 3 4 5 6 7 8 9 10 Median = 3 0 1 2 3 4 5 6 7 8 9 10 Median = 2.5 Median does not affected by extreme values 0 1 2 3 4 5 6 7 8 9 10 Median = 3 BUS304 – Data Charaterization 3 Mode The value that occurs most often Steps: 1. Put your data in ordered array (sort) Mode does not affected by extreme value either. 2. Find the data value(s) that repeats the most frequently 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 No Mode! 0 1 2 3 4 5 6 Mode=5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Mode=5 and 9 Boston Austin San Diego Los Angels Mode=San Diego BUS304 – Data Charaterization 4 Find Mode and Median from Frequency Table Below is a frequency table showing Find the mean, median and mode. the number of days the teams finish Create a histogram, locate the mode, their projects median and mode. Describe the shape of the histogram, Relative Days to Complete Frequency 5 4 ? 6 12 ? 7 8 ? 8 6 ? 9 4 ? 10 2 ? Frequency and find the relationship between mean, median and mode. BUS304 – Data Charaterization 5 Shape of a distribution Symmetric Mean = Median = Mode Right-Skewed Left-Skewed Mean < Median < Mode (Longer tail extends to left) Mode < Median < Mean (Longer tail extends to right) Note that Mean is affected by the extreme value the most. So mean is always leaning towards the tail compared to the other two measures. BUS304 – Data Charaterization 6 Measures of center location Mean Median Mode Mean is generally used, unless extreme values (outliers) exist; the next common is median, since the median is not sensitive to extreme values; mode is sometime used when there is a really large frequency. Think of the example of house price BUS304 – Data Charaterization 7 Range Simplest measure of variation Describe how wide the data spread Formula Range = Maximum Value – Minimum Value Example: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Range = 14 - 1 = 13 BUS304 – Data Charaterization 8 Disadvantage of Range Ignores the way in which data are distributed Sensitive to outliers 1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5 7 8 9 10 11 12 Range = 5 - 1 = 4 Range = 12 - 7 = 5 1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120 7 8 9 10 11 12 Range = 120 - 1 = 119 Range = 12 - 7 = 5 Range is affected the most by outliers. BUS304 – Data Charaterization 9 Break BUS304 – Data Charaterization 10 Other measures 1. Percentiles: Measures the percentage of data below the value. e.g. if the 60th percentile is 1240 (SAT score), that means there are 60% students getting a score less than 1240. Correspondingly, there are 40% of students getting 1240 or higher. How to find percentile? The pth percentile in an ordered array of n values is the value in the ith position, where p i (n 1) 100 BUS304 – Data Charaterization 11 Example Find the 80th percentile from the annual income data Step: 1. Sort the data 2. Find the location for the 80th percentile: i p 80 (n 1) (100 1) 80.8 81 100 100 3. Find the 81st person’s income Think, what does this income mean? Exercise: find the value where 30% people have the income or higher. Exercise2: find the value where 30% people have the income less than it. Exercise 3: find the value where 50% people have the income less than it. What is the measure also called? BUS304 – Data Charaterization 12 Quartiles The 25th, 50th, and 75th percentiles Called the first, second, and third quartiles, respectively. Written as Q1, Q2, Q3, respectively. The quartiles split the ranked data into 4 equal groups. 25% 25% 25% 25% Q1 Q2 Q3 BUS304 – Data Charaterization 13 Example: Example: Find the first quartile in the data sample: 22 12 14 16 17 16 13 20 18 Median = the 50th percentile = the second quartile BUS304 – Data Charaterization 14 Interquartile Range Recall: Range? Disadvantage of range? Interquartile Range: Interquartile Range = Q3 – Q1 Example: 12 13 14 16 16 17 18 20 22 Q1=13.5 Q3=19 Interquartile range = Q3 – Q1 = 19 – 13.5 = 5.5 BUS304 – Data Charaterization 15 Summary Understand and compute the following two sets of data measures: Measures of central tendency • Mean, Median, and Mode Measures of variation • Range, Variance, and Standard deviation Other ways to describe data: Percentiles, Quartiles, Interquartile range BUS304 – Data Charaterization 16