Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Ch2.2, Ch2.3 Numerical Summary Measures of Variability for Data Topics: Measures of variability (spread): o Deviation o Variance/standard deviation o Interquartile range Statistical definition of Outliers: the 1.5 x IQR criterion for outliers ----------------------------------------------------------------------------------------------------------- II: Measures of Variability (Spread) (1) Deviation: the difference between the observation and the mean Deviation of xi is defined as di xi x . What is the mean deviation ( d ) ? 1 1 d1 d 2 ... d n x1 x x2 x ... xn x n n 1 1 x1 x2 ... xn nx nx nx 0 n n d So d does not measure the spread in the data. We might use the average of absolute deviation: 1 | d1 | | d 2 | ... | d n | . n But this measure is mathematically inconvenient. Then we consider the following definition. (2) Variance / Standard Deviation Sample Variance: the sample variance s 2 of n observations is average (using n-1) of squared deviations n s2 x i 1 x 2 i n 1 1 ► An alternative form convenient for calculation s2 1 x i n( x ) 2 n 1 Sample Standard Deviation (SD): the sample standard deviation s is the square root of s 2 : n s x i 1 i x 2 n 1 Ex. Grades of 9 students on a HW assignment: 86,85,81,82,84,84,83,84,87. x 84 . What is the SD? (Also know that 9 x i 1 2 i 63532 9 s 2 = xi2 n ( x ) 2 / 8 63532 – 9 * 842 = 28/8=3.5 i 1 s s 2 3.5 1.87. Interpretation: A random student’s score is about 1.87 points from the mean score (84). Comments: 1. SD should be used as a measure of spread only when mean is used as the measure of center 2. SD=0 implies all data points are the same (no variability) 3. Like mean, SD is strongly influenced by outliers Ex. If a coding error makes 87 to be 870, then x =182 and SD becomes 278… (3) Interquartile Range (IQR) (a) Quartiles: Q1 = first quartile = median of the lower half of the data. Q3 = third quartile = median of the upper half of the data (Q2 = median). IQR = Q3 – Q1 (good spread measure in the presence of outliers) 2 To find quartiles: 1. Sort the data and divide data points into 2 halves (If there are odd number of observations, include the median in each half.) 2. Lower quartile Q1= median of the lower half of the data. 3. Upper quartile Q3 = median of the upper half of the data. (b) Inter-Quartile Range (IQR) IQR = Q3 – Q1 Interpretation of IQR: Measure of variability (spread) of data, similar to s but usually larger than s. Ex. Grades of 9 students on a HW assignment: 86,85,81,82,84,84,83,84,87. The sorted hw scores are: 81 82 83 84 84 85 86 87. So median = 84, Q1 = 83, Q3 = 85, IQR = 85 – 83 = 2. 3 Ex. Rainfall in NC in the some 15 months 1|0 2 | 25 3 | 45 Stem: one digit 4 | 11667 Leaf: tenths digit 5 | 449 6|0 7| 8|2 Find quartiles and the IQR Remark 1: The 1.5 x IQR Criterion for Outliers An observation is called an outlier if it is 1.5*IQR larger than Q3 or 1.5*IQR smaller than Q1. Extreme outlier may indicate data entry error or unusual characteristics in the data that need careful investigation (if it is 3*IQR larger than Q3 or 3*IQR smaller than Q1. Ex. Use the 1.5xIQR rule to check if there is any outlier in the Rainfall dataset. Q1 = 3.45, Q3 = 5.4, IQR = 5.4 – 3.45 = 2.925. Q1 – 2.925 = 3.45 – 2.925 = 0.525 (no data point smaller than 0.525) Q3 + 2.925 = 5.4 + 2.925 = 8.325 (no data point larger than 8.325) Remark 2: Standard numerical summaries of a data set includes sample size, center, and spread. For reasonably symmetric distribution with no outliers, use x , s For the rest situation, use x, IQR 4 Remark 3: The 5-number summary min, Q1, Median, Q3, max Remark 4: Change of Unit 1. Adding (or subtracting) a constant to each observation will NOT change the measures of spread, such as SD and IQR, (but does change the measures of center and quartiles) If the new unit = 60 + the old unit; do the spread of the data change? 2. Multiplying each observation by a constant a does multiply measures of spread (SD and IQR) by |a|. 5 Conclusion: If new unit is aX + b, then the new spread in terms of IQR is |a| times the original IQR ( Recall that the new center is ax b ) Ex. Temperatures read in Fahrenheit and the SD temperature is s, and IQR is r. What are the new SD and IQR if we switch to Centigrade? Note that 5 C F 32 . 9 The new SD is 5s/9. The new IQR = 5r/9. 6