Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Last Time: Shape and Center Boxplots and the IQR Variance and Standard Deviaton STAT 113 Variability Colin Reimer Dawson Oberlin College 13 February 2017 Transformations Last Time: Shape and Center Boxplots and the IQR Variance and Standard Deviaton Distribution of a Quantitative Variable The distribution of a quantitative variable is characterized by: A. Shape (symmetric, skewed, bimodal, etc.) B. Center (mean, median) C. Spread (Interquartile Range, Standard Deviation) D. Outliers (if any) Transformations Last Time: Shape and Center Boxplots and the IQR Variance and Standard Deviaton Transformations Skewness • A distribution is skewed when the extreme values on one side are more extreme than those on the other. • We call a distribution right-skewed when the longer “tail” is on the right, and left-skewed when the longer tail is on the left. Last Time: Shape and Center Boxplots and the IQR Variance and Standard Deviaton 0.008 0.004 0.000 Density 0.012 Right Skew 0 500 1000 1500 2001 Household Income (Thousands of 2016$) 2000 Transformations Last Time: Shape and Center Boxplots and the IQR Variance and Standard Deviaton Distribution of a Quantitative Variable The distribution of a numeric variable is characterized by: A. Shape (symmetric, skewed, bimodal, etc.) B. Center (mean, median) C. Spread (Interquartile Range, Standard Deviation) D. Outliers (if any) Transformations Last Time: Shape and Center Boxplots and the IQR Variance and Standard Deviaton Transformations Central Tendency • Often we want to summarize data with a single number • Usually representing a “typical”, or “middle” value. • How this is defined depends on the data and the question. Last Time: Shape and Center Boxplots and the IQR Variance and Standard Deviaton The Mean Intuitively, the mean is the “balance point” of the data. Mean n X x̄ = ( xi )/n i=1 • xi is the ith observation • n is the sample size (number of observations) Transformations Last Time: Shape and Center Boxplots and the IQR Variance and Standard Deviaton Transformations 0.008 0.004 0.000 Density 0.012 Household Incomes 0 500 1000 1500 2000 2001 Household Income (Thousands of 2016$) Figure: Distribution of Household Income (Source: 2001 Survey of Consumer Finances) Here, the mean is greater than 70% of the cases, due to the heavy right skew. Last Time: Shape and Center Boxplots and the IQR Variance and Standard Deviaton Transformations Skewed Distributions • The mean is strongly affected by skew and by outliers • The mean is pulled toward the extreme values. • In these cases, we generally prefer a measure of central tendency which is resistant to the influence of extreme values (also called robust). Last Time: Shape and Center Boxplots and the IQR Variance and Standard Deviaton The Median Median The median is the point that cuts the data in half: an equal amount of data lies above and below the median. Transformations Last Time: Shape and Center Boxplots and the IQR Variance and Standard Deviaton Transformations The Median 0.008 0.004 0.000 Density 0.012 The median is less affected by extreme values, so it is nearer to the bulk of households. 0 500 1000 1500 2001 Household Income (Thousands of 2016$) Figure: Distribution of 2001 Household Income 2000 Last Time: Shape and Center Boxplots and the IQR Variance and Standard Deviaton Transformations The Median 0.008 0.004 0.000 Density 0.012 Zooming in... 0 50 100 150 200 2001 Household Income (Thousands of 2016$) Figure: Distribution of 2001 Household Income 250 Last Time: Shape and Center Boxplots and the IQR Variance and Standard Deviaton Transformations Median vs. Mean Income Figure: Median Household Income vs. Per Capita GDP (1947-2007). Source: www.lanekenworthy.net Last Time: Shape and Center Boxplots and the IQR Variance and Standard Deviaton Distribution of a Quantitative Variable The distribution of a numeric variable is characterized by: A. Shape (symmetric, skewed, bimodal, etc.) B. Center (mean, median) C. Spread (Interquartile Range, Standard Deviation) D. Outliers (if any) Transformations Last Time: Shape and Center Boxplots and the IQR Variance and Standard Deviaton Transformations Measures of Variability • We want to quantify the consistency, or lack thereof, of the data. • A general term for “lack of consistency” is variability. • We will look at: • Range • Interquartile Range • Variance / Standard Deviation Last Time: Shape and Center Boxplots and the IQR Variance and Standard Deviaton Transformations The Range The range is easy to compute, but not very reliable. ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● −30 −20 −10 0 10 20 30 Fund C1 ● −30 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −20 −10 0 ● 10 20 30 Fund C2 Figure: Historical Annual Returns for Two Hypothetical Index Funds Last Time: Shape and Center Boxplots and the IQR Variance and Standard Deviaton Transformations The Range The range is easy to compute, but not very reliable. ● ● −10 ●● ●● ● ●● ●● ●●● ●● ●●●●● ● ● ● ●● ●● ●● ● ● ● ●● ● ●● ●● ● ●● ●● ● ● ● ● ● ●●● ●●●●●●●●● ● −5 0 5 ● 10 15 10 15 5 10 15 5 10 15 Fund E (Full Data Set) ● ● −10 ● ● −5 ● 0 5 Fund Sample 1 ● −10 −5 ● ● 0 ●● Fund Sample 2 ●● ●● −10 −5 ● 0 Fund Sample 3 Figure: Annual Returns for 3 random samples of 5 years Last Time: Shape and Center Boxplots and the IQR Variance and Standard Deviaton Transformations Robust Measures of Variability • We’d like a more robust measure of variability, which is not affected so much by extreme values. • Analogous to the median: describe the “middle” part of the data. • The idea: find the “middle half” of the data, and then take its range. • Specifically, exclude the lowest 25% and the highest 25%, and take the difference between the highest and lowest remaining values. Last Time: Shape and Center Boxplots and the IQR Variance and Standard Deviaton Transformations Quartiles • The median divides the data in two. • Percentiles divide the data into 100 pieces. . The k th quartile (written Qk ) is the point below which k quarters of the data lies. • Quartiles divide the data into • So, in terms of quartiles, the median is value is , the maximum value is • We can calculate the range using quartiles as , the minimum . . Last Time: Shape and Center Boxplots and the IQR Variance and Standard Deviaton Transformations Quartiles Q0 ● ● 20 25 Q1 Q2 Q3 ● 30 ● Q4 ● ● ● ● ● ●●●● ●●●●●● ● ● 35 Height (in.) 40 45 50 Last Time: Shape and Center Boxplots and the IQR Variance and Standard Deviaton Transformations The Inter-Quartile Range (IQR) The Inter-Quartile Range (IQR) The Inter-Quartile Range (or IQR) is the distance between the first and third quartiles: IQR = Q3 − Q1 Pedantic Note The IQR is a single number, not the two quartiles themselves. Last Time: Shape and Center Boxplots and the IQR Variance and Standard Deviaton Transformations The Inter-Quartile Range (IQR) Q0 Q1 Q2 Q3 Q4 Range IQR ● ● 20 25 ● 30 ● ● ● ● ● ● ●●●● ●●●●●● ● ● 35 Height (in.) 40 45 50 Last Time: Shape and Center Boxplots and the IQR Variance and Standard Deviaton Transformations The Five-Number Summary Five-number Summary • The quartiles are very natural to report together to describe the center and spread of a distribution. • Q0 through Q4 collectively form the five-number summary of a quantitative distribution. Five Number Summary = (xmin , Q1 , Median, Q3 , xmax ) = (Q0 , Q1 , Q2 , Q3 , Q4 ) Last Time: Shape and Center Boxplots and the IQR Variance and Standard Deviaton Transformations Box-and-Whisker Plots Box-and-Whisker Plots From the five-number summary, we construct a graph called a box-and-whisker plot (or just box plot, for short) 1. Draw an axis 2. Draw a rectangle (box) from Q1 to Q3 3. Draw a line across the box (or place a dot) at Q2 4. Draw lines (whiskers) extending outward from the box on both sides to either (a) (Simplest version) xmin and xmax . (b) (R default) Q1 − 1.5IQR and Q3 + 1.5IQR. 5. In version (b), plot points beyond the whiskers individually. Last Time: Shape and Center Boxplots and the IQR Variance and Standard Deviaton Transformations Box-and-Whisker Plots Q0 ● ● 20 25 Q1 Q2 Q3 Range ● 30 ● Q4 ● ● IQR ● ● ● ●●●● ●●●●●● ● ● 35 40 45 50 40 45 50 Height (in.) 20 25 30 35 Last Time: Shape and Center Boxplots and the IQR Variance and Standard Deviaton Transformations Box-and-Whisker Plots Q0 ● ● 20 25 Q1 Q2 Q3 Range ● 30 ● Q4 ● ● IQR ● ● ● ●●●● ●●●●●● ● ● 35 40 45 50 40 45 50 Height (in.) ● 20 25 30 35 Last Time: Shape and Center Boxplots and the IQR Variance and Standard Deviaton 0.000 Density Box-and-Whisker Plot: Right Skew 0 500 1000 1500 2000 2001 Household Income (Thousands of 2016$) 0 500 1000 1500 2001 Household Income (Thousands of 2016$) 2000 Transformations Last Time: Shape and Center Boxplots and the IQR Variance and Standard Deviaton Transformations 0.000 Density Box-and-Whisker Plot: Right Skew 0 500 1000 1500 2000 2001 Household Income (Thousands of 2016$) ● ● ● ● ● ● ● ●● ● ●●●● ● ●●● ● ● ● ●●● ● 0 ● ●●●●● 500 ● ● 1000 ● ● ● 1500 2001 Household Income (Thousands of 2016$) ● ● 2000 Last Time: Shape and Center Boxplots and the IQR Variance and Standard Deviaton Transformations Deviations Rather than simply measuring the distance between extremes, we can develop measures based on distance from “center”. Deviation Scores For each data point, its deviation score is its “distance” from the mean. Deviationi = xi − x̄, for each i = 1, . . . , n Last Time: Shape and Center Boxplots and the IQR Variance and Standard Deviaton Transformations Deviations mean = 36.76 ● ● 20 25 ● 30 ● ● ● ● ● ● ●●●● ●●●●●● ● ● 35 Height (in.) 40 45 50 Last Time: Shape and Center Boxplots and the IQR Variance and Standard Deviaton Transformations Deviations mean = 36.76 ● ● 20 25 ● 30 ● ● ● ● ● ● ●●●● ●●●●●● ● ● 35 Height (in.) 40 45 50 Last Time: Shape and Center Boxplots and the IQR Variance and Standard Deviaton Transformations Deviations mean = 36.76 ● ● 20 25 ● 30 ● ● ● Deviation = 6.24 ● ● ● ●●●● ●●●●●● ● ● 35 Height (in.) 40 45 50 Last Time: Shape and Center Boxplots and the IQR Variance and Standard Deviaton Transformations Deviations mean = 36.76 Deviation = −12.76 ● ● 20 25 ● 30 ● ● ● Deviation = 6.24 ● ● ● ●●●● ●●●●●● ● ● 35 40 45 50 Height (in.) How can we use these for an overall measure of spread? Last Time: Shape and Center Boxplots and the IQR Variance and Standard Deviaton Transformations Variance • If we square all the deviations from the mean and average them, we get the variance. Variance The variance, written s2 , is the average of the squared deviations from the mean. That is, Pn Pn 2 (xi − x̄)2 2 i=1 Deviationi = i=1 s = n−1 n−1 Last Time: Shape and Center Boxplots and the IQR Variance and Standard Deviaton Transformations What’s with that denominator? • With an average, you’re supposed to divide by the number of things, aren’t you? Why n − 1? • Usually we are working with a sample, and are interested in estimating the population variability. • We get no information about variability from the first observation, so there are only n − 1 “degrees of freedom” in the sample. • Interesting math side fact: Variance is equivalent to average squared distance between all distinct pairs of data points. Last Time: Shape and Center Boxplots and the IQR Variance and Standard Deviaton Standard Deviation • Variance (s2 ) is in squared units relative to the data. • No problem: just take the square root. Standard Deviation s= √ s2 is the standard deviation s sP Pn n 2 2 √ Deviation i i=1 i=1 (xi − x̄) s = s2 = = n−1 n−1 Transformations Last Time: Shape and Center Boxplots and the IQR Variance and Standard Deviaton Transformations Same range, different s s = 18.2 ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● −30 −20 −10 0 10 20 30 Fund C1 s = 8.1 ● −30 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −20 −10 0 Fund C2 The standard deviation uses all the data. ● 10 20 30 Last Time: Shape and Center Boxplots and the IQR Variance and Standard Deviaton Distribution of a Quantitative Variable The distribution of a numeric variable is characterized by: A. Shape (symmetric, skewed, bimodal, etc.) B. Center (mean, median) C. Spread (Interquartile Range, Standard Deviation) D. Outliers (if any) Transformations Last Time: Shape and Center Boxplots and the IQR Variance and Standard Deviaton Transformations Outliers • Skewness can be an important feature of a distribution. • But sometimes a few unusual data points make an otherwise “well-behaved” distribution look skewed/multimodal. • When not part of the overall pattern, these are called outliers. • Sometimes reflect measurement errors (e.g., misplaced decimal) • Sometimes represent genuinely unusual observations Last Time: Shape and Center Boxplots and the IQR Variance and Standard Deviaton Transformations On-Base Percentage 10 12 A common statistic for batters in baseball is On-Base Percentage 6 Barry Bonds 0 2 4 Density 8 Skewness = 0.630 0.0 0.1 0.2 0.3 0.4 0.5 0.6 On−Base Percentage Figure: Distribution of major-league hitters with at least 100 Plate Appearances in 2002. Last Time: Shape and Center Boxplots and the IQR Variance and Standard Deviaton Transformations On-Base Percentage 10 12 Distribution without Bonds 6 4 2 0 Density 8 Skewness = 0.199 0.0 0.1 0.2 0.3 0.4 On−Base Percentage 0.5 0.6 Last Time: Shape and Center Boxplots and the IQR Variance and Standard Deviaton Transformations Visualizing Outliers 0.2 0.3 0.4 0.5 On−Base Percentage ●● ● ● 0.2 ● ●● ● 0.3 0.4 On−Base Percentage ● 0.5 Last Time: Shape and Center Boxplots and the IQR Variance and Standard Deviaton Transformations Problems with s and s2 • These measures, even more than the mean itself, are heavily 0.008 0.004 0.000 Density 0.012 influenced by extreme values. 0 500 1000 1500 2001 Household Income (Thousands of 2016$) 2000 Last Time: Shape and Center Boxplots and the IQR Variance and Standard Deviaton Transformations Density 0.000 Problems with s and s2 0 500 1000 1500 2000 Density 0.000 2001 Household Income (Thousands of 2016$) 0 500 1000 1500 2000 Density 0.000 2001 Income (Thousands of 2016$) (Top 0.1% Excluded) 0 500 1000 1500 2001 Income (Thousands of 2016$) (Top 1% Excluded) 2000 Last Time: Shape and Center Boxplots and the IQR Variance and Standard Deviaton Transformations Density 0.000 Problems with s and s2 0 50 100 150 200 250 300 250 300 250 300 Density 0.000 2001 Household Income (Thousands of 2016$) 0 50 100 150 200 Density 0.000 2001 Income (Thousands of 2016$) (Top 0.1% Excluded) 0 50 100 150 200 2001 Income (Thousands of 2016$) (Top 1% Excluded) Last Time: Shape and Center Boxplots and the IQR Variance and Standard Deviaton Transformations Variance-Stabilizing Transformations • The mean and standard deviation are unstable in the presence of skew. • However, they have such useful properties otherwise that it is often better to try to “remove” skew, rather than fall back on other measures. • The most common way to remove skew is by a nonlinear transformation of the underlying scale. • Take the original variable, x, and define a new variable Y = f (x), where f is a one-to-one function. • Most common case: right-skewed data with positive values • Logarithmic transform (take y = log(x)) √ • Square Root (take y = x) Last Time: Shape and Center Boxplots and the IQR Variance and Standard Deviaton Transformations Variance-Stabilizing Transformations 0.000 Density Original vs. Logarithmic Income Distribution: 0 50 100 150 200 1.0 0.0 Density 2001 Household Income (Thousands of 2016$) 102 103 104 105 2001 Household Income (2016$) 106 250 Last Time: Shape and Center Boxplots and the IQR Variance and Standard Deviaton Summary Quantitative Data Visualizing a quantitative variable • Dot Plots • Box-and-Whisker Plots • Histograms • Density curves Describing the distribution of a numeric variable • Shape (symmetry, skew, modes) • Center (mean, median) • Spread (IQR, standard deviation) • Outliers (if any) Transformations Last Time: Shape and Center Boxplots and the IQR Variance and Standard Deviaton Transformations Summary Shape and Center • A distribution is skewed when the extreme values on one end are more extreme than on the other • We say that it is skewed in the direction of the more extreme values (e.g., right-skewed if there are a few very large values) • The mean is the “balance point of the data”, written x̄. • Mean has nice math properties, but is affected by skew • The median divides the cases in half • It is resistant to outliers/skewness Last Time: Shape and Center Boxplots and the IQR Variance and Standard Deviaton Transformations Summary Variability • The range is unstable for a sample, and is extremely vulnerable to outliers/skew • The Interquartile Range (IQR) is the range of the “middle half” of the data, and is “resistant” (like the median) • The variance is the “average” of the squared deviations from each observation to the mean • The standard deviation is the square root of the variance, in order to restore units to the original scale • Nonlinear transformations (log, square root, etc.) can be used when appropriate to reduce skew and stabilize variance