Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
STAT 101, Module 2: Numerical Summaries of Variables Questions one wants to quantify If we are to examine data about Penn students (PennStudents.JMP), one might ask: How many students are 19 years old? What fraction of the total are they? Are they fewer or more than the 18 year old students? On average how tall are male and female students? How spread out are the heights? How strong is the overlap between male and female heights? Numerical versus Graphical Summaries Graphical methods allow us to o see the data as a whole o discover unexpected facts Numerical summaries give us o simplicity by condensing a lot of data into few numbers o precision, for example, when comparing groups o ways to reason about uncertainty (stay tuned) Neither replaces the other. Numerical Summaries according to Variable Type Textbook: part of Chap. 3 Qualitative variables: “how many in each group?” o Counts/Frequencies o Proportions Quantitative variables: o Measures of Location: “where is the data?” Mean, Median, Quantiles, Minimum, Maximum o Measures of Dispersion: “how wide is the data?” Standard Deviation, Range, Interquartile Range Qualitative Variables: Counts/Frequencies and Proportions The following example is from the data PennStudents.JMP Age is used as an ordinal variable. AGE 24 23 22 21 20 19 18 Frequencies Level 18 19 20 21 22 23 24 Total Count 128 139 70 33 14 4 2 390 N Missing 0 7 Levels Prob 0.32821 0.35641 0.17949 0.08462 0.03590 0.01026 0.00513 1.00000 The barplot gives a good comparison of the frequencies across the age groups. The table gives a list of exact counts and proportions (“Prob” in JMP). Count = Frequency (synonyms) Proportion = Count / Total (= Fraction) Percentage = Proportion * 100 Algebraic notation: ni = count of the i’th label pi = proportion of the i’th label Example above: n1 = 128, n2 = 139, …, n7 =2 p1 = .328, p2 = .356,…, p7 = .005 where label 1 is ‘18’, label 2 is ‘19’,… Terminology: Level = label, group name Example: JMP reports 7 levels for ‘Age’. JMP: To reproduce the above output, you need to convert the quantitative variable ‘Age’ to qualitative before you do Analyze > Distribution. The conversion is done as follows: Right-click the label ‘Age’ above the Age column > Modeling Type > Ordinal Quantitative Variables: Measures of Location and Dispersion Again, the following example is from PennStudents.JMP: HEIGHT Quantiles 100.0% 99.5% 97.5% 90.0% 75.0% 50.0% 25.0% 10.0% 2.5% 0.5% 0.0% 80 70 maximum 80.000 76.180 75.000 73.500 71.000 67.500 65.000 63.000 60.000 57.478 57.000 quartile median quartile minimum Moments Mean Std Dev Std Err Mean upper 95% Mean lower 95% Mean 60 67.754103 3.9749694 0.2012804 68.149836 67.358369 N (Ignore “Std Err Mean”, “upper 95% Mean” and “lower 95% Mean” in this table. Everything else will be explained in the next two bullets.) Measures of Location: “Where is the data?” (Sec. 3.1) Textbook: “central tendency”, Sec. 3.1 (ignore ‘population’) o Mean: average of the values x1, x2,…xN in column ‘x’ mean(x) = (x1 + x2 +… + xN )/N In the height data, the mean is reported to be 67.75… o Median: the middle value of the sorted values in a column if N is odd, and the average of the two middle values if N is even. Examples: If the values in a column are 1,2,3,4,5, the median is 3. If the values are 1,2,3,4, the median is 2.5 In the height data, the median is reported to be 67.5 390 o Quantiles: The idea of quantiles is that they divide the values in a column roughly into, for example, 20% percent of values less and 80% greater. This would be called the 20% quantile. The same applies to any other percentage. Sometimes one calls, for example, the 90% quantile the “upper 10% quantile”. If nothing is said to the contrary, the percentage of a quantile refers to the fraction of values that are less. (Don’t worry about the fine points of defining quantiles! Trust that JMP has a reasonable general definition.) Special cases: 50% quantile = median 25% and 75% = lower and upper quartiles. 10%, 20%,… 90% quantiles = deciles. 0% quantile = minimum 100% quantile = maximum In the height data, JMP give us the lower and upper 0%, 0.5%, 2.5%, 10%, 25% and 50% quantiles. Abbreviations: mean(Height), med(Height), max(Height), min(Height) Note 0: mean ≠ median [Move transformation properties after introducing dispersion measures. Then explain that it is these properties that distinguish them] Note 1: Shifting the values of a variable If you add a constant value to all the values in a column, the location measures also get added that value. For the mean this can be expressed as follows: mean(x+c) = mean(x)+c Example: If you re-express degrees Celsius in degrees Kelvin, you add 273. Therefore, add 273 to the means and quantiles of degrees Celsius and you obtain the means and quantiles in degrees Kelvin: K° = C° + 273 Note 2: Rescaling the values of a variable If you multiply all the values in a column with a constant value, the location measure also get multiplied with that value. For the mean this can be expressed as follows: mean(cx) = c·mean(x) Example: If you convert $ to €, you have to multiply with a factor 0.770831727 (2007/01/21). Therefore multiply means and quantiles of $s to obtain the means and quantiles in €s. [Caution: Quantiles other than the median do not strictly follow this formula when the factor c is negative. Lower quantiles become upper quantiles and vice versa. Ex.: The lower quartile becomes the upper quartile if c<0.] Note 3: Shifting and rescaling the values of a variable Notes 1 and 2 can be combined. Example: For translating means and quantiles from degrees Celsius to degrees Fahrenheit, apply the well-known conversion formula to the means and quantiles in Celsius and you obtain the means and quantiles in Fahrenheit: F° = (9/5) ·C° + 32 Problem: Make up a new measure of location. Notation: Because the mean is the most important measure of location, we abbreviate it often as italic m. That is, m = mean(x). If more than one variable is in play and we need to indicate the variable, we may write mx and my. For example, we might write mHeight and mWeight. Measures of Dispersion: “How wide is the data?” (Sec. 3.2) Textbook: “variability”, Sec. 3.2 (ignore ‘population’) o Range = maximum – minimum This is the (vertical) width from the top most point to the bottom most point in the boxplot. In the height data, the range is 80 – 57 = 23 o Interquartile Range (IQR): IQR = upper quartile – lower quartile This is the (vertical) width of the box in the boxplot. In the height data: IQR = 71 – 65 = 6 o Standard Deviation (s, sdev, sd, SD, std dev,…): s 1 ( x1 m)2 ( x2 m)2 ... ( xN m)2 N 1 where m = mean(x). In the height data, s is reported to be 3.97 This is the most important measure of dispersion! Questions arise, however: Why squared deviations from the mean? Why a square root? Why N–1? This will require more explanation. Stay tuned. Abbreviations: If we have standard deviations of more than one column, x and y, say, we have to distinguish the measures of dispersion. We would then use the symbols sx or s(x) and sy or s(y) for the respective standard deviations. Similarly, we might use IQRx or IQR(x) and IQRy or IQR(y). For the height data above, we could write IQRHeight= 6 and sHeight = 3.97. Terminology: s2 = Variance A look ahead: The variance of stock returns is used in finance as a measure of “volatility” or “risk” of stock investments. (Of course the standard deviation could serve for the same purpose, and so could any other measure of dispersion, but finance math dictates the use of variances.) Note 1: Shifting the values of a variable If you add a constant value to all values in a column, measures of dispersion do not change. For the standard deviation this can be expressed as follows: sx+c = sx or s(x+c) = s(x) Idea: The width does not depend on where the distribution is. Example: If you convert C° to K°, the standard deviation does not change. Neither do the range nor the IQR. Note 2: Rescaling the values of a variable If you multiply a constant value to all values in a column, measures of dispersion multiply along with the absolute value of the constant. For the standard deviation this can be expressed as follows: scx = |c| sx or s(cx) = |c| s(x) Idea: If you double the numbers, you double the width. Example: If you convert $ to €, you have to multiply with a factor 0.770831727 (2007/01/21). Therefore, multiply standard deviations, ranges, IQRs of $s with this factor to obtain the standard deviations, ranges, IQRs in €s. Note 3: Shifting and rescaling the values of a variable Notes 1 and 2 can again be combined. Example: For translating standard deviations, ranges, IQRs from degrees Celsius to degrees Fahrenheit, multiply them with a factor 9/5. Problem: Make up a new measure of dispersion. Appendix on Standard Deviations and Variances s2 = ((x1–m)2 + (x2–m)2 +…+(xN–m)2 )/(N–1) o Q: Why is the variance not a measure of dispersion? A: If the values x1 , x2 ,..., x N are multiplied with a constant c, then s2 gets multiplied with c2 and not |c|. For a measure of dispersion we want that doubling the values entails doubling the measure of dispersion, not quadrupling, as is the case for the variance s2. This explains the root in the formula for s! o Q: Why do we divide by N–1 and not N? A: The deviations from the mean, xi–m, are not independent. If we know x1–m,…,xN–1–m, then we know xN–m, because these values sum up to zero: (x1–m) + (x2–m) +…+ (xN–1–m) + (xN–m) = 0 which we can solve for (xN–m). The complete answer is more technical, so take this as a hint. Proof of the identity: (x1–m) + (x2–m) +…+ (xN–1–m) + (xN–m) = (x1 + x2 + … + xN) – Nm = Nm – Nm = 0. o Q: Why squares in the first place? Why not absolute values |xi–m| ? This would do away with the root! A: A simple reason is that we can do algebra with squares but not easily with absolute values. (A deeper reason has to do with Pythagoras and probabilities. Stay tuned!) A Few Data Examples Counts and Proportions: the Titanic data CLASS Frequencies Level 1st 2nd 3rd crew Total cre w 3rd Count 325 285 706 885 2201 Prob 0.14766 0.12949 0.32076 0.40209 1.00000 2nd N Missing 0 4 Levels 1st AGE Frequencies child Level adult child Total Count 2092 109 2201 Prob 0.95048 0.04952 1.00000 N Missing 0 2 Levels adult SEX Frequencies m ale fe m ale Level female male Total Count 470 1731 2201 N Missing 0 2 Levels Prob 0.21354 0.78646 1.00000 SURVIVED Frequencies ye s Level no yes Total Count 1490 711 2201 Prob 0.67697 0.32303 1.00000 N Missing 0 2 Levels no Lesson: For extreme differences in frequencies, numbers are superior to pictures. For example, we see that there are almost no children on the Titanic, but how few really? The table shows that there were 109 children or about 5% of the total. This would be difficult to estimate by eyeballing the bar plot. Measures of Location and Dispersion: CEO compensation Total comp + opt exer /1000 Quantiles 100000 100.0% 99.5% 97.5% 90.0% 75.0% 50.0% 25.0% 10.0% 2.5% 0.5% 0.0% maximum quartile median quartile minimum 156168 59045 25266 10493 4412 1884 903 508 254 16 0 Moments 0 Mean Std Dev Std Err Mean upper 95% Mean lower 95% Mean N 4563.4621 9235.1532 238.37119 5031.0383 4095.8858 1501 log(TotComp+optexer) Quantiles 8 100.0% 99.5% 97.5% 90.0% 75.0% 50.0% 25.0% 10.0% 2.5% 0.5% 0.0% 7 6 5 4 maximum quartile median quartile minimum 8.1936 7.7714 7.4028 7.0224 6.6461 6.2763 5.9577 5.7128 5.4165 4.4635 4.4e-16 3 Moments 2 Mean Std Dev Std Err Mean upper 95% Mean lower 95% Mean N 1 0 6.3104396 0.5715899 0.0147732 6.3394179 6.2814613 1497 Lessons: 1) The distribution of raw compensations is extremely skewed upwards. This is the reason why the mean and median are extremely different: mean(Tot Comp +…) = 4563 med(Tot Comp + …) = 1884 (both in $1000s) The median is a better measure because the mean gets pulled up by the upper extremes and is no longer a typical value. (If we asked, however, how much each CEO would get if the sum of all compensations were equally redistributed among CEOs, we would have to use the mean, never mind the skew distribution.) 2) The textbook has a measure of skewness (P. 76f), but we will not use it. Instead we take a discrepancy between mean and median as a sign of skewness. The direction of skewness follows from the order of the two measures: o mean > median: skewed upwards o mean < median: skewed downwards Remember that the mean gets pulled by extreme values, the median doesn’t. Therefore the mean tells you to which side the distribution is skewed. 3) The sdev is even more problematic than the mean for very skewed distributions (forming squares blows up even more than the raw values). By comparison, the IQR does not lose its meaning: it always tells how far the upper and lower quartiles are apart. For the raw compensations, the CEOs at the upper and lower quartile make about m$4.4 and m$0.9, respectively, with a spread of about m$3.5=IQR. Messages: o The mean and sdev are problematic for extremely skewed distributions. They are more meaningful for bell-shaped, nearly-symmetric distributions. o The median and IQR remain meaningful for skewed distributions. Another Appendix: Mean versus Median Below is a physical illustration of the difference between mean and median. The mean corresponds to the balance point of the data values on a seesaw balance as drawn on the left, assuming all data values have the same weight. The median requires a scale that only counts how much is left and how much is right. The scale on the right does this: the distance of the points from the balance point is irrelevant as long as they stay on the same side. The reason is that all their weights get transmitted to equal distances on either side. (Old-fashioned scales are constructed like the median scale, so it doesn’t matter where on the platforms one places the goods and the metal weights.) [xxx To be added next time: sx = 0 iff x=const ax = mean(|x–med(x)|) before sx use of location and dispersion measures for standardization (z-scores), with example of equalizing midterm scores (then remove standardization from Module 3 where it is an afterthought) introduce also the empirical rule and the normal distribution, even the normal probability plot, to have an interpretation for the SD ]