Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Stat I, professor Vinod, Class notes for chapter 4 Let us use the same example as in my Chapter 3 notes Original Unclassified Data are Xi for i=1 to i=10 as: 50 98 82 23 46 40 63 52 92 54. For example X1=50, X3=82. _ x or mean= 60 is a measure of centering. Cannot calculate measure of variability called variance without first calculating the mean. The Range= Xmax- Xmin = 75 is a measure of dispersion or spread Sample Variance s2 and mean absolute deviation (MAD) for unclassified data Write the data values one below the other and make a table like this: ---------------------------------------------- i _ (Xi x ) _ (Xi x )2 _ |(Xi x )|. ---------------------------------------------1 -10 100 10 2 38 1444 38 3 22 484 22 4 -37 1369 37 5 -14 196 14 6 -20 400 20 7 3 9 3 8 -8 64 8 9 32 1024 32 n=10 -6 36 6 ---------------------------------------------Totals 0 5126 190 ---------------------------------------------Check sum of deviations from the mean. It should be 0! Sum of squared deviations= 5126 Denominator for sample variance= n-1 = 9 Unclassified sample variance s2 =∑ _ (Xi x )2/ (n-1) = 5126/9 = 569.556 standard deviation is square root of variance = s = 23.8654 Mean Absolute Deviation (MAD) MAD =∑ Definition _ |(Xi x )|/n Sum of absolute deviations from the last column of the table above=190 Mean absolute deviation=MAD=19. This is less sensitive to extreme observations than sample variance, because it avoids squaring of large deviations. (We take absolute values instead of squares) _ Coefficient of Variation CV = 100*s /x If both mean and standard deviation is multiplied by a constant, CV is unchanged. If we measure stock returns in British pounds or US dollars or French Euros, CV remains fixed. This is its main advantage. Otherwise it is just another measure of dispersion or spread (bunching of the data) CV=100*(std _ dev)/x = 39.7756 Computation of the Median and other percentiles for unclassified data First we find the order statistics x(1) to x(n) from Sorted data: Make a note of the notation where subscripts are in parentheses. Original data Xi are re-ordered as: Xmin=x(1) =23 x(2) =40, ... , x(3) = 46, x(4) = 50, x(5) =52, x(6) =54 x(7) =63, x(8) = 82, x(9) =92, x(10) =98= x(n)=Xmax. MEDIAN for unclassified data: (measure of relative position) Median is not at all sensitive to extreme observations (outliers), hence it is said to be resistant or robust. The location of the median is at el= l= n/2 or at 5. Since this is an integer Median is an average of x(5) and X(6), OR the average of fifth largest value 52 and the sixth largest value 54, or Median for unclassified data = 53 here. http://lib.stat.cmu.edu/DASL/Stories/ceo.html has an example of CEO salaries Now we know the shape or Skewness from Unclassified mean and median as follows: In this example, since mean>median, underlying frequency distribution is skewed to the right! (long tail to the right side, or positively skewed) If mean were equal to median it would be symmetric and if mean<median it would be skewed to the left. TRIMMED MEAN can be complicated Compute 10% trimmed mean n=10, P=percent trimmed =10, lower case p=10/100=0.10, k=np=1 integer part of (k)=1 fractional part of k=0 Retained number of observations after trimming = 8 The answer is obtained by just averaging the following 8 (retained) numbers: 40, 46 50 52 54 63 82 92 10% trimmed mean =59.875 Practical Example of Trimmed Mean (Olympics scoring) It gives more reliable less biased average. A good application is Olympics scoring in items like gymnastics where the judgment of the Judge is involved. Imagine a biased judge who wants to favor a particular Candidate and make sure that a competing candidate does not win. If Ordinary average of scores given by 10 judges is used, the biased judge Can give a very high score to the favored candidate and very very low score to the other. The simple averaging will bring up one and bring down the other. Not so with trimmed averaging. Olympics do not use simple average, but use trimmed means. They always Ignore the highest and lowest scores and average only the middle scores. Foils the biased scoring mentioned above. This Following discussion on advanced trimming methods may be skipped. How to compute a more complicated 13% trimmed mean? Needs many steps: n=10, P=percent trimmed =13, small p=13/100=0.13, k=np=1.3 integer part of (k)=1 fractional part of k=0.3 Retained number of observations after trimming 13% on each side= 7.4 R= n(12p) = 10(12*0.13) = 10(10.26)=10(0.74)=7.4 Bottom end Trimming From the second order statistic x(2)=40 we throw out 0.3 and keep (10.3) or 0.7 part. 0.7*40=28, so we keep 28 as the contribution to the trimmed mean from x(2). The middle order stats retained for 13% trimmed mean are: 46 50 52 54 63 82 Top end Trimming From the ninth order statistic x(9)=92 we throw out 0.3 and keep (10.3) or 0.7 part. 0.7*92=64.4, Finally an example of advanced trimming method: 13% trimmed mean is the average of 7.4 middle observations (28+46+50+52+54+63+82+64.4)/7.4 Answer=(439.4)/7.4= 59.3784 IN the R software the following gives wrong result. z=c(23, 40, 46, 50, 52, 54, 63, 82,92,98) mean(z, trim=0.13) # wrong result is 59.875 R does not do advanced trimming but only simple trimming. PERCENTILES for unclassified data Now we go beyond median and report ADDITIONAL percentiles (used later for a Notched Box Plot). Such percentiles are useful for various practical purposes. For example we want to define poor as those earning income among the lowest 5%. We need to calculate the five percentile of the income distribution. Alternatively if we define rich as those earning among the top 1%, then we need to compute the 99 percentile of income data. Five Percentile= 5% = location is el=n*5/100=0.5 This is a fraction so we round up to 1 and choose the first order stat X(1)=Xmin=23 as our five percentile Ten percentile Here n=10, P=10, p=0.1, np=el=1, location is 1, so percentile is the average of X(el) and X(el+1), that is the average of X(1)=23 and X(2)=40, so the ten percentile is 32 Q1 = First Quartile= Q1=25%= 46 Median for unclassified data= Mi=50 percentile= Third Quartile= Q3=75%= 82 Ninety percentile=90%= 95 Ninety five percentile= 95%= 98 53 IQR= Interquartile range= Q3-Q1 =82-46 = 36 This is a measure of dispersion, less sensitive to extreme observations than the variance. Outlier detection Limits defined by John Tukey of Princeton. An outlier might be present on the low end or at the upper end so we need two detection limits. Outlier means that the measurement is even more extreme than the designated limit. The choice of 1.5 and IQR may seem arbitrary, but there is a lot of theory behind it, which is beyond your scope. Lower detection limit= Low=(Q1-1.5*IQR)= 46-54= 8 Upper detection limit= Upper=(Q3+1.5*IQR)= 82+54= 136 Since there is no data point smaller than the lower detection limit Low= 8 we conclude that there is no outlier on the left side in this data set. Xmin is only 23, certainly NOT more negative than -8. Since there is no data point larger than the upper detection limit Upper=136 we conclude that there is no outlier on the right side in this data set. Xmax is only 98 certainly NOT more than 136. software R commands use sophisticated interpolation for quantiles y=c(50, 98, 82, 23, 46, 40, 63, 52, 92, 54) quantile(y, c(0.01,0.05, 0.25, 0.50, 0.75, 0.95, 0.99)) # this computes quantiles by interpolation 1% 5% 25% 50% 75% 95% 99% 24.53 30.65 47.00 53.00 77.25 95.30 97.46 quantile(y, c(0.01,0.05, 0.25, 0.50, 0.75, 0.95, 0.99), type=2) This is the method (type=2) used by elementary stats textbooks. library(fBasics) basicStats(y) #gives mean, median, quartiles,etc nobs 10.0000000 NAs 0.0000000 Minimum 23.0000000 Maximum 98.0000000 1. Quartile 47.0000000 3. Quartile 77.2500000 Mean 60.0000000 Median 53.0000000 Sum 600.0000000 SE Mean 7.5468905 LCL Mean 42.9277477 UCL Mean 77.0722523 Variance 569.5555556 Stdev 23.8653631 Skewness 0.2588899 Kurtosis -1.3180867 In the following example, the numbers are such that 105 is not beyond 3sigma rule, but is an outlier by the Tukey method. > y=c(2,14,21,56,45,33,42,105) > sort(y) [1] 2 14 21 33 42 45 56 105 > basicStats(y) round.ans..digits...6. nobs 8.000000 NAs 0.000000 Minimum 2.000000 Maximum 100.000000 1. Quartile 19.250000 3. Quartile 47.750000 Mean 39.125000 Median 37.500000 Sum 313.000000 SE Mean 10.697859 LCL Mean 13.828582 UCL Mean 64.421418 Variance 915.553571 Stdev 30.258116 Skewness 0.710248 Kurtosis -0.580865 Old outlier detection limits using 3 sigma rule mean(y)+3*sd(y) # 134.8856 mean(y)-3*sd(y) #-55.38562 #These limits are defective, they fail to call 105 as an outlier #critizied by Tukey because they depend on outliers themselves quantile(y,type=2) # 0% 25% 50% 75% 100% # 2.0 17.5 37.5 50.5 105.0 Q1=quantile(y,type=2)[2] Q3=quantile(y,type=2)[4] iqr=Q3-Q1 TukeyLower=Q1-1.5*iqr TukeyUpper=Q3+1.5*iqr TukeyLower #-32 TukeyUpper #100 #this shows that Tukey method correctly captures 105 as an outlier #old method fails to captures 105 as outlier Mean, Median, Mode and Variance for Classified Data Number of classes in which to classify the data 50 98 82 23 46 40 63 52 92 54. is given to be= 3 =k width Range/(No. of classes) =(98-23)/3=25 Width of ultimate class intervals should be at least 25. Let us choose Width=30 to avoid orphan points upon classification. LowLim Xmin Chosen Lower limit of the 1st class interval = 20 =LowLim This has to be (a round number) lower than or equal to the Xmin=23, Hence 20 seems to be a good choice here. Lower limit PLUS the width defines the first interval to be 20 to 50 Mid Point is simply the average of these limits. (20+50)/2=70/2=35 Hence First class interval's midpoint= M1= 35 Second class interval's midpoint= M2= 65 Third class interval's midpoint= M3= 95 Note that these are changing by the width also. Hence the lower limit of the first dummy interval is (LowLim –Width)= 35-30= 5 Upper limit of the last dummy interval is 95+30=125 These midpoints of dummy intervals must be shown on the horizontal axis to complete the frequency polygon (drawn on top of the frequency histogram by joining consecutive midpoints). Lower limits for 3 (honest to goodness, not dummy) classes are: 20 50 80 Total frequency=Summation of fj= fj = 10 = n (remember notation n) ------------------------------------------------------------------j Low Up Mj fj Mj*fj ------------------------------------------------------------------1 20 50 35 3 105 2 50 80 65 4 260 3 80 110 95 3 285 ------------------------------------------------------------------Totals 10 650 ------------------------------------------------------------------- Total frequency=Summation of fj column = denominator for the mean= 10 Sum of last column Mjfj is 650 _ MEAN for Grouped or Classified Data xbarC= x C= (Mjfj )/fj For the example here, the mean from classified data= (Mjfj )/n = (650/10)= 65 Compare above to the earlier computed mean from unclassified data=60 These need not agree since when we classify the data we pretend that the data in any class interval are simply all concentrated at the midpoint. Mj*fj gives the contribution of j-th interval to the mean. It makes sense to think of grouped mean as weighted averages. A typical weighted average is (Mjwj )/(wj ) where Mj is j-th measurement and wj is weight. Weighted averages are ubiquitous in the form of index numbers. Consumer price index (CPI) or Dow Jones Industrial average (DJIA) are weighted averages of prices. price index= (Pjwj )/(wj ) is average of prices Pj weighted by their “importance” in the family budget. We want to focus on important price changes (rent, tuition) not small things (pensils, lunch). This is done by weights being higher for important things. Similarly S&P500 or DJIA focuses on price changes for important companies in the economy. MODE for Classified Data (most frequent measurement) Mode for classified data is the midpoint of the class interval containing the largest frequency. In our example Max of frequencies=4 Students often make the mistake of reporting the largest frequency 4 as the mode. This is wrong! The mode resides in the second class interval 50 to 80, where j=2, frequency fj is 4 and the midpoint M2 is 65. Hence the mode=65. SAMPLE VARIANCE s2 FOR CLASSIFIED OR GROUPED DATA _ s2= fj*(Mj x )2/(n-1) Let us calculate the formula directly instead of using a shortcut given in the book. The latter is numerically less accurate when the data set is large and there are rounding errors which propagate. First we need to extend the above table with 3 more columns: j is for class number, Mj is for _ _ midpoint of j-th interval, (Mj x ) are deviations from group mean, (Mj x )2 are squared _ deviations, fj*(Mj x )2 are squared deviations weighted by fj, corresponding frequencies in j-th class. I wish I could have all these columns together! An example with all columns together is toward the end of this file. ----------------------------------------------------- j _ (Mj x ) _ (Mj x )2 _ fj*(Mj x )2. ----------------------------------------------------1 -30 900 2700 2 0 0 0 3 30 900 2700 ----------------------------------------------------Sums 0 1800 5400 ----------------------------------------------------Denominator for s2= sample variance= n-1 = 9 _ Sample variance for classified data= fj*(Mj-x c)2/(n-1) =(5400/9)=600 Classified data standard deviation= square root of variance=600= 24.4949 For comparison, the unclassified data std dev. above was= 23.8654 Classified Data Median (found graphically) Recall from the notes to chapter 3 that we find the classified data median graphically. We plot the Less-Than-Ogive as well as plot Greater-Than-Ogive and find the point of intersection. Then we find which measurement represents the intersection point to get the Median. Interpolation methods are also available. Remember to plot the Less-than-ogive cumulative frequencies top to bottom against upper limits of the class intervals (50, 80, 110). Also, the greater-than-ogive plots cumulative frequencies from bottom to top against the lower limits (20, 50, 80). Interval j dummy j=1 j=2 j=3=k Income Lower Lim -10 UpLim 20 20 50 80 50 80 110 Freq fj 0 3 4 3 Less than ogive, CumFreqTop2bottom 20 0 class upper Lim=50 3 80 7 110 10 Greater than Ogive CumFreqBottomToTop -10 10 class lower Lim=20 10 50 7 80 3 dummy 110 140 0 140 10 110 0 A less-than ogive (LTO) shows how many items in the distribution have a value less than the upper limit of each class. This is plotted with measurements on the horizontal axis to cover all ranges including the dummy and total frequency on the vertical axis. It is a useful graphic when shown with the measurement (e.g., income) on the horizontal (x-axis) and the LTO on the vertical axis (yaxis). Given any x-axis value, the graphic tells us how many items are below the x-axis value. Conversely given any y-axis value of LTO, (e.g. 3 persons out of 10 or 30% of people) the graphic tells us the measurement (income) such that 30% of incomes are below that income level (50K income). A greater-than ogive (GTO) shows how many items in the distribution have a value greater than or equal to the lower limit of each class. The GTO graphic is similar to LTO, mutatis mutandis (with the necessary changes being made). For example, there are 30% earning above 80K in above table. A median is defined as that measurement where there is equal frequency below and above this measurement. For example median income 75K means that there are 50% persons earning below 75K and 50% persons earning above 75K. Therefore the median is obviously given by the point of intersection of the two ogives. Median of classified data is found graphically by dropping a perpendicular from the point of intersection of the two ogives on the horizontal axis. The answer is found on the horizontal axis, not the height where intersection occurs. R program to draw the plot xy=matrix(c(20,0,-10,10,50,3,20,10,80,7,50, 7,110,10,80,3,140,10,110,0),5,4,byrow=T) #> xy # [,1] [,2] [,3] [,4] #[1,] 20 0 -10 10 #[2,] 50 3 20 10 #[3,] 80 7 50 7 #[4,] 110 10 80 3 #[5,] 140 10 110 0 plot(xy[,1],xy[,2], xlim=c(-15,145), typ="l",xlab="Income", main="Less than and Greater than Ogives",ylab="Cumulative frequency") lines(xy[,3],xy[,4], lty=2) Exercise: Criticize the above figure. (Legend missing, source missing) What is the income for the poorest 20%?, richest 20%? http://www.abs.gov.au/websitedbs/D3310116.NSF/0/c7e40ae1fa39e31e4a2567ac001ffb61 ?OpenDocument has an example of ogives http://www.shodor.org/interactivate/activities/boxplot/ has good discussion of box plots http://lib.stat.cmu.edu/DASL/Stories/mortgagerefusals.html Acorn is the acronym for Association of Community Organizations for Reform Now. These data were presented by Acorn to a Joint Congressional Hearing on discrimination in lending. Acorn concluded that "banks generally have exhibited a pervasive pattern of lending practices that have the effect, intended or not, of racial discrimination. Wide disparities in rejection rates for minority and white applicants, even in comparable income groups, were found in all SMA's, and at nearly every institution studied." http://lib.stat.cmu.edu/DASL/Datafiles/mortgagerefusalsdat.html has the data Number of cases: 20 Variable Names: The Box and Whisker plots are excellent for comparisons. Note that the plots show low refusal rates for groups 2 and 4 involving white applicants. The plots tell the whole story in efficient and compact manner. 1. MIN = refusal rate for minority applicants 2. WHITE = refusal rate for white applicants 3. HIMIN = refusal rate for high income minority applicants 4. HIWHITE = refusal rate for high income white applicants http://lib.stat.cmu.edu/DASL/Stories/distribution.html has yet another example Now consider an example of notched box plots for comparison The percentiles plotted are 5, 10, 25, 50, 75, 90, 95 The box begins at 10 and ends at 90 Whisker is between 5 and 10 percentiles and also between 90 and 95 percentiles The notch is at the median j 1 2 3 Lower Lim 300 320 340 UpLim 320 340 360 fj 4 10 6 sum=20 Mj 310 330 350 xbar= fj*Mj 1240 3300 2100 6640 (6640/20) =332 dev= Mj-xbar -22 -2 18 -6 (Mjxbar)^2 484 4 324 812 fj(dev^2) 1936 40 1944 3920 var= 206.3158 sd= 14.3637 Note that the sum of deviations from the mean is NOT zero for classified data. In the above table it is -6.