Download Prof - fordham.edu

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Regression toward the mean wikipedia , lookup

Time series wikipedia , lookup

Transcript
Stat I, professor Vinod, Class notes for chapter 4
Let us use the same example as in my Chapter 3 notes
Original Unclassified Data are Xi for i=1 to i=10 as:
50
98
82 23 46 40 63 52 92 54. For example X1=50, X3=82.
_
x
or mean=
60 is a measure of centering. Cannot calculate measure of
variability called variance without first calculating the mean.
The Range= Xmax- Xmin = 75
is a measure of dispersion or spread
Sample Variance s2 and mean absolute deviation (MAD) for unclassified data
Write the data values one below the other and make a table like this:
----------------------------------------------
i
_
(Xi  x )
_
(Xi  x )2
_
|(Xi  x )|.
---------------------------------------------1
-10
100
10
2
38
1444
38
3
22
484
22
4
-37
1369
37
5
-14
196
14
6
-20
400
20
7
3
9
3
8
-8
64
8
9
32
1024
32
n=10
-6
36
6
---------------------------------------------Totals
0
5126
190
---------------------------------------------Check sum of deviations from the mean. It should be 0!
Sum of squared deviations= 5126
Denominator for sample variance= n-1 = 9
Unclassified sample variance s2 =∑
_
(Xi  x )2/ (n-1)
= 5126/9 = 569.556
standard deviation is square root of variance = s = 23.8654
Mean Absolute Deviation (MAD)
MAD =∑
Definition
_
|(Xi  x )|/n
Sum of absolute deviations from the last column of the table above=190
Mean absolute deviation=MAD=19. This is less sensitive to extreme observations
than sample variance, because it avoids squaring of large deviations.
(We take absolute values instead of squares)
_
Coefficient of Variation CV = 100*s /x
If both mean and standard deviation is multiplied by a constant, CV is unchanged. If we measure
stock returns in British pounds or US dollars or French Euros, CV remains fixed. This is its main
advantage. Otherwise it is just another measure of dispersion or spread (bunching of the data)
CV=100*(std
_
dev)/x =
39.7756
Computation of the Median and other percentiles for unclassified data
First we find the order statistics x(1) to x(n) from Sorted data:
Make a note of the notation where subscripts are in parentheses.
Original data Xi are re-ordered as:
Xmin=x(1) =23
x(2) =40, ... , x(3) = 46, x(4) = 50, x(5) =52, x(6) =54
x(7) =63, x(8) = 82, x(9) =92, x(10) =98= x(n)=Xmax.
MEDIAN for unclassified data: (measure of relative position)
Median is not at all sensitive to extreme observations (outliers), hence it is
said to be resistant or robust. The location of the median is at el= l= n/2 or
at 5. Since this is an integer
Median is an average of x(5) and X(6), OR the average of fifth largest value 52
and the sixth largest value 54, or
Median for unclassified data = 53 here.
http://lib.stat.cmu.edu/DASL/Stories/ceo.html
has an example of CEO salaries
Now we know the shape or Skewness from Unclassified mean and median as follows:
In this example, since mean>median, underlying frequency distribution is skewed
to the right! (long tail to the right side, or positively skewed)
If mean were equal to median it would be symmetric and if mean<median it would
be skewed to the left.
TRIMMED MEAN can be complicated
Compute 10% trimmed mean
n=10, P=percent trimmed =10, lower case p=10/100=0.10, k=np=1
integer part of (k)=1
fractional part of k=0
Retained number of observations after trimming = 8
The answer is obtained by just averaging the following 8 (retained) numbers:
40,
46
50
52
54
63
82
92
10% trimmed mean =59.875
Practical Example of Trimmed Mean (Olympics scoring)
It gives more reliable less biased average. A good application is
Olympics scoring in items like gymnastics where the judgment of the
Judge is involved. Imagine a biased judge who wants to favor a particular
Candidate and make sure that a competing candidate does not win. If
Ordinary average of scores given by 10 judges is used, the biased judge
Can give a very high score to the favored candidate and very very low score
to the other. The simple averaging will bring up one and bring down the other.
Not so with trimmed averaging.
Olympics do not use simple average, but use trimmed means. They always
Ignore the highest and lowest scores and average only the middle scores.
Foils the biased scoring mentioned above.
This
Following discussion on advanced trimming methods may be skipped.
How to compute a more complicated 13% trimmed mean? Needs many steps:
n=10, P=percent trimmed =13, small p=13/100=0.13, k=np=1.3
integer part of (k)=1
fractional part of k=0.3
Retained number of observations after trimming 13% on each side= 7.4
R= n(12p) = 10(12*0.13) = 10(10.26)=10(0.74)=7.4
Bottom end Trimming
From the second order statistic x(2)=40 we throw out 0.3
and keep (10.3) or 0.7 part. 0.7*40=28,
so we keep 28 as the contribution to the trimmed mean from x(2).
The middle order stats retained for 13% trimmed mean are:
46
50
52
54
63
82
Top end Trimming
From the ninth order statistic x(9)=92 we throw out 0.3
and keep (10.3) or 0.7 part.
0.7*92=64.4,
Finally an example of advanced trimming method:
13% trimmed mean is the average of 7.4 middle observations
(28+46+50+52+54+63+82+64.4)/7.4
Answer=(439.4)/7.4= 59.3784
IN the R software the following gives wrong result.
z=c(23, 40,
46,
50,
52,
54,
63,
82,92,98)
mean(z, trim=0.13) # wrong result is 59.875
R does not do advanced trimming but only simple trimming.
PERCENTILES for unclassified data
Now we go beyond median and report ADDITIONAL percentiles (used later for a
Notched Box Plot). Such percentiles are useful for various practical purposes.
For example we want to define poor as those earning income among the lowest 5%.
We need to calculate the five percentile of the income distribution.
Alternatively if we define rich as those earning among the top 1%, then we need
to compute the 99 percentile of income data.
Five Percentile= 5% =
location is el=n*5/100=0.5
This is a fraction so we round up to 1 and choose the first order stat
X(1)=Xmin=23 as our five percentile
Ten percentile
Here n=10, P=10, p=0.1, np=el=1, location is 1, so percentile is the average of
X(el) and X(el+1), that is the average of X(1)=23 and X(2)=40,
so the ten percentile is 32
Q1 = First Quartile= Q1=25%=
46
Median for unclassified data= Mi=50 percentile=
Third Quartile= Q3=75%=
82
Ninety percentile=90%=
95
Ninety five percentile= 95%=
98
53
IQR=
Interquartile range= Q3-Q1 =82-46 = 36
This is a measure of dispersion, less sensitive to extreme observations than the
variance.
Outlier detection
Limits defined by John Tukey of Princeton.
An outlier might be present on the low end or at the upper end so we need
two detection limits. Outlier means that the measurement is even more
extreme than the designated limit. The choice of 1.5 and IQR may seem
arbitrary, but there is a lot of theory behind it, which is beyond your scope.
Lower detection limit= Low=(Q1-1.5*IQR)= 46-54= 8
Upper detection limit= Upper=(Q3+1.5*IQR)= 82+54= 136
Since there is no data point smaller than the lower detection limit Low= 8 we conclude that there is no outlier on the
left side in this data set. Xmin is only 23, certainly NOT more negative than -8.
Since there is no data point larger than the upper detection limit Upper=136 we conclude that there is no outlier on the
right side in this data set. Xmax is only 98 certainly NOT more than 136.
software R commands use sophisticated interpolation for quantiles
y=c(50, 98, 82, 23, 46, 40, 63, 52, 92, 54)
quantile(y, c(0.01,0.05, 0.25, 0.50, 0.75, 0.95, 0.99)) # this computes quantiles by interpolation
1%
5%
25%
50%
75%
95%
99%
24.53 30.65 47.00 53.00 77.25 95.30 97.46
quantile(y, c(0.01,0.05, 0.25, 0.50, 0.75, 0.95, 0.99), type=2)
This is the method (type=2) used by elementary stats textbooks.
library(fBasics)
basicStats(y) #gives mean, median, quartiles,etc
nobs
10.0000000
NAs
0.0000000
Minimum
23.0000000
Maximum
98.0000000
1. Quartile 47.0000000
3. Quartile 77.2500000
Mean
60.0000000
Median
53.0000000
Sum
600.0000000
SE Mean
7.5468905
LCL Mean
42.9277477
UCL Mean
77.0722523
Variance
569.5555556
Stdev
23.8653631
Skewness
0.2588899
Kurtosis
-1.3180867
In the following example, the numbers are such that 105 is not beyond 3sigma rule, but is an outlier
by the Tukey method.
> y=c(2,14,21,56,45,33,42,105)
> sort(y)
[1]
2 14 21 33 42 45 56 105
> basicStats(y)
round.ans..digits...6.
nobs
8.000000
NAs
0.000000
Minimum
2.000000
Maximum
100.000000
1. Quartile
19.250000
3. Quartile
47.750000
Mean
39.125000
Median
37.500000
Sum
313.000000
SE Mean
10.697859
LCL Mean
13.828582
UCL Mean
64.421418
Variance
915.553571
Stdev
30.258116
Skewness
0.710248
Kurtosis
-0.580865
Old outlier detection limits using 3 sigma rule
mean(y)+3*sd(y) # 134.8856
mean(y)-3*sd(y) #-55.38562
#These limits are defective, they fail to call 105 as an outlier
#critizied by Tukey because they depend on outliers themselves
quantile(y,type=2)
#
0%
25%
50%
75% 100%
# 2.0 17.5 37.5 50.5 105.0
Q1=quantile(y,type=2)[2]
Q3=quantile(y,type=2)[4]
iqr=Q3-Q1
TukeyLower=Q1-1.5*iqr
TukeyUpper=Q3+1.5*iqr
TukeyLower #-32
TukeyUpper #100
#this shows that Tukey method correctly captures 105 as an outlier
#old method fails to captures 105 as outlier
Mean, Median, Mode and Variance for Classified Data
Number of classes in which to classify the data
50
98
82 23 46 40 63 52 92 54. is given to be= 3 =k
width  Range/(No. of classes) =(98-23)/3=25
Width of ultimate class intervals should be at least 25. Let us choose
Width=30 to avoid orphan points upon classification.
LowLim  Xmin
Chosen Lower limit of the 1st class interval = 20 =LowLim
This has to be (a round number) lower than or equal to the Xmin=23,
Hence 20 seems to be a good choice here.
Lower limit PLUS the width defines the first interval to be 20 to 50
Mid Point is simply the average of these limits. (20+50)/2=70/2=35
Hence First class interval's midpoint= M1= 35
Second class interval's midpoint= M2= 65
Third class interval's midpoint= M3= 95
Note that these are changing by the width also.
Hence the lower limit of the first dummy interval is (LowLim –Width)= 35-30= 5
Upper limit of the last dummy interval is 95+30=125
These midpoints of dummy intervals must be shown on the horizontal axis to
complete the frequency polygon (drawn on top of the frequency histogram by
joining consecutive midpoints).
Lower limits for 3 (honest to goodness, not dummy) classes are:
20
50
80
Total frequency=Summation of fj= fj = 10 = n (remember notation n)
------------------------------------------------------------------j
Low
Up
Mj
fj
Mj*fj
------------------------------------------------------------------1
20
50
35
3
105
2
50
80
65
4
260
3
80
110
95
3
285
------------------------------------------------------------------Totals
10
650
-------------------------------------------------------------------
Total frequency=Summation of fj column = denominator for the mean= 10
Sum of last column Mjfj is 650
_
MEAN for Grouped or Classified Data xbarC= x C= (Mjfj )/fj
For the example here, the mean from classified data= (Mjfj )/n = (650/10)= 65
Compare above to the earlier computed mean from unclassified data=60
These need not agree since when we classify the data we pretend that the data in any class interval
are simply all concentrated at the midpoint. Mj*fj gives the contribution of j-th interval to the mean.
It makes sense to think of grouped mean as weighted averages. A typical weighted
average is (Mjwj )/(wj ) where Mj is j-th measurement and wj is weight.
Weighted averages are ubiquitous in the form of index numbers. Consumer price index
(CPI) or Dow Jones Industrial average (DJIA) are weighted averages of prices.
price index= (Pjwj )/(wj ) is average of prices Pj weighted by their “importance” in the family
budget. We want to focus on important price changes (rent, tuition) not small things (pensils,
lunch). This is done by weights being higher for important things. Similarly S&P500 or DJIA
focuses on price changes for important companies in the economy.
MODE for Classified Data (most frequent measurement)
Mode for classified data is the midpoint of the class interval containing the largest frequency.
In our example Max of frequencies=4
Students often make the mistake of reporting the largest frequency 4 as the mode. This is wrong!
The mode resides in the second class interval 50 to 80, where j=2, frequency fj is 4 and the
midpoint M2 is 65. Hence the mode=65.
SAMPLE VARIANCE s2 FOR CLASSIFIED OR GROUPED DATA
_
s2= fj*(Mj  x )2/(n-1)
Let us calculate the formula directly instead of using a shortcut given in the book. The latter is
numerically less accurate when the data set is large and there are rounding errors which propagate.
First we need to extend the above table with 3 more columns: j is for class number, Mj is for
_
_
midpoint of j-th interval, (Mj  x ) are deviations from group mean, (Mj  x )2 are squared
_
deviations, fj*(Mj  x )2 are squared deviations weighted by fj, corresponding
frequencies in j-th class. I wish I could have all these columns together! An
example with all columns together is toward the end of this file.
-----------------------------------------------------
j
_
(Mj  x )
_
(Mj  x )2
_
fj*(Mj  x )2.
----------------------------------------------------1
-30
900
2700
2
0
0
0
3
30
900
2700
----------------------------------------------------Sums
0
1800
5400
----------------------------------------------------Denominator for s2= sample variance= n-1 =
9
_
Sample variance for classified data= fj*(Mj-x c)2/(n-1)
=(5400/9)=600
Classified data standard deviation= square root of variance=600= 24.4949
For comparison, the unclassified data std dev. above was=
23.8654
Classified Data Median (found graphically)
Recall from the notes to chapter 3 that we find the classified data median graphically. We plot the
Less-Than-Ogive as well as plot Greater-Than-Ogive and find the point of intersection. Then we
find which measurement represents the intersection point to get the Median. Interpolation methods
are also available.
Remember to plot the Less-than-ogive cumulative frequencies top to bottom
against upper limits of the class intervals (50, 80, 110).
Also, the greater-than-ogive plots cumulative frequencies from bottom to top
against the lower limits (20, 50, 80).
Interval
j
dummy
j=1
j=2
j=3=k
Income
Lower
Lim
-10
UpLim
20
20
50
80
50
80
110
Freq fj
0
3
4
3
Less than ogive,
CumFreqTop2bottom
20 0
class upper
Lim=50 3
80 7
110 10
Greater than Ogive
CumFreqBottomToTop
-10 10
class lower
Lim=20 10
50 7
80 3
dummy
110
140
0
140
10
110
0
A less-than ogive (LTO) shows how many items in the distribution have a value less than the upper
limit of each class. This is plotted with measurements on the horizontal axis to cover all ranges
including the dummy and total frequency on the vertical axis. It is a useful graphic when shown
with the measurement (e.g., income) on the horizontal (x-axis) and the LTO on the vertical axis (yaxis). Given any x-axis value, the graphic tells us how many items are below the x-axis value.
Conversely given any y-axis value of LTO, (e.g. 3 persons out of 10 or 30% of people) the graphic
tells us the measurement (income) such that 30% of incomes are below that income level (50K
income).
A greater-than ogive (GTO) shows how many items in the distribution have a value greater than or
equal to the lower limit of each class. The GTO graphic is similar to LTO, mutatis mutandis (with
the necessary changes being made). For example, there are 30% earning above 80K in above table.
A median is defined as that measurement where there is equal frequency below and above this
measurement. For example median income 75K means that there are 50% persons earning below
75K and 50% persons earning above 75K. Therefore the median is obviously given by the point of
intersection of the two ogives.
Median of classified data is found graphically by dropping a perpendicular from the point of
intersection of the two ogives on the horizontal axis. The answer is found on the horizontal axis,
not the height where intersection occurs.
R program to draw the plot
xy=matrix(c(20,0,-10,10,50,3,20,10,80,7,50,
7,110,10,80,3,140,10,110,0),5,4,byrow=T)
#> xy
# [,1] [,2] [,3] [,4]
#[1,] 20 0 -10 10
#[2,] 50 3 20 10
#[3,] 80 7 50 7
#[4,] 110 10 80 3
#[5,] 140 10 110 0
plot(xy[,1],xy[,2], xlim=c(-15,145), typ="l",xlab="Income", main="Less than and Greater than
Ogives",ylab="Cumulative frequency")
lines(xy[,3],xy[,4], lty=2)
Exercise: Criticize the above figure. (Legend missing, source missing)
What is the income for the poorest 20%?, richest 20%?
http://www.abs.gov.au/websitedbs/D3310116.NSF/0/c7e40ae1fa39e31e4a2567ac001ffb61
?OpenDocument
has an example of ogives
http://www.shodor.org/interactivate/activities/boxplot/
has good discussion of box plots
http://lib.stat.cmu.edu/DASL/Stories/mortgagerefusals.html
Acorn is the acronym for Association of Community Organizations for Reform Now. These data
were presented by Acorn to a Joint Congressional Hearing on discrimination in lending. Acorn
concluded that "banks generally have exhibited a pervasive pattern of lending practices that have
the effect, intended or not, of racial discrimination. Wide disparities in rejection rates for minority
and white applicants, even in comparable income groups, were found in all SMA's, and at nearly
every institution studied."
http://lib.stat.cmu.edu/DASL/Datafiles/mortgagerefusalsdat.html
has the data
Number of cases: 20
Variable Names: The Box and Whisker plots are excellent for comparisons. Note that the plots
show low refusal rates for groups 2 and 4 involving white applicants. The plots tell the whole story
in efficient and compact manner.
1. MIN = refusal rate for minority applicants
2. WHITE = refusal rate for white applicants
3. HIMIN = refusal rate for high income minority applicants
4. HIWHITE = refusal rate for high income white applicants
http://lib.stat.cmu.edu/DASL/Stories/distribution.html
has yet another example
Now consider an example of notched box plots for comparison
The percentiles plotted are 5, 10, 25, 50, 75, 90, 95
The box begins at 10 and ends at 90
Whisker is between 5 and 10 percentiles and also between 90 and 95 percentiles
The notch is at the median
j
1
2
3
Lower Lim
300
320
340
UpLim
320
340
360
fj
4
10
6
sum=20
Mj
310
330
350
xbar=
fj*Mj
1240
3300
2100
6640
(6640/20)
=332
dev=
Mj-xbar
-22
-2
18
-6
(Mjxbar)^2
484
4
324
812
fj(dev^2)
1936
40
1944
3920
var=
206.3158
sd=
14.3637
Note that the sum of deviations from the mean is NOT zero for classified data.
In the above table it is -6.