Download Module 2

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

World Values Survey wikipedia , lookup

Regression toward the mean wikipedia , lookup

Categorical variable wikipedia , lookup

Transcript
STAT 101, Module 2: Numerical Summaries of Variables
Questions one wants to quantify
If we are to examine data about Penn students
(PennStudents.JMP), one might ask:
 How many students are 19 years old?
What fraction of the total are they?
Are they fewer or more than the 18 year old students?
 On average how tall are male and female students?
How spread out are the heights?
How strong is the overlap between male and female heights?
Numerical versus Graphical Summaries
 Graphical methods allow us to
o see the data as a whole
o discover unexpected facts
 Numerical summaries give us
o simplicity by condensing a lot of data into few numbers
o precision, for example, when comparing groups
o ways to reason about uncertainty (stay tuned)
Neither replaces the other.
Numerical Summaries according to Variable Type
 Textbook: part of Chap. 3
 Qualitative variables: “how many in each group?”
o Counts/Frequencies
o Proportions
 Quantitative variables:
o Measures of Location: “where is the data?”
Mean, Median, Quantiles, Minimum, Maximum
o Measures of Dispersion: “how wide is the data?”
Standard Deviation, Range, Interquartile Range
Qualitative Variables: Counts/Frequencies and Proportions
 The following example is from the data PennStudents.JMP
Age is used as an ordinal variable.
AGE
24
23
22
21
20
19
18
Frequencies
Level
18
19
20
21
22
23
24
Total
Count
128
139
70
33
14
4
2
390
N Missing
0
7 Levels
Prob
0.32821
0.35641
0.17949
0.08462
0.03590
0.01026
0.00513
1.00000
 The barplot gives a good comparison of the frequencies
across the age groups.
 The table gives a list of exact counts and proportions
(“Prob” in JMP).
Count = Frequency (synonyms)
Proportion = Count / Total (= Fraction)
Percentage = Proportion * 100
Algebraic notation: ni = count of the i’th label
pi = proportion of the i’th label
Example above: n1 = 128, n2 = 139, …, n7 =2
p1 = .328, p2 = .356,…, p7 = .005
where label 1 is ‘18’, label 2 is ‘19’,…
Terminology: Level = label, group name
Example: JMP reports 7 levels for ‘Age’.
JMP: To reproduce the above output, you need to convert the
quantitative variable ‘Age’ to qualitative before you do
Analyze > Distribution. The conversion is done as follows:
Right-click the label ‘Age’ above the Age column
> Modeling Type > Ordinal
Quantitative Variables: Measures of Location and Dispersion
 Again, the following example is from PennStudents.JMP:
HEIGHT
Quantiles
100.0%
99.5%
97.5%
90.0%
75.0%
50.0%
25.0%
10.0%
2.5%
0.5%
0.0%
80
70
maximum
80.000
76.180
75.000
73.500
71.000
67.500
65.000
63.000
60.000
57.478
57.000
quartile
median
quartile
minimum
Moments
Mean
Std Dev
Std Err Mean
upper 95% Mean
lower 95% Mean
60
67.754103
3.9749694
0.2012804
68.149836
67.358369
N
(Ignore “Std Err Mean”, “upper 95% Mean” and “lower 95%
Mean” in this table. Everything else will be explained in the
next two bullets.)
 Measures of Location: “Where is the data?” (Sec. 3.1)
Textbook: “central tendency”, Sec. 3.1 (ignore ‘population’)
o Mean: average of the values x1, x2,…xN in column ‘x’
mean(x) = (x1 + x2 +… + xN )/N
In the height data, the mean is reported to be 67.75…
o Median: the middle value of the sorted values in a
column if N is odd, and the average of the two middle
values if N is even.
Examples: If the values in a column are 1,2,3,4,5, the
median is 3. If the values are 1,2,3,4, the median is 2.5
In the height data, the median is reported to be 67.5
390
o Quantiles: The idea of quantiles is that they divide the
values in a column roughly into, for example, 20%
percent of values less and 80% greater.
This would be called the 20% quantile.
The same applies to any other percentage.
Sometimes one calls, for example, the 90% quantile the
“upper 10% quantile”. If nothing is said to the
contrary, the percentage of a quantile refers to the
fraction of values that are less.
(Don’t worry about the fine points of defining
quantiles! Trust that JMP has a reasonable general
definition.)
Special cases:
 50% quantile = median
 25% and 75% = lower and upper quartiles.
 10%, 20%,… 90% quantiles = deciles.
 0% quantile = minimum
 100% quantile = maximum
In the height data, JMP give us the lower and upper 0%,
0.5%, 2.5%, 10%, 25% and 50% quantiles.
Abbreviations: mean(Height), med(Height),
max(Height), min(Height)
Note 0: mean ≠ median
[Move transformation properties after introducing dispersion
measures. Then explain that it is these properties that
distinguish them]
Note 1: Shifting the values of a variable
If you add a constant value to all the values in a column, the
location measures also get added that value. For the mean
this can be expressed as follows:
mean(x+c) = mean(x)+c
Example: If you re-express degrees Celsius in degrees
Kelvin, you add 273. Therefore, add 273 to the means and
quantiles of degrees Celsius and you obtain the means and
quantiles in degrees Kelvin: K° = C° + 273
Note 2: Rescaling the values of a variable
If you multiply all the values in a column with a constant
value, the location measure also get multiplied with that
value. For the mean this can be expressed as follows:
mean(cx) = c·mean(x)
Example: If you convert $ to €, you have to multiply with a
factor 0.770831727 (2007/01/21). Therefore multiply means
and quantiles of $s to obtain the means and quantiles in €s.
[Caution: Quantiles other than the median do not strictly
follow this formula when the factor c is negative. Lower
quantiles become upper quantiles and vice versa. Ex.: The
lower quartile becomes the upper quartile if c<0.]
Note 3: Shifting and rescaling the values of a variable
Notes 1 and 2 can be combined.
Example: For translating means and quantiles from degrees
Celsius to degrees Fahrenheit, apply the well-known
conversion formula to the means and quantiles in Celsius and
you obtain the means and quantiles in Fahrenheit:
F° = (9/5) ·C° + 32
Problem: Make up a new measure of location.
Notation: Because the mean is the most important measure
of location, we abbreviate it often as italic m. That is, m =
mean(x). If more than one variable is in play and we need to
indicate the variable, we may write mx and my. For example,
we might write mHeight and mWeight.
 Measures of Dispersion: “How wide is the data?” (Sec. 3.2)
Textbook: “variability”, Sec. 3.2 (ignore ‘population’)
o Range = maximum – minimum
This is the (vertical) width from the top most point to
the bottom most point in the boxplot.
In the height data, the range is 80 – 57 = 23
o Interquartile Range (IQR):
IQR = upper quartile – lower quartile
This is the (vertical) width of the box in the boxplot.
In the height data: IQR = 71 – 65 = 6
o Standard Deviation (s, sdev, sd, SD, std dev,…):
s 

1
( x1  m)2  ( x2  m)2  ...  ( xN  m)2
N 1

where m = mean(x). In the height data, s is reported to
be 3.97
This is the most important measure of dispersion!
Questions arise, however: Why squared deviations from
the mean? Why a square root? Why N–1?
This will require more explanation. Stay tuned.
Abbreviations: If we have standard deviations of more
than one column, x and y, say, we have to distinguish
the measures of dispersion. We would then use the
symbols sx or s(x) and sy or s(y) for the respective
standard deviations. Similarly, we might use IQRx or
IQR(x) and IQRy or IQR(y). For the height data above,
we could write IQRHeight= 6 and sHeight = 3.97.
Terminology:
s2 = Variance
A look ahead: The variance of stock returns is used in
finance as a measure of “volatility” or “risk” of stock
investments. (Of course the standard deviation could
serve for the same purpose, and so could any other
measure of dispersion, but finance math dictates the use
of variances.)
Note 1: Shifting the values of a variable
If you add a constant value to all values in a column,
measures of dispersion do not change. For the standard
deviation this can be expressed as follows:
sx+c = sx or s(x+c) = s(x)
Idea: The width does not depend on where the distribution is.
Example: If you convert C° to K°, the standard deviation
does not change. Neither do the range nor the IQR.
Note 2: Rescaling the values of a variable
If you multiply a constant value to all values in a column,
measures of dispersion multiply along with the absolute
value of the constant. For the standard deviation this can be
expressed as follows:
scx = |c| sx or s(cx) = |c| s(x)
Idea: If you double the numbers, you double the width.
Example: If you convert $ to €, you have to multiply with a
factor 0.770831727 (2007/01/21). Therefore, multiply
standard deviations, ranges, IQRs of $s with this factor to
obtain the standard deviations, ranges, IQRs in €s.
Note 3: Shifting and rescaling the values of a variable
Notes 1 and 2 can again be combined.
Example: For translating standard deviations, ranges, IQRs
from degrees Celsius to degrees Fahrenheit, multiply them
with a factor 9/5.
Problem: Make up a new measure of dispersion.
 Appendix on Standard Deviations and Variances
s2 = ((x1–m)2 + (x2–m)2 +…+(xN–m)2 )/(N–1)
o Q: Why is the variance not a measure of dispersion?
A: If the values x1 , x2 ,..., x N are multiplied with a
constant c, then s2 gets multiplied with c2 and not |c|.
For a measure of dispersion we want that doubling the
values entails doubling the measure of dispersion, not
quadrupling, as is the case for the variance s2.
This explains the root in the formula for s!
o Q: Why do we divide by N–1 and not N?
A: The deviations from the mean, xi–m, are not
independent. If we know x1–m,…,xN–1–m, then we
know xN–m, because these values sum up to zero:
(x1–m) + (x2–m) +…+ (xN–1–m) + (xN–m) = 0
which we can solve for (xN–m).
The complete answer is more technical, so take this as a
hint. Proof of the identity:
(x1–m) + (x2–m) +…+ (xN–1–m) + (xN–m) =
(x1 + x2 + … + xN) – Nm = Nm – Nm = 0.
o Q: Why squares in the first place? Why not absolute
values |xi–m| ? This would do away with the root!
A: A simple reason is that we can do algebra with
squares but not easily with absolute values.
(A deeper reason has to do with Pythagoras and
probabilities. Stay tuned!)
A Few Data Examples
 Counts and Proportions: the Titanic data
CLASS
Frequencies
Level
1st
2nd
3rd
crew
Total
cre w
3rd
Count
325
285
706
885
2201
Prob
0.14766
0.12949
0.32076
0.40209
1.00000
2nd
N Missing
0
4 Levels
1st
AGE
Frequencies
child
Level
adult
child
Total
Count
2092
109
2201
Prob
0.95048
0.04952
1.00000
N Missing
0
2 Levels
adult
SEX
Frequencies
m ale
fe m ale
Level
female
male
Total
Count
470
1731
2201
N Missing
0
2 Levels
Prob
0.21354
0.78646
1.00000
SURVIVED
Frequencies
ye s
Level
no
yes
Total
Count
1490
711
2201
Prob
0.67697
0.32303
1.00000
N Missing
0
2 Levels
no
Lesson: For extreme differences in frequencies, numbers are
superior to pictures. For example, we see that there are
almost no children on the Titanic, but how few really? The
table shows that there were 109 children or about 5% of the
total. This would be difficult to estimate by eyeballing the
bar plot.
 Measures of Location and Dispersion: CEO compensation
Total comp + opt exer /1000
Quantiles
100000
100.0%
99.5%
97.5%
90.0%
75.0%
50.0%
25.0%
10.0%
2.5%
0.5%
0.0%
maximum
quartile
median
quartile
minimum
156168
59045
25266
10493
4412
1884
903
508
254
16
0
Moments
0
Mean
Std Dev
Std Err Mean
upper 95% Mean
lower 95% Mean
N
4563.4621
9235.1532
238.37119
5031.0383
4095.8858
1501
log(TotComp+optexer)
Quantiles
8
100.0%
99.5%
97.5%
90.0%
75.0%
50.0%
25.0%
10.0%
2.5%
0.5%
0.0%
7
6
5
4
maximum
quartile
median
quartile
minimum
8.1936
7.7714
7.4028
7.0224
6.6461
6.2763
5.9577
5.7128
5.4165
4.4635
4.4e-16
3
Moments
2
Mean
Std Dev
Std Err Mean
upper 95% Mean
lower 95% Mean
N
1
0
6.3104396
0.5715899
0.0147732
6.3394179
6.2814613
1497
Lessons:
1) The distribution of raw compensations is extremely
skewed upwards. This is the reason why the mean and
median are extremely different:
mean(Tot Comp +…) = 4563
med(Tot Comp + …) = 1884 (both in $1000s)
The median is a better measure because the mean gets pulled
up by the upper extremes and is no longer a typical value.
(If we asked, however, how much each CEO would get if the
sum of all compensations were equally redistributed among
CEOs, we would have to use the mean, never mind the skew
distribution.)
2) The textbook has a measure of skewness (P. 76f), but we
will not use it. Instead we take a discrepancy between mean
and median as a sign of skewness. The direction of skewness
follows from the order of the two measures:
o mean > median: skewed upwards
o mean < median: skewed downwards
Remember that the mean gets pulled by extreme values, the
median doesn’t. Therefore the mean tells you to which side
the distribution is skewed.
3) The sdev is even more problematic than the mean for very
skewed distributions (forming squares blows up even more
than the raw values). By comparison, the IQR does not lose
its meaning: it always tells how far the upper and lower
quartiles are apart. For the raw compensations, the CEOs at
the upper and lower quartile make about m$4.4 and m$0.9,
respectively, with a spread of about m$3.5=IQR.
Messages:
o The mean and sdev are problematic for extremely
skewed distributions. They are more meaningful for
bell-shaped, nearly-symmetric distributions.
o The median and IQR remain meaningful for skewed
distributions.
Another Appendix: Mean versus Median
Below is a physical illustration of the difference between mean and
median.
 The mean corresponds to the balance point of the data values
on a seesaw balance as drawn on the left, assuming all data
values have the same weight.
 The median requires a scale that only counts how much is
left and how much is right. The scale on the right does this:
the distance of the points from the balance point is irrelevant
as long as they stay on the same side. The reason is that all
their weights get transmitted to equal distances on either side.
(Old-fashioned scales are constructed like the median scale, so
it doesn’t matter where on the platforms one places the goods
and the metal weights.)
[xxx To be added next time:
 sx = 0 iff x=const
 ax = mean(|x–med(x)|) before sx
 use of location and dispersion measures for standardization
(z-scores), with example of equalizing midterm scores (then
remove standardization from Module 3 where it is an
afterthought)
 introduce also the empirical rule and the normal distribution,
even the normal probability plot, to have an interpretation
for the SD
]