Download Numeric Data II - Colin Reimer Dawson

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Transcript
Last Time: Shape and Center
Boxplots and the IQR
Variance and Standard Deviaton
STAT 113
Variability
Colin Reimer Dawson
Oberlin College
13 February 2017
Transformations
Last Time: Shape and Center
Boxplots and the IQR
Variance and Standard Deviaton
Distribution of a Quantitative Variable
The distribution of a quantitative variable is characterized by:
A. Shape (symmetric, skewed, bimodal, etc.)
B. Center (mean, median)
C. Spread (Interquartile Range, Standard Deviation)
D. Outliers (if any)
Transformations
Last Time: Shape and Center
Boxplots and the IQR
Variance and Standard Deviaton
Transformations
Skewness
• A distribution is skewed when the extreme values on one side
are more extreme than those on the other.
• We call a distribution right-skewed when the longer “tail” is
on the right, and left-skewed when the longer tail is on the
left.
Last Time: Shape and Center
Boxplots and the IQR
Variance and Standard Deviaton
0.008
0.004
0.000
Density
0.012
Right Skew
0
500
1000
1500
2001 Household Income (Thousands of 2016$)
2000
Transformations
Last Time: Shape and Center
Boxplots and the IQR
Variance and Standard Deviaton
Distribution of a Quantitative Variable
The distribution of a numeric variable is characterized by:
A. Shape (symmetric, skewed, bimodal, etc.)
B. Center (mean, median)
C. Spread (Interquartile Range, Standard Deviation)
D. Outliers (if any)
Transformations
Last Time: Shape and Center
Boxplots and the IQR
Variance and Standard Deviaton
Transformations
Central Tendency
• Often we want to summarize data with a single number
• Usually representing a “typical”, or “middle” value.
• How this is defined depends on the data and the question.
Last Time: Shape and Center
Boxplots and the IQR
Variance and Standard Deviaton
The Mean
Intuitively, the mean is the “balance point” of the data.
Mean
n
X
x̄ = (
xi )/n
i=1
• xi is the ith observation
• n is the sample size (number of observations)
Transformations
Last Time: Shape and Center
Boxplots and the IQR
Variance and Standard Deviaton
Transformations
0.008
0.004
0.000
Density
0.012
Household Incomes
0
500
1000
1500
2000
2001 Household Income (Thousands of 2016$)
Figure: Distribution of Household Income (Source: 2001 Survey of
Consumer Finances)
Here, the mean is greater than 70% of the cases, due to the heavy
right skew.
Last Time: Shape and Center
Boxplots and the IQR
Variance and Standard Deviaton
Transformations
Skewed Distributions
• The mean is strongly affected by skew and by outliers
• The mean is pulled toward the extreme values.
• In these cases, we generally prefer a measure of central
tendency which is resistant to the influence of extreme values
(also called robust).
Last Time: Shape and Center
Boxplots and the IQR
Variance and Standard Deviaton
The Median
Median
The median is the point that cuts the data in half: an equal
amount of data lies above and below the median.
Transformations
Last Time: Shape and Center
Boxplots and the IQR
Variance and Standard Deviaton
Transformations
The Median
0.008
0.004
0.000
Density
0.012
The median is less affected by extreme values, so it is nearer to the
bulk of households.
0
500
1000
1500
2001 Household Income (Thousands of 2016$)
Figure: Distribution of 2001 Household Income
2000
Last Time: Shape and Center
Boxplots and the IQR
Variance and Standard Deviaton
Transformations
The Median
0.008
0.004
0.000
Density
0.012
Zooming in...
0
50
100
150
200
2001 Household Income (Thousands of 2016$)
Figure: Distribution of 2001 Household Income
250
Last Time: Shape and Center
Boxplots and the IQR
Variance and Standard Deviaton
Transformations
Median vs. Mean Income
Figure: Median Household Income vs. Per Capita GDP (1947-2007).
Source: www.lanekenworthy.net
Last Time: Shape and Center
Boxplots and the IQR
Variance and Standard Deviaton
Distribution of a Quantitative Variable
The distribution of a numeric variable is characterized by:
A. Shape (symmetric, skewed, bimodal, etc.)
B. Center (mean, median)
C. Spread (Interquartile Range, Standard Deviation)
D. Outliers (if any)
Transformations
Last Time: Shape and Center
Boxplots and the IQR
Variance and Standard Deviaton
Transformations
Measures of Variability
• We want to quantify the consistency, or lack thereof, of the
data.
• A general term for “lack of consistency” is variability.
• We will look at:
• Range
• Interquartile Range
• Variance / Standard Deviation
Last Time: Shape and Center
Boxplots and the IQR
Variance and Standard Deviaton
Transformations
The Range
The range is easy to compute, but not very reliable.
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
−30
−20
−10
0
10
20
30
Fund C1
●
−30
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−20
−10
0
●
10
20
30
Fund C2
Figure: Historical Annual Returns for Two Hypothetical Index Funds
Last Time: Shape and Center
Boxplots and the IQR
Variance and Standard Deviaton
Transformations
The Range
The range is easy to compute, but not very reliable.
●
●
−10
●●
●● ●
●●
●●
●●●
●●
●●●●●
●
●
●
●●
●●
●●
●
●
●
●●
●
●●
●●
●
●●
●●
●
●
●
●
●
●●●
●●●●●●●●● ●
−5
0
5
●
10
15
10
15
5
10
15
5
10
15
Fund E (Full Data Set)
●
●
−10
● ●
−5
●
0
5
Fund Sample 1
●
−10
−5
●
●
0
●●
Fund Sample 2
●● ●●
−10
−5
●
0
Fund Sample 3
Figure: Annual Returns for 3 random samples of 5 years
Last Time: Shape and Center
Boxplots and the IQR
Variance and Standard Deviaton
Transformations
Robust Measures of Variability
• We’d like a more robust measure of variability, which is not
affected so much by extreme values.
• Analogous to the median: describe the “middle” part of the
data.
• The idea: find the “middle half” of the data, and then take its
range.
• Specifically, exclude the lowest 25% and the highest 25%, and
take the difference between the highest and lowest remaining
values.
Last Time: Shape and Center
Boxplots and the IQR
Variance and Standard Deviaton
Transformations
Quartiles
• The median divides the data in two.
• Percentiles divide the data into 100 pieces.
. The k th
quartile (written Qk ) is the point below which k quarters of
the data lies.
• Quartiles divide the data into
• So, in terms of quartiles, the median is
value is
, the maximum value is
• We can calculate the range using quartiles as
, the minimum
.
.
Last Time: Shape and Center
Boxplots and the IQR
Variance and Standard Deviaton
Transformations
Quartiles
Q0
●
●
20
25
Q1 Q2 Q3
●
30
●
Q4
●
●
●
●
● ●●●●
●●●●●● ● ●
35
Height (in.)
40
45
50
Last Time: Shape and Center
Boxplots and the IQR
Variance and Standard Deviaton
Transformations
The Inter-Quartile Range (IQR)
The Inter-Quartile Range (IQR)
The Inter-Quartile Range (or IQR) is the distance between the
first and third quartiles:
IQR = Q3 − Q1
Pedantic Note
The IQR is a single number, not the two quartiles themselves.
Last Time: Shape and Center
Boxplots and the IQR
Variance and Standard Deviaton
Transformations
The Inter-Quartile Range (IQR)
Q0
Q1 Q2 Q3
Q4
Range
IQR
●
●
20
25
●
30
●
●
●
●
●
● ●●●●
●●●●●● ● ●
35
Height (in.)
40
45
50
Last Time: Shape and Center
Boxplots and the IQR
Variance and Standard Deviaton
Transformations
The Five-Number Summary
Five-number Summary
• The quartiles are very natural to report together to describe
the center and spread of a distribution.
• Q0 through Q4 collectively form the five-number summary
of a quantitative distribution.
Five Number Summary = (xmin , Q1 , Median, Q3 , xmax )
= (Q0 , Q1 , Q2 , Q3 , Q4 )
Last Time: Shape and Center
Boxplots and the IQR
Variance and Standard Deviaton
Transformations
Box-and-Whisker Plots
Box-and-Whisker Plots
From the five-number summary, we construct a graph called a
box-and-whisker plot (or just box plot, for short)
1. Draw an axis
2. Draw a rectangle (box) from Q1 to Q3
3. Draw a line across the box (or place a dot) at Q2
4. Draw lines (whiskers) extending outward from the box on both
sides to either
(a) (Simplest version) xmin and xmax .
(b) (R default) Q1 − 1.5IQR and Q3 + 1.5IQR.
5. In version (b), plot points beyond the whiskers individually.
Last Time: Shape and Center
Boxplots and the IQR
Variance and Standard Deviaton
Transformations
Box-and-Whisker Plots
Q0
●
●
20
25
Q1 Q2 Q3
Range
●
30
●
Q4
●
● IQR
●
●
● ●●●●
●●●●●● ● ●
35
40
45
50
40
45
50
Height (in.)
20
25
30
35
Last Time: Shape and Center
Boxplots and the IQR
Variance and Standard Deviaton
Transformations
Box-and-Whisker Plots
Q0
●
●
20
25
Q1 Q2 Q3
Range
●
30
●
Q4
●
● IQR
●
●
● ●●●●
●●●●●● ● ●
35
40
45
50
40
45
50
Height (in.)
●
20
25
30
35
Last Time: Shape and Center
Boxplots and the IQR
Variance and Standard Deviaton
0.000
Density
Box-and-Whisker Plot: Right Skew
0
500
1000
1500
2000
2001 Household Income (Thousands of 2016$)
0
500
1000
1500
2001 Household Income (Thousands of 2016$)
2000
Transformations
Last Time: Shape and Center
Boxplots and the IQR
Variance and Standard Deviaton
Transformations
0.000
Density
Box-and-Whisker Plot: Right Skew
0
500
1000
1500
2000
2001 Household Income (Thousands of 2016$)
●
●
●
●
●
●
●
●●
●
●●●●
●
●●●
●
●
● ●●●
●
0
●
●●●●●
500
●
●
1000
●
●
●
1500
2001 Household Income (Thousands of 2016$)
●
●
2000
Last Time: Shape and Center
Boxplots and the IQR
Variance and Standard Deviaton
Transformations
Deviations
Rather than simply measuring the distance between extremes, we
can develop measures based on distance from “center”.
Deviation Scores
For each data point, its deviation score is its “distance” from the
mean.
Deviationi = xi − x̄,
for each i = 1, . . . , n
Last Time: Shape and Center
Boxplots and the IQR
Variance and Standard Deviaton
Transformations
Deviations
mean = 36.76
●
●
20
25
●
30
●
●
●
●
●
● ●●●●
●●●●●● ● ●
35
Height (in.)
40
45
50
Last Time: Shape and Center
Boxplots and the IQR
Variance and Standard Deviaton
Transformations
Deviations
mean = 36.76
●
●
20
25
●
30
●
●
●
●
●
● ●●●●
●●●●●● ● ●
35
Height (in.)
40
45
50
Last Time: Shape and Center
Boxplots and the IQR
Variance and Standard Deviaton
Transformations
Deviations
mean = 36.76
●
●
20
25
●
30
●
●
●
Deviation = 6.24
●
●
● ●●●●
●●●●●● ● ●
35
Height (in.)
40
45
50
Last Time: Shape and Center
Boxplots and the IQR
Variance and Standard Deviaton
Transformations
Deviations
mean = 36.76
Deviation = −12.76
●
●
20
25
●
30
●
●
●
Deviation = 6.24
●
●
● ●●●●
●●●●●● ● ●
35
40
45
50
Height (in.)
How can we use these for an overall measure of spread?
Last Time: Shape and Center
Boxplots and the IQR
Variance and Standard Deviaton
Transformations
Variance
• If we square all the deviations from the mean and average
them, we get the variance.
Variance
The variance, written s2 , is the average of the squared deviations
from the mean. That is,
Pn
Pn
2
(xi − x̄)2
2
i=1 Deviationi
= i=1
s =
n−1
n−1
Last Time: Shape and Center
Boxplots and the IQR
Variance and Standard Deviaton
Transformations
What’s with that denominator?
• With an average, you’re supposed to divide by the number of
things, aren’t you? Why n − 1?
• Usually we are working with a sample, and are interested in
estimating the population variability.
• We get no information about variability from the first
observation, so there are only n − 1 “degrees of freedom” in
the sample.
• Interesting math side fact: Variance is equivalent to average
squared distance between all distinct pairs of data points.
Last Time: Shape and Center
Boxplots and the IQR
Variance and Standard Deviaton
Standard Deviation
• Variance (s2 ) is in squared units relative to the data.
• No problem: just take the square root.
Standard Deviation
s=
√
s2 is the standard deviation
s
sP
Pn
n
2
2
√
Deviation
i
i=1
i=1 (xi − x̄)
s = s2 =
=
n−1
n−1
Transformations
Last Time: Shape and Center
Boxplots and the IQR
Variance and Standard Deviaton
Transformations
Same range, different s
s = 18.2
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
−30
−20
−10
0
10
20
30
Fund C1
s = 8.1
●
−30
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−20
−10
0
Fund C2
The standard deviation uses all the data.
●
10
20
30
Last Time: Shape and Center
Boxplots and the IQR
Variance and Standard Deviaton
Distribution of a Quantitative Variable
The distribution of a numeric variable is characterized by:
A. Shape (symmetric, skewed, bimodal, etc.)
B. Center (mean, median)
C. Spread (Interquartile Range, Standard Deviation)
D. Outliers (if any)
Transformations
Last Time: Shape and Center
Boxplots and the IQR
Variance and Standard Deviaton
Transformations
Outliers
• Skewness can be an important feature of a distribution.
• But sometimes a few unusual data points make an otherwise
“well-behaved” distribution look skewed/multimodal.
• When not part of the overall pattern, these are called outliers.
• Sometimes reflect measurement errors (e.g., misplaced
decimal)
• Sometimes represent genuinely unusual observations
Last Time: Shape and Center
Boxplots and the IQR
Variance and Standard Deviaton
Transformations
On-Base Percentage
10
12
A common statistic for batters in baseball is On-Base Percentage
6
Barry Bonds
0
2
4
Density
8
Skewness = 0.630
0.0
0.1
0.2
0.3
0.4
0.5
0.6
On−Base Percentage
Figure: Distribution of major-league hitters with at least 100 Plate
Appearances in 2002.
Last Time: Shape and Center
Boxplots and the IQR
Variance and Standard Deviaton
Transformations
On-Base Percentage
10
12
Distribution without Bonds
6
4
2
0
Density
8
Skewness = 0.199
0.0
0.1
0.2
0.3
0.4
On−Base Percentage
0.5
0.6
Last Time: Shape and Center
Boxplots and the IQR
Variance and Standard Deviaton
Transformations
Visualizing Outliers
0.2
0.3
0.4
0.5
On−Base Percentage
●● ● ●
0.2
● ●●
●
0.3
0.4
On−Base Percentage
●
0.5
Last Time: Shape and Center
Boxplots and the IQR
Variance and Standard Deviaton
Transformations
Problems with s and s2
• These measures, even more than the mean itself, are heavily
0.008
0.004
0.000
Density
0.012
influenced by extreme values.
0
500
1000
1500
2001 Household Income (Thousands of 2016$)
2000
Last Time: Shape and Center
Boxplots and the IQR
Variance and Standard Deviaton
Transformations
Density
0.000
Problems with s and s2
0
500
1000
1500
2000
Density
0.000
2001 Household Income (Thousands of 2016$)
0
500
1000
1500
2000
Density
0.000
2001 Income (Thousands of 2016$) (Top 0.1% Excluded)
0
500
1000
1500
2001 Income (Thousands of 2016$) (Top 1% Excluded)
2000
Last Time: Shape and Center
Boxplots and the IQR
Variance and Standard Deviaton
Transformations
Density
0.000
Problems with s and s2
0
50
100
150
200
250
300
250
300
250
300
Density
0.000
2001 Household Income (Thousands of 2016$)
0
50
100
150
200
Density
0.000
2001 Income (Thousands of 2016$) (Top 0.1% Excluded)
0
50
100
150
200
2001 Income (Thousands of 2016$) (Top 1% Excluded)
Last Time: Shape and Center
Boxplots and the IQR
Variance and Standard Deviaton
Transformations
Variance-Stabilizing Transformations
• The mean and standard deviation are unstable in the presence
of skew.
• However, they have such useful properties otherwise that it is
often better to try to “remove” skew, rather than fall back on
other measures.
• The most common way to remove skew is by a nonlinear
transformation of the underlying scale.
• Take the original variable, x, and define a new variable
Y = f (x), where f is a one-to-one function.
• Most common case: right-skewed data with positive values
• Logarithmic transform (take y = log(x))
√
• Square Root (take y = x)
Last Time: Shape and Center
Boxplots and the IQR
Variance and Standard Deviaton
Transformations
Variance-Stabilizing Transformations
0.000
Density
Original vs. Logarithmic Income Distribution:
0
50
100
150
200
1.0
0.0
Density
2001 Household Income (Thousands of 2016$)
102
103
104
105
2001 Household Income (2016$)
106
250
Last Time: Shape and Center
Boxplots and the IQR
Variance and Standard Deviaton
Summary
Quantitative Data
Visualizing a quantitative variable
• Dot Plots
• Box-and-Whisker Plots
• Histograms
• Density curves
Describing the distribution of a numeric variable
• Shape (symmetry, skew, modes)
• Center (mean, median)
• Spread (IQR, standard deviation)
• Outliers (if any)
Transformations
Last Time: Shape and Center
Boxplots and the IQR
Variance and Standard Deviaton
Transformations
Summary
Shape and Center
• A distribution is skewed when the extreme values on one end
are more extreme than on the other
• We say that it is skewed in the direction of the more extreme
values (e.g., right-skewed if there are a few very large values)
• The mean is the “balance point of the data”, written x̄.
• Mean has nice math properties, but is affected by skew
• The median divides the cases in half
• It is resistant to outliers/skewness
Last Time: Shape and Center
Boxplots and the IQR
Variance and Standard Deviaton
Transformations
Summary
Variability
• The range is unstable for a sample, and is extremely vulnerable
to outliers/skew
• The Interquartile Range (IQR) is the range of the “middle
half” of the data, and is “resistant” (like the median)
• The variance is the “average” of the squared deviations from
each observation to the mean
• The standard deviation is the square root of the variance, in
order to restore units to the original scale
• Nonlinear transformations (log, square root, etc.) can be used
when appropriate to reduce skew and stabilize variance