Download Explaining Variability

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Transcript
Explaining Variability
Paul Cohen ISTA 370
Spring, 2012
Paul Cohen ISTA 370 ()
Explaining Variability
Spring, 2012
1 / 33
Checkpoint
Every data value is the result of a complex causal story that
involves many factors, most unmeasured. In general, any datum
might have been different if the causal story had been different.
The fundamental model of data is y = f (x, ) where x stands for
measured/controlled factors, stands for all other factors, and f
combines the effects of these factors.
Variability of y is due to causal effects of x and , both.
Variability of y due to x is explained, that due to is unexplained
The job of science is to measure and control factors (x), and
figure out how they combine (f ), so that the unexplained
proportion of variability is low enough for one’s purposes.
Paul Cohen ISTA 370 ()
Explaining Variability
Spring, 2012
2 / 33
Load HeightC
> read.table.ISTA370<-function(filename){
+
dataURL<-"http://www.sista.arizona.edu/~cohen/ISTA%20370/
+
# Reads a data frame from a URL path rooted at ISTA370 da
+
read.table(paste(dataURL,filename,sep=""))
+ }
> heightC<-read.table.ISTA370("heightC.txt")
> attach(heightC)
Paul Cohen ISTA 370 ()
Explaining Variability
Spring, 2012
3 / 33
How to measure variability
Measuring Variability
> options(width = 60)
> table(height)
height
61
2
69
3
62
1
70
1
63
3
71
3
64 64.75
3
1
72
76
4
1
66
2
67
3
67.5
1
68
4
68.5
1
Clearly there’s variability to be explained, but to know whether
we’re explaining it, we need a way to measure it!
Paul Cohen ISTA 370 ()
Explaining Variability
Spring, 2012
4 / 33
How to measure variability
The Range
Measuring Variability
The Range
> max(height)
[1] 76
> min(height)
[1] 61
> range(height)
[1] 61 76
The range measures variability, but it focuses on the two most
unusual members of a distribution, so it doesn’t tell us much about
how the average person varies from others.
Paul Cohen ISTA 370 ()
Explaining Variability
Spring, 2012
5 / 33
How to measure variability
Average Squared Difference
Measuring Variability
The Average Squared Difference
To get a sense of how much variability there is on average, find the
average squared difference between all pairs of data points.
All pairs of four values
2
7
PN
2
D =
9
i,j =1,i6=j (xi − xj )
N2 − N
D = 6.467
13
2
=
2
7
9
13
2
0
-­‐5
-­‐7
-­‐11
7
5
0
-­‐2
-­‐6
9
7
2
0
-­‐4
13
11
6
4
0
(2 − 7)2 + (2 − 9)2 + . . . + (13 − 9)2
= 41.833
42 − 4
This is the average distance between points.
Paul Cohen ISTA 370 ()
Explaining Variability
Spring, 2012
6 / 33
How to measure variability
Average Squared Difference
Measuring Variability
The Average Squared Difference
The average squared difference compares each point to every other
point; it doesn’t compare each point to itself. Thus we ignore the
diagonal (note the restriction i 6= j on the summation) and to find
the average squared distance we divide by N 2 − N .
All pairs of four values
2
7
PN
2
D =
9
i,j =1,i6=j (xi − xj )
N2 − N
Paul Cohen ISTA 370 ()
13
2
=
2
7
9
13
2
0
-­‐5
-­‐7
-­‐11
7
5
0
-­‐2
-­‐6
9
7
2
0
-­‐4
13
11
6
4
0
(2 − 7)2 + (2 − 9)2 + . . . + (13 − 9)2
= 41.833
42 − 4
Explaining Variability
Spring, 2012
7 / 33
How to measure variability
Sample Variance
Measuring Variability
The Sample Variance
The sample variance, s 2 , is the average squared distance between
the mean of a sample, x , and the points in a sample, xi (we’ll
discuss the N − 1 later):
mean:
7.75
2
9
7
13
PN
x=
2
s =
PN
i=1 xi
N
=
(2 + 7 + 9 + 13)
= 7.75
4
(2 − 7.75)2 + . . . + (13 − 7.75)2
− x )2
=
= 20.916
N −1
3
i=1 (xi
Paul Cohen ISTA 370 ()
Explaining Variability
Spring, 2012
8 / 33
How to measure variability
Sample Variance
Measuring Variability
Sample Variance vs. Average Squared Difference
N
1 X
s =
(xi − x )2 = 20.916
N −1
2
i=1
PN
2
D =
i,j =1,i6=j (xi
N2
− xj )2
= 41.833
So D 2 = 2s 2 . The sample variance, s 2 , is proportional to the
average squared distance between individual points in a sample.
The sample variance, which is more commonly used, is twice the
average squared distance between points, so is a good
representation of variability between points.
Paul Cohen ISTA 370 ()
Explaining Variability
Spring, 2012
9 / 33
How to measure variability
Sample Variance
Measuring Variability
Standard Deviation vs. Average Squared Difference
D is the average squared difference between two points in a sample.
This is an easy-to-interpret measure of variability.
√
The standard deviation, s, is s 2
√
Since D 2 = 2s 2 , D = s 2
√
So the standard deviation times 2 is the average squared
difference between two points in a sample.
Paul Cohen ISTA 370 ()
Explaining Variability
Spring, 2012
10 / 33
How to measure variability
Sample and Population Variance
Variance
What’s with the N − 1?
Sample variance, population variance:
N
N
1 X
1 X
s2 =
(xi − x )2 ,
(xi − x )2
σ2 =
N − 1 i=1
N i=1
We’ll ignore the difference for a while and work with the population
variance. Unfortunately, the var function in R computes the sample
variance, so...
> popvar<-function(x){
+
n<-length(x)
+
(((n-1) / n) * var(x))}
Paul Cohen ISTA 370 ()
Explaining Variability
Spring, 2012
11 / 33
How to measure variability
Sample and Population Variance
Sample and Population Variance
> s<-c(2,7,9,13)
> var(s)
[1] 20.91667
> popvar(s)
[1] 15.6875
Paul Cohen ISTA 370 ()
Explaining Variability
Spring, 2012
12 / 33
How to measure variability
Sample and Population Variance
Another form for the Mean and Variance
If a population contains unique elements E = e1 , e2 , ..., ek . Let the
ith element have probability pi . Then the mean and variance are:
X
X
µ=
pi × ei
σ2 =
pi × (ei − µ)2
ei ∈E
>
>
>
>
ei ∈E
s<-c(2,7,9,13)
p<-c(.25,.25,.25,.25)
m<-sum(p*s)
m
[1] 7.75
> sum(p*(s-m)^2)
[1] 15.6875
> popvar(s)
[1] 15.6875
Paul Cohen ISTA 370 ()
Explaining Variability
Spring, 2012
13 / 33
How to measure variability
Variance as prediction accuracy
Another Interpretation of Variance
Variance is related to the average distance between points, as we
have seen. It is also related to the accuracy of predictions.
Suppose you have a population E = 4, 7, 9, 13, where the ith
element has probability pi = 1/4. Someone asks you to guess the
value of an element drawn at random. Your best guess would be
µ = 7.75, the population mean. This guessing game has an mean
squared error:
σ2 =
X
pi × (i − µ)2
ei ∈E
So the population variance, σ 2 , is your average squared error when
you play a guessing game and always guess µ, the population mean.
Paul Cohen ISTA 370 ()
Explaining Variability
Spring, 2012
14 / 33
Explaining Variability
Explaining Variability, Redux
The fundamental model of data says y = f (x , ), where y is
something we want to explain, x is one or more factors, and is
other factors that we don’t measure or control.
The job of science is to discover x ’s that explain variability in y.
If variability in y is measured by the variance of y, what does
explaining variability mean?
It means that your error when you play a guessing game about y
is lower if you know the value of x than if you don’t.
It means that the variance of y is lower when you know x than
when you don’t.
Paul Cohen ISTA 370 ()
Explaining Variability
Spring, 2012
15 / 33
Explaining Variability
Back to Height...
5
Histogram of height
3
2
average difference:
√
√
D = 5.1 = 2.26in.
1
variance: s 2 = 13.025
0
Frequency
4
D 2 = 26.05, D = 5.1
60
65
70
height
Paul Cohen ISTA 370 ()
75
√
s 2 = 3.6
√
average squared difference: s 2 = 5.1
standard deviation: s =
Explaining Variability
Spring, 2012
16 / 33
Explaining Variability
Explaining Variance
Explaining variability in y in terms of x means that the variance of
y is lower when you know x than when you don’t.
>
>
>
>
males<-height[gender==0]
females<-height[gender==1]
nm<-length(males) ; nf<-length(females)
popvar(height)
[1] 13.02525
Average error when you don’t know gender
> popvar(males) ; popvar(females)
[1] 5.378893
Average error when you know gender=male
[1] 7.18335
Average error when you know gender=female
> ((popvar(males)*nm)+popvar(females)*nf)/(nf+nm)
[1] 6.253781
Paul Cohen ISTA 370 ()
Weighted avg. error when you know gender
Explaining Variability
Spring, 2012
17 / 33
Explaining Variability
Explaining Variance
Explaining variability in y in terms of x means that the variance of
y is lower when you know x than when you don’t.
> popvar(height)
[1] 13.02525
Average error when you don’t know gender
> popvar(males) ; popvar(females)
[1] 5.378893
Average error when you know gender=male
[1] 7.18335
Average error when you know gender=female
> ((popvar(males)*nm)+popvar(females)*nf)/(nf+nm)
[1] 6.253781
Weighted avg. error when you know gender
Variance is reduced from 13.03 to 6.25, so the proportion of
variance explained by gender is (13.03 − 6.25)/13.03 = 52%
Paul Cohen ISTA 370 ()
Explaining Variability
Spring, 2012
18 / 33
Explaining Variability when x is continuous
Explaining Variance
Explaining variability in y in terms of x means that the variance of
y is lower when you know x than when you don’t.
What if x isn’t a label such as “male” or “female” but is a
continuous variable such as father’s height?
●
75
Warning: what follows will be
explained later in the semester
●
●
height
70
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
65
●
●
●
●
●
●
●
●
●
●
66
68
70
72
74
76
dadHeight
Paul Cohen ISTA 370 ()
Explaining Variability
Spring, 2012
19 / 33
Explaining Variability when x is continuous
Explaining Variability in Height
as a Linear Function of dadHeight
> regHeight<-lm(height~dadHeight) # lm means linear model
> regHeight$coefficients
height = 0.613dadHeight + 24.8 + 75
●
●
●
●
70
Discuss: If dadHeight explained
all of the variability in height, then
all the points would be on a line,
and using the line to guess height
given dadheight, instead of the mean
of height, would produce no errors.
height
dadHeight
0.6130105
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
65
(Intercept)
24.8031485
●
●
●
●
●
●
●
●
●
66
Paul Cohen ISTA 370 ()
Explaining Variability
68
70
72
dadHeight
74
Spring, 2012
76
20 / 33
Explaining Variability when x is continuous
Explaining Variability in Height
as a Linear Function of dadHeight
> regHeight<-lm(height~dadHeight) # lm means linear model
> regHeight$coefficients
(Intercept)
24.8031485
dadHeight
0.6130105
height = 0.613dadHeight + 24.8 + > summary(regHeight)[c("r.squared")]
75
●
$r.squared
[1] 0.2130919
●
70
height
Explaining Variability
●
●
●
●
●
●
●
●
●
●
●
●
●
●
65
r 2 measures how much better we do
in the guessing game by using the
line instead of the mean of height.
Only 21% of the variability in height
is explained by dadHeight
Paul Cohen ISTA 370 ()
●
●
●
●
●
●
●
●
●
●
●
●
66
68
70
72
dadHeight
74
Spring, 2012
76
21 / 33
Explaining Variability when x is continuous
Explaining Variability in Height: Males Only
>
>
>
>
m<-subset(heightC,gender==0)
regHeightM1<-lm(m$height~m$dadHeight) #linear model
# Variance explained by dadHeight
summary(regHeightM1)[c("r.squared")]
$r.squared
[1] 0.36726
72
●
Explaining Variability
●
●
●
●
●
●
●
●
●
●
●
66
Paul Cohen ISTA 370 ()
●
●
70
m$height
74
●
68
However, if we restrict ourselves to
males, and use the line to guess height
then dadheight explains 36% of the
variability in male height
76
> plot(m$dadHeight,m$height)
> abline(regHeightM1)
68
●
70
72
m$dadHeight
74
Spring, 2012
76
22 / 33
Explaining Variability when x is continuous
Explaining Variability in Height: Males Only
> m<-subset(heightC,gender==0)
> regHeightM2<-lm(m$height~m$dadHeight+m$momHeight)
> regHeightM2$coefficients
(Intercept) m$dadHeight m$momHeight
16.9256258
0.4202007
0.3763106
> # Variance explained by dadHeight and momHeight
> summary(regHeightM2)[c("r.squared")]
$r.squared
[1] 0.4472294
44% of the variability in male height is explained
by dadHeight and momHeight, together
> # Variance explained by dadHeight only
> summary(regHeightM1)[c("r.squared")]
$r.squared
[1] 0.36726
Paul Cohen ISTA 370 ()
Explaining Variability
Spring, 2012
23 / 33
Explaining Variability when x is continuous
Summary: Explaining Variance
A variable y (e.g., height) has
some variability, measured by its
variance.
Other factors x, such as gender
or father’s height, may be used
to explain height.
We set up a guessing game,
using a function y = f (x,) to
guess y
Explaining variance means
f (x, ) does better than the
mean of y at guessing y
Unexplained variance is
attributed to factors Paul Cohen ISTA 370 ()
x1
x2
x3
.
.
.
factors we measure or control
y
a datum, such as height
e
all influences that we do not measure or control
xn
X=(x1,x2,...,xn) the factors we measure/control
y = f (X,e) the fundamental model of data
Explaining Variability
Spring, 2012
24 / 33