Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Explaining Variability Paul Cohen ISTA 370 Spring, 2012 Paul Cohen ISTA 370 () Explaining Variability Spring, 2012 1 / 33 Checkpoint Every data value is the result of a complex causal story that involves many factors, most unmeasured. In general, any datum might have been different if the causal story had been different. The fundamental model of data is y = f (x, ) where x stands for measured/controlled factors, stands for all other factors, and f combines the effects of these factors. Variability of y is due to causal effects of x and , both. Variability of y due to x is explained, that due to is unexplained The job of science is to measure and control factors (x), and figure out how they combine (f ), so that the unexplained proportion of variability is low enough for one’s purposes. Paul Cohen ISTA 370 () Explaining Variability Spring, 2012 2 / 33 Load HeightC > read.table.ISTA370<-function(filename){ + dataURL<-"http://www.sista.arizona.edu/~cohen/ISTA%20370/ + # Reads a data frame from a URL path rooted at ISTA370 da + read.table(paste(dataURL,filename,sep="")) + } > heightC<-read.table.ISTA370("heightC.txt") > attach(heightC) Paul Cohen ISTA 370 () Explaining Variability Spring, 2012 3 / 33 How to measure variability Measuring Variability > options(width = 60) > table(height) height 61 2 69 3 62 1 70 1 63 3 71 3 64 64.75 3 1 72 76 4 1 66 2 67 3 67.5 1 68 4 68.5 1 Clearly there’s variability to be explained, but to know whether we’re explaining it, we need a way to measure it! Paul Cohen ISTA 370 () Explaining Variability Spring, 2012 4 / 33 How to measure variability The Range Measuring Variability The Range > max(height) [1] 76 > min(height) [1] 61 > range(height) [1] 61 76 The range measures variability, but it focuses on the two most unusual members of a distribution, so it doesn’t tell us much about how the average person varies from others. Paul Cohen ISTA 370 () Explaining Variability Spring, 2012 5 / 33 How to measure variability Average Squared Difference Measuring Variability The Average Squared Difference To get a sense of how much variability there is on average, find the average squared difference between all pairs of data points. All pairs of four values 2 7 PN 2 D = 9 i,j =1,i6=j (xi − xj ) N2 − N D = 6.467 13 2 = 2 7 9 13 2 0 -‐5 -‐7 -‐11 7 5 0 -‐2 -‐6 9 7 2 0 -‐4 13 11 6 4 0 (2 − 7)2 + (2 − 9)2 + . . . + (13 − 9)2 = 41.833 42 − 4 This is the average distance between points. Paul Cohen ISTA 370 () Explaining Variability Spring, 2012 6 / 33 How to measure variability Average Squared Difference Measuring Variability The Average Squared Difference The average squared difference compares each point to every other point; it doesn’t compare each point to itself. Thus we ignore the diagonal (note the restriction i 6= j on the summation) and to find the average squared distance we divide by N 2 − N . All pairs of four values 2 7 PN 2 D = 9 i,j =1,i6=j (xi − xj ) N2 − N Paul Cohen ISTA 370 () 13 2 = 2 7 9 13 2 0 -‐5 -‐7 -‐11 7 5 0 -‐2 -‐6 9 7 2 0 -‐4 13 11 6 4 0 (2 − 7)2 + (2 − 9)2 + . . . + (13 − 9)2 = 41.833 42 − 4 Explaining Variability Spring, 2012 7 / 33 How to measure variability Sample Variance Measuring Variability The Sample Variance The sample variance, s 2 , is the average squared distance between the mean of a sample, x , and the points in a sample, xi (we’ll discuss the N − 1 later): mean: 7.75 2 9 7 13 PN x= 2 s = PN i=1 xi N = (2 + 7 + 9 + 13) = 7.75 4 (2 − 7.75)2 + . . . + (13 − 7.75)2 − x )2 = = 20.916 N −1 3 i=1 (xi Paul Cohen ISTA 370 () Explaining Variability Spring, 2012 8 / 33 How to measure variability Sample Variance Measuring Variability Sample Variance vs. Average Squared Difference N 1 X s = (xi − x )2 = 20.916 N −1 2 i=1 PN 2 D = i,j =1,i6=j (xi N2 − xj )2 = 41.833 So D 2 = 2s 2 . The sample variance, s 2 , is proportional to the average squared distance between individual points in a sample. The sample variance, which is more commonly used, is twice the average squared distance between points, so is a good representation of variability between points. Paul Cohen ISTA 370 () Explaining Variability Spring, 2012 9 / 33 How to measure variability Sample Variance Measuring Variability Standard Deviation vs. Average Squared Difference D is the average squared difference between two points in a sample. This is an easy-to-interpret measure of variability. √ The standard deviation, s, is s 2 √ Since D 2 = 2s 2 , D = s 2 √ So the standard deviation times 2 is the average squared difference between two points in a sample. Paul Cohen ISTA 370 () Explaining Variability Spring, 2012 10 / 33 How to measure variability Sample and Population Variance Variance What’s with the N − 1? Sample variance, population variance: N N 1 X 1 X s2 = (xi − x )2 , (xi − x )2 σ2 = N − 1 i=1 N i=1 We’ll ignore the difference for a while and work with the population variance. Unfortunately, the var function in R computes the sample variance, so... > popvar<-function(x){ + n<-length(x) + (((n-1) / n) * var(x))} Paul Cohen ISTA 370 () Explaining Variability Spring, 2012 11 / 33 How to measure variability Sample and Population Variance Sample and Population Variance > s<-c(2,7,9,13) > var(s) [1] 20.91667 > popvar(s) [1] 15.6875 Paul Cohen ISTA 370 () Explaining Variability Spring, 2012 12 / 33 How to measure variability Sample and Population Variance Another form for the Mean and Variance If a population contains unique elements E = e1 , e2 , ..., ek . Let the ith element have probability pi . Then the mean and variance are: X X µ= pi × ei σ2 = pi × (ei − µ)2 ei ∈E > > > > ei ∈E s<-c(2,7,9,13) p<-c(.25,.25,.25,.25) m<-sum(p*s) m [1] 7.75 > sum(p*(s-m)^2) [1] 15.6875 > popvar(s) [1] 15.6875 Paul Cohen ISTA 370 () Explaining Variability Spring, 2012 13 / 33 How to measure variability Variance as prediction accuracy Another Interpretation of Variance Variance is related to the average distance between points, as we have seen. It is also related to the accuracy of predictions. Suppose you have a population E = 4, 7, 9, 13, where the ith element has probability pi = 1/4. Someone asks you to guess the value of an element drawn at random. Your best guess would be µ = 7.75, the population mean. This guessing game has an mean squared error: σ2 = X pi × (i − µ)2 ei ∈E So the population variance, σ 2 , is your average squared error when you play a guessing game and always guess µ, the population mean. Paul Cohen ISTA 370 () Explaining Variability Spring, 2012 14 / 33 Explaining Variability Explaining Variability, Redux The fundamental model of data says y = f (x , ), where y is something we want to explain, x is one or more factors, and is other factors that we don’t measure or control. The job of science is to discover x ’s that explain variability in y. If variability in y is measured by the variance of y, what does explaining variability mean? It means that your error when you play a guessing game about y is lower if you know the value of x than if you don’t. It means that the variance of y is lower when you know x than when you don’t. Paul Cohen ISTA 370 () Explaining Variability Spring, 2012 15 / 33 Explaining Variability Back to Height... 5 Histogram of height 3 2 average difference: √ √ D = 5.1 = 2.26in. 1 variance: s 2 = 13.025 0 Frequency 4 D 2 = 26.05, D = 5.1 60 65 70 height Paul Cohen ISTA 370 () 75 √ s 2 = 3.6 √ average squared difference: s 2 = 5.1 standard deviation: s = Explaining Variability Spring, 2012 16 / 33 Explaining Variability Explaining Variance Explaining variability in y in terms of x means that the variance of y is lower when you know x than when you don’t. > > > > males<-height[gender==0] females<-height[gender==1] nm<-length(males) ; nf<-length(females) popvar(height) [1] 13.02525 Average error when you don’t know gender > popvar(males) ; popvar(females) [1] 5.378893 Average error when you know gender=male [1] 7.18335 Average error when you know gender=female > ((popvar(males)*nm)+popvar(females)*nf)/(nf+nm) [1] 6.253781 Paul Cohen ISTA 370 () Weighted avg. error when you know gender Explaining Variability Spring, 2012 17 / 33 Explaining Variability Explaining Variance Explaining variability in y in terms of x means that the variance of y is lower when you know x than when you don’t. > popvar(height) [1] 13.02525 Average error when you don’t know gender > popvar(males) ; popvar(females) [1] 5.378893 Average error when you know gender=male [1] 7.18335 Average error when you know gender=female > ((popvar(males)*nm)+popvar(females)*nf)/(nf+nm) [1] 6.253781 Weighted avg. error when you know gender Variance is reduced from 13.03 to 6.25, so the proportion of variance explained by gender is (13.03 − 6.25)/13.03 = 52% Paul Cohen ISTA 370 () Explaining Variability Spring, 2012 18 / 33 Explaining Variability when x is continuous Explaining Variance Explaining variability in y in terms of x means that the variance of y is lower when you know x than when you don’t. What if x isn’t a label such as “male” or “female” but is a continuous variable such as father’s height? ● 75 Warning: what follows will be explained later in the semester ● ● height 70 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 65 ● ● ● ● ● ● ● ● ● ● 66 68 70 72 74 76 dadHeight Paul Cohen ISTA 370 () Explaining Variability Spring, 2012 19 / 33 Explaining Variability when x is continuous Explaining Variability in Height as a Linear Function of dadHeight > regHeight<-lm(height~dadHeight) # lm means linear model > regHeight$coefficients height = 0.613dadHeight + 24.8 + 75 ● ● ● ● 70 Discuss: If dadHeight explained all of the variability in height, then all the points would be on a line, and using the line to guess height given dadheight, instead of the mean of height, would produce no errors. height dadHeight 0.6130105 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 65 (Intercept) 24.8031485 ● ● ● ● ● ● ● ● ● 66 Paul Cohen ISTA 370 () Explaining Variability 68 70 72 dadHeight 74 Spring, 2012 76 20 / 33 Explaining Variability when x is continuous Explaining Variability in Height as a Linear Function of dadHeight > regHeight<-lm(height~dadHeight) # lm means linear model > regHeight$coefficients (Intercept) 24.8031485 dadHeight 0.6130105 height = 0.613dadHeight + 24.8 + > summary(regHeight)[c("r.squared")] 75 ● $r.squared [1] 0.2130919 ● 70 height Explaining Variability ● ● ● ● ● ● ● ● ● ● ● ● ● ● 65 r 2 measures how much better we do in the guessing game by using the line instead of the mean of height. Only 21% of the variability in height is explained by dadHeight Paul Cohen ISTA 370 () ● ● ● ● ● ● ● ● ● ● ● ● 66 68 70 72 dadHeight 74 Spring, 2012 76 21 / 33 Explaining Variability when x is continuous Explaining Variability in Height: Males Only > > > > m<-subset(heightC,gender==0) regHeightM1<-lm(m$height~m$dadHeight) #linear model # Variance explained by dadHeight summary(regHeightM1)[c("r.squared")] $r.squared [1] 0.36726 72 ● Explaining Variability ● ● ● ● ● ● ● ● ● ● ● 66 Paul Cohen ISTA 370 () ● ● 70 m$height 74 ● 68 However, if we restrict ourselves to males, and use the line to guess height then dadheight explains 36% of the variability in male height 76 > plot(m$dadHeight,m$height) > abline(regHeightM1) 68 ● 70 72 m$dadHeight 74 Spring, 2012 76 22 / 33 Explaining Variability when x is continuous Explaining Variability in Height: Males Only > m<-subset(heightC,gender==0) > regHeightM2<-lm(m$height~m$dadHeight+m$momHeight) > regHeightM2$coefficients (Intercept) m$dadHeight m$momHeight 16.9256258 0.4202007 0.3763106 > # Variance explained by dadHeight and momHeight > summary(regHeightM2)[c("r.squared")] $r.squared [1] 0.4472294 44% of the variability in male height is explained by dadHeight and momHeight, together > # Variance explained by dadHeight only > summary(regHeightM1)[c("r.squared")] $r.squared [1] 0.36726 Paul Cohen ISTA 370 () Explaining Variability Spring, 2012 23 / 33 Explaining Variability when x is continuous Summary: Explaining Variance A variable y (e.g., height) has some variability, measured by its variance. Other factors x, such as gender or father’s height, may be used to explain height. We set up a guessing game, using a function y = f (x,) to guess y Explaining variance means f (x, ) does better than the mean of y at guessing y Unexplained variance is attributed to factors Paul Cohen ISTA 370 () x1 x2 x3 . . . factors we measure or control y a datum, such as height e all influences that we do not measure or control xn X=(x1,x2,...,xn) the factors we measure/control y = f (X,e) the fundamental model of data Explaining Variability Spring, 2012 24 / 33