Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Measures of Variability • Chapter 5 of Howell (except 5.3 and 5.4) • People are all slightly different (that’s what makes it fun) • Not everyone scores the same on the same scale • This is interesting for us - must take it into account • The variation tells us about the people we studied 1 Example of variability • Imagine this variable: • 5738229193 • The mean is 4.9 • We sort of expect 4.9 to be representative of the scores, but: 2.5 2 1.5 Series1 1 0.5 0 1 2 3 4 5 6 7 8 9 The data is at the edges - not at all close to 4.9! 2 3 A second sample: • Look at this one: • 444555566 • The mean is also 4.9 • But the distribution: 4.5 4 3.5 3 2.5 Series1 2 1.5 1 0.5 0 1 2 3 4 5 6 7 8 9 Same mean as before, but the numbers are very clustered close to the mean! How do we explain this difference? 4.5 2.5 4 2 3.5 1.5 2.5 3 Series1 1 Series1 2 1.5 1 0.5 0.5 0 0 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 • Both have the same mean • the mean obviously doesn’t tell the whole story! • What is the actual difference between those data sets? • The left one if more “spread out” than the one on the right 4 Variability • Measures of variability capture this “spreadness” of the data • not applicable to nominal variables • Various ways to measure it • How far does the data stretch? • How far, on average, is it spread from the mean? 5 Extents of the data - the range • The range is the total width of the data • Consider x, with a sample • 7434563 • These values range all the way from 3 (the smallest value) to 7 (the biggest value) - it’s range is 4 • Easy to calculate: • rangex = max(x) - min(x) • (the largest value of x minus the smallest value of x) • A high range value means the data is very spread 6 Example: calculating the range • Calculate the range for x, from the sample: • • • • • 26 28 32 15 25 12 Step 1 - find the largest value of x • in this sample, it is 32 Step 2 - find the smallest value of x • in this sample, it is 12 Step 3 - biggest minus smallest • 32 - 12 = 20 The range is 20 7 Why the range is cool/ why it sucks • Gives an idea of how far spread the data is • a higher range number means the data is more spread apart • Can compare various sample’s ranges to see which is spread the most • But: can’t distinguish between these two samples (both have range = 10) 10 4.5 9 4 8 3.5 7 3 6 5 2.5 Series1 2 Series1 4 3 1.5 2 1 1 0.5 0 0 1 1 2 3 4 5 6 7 8 9 10 11 2 3 4 5 6 7 8 9 10 11 8 9 A better idea of variation • The right histogram shows more clustering, but has a few values which “throw off” the range • Range can be fooled by “extreme values” outliers • There exist better measures which are “outlier proof” Outlier proofing - Varience • The varience presents a better measure of data spread • not as easily influenced by outliers • Varience is based on the average distance of the scores from the mean • It is not on the variable’s scale • the variance is not in the same units a the variable • Still useful - bigger values mean more spread 10 Calculating variance (brace yourself) • Variance is calculated using a formula: Varience is the mean of the squared deviations of the observations 11 Calculating variance (in English) • Easy if broken down into 5 small steps! • Step 1: Work out the mean of x, and n • Step 2: For each data point, work out the deviation (x minus the mean of x) • Step 3: For each data point, square the deviations you got above • Step 4: Add all the squared deviations together • Step 5: Divide your sum by n minus 1 12 Example: working out s2 • Work out the variance for x, based on the sample: • 16, 12, 15, 14, 20 • By the numbers! • Step 1: work out the mean and n • n is 5 • 16+12+15+14+20 = 77 • 77 / 5 = 15.4 • The mean is 15.4 13 Example: working out s2 • For the remaining steps, make yourself a table: • x x-x (x-x)2 Each column is a step - fill in one at a time 14 Example: working out s2 • Step 2: Work out the deviation (x minus mean of x) x 16 12 15 14 20 x-x 0.6 -3.4 -0.4 -1.4 4.6 (x-x)2 15 Example: working the variance • Step 3: Square the deviations (column 2 times column 2) x 16 12 15 14 20 x-x 0.6 -3.4 -0.4 -1.4 4.6 (x-x)2 0.36 11.56 0.16 1.96 21.16 16 Example: working the variance • Step 4: sum the squared deviations • 0.36+11.56+0.16+1.96+21.16 = 35.2 • Step 5: divide the sum by (n-1) • n=5 • n-1 = 4 • 35.2 / 4 = 8.8 • The variance of this data set is 8.8 • Simple, but tedious! 17 18 Variance: The bad news • Variance is a good measure of spread, but it is in odd units • A bigger number means more spread, but the number itself means very little • Because we square in the formula, we cause the numbers to loose their scale • The variance of an IQ scale is not in IQ points • Would be nice to have a measure of variation which is in the correct units! 19 The Standard Deviation • The standard deviation is a measure variation • Has all the good properties of the variance • PLUS it is in the same scale as the variable • Standard deviation of IQ scores is expressed in IQ points • Gives and intuitive understanding of how far apart the scores truly are spread – “Scores were centered at 100 and spread by 15” 20 Calculating the standard deviation • Very simple formula: • To work it out, calculate variance and then take its square root 21 Example: working out s • Work out the variance for x, based on the sample: • 16, 12, 15, 14, 20 • Step 1: Work out the variance • s2 = 8.8 (from the previous example) • Step 2: find the square root: • 8.8 = 2.966 The standard dev is 2.966 22 Variance and standard deviation • If you have variance, it is easy to work out standard deviation • Square root the variance • If you have the standard deviation, it is easy to work out the variance • Square it 23 Using the standard deviation with the mean • By looking at the mean and std deviation at the same time, we can get a good idea of a variable: Mean: 5.35 Std dev: 2.3 A Mean: 5.35 Std dev: 1.008 B 4.5 6 4 3.5 5 3 4 2.5 3 Series1 Series1 2 1.5 2 1 1 0.5 0 0 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Understanding distributions • The mean tells us the “middle” of the distribution • The standard dev tells us the “spreadness” of the data • From this we can derive a lot • A low std dev means that everyone scored almost the same • A high std dev tells you there was a lot of disagreement 24