Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
1 Stat 1793 Section 3-3 Measures of Variation (variability, dispersion, spread) How do the values in a data set vary amongst themselves? A data set with values closer to each other have lower measures of variation than those sets with data spread farther apart. Measures of central tendency do not tell the full picture. Examples: A producer of time bombs aim for small variability – it would not be good to have 30 minute fuses have a range from 10 to 50 minutes A teacher interested in distinguishing between strong students and weak students aims to design assessments with large variability in terms of results – it would not be helpful if all students scored exactly the same mark. Two basketball players may have the same average points scored per game, but with different ranges. If one player had a range from 0 to 52 points while the other player had a range of 30 to 36 points, it says something about their consistency. We are going to examine 4 measures of variation: 1. 2. 3. 4. 5. Range Variance Standard Deviation Coefficient of Variation Interquartile Range (In a later section) Range Difference between the greatest value and the least value in the data set It is the simplest, most easily calculated measure of variation and gives some impression of variation Can be misleading (especially for large sets of data) as it is entirely dependent on the two extreme values and is insensitive to those values in between One use of the range is to evaluate samples with very few items. Some quality control techniques involve taking periodic small samples and basing further action on the range found in several such samples Cholesterol Data, Range = 221- 199=22 Variance The average of the squared differences of all the values from the mean The population variance is denoted 2 and is found by the N formula: 2 X i 1 population mean i N 2 where N is the population size and is the 2 The sample variance is denoted s 2 and is found by the formula: X n s2 i 1 i X 2 where n is the sample size and X is the sample mean. n 1 Why do we divide by n-1 instead of n? We can prove that the use of n tends to yield values of the sample variance that tend to underestimate the population variation. To get larger values of the sample variance we decrease the denominator to n-1. It is also related to degrees of freedom, the number of values in a sample that must be assigned. Why do we use the squared distances instead of the actual distances? One reason is that the sum of the actual distances will always be zero. There is also a quicker computational formula for the sample variance. It X n is: s 2 i 1 i X n 1 2 n X i n i 1 2 Xi n i 1 n 1 2 Example Consider the sample: 4, 2, 3, 5, 6. Calculate the sample variance. 20 4 5 X Total s2 Xi X i2 Xi X 4 2 3 5 6 20 16 4 9 25 36 90 0 -2 -1 1 2 0 X 10 2.5 4 20 2 5 10 2.5 4 4 90 s2 It appears that the short cut version saves no time at all, but be assured it does! Particularly with large data sets or with means with a lot of decimals! i X 0 4 1 1 4 10 2 3 Example Consider the sample: 5.2,4.2,3.1,3.6,4.7,4.8,4.1. Calculate the sample variance. 29.7 4.242857 7 X Total s2 X i2 Xi X 5.2 4.2 3.1 3.6 4.7 4.8 4.1 29.7 27.04 17.64 9.61 12.96 22.09 23.04 16.81 129.19 .95714 -.04286 -1.14286 -.64286 .45714 .55714 -.14286 -.00002 Rounding error X X .91612 .00184 1.3061 .41327 .20898 .3104 .02041 3.1771 i 2 3.1771 .5295 6 129.19 s2 Xi 6 29.7 2 7 .52952381 , Much quicker!! Standard Deviation Standard deviation is the square root of the variance. Why is it necessary to take the square root? The reason is that for the variance, since the squared distances are used, the units of the resultant numbers are the squares of the units of the original data. For the standard deviation, finding the square root of the variance puts the units back into the same units as that of the data set. Remember that standard deviation and variance are always positive quantities! To roughly estimate standard deviation, one rule of thumb is to use Range/4. Because of this, an observation is said to be “unusual” if it falls farther away from the mean than 2 standard deviations. 3-2 Exercises page 104 – 107 # 2,3,4,8,17 4 Uses of the Variance and Standard Deviation Used to determine the spread of the data. If the variance is large, the data is more dispersed. The information is useful in comparing two or more data sets to determine which is more variable. Used to determine the consistency of the variable. For example, in the manufacture of fittings, like nuts and bolts, the variation in the diameters must be small, or the parts will not fit together. Used to determine the number of data values that fall within a specified interval in a distribution. (Empirical Rule and Chebyshev’s Theorem) Used often in statistical inference, later in the course. P.L. Chebyshev (1821 – 1894) Russian mathematician specified proportions of the spread or dispersion of a variable in terms of the standard deviation. Chebyshev’s Theorem The proportion of values from a data set that will fall within k standard deviations of the 1 mean will be at least 1 2 , where k is a number greater than 1 k For example when k=2, 1 1 3 , so at least 75 % will fall within 2 standard k2 4 deviations of the mean. When k=3, 1 1 8 , so at least 88.9 % will fall within 3 standard deviations of the k2 9 mean. Examples 1. 2. The mean prices of houses in a certain neighbourhood is $50 000 and the standard deviation is $10 000. Find the price range for which at least 75 % of the houses will fall. Answer: $30 000 to $ 70 000 A survey of local companies found that the mean amount of travel allowance for executives is $0.25 per mile with a standard deviation of $0.02. What is the minimum percentage of data values that fall between $0.20 and $0.30? Answer: at least 84 % 0.20,0.30 .25 .05 .05 k 0.02 k 2.5 1 1 1 .16 .84 , so at least 84% will fall between those limits. 2.5 2 5 Chebyshev’s Theorem applies to any distribution regardless of its shape. When a distribution is bell-shaped (or normal) we can apply the following Emprical Rule For data sets having an approximately bell-shaped distribution, About 68 % of all values fall within 1 standard deviation of the mean About 95 % of all values fall within 2 standard deviations of the mean About 99.7 % of all values fall within 3 standard deviations of the mean Do Exercise 33, 32 on page 109 Comparison of variation for 2 or more data sets is fine as long as the units measured are the same for each data set. But what if someone wishes to compare variation between two variables with different units of measurement? For example, number of cars sold and sales commissions? For this we can use coefficients of variation. The coefficient of variation is the standard deviation divided by the mean, and then multiplied by 100. This is expressed as a percentage. Example The mean of the number of sales of cars over a three-month period is 87 and the standard deviation is 5. The mean of sales commissions is $5225 and the standard deviation is $773. Compare the variations of the two. Solution: The coefficients of variations are 5 CV 100% 5.7% number of car sales 87 773 CV 100% 14.8% sales commissions 5225 Commissions are more variable than car sales.