Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
4.2 Describing Variability A. Range Definition - The range of a data set is defined as range = largest observation – smallest observation Is this measure of spread affected by extreme scores? B. Deviations from the Mean Definition - The n deviations from the sample mean are the differences (x1 - x ), (x2 - x ), . . . (xn - x ) Miles from school 1,2,3,4 X X= 2 (X - X ) 1 3 4 x Σ= The sum of the deviation from the mean is always _______ How can we “fix” this? __________________ C. The Variance and Standard Deviation The sample variance is denoted by s2. The variance is the average sum of the squared deviations from the MEAN s2 ( x x )2 n 1 The sample standard deviation is the positive square root of the sample variance and is denoted by s. The standard deviation is the average sum of the deviations from the MEAN s ( x x ) 2 n 1 Variance is the average sum of the squared deviations from the mean X 1 2 3 4 (X - X ) (X - X )2 The sum of (X - X )2 is huge and doesn’t describe the “typical” spread of the data set. How do we “fix” this problem? _____________________ Our value (s2) describes the average spread of the SQUARED deviations from the mean. This value is not in the same unit of measurement as our original data. How do we “fix” this problem? __________________ Standard Deviation is the average sum of the deviations from the mean Is this measure of spread affected by extreme scores? Standard Deviation 1st – the sum of the deviations (differences) from the mean is ALWAYS ZERO 2nd – so, we square each of the differences (negatives become positives). 3rd – this value (sum) is so large and doesn’t reflect the “typical” squared difference 4th – so, we find the average sum of the squared deviations – this is s2, the variance. 5th – this value is in squared units and doesn’t describe the spread of our original data which is not in squared units 6th – so, we “remove” the square by taking the square root of the variance – this is s, the standard deviation. Again (because it is important) the standard deviation is the average sum of the deviations from the mean. **The derivation of the standard deviation formula is on your test. Sample statistics estimate Population parameters. These estimates should be unbiased. statistic parameter unbiased estimator μ X 2 s unbiased estimator σ2 s unbiased estimator σ unbiased estimator p p̂ s2 Why divide by (n – 1) ( x x )2 n 1 s ( x x ) 2 n 1 To find the average of a set of data, we add the data values than divide by the number of data values (n). But when finding the average sum of the deviations from the mean (standard deviation) we divide by the number of data values minus 1 (n-1). In a nutshell, dividing by n – 1 provides a sample variance that is an unbiased estimator of the population variance σ2 and dividing by n does not. It seems that when we calculate a SAMPLE variance by dividing by n, this value UNDERESTIMATES the Population variance. So, dividing by (n-1) slightly inflates s2 to make it closer to σ2. This logic holds true for the SAMPLE standard deviation as well. D. The Interquartile Range Definition - The interquartile range (iqr) is a measure of variability that is not as sensitive to the presence of outliers as the standard deviation is. Specifically, the iqr is a measure of the middle 50% of the data set. iqr = upper quartile (Q1) – lower quartile(Q2) (Q1)lower quartile= median of the lower half of the sample (Q2)upper quartile=median of the upper half of the sample (If n is odd, the median of the entire sample is excluded from both halves) 2, 4, 4, 5, 7, 8, 10, 11, 12, 15 2, 4, 4, 5, 7, 8, 10, 11, 12, 15, 16 2, 4, 4, 5, 7, 8, 10, 11, 12, 15,1600 Is this measure of spread affected by extreme scores? Resistant statistics are those measures that are not affected be extreme values. That is, an extremely large or small value in a data set does not pull the statistic toward that extreme value. 1, 2, 3, 4, 5 shape center: 1, 2, 3, 4, 500 shape center: mean = median = iqr = standard deviation = 1.6 resistant measures vs. non-resistant measures mean = median = iqr = standard deviation = 222.5 Remember when describing distributions, start with the shape. The shape of the distribution tells you which measure of center and spread is appropriate for the data set. Skewed distributions and distributions with extreme values are best described by the median and iqr since the median and iqr are resistant to extreme values (long tails). For symmetric distributions, the mean and standard deviation are the best way to describe the center and spread. (Mode? - Most of our distributions will be unimodal, so a mention of this at the beginning of the description is appropriate. Range? – Even though the range is non-resistant a mention of the range in your description is a good idea). All descriptions must be in context, in complete sentence format, and include the unit of measurement from the problem. Cost-to-Charge Shape? So, which measure of center is appropriate? Which measure of spread is appropriate? MINITAB Descriptive Statistics: acrylamide Variable N Mean Median TriMean STDev SE Mean acrylamide 7 287.7 270.0 287.7 112.3 42.4 Maximum Q1 Q3 497.0 193.0 328.0 Variable Minimum acrylamide 155.0 Introduction to Statistics and Data Analysis Measuring Spread (Variability) Part I. For each of the four pairs of histograms below, choose the statement from the box at the right that best describes the situation. HINT: Mark the value of the mean on the x-axis and consider where the data lies relative to the mean. Mean = 0.4 Mean = 4.4 Pair #1 10 10 A has larger std dev 8 8 Frequency 12 Frequency 12 6 6 4 4 2 2 0 0 1 2 3 4 B has larger std dev 0 5 0 1 2 A1 3 4 5 B1 Both graphs have the same std dev Mean = 2.5 9 8 8 7 7 6 6 Frequency Frequency Mean = 2.5 9 5 4 4 3 2 2 1 1 0 1 2 3 4 0 5 A has larger std dev 5 3 0 Pair #2 B has larger std dev 0 1 2 A2 3 4 5 B2 Both graphs have the same std dev Mean = 2.0 6 Pair #3 5 5 A has larger std dev 4 4 Frequency Frequency Mean = 2.0 6 3 3 2 2 1 1 0 0 1 2 3 B has larger std dev 0 4 0 1 2 A3 3 4 B3 Both graphs have the same std dev Mean = 3.33 6 Pair #4 5 5 A has larger std dev 4 4 Frequency Frequency Mean = 2.57 6 3 3 2 2 1 1 0 0 1 2 3 A4 4 5 0 B has larger std dev 0 1 2 3 4 5 B4 Both graphs have the same std dev