Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Survey

Document related concepts

Transcript

Outline and Review of Chapter 4 Measures of Variability The measures of central tendency can provide an anchor, or a point that tells you where most of the scores can be found. Often, but not always, these points are near the center of the distribution. Measures of variability tell you how widely the scores are scattered or distributed around the measures of central tendency. The Range The range and interquartile range are useful ways to describe the variability of any distribution, regardless of its shape. To determine the range, you must first consider the upper real limit and lower real limit of your distribution. If you look at figure 4.1, you will notice that the blocks representing the 24 individual scores are centered above the corresponding numbers along the xaxis. The x-axis is a number line and, although it only displays whole numbers, you can find the locations of values like 2.5 and 9.1 or any non-whole number that you can imagine along that line. Therefore, the left edge the block over the number 1 is actually over 0.5 this is the lower real limit (LRL) of the distribution. Similarly the right edges of the two blocks over the number 10 are actually at 10.5 which is the upper real limit (URL) of the distribution. The range of the distribution is calculated as Range = URL-LRL and for figure 4.1 Range = 10.5 - 0.5 = 10 If you measure the width of the distribution in blocks, you will find that this equals the range. Another approach is to take the value of the highest score (Max) minus the lowest score (Min) and add one: Range = Max-Min + 1 Interquartile Range Quartiles are the points along the x-axis that divide the distribution into four equal groups. Determining the values of the quartiles is simple when you have samples that are multiples of four. For example, Figure 4.1 shows a frequency distribution histogram for a sample of 24 scores so the quartiles should split this distribution into four groups of 6 blocks. The first quartile (Q1) is the number that separates the six lowest blocks (the bottom 25%) from the rest of the group. If you wanted to cut the six lowest blocks from the rest, you would place the cut on the x-axis at 4.5. This is Q1. The second quartile (Q2) separates the bottom 50% from the rest of the group and, therefore, divides the distribution into two equal groups. You have been introduced to Q2 before; it is the median. For the data in figure 4.1, Q2 is 5.5. Finally Q3 is the value that separates the six highest scores from the rest. Q3 is 7.5. Figure 4.1. Frequency distribution histogram showing a sample of 24 scores and three quartiles. The interquartile range (IQR) is often used to describe the variability of the data and it reports the range of the center 50% of scores instead of the whole distribution. The interquartile range is calculated as IQR = Q3-Q1 and for Figure 4.1 it is IQR = 7.5 - 4.5 = 3 Some may also report the Semi-Interquartile Range (SIQR) which is simply IQR/2. Notice that the interquartile range is generally unaffected by the extreme scores. If the lowest score changes from 1 to 0, the interquartile range will not change. If the highest scores change from 10 to 100, the interquartile range will not change. If you know the range as well as the interquartile range you can imagine how the data are grouped. The range will always be larger than the IQR but a distribution with a large range and a very small interquartile range will have a half of its scores in a narrow central cluster. If the IQR is almost as large as the range, you would expect to see scores stacked up at the extreme values. These numbers should provide you with enough information to take some educated guesses about the nature of a distribution. Complicated Quartiles Determining quartiles can be a bit complicated when you cannot easily divide the data into four equal groups. For example, if you had a set of 22 scores, you would need to place the quartiles that would create four groups of 5.5 blocks. This gets tricky and, although there are more complicated ways to do it, a simple and effective way – for our purposes - is to carefully construct a frequency distribution histogram and see where you would need to slice up the distribution. For example, you cannot neatly cut 5.5 blocks from the bottom 25% but if you placed Q1 exactly at 4, you would have four complete blocks as well as three halves. This separates 5.5 blocks from the bottom of the distribution and, therefore, Q1 = 4. The median (Q2) is located at the point that divides the distribution into two groups of eleven (n=22). This point is at 5.75. Q3 is a little easier to see at 8. Figure 4.2. Frequency distribution histogram showing a set of 22 scores and three quartiles. Standard Deviation and Variance of a Population The standard deviation is a popular and powerful measure of variability, but it is only appropriate for describing normal distributions, although some researchers do not appreciate this and you might find many violations of this rule in published literature. The standard deviation, as its name implies, measures a standard or typical deviation. In our case, deviation is distance from the mean. The calculation of a standard deviation begins with the calculation of a mean and then assigning a Deviation Score to each score in the population. Deviation Score (The distance between any score and the mean) = X - µ For example, if a population has a mean of 100 (μ = 100) and one of the people in that population has a raw score of 80, the deviation score for that person would be -20. If a person’s score is 107, the deviation score would be 7. Imagine now that you wanted to know the “typical” deviation score for the population. Just like you would calculate the mean of all the raw scores to get an idea of a typical raw score, you might be inclined to calculate the mean of the deviation scores to get a typical deviation. However, it won’t work. The average (or mean) of any set of deviation scores will always be zero because the sum of the deviation scores will always be zero. Numbers below the mean will have negative deviation scores that are cancelled out by the scores numbers above the mean that have positive deviation scores. So, as a rule… The sum of all deviation scores in a population will always be zero. ∑ (X - µ) = 0 This makes it impossible to calculate an average deviation score since ∑ (X - µ)/N = 0/N = 0 There is a way around this problem. If we square the deviation scores before adding them up, we are guaranteed to get a non-negative number (the sum of the deviation scores will be zero if all of the scores in a population are identical, but this is unlikely). The sum of the deviation scores is the Sum of Squares (SS) and the formula for calculating SS is: SS = ∑ (X - µ)2 This formula is referred to as the definitional formula for sum of squares. Although this formula makes it clear that SS is the sum of the squared distance between each score and the mean, there is a simpler and more popular formula that will get you the same result: SS = ∑ X2 – (∑X)2/N Because this formula is more commonly used when computing SS, it is referred to as the computational formula for sum of squares. Once we calculate the sum of the squared deviation scores, we can calculate the average squared distance by dividing SS by the number of scores. The average of the squared deviation scores is the variance and for a population the symbol for variance is sigma-squared. Population Variance (σ2) = the average of the squared deviation scores. σ2 = SS/N The variance is the average squared distance from the mean and, although it is a useful indicator of variance, it is more appropriate to report a typical distance, rather than a typical squared distance. The standard deviation gives us this measure of typical distance and it is simply the square root of the variance: Population Standard Deviation (σ) = the square root of the Population Variance (σ2) σ = √SS/N = √σ2 Standard Deviation of a Sample Remember that samples are used to help us make estimates about the population parameters that we can almost never measure directly. Therefore, if we want to use the sample statistic as an estimate of a population statistic we must consider things that will influence the accuracy of the estimate. When you sample people from a population, most of your scores will probably come from the middle of the distribution since that’s where most people are. Therefore, a sample mean (M) will probably be the result of averaging a few scores from people who are above the mean and a few scores from people who are below the mean and your sample mean (M) will be a good estimate of the population mean (μ). Although you may not be sure about the accuracy of your estimate, you will have no particular reason to believe that you have overestimated μ nor do you have any reason to believe that you have underestimated μ. A sample mean is therefore an unbiased estimate of the population mean. There are some other things that you know about your sample; you know that the sample size (n) is most likely much smaller than the size of the population (N) and, because of this, you are missing quite a few scores. Most of the scores that you are missing are probably “extreme” scores at the very high and very low ends of the distribution. This means that your sample will never have a range (or a distribution) that is as wide as the population’s. Therefore, a straight calculation of the sample standard deviation is likely to give you a number that is smaller than the population standard deviation. Unless we compensate for this, the sample variance (s) and sample standard deviation (s) will be biased estimates of the population variance (σ2) and standard deviations (σ). If you want to use the sample standard deviation to estimate the population standard deviation, instead of dividing SS by the number of scores in your sample (n) when calculating variance, you should divide it by n-1 also known as the degrees of freedom. The fraction SS/(n-1) is a little bigger than the fraction SS/n and this small correction makes the sample variance (s2) and standard deviations (s) better estimates of the population (σ2) variance and population standard deviation (σ). Sample Variance (s2) = an estimate of the Population Variance s2 = SS/(n-1) = s2 = SS/df Sample Standard Deviation (s) = the square root of the Sample Variance s = √SS/df = √s2 Summary: Measures of variability almost always accompanty - and describe how widely the data are scattered around - a measure of central tendency. Almost any set of ordinal, interval, or ratio scale data can be described using range, quartiles, and interquartile ranges. The standard deviation offers more predictive power than ranges and quartiled but should only be applied to normally distributed data (often interval or ratio scale data). We will explore the usefulness of the standard deviation in the following chapters.