Download Ch04

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

German tank problem wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Outline and Review of Chapter 4
Measures of Variability
The measures of central tendency can provide an anchor, or a point that tells you where most
of the scores can be found. Often, but not always, these points are near the center of the
distribution. Measures of variability tell you how widely the scores are scattered or distributed
around the measures of central tendency.
The Range
The range and interquartile range are useful ways to describe the variability of any distribution,
regardless of its shape. To determine the range, you must first consider the upper real limit and
lower real limit of your distribution. If you look at figure 4.1, you will notice that the blocks
representing the 24 individual scores are centered above the corresponding numbers along the xaxis. The x-axis is a number line and, although it only displays whole numbers, you can find the
locations of values like 2.5 and 9.1 or any non-whole number that you can imagine along that
line. Therefore, the left edge the block over the number 1 is actually over 0.5 this is the lower
real limit (LRL) of the distribution. Similarly the right edges of the two blocks over the number
10 are actually at 10.5 which is the upper real limit (URL) of the distribution. The range of the
distribution is calculated as
Range = URL-LRL
and for figure 4.1
Range = 10.5 - 0.5 = 10
If you measure the width of the distribution in blocks, you will find that this equals the range.
Another approach is to take the value of the highest score (Max) minus the lowest score (Min)
and add one:
Range = Max-Min + 1
Interquartile Range
Quartiles are the points along the x-axis that divide the distribution into four equal groups.
Determining the values of the quartiles is simple when you have samples that are multiples of
four. For example, Figure 4.1 shows a frequency distribution histogram for a sample of 24
scores so the quartiles should split this distribution into four groups of 6 blocks. The first
quartile (Q1) is the number that separates the six lowest blocks (the bottom 25%) from the rest of
the group. If you wanted to cut the six lowest blocks from the rest, you would place the cut on
the x-axis at 4.5. This is Q1. The second quartile (Q2) separates the bottom 50% from the rest
of the group and, therefore, divides the distribution into two equal groups. You have been
introduced to Q2 before; it is the median. For the data in figure 4.1, Q2 is 5.5. Finally Q3 is the
value that separates the six highest scores from the rest. Q3 is 7.5.
Figure 4.1. Frequency distribution histogram showing a sample of 24 scores and three quartiles.
The interquartile range (IQR) is often used to describe the variability of the data and it reports
the range of the center 50% of scores instead of the whole distribution. The interquartile range is
calculated as
IQR = Q3-Q1
and for Figure 4.1 it is
IQR = 7.5 - 4.5 = 3
Some may also report the Semi-Interquartile Range (SIQR) which is simply IQR/2. Notice
that the interquartile range is generally unaffected by the extreme scores. If the lowest score
changes from 1 to 0, the interquartile range will not change. If the highest scores change from 10
to 100, the interquartile range will not change. If you know the range as well as the
interquartile range you can imagine how the data are grouped. The range will always be larger
than the IQR but a distribution with a large range and a very small interquartile range will have a
half of its scores in a narrow central cluster. If the IQR is almost as large as the range, you
would expect to see scores stacked up at the extreme values. These numbers should provide you
with enough information to take some educated guesses about the nature of a distribution.
Complicated Quartiles
Determining quartiles can be a bit complicated when you cannot easily divide the data into four
equal groups. For example, if you had a set of 22 scores, you would need to place the quartiles
that would create four groups of 5.5 blocks. This gets tricky and, although there are more
complicated ways to do it, a simple and effective way – for our purposes - is to carefully
construct a frequency distribution histogram and see where you would need to slice up the
distribution. For example, you cannot neatly cut 5.5 blocks from the bottom 25% but if you
placed Q1 exactly at 4, you would have four complete blocks as well as three halves. This
separates 5.5 blocks from the bottom of the distribution and, therefore, Q1 = 4. The median (Q2)
is located at the point that divides the distribution into two groups of eleven (n=22). This point is
at 5.75. Q3 is a little easier to see at 8.
Figure 4.2. Frequency distribution histogram showing a set of 22 scores and three quartiles.
Standard Deviation and Variance of a Population
The standard deviation is a popular and powerful measure of variability, but it is only
appropriate for describing normal distributions, although some researchers do not appreciate
this and you might find many violations of this rule in published literature.
The standard deviation, as its name implies, measures a standard or typical deviation. In our
case, deviation is distance from the mean. The calculation of a standard deviation begins with the
calculation of a mean and then assigning a Deviation Score to each score in the population.
Deviation Score (The distance between any score and the mean) = X - µ
For example, if a population has a mean of 100 (μ = 100) and one of the people in that
population has a raw score of 80, the deviation score for that person would be -20. If a person’s
score is 107, the deviation score would be 7. Imagine now that you wanted to know the “typical”
deviation score for the population. Just like you would calculate the mean of all the raw scores
to get an idea of a typical raw score, you might be inclined to calculate the mean of the deviation
scores to get a typical deviation. However, it won’t work. The average (or mean) of any set of
deviation scores will always be zero because the sum of the deviation scores will always be zero.
Numbers below the mean will have negative deviation scores that are cancelled out by the scores
numbers above the mean that have positive deviation scores. So, as a rule…
The sum of all deviation scores in a population will always be zero.
∑ (X - µ) = 0
This makes it impossible to calculate an average deviation score since
∑ (X - µ)/N = 0/N = 0
There is a way around this problem. If we square the deviation scores before adding them up, we
are guaranteed to get a non-negative number (the sum of the deviation scores will be zero if all of
the scores in a population are identical, but this is unlikely). The sum of the deviation scores is
the Sum of Squares (SS) and the formula for calculating SS is:
SS = ∑ (X - µ)2
This formula is referred to as the definitional formula for sum of squares. Although this formula
makes it clear that SS is the sum of the squared distance between each score and the mean, there
is a simpler and more popular formula that will get you the same result:
SS = ∑ X2 – (∑X)2/N
Because this formula is more commonly used when computing SS, it is referred to as the
computational formula for sum of squares.
Once we calculate the sum of the squared deviation scores, we can calculate the average squared
distance by dividing SS by the number of scores. The average of the squared deviation scores is
the variance and for a population the symbol for variance is sigma-squared.
Population Variance (σ2) = the average of the squared deviation scores.
σ2 = SS/N
The variance is the average squared distance from the mean and, although it is a useful indicator
of variance, it is more appropriate to report a typical distance, rather than a typical squared
distance. The standard deviation gives us this measure of typical distance and it is simply the
square root of the variance:
Population Standard Deviation (σ) = the square root of the Population Variance (σ2)
σ = √SS/N = √σ2
Standard Deviation of a Sample
Remember that samples are used to help us make estimates about the population parameters that
we can almost never measure directly. Therefore, if we want to use the sample statistic as an
estimate of a population statistic we must consider things that will influence the accuracy of the
estimate. When you sample people from a population, most of your scores will probably come
from the middle of the distribution since that’s where most people are. Therefore, a sample
mean (M) will probably be the result of averaging a few scores from people who are above the
mean and a few scores from people who are below the mean and your sample mean (M) will be a
good estimate of the population mean (μ). Although you may not be sure about the accuracy of
your estimate, you will have no particular reason to believe that you have overestimated μ nor do
you have any reason to believe that you have underestimated μ. A sample mean is therefore an
unbiased estimate of the population mean.
There are some other things that you know about your sample; you know that the sample size (n)
is most likely much smaller than the size of the population (N) and, because of this, you are
missing quite a few scores. Most of the scores that you are missing are probably “extreme”
scores at the very high and very low ends of the distribution. This means that your sample will
never have a range (or a distribution) that is as wide as the population’s. Therefore, a straight
calculation of the sample standard deviation is likely to give you a number that is smaller than
the population standard deviation. Unless we compensate for this, the sample variance (s)
and sample standard deviation (s) will be biased estimates of the population variance (σ2)
and standard deviations (σ).
If you want to use the sample standard deviation to estimate the population standard deviation,
instead of dividing SS by the number of scores in your sample (n) when calculating variance,
you should divide it by n-1 also known as the degrees of freedom. The fraction SS/(n-1) is a little
bigger than the fraction SS/n and this small correction makes the sample variance (s2) and
standard deviations (s) better estimates of the population (σ2) variance and population standard
deviation (σ).
Sample Variance (s2) = an estimate of the Population Variance
s2 = SS/(n-1) = s2 = SS/df
Sample Standard Deviation (s) = the square root of the Sample Variance
s = √SS/df = √s2
Summary: Measures of variability almost always accompanty - and describe how widely the
data are scattered around - a measure of central tendency. Almost any set of ordinal, interval, or
ratio scale data can be described using range, quartiles, and interquartile ranges. The standard
deviation offers more predictive power than ranges and quartiled but should only be applied to
normally distributed data (often interval or ratio scale data). We will explore the usefulness of
the standard deviation in the following chapters.