Download CHAPTER FOUR: Variability

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Mean field particle methods wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Student's t-test wikipedia , lookup

Regression toward the mean wikipedia , lookup

Transcript
CHAPTER FOUR: Variability
In order to determine the degree to which members of a distribution vary amongst
themselves, measures of spread or VARIABILITY had to be developed. These measures
could, in turn, assess the appropriateness of the measures of central tendency. The mean
could be especially misleading when a distribution included outliers or was severely
skewed.
Consider this scenario. If everyone who took a class test either failed or got a perfect
score, the mean would suggest that class performance was average. Yet no one in that
hypothetical class did average work. A measure of variability would indicate extreme
spread among scores, and the teacher would be forewarned not to rely upon the mean for
a realistic assessment of class progress. This is not unusually when assessing
HETEROGENEOUS (diversified) samples. When the sample or population is
HOMOGENEOUS (similar in terms of the dependent measure), the measure of spread is
small, and the measures of central tendency are more reliable.
Types of Measures of Spread:
The True Range-The simplest way of measuring of spread is to determine the span of
possible places on the scale of interest. This can be done by simply subtracting the lowest
score from the highest. Be sure to add 1 to count the starting point of the scale.
Range =[(Highest Score - Lowest Score) + 1]
If this seems confusing, just subtract the lower limit (LL) of the lowest score (X) from the
upper limit (UL) of the highest score, and that will give the correct answer as well.
Range =[UL of Highest Score - LL of Lowest Score]
Remember, you must account for all the spaces occupied on the scale, including all those
in between. If the highest number is 9, and the lowest is 0, there are ten spaces included
on that scale. Count them: (0, 1, 2, 3, 4, 5, 6, 7, 8, 9 )
The Inter-quartile Range(IQR)-You recall that you could determine the median by
locating the percentile (score) at the fiftieth percentile rank (50th%). The same methods
can be used to locate the score (X) at the first quartile (25th%) and the third quartile
(75th%). By definition, QUARTILES divide a distribution into quarters, or fourths. The
IQR is the range between the 1st and 3rd quartiles (Q). Therefore, simply subtract the
score at the 25th% (Q1) from the score at the 75th% (Q3):
IQR = X(75th%) - X(25th%)
The Semi-Interquartile Range(SIQR)-When the data is very skewed or incomplete, the
SIQR replaces the IQR. To compute the SIQR, simply divide the IQR by 2:
SIQR = (IQR)/2
The Standard of Deviation(S)-The best way to determine the appropriateness of the mean
is to determine the average (mean) amount of dispersion from the mean. Unfortunately,
the mean (X) is the sum of the scores (∑X) divided by the number of scores (N). The sum
of the distance scores from the mean always equals zero, making computation useless.
Recall:
_
Sum of the Distances = Sum of (X - X) = 0
To correct this problem, the distance score is squared. This sum of the distance scores
squared is called the SUM SQUARES (SS). The formula, then is:
SS = The Sum of (The score – The Sample Mean) Squared
This can serve as the sum of the distances that is divided by the number of distances to
suggest a mean distance. But to return to the original scale, the square root of this squared
mean distance should be computed. The standard of deviation is the square root of the
variance:
(S)(S) = SS/N
The Variance (S)-The standard of deviation (S) squared is called the variance. This is the
very number one takes the square root of to determine the S; the squared mean distance
from the mean. The purpose of the variance was primarily to calculate the S. It can be
thought of as a squared measure of variability.
S(S) = [Sum of (X - X)Squared / N]
The formula for the SUM Squares (SS). Use the formula you are more comfortable with.
They measure the same thing.
DEGREES OF FREEDOM: Variability can be determined in both the
inferential and descriptive cases. Descriptive statistics are based upon
populations. That is what the above formulas apply to. The formulas for both
the variance and the standard of deviation should be adjusted for the case of
samples, because there is a risk of bias in the estimate. The sample may not be
the most representative of the true population. To assure unbiased estimates,
subtract a 1 from the denominator, N. This adjusted denominator is called
DEGREES OF FREEDOM (df = n-1)
NOTATION: To further distinguish the descriptive from the inferential case,
English letters will be reserved for inferential statistics. The descriptive case
will have Greek letters for notation. This signals that the denominator is an N,
and not the df, as that is not required in the descriptive case.
APPLICATION: While the S and S require continuous data, the ranges only
require ordinal scale. They, therefore, complement measures of central tendency
of the same scale. When all the members of the distribution are multiplied by a
constant, the SS is inflated, but the mean is not changed. If the members of the
distribution all have a constant added to them, neither the mean nor the SS
changes. This is true in both the descriptive and inferential cases.
Consider the data set X = 1, 4, 3, 2, 2, 0, 1, 2
_
X
f
cf
c%
X-X Squared
4
1
8
100.0 2
4
3
1
7
87.5 1
1
2
4
6
75.0 0
0
1
1
2
25.0 -1
1
0
1
1
12.5 -2
4
__
____ ____ __
__
__
8
0
10
16
0
10
Xf
4
3
8
1
0
__
16
Mn = SumX/N = 16/8 = 2
Mo = 2
Md = 50th% = 2
Rg = 4.5 - (-.5) = 5
IQR = X75% - X25%
= 2.5 - 1.5 = 1
SIQR = IQR/2 = 1/2 .5
It sometimes eases interpolation to see the set in a line.
In this case, there are 2 integers in each quartile.
X= 01
12
22
34
25% 50% 75% 100%
Now that the ordinal measures are complete, consider the continuous solutions:
SS = Sum (X - X)Sq = 10
= SS/N = 10/8 = 1.25 S = SS/df = 10/7 = 1.428
_____
= \ 1.25 = 1.118 S = Square Root of1.428 = 1.195
Notice that the measures of variability based upon SS, are always larger in the inferential
case. That is the effect that the df has upon them. The estimate of spread based upon a
sample has an element of risk, depending upon the degree of similarity of the sample to
the population from which it is drawn. The df will inflate that 'best guess' of variability,
to 'hedge your bet' and be certain to cover that true variability.