Download Aron, Aron, and Coups: The Mean, Variance, SD, and Z scores

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Central limit theorem wikipedia , lookup

Transcript
Soci 4003: Statistics for the Social Sciences
Chapter 2
Aron, Aron, and Coups: The Mean, Variance, SD, and Z scores
I.
Measures of Central Tendency - Means, Median, and Mode
A) Means are typically the statistic used to describe data by social scientist.
However, we what are the limitations ----- outliers
3 Characteristics of the MEAN
1) The mean is always the center of any distribution of scores in the sense
that it is the point around which all of the scores cancel out.
Symbolically:
Σ (Xi – Mean) = 0
Or, we take each score in a distribution, subtract the mean from it, and
add all of the differences; the resultant sum will always be zero. To
illustrate
65
73
77
85
90
65-78 =
73-78 =
77-78 =
85-78 =
90-78 =
- 13
-5
-1
7
12
0
2 ) Sum of Least Squares principle, whiuch means that the mean is the
point in a distribution around which the variation of the scores (as
indicated by the squared differences) is minimized.
If the differences between the scores and the mean are squared and
then added, the resultant will be less than the sum of the squared
differences between the scores and any other point in the distribution.
This signifies merely signifies that he mean is closer to all scores than
any other measure of central tendancy
Σ (X – mean)2
Remember our last Example
65
65-78 = (- 13) = 169
65
65-77 = (- 12) = 144
73
73-78 = (- 5) = 29
73
73-77 = (- 4) = 16
77
77-78 = (- 1) = 1
77
77-77 = (0) =
0
85
85-78 = (7) =
49
85
85-77 = (8) =
64
90
90-78 = (12) = 144
90
90-77 = (13) = 169
388
393
3) Unlike the Mode and Median, all scores affect the mean
B. Median – the number that falls directly in the middle of a distribution
C. Mode – the score that occurs the most in the frequency
Why to use these statistics?
1) Mode – when the variable is nominal and you want to report the most
common score
2) Median – When the variable is ordinal level; When the interval-ratio is
highly skewed (statistics of lies – mixed community income can be
made to look higher with mean); You want to report the mixed score
3) Mean – Variables are interval or ratio; You want to report the typical
scores; You want to do further statistical analysis
D) Rate – Provides another useful way of summarizing the distribution of a single
variable.
# of actual occurrences / the number of possible occurrences per some unite
of time
crude death rate – 100/7000 x 1000 = 14.29
14. 29 deaths per 1000 people
Ex) see Healey page 43
II.
Range, Variance and Standard Deviation – Measure of Dispersion
Provides a full description of a distribution of scores, the measures of central
tendencies should be combined with measures of dispersion
1) Range – the distance between the highest and lowest scores in a
distribution
Unfortunately, since it is based on only two scores (the highest and
lowest), the range is often deceptive as a measure of dispersion. Often, a
distribution will have outliers that will make ranges problematic
2) Variance and Standard Deviance
The Range is problematic because they do not use all the scores in the
distribution, and, in this sense, they do not capitalize on all the available
data.
A good measure of dispersion should:
a) Use all scores in the distribution
b) Describe the average or typical deviation of the scores. The
statistics should give us an idea about how far the scores are
form each other or from the center of the distribution
c) Increase in value as the distribution of scores becomes more
diverse
To create such a statistics, here are the logical steps
1) To create a score that describes the distances between each score and
the mean, we wand to assess the deviation (X – mean)
2) However, this gives us many scores. To create a useful statistic, we
then think about sum the deviations, but, as you know, this statistic
will always equal 0
3) Therefore a solution pushes us to square the deviations to get absolute
scores.
4) One more problem reveals itself. Although this number will allow us
to see variability (hi scores = higher distribution and vice versa), the
size of the score would depend heavily on the size of the sample or
population. Therefore, we must standardize the score by dividing by
the sample or population size (discuss rates).
5) With this said, this formula is for VARIANCE
s2 = Σ (X – mean) 2
N
- Discuss n-1 = often underestimates the effect when assessing
samples, therefore we minus –1 to account for that
6) What I want to talk about also is standard deviation, which is much
simpler it is the square root of variance
Interpreting the Standard Deviation (s) – You may be asking yourself
why does this matter? What do I have? This measure is meaningful
for many reasons.
a) involves understanding deviation and the normal curve
b) think of this as an index that increases as disperson increases
c) can compare the distribution of 2 samples
III.
Normal or “Bell-Shape” Curve and Z scores
The Normal curve is central to theory that underlies inferential statistics
The normal curve is a THEORETICAL model, or line chart, that is:
a) unimodal (has a single mode or peak)
b) perfectly smooth and symmetrical (unskewed) so that its mean,
median, and mode are all exactly the same value
Of course, no empirical distribution has a shape that perfectly matches this ideal
model, but many variables (standardized test scores, test results of large classes,
height and weights) are close enough to permit the assumption of normality – as
long as it’s random sampling it reaches a normal distribution
In turn, this assumption makes possible one of the most important uses of the
normal curve—the description of empirical distributions based on our knowledge
of the theoretical normal curve
NOTE: On any normal curve, distances along abscissa (horizontal axis), when
measured in standard deviations, always encompases exactly the same proportion
of the total area under the curve.
1 standard deviation – 68.26% (1/2 – 34.13
2 stadard deviations – 95.44
(1/2)- 47.72
The relationship between distance from the mean and area allows us to describe
empirical distributions that are at least approximately normal. The position of
individual scores can be describe with respect to the mean, the distribution as a
whole, or any other score in the distribution
Computing Z-Scores
To find the percentage of the total area (or number of cases) above or below
scores in an empirical distribution, the original scores must first be extressed in
units of the standard deviation or converted into Z-SCORES. The original scores
could be in any unit measurement (feet, IQ, dollars), but Z scores always have the
same values for their mean (0) and standard deviations.
- Think of converting the original scores into Z scores as a process of changing
scales-similar to changing from meters to yards, or kilometers to miles.
The original (or raw) scores and Z scores are two equally valid but different ways
of measuring distances under the normal curve
When computing Z scores, we convert the original units of measurements to Z
scores and, thus “standardize” the normal curve to a distribution that has a mean
of 0 and a sd of 1
Z = X - mean
S
The z score of positive 1 indicates that the original score lies 1 sd above the mean
Normal Curve Table – allows you to find the proportion of scores above and
below the z score to the mean (see index)
IV.
SPSS example
* Univar.sps.
* Sample SPSS descriptive statistics example. Replicates examples in handout.
* This program is really quite short, but these painstakingly detailed
* comment lines stretch it out. Comment lines are very handy though
* if you are ever trying to figure out why you did something the way you did.
* Also, while I am giving you this program, this could all easily be done
* interactively using SPSS Menus. In effect, SPSS will generate most
* of this syntax for you.
* First, enter the data. Normally I would create a separate data file, but for
* now I will enter the data directly into the program using the
* data list, begin data and end data commands.
data list free / X.
begin data.
100
150
200
250
250
250
250
325
325
400
end data.
* The formats command tells SPSS that X is measured in dollars.
* Not essential, but it helps make the display easier to read. This could
* also be done using the SPSS Data Editor. The Var Labels Command
* will also make the output easier to read.
Formats X (dollar8).
Var Labels X "Weekly Income".
* Next, run the frequencies command, indicating what stats I want.
* I used SPSS menus to generate the syntax for this command, but it
* could also be typed in directly.
FREQUENCIES
VARIABLES=x
/STATISTICS=STDDEV VARIANCE MEAN MEDIAN MODE SUM
/ORDER= ANALYSIS .
* Now, here is how to run the problem when the data are already grouped
* in a frequency distribution.
* The variable WGT indicates how often the value occurs in the data.
data list free / X WGT.
begin data.
100 1
150 1
200 1
250 4
325 2
400 1
end data.
Formats X (dollar8).
Var Labels X "Weekly Income"/ Wgt "Weighting Var".
* The Weight command causes cases to be weighted by the # of times
* the value occurs.
Weight by Wgt.
* Now just run the frequencies again.
FREQUENCIES
VARIABLES=x
/STATISTICS=STDDEV VARIANCE MEAN MEDIAN MODE SUM
/ORDER= ANALYSIS .