Download Summary of Richard Lowry`s course - http://faculty.vassar.edu/lowry

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Psychometrics wikipedia , lookup

Foundations of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Analysis of variance wikipedia , lookup

Misuse of statistics wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Summary of Richard Lowry's course - http://faculty.vassar.edu/lowry/webtext.html
Michel Beaudouin-Lafon
Chapter 1 - Measurement
Chapter 2 - Distributions
Chapter 3 - Linear Correlation and Regression
Chapter 4 - Statistical significance
Chapter 5 - Basic concepts of probability
Chapter 6 - Introduction to probability sampling distributions
Chapter 7 - Tests of statistical significance: Three overarching concepts
Chapter 8 - Chi-square procedures for the analysis of categorical frequency data
Chapter 9 - Introduction to procedures involving sample means
Chapter 10 - T-procedures for estimating the mean of a population
Chapter 11 - T-test for the significance of the difference between the means of
two independant samples
Chapter 12 - T-test for the significance of the difference between the means of
two correlated samples
Chapter 13 - Conceptual introduction to the analysis of variance
Chapter 14 - One-way analysis of variance for independent samples
Chapter 15 - One-way analysis of variance for correlated samples
Chapter 16 - Two-way analysis of variance for independent samples
Chapter 17 - One-way analysis of covariance for independent samples
2
4
6
8
10
12
14
16
18
21
22
24
26
28
30
32
35
Summary of statistical tests
One independent variable, continuous, equal-interval scale
One dependent variable, continuous, equal-interval scale
Pearson (chap 3, p6)
N independent variables, categorical
One dependent variable, categorical frequency data
Chi-square (chap 8, p16)
One independent variable, categorical
One dependent variable, equal-interval scale
Two samples
Independent
Correlated
samples
samples
Parametric
t-test
(chap 11, p22)
t-test
(chap 12, p24)
Nonparametric
MannWhitney
(chap 11, p23)
Wilcoxon
(chap 12, p25)
N samples
Independent
Correlated
samples
samples
1-way
1-way
ANOVA
ANOVA
(chap 14, p28) (chap 15, p30)
KruskalFridman
Wallis
(chap 15, p31)
(chap 14, p29)
Two independent variables, categorical
One dependent variable, equal-interval scale
Independent samples, parametric test:
2-way ANOVA (chap 16, p32), ANCOVA (chap 17, p35)
Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html
1
Chapter 1 - Measurement
Variable = the property being measured
Variate = a specific measure
Types of measures:
Counting = counting the number of units, e.g. size or weight
=> Scalar variables
Ordering = arranging in order, e.g. things you worry about a bit, some, a lot
=> Ordinal variables
Sorting = putting into categories, e.g. male/female
=> Nominal variables
Scalar variables
Absolute scale = counting a number of items
Relative scale = measuring relative to a unit scale, e.g. cm or kg
Equal interval scales => can compute compound measures
sum, difference and average (see below)
Unequal interval scales, e.g. dB => cannot compute compound measures
Continuous scale => values = real numbers
Discrete scale => values = integer numbers
Ratio scale if there is an absolute zero, e.g. cm or kg, but not ºC
=> permit the computation of ratios, e.g. twice as large
Non-ratio scales do not permit this,
but differences are a ratio scale, e.g.
t1 ºC is twice t2 ºC is meaningless
but (t2 ºC - t1 ºC) is twice (t3 ºC - t2 ºC) is OK
Compound measures = sum, difference, average
All require equal interval scales for the individual measures,
and the result has the same properties (abs/rel, cont/discr, ratio)
as the individual measures
Sum : result has same properties as individual measure
Average : result is always continuous
e.g. average the number of students (discrete) => continuous
Difference : result is always a ratio scale (see example above)
Cross-classification of equal-interval scales
Discrete / Non-ratio => does not exist in practice (counting has abs. 0)
Discrete / Ratio => number of discrete items or events
Continuous / Non-ratio => fairly rare, e.g. Fahrenheit and Celsius degrees
Continuous / Ratio => most measures of continuous variables
Ordinal variables
Specify the order of size, quantity or magnitude among the measured items
Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html
2
Rank-order scales
Example : rank what's important in your life from highest to lowest
Family > Friends > Career > Money
=> intrinsically non equal interval
Rating scales
Assign a numeral rating
0 completely unimportant
1 slightly important
2 moderately important
3 quite important
4 exceedingly important
to the items above, e.g.
Family = 4.0, Friends = 3.8, Career = 3.6, Money = 3.4
vs Family = 4.0, Friends = 3.5, Career = 2.0, Money = 1.0
More information than rank-order but still not an equal-interval scale
even when presented as such
Often treated as such by using compound measures, e.g. average
Nominal variables
Sort into categories, e.g. male/female, without any order among categories
No compound measure makes sense
however can use cardinal scale to count number of items in each category
=> twice as many males than females
Cross-categorizing, e.g. male/female & geographical origin
=> bivariate or multivariate classification
study whether two or more categories are systematically associated
with each other
Direct vs Indirect measures
Direct measure : use a tape measurer to measure the desk
Indirect measure : assess the knowledge of students by grading a test
Direct measure of how well the students did
Indirect measure of their knowledge of the material
More often than not, we need to use indirect measures
Reliability
Whether repeated measures give the same answer
=> metal tape measurer is more reliable than an elastic one
Validity
Whether a direct measure is actually a valid indirect measure
=> "fair" vs "unfair" exam, or measuring produberances on the head to
measure personality and mentality (19th century's phrenology)
Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html
3
Chapter 2 - Distributions
Distribution = list of individual measures
Graphic representation = absolute or relative (i.e. %) frequencies
Histogram
Frequency polygon
Frequency curve
Examples of distributions
Parameters of a distribution
Central tendency - measures tend to cluster around a point of aggregation
Variability - measures tend to spread out away from each other
Skew - one tail of the distribution more elongated than the other
Negatively skewed -> elongated tail to the left [A]
Positively skewed -> elongated tail to the right [B]
No skew = symmetric [C-D-E]
Kurtosis - whether there is a cluster of measures, i.e. a peak in the distribution
Platykurtic -> mostly flat [D]
Leptokurtic -> high peak [E]
Mesokurtic -> medium [C]
Modality - number of distinct peaks (1 = unimodal, 2 = bimodal, 3 - trimodal)
Bimodal: major peak = highest / minor peak = lowest [F]
Can reveal the mix of 2 unimodal distributions,
e.g. well prepared vs not so well prepared students
Measures of central tendency
Measures the location of the clustering: measured by mode, median, and mean
Mode = point or region with highest number of individual measures
Median = midpoint of all individual measures
Mean = arithmetic average of all individual measures
Mean = Median = Mode when distribution is unimodal and symmetric
Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html
4
Skewed distribution : mean towards the tail, mode away from tail, median
somewhere in between
Median and mode rather useless in analytical and inferential statistics,
because only mean has the property of equal interval scale
Mean is noted M: MX is the mean of Xi = (∑ Xi) / N.
Measures of variability
Measures the strength of the clustering
Range = distance between lowest and highest measures
Interquartile range = distance between lowest and highest of the
middle 50% of the measures
Both are rather useless in analytical and inferential statistics
Variance & Standard deviation
Deviate = Xi - MX
Sum of square deviates = SSX = ∑ (Xi - MX)2
Variance (also called mean square) = s2 = SSX / N
Standard deviation (also called root mean square) = s = sqrt (SSX / N)
A better formula to compute the sum of square deviates is
SSX = ∑ (Xi2) - (∑ Xi)2 / N
Standard deviation can be visualized as a scale for the distribution:
MX ± 1s represent a range within the distribution,
similar to the interquartile range, but computed with all Xi
MX ± 1s typically encompasses 2/3 of the values
Varieties of distributions
Empirical distribution - set of variates that have been (or could be) observed
Theoretical distribution - derived from mathematical properties
Theoretical distributions are at the heart of inferential statistics
The best-known is the normal distribution, also called bell-shape curve
Normal distribution
+1z and -1z fall at the point where the curvature changes (convex <> concave)
68.26% of the distribution is in the -1z to +1z range
each tail encompasses 15.87% of the distribution
Populations vs. Samples
Population distribution = ALL the measures of a variable
Sample distribution = SOME measures of a variable
Inferential statistics = inferring properties of populations from samples
Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html
5
Chapter 3 - Linear Correlation and Regression
Correlation and Regression = relationship between two variables X and Y, when each
Xi is paired with on Yi, e.g. height vs weight
If the more of X, the more of Y => positive correlation
If the more of X, the less of Y => negative correlation
Scatterplot = bivariate coordinate plotting
Plot dots at coordinates (Xi, Yi) for each i
If a causality is suspected between the variables
the independent variable (the one capable of influencing the other) is X
the dependent variable (the one being influenced by the other) is Y
A linear relationship can be represented by a line that best fits the scatterplot
Positive correlation = upward line
Negative correlation = downward line
Zero correlation = no apparent pattern
Physical phenomena will tend to fit linear relationships perfectly,
while behavioral and biological phenomena will be more imperfect
Measuring linear correlation
Pearson product-moment correlation coefficient = r
Ranges from -1.0 (perfect negative correlation) to +1.0 (perfect
positive correlation), 0.0 meaning zero correlation
Coefficient of determination = r2, ranges from 0.0 to 1.0
Provides an equal interval and ratio scale of the strength of the
correlation, but not its sign, which is given by r
This strength of the correlation can be turned into a percentage (0.44 ->
44%) and interpreted as the amount of the variability of Y that is associated with (tied
to / linked with / coupled with) the variability of X (and vice versa).
Regression line = line that best fits the scatterplot
Criterion for best fit : the sum of the squared vertical
distances between the data points and the line is minimal
Line goes through the point (MX, MY)
Covariance = similar to variance but for paired bivariate instances of X and Y
Co-deviateXY = deviateX . deviateY
covariance = (∑ co-deviateXY) / N = SCXY / N
Better formula to compute SCXY = ∑ Xi Yi - ∑ Xi ∑ Yi / N
r = observed covariance / maximum possible positive covariance
observed covariance : see above
maximum possible positive covariance : geometric mean of the
variances of X and Y = sqrt (varianceX . varianceY)
Computationally easier formula for r:
r = SCXY / sqrt (SSX . SSY)
where SCXY = sum of co-deviates of (X,Y)
SSX and SSY = sum of square deviates of X and Y
Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html
6
Interpretation of correlation
Correlation represents the co-variation of the paired
instances of Xi and Yi
It can be represented by overlapping circles: amount of
overlap is r2 of each circle, non overlap is the residual variance, 1 - r2.
Where does this correlation come from? Could it be cause and effect?
Correlation is a tool, and like any tool, it can do harm if misused
Interpreting correlation as causal relationships requires care and cannot result
solely from the observed correlation. Here are the 3 questions to ask oneself:
1- Statistical significance - could it be that the observed pattern was a mere
effect of chance, i.e. it exists in the sample but not the population? See the next
chapter for a first cut on statistical significance
2- Does X cause Y, or Y cause X (in which case you should swap them), or
could it be a reciprocal causal relationship where X affects Y and Y affects X? For
example, in American football, a higher score of the winner is correlated with a
higher score of the loser. It is likely that they influence each other in a reciprocal
relationship.
3- Could X and Y be influenced by another variable Z, not taken into account?
For example the number of births of baby boys is highly correlated with the number
of births of baby girls, but it's hard to imagine a causal relationship (even reciprocal).
Instead, they are both influenced bya third factor, the birth rate.
Regression line: y = a x + b
The slope of the line is b = SCXY / SSX
The intercept of the line is a = MY - b MX
The line of regression is not just useful to show the correlation, it can also be
used to make predictions: given a value x, what is the likely value of y:
predicted Yi = a Xi + b ± SE
where SE is the standard error of estimate = sqrt (SSresidual / (N-2))
SSresidual is the sum of square residuals, where the residual is the
vertical deviate between a point and the line of regression. SSresidual = SSY . (1 - r2).
Since r2 is the proportion of variability of Y that is explained with the variability of X,
1 - r2 is the residual proportion, i.e. the variability in Y that is not associated with
variability in X. Multiplied by SSY, it gives the amount of SSY that is residual, i.e. not
accounted for by the correlation between X and Y.
Note that SE is almost the standard deviation of the residuals, the difference
being that we divide by N-2 instead of N. This will be explained later, but relates to
the fact that whereas the standard deviation is a measure of the sample, the standard
error has to do with the overall population represented by this sample.
The standard error can be visualized by drawing two lines, above and below
the regression line, at a distance of SE.
We can predict that approximately 2/3 of the XiYi in the actual
population will be within these two lines.
We can also say that we have a 2/3 confidence that the value of Yi falls
within ±SE of the predicted value for a given Xi.
[see subchapters 3a and 3b on partial correlation and rank-order correlation]
Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html
7
Chapter 4 - Statistical significance
The essential task of inferential statistics is to determine what can reasonably be
concluded about a population on the basis of a limited sample from that population.
A sample is like a "window" into the population. Sometimes, it truly represents the
entire population, sometimes it misrepresents it, leading to erroneous conclusions.
This is particularly the case when the studied phenomenon contains some element of
random variability.
Statistical significance is the apparatus by which one can assess the extend to which
the observed facts do not result from mere chance coincidence, i.e. the confidence one
can have in the inferred relationships.
Statistical significance for correlation
The question is whether the observed correlation, measured by r on a sample,
corresponds to the actual correlation of the entire populalation, called ρ ("rho").
In particular, how likely is it that we measure r ≠ 0 when ρ = 0, i.e. we detect a
relationship when none exists?
A small experiment
Get a pair dice of different colors, say white and blue, and toss them multiple
times. Record the pair of numbers (Xi = white dice, Yi = blue dice).
There is no reason to expect any correlation between Xi and Yi on the entire
population, i.e. ρ = 0
However, if you get sets of N=5 samples and compute the coefficient of
correlation r for each set, you will get values that deviate by a large amount from 0,
with about 40% of the values beyond ±0.5.
If you draw a relative frequency histogram of the r values, you get a pretty flat
(platykurtic) distribution.
Now repeat the experiment with sample sizes of N=10, N=20, N=30 and draw
the same frequency histograms.
With N=10, 12% of the samples are outside the ±0.5 range
With N=20, 2.5% of the samples are outside the ±0.5 range
With N=30, 0.5% of the samples are outside the ±0.5 range
As N increases, the distribution approaches a normal distribution, and the
tendency of sample correlation coefficients to deviate from 0 decreases.
Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html
8
Another experiment
Suppose a team of medical investigators is exploring the properties of a new
nutrient, called X. They hypothesize that is should have the effect of increasing the
production of a certain blood component, called Y. They conduct an experiment with
some adult subjects and they get r = +0.50.
The question of statistical significance is: what confidence can they have that
this result is not just due to mere chance coincidence?
This of course depends on the size of the sample.
In the dice experiment, with N=5, there is a 20% chance to observe a positive
correlation as large as r=+0.5 when the correlation within the entire population is ρ=0.
With N=10, it drops to 6%, with N=20, to 1.25%, with N=30, to 0.25%.
In most areas of scientific research, the criterion for statistical significance is set at the
5% level, i.e. a result is regarded as statistically significant only if it had 5% or less
likelyhood of occurring by mere chance coincidence.
For correlation analysis, the r-values required for statistical significance at the 5%
level according to the sample size N can be computed.
Two cases need to be considered: directional and non-directional, reflecting
whether the investigators specify, in advance, whether or not they expect the
correlation to be positive or negative:
Positive/Negative directional hypothesis: the relationship between X
and Y in the population is positive/negative (the more X, the more/less Y), so this
particular sample of XiYi pairs will show a positive/negative correlation.
Non-directional hypothesis: the relationship between X and Y in the
population is something other than 0, either positive or negative, so this particular
sample of XiYi pairs will show a non-zero correlation.
As shown in the table below, the r-values required for statistical significance
are higher with a non-directional hypothesis than with a directional one.
N
5
10
12
15
20
25
30
±r (directional)
0.81
0.55
0.50 <=
0.44
0.38
0.34
0.31
±r (non directional)
0.88
0.63
0.58
0.51
0.44
0.40
0.36
In the medical experiment, the +0.50 correlation is statistically significant at the 5%
level only if the sample size N is 12 (marked in the table) since the hypothesis is
directional. For larger sample sizes, the correlation is statistically significant beyond
the 5% level
Test for the significance of the Pearson product-moment correlation coefficient (see
chapters 9-12):
t = r / sqrt[(1-r2)/(N-2)]
Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html
9
Chapter 5 - Basic concepts of probability
Imagine someone pretending that he can control the outcome of tossing a coin to
make it come up heads, but not 100% of the times. He offers to be tested on a series
of 100 tosses. How many heads would have to turn up for you to find the result
"impressive"? In general, the answer is between 70 and 80, i.e. 70% to 80%. Our
(rightful) intuition is that if he does not control the toss, the result would be around
50% - i.e. you would'nt be impressed if it were 51, 52 or even 53. Let's say you draw
the line at 70%. Now what if the test had involved 10 tosses rather than 100. Would
you be impressed if heads turned up 7 times out of the 10 tosses? Probably not.
Indeed, as we'll show, the chance of getting over 70% head in 10 tosses is 17%, while
it is 0.005% for 100 tosses. With the standard 5% criterion for significance, 7 out of
10 is not enough.
Laplace about the theory of probability: "at bottom only common sense reduced to
calculation".
Probability of event x = P(x) = (number of possibilities favorable to the occurence of
x) divided by (total number of pertinent possibilities).
A priori probability: known precisely in advance, by counting or enumerating
e.g., drawing a blue ball out of a box whose content is known
A posteriori probability: estimated based on large number of observations
e.g., success rate of a surgery procedure
Compound probability: common sense explained by calculation
Conjunction: A and B
Disjunction: A or B
Conjunctive probability: P(A and B)
P(A and B) = P(A) x P(B) if A and B are independant
If A and B are not independant, be careful to count each probability properly
Example: 4 females and 6 males in a room, pick 3,
probability that all 3 are female
P(first pick is female) = 4/10
P(second pick is female) = 3/9 since we've already picked one
P(third pick is female) = 2/8 since we've already picked two
P(all three are female) = 4/10 * 3/9 * 2/8 = .033 = 3.3%
probability that all 3 are male
similar reasoning: 6/10 * 5/9 * 4/8 = .167 = 16.7%
(i.e., 5 times as much as picking 3 female!)
Disjunctive probability: P(A or B)
P(A or B) = P(A) + P(B) if A and B are mutually exclusive
If A and B are not mutually exclusive, remove common occurences
Example: 26 students, 12 sophomores and 14 juniors,
12 sophomores: 7 females and 5 males
14 juniors: 8 females and 6 males
probability of picking either a sophomore or a female
P(S or F) = P(S) + P(F) = 12/26 + 15/26 = 27/26 > 1 WRONG
Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html
10
P(S or F) = P(S) + P(F) - P(S and F) = P(S)+P(F) - P(S)P(F)
= 12/26 + 15/26 - (12/26 * 15/26) WRONG
The probability of drawing a sophomore is 12/26 but once
we've done that, the probability that it is a female is not 12/26 but 7/12. Or the other
way around: the probability of drawing a female is 15/26 but once we've done that,
the probability that she is a sophomore is 7/15. We get
P(S and F) = 12/26 * 7/12 = 7 / 26
or
P(S and F) = 15/26 * 7/15 = 7 / 26
So the result is P(S or F) = 12/26 + 15/26 - 7/26 = 20/26 = .769
It is easier to make such computations with the complementary
probability, especially when there are more than 2 events (A or B or C) and use the
fact that P(x) = 1 - P(!x)
Here, the probability of not drawing a sophomore or a woman
is 6/26 since there are 6 male juniors: P(S or F) = 1 - P(M and J) = 1 - 6/26 = 20/26
More complex calculations
An unknown disease causes 40% patients to recover after 2 months (and
therefore 60% don't). The a posteriori probability to recover is P(r) = .4. Researchers
want to test the effect of a plant, and start with a small sample of 10 patients. 7 of
them recover within 2 months, while 3 don't. Is this 70% recovery rate significantly
better than the 40%, i.e. is the plant having an effect?
We need to compute the probability that at least 7 patients out of 10
recover with a recovery probability of 40% and see if it is below the 5% threshold of
significance. This probability is P(7, 8, 9 or 10 patients recover), which is the sum of
the individual probabilities P(7r) + P(8r) + P(9r) + P(10r).
P(10r) is easy since there is only one possibility: that all
patients recover, hence P(10r) = .4 x .4 x ... x .4 = .4 ^ 10 = .000105
P(9r) is the sum of the probability that all patients but the first
one recovers, plus the probability that all but the second one recover, etc. Each
probability is (.6 x .4^9) so the total is 10 times that = .00157
P(8r) gets even more complicated since we need to look at all
the combinations were 8 patients recover and 2 don't.
Fortunately, there is a general formula to compute the number
of ways in which it is possible to obtain "k out of N" (10 out of 10, 9 out of 10, 8 out
of 10, etc.): C(n,k) = n! / k! (n-k)!
Since the probability for each individual event is pk qn-k where
q = 1-p, we get the general formula:
P(k out of n) = pk qn-k n! / k! (n-k)!
With this formula we get P(7r) = .0425, P(8r) = .0106, P(9r) =
.0016 and P(10r) = .0001 so the probability that 7 or more patient recover is .0548.
This is above the 5% threshold so cannot be considered significant, although it is
close enough from that threshold that it is worth doing a larger-scale test.
In the larger test, they take a sample of 1000 patients and observe that 430 of them
recover within two months. Although this is only 43% compared with the 40%
without the plant, it is actually significant, as we will show later. However using the
above formula to make the calculation is impractical, even with a computer. We need
a better tool - fortunately there is one!
Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html
11
Chapter 6 - Introduction to probability sampling distributions
Binomial probability: situation where an event can occur or not, e.g. heads when
tossing coins, whether a patient recovers. It is characterized by:
p - probability that the event will occur
q - probability that the event will not occur
N - number of instances in which the event has the opportunity to occur
Let's consider tossing a coin twice (p = q = .5, N = 2) and let's count the occurence of
Heads. There are four possible outcomes (H means heads, -- means not heads):
-- -- : probability = .5 x .5 = 0.25
(25%) 0 Heads
-- H : probability = .5 x .5 = 0.25 \
H -- : probability = .5 x .5 = 0.25 / (50%) 1 Heads
H H : probability = .5 x .5 = 0.25
(25%) 2 Heads
With 2 patients and p = .4, q = .6 (R means recovery):
-- -- : probability = .6 x .6 = 0.36
(36%) 0 Recovery
-- R : probability = .6 x .4 = 0.24 \
R -- : probability = .4 x .6 = 0.24 / (48%) 1 Recovery
R R : probability = .4 x .4 = 0.16
(16%) 2 Recoveries
Sampling distribution of these probabilities:
Since these are distributions, we can look at the measures of central tendency and
variability. For the binomial sampling probability distribution, we get:
Mean: µ = Np
Variance: σ2 = Npq
Standard deviation: σ = sqrt(Npq)
(We use the greek letters µ and σ because we are referring to a population)
For our examples:
coins:
µ = 1.0
σ2 = 0.5
σ = ± 0.71
2
patients:
µ = 0.8
σ = 0.48
σ = ± 0.69
If we increase N, the distributions come closer and closer to a normal distribution.
The graphs belows show the histograms, computed with the formula from Chapter 5
and the normal distribution for N = 10 and N = 20:
In practice, we can use the approximation of the normal distribution when N, p and q
are such that Np ≥ 5 and Nq ≥ 5
Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html
12
Now let's look closer at the distribution for N=20, p=.4, q=.6 :
We compute µ = 8.0 and σ = ± 2.19
The graph above superimposes the binomial distribution, in blue with horizontal axis
k, and the normal distribution, in red with horizontal axis z. The formula to convert
between k and z is as follows:
z = ((k - µ) ± .5) / σ
The component ± .5 is a correction for continuity, accounting for the fact that
the normal distribution is continuous while the binomial distribution is stepwise.
When k is smaller than µ, add a half unit (+ .5),
when k is larger than µ, subtract half a unit (- .5).
Example:
For k = 5, we get z = ((5 - 8) - .5) / 2.19 = -1.14
For k = 11 we get z = ((11 - 8) - .5) / 2.19 = +1.14
From these values we can look up the table giving the area beyond ±z (see
Chapter 2). For z = 1.14 we get .1271.
This means that, out of 20 patients, there is a 12.71% chance that 5 or less
recover, and 12.71% chance that 11 or more recover.
Note that these values are approximate: computing the exact binomial probabilities
leads to 12.56% for 5 or less and 12.75 for 11 or more. The differences with the above
computation is only .0015 and .0004 respectively.
If we were to make the same computation for N=1000 patients, there is no way we
could do the exact calculation. With the normal distribution we get:
µ = 400, σ = ±15.49
Let's say we get 430 patients that recover, i.e. 30 more than the expected mean. We
compute the corresponding z value as z = ((430 - 400) - .5) / 5.49 = +1.90. Referring
to the the table of the normal distribution, this gives us .0287, i.e. a 2.87% chance that
this many recoveries could be obtained just by chance. Therefore the effect is
statistically significant, as announced (but not demonstrated) in chapter 5 !
Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html
13
Chapter 7 - Tests of statistical significance: Three overarching concepts
Tests of statistical significance are the devices by which scientific researchers can
rationally determine how confident they may be that their observed results reflect
anything more than mere chance coincidence.
Mean chance expectation
All tests of statistical significance involve a comparison between
(i) An observed value - e.g. an observed correlation coefficient; an observed
number of head in N tosses; an observed number of recoveries in N patients (ii) The value that one would expect to find, on average, if nothing other than
chance and random variability were operating in the situation - e.g. a correlation
coefficient of r=0, assuming the correlation within the general population is ρ=0,
or 5 heads in 10 tosses assuming the probability of heads is .5, or 400 recoveries in
1000 patients, assuming the probability of recovery for any patient is .4 The second value is often called mean chance expectation or MCE. For binomial
probability situations, MCE is equal to the mean.
The null hypothesis and the research hypothesis
When performing a test of statistical significance, it is useful to distinguish between
(i) the particular hypothesis that the research is seeking to examine, e.g. that a
medication has some degree of effectiveness
and
(ii) the logical antithesis of the research hypothesis, e.g. that the medication
has no effectiveness at all.
The second one is commonly called the null hypothesis or H0. The first one is the
research hypothesis (or experimental hypothesis, or alternative hypothesis) or H1.
In general:
H0: observed value = MCE
H1: observed value != MCE
(here 'a = b' means a and b do not statistically differ, i.e. a equals b within the
limits of statistical significance, 'a != b' means that they statistically differ, i.e. the
difference is beyond what could be expected from random variability).
The research hypothesis may be directional, when the expected difference has a
particular direction. For example, it is expected that the medication has the effect of
increasing the number of patients that recover. In this case
H1: observed value > MCE
In other cases we may have
H1: observed value < MCE
e.g. the effect of sleep deprivation on the outcome of the disease.
('a > b' and 'a < b' here mean that the difference is statistically significant)
In fact, the so-called non-directional hypothesis (observed value != MCE) should be
called bi-directional as we are really testing
(observed value < MCE) or (observed value > MCE).
Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html
14
Directional vs non-directional research hypotheses
One-way vs two-way tests of significance
Let us get back to our coin-tossing experiment with the claimant of paranormal
powers aiming to produce "an impressive nunber of heads". We toss a coin 100 times
and get 59 heads. Is that impressive?
N=100, p=.5, q=.5 gives a z-ration of +1.7 and a probability of .0446
This passes the test of 5% significance (see figure below, left).
If instead the claim was: "I have a power but I cannot control its direction. All I can
say is that the number of heads/tails will be significantly different from 50/50". This
is now a non-directional hypothesis, resulting in a probabilty of .0892, far beyond the
threshold of 5% (see figure below, right).
Because these tests often refer to the left and right "tails" of the distribution, they are
often described as one-tail and two-tail tests. As a general approximation, the
probability measured with a two-way test is exactly or approximately twice that
measured with a one-way test (with the binomial probability, it is exactly twice).
The logic of the non-directional hypothesis is also the default position when there is
no prior hypothesis at all. Let's say you toss a coin 16 times just to see what happens
and get 12 heads. With a one-tail test, the probability is p=.028 and you may think
that you have paranormal powers. But since you didn't know (or specify) what to
expect, you need to use a two-tail test (p=.056) and accept that the result is indeed not
significant at the 5% level.
Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html
15
Chapter 8 - Chi-square procedures for the analysis of categorical frequency data
Chapters 5/6 dealt with situations with two mutually exclusive categories, e.g. head or
tail. Chi-square extends the logic of binomial procedures to situations with more than
two categories (e.g., patient conditions improves, stays the same or worsens), and to
situations with more than one dimension of classification (e.g., students from college
A or college B, and their political inclination as conservative or liberal).
Chi-square for one dimension of categorization
Suppose we have a river with 3 species of fish in equal proportions. We pick 300 fish
and count 89 of species A, 120 of species B and 91 of species C. The question of
statistical significance is whether the observed difference with the expected counts
(100 of each) is due to chance or to some ecological disorder in the river. We can look
at the amount by which each species deviates from the expected count :
(observed frequency - expected frequency) / expected frequency = (O - E) / E
species A : (89 - 100) / 100 = -11%
species B : (120 - 100) / 100 = +20%
species C : (91 - 100) / 100 = -9%
Note that the sum of this relative differences is 0. The Chi-square (χ2) simply consists
of taking the square of the differences and adding them all up : χ2 = ∑ (O - E)2 / E
species A : (89 - 100) 2 / 100 = 1.21
\
species B : (120 - 100) 2 / 100 = 4.0
>
total = 6.02
2
species C : (91 - 100) / 100 = .91
/
We now need to know the properties of the sampling distribution to which this value
belongs. We could take a large number of samples (say 10000) of 300 fish in a river
with equal proportions of eache species, compute their χ2 and see where the above
value (6.02) lies. The result is depicted below (left): it shows that there is a 4.99%
chance of picking a sample with a chi-square of 6.02.
The theoretical distribution (above, right) shows the various values of χ2 for diffent pvalues. Unlike the one on the left, this distribution is general : it applies to any
situation with 3 categories, any sample size, and any expected values, e.g. a sample
size of 100 with expected values of 25/50/25 or a sample size of 62 with expected
values 13/22/27. Our value of 6.02 is slightly beyond the 5% significance level.
If we have more categories, the distribution would be different. It depends on the
number of degrees of freedom (df), which, for one dimension of categorization, is the
number of categories minus 1. This is because, given a sample size and n categories,
you can pick any value for n-1 categories, but the last one is forced. The graph and the
table below show the distributions and level of significance for various p-values. Note
that the χ2 is by nature non-directional since it is based on the squared differences.
Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html
16
df
1
2
3
4
5
Level of significance (non-directional test)
.05
.025
.01
.005
.001
3.84
5.02
6.63
7.88
10.83
5.99
7.38
9.21
10.60
13.82
7.81
9.35
11.34
12.84
16.27
9.49
11.14
13.28
14.86
18.47
11.07
12.83
15.09
16.75
20.52
Chi-square with two dimensions of categorization
The best way to represent two-dimensional data is with a contingency table. Suppose
we have two groups of female
Alzheimer's onset
patients, one of which received
No
Yes
estrogen and the other did not, and
Received Yes
O = 147
O=9
156
we measure the number of those
estrogen
No
O = 810
O = 158
968
which developed Alzheimer's
957
167
1124
desease within 5 years. We get this:
Alzheimer's onset
First we need to figure out what the
No
Yes
expected values would be. With a
Received Yes E = 132.82
E = 23.18
156
single dimension, these values are
estrogen
No
E
=
824.18
E
=
143.82
968
easy to guess. Here they are
957
167
1124
computed as E = R*C / N where R
is the sum for the row of the cell being considered, C is the sum for the column and N
the total size of the sample. Now that we know the expected values, we can compute
the value (O-E)2/E in each cell, and sum them to have the χ2. In the special case
where there is exactly two rows and two columns, we need a correction for continuity
(like in Chapter 6) : χ2 = ∑(|O-E| - 0.5)2 / E = 11.01 in this example.
Next we need to compute the degrees of freedom to select the proper distribution.
Following the earlier explanation, we note that here, once we know the totals per row
and per column, setting the value of one cell determines all three others
unequivocally. Hence df = 1 in this case. In the general case of r rows and c columns,
a similar reasoning would show that df = (r - 1) (c - 1). Going back to the table at the
top of this page, we can see that our χ2 exceeds the value for significance at the .1%
level (10.83).
Note about non-significant Chi-square values : A non-significant χ2 should not be
interpreted as a "goodness-of-fit": All it tells you is that you cannot reject the null
hypothesis. It does not tell you that you can accept the null hypothesis since many
other null hypotheses could have ended up with a non-significant χ2.
Two limitations of Chi-square
1- Chi-square can only be applied if the categories are independant of each other, i.e.
the categories are both exhaustive and mutually exclusive.
2- The logical validity of chi-square decreases when the values of E become smaller.
It should not be used if any E is 5 or below. In this case, you can use the Fisher Exact
Probability Test (which also has the advantage of being directional) [see chapter 8a].
Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html
17
Chapter 9 - Introduction to procedures involving sample means
We now turn to the analysis of situations that involve continuous variables. Rather
than looking at wether a patient has recovered or not, we measure how much he or she
has recovered, for example by measuring the concentration of a certain component in
the patient's blood that is known to decrease in proportion to the degree of the disease.
We want to be able to say, e.g., that on average, patients who received te
experimental treatment showed greater improvement than those who did not. The key
word here is "on average", which means that we are focusing on the mean of a sample
or the means of two or more samples. The question then is to assess the significance
of the results drawn from a sample with respect to the overall population.
Let's start with a normal distribution, whose
mean and standard deviation are known to be
µ = 18 and σ = ±3. This distribution allows
us to answer questions such as: what is the
probability that a random Xi is between 18
and 21, i.e. within 1 standard deviation
(answer: 34.13%).
More generally, we can compute the z-ratio
of any value as z = (Xi - µ) / σ. (Note that we
don't need the correction for continuity introduced in chapter 6 for the values are now
continuous). For example Xi = 22 gives z = +1.33, Xi = 12 gives z = -2.0 which
translate respectively into a probablity of P = .0228 and P = .0918.
The sampling distribution of sample means
Let us now draw samples of size N=10 in this
population and calculate the mean of each
such sample. It turns out that the distribution
M of such means (see right) is itself a normal
distribution, whose mean µM and standard
deviation σM are closely related to those of
the source population :
µM = µsource , σ2M = σ2source / N
and therefore σM = σsource / sqrt(N)
In other words, as the sample size gets larger, the measured means cluster more
tightly around the actual mean of the population.
So if we know that a population is normally distributed and if we know its mean and
standard deviation, we can answer questions such as: What is the probability that a
sample has a mean MX greater than 20? We compute the z-ratio for 20 in the M
distribution: z = (MX - µM) / σM = +2.11, corresponding to P = 1.74%.
The sampling distribution of sample-means differences
Now we suppose that we draw a large number of pairs of samples: in each pair, we
have sample A of size Na and sample B of size Nb and we consider the difference
between the means of the two samples in each pair: d = MXa - MXb. Here again, it
turns out that the distribution M-M of these sample-mean differences is normal and
closely related to the source population:
Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html
18
µM-M = 0 , σ2M-M = σ2source / Na + σ2source / Nb
and therefore σM-M = sqrt [σ2source / Na + σ2source / Nb]
This allows us to answer questions such as: what is the probability that the mean of
sample A is greater than the mean of sample B by 2 or more points? Again, we
compute the z-ratio for 2 in the M-M distribution : z = ((MXa - MXb) - µM-M) / σM-M =
1.49, corresponding to a probability of 6.81%.
Estimating the mean and variability of a normally distributed source population
The problem with the above computations is that they assume that we already know
the mean and standard distribution of the source population. In many cases we don't.
But since we know (from the above) that a random sample will tend to reflect the
properties of the overall population, we can say that the sample mean is "in the
vicinity" of the mean of the source population:
estimated µsource = Mx ± [definition of "vicinity"]
Intuitively, the larger the sample, the closer the approximation of the real value. But
how is this vicinity computed exactly? This will be the subject of chapter 10.
The next question is to estimate the variability of the source population given that of
a given sample. In general, the variability that appears in a sample will tend to be
smaller than the variability that exists in the population. We recall from Chapter 2 that
the variance of a population is the average of the square deviates: σ2 = SS / N. The
relationship between the sample variance and the population variance is:
mean sample variance = s2 = σ2source (N-1) / N
From this we can compute the estimated source variance from the sample variance :
estimated σ2source = {s2} = s2 N / (N-1) = SS / (N-1)
est.σsource = {s} = sqrt[SS / (N-1)].
Estimating the standard deviation of the sampling distribution of sample means
We now turn to the sampling distribution of sample means. We recall from part 1 of
this chapter that σ2M = σ2source / N. We can estimate σM by substituting {s2} for σ2M.
This gives an estimated σM = sqrt[{s2} / (N-1)]. We know that this estimated variance
is "somewhere in the vicinity" of the real variance. Since we don't know the real
variance of the population, we cannot calculate the z-ratio for a given sample mean,
but we can make an estimate of the z-ratio, called t, using the estimated variance :
t = (MX - µM) / est.σM
For example, if we have a sample of size N=25 with mean Mx = 53 and SS = 625. If
we believe the source mean is µM = 50, we can compute t = +2.94. We will see a bit
later how this translates into a probability with the so-called t-distribution.
Estimating the std deviation of the sampling distribution of sample mean differences
Here again, we cannot compute the z-ratio for a given pair of samples since we don't
know the variability of the source population, but we can use estimates:
estimated variance of sample A = SSa / (Na - 1)
estimated variance of sample B = SSb / (Nb - 1)
The so-called pooled variance {s2p} is defined as (SSa + SSb) / ((Na-1)+(Nb-1))
and the estimated variance of the sample mean differences is
est.σM-M = sqrt[{s2p} / Na + {s2p} / Nb].
Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html
19
As before, we can compute t, an estimate of the z-ratio using the estimated variance:
t = (MXa - MXb) / est.σM-M
For example, if we have sample A of size Na = 20 with Ma = 105.6 and SSa = 4321
and sample B of size Nb = 20 with Mb = 101.3 and SSb = 4563, we can compute the
pooled variance {s2p} = (4321+4563) / (19+19) = 233.79 and est.σM-M = ±4.84. From
this we compute the t-ratio t = +0.89. Again, we need to study the t-distribution to
translate this into a probability.
The t distribution
Suppose we select a large number of pairs of
samples of size 10 and for each pair we
compute its t-ratio as above. The distribution of
these t-values would follow the red curve to the
right. This is not exactly a normal distribution
but very similar.
Unlike a normal distribution, the t-distribution
depends on the degrees of freedom, a concept
already introduced in Chapter 8. Here, for a
single sample of size N, df = N-1, while for two
samples of sizes Na and Nb, df = (Na-1)+(Nb-1).
The figure to the right shows t-distributions for
df=5 and df=40. The latter is almost
indistinguishable from the normal distribution.
The use of the t-distribution is exactly the same as for the normal distribution: given a
t-ratio, one can look-up in a table the corresponding probability. The table below
gives some examples:
df
5
10
18
20
Level of significance
.05
.025
--.05
2.02
2.57
1.81
2.23
1.73
2.10
1.72
2.09
.01
.02
3.36
2.76
2.55
2.53
.005
.01
4.03
3.17
2.88
2.85
test:
.0005 directional
.001 non-dir.
6.87
4.59
3.92
3.85
To the right is the t-distribution for df=18
(typically Na = Nb = 10), with the .05 and
.025 one-tailed levels of significance marked.
For a non-directional ("two-tailed") test to be
significant to the .05 level, t would have to be
beyond ±2.10.
For a directional ("one-tailed") test to be significant at the same 0.05 level, t would
have to be beyond ±1.73.
Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html
20
Chapter 10 - T-procedures for estimating the mean of a population
Assumption of the t-test for one sample:
- The scale of measurement has the properties of an equal interval scale.
- The sample is randomly drawn from the source population
- The source population can be reasonably supposed to have a normal distribution
Step 1 - For the sample of N values Xi, calculate
MX - the mean of the sample - MX = (∑ Xi) / N
SS - the sum of square deviates - SS = ∑ (Xi2) - (∑ Xi)2 / N
Step 2 - estimate the variance of the source population as
{s2} = SS / (N-1)
Step 3 - estimate the standard deviation of the sampling distribution of sample means
(also called "standard error of the mean") as
est.σM = sqrt[{s2} / N]
Note that you can merge steps 3 and 3 in one step:
est.σM = sqrt[SS * N / (N-1)]
Step 4 - perform the point and interval estimate as
est.µsource = MX ± (tcritical * est.σM)
with df = N - 1
tcritical is taken from a table such as the one at the end of Chapter 9 and depends on df,
the level of significance and whether the test is directional or non-directional.
± (tcritical * est.σM) is called the confidence interval.
Example:
An archeologist finds 19 intact specimens of a certain prehistoric artifact, and
measures their lengths :
17.3 18.9 17.7 23.8 16.0 22.1 18.4 18.2 13.3 26.8
18.6 24.5 22.8 13.4 18.1 14.8 20.6 17.4 16.1
She assumes the population to have a normal distribution and has no reason to believe
that her sample is not random. She estimates the mean length of the population :
Step 1:
MX = 18.9
SS = 248.5
df = 18
Step 2:
{s2} = 13.81
Step 3:
est.σM = ±0.85
Step 4:
est.µsource = 18.9 ± 1.79 with a 95% confidence level
est.µsource = 18.9 ± 2.45 with a 99% confidence level
The figure to the right shows the critical
t-values for the calculated standard error
of the mean.
Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html
21
Chapter 11 - T-test for the significance of the difference between the means of
two independant samples
This is probably the most widely used statistical test of all time. It is simple, easy to
use, straightforward, and adaptable to a broad range of situations. This is because
scientific research often examines a phenomenon two variables at a time, trying to
answer the question : are these two variables related? If we alter the level of one (the
independant variable), will we therby alter the level of the other (the dependant
variable)? Or: if we examine two levels of a variable, will we find them to be
associated with different levels of the other.
Example 1: does the presence of a certain fungus enhance the growth of a plant?
Procedure: take a set of seeds and randomly sort them into groups A and B
Group A is the experimental group: grow them in a soil with the fungus
Group B is the control group: grow them in a soil without the fungus
Harvest the two groups and measure their growth. If the fungus affects
growth, the mean of group A should be significantly larger than the mean of group B.
Example 2: do two types of music have different effect on the performance of mental
tasks?
Procedure: take a pool of subjects, assign them randomly to groups A and B
Group A will have music type I playing in the background
Group B will have music type II playing in the background
Measure the performance of each subject for a set of mental tasks.
Any difference between the effects of the two types of music should show up
as a difference between the mean level of performance of the two groups.
Example 3: Do two strains of mice differ with respect to their ability to learn a
particular behavior
Procedure: here the two subject pools are already given: strain A and strain B
Draw a random sample of size Na from pool A and a sample of size Nb from B
Run the members of each group through the experimental protocol and
measure how well and how quickly they learn the behavior
Any difference between their abilities to learn this behavior should show up as
a difference between the group means.
Recall from Chapter 7 that whenever you perform a statistical test, what you are
testing is the null hypothesis. In general, the null hypothesis is that there is no effect
(of the fungus, of the type of music, of the strain of mice). The research hypothesis
may be directional (Example 1) or non-directional (Examples 2 and 3), which defines
the type of test of statistical significance.
Assumptions of the t-test for two independent samples:
- Both samples are independently and randomly drawn from the source population(s)
- The source population(s) can be reasonably supposed to have a normal distribution
- The scale of measurement for both samples has the properties of equal interval scale
Step 1 - For the two samples A and B of sizes Na and Nb, calculate
MXa and SSa
MXb and SSb
Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html
22
Step 2 - estimate the variance of the source population as
df = (Na - 1) + (Nb - 1)
{s2p} = (SSa + SSb) / df
Step 3 - estimate the standard deviation of the sampling distribution of sample-mean
differences (the "standard error of MXa - MXb") as
est.σM-M = sqrt[{s2p}/Na + {s2p}/Nb]
Step 4 - calculate t as
t = (MXa - MXb) / est.σM-M
Step 5 - Refer to a table of the t-distribution for the critical values of t for the given df
and the given level of significance, taking into account the directionality of the test.
Example:
Let us use example 2 and assume the following values:
Group A (music type I):
Na = 15
Ma = 23.13
Group B (music type II):
Nb = 15
Mb = 20.86
Ma - Mb = 2.26
SSa = 119.73
SSb = 19.73
Step 2: {s2p} = 10.55
Step 3: est.σM-M = ±1.19
Step 4: t = +1.9 with df = 28
Step 5: looking at the figure, we can see that
with a non-directional test at the 5% level, we
fall short of significance: the critical t-value for
.025 significance is ±2.05. However, if we had
made a directional test, it would have reached
significance at the 5% level since the critical t
in this case is 1.70.
Mann-Whitney test
The t-test described above is called a parametric test because of the assumptions it
makes about the samples and the population. The Mann-Whitney test is a nonparametric variant that makes the following assumptions:
- The two samples are randomly and independently drawn
- The dependent variable is intrinsically continuous, capable in principle, if not in
practice, of producing measures carried out to the nth decimal place
- The measures within the two samples have the properties of at least an ordinal scale
of measurement, so that it is meaningful to speak of "greater than", "less than" and
"equal to".
For the details of the procedure, see Chapter 11a on-line.
The logic of the procedure is to sort the measures and convert them into a rank order
(the lowest is rank 1, the next higher one is rank 2, etc), then working with this rank
order instead of the measure, taking advantage of the properties of rank ordering.
Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html
23
Chapter 12 - T-test for the significance of the difference between the means of
two correlated samples
We start with a rather strange hypothesis: that people with shoes are taller than
without shoes. A researcher tests this hypothesis and measures (in inches) a set of 15
adults first with their shoes (sample A) and then without (sample B). He observes that
they are all taller with their shoes, and computes
Ma = 65.8
Mb = 64.2
Ma - Mb = 1.6
SSa = 378.4 SSb = 384.1 (σa = ±5.0
σb = ±5.1)
If you compute the t-value as we've done in the previous chapter, you get t = +0.84,
which is not significant at the 5% level for a directional test! Why is that?
Looking at the distributions of samples A and
B (right), one can see that they each have a
large internal variability, and the computation
of t depends ultimately on SSa and SSb, the
raw measure of these internal variabilities. In
our example, the individual differences with respect to height are entirely extraneous
to the question of whether people on average are taller with shoes. The t-test for
independant samples treat this extraneous variability as though it were not extraneous,
and in consequence overestimates the standard deviation in the relevant sampling
distribution. This in turn results in an underestimate of the observed mean difference.
The t-test for correlated samples avoids this pitfall by disregarding the extraneous
variability and looking only at what is relevant for the question at hand. The name
comes from the fact that the two sets of measures are arranged in pairs and are thus
potentioally correlated. This procedure is also spoken of as the repeated-measures ttest or the within-subject t-test, because it typically involves situations in which each
subject is measured twice, once in condition A, and then again in condition B.
However it is not essential that the measures in conditions A and B come from the
same subjects, only that each individual item in sample A is intrinsically linked with a
corresponding item in sample B. Here, the height of subject 1 with shoes is linked to
the height of subject 1 without shoes, and so on for each subject.
In fact, since we are only concerned with the difference between the shoes-on and
shoes-off conditions, there is only one sample here, whose variable D is defined for
each linked pair as Di = XAi - XBi
If we now compute the mean of this new variable, we still get MD = 1.6, but the
variability is now a lot smaller: SSD = 2.59 (s = 0.42). From there, the rest of the
procedure will look familiar : rather than comparing the means of two samples, we
look at a single sample made of the differences between the measures, and test
whether the observed mean of the differences is significantly different from 0.
The t-test is particularly useful in research involving human and animal subject
precisely because it is so very effective in removing the extraneous effects of preexisting individual differences. Note than in some contexts, these individual
differences are not extraneous. In some cases they might be the very essence of the
phenomena of interest. But there are many situations where the facts of interest are
merely obscured by the variability of individual differences.
Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html
24
Assumptions of the t-test for two correlated samples:
- The scale of measurement of Xa and Xb has the properties of an equal-interval scale
- The values of Di have been randomly drawn from the source population
- The source population from which the values Di have been drawn can be reasonably
supposed to have a normal distribution.
Step 1 - for the sample of N values of Di, where Di = XAi - XBi, calculate the mean and
the sum of square deviates
MD = (∑ Di) / N
SSD = ∑ (Di2) - (∑ Di)2 / N
Step 2 - estimate the variance and standard deviation of the source population
{s2} = SSD / (N-1)
est. σMD = sqrt[{s2} / N]
Step 3 - calculate t as
t = MD / est. σMD
Step 4 - refer the calculated value of t to the table of critical values for df = N-1 and
the given level of significance, taking into account the directionality of the test.
Example: in the shoe example, we have t=+14.17, df = 14. The one-tailed probability
of getting this result by mere chance coincidence is 0.0000000005 ...
Example 2: Suppose we want to determine whether two types of music, A and B,
differ with respect to their effects on sensory-motor coordination. If we draw two
independent samples of human subjects and test the first one with type-A music and
the other with type-B music, it is very unlikely that we will get a significant result, for
the individual differences in sensory-motor coordination are likely to obscure any
effect of the type of music. In a design with correlated samples, we test all subjects
with both types of music, arranging for half of the subject to be tested with type-A
then type-B and the other half for type-B then type-A in order to to obviate the
potential effects of practice and test sequence.
Assume we get a sample of differences with a mean MD = -1.53 and SSD = 55.45 (s =
±1.92). We compute {s2} = 3.96, est.σMD = ±0.51 and t = -3.0 with df = 14. Looking
up the corresponding t table, we find that the critical t value at the 1% level of
significance for the non-directional test is 2.98. We therefore meet and slightly exceed
this critical value.
Wilcoxon signed-rank test
Similar to the non-parametric version of the t-test for uncorrelated samples outlined in
the previous chapter, there is a non-parametric version of the t-test for correlated
samples, called the Wilcoxon signed-rank test. This test assumes the following:
- The scale of measurement for XA and XB has the properties of an equal-interval
scale
- The differences between the paired values of XA and XB have been randomly drawn
from the source population
- The source population from which these differences have been drawn can be
reasonably supposed to have a normal distribution
See the on-line Chapter 12b for the details of the procedure. It is similar to the MannWhitney test but takes into account the signs of the ranking (hence its name).
Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html
25
Chapter 13 - Conceptual introduction to the analysis of variance
The t-test (chapter 11) compares the means of two samples, but cannot be used to
compare more than two samples. For example, if we have three samples A, B, C, we
cannot use 3 t-tests to compare A & B, B & C, A & C: even if each t-test is
significant at the .05 level, the overall difference is only significant at the .15 level!
The analysis of variance or ANOVA solves this problem. Here is how it works with
k=3 samples A, B, C:
- let Na, Nb and Nc be the size of each sample and NT = Na + Nb + Nc
- compute Ma, Mb and Mc, the mean of each sample, and MT, the total mean
- compute SSa, SSb and SSc, the sum of square deviates of each sample,
and SST, the sum of square deviates of the complete set of measures
- compute the measure of aggregate differences among samples :
for each group g = a, b, c, this is the value Ng (Mg - MT)2
This is similar to the sum of square deviates, but for each sample as a whole
- the sum of these three values is called SSbg,
standing for the sum of square deviates between samples.
Now we want to figure out if this measure differs significantly from the zero that
would be specified by the null hypothesis. Remember the t value from chapter 11:
t = (MXa - MXb) / est.σM-M
- The numerator is the difference between the two means. It corresponds to
our value SSbg which measures the difference among any number of samples.
- The denominator can be interpreted as the variability that can be expected
from a random sample. In the t-test, it is computed from SSa and SSb. Here, we
introduce the sum of square deviates within groups as SSwg = SSa + SSb + SSc.
- Note that SSbg + SSwg = SST, i.e. we have separated the variability of the
three samples into two parts: variability between the samples vs. within the samples.
In chapters 9 through 12, we encountered several version of the basic concept that the
variance of the source population can be estimated as
est.σ2source = (sum of square deviates) / (degrees of freedom)
We now proceed to estimate the mean of square deviates (MS), which is the term
used in the context of ANOVA for an estimate of the source population variance:
- let MSbg = SSbg / dfbg where dfbg is the degrees of freedom between groups,
i.e. the number of groups k minus 1 (here: dfbg = k - 1 = 3 - 1 = 2).
- les MSwg = SSwg / dfwg where dfwg is the degrees of freedom within samples,
which is the sum of the degrees of freedom of each sample, i.e.
dfwg = (Na - 1) + (Nb - 1) + (Nb - 1) = NT - k
What do MSbg and MSwg estimate? In short, the same thing. If the null hypothesis is
true, both values are estimates of the variance of the source population. If the null
hypothesis is not true, MSbg will tend to be greater than MSwg.
The F-Ratio
The relationship between two values of MS is conventionally described by
F = MSerror / MSeffect
where MSeffect is a variance estimate pertaining to the fact whose significance we wish
to assess (here the difference between the means of independent samples) and MSerror
is a variance estimate of random variability present in the situation.
Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html
26
When the null hypothesis is true, the F-ratio tends to be equal or less than 1.0
When the null hypothesis is false, the F-ratio tends to be greater than 1.0
In our case we define F as F = MSbg / MSwg.
The next question is to assess F against the corresponding sampling distribution.
F can be written as (SSbg / dfbg) / (SSwg / dfwg). Similar to the t and Chi-square which
have a different distribution for differents values of df, here we have a different
distribution for each value of the pair (dfbg, dfwg). For example, with k=3 and
Na=Nb=Nc=5, we have dfbg = k-1 = 2 and dfwg = NT - k = 12. The corresponding
distribution is noted F2,12:
For large values of df, the maximum value of F is reached at F=1.0, for smaller
values, it is reached to the left of F=1.0. Here is a table of critical values of F for some
values of dfbg and dfwg at the .01 and .05 significance levels:
df
denominator
10
11
12
13
1
4.96
10.04
4.84
9.65
4.75
9.33
4.67
9.07
df numerator
2
3
4.10
3.71
7.56
6.55
3.98
3.59
7.21
6.22
3.89
3.49
6.93
5.95
3.81
3.41
6.70
5.74
4
3.48
5.99
3.36
5.67
3.26
5.41
3.18
5.20
Example
We consider k=3 samples: A = (16, 15, 17, 15, 20), B = (20, 19, 21, 16, 18),
C = (18, 19, 18, 23, 19) of sizes Na=Nb=Nc=5, and therefore NT = 15.
We compute Ma = 16.6, Mb = 18.8, Mc = 19.2, MT = 18.2.
Then we get SSa = 17.2, SSb = 14.8, SSc = 18.8, SST = 70.4.
From there we get SSbg = 12.8+1.8+5.0 = 19.6 and SSwg = 17.2+14.8+18.8 = 50.8
We can verify that SSbg + SSwg = SST
Finally we get MSbg = 19.6/2 = 9.28 and MSwg = 50.8/12 = 4.23 and therefore F =
2.32, which is non significant at the .05 significance level (the critical value is 3.89).
The relationship between F and t
The ANOVA can be used for situations with two samples, in which case the results
are similar to those obtained with a t-test: the F-ratio would be equal to the square of
the t-ratio. The only difference is that a t-test can be either directional or nondirectional, while the ANOVA (like Chi-square) is intrinsically non-directional.
Applications of F
The general form F = MSerror / MSeffect applies to many situations described in the next
chapters: one-way analysis of variance with independent (chap 14) and correlated
(chap 15) samples, and two-way analysis of variance (chap16), which analyses the
effects of two independant variables concurrently, including their interaction effects.
Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html
27
Chapter 14 - One-way analysis of variance for independent samples
Assumptions of the one-way ANOVA for independent samples:
- the scale on which the dependent variable is measured is an equal interval scale;
- the k samples are independently and randomly drawn from the source population(s);
- the source population(s) can be reasonably supposed to have a normal distribution;
- the k samples have approximately equal variance, i.e. the estimated variances of the
source populations SSg/(Ng - 1) are within a ratio of 1.5.
The ANOVA is robust to assumptions 1, 3&4 if all samples have the same size.
Step 1 - Combine all k groups and calculate SST = ∑ (Xi2) - (∑ Xi)2 / NT
Step 2 - For each group g calculate SSg = ∑ (Xgi2) - (∑ Xgi)2 / Ng
Step 3 - Calculate SSwg = ∑SSg
Step 4 - Calculate SSbg = SST - SSwg
Step 5 - Calculate dfT = NT - 1, dfbg = k - 1, dgwg = NT - k
Step 6 - Calculate MSbg = SSbg / dfbg and MSwg = SSwg / dfwg
Step 7 - Calculate F = MSbg / MSwg
Step 8 - Refer the calculated F-ratio to the table of critical values of Fdfbg, dfwg
It is good practice to present a summary table of the results os follows:
Source
between groups ("effect")
within groups ("error")
TOTAL
SS
140.10
116.32
256.42
df
3
16
19
MS
46.70
7.27
F
6.42
P
<.01
Post-ANOVA comparisons: Tukey HSD test
A significant F-ratio tells you that the aggregate difference among the means of the
several samples is significantly greater than zero, but it does not tell you whether any
particular sample mean significantly differs from any particular one. Again, the t-test
cannot be used because we need to make several comparisons at once, e.g. A-B, A-C,
A-D, B-C, B-D, C-D.
The Tukey HSD (Honestly Significant Difference) test is based on the Studentized
range statistic, Q. Let us call ML and MS the largest and smallest means of all k
groups. Then
Q = (ML - MS) / sqrt [MSwg / Np/s]
where Np/s is the number of values per sample. If the k samples are of different sizes,
Np/s is the harmonic mean of the sample sizes : Np/s = k / ∑ (1 / Ng)
Q belongs to a distribution defined by two parameters: the number of groups (k) and
the degrees of freedom of the denominator of the F-ratio (dfwg).
Let us call Qc the critical value of Q for a given k and dfwg and a level of confidence c.
Juggling with the formula for Q, we get:
ML - MS = Q . sqrt [MSwg / Np/s]
Replacing Q by Qc, this gives us the minimal difference between the means of any
two groups for these two groups to be significantly different at the c level:
HSDc = Qc . sqrt [MSwg / Np/s]
Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html
28
In our example, we get
HSD.05 = 4.05 . sqrt [7.27 / 5] = 4.88
HSD.01 = 5.2 . sqrt [7.27 / 5] = 6.27
As can be seen in the table below, only two pairs are significantly different at both
levels: AC and AD
A·B
A·C
A·D
B·C
B·D
C·D
Ma=28.86
Mb=25.04
Ma=28.86
Mc=22.50
Ma=28.86
Md=22.30
Mb=25.04
Mc=22.50
Mb=25.04
Md=22.30
Mc=22.50
Md=22.30
3.82
6.36
HSD.05 = 4.88
HSD.01 = 6.27
6.56
2.54
2.74
0.20
Note that the fact that the difference between Ma and Mb is not significant does not
mean that there is no effect. It just means that the Tukey HSD test was not able to
detect it.
One-way ANOVA and correlation
In the above example, it is pretty clear from the plot that there is a correlation between
the dosage and the pull. It is also pretty obvious this is not a linear relationship, but
rather a curvilinear one. Within the context of analysis of variance, a useful measure
of the strength of a curvilinear relationship between the independant and dependant
variable is given by eta-square :
eta2 = SSbg / SST
In our example, we have eta2 = 0.55, which is
interpreted as: of all the variability that exists
within the dependent variable, 55% is associated
with the variability in the independent variable
"dosage". This is similar to the r2 from chapter 3.
In fact, when the relationship is linear, the values
of eta2 and r2 are the same. If the relationship is
curvilinear, then the r2 is smaller than the eta2
(see below). Note that this comparison with r2 is
only valid when the independent variable is an equal-interval scale. If it is ordinal or
categorical, all it means is the more general statement above.
Kruskal-Wallis test for 3 or more independent samples
This is a non-parametric version of the ANOVA for independent samples. Since the
ANOVA is quite robust when the sample sizes are the same, this is particularly useful
when the dependent variable is ordinal and the sample sizes are different.
The mechanics of the Kruskal-Wallis test begins in a way very similar to the MannWhitney test from subchapter 11a, where the measures are replaced by their ranks.
Then it proceeds very similarly to an ANOVA.
See subchapter 14a online for the details.
Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html
29
Chapter 15 - One-way analysis of variance for correlated samples
The analysis of variance for correlated samples is an extension of the correlatedsamples t-test from Chapter 12. In the latter, we have a certain number of subjects,
each measured under two conditions, or, alternatively, we have a certain number of
matched pairs with one member measured under condition A and the other under
condition B. It is the same structure for correlated-samples ANOVA, except that now
the number of conditions (k) is 3 or more.
When each subject is measured under each of the k conditions, we have:
repeated measures or within subject design
Subject / Condition A
B
C
1
S1-A S1-B S1-C each row = 1 subject
2
S2-A S2-B S2-C measured under
3
S3-A S3-B S3-C each condition
and so on
When it involves subjects matched in sets of k with the subjects in each
matched set randomly assigned to one of the k conditions, we have a:
randomized blocks design
(each set of k matched subject is a block)
Subject / Condition A
B
C
1
S1a-A S1b-B S1c-C each row = k matched
2
S2a-A S2b-B S2c-C subjects, each measured
3
S3a-A S3b-B S3c-C under one condition
and so on
The correlated samples ANOVA starts like the
independent-samples version, by computing MSbg
and MSwg (see right).
Then it removes the variability that exists inside each
of the k groups by identifying the portion
of SSwg that is attributable to pre-existing
individual differences. This portion,
SSsubj, is dropped from the analysis while
the remaining portion, SSerror, is used to
measure the sheer, random variability
(see right).
Identification and removal of SSsubj
For each subject (repeated measures) or each block (randomized block), we compute
the mean Msubj* (subj* represents any subject).
Then we compute the weighted square deviate = k (Msubj* - MT)2
The sum of these for all subjects is precisely SSsubj = ∑ k (Msubj* - MT)2
In practice a more efficient formula is used: SSsubj = ∑ (∑Xsubj*)2 / k - (∑XTi) 2 / NT
From there, we easily get SSerror = SSwg - SSsubj
The degrees of freedom are computed as before (dfT = NT - 1, dfbg = k-1, dfwg = NT -k)
but now we need to split dfwg into dfsubj and dferror, corresponding to SSsubj and SSerror:
dfsubj = Nsubj - 1
dferror = dfwg - dfsubj
Finally we compute MSbg = SSbg / dfbg and MSerror = SSerror/dferror and the F-ratio
F = MSbg / MSerror, which is looked up against the critical F-value for df=dfbg, dferror.
Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html
30
The ANOVA summary table looks like this:
Source
between groups
("effect")
within groups
·error
·subjects
TOTAL
SS
120.0
df
2
2285.0
103.3
2181.7
2405.0
51
34
17
53
MS
60.0
F
20.0
P
<.01
3.0
Note in this example the large value of SSsubj, which, once removed, makes the
difference significant.
Assumptions of the one-way ANOVA for correlated samples:
- the scale on which the dependent variable is measured is an equal interval scale;
- the measures within each group are independent of each other;
- the source population(s) can be reasonably supposed to have a normal distribution;
- the k samples have approximately equal variance, i.e. the estimated variances of the
source populations SSg/(Ng - 1) are within a ratio of 1.5;
- the correlation coefficients (r) among the k groups are positive and of approximately
the same magnitude (homogeneity of covariance).
This ANOVA is robust to assumptions 3&4 (samples have the same size here).
Step 1 - Combine all k groups and calculate SST = ∑ (Xi2) - (∑ Xi)2 / NT
Step 2 - For each group calculate SSg = ∑ (Xgi2) - (∑ Xgi)2 / Ng
Step 3 - Calculate SSwg = ∑ SSg
Step 4 - Calculate SSbg = SST - SSwg
Step 5 - Calculate SSsubj = [ ∑ (∑Xsubj*)2 ] / k - (∑XTi) 2 / NT
Step 6 - Calculate SSerror = SSwg - SSsubj
Step 7 - Calculate dfT= NT-1, dfbg= k-1, dgwg= NT-k, dfsubj= Nsubj-1, dferror= dfwg-dfsubj
Step 8 - Calculate MSbg = SSbg / dfbg and MSerror = SSerror / dferror
Step 9 - Calculate F = MSbg / MSerror
Step 10 - Refer the calculated F-ratio to the table of critical values of Fdfbg, dferror
Post-ANOVA comparisons: the Tukey HSD test
The Tukey HSD test introduced in Chapter 14 can be extended to correlated samples
by using MSerror instead of MSwg and dferror instead of dfwg. Since the number of
measures per group is equal to the number of subjects in a repeated measures design
or the number of matched sets of subjects in a randomized block design, we will also
be substituting Nsubj for Np/s.
The critical value for the difference between any two means to be significant is then
HSDc = Qc . sqrt [MSerror / Nsubj]
where Qc is the critical value from the Q distribution for k and dferror at the level of
confidence c.
Fridman test for 3 or more correlated samples
This is a non-parametric version of the ANOVA for correlated samples. Given that
the ANOVA is pretty robust, it used mainly when the measures are mere rankorderings or mere ratings, or intrinsically non-linear (e.g. decibels).
The procedure resembles the Kruskal Wallis test (Chap 14) - see chapter 15a online.
Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html
31
Chapter 16 - Two-way analysis of variance for independent samples
The two-way ANOVA examines the effects of two independent variables
concurrently. In addition to looking at the effect of each variable, it also looks at their
interaction, i.e. whether the two variables interact with respect to their effect on the
dependent variable.
Suppose we want to test the effect of two drugs A and B on the blood level of a
certain hormone. We randomly and independently sort the subjects into 4 groups:
r1c1 is a control group (it will get a
Drug B
placebo), r1c2 will get 1 unit of A, r2c1
0
units
1 unit
will get one unit of B and r2c2 will get
r1c1
r1c2
one unit of A and one unit of B. This is
0
0 units of A
0 units of A
units
conventionally represented with a
Drug
0 units of B
1 unit of B
matrix, and it is conventional to speak
A
r2c1
r2c2
1
of the row variable (here, A) and the
1 unit of A
1 unit of A
unit
column variable (here, B):
0 units of B
1 unit of B
The differences between the means of the row variables and the means of the column
variables are called the main effects. The interaction effect is something above and
beyond the two main effects. Here are four scenarios:
B
A
0u
1u
0u
5
5
5
1u
5
5
5
5
5
5
B
A
0u
1u
0u
5
10
7.5
1u
10
15
12.5
7.5
12.5
10
B
A
0u
1u
0u
5
10
7.5
0u
1u
0u
5
10
7.5
1u
10
20
15
7.5
15
11.25
B
A
1u
10
5
7.5
7.5
7.5
7.5
This scenario shows no effect at
all, either separately or in
combination. This corresponds
to the null hypothesis.
In this scenario, both drugs
increase the level of the
hormone, but their combined
effect is a simple addition.
Here the drugs have the same
effects as above when presented
separately, but now they
interact: The effect of the two in
combination is greater than the
sum of the separate effects.
Here, too, there is an interaction
effect, but in the opposite
direction: the combined effect is
a mutual cancellation.
Note in the last scenario that there is no main effect since the means are the same for
either rows or columns, while each drug obviously has an effect when presented
separately. This means that the interpretation of the presence or absence of main
Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html
32
effects is not always simple and straightforward. There is an interaction when the
combination of the variables produces effects that are different from the simple sum
of their separate effects. When represented graphically, the absence of interaction
appears as lines that are approximately parallel, while the presence of an interaction
appears as lines that diverge, converge or cross over each other.
Procedure
The two-way ANOVA proceeds similarly to
the one-way ANOVA: we compute SSbg and
SSwg. SSwg ends up as the MSerror in the
denominator of the F-factor, while SSbg is
further divided into three parts: the main
effects (MSrows and MScolums) and the interaction effect respectively (MSinteraction). This
results in three F-factors and 3 tests of significance to perform.
Breaking SSbg into its components parts
SSrows (or SSr) is the difference among the means of the two or more rows. It is
expected to be 0 if there is no main row effect. Hence it is computed as the sum of
square deviates between each row mean and the expected row mean:
SSr = Nr . ∑ (Mri - MT)2 where Nr is the number of samples in the row
As usual, there is a more efficient way to compute this:
SSr = ∑ [(∑Xri)2 / Nri] - (∑XT) 2 / NT
Similary, SScolumns (or SSc) is the sum of square deviates between each column mean
and the expected column mean. It can be computed as
SSc = ∑ [(∑Xci)2 / Nci] - (∑XT) 2 / NT
SSinteraction (or SSrxc) is a bit more complicated.
Lets us call Mg* the mean of any particular group, Mr* the mean of the row to which
that group belongs and Mc* the mean of the column to which that group belongs. If
there were no interaction, we would expect the mean of each group to be a simple
additive combination of Mr* and Mc*, namely:
[null] Mg* - MT = (Mr* - MT) + (Mc* - MT)
or
[null] Mg* = Mr* + Mc* - MT
SSrxc is then the sum of square deviates between the observed group means Mgi and
the corresponding expected group means Mr* + Mc* - MT .
However, since we are breaking out SSbg into three components, a much more
efficient way to compute SSrxc is
SSrxc = SSbg - SSr - SSc
Next, we compute the degrees of freedom for each of these values of SS:
dfr = r - 1
dfc = c - 1
dfrxc = (r-1)(c-1)
From there we get the three mean squares:
MSr = SSr / dfr
MSc = SSc / dfc
MSrxc = MSrxc / dfrxc
and the three F-ratios:
Fr = MSr / MSerror
Fc = MSc / MSerror
Frxc = MSrxc / MSerror
which we can compare with the critical values in the corresponding F tables, i.e. for
df = (dfr,dferror)
df = (dfc,dferror)
df = (dfrxc,dferror)
Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html
33
It is useful to present the results in a table, as follows:
Source
between groups
rows
columns
interaction
within groups (error)
TOTAL
SS
395.19
31.33
49.28
314.58
252.34
647.53
df
1
1
1
1
36
39
MS
31.33
49.28
314.58
7.01
F
4.47
7.03
44.88
P
<.05
<.05
<.01
One must be careful in interpreting main and interaction effects. If we plot the row
and column means (below, left), it looks like both A and B decrease the amount of
hormone. If however we plot the group means (below, center), the pattern is more
clear: there is a negative interaction effect between A and B. The interaction plot
(below, right), shows crossing lines, which is typical of this scenario.
Assumptions of the two-way ANOVA for independent samples:
- the scale on which the dependent variable is measured is an equal interval scale;
- the samples are independently and randomly drawn from the source population(s);
- the source population(s) can be reasonably supposed to have a normal distribution;
- the samples have approximately equal variance.
The ANOVA is robust to assumptions 3&4 if all samples have the same size and, to a
certain extent, to assumption 1 (see below).
Step 1 - Combine all rc groups and calculate SST = ∑ (Xi2) - (∑ Xi)2 / NT
Step 2 - For each of the rc group g calculate SSg = ∑ (Xgi2) - (∑ Xgi)2 / Ng
Step 3 - Calculate SSwg = ∑SSg
Step 4 - Calculate SSbg = SST - SSwg
Step 5 - Calculate SSr = ∑ [(∑Xri)2 / Nri] - (∑XT) 2 / NT
Step 6 - Calculate SSc = ∑ [(∑Xrc)2 / Nrc] - (∑XT) 2 / NT
Step 7 - Calculate SSrxc = SSbg - SSr - SSc
Step 8 - Calculate dfT = NT - 1, dfwg = NT - rc, dfr = r - 1, dfc = c - 1, dfrxc = (r-1)(c-1)
Step 9 - Calculate MSr = SSr / dfr, MSc = SSc / dfc and MSerror = SSwg / dfwg
Step 10 - Calculate Fr = MSr / MSerror , Fc = MSc / MSerror , Frxc = MSrxc / MSerror
Step 11 - Refer the calculated F-ratios to the table of critical values of Fdfnum, dfden
Robustness of two-way analysis of variance with ordinal scale data
If we plug ordinal rating-scale data into an analysis of variance and end up with
"significant" effects, are those effects really significant in the technical statistical
sense of the term?
We could test this by generating random samples for which we know that the null
hypothesis is true and see if there is still a 5% chance of ending with a "significant"
effect at the .05 level. The simulator available on-line shows that indeed this seems to
be the case. This provides evidence that the ANOVA is quite robust with respect to
the scale of the dependent variable.
Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html
34
Chapter 17 - One-way analysis of covariance for independent samples
The analysis of covariance or ANCOVA combines the analysis of variance from
Chapters 13 to 16 and the linear regression and correlation from Chapter 3. It can
remove the obscuring effects of pre-existing individual differences among subjects
without resorting to the repeated-measures strategy. In addition, it can compensate for
systematic biases among the samples.
Method A
Method B
Assume we want to assess the
Subj.
Xa
Ya
Subj.
Xb
Yb
effectiveness of two methods of
a1
88
66
b1
90
62
elementary mathematics instruction.
a2
98
85
b2
100
87
Obviously, we cannot use repeated
a3
100
90
b3
110
91
measures, so one group receives
a4
110
97
b4
120
98
Method A and the other Method B.
Mean 99.0
84.5
105.0
84.5
Since we want to make sure the
measures are not biased by the level of intelligence of each group, we measure the
level of intelligence X of each subject beforehand, and we measure the amount of
learning Y at the end of the experiment. The resulting table (above) shows two things:
- the higher the intelligence, the more learning takes place, i.e. there is a
covariance between X and Y;
- the means of Y are the same, as if the same amount of learning had taken
place. However, if there were no difference between the groups, given the correlation
between intelligence and learning and since group B is more intelligent on average
than group A, group B should have learned more.
In short, the ANCOVA does two things:
- it removes from Y the extraneous variability that derives from pre-existing
individual differences, insofar as those differences are reflected in the covariate X;
- it adjusts the means of the Y variable to compensate for the fact that different
groups started with different mean levels of the covariate X.
This allows to run what-if scenarios, such as: What would have happened if group A
had started with the same level of intelligence as group B?
The rest of the chapter will use the
following example of comparison of
two methods of hypnotic induction,
where the covariate X is the score on
an index of primary suggestibility
which is know to affect the response
to hypnotic induction Y.
Subj.
a1
a2
a3
a4
a5
a6
a7
a8
a9
a10
Method A
Xa
Ya
5
20
10
23
12
30
9
25
23
34
21
40
14
27
18
38
6
24
13
31
Note how the difference between the
means is small when compared to the
spread of the measures (below), which
Mean
13.1
would make the difference between
the means non-significant with a regular ANOVA.
Subj.
b1
b2
b3
b4
b5
b6
b7
b8
b9
b10
29.2
Method B
Xb
Yb
7
19
12
26
27
33
24
35
18
30
22
31
26
34
21
28
14
23
9
22
18.0
28.1
Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html
35
Note also the correlation (r = +.803)
between X and Y (right), which the
ANCOVA will take into account.
The ANCOVA requires four sets of calculations:
1- SS values for Y, i.e. analysis of variance for the dependent variable in
which we are chiefly interested => SST(Y), SSwg(Y) and SSbg(Y) ;
2- SS values for X, the covariate whose effects upon Y we wish to bring
under statistical control => SST(X) and SSwg(X) ;
3- SC measures for the covariance of X and Y => SCT an SCwg ;
4- A final set of calculations to remove from Y the portion of its variability
that is attributable to its covariance with X.
1- Calculations for the dependent variable Y
As with the usual ANOVA, we compute SST(Y), SSwg(Y) and SSbg(Y).
2- Calculations for the covariate X
Next we compute SST(X) and SSwg(X) (SSbg(X) is not needed).
3- Calculations for the covariance of X and Y
The sum of codeviates is (Chapter 3): SC = ∑ (Xi - MX)(Yi - MY)
It can be calculated more efficiently as SC = ∑ XiYi - (∑ Xi) (∑Yi) / N
We compute two values of SC:
- SCT, the covariance of X and Y within the total array of data
SCT = ∑ XTiYTi - (∑ XTi) (∑YTi) / NT
- SCwg, the covariance within the two groups
SCwg = SCwg(a) + SCwg(b)
= ∑ XaiYai - (∑ Xai) (∑Yai) / Na + ∑ XbiYbi - (∑ Xbi) (∑Ybi) / Nb
4- Final set of calculations
4a- Adjustment of SST(Y)
The overall correlation between X and Y (both groups combined) is:
rT = SCT / sqrt[SST(X)SST(Y)]
The proportion of the total variability of Y attributable to its
covariance with X is rT2. We adjust SST(Y) by removing this proportion:
[adj]SST(Y) = (1 - rT2) SST(Y) = SST(Y) - SCT2 / SST(X)
4b- Adjustment of SSwg(Y)
The same goes for the aggregate correlation between X and Y within
the two groups, calculated as rwg = SCwg / sqrt[SSwg(X)SSwg(Y)].
SCwg is adjusted by removing the proportion of the within-groups
variability of Y attributable to covariance with X:
[adj]SSwg(Y) = (1 - rwg2) SST(Y) = SSwg(Y) - SCwg2 / SSwg(X)
4c- Adjustment of SSbg(Y)
This is simply obtained by subtraction:
[adj]SSbg(Y) = [adj]SST(Y) - [adj]SSwg(Y)
4d- Adjustment of the means of Y for groups A and B
We compute the slope of the regression line for the aggregate
correlation between X and Y within the two groups: bwg = SCwg / SSwg(X)
We can now adjust the means for Y to take into account the fact that
the two groups started with different means for X and the correlation between X & Y:
[adj]MYa = MYa - bwg(MXb - MXT)
[adj]MYb = MYb - bwg(MXb - MXT)
4e- Analysis of covariance using adjusted values of SS
The F-ratio is, as usual, MSeffect / MSerror = (SSbg/dfbg)/(SSwg/dfwg)
except that we use the adjusted values of SSbg and SSbg and an adjusted value of dfwg:
[adj]dfwg(Y) = NT - k - 1
dfbg(Y) = k - 1
F = ([adj]SSbg(Y) / dfbg(Y)) / ([adj]SSwg(Y) / [adj]dfwg(Y))
Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html
36
The analysis of the significance of the F-ratio is a bit trickier than with the ANOVA.
A significant F-ratio does not mean that the difference between the sample means
MYA and MYB is significant in and of itself. What it means is:
- If the correlation between X and Y within the general population is
approximately the same as the one observed withing the samples; and
- If we remove from Y the covariance that it has with X, so as to remove the
pre-existing individual differences that are measured by the covariate X; and
- If we adjust the group means of Y in accordance with the observed
correlation between X and Y;
- Then we can conclude that the two adjusted means [adj]MYa and [adj]MYb
significantly differ.
Assumptions for the two-way ANCOVA for independent samples:
- the scale on which the dependent variable is measured is an equal interval scale;
- the samples are independently and randomly drawn from the source population(s);
- the source population(s) can be reasonably supposed to have a normal distribution;
- the samples have approximately equal variance;
- the slopes of the regression lines for each of the groups considered separately ar all
approximately the same (homogeneity of regression, see below).
The ANCOVA is robust to assumptions 1, 3&4 if all samples have the same size.
Example
We compare three methods (A, B, C) of instruction for computer programming (Y),
taking into account the pre-existing levels of computer familiarity (X).
We get MXa = 13.0, MYa = 25.4, MXb = 13.8, MYb = 26.3, MXc = 10.1, MYc = 23.8.
After running the ANCOVA, we get the following adjusted means, which differ quite
sharply from the observed ones (see figure to the right):
[adj]MYa = 24.3, [adj]MYb = 23.9, [adj]MYc = 27.3
The ANCOVA gives an F-ratio of 3.8 (df=2,32), which is
significant beyond the .05 level.
However we need to test for the homogeneity of regression.
Homogeneity of regression
The figure to the right shows the 4 regression lines for the
above example: three in red for each group and one in blue for
the overall within-groups regression. Although the lines seem
to have similar slopes, we need a more rigourous test. To that
end, we once more use another F-ratio of the form
F = MSeffect / MSerror = (SSeffect/dfeffect)/(SSerror/dferror)
- SSeffect is a measure of the aggregate degree to which the separate regressionline slopes of the groups differ from the slope (bwg) of the overall within-group
regression. It is called SSb-reg and computed as
SSb-reg = ∑ (SCg2 / SSXg) - SCwg2 / SSwg(X)
- SSerror is what is left over from [adj]SSwg(Y), the adjusted error of the original
ANCOVA, after SSb-reg has been removed from it. It is called SSY(remainder):
SSY(remainder) = [adj]SSwg(Y) - SSb-reg
- the corresponding degrees of freedom are
dfb-reg = k - 1
dfY(remainder) = [adj]dfwg(Y) - dfb-reg = NT - 2k
For the above example, we get an F-ratio of 0.30 (df=2, 30), which is not significant
and therefore the homogeneity of regression hypothesis is satisfied.
Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html
37