Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Psychometrics wikipedia , lookup
Foundations of statistics wikipedia , lookup
Bootstrapping (statistics) wikipedia , lookup
History of statistics wikipedia , lookup
Taylor's law wikipedia , lookup
Analysis of variance wikipedia , lookup
Misuse of statistics wikipedia , lookup
Summary of Richard Lowry's course - http://faculty.vassar.edu/lowry/webtext.html Michel Beaudouin-Lafon Chapter 1 - Measurement Chapter 2 - Distributions Chapter 3 - Linear Correlation and Regression Chapter 4 - Statistical significance Chapter 5 - Basic concepts of probability Chapter 6 - Introduction to probability sampling distributions Chapter 7 - Tests of statistical significance: Three overarching concepts Chapter 8 - Chi-square procedures for the analysis of categorical frequency data Chapter 9 - Introduction to procedures involving sample means Chapter 10 - T-procedures for estimating the mean of a population Chapter 11 - T-test for the significance of the difference between the means of two independant samples Chapter 12 - T-test for the significance of the difference between the means of two correlated samples Chapter 13 - Conceptual introduction to the analysis of variance Chapter 14 - One-way analysis of variance for independent samples Chapter 15 - One-way analysis of variance for correlated samples Chapter 16 - Two-way analysis of variance for independent samples Chapter 17 - One-way analysis of covariance for independent samples 2 4 6 8 10 12 14 16 18 21 22 24 26 28 30 32 35 Summary of statistical tests One independent variable, continuous, equal-interval scale One dependent variable, continuous, equal-interval scale Pearson (chap 3, p6) N independent variables, categorical One dependent variable, categorical frequency data Chi-square (chap 8, p16) One independent variable, categorical One dependent variable, equal-interval scale Two samples Independent Correlated samples samples Parametric t-test (chap 11, p22) t-test (chap 12, p24) Nonparametric MannWhitney (chap 11, p23) Wilcoxon (chap 12, p25) N samples Independent Correlated samples samples 1-way 1-way ANOVA ANOVA (chap 14, p28) (chap 15, p30) KruskalFridman Wallis (chap 15, p31) (chap 14, p29) Two independent variables, categorical One dependent variable, equal-interval scale Independent samples, parametric test: 2-way ANOVA (chap 16, p32), ANCOVA (chap 17, p35) Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html 1 Chapter 1 - Measurement Variable = the property being measured Variate = a specific measure Types of measures: Counting = counting the number of units, e.g. size or weight => Scalar variables Ordering = arranging in order, e.g. things you worry about a bit, some, a lot => Ordinal variables Sorting = putting into categories, e.g. male/female => Nominal variables Scalar variables Absolute scale = counting a number of items Relative scale = measuring relative to a unit scale, e.g. cm or kg Equal interval scales => can compute compound measures sum, difference and average (see below) Unequal interval scales, e.g. dB => cannot compute compound measures Continuous scale => values = real numbers Discrete scale => values = integer numbers Ratio scale if there is an absolute zero, e.g. cm or kg, but not ºC => permit the computation of ratios, e.g. twice as large Non-ratio scales do not permit this, but differences are a ratio scale, e.g. t1 ºC is twice t2 ºC is meaningless but (t2 ºC - t1 ºC) is twice (t3 ºC - t2 ºC) is OK Compound measures = sum, difference, average All require equal interval scales for the individual measures, and the result has the same properties (abs/rel, cont/discr, ratio) as the individual measures Sum : result has same properties as individual measure Average : result is always continuous e.g. average the number of students (discrete) => continuous Difference : result is always a ratio scale (see example above) Cross-classification of equal-interval scales Discrete / Non-ratio => does not exist in practice (counting has abs. 0) Discrete / Ratio => number of discrete items or events Continuous / Non-ratio => fairly rare, e.g. Fahrenheit and Celsius degrees Continuous / Ratio => most measures of continuous variables Ordinal variables Specify the order of size, quantity or magnitude among the measured items Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html 2 Rank-order scales Example : rank what's important in your life from highest to lowest Family > Friends > Career > Money => intrinsically non equal interval Rating scales Assign a numeral rating 0 completely unimportant 1 slightly important 2 moderately important 3 quite important 4 exceedingly important to the items above, e.g. Family = 4.0, Friends = 3.8, Career = 3.6, Money = 3.4 vs Family = 4.0, Friends = 3.5, Career = 2.0, Money = 1.0 More information than rank-order but still not an equal-interval scale even when presented as such Often treated as such by using compound measures, e.g. average Nominal variables Sort into categories, e.g. male/female, without any order among categories No compound measure makes sense however can use cardinal scale to count number of items in each category => twice as many males than females Cross-categorizing, e.g. male/female & geographical origin => bivariate or multivariate classification study whether two or more categories are systematically associated with each other Direct vs Indirect measures Direct measure : use a tape measurer to measure the desk Indirect measure : assess the knowledge of students by grading a test Direct measure of how well the students did Indirect measure of their knowledge of the material More often than not, we need to use indirect measures Reliability Whether repeated measures give the same answer => metal tape measurer is more reliable than an elastic one Validity Whether a direct measure is actually a valid indirect measure => "fair" vs "unfair" exam, or measuring produberances on the head to measure personality and mentality (19th century's phrenology) Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html 3 Chapter 2 - Distributions Distribution = list of individual measures Graphic representation = absolute or relative (i.e. %) frequencies Histogram Frequency polygon Frequency curve Examples of distributions Parameters of a distribution Central tendency - measures tend to cluster around a point of aggregation Variability - measures tend to spread out away from each other Skew - one tail of the distribution more elongated than the other Negatively skewed -> elongated tail to the left [A] Positively skewed -> elongated tail to the right [B] No skew = symmetric [C-D-E] Kurtosis - whether there is a cluster of measures, i.e. a peak in the distribution Platykurtic -> mostly flat [D] Leptokurtic -> high peak [E] Mesokurtic -> medium [C] Modality - number of distinct peaks (1 = unimodal, 2 = bimodal, 3 - trimodal) Bimodal: major peak = highest / minor peak = lowest [F] Can reveal the mix of 2 unimodal distributions, e.g. well prepared vs not so well prepared students Measures of central tendency Measures the location of the clustering: measured by mode, median, and mean Mode = point or region with highest number of individual measures Median = midpoint of all individual measures Mean = arithmetic average of all individual measures Mean = Median = Mode when distribution is unimodal and symmetric Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html 4 Skewed distribution : mean towards the tail, mode away from tail, median somewhere in between Median and mode rather useless in analytical and inferential statistics, because only mean has the property of equal interval scale Mean is noted M: MX is the mean of Xi = (∑ Xi) / N. Measures of variability Measures the strength of the clustering Range = distance between lowest and highest measures Interquartile range = distance between lowest and highest of the middle 50% of the measures Both are rather useless in analytical and inferential statistics Variance & Standard deviation Deviate = Xi - MX Sum of square deviates = SSX = ∑ (Xi - MX)2 Variance (also called mean square) = s2 = SSX / N Standard deviation (also called root mean square) = s = sqrt (SSX / N) A better formula to compute the sum of square deviates is SSX = ∑ (Xi2) - (∑ Xi)2 / N Standard deviation can be visualized as a scale for the distribution: MX ± 1s represent a range within the distribution, similar to the interquartile range, but computed with all Xi MX ± 1s typically encompasses 2/3 of the values Varieties of distributions Empirical distribution - set of variates that have been (or could be) observed Theoretical distribution - derived from mathematical properties Theoretical distributions are at the heart of inferential statistics The best-known is the normal distribution, also called bell-shape curve Normal distribution +1z and -1z fall at the point where the curvature changes (convex <> concave) 68.26% of the distribution is in the -1z to +1z range each tail encompasses 15.87% of the distribution Populations vs. Samples Population distribution = ALL the measures of a variable Sample distribution = SOME measures of a variable Inferential statistics = inferring properties of populations from samples Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html 5 Chapter 3 - Linear Correlation and Regression Correlation and Regression = relationship between two variables X and Y, when each Xi is paired with on Yi, e.g. height vs weight If the more of X, the more of Y => positive correlation If the more of X, the less of Y => negative correlation Scatterplot = bivariate coordinate plotting Plot dots at coordinates (Xi, Yi) for each i If a causality is suspected between the variables the independent variable (the one capable of influencing the other) is X the dependent variable (the one being influenced by the other) is Y A linear relationship can be represented by a line that best fits the scatterplot Positive correlation = upward line Negative correlation = downward line Zero correlation = no apparent pattern Physical phenomena will tend to fit linear relationships perfectly, while behavioral and biological phenomena will be more imperfect Measuring linear correlation Pearson product-moment correlation coefficient = r Ranges from -1.0 (perfect negative correlation) to +1.0 (perfect positive correlation), 0.0 meaning zero correlation Coefficient of determination = r2, ranges from 0.0 to 1.0 Provides an equal interval and ratio scale of the strength of the correlation, but not its sign, which is given by r This strength of the correlation can be turned into a percentage (0.44 -> 44%) and interpreted as the amount of the variability of Y that is associated with (tied to / linked with / coupled with) the variability of X (and vice versa). Regression line = line that best fits the scatterplot Criterion for best fit : the sum of the squared vertical distances between the data points and the line is minimal Line goes through the point (MX, MY) Covariance = similar to variance but for paired bivariate instances of X and Y Co-deviateXY = deviateX . deviateY covariance = (∑ co-deviateXY) / N = SCXY / N Better formula to compute SCXY = ∑ Xi Yi - ∑ Xi ∑ Yi / N r = observed covariance / maximum possible positive covariance observed covariance : see above maximum possible positive covariance : geometric mean of the variances of X and Y = sqrt (varianceX . varianceY) Computationally easier formula for r: r = SCXY / sqrt (SSX . SSY) where SCXY = sum of co-deviates of (X,Y) SSX and SSY = sum of square deviates of X and Y Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html 6 Interpretation of correlation Correlation represents the co-variation of the paired instances of Xi and Yi It can be represented by overlapping circles: amount of overlap is r2 of each circle, non overlap is the residual variance, 1 - r2. Where does this correlation come from? Could it be cause and effect? Correlation is a tool, and like any tool, it can do harm if misused Interpreting correlation as causal relationships requires care and cannot result solely from the observed correlation. Here are the 3 questions to ask oneself: 1- Statistical significance - could it be that the observed pattern was a mere effect of chance, i.e. it exists in the sample but not the population? See the next chapter for a first cut on statistical significance 2- Does X cause Y, or Y cause X (in which case you should swap them), or could it be a reciprocal causal relationship where X affects Y and Y affects X? For example, in American football, a higher score of the winner is correlated with a higher score of the loser. It is likely that they influence each other in a reciprocal relationship. 3- Could X and Y be influenced by another variable Z, not taken into account? For example the number of births of baby boys is highly correlated with the number of births of baby girls, but it's hard to imagine a causal relationship (even reciprocal). Instead, they are both influenced bya third factor, the birth rate. Regression line: y = a x + b The slope of the line is b = SCXY / SSX The intercept of the line is a = MY - b MX The line of regression is not just useful to show the correlation, it can also be used to make predictions: given a value x, what is the likely value of y: predicted Yi = a Xi + b ± SE where SE is the standard error of estimate = sqrt (SSresidual / (N-2)) SSresidual is the sum of square residuals, where the residual is the vertical deviate between a point and the line of regression. SSresidual = SSY . (1 - r2). Since r2 is the proportion of variability of Y that is explained with the variability of X, 1 - r2 is the residual proportion, i.e. the variability in Y that is not associated with variability in X. Multiplied by SSY, it gives the amount of SSY that is residual, i.e. not accounted for by the correlation between X and Y. Note that SE is almost the standard deviation of the residuals, the difference being that we divide by N-2 instead of N. This will be explained later, but relates to the fact that whereas the standard deviation is a measure of the sample, the standard error has to do with the overall population represented by this sample. The standard error can be visualized by drawing two lines, above and below the regression line, at a distance of SE. We can predict that approximately 2/3 of the XiYi in the actual population will be within these two lines. We can also say that we have a 2/3 confidence that the value of Yi falls within ±SE of the predicted value for a given Xi. [see subchapters 3a and 3b on partial correlation and rank-order correlation] Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html 7 Chapter 4 - Statistical significance The essential task of inferential statistics is to determine what can reasonably be concluded about a population on the basis of a limited sample from that population. A sample is like a "window" into the population. Sometimes, it truly represents the entire population, sometimes it misrepresents it, leading to erroneous conclusions. This is particularly the case when the studied phenomenon contains some element of random variability. Statistical significance is the apparatus by which one can assess the extend to which the observed facts do not result from mere chance coincidence, i.e. the confidence one can have in the inferred relationships. Statistical significance for correlation The question is whether the observed correlation, measured by r on a sample, corresponds to the actual correlation of the entire populalation, called ρ ("rho"). In particular, how likely is it that we measure r ≠ 0 when ρ = 0, i.e. we detect a relationship when none exists? A small experiment Get a pair dice of different colors, say white and blue, and toss them multiple times. Record the pair of numbers (Xi = white dice, Yi = blue dice). There is no reason to expect any correlation between Xi and Yi on the entire population, i.e. ρ = 0 However, if you get sets of N=5 samples and compute the coefficient of correlation r for each set, you will get values that deviate by a large amount from 0, with about 40% of the values beyond ±0.5. If you draw a relative frequency histogram of the r values, you get a pretty flat (platykurtic) distribution. Now repeat the experiment with sample sizes of N=10, N=20, N=30 and draw the same frequency histograms. With N=10, 12% of the samples are outside the ±0.5 range With N=20, 2.5% of the samples are outside the ±0.5 range With N=30, 0.5% of the samples are outside the ±0.5 range As N increases, the distribution approaches a normal distribution, and the tendency of sample correlation coefficients to deviate from 0 decreases. Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html 8 Another experiment Suppose a team of medical investigators is exploring the properties of a new nutrient, called X. They hypothesize that is should have the effect of increasing the production of a certain blood component, called Y. They conduct an experiment with some adult subjects and they get r = +0.50. The question of statistical significance is: what confidence can they have that this result is not just due to mere chance coincidence? This of course depends on the size of the sample. In the dice experiment, with N=5, there is a 20% chance to observe a positive correlation as large as r=+0.5 when the correlation within the entire population is ρ=0. With N=10, it drops to 6%, with N=20, to 1.25%, with N=30, to 0.25%. In most areas of scientific research, the criterion for statistical significance is set at the 5% level, i.e. a result is regarded as statistically significant only if it had 5% or less likelyhood of occurring by mere chance coincidence. For correlation analysis, the r-values required for statistical significance at the 5% level according to the sample size N can be computed. Two cases need to be considered: directional and non-directional, reflecting whether the investigators specify, in advance, whether or not they expect the correlation to be positive or negative: Positive/Negative directional hypothesis: the relationship between X and Y in the population is positive/negative (the more X, the more/less Y), so this particular sample of XiYi pairs will show a positive/negative correlation. Non-directional hypothesis: the relationship between X and Y in the population is something other than 0, either positive or negative, so this particular sample of XiYi pairs will show a non-zero correlation. As shown in the table below, the r-values required for statistical significance are higher with a non-directional hypothesis than with a directional one. N 5 10 12 15 20 25 30 ±r (directional) 0.81 0.55 0.50 <= 0.44 0.38 0.34 0.31 ±r (non directional) 0.88 0.63 0.58 0.51 0.44 0.40 0.36 In the medical experiment, the +0.50 correlation is statistically significant at the 5% level only if the sample size N is 12 (marked in the table) since the hypothesis is directional. For larger sample sizes, the correlation is statistically significant beyond the 5% level Test for the significance of the Pearson product-moment correlation coefficient (see chapters 9-12): t = r / sqrt[(1-r2)/(N-2)] Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html 9 Chapter 5 - Basic concepts of probability Imagine someone pretending that he can control the outcome of tossing a coin to make it come up heads, but not 100% of the times. He offers to be tested on a series of 100 tosses. How many heads would have to turn up for you to find the result "impressive"? In general, the answer is between 70 and 80, i.e. 70% to 80%. Our (rightful) intuition is that if he does not control the toss, the result would be around 50% - i.e. you would'nt be impressed if it were 51, 52 or even 53. Let's say you draw the line at 70%. Now what if the test had involved 10 tosses rather than 100. Would you be impressed if heads turned up 7 times out of the 10 tosses? Probably not. Indeed, as we'll show, the chance of getting over 70% head in 10 tosses is 17%, while it is 0.005% for 100 tosses. With the standard 5% criterion for significance, 7 out of 10 is not enough. Laplace about the theory of probability: "at bottom only common sense reduced to calculation". Probability of event x = P(x) = (number of possibilities favorable to the occurence of x) divided by (total number of pertinent possibilities). A priori probability: known precisely in advance, by counting or enumerating e.g., drawing a blue ball out of a box whose content is known A posteriori probability: estimated based on large number of observations e.g., success rate of a surgery procedure Compound probability: common sense explained by calculation Conjunction: A and B Disjunction: A or B Conjunctive probability: P(A and B) P(A and B) = P(A) x P(B) if A and B are independant If A and B are not independant, be careful to count each probability properly Example: 4 females and 6 males in a room, pick 3, probability that all 3 are female P(first pick is female) = 4/10 P(second pick is female) = 3/9 since we've already picked one P(third pick is female) = 2/8 since we've already picked two P(all three are female) = 4/10 * 3/9 * 2/8 = .033 = 3.3% probability that all 3 are male similar reasoning: 6/10 * 5/9 * 4/8 = .167 = 16.7% (i.e., 5 times as much as picking 3 female!) Disjunctive probability: P(A or B) P(A or B) = P(A) + P(B) if A and B are mutually exclusive If A and B are not mutually exclusive, remove common occurences Example: 26 students, 12 sophomores and 14 juniors, 12 sophomores: 7 females and 5 males 14 juniors: 8 females and 6 males probability of picking either a sophomore or a female P(S or F) = P(S) + P(F) = 12/26 + 15/26 = 27/26 > 1 WRONG Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html 10 P(S or F) = P(S) + P(F) - P(S and F) = P(S)+P(F) - P(S)P(F) = 12/26 + 15/26 - (12/26 * 15/26) WRONG The probability of drawing a sophomore is 12/26 but once we've done that, the probability that it is a female is not 12/26 but 7/12. Or the other way around: the probability of drawing a female is 15/26 but once we've done that, the probability that she is a sophomore is 7/15. We get P(S and F) = 12/26 * 7/12 = 7 / 26 or P(S and F) = 15/26 * 7/15 = 7 / 26 So the result is P(S or F) = 12/26 + 15/26 - 7/26 = 20/26 = .769 It is easier to make such computations with the complementary probability, especially when there are more than 2 events (A or B or C) and use the fact that P(x) = 1 - P(!x) Here, the probability of not drawing a sophomore or a woman is 6/26 since there are 6 male juniors: P(S or F) = 1 - P(M and J) = 1 - 6/26 = 20/26 More complex calculations An unknown disease causes 40% patients to recover after 2 months (and therefore 60% don't). The a posteriori probability to recover is P(r) = .4. Researchers want to test the effect of a plant, and start with a small sample of 10 patients. 7 of them recover within 2 months, while 3 don't. Is this 70% recovery rate significantly better than the 40%, i.e. is the plant having an effect? We need to compute the probability that at least 7 patients out of 10 recover with a recovery probability of 40% and see if it is below the 5% threshold of significance. This probability is P(7, 8, 9 or 10 patients recover), which is the sum of the individual probabilities P(7r) + P(8r) + P(9r) + P(10r). P(10r) is easy since there is only one possibility: that all patients recover, hence P(10r) = .4 x .4 x ... x .4 = .4 ^ 10 = .000105 P(9r) is the sum of the probability that all patients but the first one recovers, plus the probability that all but the second one recover, etc. Each probability is (.6 x .4^9) so the total is 10 times that = .00157 P(8r) gets even more complicated since we need to look at all the combinations were 8 patients recover and 2 don't. Fortunately, there is a general formula to compute the number of ways in which it is possible to obtain "k out of N" (10 out of 10, 9 out of 10, 8 out of 10, etc.): C(n,k) = n! / k! (n-k)! Since the probability for each individual event is pk qn-k where q = 1-p, we get the general formula: P(k out of n) = pk qn-k n! / k! (n-k)! With this formula we get P(7r) = .0425, P(8r) = .0106, P(9r) = .0016 and P(10r) = .0001 so the probability that 7 or more patient recover is .0548. This is above the 5% threshold so cannot be considered significant, although it is close enough from that threshold that it is worth doing a larger-scale test. In the larger test, they take a sample of 1000 patients and observe that 430 of them recover within two months. Although this is only 43% compared with the 40% without the plant, it is actually significant, as we will show later. However using the above formula to make the calculation is impractical, even with a computer. We need a better tool - fortunately there is one! Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html 11 Chapter 6 - Introduction to probability sampling distributions Binomial probability: situation where an event can occur or not, e.g. heads when tossing coins, whether a patient recovers. It is characterized by: p - probability that the event will occur q - probability that the event will not occur N - number of instances in which the event has the opportunity to occur Let's consider tossing a coin twice (p = q = .5, N = 2) and let's count the occurence of Heads. There are four possible outcomes (H means heads, -- means not heads): -- -- : probability = .5 x .5 = 0.25 (25%) 0 Heads -- H : probability = .5 x .5 = 0.25 \ H -- : probability = .5 x .5 = 0.25 / (50%) 1 Heads H H : probability = .5 x .5 = 0.25 (25%) 2 Heads With 2 patients and p = .4, q = .6 (R means recovery): -- -- : probability = .6 x .6 = 0.36 (36%) 0 Recovery -- R : probability = .6 x .4 = 0.24 \ R -- : probability = .4 x .6 = 0.24 / (48%) 1 Recovery R R : probability = .4 x .4 = 0.16 (16%) 2 Recoveries Sampling distribution of these probabilities: Since these are distributions, we can look at the measures of central tendency and variability. For the binomial sampling probability distribution, we get: Mean: µ = Np Variance: σ2 = Npq Standard deviation: σ = sqrt(Npq) (We use the greek letters µ and σ because we are referring to a population) For our examples: coins: µ = 1.0 σ2 = 0.5 σ = ± 0.71 2 patients: µ = 0.8 σ = 0.48 σ = ± 0.69 If we increase N, the distributions come closer and closer to a normal distribution. The graphs belows show the histograms, computed with the formula from Chapter 5 and the normal distribution for N = 10 and N = 20: In practice, we can use the approximation of the normal distribution when N, p and q are such that Np ≥ 5 and Nq ≥ 5 Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html 12 Now let's look closer at the distribution for N=20, p=.4, q=.6 : We compute µ = 8.0 and σ = ± 2.19 The graph above superimposes the binomial distribution, in blue with horizontal axis k, and the normal distribution, in red with horizontal axis z. The formula to convert between k and z is as follows: z = ((k - µ) ± .5) / σ The component ± .5 is a correction for continuity, accounting for the fact that the normal distribution is continuous while the binomial distribution is stepwise. When k is smaller than µ, add a half unit (+ .5), when k is larger than µ, subtract half a unit (- .5). Example: For k = 5, we get z = ((5 - 8) - .5) / 2.19 = -1.14 For k = 11 we get z = ((11 - 8) - .5) / 2.19 = +1.14 From these values we can look up the table giving the area beyond ±z (see Chapter 2). For z = 1.14 we get .1271. This means that, out of 20 patients, there is a 12.71% chance that 5 or less recover, and 12.71% chance that 11 or more recover. Note that these values are approximate: computing the exact binomial probabilities leads to 12.56% for 5 or less and 12.75 for 11 or more. The differences with the above computation is only .0015 and .0004 respectively. If we were to make the same computation for N=1000 patients, there is no way we could do the exact calculation. With the normal distribution we get: µ = 400, σ = ±15.49 Let's say we get 430 patients that recover, i.e. 30 more than the expected mean. We compute the corresponding z value as z = ((430 - 400) - .5) / 5.49 = +1.90. Referring to the the table of the normal distribution, this gives us .0287, i.e. a 2.87% chance that this many recoveries could be obtained just by chance. Therefore the effect is statistically significant, as announced (but not demonstrated) in chapter 5 ! Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html 13 Chapter 7 - Tests of statistical significance: Three overarching concepts Tests of statistical significance are the devices by which scientific researchers can rationally determine how confident they may be that their observed results reflect anything more than mere chance coincidence. Mean chance expectation All tests of statistical significance involve a comparison between (i) An observed value - e.g. an observed correlation coefficient; an observed number of head in N tosses; an observed number of recoveries in N patients (ii) The value that one would expect to find, on average, if nothing other than chance and random variability were operating in the situation - e.g. a correlation coefficient of r=0, assuming the correlation within the general population is ρ=0, or 5 heads in 10 tosses assuming the probability of heads is .5, or 400 recoveries in 1000 patients, assuming the probability of recovery for any patient is .4 The second value is often called mean chance expectation or MCE. For binomial probability situations, MCE is equal to the mean. The null hypothesis and the research hypothesis When performing a test of statistical significance, it is useful to distinguish between (i) the particular hypothesis that the research is seeking to examine, e.g. that a medication has some degree of effectiveness and (ii) the logical antithesis of the research hypothesis, e.g. that the medication has no effectiveness at all. The second one is commonly called the null hypothesis or H0. The first one is the research hypothesis (or experimental hypothesis, or alternative hypothesis) or H1. In general: H0: observed value = MCE H1: observed value != MCE (here 'a = b' means a and b do not statistically differ, i.e. a equals b within the limits of statistical significance, 'a != b' means that they statistically differ, i.e. the difference is beyond what could be expected from random variability). The research hypothesis may be directional, when the expected difference has a particular direction. For example, it is expected that the medication has the effect of increasing the number of patients that recover. In this case H1: observed value > MCE In other cases we may have H1: observed value < MCE e.g. the effect of sleep deprivation on the outcome of the disease. ('a > b' and 'a < b' here mean that the difference is statistically significant) In fact, the so-called non-directional hypothesis (observed value != MCE) should be called bi-directional as we are really testing (observed value < MCE) or (observed value > MCE). Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html 14 Directional vs non-directional research hypotheses One-way vs two-way tests of significance Let us get back to our coin-tossing experiment with the claimant of paranormal powers aiming to produce "an impressive nunber of heads". We toss a coin 100 times and get 59 heads. Is that impressive? N=100, p=.5, q=.5 gives a z-ration of +1.7 and a probability of .0446 This passes the test of 5% significance (see figure below, left). If instead the claim was: "I have a power but I cannot control its direction. All I can say is that the number of heads/tails will be significantly different from 50/50". This is now a non-directional hypothesis, resulting in a probabilty of .0892, far beyond the threshold of 5% (see figure below, right). Because these tests often refer to the left and right "tails" of the distribution, they are often described as one-tail and two-tail tests. As a general approximation, the probability measured with a two-way test is exactly or approximately twice that measured with a one-way test (with the binomial probability, it is exactly twice). The logic of the non-directional hypothesis is also the default position when there is no prior hypothesis at all. Let's say you toss a coin 16 times just to see what happens and get 12 heads. With a one-tail test, the probability is p=.028 and you may think that you have paranormal powers. But since you didn't know (or specify) what to expect, you need to use a two-tail test (p=.056) and accept that the result is indeed not significant at the 5% level. Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html 15 Chapter 8 - Chi-square procedures for the analysis of categorical frequency data Chapters 5/6 dealt with situations with two mutually exclusive categories, e.g. head or tail. Chi-square extends the logic of binomial procedures to situations with more than two categories (e.g., patient conditions improves, stays the same or worsens), and to situations with more than one dimension of classification (e.g., students from college A or college B, and their political inclination as conservative or liberal). Chi-square for one dimension of categorization Suppose we have a river with 3 species of fish in equal proportions. We pick 300 fish and count 89 of species A, 120 of species B and 91 of species C. The question of statistical significance is whether the observed difference with the expected counts (100 of each) is due to chance or to some ecological disorder in the river. We can look at the amount by which each species deviates from the expected count : (observed frequency - expected frequency) / expected frequency = (O - E) / E species A : (89 - 100) / 100 = -11% species B : (120 - 100) / 100 = +20% species C : (91 - 100) / 100 = -9% Note that the sum of this relative differences is 0. The Chi-square (χ2) simply consists of taking the square of the differences and adding them all up : χ2 = ∑ (O - E)2 / E species A : (89 - 100) 2 / 100 = 1.21 \ species B : (120 - 100) 2 / 100 = 4.0 > total = 6.02 2 species C : (91 - 100) / 100 = .91 / We now need to know the properties of the sampling distribution to which this value belongs. We could take a large number of samples (say 10000) of 300 fish in a river with equal proportions of eache species, compute their χ2 and see where the above value (6.02) lies. The result is depicted below (left): it shows that there is a 4.99% chance of picking a sample with a chi-square of 6.02. The theoretical distribution (above, right) shows the various values of χ2 for diffent pvalues. Unlike the one on the left, this distribution is general : it applies to any situation with 3 categories, any sample size, and any expected values, e.g. a sample size of 100 with expected values of 25/50/25 or a sample size of 62 with expected values 13/22/27. Our value of 6.02 is slightly beyond the 5% significance level. If we have more categories, the distribution would be different. It depends on the number of degrees of freedom (df), which, for one dimension of categorization, is the number of categories minus 1. This is because, given a sample size and n categories, you can pick any value for n-1 categories, but the last one is forced. The graph and the table below show the distributions and level of significance for various p-values. Note that the χ2 is by nature non-directional since it is based on the squared differences. Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html 16 df 1 2 3 4 5 Level of significance (non-directional test) .05 .025 .01 .005 .001 3.84 5.02 6.63 7.88 10.83 5.99 7.38 9.21 10.60 13.82 7.81 9.35 11.34 12.84 16.27 9.49 11.14 13.28 14.86 18.47 11.07 12.83 15.09 16.75 20.52 Chi-square with two dimensions of categorization The best way to represent two-dimensional data is with a contingency table. Suppose we have two groups of female Alzheimer's onset patients, one of which received No Yes estrogen and the other did not, and Received Yes O = 147 O=9 156 we measure the number of those estrogen No O = 810 O = 158 968 which developed Alzheimer's 957 167 1124 desease within 5 years. We get this: Alzheimer's onset First we need to figure out what the No Yes expected values would be. With a Received Yes E = 132.82 E = 23.18 156 single dimension, these values are estrogen No E = 824.18 E = 143.82 968 easy to guess. Here they are 957 167 1124 computed as E = R*C / N where R is the sum for the row of the cell being considered, C is the sum for the column and N the total size of the sample. Now that we know the expected values, we can compute the value (O-E)2/E in each cell, and sum them to have the χ2. In the special case where there is exactly two rows and two columns, we need a correction for continuity (like in Chapter 6) : χ2 = ∑(|O-E| - 0.5)2 / E = 11.01 in this example. Next we need to compute the degrees of freedom to select the proper distribution. Following the earlier explanation, we note that here, once we know the totals per row and per column, setting the value of one cell determines all three others unequivocally. Hence df = 1 in this case. In the general case of r rows and c columns, a similar reasoning would show that df = (r - 1) (c - 1). Going back to the table at the top of this page, we can see that our χ2 exceeds the value for significance at the .1% level (10.83). Note about non-significant Chi-square values : A non-significant χ2 should not be interpreted as a "goodness-of-fit": All it tells you is that you cannot reject the null hypothesis. It does not tell you that you can accept the null hypothesis since many other null hypotheses could have ended up with a non-significant χ2. Two limitations of Chi-square 1- Chi-square can only be applied if the categories are independant of each other, i.e. the categories are both exhaustive and mutually exclusive. 2- The logical validity of chi-square decreases when the values of E become smaller. It should not be used if any E is 5 or below. In this case, you can use the Fisher Exact Probability Test (which also has the advantage of being directional) [see chapter 8a]. Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html 17 Chapter 9 - Introduction to procedures involving sample means We now turn to the analysis of situations that involve continuous variables. Rather than looking at wether a patient has recovered or not, we measure how much he or she has recovered, for example by measuring the concentration of a certain component in the patient's blood that is known to decrease in proportion to the degree of the disease. We want to be able to say, e.g., that on average, patients who received te experimental treatment showed greater improvement than those who did not. The key word here is "on average", which means that we are focusing on the mean of a sample or the means of two or more samples. The question then is to assess the significance of the results drawn from a sample with respect to the overall population. Let's start with a normal distribution, whose mean and standard deviation are known to be µ = 18 and σ = ±3. This distribution allows us to answer questions such as: what is the probability that a random Xi is between 18 and 21, i.e. within 1 standard deviation (answer: 34.13%). More generally, we can compute the z-ratio of any value as z = (Xi - µ) / σ. (Note that we don't need the correction for continuity introduced in chapter 6 for the values are now continuous). For example Xi = 22 gives z = +1.33, Xi = 12 gives z = -2.0 which translate respectively into a probablity of P = .0228 and P = .0918. The sampling distribution of sample means Let us now draw samples of size N=10 in this population and calculate the mean of each such sample. It turns out that the distribution M of such means (see right) is itself a normal distribution, whose mean µM and standard deviation σM are closely related to those of the source population : µM = µsource , σ2M = σ2source / N and therefore σM = σsource / sqrt(N) In other words, as the sample size gets larger, the measured means cluster more tightly around the actual mean of the population. So if we know that a population is normally distributed and if we know its mean and standard deviation, we can answer questions such as: What is the probability that a sample has a mean MX greater than 20? We compute the z-ratio for 20 in the M distribution: z = (MX - µM) / σM = +2.11, corresponding to P = 1.74%. The sampling distribution of sample-means differences Now we suppose that we draw a large number of pairs of samples: in each pair, we have sample A of size Na and sample B of size Nb and we consider the difference between the means of the two samples in each pair: d = MXa - MXb. Here again, it turns out that the distribution M-M of these sample-mean differences is normal and closely related to the source population: Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html 18 µM-M = 0 , σ2M-M = σ2source / Na + σ2source / Nb and therefore σM-M = sqrt [σ2source / Na + σ2source / Nb] This allows us to answer questions such as: what is the probability that the mean of sample A is greater than the mean of sample B by 2 or more points? Again, we compute the z-ratio for 2 in the M-M distribution : z = ((MXa - MXb) - µM-M) / σM-M = 1.49, corresponding to a probability of 6.81%. Estimating the mean and variability of a normally distributed source population The problem with the above computations is that they assume that we already know the mean and standard distribution of the source population. In many cases we don't. But since we know (from the above) that a random sample will tend to reflect the properties of the overall population, we can say that the sample mean is "in the vicinity" of the mean of the source population: estimated µsource = Mx ± [definition of "vicinity"] Intuitively, the larger the sample, the closer the approximation of the real value. But how is this vicinity computed exactly? This will be the subject of chapter 10. The next question is to estimate the variability of the source population given that of a given sample. In general, the variability that appears in a sample will tend to be smaller than the variability that exists in the population. We recall from Chapter 2 that the variance of a population is the average of the square deviates: σ2 = SS / N. The relationship between the sample variance and the population variance is: mean sample variance = s2 = σ2source (N-1) / N From this we can compute the estimated source variance from the sample variance : estimated σ2source = {s2} = s2 N / (N-1) = SS / (N-1) est.σsource = {s} = sqrt[SS / (N-1)]. Estimating the standard deviation of the sampling distribution of sample means We now turn to the sampling distribution of sample means. We recall from part 1 of this chapter that σ2M = σ2source / N. We can estimate σM by substituting {s2} for σ2M. This gives an estimated σM = sqrt[{s2} / (N-1)]. We know that this estimated variance is "somewhere in the vicinity" of the real variance. Since we don't know the real variance of the population, we cannot calculate the z-ratio for a given sample mean, but we can make an estimate of the z-ratio, called t, using the estimated variance : t = (MX - µM) / est.σM For example, if we have a sample of size N=25 with mean Mx = 53 and SS = 625. If we believe the source mean is µM = 50, we can compute t = +2.94. We will see a bit later how this translates into a probability with the so-called t-distribution. Estimating the std deviation of the sampling distribution of sample mean differences Here again, we cannot compute the z-ratio for a given pair of samples since we don't know the variability of the source population, but we can use estimates: estimated variance of sample A = SSa / (Na - 1) estimated variance of sample B = SSb / (Nb - 1) The so-called pooled variance {s2p} is defined as (SSa + SSb) / ((Na-1)+(Nb-1)) and the estimated variance of the sample mean differences is est.σM-M = sqrt[{s2p} / Na + {s2p} / Nb]. Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html 19 As before, we can compute t, an estimate of the z-ratio using the estimated variance: t = (MXa - MXb) / est.σM-M For example, if we have sample A of size Na = 20 with Ma = 105.6 and SSa = 4321 and sample B of size Nb = 20 with Mb = 101.3 and SSb = 4563, we can compute the pooled variance {s2p} = (4321+4563) / (19+19) = 233.79 and est.σM-M = ±4.84. From this we compute the t-ratio t = +0.89. Again, we need to study the t-distribution to translate this into a probability. The t distribution Suppose we select a large number of pairs of samples of size 10 and for each pair we compute its t-ratio as above. The distribution of these t-values would follow the red curve to the right. This is not exactly a normal distribution but very similar. Unlike a normal distribution, the t-distribution depends on the degrees of freedom, a concept already introduced in Chapter 8. Here, for a single sample of size N, df = N-1, while for two samples of sizes Na and Nb, df = (Na-1)+(Nb-1). The figure to the right shows t-distributions for df=5 and df=40. The latter is almost indistinguishable from the normal distribution. The use of the t-distribution is exactly the same as for the normal distribution: given a t-ratio, one can look-up in a table the corresponding probability. The table below gives some examples: df 5 10 18 20 Level of significance .05 .025 --.05 2.02 2.57 1.81 2.23 1.73 2.10 1.72 2.09 .01 .02 3.36 2.76 2.55 2.53 .005 .01 4.03 3.17 2.88 2.85 test: .0005 directional .001 non-dir. 6.87 4.59 3.92 3.85 To the right is the t-distribution for df=18 (typically Na = Nb = 10), with the .05 and .025 one-tailed levels of significance marked. For a non-directional ("two-tailed") test to be significant to the .05 level, t would have to be beyond ±2.10. For a directional ("one-tailed") test to be significant at the same 0.05 level, t would have to be beyond ±1.73. Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html 20 Chapter 10 - T-procedures for estimating the mean of a population Assumption of the t-test for one sample: - The scale of measurement has the properties of an equal interval scale. - The sample is randomly drawn from the source population - The source population can be reasonably supposed to have a normal distribution Step 1 - For the sample of N values Xi, calculate MX - the mean of the sample - MX = (∑ Xi) / N SS - the sum of square deviates - SS = ∑ (Xi2) - (∑ Xi)2 / N Step 2 - estimate the variance of the source population as {s2} = SS / (N-1) Step 3 - estimate the standard deviation of the sampling distribution of sample means (also called "standard error of the mean") as est.σM = sqrt[{s2} / N] Note that you can merge steps 3 and 3 in one step: est.σM = sqrt[SS * N / (N-1)] Step 4 - perform the point and interval estimate as est.µsource = MX ± (tcritical * est.σM) with df = N - 1 tcritical is taken from a table such as the one at the end of Chapter 9 and depends on df, the level of significance and whether the test is directional or non-directional. ± (tcritical * est.σM) is called the confidence interval. Example: An archeologist finds 19 intact specimens of a certain prehistoric artifact, and measures their lengths : 17.3 18.9 17.7 23.8 16.0 22.1 18.4 18.2 13.3 26.8 18.6 24.5 22.8 13.4 18.1 14.8 20.6 17.4 16.1 She assumes the population to have a normal distribution and has no reason to believe that her sample is not random. She estimates the mean length of the population : Step 1: MX = 18.9 SS = 248.5 df = 18 Step 2: {s2} = 13.81 Step 3: est.σM = ±0.85 Step 4: est.µsource = 18.9 ± 1.79 with a 95% confidence level est.µsource = 18.9 ± 2.45 with a 99% confidence level The figure to the right shows the critical t-values for the calculated standard error of the mean. Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html 21 Chapter 11 - T-test for the significance of the difference between the means of two independant samples This is probably the most widely used statistical test of all time. It is simple, easy to use, straightforward, and adaptable to a broad range of situations. This is because scientific research often examines a phenomenon two variables at a time, trying to answer the question : are these two variables related? If we alter the level of one (the independant variable), will we therby alter the level of the other (the dependant variable)? Or: if we examine two levels of a variable, will we find them to be associated with different levels of the other. Example 1: does the presence of a certain fungus enhance the growth of a plant? Procedure: take a set of seeds and randomly sort them into groups A and B Group A is the experimental group: grow them in a soil with the fungus Group B is the control group: grow them in a soil without the fungus Harvest the two groups and measure their growth. If the fungus affects growth, the mean of group A should be significantly larger than the mean of group B. Example 2: do two types of music have different effect on the performance of mental tasks? Procedure: take a pool of subjects, assign them randomly to groups A and B Group A will have music type I playing in the background Group B will have music type II playing in the background Measure the performance of each subject for a set of mental tasks. Any difference between the effects of the two types of music should show up as a difference between the mean level of performance of the two groups. Example 3: Do two strains of mice differ with respect to their ability to learn a particular behavior Procedure: here the two subject pools are already given: strain A and strain B Draw a random sample of size Na from pool A and a sample of size Nb from B Run the members of each group through the experimental protocol and measure how well and how quickly they learn the behavior Any difference between their abilities to learn this behavior should show up as a difference between the group means. Recall from Chapter 7 that whenever you perform a statistical test, what you are testing is the null hypothesis. In general, the null hypothesis is that there is no effect (of the fungus, of the type of music, of the strain of mice). The research hypothesis may be directional (Example 1) or non-directional (Examples 2 and 3), which defines the type of test of statistical significance. Assumptions of the t-test for two independent samples: - Both samples are independently and randomly drawn from the source population(s) - The source population(s) can be reasonably supposed to have a normal distribution - The scale of measurement for both samples has the properties of equal interval scale Step 1 - For the two samples A and B of sizes Na and Nb, calculate MXa and SSa MXb and SSb Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html 22 Step 2 - estimate the variance of the source population as df = (Na - 1) + (Nb - 1) {s2p} = (SSa + SSb) / df Step 3 - estimate the standard deviation of the sampling distribution of sample-mean differences (the "standard error of MXa - MXb") as est.σM-M = sqrt[{s2p}/Na + {s2p}/Nb] Step 4 - calculate t as t = (MXa - MXb) / est.σM-M Step 5 - Refer to a table of the t-distribution for the critical values of t for the given df and the given level of significance, taking into account the directionality of the test. Example: Let us use example 2 and assume the following values: Group A (music type I): Na = 15 Ma = 23.13 Group B (music type II): Nb = 15 Mb = 20.86 Ma - Mb = 2.26 SSa = 119.73 SSb = 19.73 Step 2: {s2p} = 10.55 Step 3: est.σM-M = ±1.19 Step 4: t = +1.9 with df = 28 Step 5: looking at the figure, we can see that with a non-directional test at the 5% level, we fall short of significance: the critical t-value for .025 significance is ±2.05. However, if we had made a directional test, it would have reached significance at the 5% level since the critical t in this case is 1.70. Mann-Whitney test The t-test described above is called a parametric test because of the assumptions it makes about the samples and the population. The Mann-Whitney test is a nonparametric variant that makes the following assumptions: - The two samples are randomly and independently drawn - The dependent variable is intrinsically continuous, capable in principle, if not in practice, of producing measures carried out to the nth decimal place - The measures within the two samples have the properties of at least an ordinal scale of measurement, so that it is meaningful to speak of "greater than", "less than" and "equal to". For the details of the procedure, see Chapter 11a on-line. The logic of the procedure is to sort the measures and convert them into a rank order (the lowest is rank 1, the next higher one is rank 2, etc), then working with this rank order instead of the measure, taking advantage of the properties of rank ordering. Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html 23 Chapter 12 - T-test for the significance of the difference between the means of two correlated samples We start with a rather strange hypothesis: that people with shoes are taller than without shoes. A researcher tests this hypothesis and measures (in inches) a set of 15 adults first with their shoes (sample A) and then without (sample B). He observes that they are all taller with their shoes, and computes Ma = 65.8 Mb = 64.2 Ma - Mb = 1.6 SSa = 378.4 SSb = 384.1 (σa = ±5.0 σb = ±5.1) If you compute the t-value as we've done in the previous chapter, you get t = +0.84, which is not significant at the 5% level for a directional test! Why is that? Looking at the distributions of samples A and B (right), one can see that they each have a large internal variability, and the computation of t depends ultimately on SSa and SSb, the raw measure of these internal variabilities. In our example, the individual differences with respect to height are entirely extraneous to the question of whether people on average are taller with shoes. The t-test for independant samples treat this extraneous variability as though it were not extraneous, and in consequence overestimates the standard deviation in the relevant sampling distribution. This in turn results in an underestimate of the observed mean difference. The t-test for correlated samples avoids this pitfall by disregarding the extraneous variability and looking only at what is relevant for the question at hand. The name comes from the fact that the two sets of measures are arranged in pairs and are thus potentioally correlated. This procedure is also spoken of as the repeated-measures ttest or the within-subject t-test, because it typically involves situations in which each subject is measured twice, once in condition A, and then again in condition B. However it is not essential that the measures in conditions A and B come from the same subjects, only that each individual item in sample A is intrinsically linked with a corresponding item in sample B. Here, the height of subject 1 with shoes is linked to the height of subject 1 without shoes, and so on for each subject. In fact, since we are only concerned with the difference between the shoes-on and shoes-off conditions, there is only one sample here, whose variable D is defined for each linked pair as Di = XAi - XBi If we now compute the mean of this new variable, we still get MD = 1.6, but the variability is now a lot smaller: SSD = 2.59 (s = 0.42). From there, the rest of the procedure will look familiar : rather than comparing the means of two samples, we look at a single sample made of the differences between the measures, and test whether the observed mean of the differences is significantly different from 0. The t-test is particularly useful in research involving human and animal subject precisely because it is so very effective in removing the extraneous effects of preexisting individual differences. Note than in some contexts, these individual differences are not extraneous. In some cases they might be the very essence of the phenomena of interest. But there are many situations where the facts of interest are merely obscured by the variability of individual differences. Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html 24 Assumptions of the t-test for two correlated samples: - The scale of measurement of Xa and Xb has the properties of an equal-interval scale - The values of Di have been randomly drawn from the source population - The source population from which the values Di have been drawn can be reasonably supposed to have a normal distribution. Step 1 - for the sample of N values of Di, where Di = XAi - XBi, calculate the mean and the sum of square deviates MD = (∑ Di) / N SSD = ∑ (Di2) - (∑ Di)2 / N Step 2 - estimate the variance and standard deviation of the source population {s2} = SSD / (N-1) est. σMD = sqrt[{s2} / N] Step 3 - calculate t as t = MD / est. σMD Step 4 - refer the calculated value of t to the table of critical values for df = N-1 and the given level of significance, taking into account the directionality of the test. Example: in the shoe example, we have t=+14.17, df = 14. The one-tailed probability of getting this result by mere chance coincidence is 0.0000000005 ... Example 2: Suppose we want to determine whether two types of music, A and B, differ with respect to their effects on sensory-motor coordination. If we draw two independent samples of human subjects and test the first one with type-A music and the other with type-B music, it is very unlikely that we will get a significant result, for the individual differences in sensory-motor coordination are likely to obscure any effect of the type of music. In a design with correlated samples, we test all subjects with both types of music, arranging for half of the subject to be tested with type-A then type-B and the other half for type-B then type-A in order to to obviate the potential effects of practice and test sequence. Assume we get a sample of differences with a mean MD = -1.53 and SSD = 55.45 (s = ±1.92). We compute {s2} = 3.96, est.σMD = ±0.51 and t = -3.0 with df = 14. Looking up the corresponding t table, we find that the critical t value at the 1% level of significance for the non-directional test is 2.98. We therefore meet and slightly exceed this critical value. Wilcoxon signed-rank test Similar to the non-parametric version of the t-test for uncorrelated samples outlined in the previous chapter, there is a non-parametric version of the t-test for correlated samples, called the Wilcoxon signed-rank test. This test assumes the following: - The scale of measurement for XA and XB has the properties of an equal-interval scale - The differences between the paired values of XA and XB have been randomly drawn from the source population - The source population from which these differences have been drawn can be reasonably supposed to have a normal distribution See the on-line Chapter 12b for the details of the procedure. It is similar to the MannWhitney test but takes into account the signs of the ranking (hence its name). Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html 25 Chapter 13 - Conceptual introduction to the analysis of variance The t-test (chapter 11) compares the means of two samples, but cannot be used to compare more than two samples. For example, if we have three samples A, B, C, we cannot use 3 t-tests to compare A & B, B & C, A & C: even if each t-test is significant at the .05 level, the overall difference is only significant at the .15 level! The analysis of variance or ANOVA solves this problem. Here is how it works with k=3 samples A, B, C: - let Na, Nb and Nc be the size of each sample and NT = Na + Nb + Nc - compute Ma, Mb and Mc, the mean of each sample, and MT, the total mean - compute SSa, SSb and SSc, the sum of square deviates of each sample, and SST, the sum of square deviates of the complete set of measures - compute the measure of aggregate differences among samples : for each group g = a, b, c, this is the value Ng (Mg - MT)2 This is similar to the sum of square deviates, but for each sample as a whole - the sum of these three values is called SSbg, standing for the sum of square deviates between samples. Now we want to figure out if this measure differs significantly from the zero that would be specified by the null hypothesis. Remember the t value from chapter 11: t = (MXa - MXb) / est.σM-M - The numerator is the difference between the two means. It corresponds to our value SSbg which measures the difference among any number of samples. - The denominator can be interpreted as the variability that can be expected from a random sample. In the t-test, it is computed from SSa and SSb. Here, we introduce the sum of square deviates within groups as SSwg = SSa + SSb + SSc. - Note that SSbg + SSwg = SST, i.e. we have separated the variability of the three samples into two parts: variability between the samples vs. within the samples. In chapters 9 through 12, we encountered several version of the basic concept that the variance of the source population can be estimated as est.σ2source = (sum of square deviates) / (degrees of freedom) We now proceed to estimate the mean of square deviates (MS), which is the term used in the context of ANOVA for an estimate of the source population variance: - let MSbg = SSbg / dfbg where dfbg is the degrees of freedom between groups, i.e. the number of groups k minus 1 (here: dfbg = k - 1 = 3 - 1 = 2). - les MSwg = SSwg / dfwg where dfwg is the degrees of freedom within samples, which is the sum of the degrees of freedom of each sample, i.e. dfwg = (Na - 1) + (Nb - 1) + (Nb - 1) = NT - k What do MSbg and MSwg estimate? In short, the same thing. If the null hypothesis is true, both values are estimates of the variance of the source population. If the null hypothesis is not true, MSbg will tend to be greater than MSwg. The F-Ratio The relationship between two values of MS is conventionally described by F = MSerror / MSeffect where MSeffect is a variance estimate pertaining to the fact whose significance we wish to assess (here the difference between the means of independent samples) and MSerror is a variance estimate of random variability present in the situation. Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html 26 When the null hypothesis is true, the F-ratio tends to be equal or less than 1.0 When the null hypothesis is false, the F-ratio tends to be greater than 1.0 In our case we define F as F = MSbg / MSwg. The next question is to assess F against the corresponding sampling distribution. F can be written as (SSbg / dfbg) / (SSwg / dfwg). Similar to the t and Chi-square which have a different distribution for differents values of df, here we have a different distribution for each value of the pair (dfbg, dfwg). For example, with k=3 and Na=Nb=Nc=5, we have dfbg = k-1 = 2 and dfwg = NT - k = 12. The corresponding distribution is noted F2,12: For large values of df, the maximum value of F is reached at F=1.0, for smaller values, it is reached to the left of F=1.0. Here is a table of critical values of F for some values of dfbg and dfwg at the .01 and .05 significance levels: df denominator 10 11 12 13 1 4.96 10.04 4.84 9.65 4.75 9.33 4.67 9.07 df numerator 2 3 4.10 3.71 7.56 6.55 3.98 3.59 7.21 6.22 3.89 3.49 6.93 5.95 3.81 3.41 6.70 5.74 4 3.48 5.99 3.36 5.67 3.26 5.41 3.18 5.20 Example We consider k=3 samples: A = (16, 15, 17, 15, 20), B = (20, 19, 21, 16, 18), C = (18, 19, 18, 23, 19) of sizes Na=Nb=Nc=5, and therefore NT = 15. We compute Ma = 16.6, Mb = 18.8, Mc = 19.2, MT = 18.2. Then we get SSa = 17.2, SSb = 14.8, SSc = 18.8, SST = 70.4. From there we get SSbg = 12.8+1.8+5.0 = 19.6 and SSwg = 17.2+14.8+18.8 = 50.8 We can verify that SSbg + SSwg = SST Finally we get MSbg = 19.6/2 = 9.28 and MSwg = 50.8/12 = 4.23 and therefore F = 2.32, which is non significant at the .05 significance level (the critical value is 3.89). The relationship between F and t The ANOVA can be used for situations with two samples, in which case the results are similar to those obtained with a t-test: the F-ratio would be equal to the square of the t-ratio. The only difference is that a t-test can be either directional or nondirectional, while the ANOVA (like Chi-square) is intrinsically non-directional. Applications of F The general form F = MSerror / MSeffect applies to many situations described in the next chapters: one-way analysis of variance with independent (chap 14) and correlated (chap 15) samples, and two-way analysis of variance (chap16), which analyses the effects of two independant variables concurrently, including their interaction effects. Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html 27 Chapter 14 - One-way analysis of variance for independent samples Assumptions of the one-way ANOVA for independent samples: - the scale on which the dependent variable is measured is an equal interval scale; - the k samples are independently and randomly drawn from the source population(s); - the source population(s) can be reasonably supposed to have a normal distribution; - the k samples have approximately equal variance, i.e. the estimated variances of the source populations SSg/(Ng - 1) are within a ratio of 1.5. The ANOVA is robust to assumptions 1, 3&4 if all samples have the same size. Step 1 - Combine all k groups and calculate SST = ∑ (Xi2) - (∑ Xi)2 / NT Step 2 - For each group g calculate SSg = ∑ (Xgi2) - (∑ Xgi)2 / Ng Step 3 - Calculate SSwg = ∑SSg Step 4 - Calculate SSbg = SST - SSwg Step 5 - Calculate dfT = NT - 1, dfbg = k - 1, dgwg = NT - k Step 6 - Calculate MSbg = SSbg / dfbg and MSwg = SSwg / dfwg Step 7 - Calculate F = MSbg / MSwg Step 8 - Refer the calculated F-ratio to the table of critical values of Fdfbg, dfwg It is good practice to present a summary table of the results os follows: Source between groups ("effect") within groups ("error") TOTAL SS 140.10 116.32 256.42 df 3 16 19 MS 46.70 7.27 F 6.42 P <.01 Post-ANOVA comparisons: Tukey HSD test A significant F-ratio tells you that the aggregate difference among the means of the several samples is significantly greater than zero, but it does not tell you whether any particular sample mean significantly differs from any particular one. Again, the t-test cannot be used because we need to make several comparisons at once, e.g. A-B, A-C, A-D, B-C, B-D, C-D. The Tukey HSD (Honestly Significant Difference) test is based on the Studentized range statistic, Q. Let us call ML and MS the largest and smallest means of all k groups. Then Q = (ML - MS) / sqrt [MSwg / Np/s] where Np/s is the number of values per sample. If the k samples are of different sizes, Np/s is the harmonic mean of the sample sizes : Np/s = k / ∑ (1 / Ng) Q belongs to a distribution defined by two parameters: the number of groups (k) and the degrees of freedom of the denominator of the F-ratio (dfwg). Let us call Qc the critical value of Q for a given k and dfwg and a level of confidence c. Juggling with the formula for Q, we get: ML - MS = Q . sqrt [MSwg / Np/s] Replacing Q by Qc, this gives us the minimal difference between the means of any two groups for these two groups to be significantly different at the c level: HSDc = Qc . sqrt [MSwg / Np/s] Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html 28 In our example, we get HSD.05 = 4.05 . sqrt [7.27 / 5] = 4.88 HSD.01 = 5.2 . sqrt [7.27 / 5] = 6.27 As can be seen in the table below, only two pairs are significantly different at both levels: AC and AD A·B A·C A·D B·C B·D C·D Ma=28.86 Mb=25.04 Ma=28.86 Mc=22.50 Ma=28.86 Md=22.30 Mb=25.04 Mc=22.50 Mb=25.04 Md=22.30 Mc=22.50 Md=22.30 3.82 6.36 HSD.05 = 4.88 HSD.01 = 6.27 6.56 2.54 2.74 0.20 Note that the fact that the difference between Ma and Mb is not significant does not mean that there is no effect. It just means that the Tukey HSD test was not able to detect it. One-way ANOVA and correlation In the above example, it is pretty clear from the plot that there is a correlation between the dosage and the pull. It is also pretty obvious this is not a linear relationship, but rather a curvilinear one. Within the context of analysis of variance, a useful measure of the strength of a curvilinear relationship between the independant and dependant variable is given by eta-square : eta2 = SSbg / SST In our example, we have eta2 = 0.55, which is interpreted as: of all the variability that exists within the dependent variable, 55% is associated with the variability in the independent variable "dosage". This is similar to the r2 from chapter 3. In fact, when the relationship is linear, the values of eta2 and r2 are the same. If the relationship is curvilinear, then the r2 is smaller than the eta2 (see below). Note that this comparison with r2 is only valid when the independent variable is an equal-interval scale. If it is ordinal or categorical, all it means is the more general statement above. Kruskal-Wallis test for 3 or more independent samples This is a non-parametric version of the ANOVA for independent samples. Since the ANOVA is quite robust when the sample sizes are the same, this is particularly useful when the dependent variable is ordinal and the sample sizes are different. The mechanics of the Kruskal-Wallis test begins in a way very similar to the MannWhitney test from subchapter 11a, where the measures are replaced by their ranks. Then it proceeds very similarly to an ANOVA. See subchapter 14a online for the details. Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html 29 Chapter 15 - One-way analysis of variance for correlated samples The analysis of variance for correlated samples is an extension of the correlatedsamples t-test from Chapter 12. In the latter, we have a certain number of subjects, each measured under two conditions, or, alternatively, we have a certain number of matched pairs with one member measured under condition A and the other under condition B. It is the same structure for correlated-samples ANOVA, except that now the number of conditions (k) is 3 or more. When each subject is measured under each of the k conditions, we have: repeated measures or within subject design Subject / Condition A B C 1 S1-A S1-B S1-C each row = 1 subject 2 S2-A S2-B S2-C measured under 3 S3-A S3-B S3-C each condition and so on When it involves subjects matched in sets of k with the subjects in each matched set randomly assigned to one of the k conditions, we have a: randomized blocks design (each set of k matched subject is a block) Subject / Condition A B C 1 S1a-A S1b-B S1c-C each row = k matched 2 S2a-A S2b-B S2c-C subjects, each measured 3 S3a-A S3b-B S3c-C under one condition and so on The correlated samples ANOVA starts like the independent-samples version, by computing MSbg and MSwg (see right). Then it removes the variability that exists inside each of the k groups by identifying the portion of SSwg that is attributable to pre-existing individual differences. This portion, SSsubj, is dropped from the analysis while the remaining portion, SSerror, is used to measure the sheer, random variability (see right). Identification and removal of SSsubj For each subject (repeated measures) or each block (randomized block), we compute the mean Msubj* (subj* represents any subject). Then we compute the weighted square deviate = k (Msubj* - MT)2 The sum of these for all subjects is precisely SSsubj = ∑ k (Msubj* - MT)2 In practice a more efficient formula is used: SSsubj = ∑ (∑Xsubj*)2 / k - (∑XTi) 2 / NT From there, we easily get SSerror = SSwg - SSsubj The degrees of freedom are computed as before (dfT = NT - 1, dfbg = k-1, dfwg = NT -k) but now we need to split dfwg into dfsubj and dferror, corresponding to SSsubj and SSerror: dfsubj = Nsubj - 1 dferror = dfwg - dfsubj Finally we compute MSbg = SSbg / dfbg and MSerror = SSerror/dferror and the F-ratio F = MSbg / MSerror, which is looked up against the critical F-value for df=dfbg, dferror. Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html 30 The ANOVA summary table looks like this: Source between groups ("effect") within groups ·error ·subjects TOTAL SS 120.0 df 2 2285.0 103.3 2181.7 2405.0 51 34 17 53 MS 60.0 F 20.0 P <.01 3.0 Note in this example the large value of SSsubj, which, once removed, makes the difference significant. Assumptions of the one-way ANOVA for correlated samples: - the scale on which the dependent variable is measured is an equal interval scale; - the measures within each group are independent of each other; - the source population(s) can be reasonably supposed to have a normal distribution; - the k samples have approximately equal variance, i.e. the estimated variances of the source populations SSg/(Ng - 1) are within a ratio of 1.5; - the correlation coefficients (r) among the k groups are positive and of approximately the same magnitude (homogeneity of covariance). This ANOVA is robust to assumptions 3&4 (samples have the same size here). Step 1 - Combine all k groups and calculate SST = ∑ (Xi2) - (∑ Xi)2 / NT Step 2 - For each group calculate SSg = ∑ (Xgi2) - (∑ Xgi)2 / Ng Step 3 - Calculate SSwg = ∑ SSg Step 4 - Calculate SSbg = SST - SSwg Step 5 - Calculate SSsubj = [ ∑ (∑Xsubj*)2 ] / k - (∑XTi) 2 / NT Step 6 - Calculate SSerror = SSwg - SSsubj Step 7 - Calculate dfT= NT-1, dfbg= k-1, dgwg= NT-k, dfsubj= Nsubj-1, dferror= dfwg-dfsubj Step 8 - Calculate MSbg = SSbg / dfbg and MSerror = SSerror / dferror Step 9 - Calculate F = MSbg / MSerror Step 10 - Refer the calculated F-ratio to the table of critical values of Fdfbg, dferror Post-ANOVA comparisons: the Tukey HSD test The Tukey HSD test introduced in Chapter 14 can be extended to correlated samples by using MSerror instead of MSwg and dferror instead of dfwg. Since the number of measures per group is equal to the number of subjects in a repeated measures design or the number of matched sets of subjects in a randomized block design, we will also be substituting Nsubj for Np/s. The critical value for the difference between any two means to be significant is then HSDc = Qc . sqrt [MSerror / Nsubj] where Qc is the critical value from the Q distribution for k and dferror at the level of confidence c. Fridman test for 3 or more correlated samples This is a non-parametric version of the ANOVA for correlated samples. Given that the ANOVA is pretty robust, it used mainly when the measures are mere rankorderings or mere ratings, or intrinsically non-linear (e.g. decibels). The procedure resembles the Kruskal Wallis test (Chap 14) - see chapter 15a online. Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html 31 Chapter 16 - Two-way analysis of variance for independent samples The two-way ANOVA examines the effects of two independent variables concurrently. In addition to looking at the effect of each variable, it also looks at their interaction, i.e. whether the two variables interact with respect to their effect on the dependent variable. Suppose we want to test the effect of two drugs A and B on the blood level of a certain hormone. We randomly and independently sort the subjects into 4 groups: r1c1 is a control group (it will get a Drug B placebo), r1c2 will get 1 unit of A, r2c1 0 units 1 unit will get one unit of B and r2c2 will get r1c1 r1c2 one unit of A and one unit of B. This is 0 0 units of A 0 units of A units conventionally represented with a Drug 0 units of B 1 unit of B matrix, and it is conventional to speak A r2c1 r2c2 1 of the row variable (here, A) and the 1 unit of A 1 unit of A unit column variable (here, B): 0 units of B 1 unit of B The differences between the means of the row variables and the means of the column variables are called the main effects. The interaction effect is something above and beyond the two main effects. Here are four scenarios: B A 0u 1u 0u 5 5 5 1u 5 5 5 5 5 5 B A 0u 1u 0u 5 10 7.5 1u 10 15 12.5 7.5 12.5 10 B A 0u 1u 0u 5 10 7.5 0u 1u 0u 5 10 7.5 1u 10 20 15 7.5 15 11.25 B A 1u 10 5 7.5 7.5 7.5 7.5 This scenario shows no effect at all, either separately or in combination. This corresponds to the null hypothesis. In this scenario, both drugs increase the level of the hormone, but their combined effect is a simple addition. Here the drugs have the same effects as above when presented separately, but now they interact: The effect of the two in combination is greater than the sum of the separate effects. Here, too, there is an interaction effect, but in the opposite direction: the combined effect is a mutual cancellation. Note in the last scenario that there is no main effect since the means are the same for either rows or columns, while each drug obviously has an effect when presented separately. This means that the interpretation of the presence or absence of main Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html 32 effects is not always simple and straightforward. There is an interaction when the combination of the variables produces effects that are different from the simple sum of their separate effects. When represented graphically, the absence of interaction appears as lines that are approximately parallel, while the presence of an interaction appears as lines that diverge, converge or cross over each other. Procedure The two-way ANOVA proceeds similarly to the one-way ANOVA: we compute SSbg and SSwg. SSwg ends up as the MSerror in the denominator of the F-factor, while SSbg is further divided into three parts: the main effects (MSrows and MScolums) and the interaction effect respectively (MSinteraction). This results in three F-factors and 3 tests of significance to perform. Breaking SSbg into its components parts SSrows (or SSr) is the difference among the means of the two or more rows. It is expected to be 0 if there is no main row effect. Hence it is computed as the sum of square deviates between each row mean and the expected row mean: SSr = Nr . ∑ (Mri - MT)2 where Nr is the number of samples in the row As usual, there is a more efficient way to compute this: SSr = ∑ [(∑Xri)2 / Nri] - (∑XT) 2 / NT Similary, SScolumns (or SSc) is the sum of square deviates between each column mean and the expected column mean. It can be computed as SSc = ∑ [(∑Xci)2 / Nci] - (∑XT) 2 / NT SSinteraction (or SSrxc) is a bit more complicated. Lets us call Mg* the mean of any particular group, Mr* the mean of the row to which that group belongs and Mc* the mean of the column to which that group belongs. If there were no interaction, we would expect the mean of each group to be a simple additive combination of Mr* and Mc*, namely: [null] Mg* - MT = (Mr* - MT) + (Mc* - MT) or [null] Mg* = Mr* + Mc* - MT SSrxc is then the sum of square deviates between the observed group means Mgi and the corresponding expected group means Mr* + Mc* - MT . However, since we are breaking out SSbg into three components, a much more efficient way to compute SSrxc is SSrxc = SSbg - SSr - SSc Next, we compute the degrees of freedom for each of these values of SS: dfr = r - 1 dfc = c - 1 dfrxc = (r-1)(c-1) From there we get the three mean squares: MSr = SSr / dfr MSc = SSc / dfc MSrxc = MSrxc / dfrxc and the three F-ratios: Fr = MSr / MSerror Fc = MSc / MSerror Frxc = MSrxc / MSerror which we can compare with the critical values in the corresponding F tables, i.e. for df = (dfr,dferror) df = (dfc,dferror) df = (dfrxc,dferror) Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html 33 It is useful to present the results in a table, as follows: Source between groups rows columns interaction within groups (error) TOTAL SS 395.19 31.33 49.28 314.58 252.34 647.53 df 1 1 1 1 36 39 MS 31.33 49.28 314.58 7.01 F 4.47 7.03 44.88 P <.05 <.05 <.01 One must be careful in interpreting main and interaction effects. If we plot the row and column means (below, left), it looks like both A and B decrease the amount of hormone. If however we plot the group means (below, center), the pattern is more clear: there is a negative interaction effect between A and B. The interaction plot (below, right), shows crossing lines, which is typical of this scenario. Assumptions of the two-way ANOVA for independent samples: - the scale on which the dependent variable is measured is an equal interval scale; - the samples are independently and randomly drawn from the source population(s); - the source population(s) can be reasonably supposed to have a normal distribution; - the samples have approximately equal variance. The ANOVA is robust to assumptions 3&4 if all samples have the same size and, to a certain extent, to assumption 1 (see below). Step 1 - Combine all rc groups and calculate SST = ∑ (Xi2) - (∑ Xi)2 / NT Step 2 - For each of the rc group g calculate SSg = ∑ (Xgi2) - (∑ Xgi)2 / Ng Step 3 - Calculate SSwg = ∑SSg Step 4 - Calculate SSbg = SST - SSwg Step 5 - Calculate SSr = ∑ [(∑Xri)2 / Nri] - (∑XT) 2 / NT Step 6 - Calculate SSc = ∑ [(∑Xrc)2 / Nrc] - (∑XT) 2 / NT Step 7 - Calculate SSrxc = SSbg - SSr - SSc Step 8 - Calculate dfT = NT - 1, dfwg = NT - rc, dfr = r - 1, dfc = c - 1, dfrxc = (r-1)(c-1) Step 9 - Calculate MSr = SSr / dfr, MSc = SSc / dfc and MSerror = SSwg / dfwg Step 10 - Calculate Fr = MSr / MSerror , Fc = MSc / MSerror , Frxc = MSrxc / MSerror Step 11 - Refer the calculated F-ratios to the table of critical values of Fdfnum, dfden Robustness of two-way analysis of variance with ordinal scale data If we plug ordinal rating-scale data into an analysis of variance and end up with "significant" effects, are those effects really significant in the technical statistical sense of the term? We could test this by generating random samples for which we know that the null hypothesis is true and see if there is still a 5% chance of ending with a "significant" effect at the .05 level. The simulator available on-line shows that indeed this seems to be the case. This provides evidence that the ANOVA is quite robust with respect to the scale of the dependent variable. Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html 34 Chapter 17 - One-way analysis of covariance for independent samples The analysis of covariance or ANCOVA combines the analysis of variance from Chapters 13 to 16 and the linear regression and correlation from Chapter 3. It can remove the obscuring effects of pre-existing individual differences among subjects without resorting to the repeated-measures strategy. In addition, it can compensate for systematic biases among the samples. Method A Method B Assume we want to assess the Subj. Xa Ya Subj. Xb Yb effectiveness of two methods of a1 88 66 b1 90 62 elementary mathematics instruction. a2 98 85 b2 100 87 Obviously, we cannot use repeated a3 100 90 b3 110 91 measures, so one group receives a4 110 97 b4 120 98 Method A and the other Method B. Mean 99.0 84.5 105.0 84.5 Since we want to make sure the measures are not biased by the level of intelligence of each group, we measure the level of intelligence X of each subject beforehand, and we measure the amount of learning Y at the end of the experiment. The resulting table (above) shows two things: - the higher the intelligence, the more learning takes place, i.e. there is a covariance between X and Y; - the means of Y are the same, as if the same amount of learning had taken place. However, if there were no difference between the groups, given the correlation between intelligence and learning and since group B is more intelligent on average than group A, group B should have learned more. In short, the ANCOVA does two things: - it removes from Y the extraneous variability that derives from pre-existing individual differences, insofar as those differences are reflected in the covariate X; - it adjusts the means of the Y variable to compensate for the fact that different groups started with different mean levels of the covariate X. This allows to run what-if scenarios, such as: What would have happened if group A had started with the same level of intelligence as group B? The rest of the chapter will use the following example of comparison of two methods of hypnotic induction, where the covariate X is the score on an index of primary suggestibility which is know to affect the response to hypnotic induction Y. Subj. a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 Method A Xa Ya 5 20 10 23 12 30 9 25 23 34 21 40 14 27 18 38 6 24 13 31 Note how the difference between the means is small when compared to the spread of the measures (below), which Mean 13.1 would make the difference between the means non-significant with a regular ANOVA. Subj. b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 29.2 Method B Xb Yb 7 19 12 26 27 33 24 35 18 30 22 31 26 34 21 28 14 23 9 22 18.0 28.1 Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html 35 Note also the correlation (r = +.803) between X and Y (right), which the ANCOVA will take into account. The ANCOVA requires four sets of calculations: 1- SS values for Y, i.e. analysis of variance for the dependent variable in which we are chiefly interested => SST(Y), SSwg(Y) and SSbg(Y) ; 2- SS values for X, the covariate whose effects upon Y we wish to bring under statistical control => SST(X) and SSwg(X) ; 3- SC measures for the covariance of X and Y => SCT an SCwg ; 4- A final set of calculations to remove from Y the portion of its variability that is attributable to its covariance with X. 1- Calculations for the dependent variable Y As with the usual ANOVA, we compute SST(Y), SSwg(Y) and SSbg(Y). 2- Calculations for the covariate X Next we compute SST(X) and SSwg(X) (SSbg(X) is not needed). 3- Calculations for the covariance of X and Y The sum of codeviates is (Chapter 3): SC = ∑ (Xi - MX)(Yi - MY) It can be calculated more efficiently as SC = ∑ XiYi - (∑ Xi) (∑Yi) / N We compute two values of SC: - SCT, the covariance of X and Y within the total array of data SCT = ∑ XTiYTi - (∑ XTi) (∑YTi) / NT - SCwg, the covariance within the two groups SCwg = SCwg(a) + SCwg(b) = ∑ XaiYai - (∑ Xai) (∑Yai) / Na + ∑ XbiYbi - (∑ Xbi) (∑Ybi) / Nb 4- Final set of calculations 4a- Adjustment of SST(Y) The overall correlation between X and Y (both groups combined) is: rT = SCT / sqrt[SST(X)SST(Y)] The proportion of the total variability of Y attributable to its covariance with X is rT2. We adjust SST(Y) by removing this proportion: [adj]SST(Y) = (1 - rT2) SST(Y) = SST(Y) - SCT2 / SST(X) 4b- Adjustment of SSwg(Y) The same goes for the aggregate correlation between X and Y within the two groups, calculated as rwg = SCwg / sqrt[SSwg(X)SSwg(Y)]. SCwg is adjusted by removing the proportion of the within-groups variability of Y attributable to covariance with X: [adj]SSwg(Y) = (1 - rwg2) SST(Y) = SSwg(Y) - SCwg2 / SSwg(X) 4c- Adjustment of SSbg(Y) This is simply obtained by subtraction: [adj]SSbg(Y) = [adj]SST(Y) - [adj]SSwg(Y) 4d- Adjustment of the means of Y for groups A and B We compute the slope of the regression line for the aggregate correlation between X and Y within the two groups: bwg = SCwg / SSwg(X) We can now adjust the means for Y to take into account the fact that the two groups started with different means for X and the correlation between X & Y: [adj]MYa = MYa - bwg(MXb - MXT) [adj]MYb = MYb - bwg(MXb - MXT) 4e- Analysis of covariance using adjusted values of SS The F-ratio is, as usual, MSeffect / MSerror = (SSbg/dfbg)/(SSwg/dfwg) except that we use the adjusted values of SSbg and SSbg and an adjusted value of dfwg: [adj]dfwg(Y) = NT - k - 1 dfbg(Y) = k - 1 F = ([adj]SSbg(Y) / dfbg(Y)) / ([adj]SSwg(Y) / [adj]dfwg(Y)) Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html 36 The analysis of the significance of the F-ratio is a bit trickier than with the ANOVA. A significant F-ratio does not mean that the difference between the sample means MYA and MYB is significant in and of itself. What it means is: - If the correlation between X and Y within the general population is approximately the same as the one observed withing the samples; and - If we remove from Y the covariance that it has with X, so as to remove the pre-existing individual differences that are measured by the covariate X; and - If we adjust the group means of Y in accordance with the observed correlation between X and Y; - Then we can conclude that the two adjusted means [adj]MYa and [adj]MYb significantly differ. Assumptions for the two-way ANCOVA for independent samples: - the scale on which the dependent variable is measured is an equal interval scale; - the samples are independently and randomly drawn from the source population(s); - the source population(s) can be reasonably supposed to have a normal distribution; - the samples have approximately equal variance; - the slopes of the regression lines for each of the groups considered separately ar all approximately the same (homogeneity of regression, see below). The ANCOVA is robust to assumptions 1, 3&4 if all samples have the same size. Example We compare three methods (A, B, C) of instruction for computer programming (Y), taking into account the pre-existing levels of computer familiarity (X). We get MXa = 13.0, MYa = 25.4, MXb = 13.8, MYb = 26.3, MXc = 10.1, MYc = 23.8. After running the ANCOVA, we get the following adjusted means, which differ quite sharply from the observed ones (see figure to the right): [adj]MYa = 24.3, [adj]MYb = 23.9, [adj]MYc = 27.3 The ANCOVA gives an F-ratio of 3.8 (df=2,32), which is significant beyond the .05 level. However we need to test for the homogeneity of regression. Homogeneity of regression The figure to the right shows the 4 regression lines for the above example: three in red for each group and one in blue for the overall within-groups regression. Although the lines seem to have similar slopes, we need a more rigourous test. To that end, we once more use another F-ratio of the form F = MSeffect / MSerror = (SSeffect/dfeffect)/(SSerror/dferror) - SSeffect is a measure of the aggregate degree to which the separate regressionline slopes of the groups differ from the slope (bwg) of the overall within-group regression. It is called SSb-reg and computed as SSb-reg = ∑ (SCg2 / SSXg) - SCwg2 / SSwg(X) - SSerror is what is left over from [adj]SSwg(Y), the adjusted error of the original ANCOVA, after SSb-reg has been removed from it. It is called SSY(remainder): SSY(remainder) = [adj]SSwg(Y) - SSb-reg - the corresponding degrees of freedom are dfb-reg = k - 1 dfY(remainder) = [adj]dfwg(Y) - dfb-reg = NT - 2k For the above example, we get an F-ratio of 0.30 (df=2, 30), which is not significant and therefore the homogeneity of regression hypothesis is satisfied. Michel Beaudouin-Lafon - Summary of Richard Lowry's course http://faculty.vassar.edu/lowry/webtext.html 37