Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Statistics for EES and MEME 3. t-test Dirk Metzler 21. May 2012 Contents 1 Taking standard errors into account 1 2 t-Test for paired samples 2.1 Example: Orientation of pied flycatchers . . . . . . . . . . . . . . . . . . . . 2.2 Example: direction-dependent thickness of cork . . . . . . . . . . . . . . . . 2.3 Principle of statistical testing . . . . . . . . . . . . . . . . . . . . . . . . . . 6 6 12 16 3 t-test for unpaired samples 3.1 H0 : means and variances are equal . . . . . . . . . . . . . . . . . . . . . . . 3.2 H0 : mean are equal, variances may be different . . . . . . . . . . . . . . . . 21 21 24 1 Taking standard errors into account Important consequence Consider the interval√ √ x − s/ n x + s/ n x This interval contains µ with probability of ca. 2/3 √ x − s/ n √ x + s/ n x 1 This interval contains µ with probability of ca. 2/3 √ x − s/ n √ x + s/ n x probability that µ is outside of interval is ca. 1/3 Thus: √ It may happen that x deviates from µ by more than s/ n. Application 1: Which values of µ are plausible? x = 0.12 √ s/ n = 0.007 Question: Could it be that µ = 0.115? Answer: Yes, not unlikely. Deviation x − µ = 0.120√− 0.115 = 0.005. Standard Error s/ n = 0.007 Deviations like this are not unusual. Application 2: Comparison of mean values Example: Galathea 2 Galathea: Carapace lengths in a sample Males: x1 = 3.04 mm s1 = 0.9 mm n1 = 25 Females: x2 = 3.23 mm s2 = 0.9 mm n2 = 29 The females are apparently larger. Is this significant? Or could it be just random? How precisely do we know the true mean value? Males: x1 = 3.04 mm s1 = 0.9 mm n1 = 25 √ s1 / n1 = 0.18 [mm] We have to assume uncertainty in the magnitude of ±0.18 (mm) in x1 How precisely do we know the true mean value? Females: x2 = 3.23 mm s2 = 0.9 mm n2 = 29 √ s2 / n2 = 0.17 [mm] It is not unlikely that x2 deviates from the true mean by more than ±0.17 (mm). The difference of the means 3.23 − 3.04 = 0.19 [mm] is not much larger than the expected inaccuracies. It could also be due to pure random that x2 > x1 MORE PRECISELY: If the true means are actually equal µFemales = µMales it is still quite likely that the sample means x2 are x1 that different. 3 In the language of statistics: The difference of the mean values is (statistically) not significant. not significant = can be just random Application 3: If the mean values are represented graphically, you should als show their precision √ (±s/ n) Not Like this: Like that: 3.4 3.2 ● 2.8 3.0 ● Males Females 2.6 Carapace length [mm] Carapace lengths: Mean values for males and females Carapace lengths: Mean values ± standard errors for males and females 4 3.4 3.2 3.0 2.8 ● Males Females 2.6 Carapace length [mm] ● Application 4: Planning an experiment: How many observations do I need? (How large should n be?) √ If you know which precision you need for (the standard error s/ n of) x and if you already have an idea of s √ then you can estimate the value of n that is necessary:[0.5ex] s/ n = g[0.5ex] (g = desired standard error) Example: Stressed transpiration values in another sorghum subspecies: x = 0.18 s = 0.06 n = 13 How often do we have to repeat the experiment to get a standard error of ≈ 0.01? Which n do we need? √ Solution: desired: s/ n ≈ 0.01[1ex] √ From the previous experiment we know:[1ex] s ≈ 0.06, [0.5ex] so: n ≈ 6 6 ≈ 36 5 Summary • Assume a population has mean value µ and standard deviation σ. • We draw a sample of size n from this population with sample mean x. √ • x is a random variable with mean value µ and standard deviation σ/ n. √ • Estimate the standard deviation of x by s/ n. √ • s/ n is the Standard Error (of the Mean). √ • Deviations of x of the magnitude of s/ n are usual.[0.3ex] They are not significant: they can be random. 2 2.1 t-Test for paired samples Example: Orientation of pied flycatchers sample means ± standard errors A B C D Is A significantly different from B? No: A is in 2/3 range of µB Is B significantly different from C? No: 2/3 ranges of µB and µC overlap Is C significantly different from D? We do not know yet.(Answer will be “no”.) 6 References [WGS+04] Wiltschko, W.; Gesson, M.; Stapput, K.; Wiltschko, R. Light-dependent magnetoreception in birds: interaction of at least two different receptors. Naturwissenschaften 91.3, pp. 130-4, 2004. References [WRS+05] Wiltschko, R.; Ritz, T.; Stapput, K.; Thalau, P.; Wiltschko, W. Two different types of light-dependent responses to magnetic fields in birds. Curr Biol 15.16, pp. 1518-23, 2005. [WSB+07] Wiltschko, R.; Stapput, K.; Bischof, H. J.; Wiltschko, W. Light-dependent magnetoreception in birds: increasing intensity of monochromatic light changes the nature of the response. Front Zool, 4, 2007. Direction of flight with blue light. Another flight of the same bird with blue light. All flight’s directions of this bird in blue light. corresponding exit points the more variable the Directions of this bird’s flights in green light. corresponding exit points arrow heads: barycenter of exit points for green light the same for “blue” directions directions the shorter the arrows! Question of interest Does the color of monochromatic light influence the orientation? Experiment: For 17 birds compare barycenter vector lengths for blue and green light. 7 0.5 Pies flycathchers : barycenter vector lengths for green light and for blue light b, n=17 0.4 ● 0.3 ● ● ● 0.2 with green light ● ● ● ● ● ● ● ● ● 0.1 ● ● 0.0 ● 0.0 0.1 0.2 0.3 0.4 0.5 with blue light Compute for each bird the distance to the diagonal, i.e. x := “green length” − “blue length” −0.05 0.00 0.05 0.10 0.15 0.20 −0.05 0.00 0.05 0.10 0.15 0.20 Can the theoretical mean be µ = 0? (i.e. the expectation value for a randomly chosen bird) x = 0.0518 s = 0.0912 s 0.912 SEM = √ = √ = 0.022 n 17 8 Is |x − µ| ≈ 0.0518 a large deviation? Large? Large compared to what? Large compared the the standard error? |x − µ| 0.0518 √ ≈ ≈ 2.34 0.022 s/ n So, x is more than 2.3 standard deviations away from µ = 0. How probable is this when 0 is the true theoretical mean? in other words: Is this deviation significant? We know: x−µ √ σ/ n is asymptotically (for large n) standard-normally distributed. σ is the theoretical standard deviation, but we do not know it. Instead of σ we use the sample standard deviation s. We have applied the “2/3 rule of thumb” which is based on the normal distribution. For questions about the significance this approximation is too imprecise if σ is replaced by s (and n is not extremely large). General Rule 1. If X1 , . . . , Xn are independently drawn from a normal distributed with mean µ and n 1 X 2 (Xi − X)2 , s = n − 1 i=1 then X −µ √ s/ n is t-distributed with n − 1 degrees of freedom (df ). The t-distribution is also called Student-distribution, since Gosset published it using this synonym. Density of the t-distribution 9 0.4 0.3 0.2 density 0.1 0.0 −4 −2 0 2 4 Density of t-distribution with 16 df Den- 0.2 0.0 0.1 density 0.3 0.4 sity of standard normal distribution −4 −2 0 2 4 16 14 0.2 0.0 0.1 density 0.3 Density of standard normal distribution µ = 0 and σ 2 = −4 −2 0 2 4 Density of t-distribution with 5 df Den2 sity of normal distribution with µ = 0 and σ = 10 5 3 How (im)probable is a deviation by at least 2.35 standard errors? Pr(T = 2.34) = 0 does not help! 0.2 The total area of the magentacolored regions. 0.0 0.1 density 0.3 0.4 Compute Pr(T ≥ 2.34), the so-called p-value. −4 −2 −2.34 0 2 4 2.34 R does that for us: > pt(-2.34,df=16)+pt(2.34,df=16,lower.tail=FALSE) [1] 0.03257345 Comparison with normal distribution: > pnorm(-2.34)+pnorm(2.34,lower.tail=FALSE) [1] 0.01928374 > pnorm(-2.34,sd=16/14)+ + pnorm(2.34,sd=16/14,lower.tail=FALSE) [1] 0.04060902 t-Test with R > x <- length$green-length$blue > t.test(x) One Sample t-test data: x t = 2.3405, df = 16, p-value = 0.03254 alternative hypothesis: true mean is not equal to 0 95 percent confidence interval: 0.004879627 0.098649784 sample estimates: mean of x 0.05176471 11 We note: p-value = 0.03254 0.2 0.0 0.1 density 0.3 0.4 i.e.: If the Null hypothesis “just random”, in our case µ = 0 holds, any deviation like the observed one or even more is very improbable. If we always reject the null hypothesis if and only if the p-value is below a level of significance of 0.05, then we can say following holds: If the null hypothesis holds, then the probability that we reject it, is only 0.05. If we set the level of significance to α = 0.05, we reject the null hypothesis H0 if and only if the t-value is in one of the red ranges: −4 −2 0 2 4 (example of t-distribution with df= 16) Which t-values are significant on the “5% level”? df |t| ≥ . . . 5 2.57 10 2.23 2.09 20 2.04 30 100 1.98 ∞ 1.96 > qt(0.025,df=c(5,10,20,30,100,1e100))[1] -2.570582 -2.228139 -2.085963 -2.042272 -1.983972 -1.959964 2.2 Example: direction-dependent thickness of cork Simulated data! The data in the following example are simulated. The computer simulations are inspired by a classical example data set. 12 References [RAO48e] Rao, C.R. Tests of significance in multivariate analysis. Biometrika 35, pp. 58-79, 1948. For n = 28 the thickness of cork [mm] was measured in each direction: w 77 63 58 38 36 26 27 . . 100 cork thickness for each direction 80 s 76 66 64 36 35 34 31 . . 60 e 66 53 57 29 32 35 39 . . 40 n 72 60 5 41 32 30 39 . . n e s w Can some of these differences be significant??? 13 w s e n 40 60 80 100 Stripchart of cork thickness for each direction with sample means and sample means ± standard errors Can some of these differences be significant??? Have we missed something? We have neglected from which measurements came from the same tree! The trees differ in their size. We must compare values that come from the same tree! ( paired t-test) 80 100 Cork thickness [mm] for n = 28 trees ● ● ● ● ● 60 Cork thickness on west aspect kork$w ● ● ● ● ● ● ● 40 ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● 40 60 80 100 kork$n cork thickness on north aspect Differences in cork thickness between west and north aspect for n = 28 trees 14 −5 0 5 10 15 20 with sample mean and sample mean±standard error Is the difference significantly different from 0? x := (thickness north) x sx sx √ n x √ t − value = sx / n degrees of freedom: df − (thickness west) ≈ 5.36 ≈ 7.99 ≈ 1.51 ≈ 3.547 = n − 1 = 27 > t.test(cork$n-cork$w) One Sample t-test data: cork$n - cork$w t = 3.5471, df = 27, p-value = 0.001447 alternative hypothesis: true mean is not equal to 0 95 percent confidence interval: 2.258274 8.456012 sample estimates: mean of x 5.357143 15 −5 0 5 10 15 east−south p−value= 0.0912 ● p−value= 0.0039 west−south west−east p−value= 0.607 ● north−south p−value= 0.574 ● p−value= 0.0014 north−west ● p−value= 0.0072 north−east 0 2.3 ● 2 ● 4 6 Principle of statistical testing Example: Codon Bias • We see 16710 times CCT an 18895 times CCC as a codon for prolin in a large sample of human genes. • If both are equally probable, we expect 17802.5 of each. • The observation differs from this value by 1092.5. • z-Test: The probability of a deviation like this or more is smaller than 10−30 . • We conclude that CCT and CCC do not seem to be equally probable. Example: bird orientation • Does the variability of flight direction depend on ambient light color? • Measure variability by length of barycenter vector length. • Quantify difference by X =(length green)− (length blue). • If light color has no influence, then EX = 0. 16 Example: bird orientation X =(Length green)− (Length blue) • If the light color has no influence, then EX = 0. −0.05 0.00 0.05 0.10 • But we observe that X = 0.0518 and SEM=0.022 • t-test: p-value of this deviation is ca. 3.3%. • It seems likely that the color of the light has some influence. Example: Thickness of cork • X=(thickness on north aspect)− (thickness on west aspect) • If the direction has no effect, then EX = 0. • But we see X = 5.36 and SEM= 1.51 −5 0 5 10 15 20 • t-test: p-value of this deviation is ca. 0.14%. • Thus, it seems that the direction has some effect. • We want to argue that some deviation in the data is not just random. • To this end we first specify a null hypothesis H0 , i.e. we define, what “just random” means. • Then we try to show: If H0 is true, then a deviation that is at least at large as the observed one, is very improbable. 17 0.15 0.20 • If we can do this, we reject H0 . • How we measure deviation, must be clear before we see the data. Null hypotheses • H0 for Codon-Bias: CCT and CCA have the same probability of positions decide independently between CCT and CCA 1 2 Moreover: all • H0 for bird orientation and cork thickness: EX = 0. Moreover: X normally distributed, Xi independent. Deviations and p-values • Codon Bias: Number of CCT deviates by 2156 from H0 -expectation. binomial distribution model gives us σ = n/4 an we can apply the z-test to compute the p-value: The probability that a bin(n, 12 )-distributed random variable deviates from n/2 by 2156 or more. • bird orientation and cork thickness: t-value = X − EX √ s/ n p-value: Probability, that a t-distributed value with n − 1 df deviates from 0 at least as much as the t-value calculated from the observed data. 0.2 2.5% 2.5% 0.0 0.1 We observe a value of x that is much larger than the H0 expectation value µ. density 0.3 0.4 Testing two-sided or one-sided? 0.2 0.1 5.0% 0.0 p-value=PrH0 (|X − µ| ≥ |x − µ|) density 0.3 0.4 −4 −4 −2 18 0 2 4 p-value=PrH0 (X ≥ x) −2 0 2 4 The pure teachings of statistical testing • Specify a null hypothesis H0 , e.g. µ = 0. • Specify level of significance α, e.g. α = 0.05. • Specify an event A such that Pr(A) = α H0 (or at least PrH0 (A) ≤ α). e.g. A = {X > q} or A = {|X − µ| > r} in general: A = {p-value ≤ α} • AND AFTER THAT: Look at the data and check if if A occurs. • Then, the probability that H0 is rejected in the case that H0 is actually true (“Type I error”) is just α. Violations against the pure teachings “The two-sided test gave me a p-value of 0.06. Therefore, I tested one-sided and this worked out nicely.” is as bad as: “At first glance I saw that x is larger than µH0 . So, I immediately applied the one-sided test.” Important The decision between one-sided and two-sided must not depend on the concrete data that are used in the test. More generally: If A is the event that will lead to the rejection of H0 , (if it occurs) then A must be defined without being influenced by the data that is used for testing. The choice of A should depend on the alternative H1 , which is what we actually want to substantiate by rejecting H0 . We need PrH0 (A) = α and PrH1 (A) = as large as possible to make the probability of the Type II error as small as possible. Type II error: H1 is true but H0 is not rejected. 19 Examples • If our original intention is to show that the flight direction of pied flycatchers varies more in blue light than in green light, then we can use a one-sided test. • Even if the data then shows quite clearly that the opposite is true, we can not reject the null hypothesis that the color of the light has no effect (at least not if we really stick to the pure teachings). • If we initially want to show that the cork on the northern aspect of the tree is thicker than on the western, we can use a one-sided test. • If the data then clearly show that the opposite is true, we may, strictly speaking, not call this significant. This means: Use separate data sets for exploratory data analysis and for testing. In some fields these rules are followed quite strictly, e.g. testing new pharmaceuticals for accreditation. In some other fields the practical approach is more common: Just inform the reader about the p-values of different null-hypotheses. Let the reader decide which null-hypothesis would have been the most natural one. If H0 is rejected on the 5%-level, which of the following statements is true? • The null hypothesis is wrong. The null hypothesis is wrong. • H0 is wrong with a probability of 95%. H0 is wrong with a probability of 95%. • If H0 is true, you will see such an extreme (or an even more extreme) event only in 5% of the data sets. If H0 is true, you will see such an extreme (or an even more extreme) event only in 5% of the data sets. X If the test did not reject H0 , which of the following statements are true? • We have to reject the alternative H1 . We have to reject the alternative H1 . • H0 is true. H0 is true • H0 is probably true. H0 is probably true. • It is safe to assume that H0 was true. It is safe to assume that H0 was true. • Even if H0 is true, it is not so unlikely that our test statistic takes a value that is as extreme as the one we observed. Even if H0 is true, it is not so unlikely that our test statistic takes a value that is as extreme as the one we observed.X • With this respect, H0 is compatible with the data. With this respect, H0 is compatible with the data.X 20 3 t-test for unpaired samples 3.1 H0 : means and variances are equal http://en.wikipedia. org/wiki/File: Tetranychus-urticae. jpgphoto (c) by J. Holopainen Example: Do mites favor plants that have not been infected with mites before? Infect cotton plants with mites (Tetranychus urticae) and count mites on plants that had have mites before and plants that are infected for the first time. Data for mites-cotton example are artificially generated by computer simulation inspired by actual experiments, cf. e.g. References 50 100 150 200 250 ● ● 300 ● ● ●● ●● ● ● ● 50 100 ● ● ● ● ● ● ● 150 200 µ(y) = 168.4 21 ● ●● 250 ● ● 300 ● x had mites before ● ●● ● ●● ●●●●● ● ● ● ● ● ● ● ● ●● ● ● ●● ●●●●● ● ● ● ● ● ● ● ● ●● ● y first time mites ● ● ● ● ● ● x had mites before ● ●● ●● ● ● ● ● y first time mites x had mites before ● ●● ●●●●● ● ● ● ● ● ● ● ● ●● ● y first time mites [1] S. Harrison, R. Karban: Behavioral response of spider mites (Tetranychus urticae) to induced resistance of cotton plants Ecological Entomology 11:181-188, 1986. ● ●● ●● ● ● ● 50 100 ● ● ● ● ● ● ● 150 200 ● ●● 250 ● ● 300 ● sd(y) = 91.09763 √ sd(y)/ 20 = 20.37005 µ(x) = 121.65 sd(x) = 47.24547 √ sd(x)/ 20 = 10.56441 Our null hypothesis H0 : All data values are independent samples from the same normal distribution. This H0 includes that the theoretical means in the two groups are equal, but also that the theoretical standard deviations are equal. Theorem 1 (two-sample t-test, unpaired with equal variances). Suppose that X1 , . . . , Xn and Y1 , . . . , Ym are independent and normally distributed random variables with the same mean µ and the same variance σ 2 . Define the pooled sample variance to be s2p = (n − 1) · s2X + (m − 1) · s2Y . m+n−2 The statistic t= X −Y q sp · n1 + 1 m follows a t distribution with n + m − 2 degrees of freedom. > t.test(y,x,var.equal=TRUE) Two Sample t-test data: y and x t = 2.0373, df = 38, p-value = 0.04862 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 0.2970719 93.2029281 sample estimates: mean of x mean of y 168.40 121.65 22 Normal Q−Q Plot 200 ● ● 150 ● Sample Quantiles 100 ● ● ● ● 50 ●● ●● ● ● −50 0 ● ● −100 ● ● ● ● ●● ● ●● ● ● ●●● ●● ●● ● ● ●●● ● ● −2 −1 0 1 2 Theoretical Quantiles Normal Q−Q Plot had mites before Normal Q−Q Plot have mites the first time ● ● 300 ● 250 200 ● ● ● ● ● ● 200 Sample Quantiles 150 ● ● 100 ● ● ● ● ● ● ● 100 Sample Quantiles ● ● ● ● ● ● 150 ● ● ● ● ● ● ● ● ● ● ● 50 50 ● ● −2 −1 0 1 2 ● Theoretical Quantiles ● ● −2 −1 0 1 2 Theoretical Quantiles confirm p-value by permutation test 1. let torig be the original t-value from comparison between plants that were infected for the first time and plants that had been infected before. 2. set counter k to 0 3. repeat the following steps e.g. 100000 times • randomly shuffle lables “first time infected” and “had mites before” • compute t-statistic t between these random groups • if |t| ≥ |torig | increase counter k by 1. 4. simulated p-value is k/100000, the fraction of simulated t-values that were more extereme than torig . 23 confirm p-value by permutation test > > > + + + + + + + + > > t.orig <- t.test(cont,ex,var.equal=TRUE)$statistic count <- 0 for( i in 1:100000) { random.group <- sample(c(rep("cont",20),rep("ex",20))) y <- all[random.group=="cont"] x <- all[random.group=="ex"] t <- t.test(x,y,var.equal=TRUE)$statistic if( abs(t)>=abs(t.orig) ) { count <- count+1 } } count/100000 [1] 0.04895 3.2 H0 : mean are equal, variances may be different Hipparion dataset • Hipparion is an extinct genus of horse. • Hipparion africanum lived 4 mio years ago. • Hipparion libycum lived 2.5 mio years ago. • 2.8 mio years ago climate became cooler and dryer in east Africa. • Hipparion ate more grass and less leaves and more grass. 24 • can we see this in fossil teeth? africanum libycum • data: mesiodistal lengths from 77 back teeth which were found in Chiwondo Beds, Malawi, and are now in the Landesmuseum Darmstadt. 25 30 35 md 25 40 Theorem 2 (Welch’s t-test). Suppose that X1 , . . . , Xn and Y1 , . . . , Ym are independent and normally distributed random variables with EXi = EYj and potentially different variances 2 VarXi = σX and VarYi = σY2 . Let s2X and s2Y be the sample variances. The statistic X −Y t= q 2 s2 sX + mY n approximately follows a t distribution with 2 sX + n s4X 2 n ·(n−1) s2Y m 2 s4Y 2 m ·(m−1) + degrees of freedom. > t.test(md[Species=="africanum"],md[Species=="libycum"]) Welch Two Sample t-test data: md[Species == "africanum"] and md[Species == "libycum"] t = -3.2043, df = 54.975, p-value = 0.002255 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -4.1025338 -0.9453745 sample estimates: mean of x mean of y 25.91026 28.43421 Normal Q−Q Plot ● 2 ● ● ● ● ● ●● 1 ● ● ●● ● ● ●●●●●●●● ●●● ●● 0 ●● ●●●●● ●●● ●●●●●● ●●●●●● ●● ●● ●●●● ●●●●●● −1 Sample Quantiles ●● ● ●● ● ● ● ● ●● ●● ● −2 −1 0 Theoretical Quantiles 26 1 2 Normal Q−Q Plot Hipparium libycum 30 ● 40 Normal Q−Q Plot Hipparion africanum ● ● ● ● ● 35 ● ● ● ● ●●●●●● ● ●● ●●● ● ● ● ● ● 30 Sample Quantiles 28 ● ● 26 Sample Quantiles ● ●●● ●● ●● ●●● ● ●●●●●● ●● ●●●●● 25 24 ●● ● ● ●● ● ● ● ●●● ● ● ● ● ● −2 ● ● ● ● ● ● −1 0 1 2 −2 −1 Theoretical Quantiles 0 1 2 Theoretical Quantiles • qqnorm plots show deviation from normality assumption • not a priori clear if t-test is applicable • permutation test: difficult because H0 allows that standard deviations are different • examine robustness of t-test by simulation studies • or apply some non-parametric test, e.g. signed rank test 27