Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Degrees of freedom (statistics) wikipedia , lookup
History of statistics wikipedia , lookup
Psychometrics wikipedia , lookup
Bootstrapping (statistics) wikipedia , lookup
Taylor's law wikipedia , lookup
Misuse of statistics wikipedia , lookup
Analysis of variance wikipedia , lookup
Chapter 8. Inferences about More Than Two Population Central Values Case Study: Effect of Timing of the Treatment of Port-Wine Stains with Lasers (1) • • To investigate whether treatment at a young age would yield better results than treatment at an older age. Data classified by age: Case Study: Effect of Timing of the Treatment of Port-Wine Stains with Lasers (2) Case Study: Effect of Timing of the Treatment of Port-Wine Stains with Lasers (3) • From the boxplots, we can observe that the four groups do not appear to have that great a difference in improvement. • We can now develop the analysis of variance procedure to confirm whether a statistically significant difference exists between the four age groups. Within-Sample Variation • Because the variability among the sample means is large in comparison to the within-sample variation, we might conclude intuitively that the corresponding population means are different. Between-Sample Variation • • In this table, the sample means are the same as given in the previous table, but the variability within a sample is much larger, and the between-sample variation is small relative to the within-sample variability. We would be less likely to conclude that the corresponding population means differ based on these data. An Analysis of Variance (1) • • A statistical test about more than two population means T test is used to test the equality of two population means. t= sp y1 − y2 (1 n1 ) + (1 n2 ) 2 2 2 2 n s n s n s n s − + − ( 1 ) ( 1 ) ( − 1 ) + ( − 1 ) 1 2 2 1 2 2 s 2p = 1 = 1 n1 + n2 − 2 (n1 − 1) + (n2 − 1) 2 2 • S p is a pooled estimate of the common population variance σ . • Now suppose that we wish to extend this method to test the equality of more than two population means. A more general method of data analysis is the analysis of variance. An Analysis of Variance (2) • Summary of the samples results for five populations • If we are interested in testing the equality of the population means (i.e., µ1 = µ2 = µ3 = µ4 = µ5 ) we might be tempted to run all possible pairwise comparisons of two population means. If we confirm that the five distributions are approximately normal with the same variance σ2, we could run 10 t tests comparing all pairs of means. • • Although we may have probability of a Type error fixed at α=0.05 for each individual test, the probability of falsely rejecting at least one of those tests is larger than 0.05. An Analysis of Variance (3) • The analysis of variance procedures are developed under the following conditions: – Each of the five populations has a normal distribution. – The variances of the five populations are equal: that is, σ12 = σ 22 = σ 32 = σ 42 = σ 52 = σ 2 . – The five sets of measurements are independent random samples form their respective populations. An Analysis of Variance (4) • Within-sample variance (n1 − 1) S12 + (n2 − 1) S 22 + (n3 − 1) S32 + (n4 − 1) S 42 + (n5 − 1) S52 S = (n1 − 1) + (n2 − 1) + (n3 − 1) + (n4 − 1) + (n5 − 1) 2 W (n1 − 1) S12 + (n2 − 1) S 22 + (n3 − 1) S32 + (n4 − 1) S 42 + (n5 − 1) S52 = n1 + n2 + n3 + n4 + n5 − 5 • Note that this quantity is merely an extension of (n1 − 1) s12 + (n2 − 1) s22 (n1 − 1) s12 + (n2 − 1) s22 = s = (n1 − 1) + (n2 − 1) n1 + n2 − 2 2 p • represents a combined estimate of the common variance σ2, and it measures the variability of the observations within the five populations. SW2 An Analysis of Variance (5) • If the null hypothesis µ1 = µ2 = µ3 = µ4 = µ5 is true, then the populations are identical, with mean µ and variance σ2. • Drawing single samples from the five populations is then equivalent to drawing five different samples from the same population. • To evaluate the variation in the five sample means, we need to know the sampling distribution of the sample mean computed from a random sample of 25 observations from a normal population. • We can estimate the variance of the distribution of sample means σ2/25 , using the formula: An Analysis of Variance (6) • Between-sample variance – The quantity estimates σ2/25 ,and hence 25 × (sample variance of the means) estimates σ2. 2 – The quantity as S B . The subscript B denotes a measure of the variability among the sample means for the five populations. • Under the null hypothesis that all five population means are identical, we 2 have two estimates of σ2---namely, S B2 and SW . Suppose the ratio S B2 SW2 is used as the test statistics to test the hypothesis that µ1 = µ2 = µ3 = µ4 = µ5 . S • S follows an F distribution with degrees of freedom df1=4 for S B2 and df2=120 for SW2 . 2 B 2 W An Analysis of Variance (7) • The test statistic used to test equality of the population means is S B2 F= 2 SW • • • When the null hypothesis is true, both S B and SW2 estimate σ2 , and expect F to assume a value near F = 1. When the hypothesis of equality is false, S B2 will tend to be large than SW2 due to the differences among the population means. If the calculated value of F falls in the rejection region, we conclude that not all five population means are identical. 2 Analysis of Variance --Completely Randomized Design (1) • Summary of sample data for a completely randomized design Analysis of Variance --Completely Randomized Design (2) • Total sum of squares (TSS) – Let ST2 be the sample variance of the nT measurements. t ni TSS = ∑∑ ( yij − y.. ) 2 = (nT − 1) ST2 i =1 j =1 – It is possible to partition the total sum of squares as follows: ∑(y ij ij − y.. ) 2 = ∑ ( yij − yi. ) 2 + ∑ ni ( yi. − y.. ) 2 ij i – Within-sample sum of squares (SSW; SW2 ) SSW = ∑ ( yij − yi. ) 2 = (n1 − 1) S12 + (n2 − 1) S 22 + ... + (ni − 1) Si2 ij – Between-sample sum of squares (SSB; S B2 ) SSB = ∑ ni ( yi. − y.. ) 2 i Analysis of Variance --Completely Randomized Design (3) • Although the formulas for TSS, SSW and SSB are easily interpreted, they are not easy to use for calculations. Instead, we recommend using a computer software program. • An analysis of variance for a completely randomized design with t populations has the following null and alternative hypotheses: H 0 : µ1 = µ2 = µ3 = ... = µt H a : At least one of the t population means differs from the rest. • The quantities SW2 and S B2 can be computed using the shortcut formula SSB S = t −1 2 B SW2 = SSW nT − t Analysis of Variance Table Example 8.2 • A clinical psychologist wished to compare three methods for reducing hostility levels in university students, and used a certain test (HLT) to measure the degree of hostility. Answer to Example 8.2 The Model for Observation in a Completely Randomized Design (One-Way Classification) • Assumptions – Independent random samples – Each sample is selected from a normal population. – The mean and variance for population i are, respectively, μi and σ 2. • Four distributions Model for Analysis of Variance (1) • yij : the jth sample measurement selected from population i, is the sum of three terms. • μ: denotes an overall mean that is an unknown constant. • αi: denotes an effect due to population i. αi is an unknown constant. yij = µ + αi + εij • εij: represents the random deviation of yij about the ith population mean, μi. The εij‘s are often referred to as error terms. The term error simply refers to the fact that the observations from the t populations differ by more than just their means. Model for Analysis of Variance (2) µi = E ( yij ) = E ( µ + αi + εij ) = µ + αi + E (εij ) = µ + αi • The εij‘s are normally distributed with mean 0 and variance σ ε . Also, the variance for each of the t populations can be shown to be σ ε2 . • Summary of some of the assumptions for a completely randomized design 2 Model for Analysis of Variance (3) • Using the model, the null hypothesis is: H 0 : α1 = α2 = ... = αt = 0 H a : At least one of the αi s differs from 0. • We need to verify that these conditions are satisfied prior to making inferences from the analysis of variance table. – The normality condition is not as critical as the equal variance assumption when we have large sample sizes unless the populations are severely skewed or have very heavy tails. – The assumption of homogeneity (equality) of population variances is less critical when the sample sizes are nearly equal, where the variances can be markedly different and the p-values for an analysis of variance will still be only mildly distorted. Model for Analysis of Variance (4) --- Residual Analysis • The evaluation of the normal condition will be evaluated using residual analysis. εij = yij − µi • Then if the condition of equal variances is valid, the εijs are a random sample from a normal population, However, μi is an unknown constant, but if we estimate μi with yi., and let eij = yij − yi. • Then we can use the eijs to evaluate the normality assumption. Even when the individual nis are small, we would have nT residuals, which would provide a sufficient number of values to evaluate the normality condition. • We can plot the eijs in a boxplot or a normality plot to evaluate whether the data appear to have been generate from normal populations. Example 8.3 • An international organization wanted to determine whether the clerics from different religions have different levels of awareness with respect to the causes of mental illness. Three random samples were drawn, one containing then Methodist ministers, a second containing ten catholic priests, and a third containing ten Pentecostal ministers. Each of the 30 clerics was then examined, using a standard written test, to measure his or her knowledge about causes of mental illness. The test scores are listed in the following table. Answer to Example 8.3 (1) • Residuals eij for clerics’ knowledge of mental illness eij = yij − yi. Answer to Example 8.3 (2) • Normal probability plot for residuals • A lack of concentration of the residuals about the straight line. Answer to Example 8.3 (3) Answer to Example 8.3 (4) • Equal variance test – Levine’s test statistics from L = MSB/MSW =178.3/186.9 = 0.95 ( < the critical value 3.35). Thus we fail to reject the null hypothesis that the standard deviations are equal. • The Kruskal-Wallis test can be used when the populations are nonnormal but have identical distributions under the null hypothesis. Transformations of the Data --An Alternative Analysis (1) • A transformation of the sample data is defined to be a process in which the measurements on the original scale are systematically converted to a new scale of measurement. • Transformation to achieve uniform variance Transformations of the Data --An Alternative Analysis (2) 2 σ • When it appear that i = kµi with k ≈ 1 . The transformation yT = y + 0.375 is appropriate. • The logarithmic transformation (yT = log(y)) is appropriate any time the coefficient of variation σi/μi is constant across the populations of interest. • The transformation ( yT = arcsin y ) is particular appropriate for data recorded as percentages or proportions. A Nonparametric Alternative : The Kruskal-Wallis Test (1) • Extension of the rank sum test for more than two populations – H0:The k distributions are identical. – Ha: Not all the distributions are the same. Ti 2 12 H= − 3(nT + 1) ∑ nT (nT + 1) i ni ni : the number of observations from sample i (i = 1, 2, ..., k ) nT : the combined (total) sample size; that is, nT = ∑ ni and Ti i denotes the sum of the ranks for the measurements in sample i after the combined sample measurements have been ranked. A Nonparametric Alternative : The Kruskal-Wallis Test (2) • Note: When there are a large number of ties in the ranks of the sample measurements, use H' = H 1 − [∑ (t 3j − t j ) /(nT3 − nT )] j where tj is the number of observations in the jth group of tied ranks. A Nonparametric Alternative : The Kruskal-Wallis Test (3)