Download Remembered?

Hypothesis Tests II The normal distribution Normally distributed data Normally distributed means First, lets consider a more simple problem… 𝐻0 : 𝜇 = 𝜇0 We are testing the equality of a mean of a population (Y) to a particular value. Now, if 𝐻0 is assumed, what do we know? We have some idea about the distribution of sample mean 𝑌. We need a measuring device that is sensitive to the variations in 𝐻0 , or in other words deviations from the statement therein… z1 z2 𝒇 𝒛𝟏 , 𝒛𝟐 = 𝑧 = (𝑧1 , 𝑧2 ) has a bivariate normal distribution. 𝟏 𝟐𝝅𝝈𝒛𝟏 𝝈𝒛𝟐 𝟏 𝐞𝐱𝐩(− 𝟐 𝟏 − 𝝆𝟐 𝟏 − 𝝆𝟐 𝒙 − 𝝁 𝒛𝟏 𝝈𝟐𝒛𝟏 𝟐 𝒙 − 𝝁𝒛𝟐 + 𝝈𝟐𝒛𝟐 𝟐 − 𝟐𝝆 𝒙 − 𝝁𝒛𝟏 (𝒙 − 𝝁𝒛𝟏 ) 𝝈𝒛𝟏 𝝈𝒛𝟐 If 𝑧 = (𝑧1 , 𝑧2 ) has a bivariate normal distribution then the pdf of points (𝑧1 − 𝑧, 𝑧2 − 𝑧) is a one dimensional normal distribution. This distribution is over the line 𝑧1 + 𝑧2 = 0 because all points of the form (𝑧1 − 𝑧, 𝑧2 − 𝑧) is situated on this line. z1 z2 This is how one dimension (degrees of freedom) is lost! If 𝑧 = (𝑧1 , 𝑧2 , 𝑧3 ) has a multinomial normal distribution then the pdf of points (𝑧1 − 𝑧, 𝑧2 − 𝑧, 𝑧3 − 𝑧) is a two dimensional normal distribution. This distribution is over the plane 𝑧1 + 𝑧2 + 𝑧3 = 0 because all points of the form (𝑧1 − 𝑧, 𝑧2 − 𝑧, 𝑧3 − 𝑧) is situated on this plane. (Hard to draw) That means even though the points (𝑧1 − 𝑧, 𝑧2 − 𝑧 ) lie in a two dimensional space, the probability distribution function defined over them is basically single dimensional. zi2 ~𝒳𝑛2 But, 𝑧𝑖 − 𝑧 2 2 ~𝒳𝑛−1 The situation resembles the following: Assume we have two normally distributed random variables; 𝑋1 and 𝑋2 . Then the distribution of the sum their squares, i.e., 𝑋12 + 𝑋22 does not necessarily have a Chi-squared distribution with two degrees of freedom. Why? Consider the case where 𝑋2 = −𝑋1 . Then 𝑋12 + 𝑋22 = 2𝑋12 which has a Chi-square distribution of one degree of freedom. Hence unless 𝑋1 , 𝑋2 are independent 𝑋12 + 𝑋22 has chi-square distribution with one degree of freedom. What is 𝒳 2 distribution? The t-distribution That is why we divide by (n-1) in calculating sample s.d. One sample t-test Two-sample tests B and A are types of seeds. Numerical Example (wheat again) Summary: We have so far seen how a good test statistic (null distribution) looks like. The distribution that we have selected is a test book distribution. Could we pick others? Choosing Test Statistic The t statistic The Kolmogorov-Smirnov statistic Comparing the test statistics Sensitivity to specific alternatives Discussion Or… • We need to add in additional assumptions such as equality of the stanadard deviations of the samples. Two-sample tests B and A are types of seeds. Contingency Tables (Cross-Tabs) We use cross-tabulation when: • We want to look at relationships among two or three variables. • We want a descriptive statistical measure to tell us whether differences among groups are large enough to indicate some sort of relationship among variables. Cross-tabs are not sufficient to: • Tell us the strength or actually size of the relationships among two or three variables. • Test a hypothesis about the relationship between two or three variables. • Tell us the direction of the relationship among two or more variables. • Look at relationships between one nominal or ordinal variable and one ratio or interval variable unless the range of possible values for the ratio or interval variable is small. What do you think a table with a large number of ratio values would look like? Because we use tables in these ways, we can set up some decision rules about how to use tables • Independent variables should be column variables. • If you are not looking at independent and dependent variable relationships, use the variable that can logically be said to influence the other as your column variable. • Using this rule, always calculate column percentages rather than row percentages. • Use the column percentages to interpret your results. For example, • If we were looking at the relationship between gender and income, gender would be the column variable and income would be the row variable. Logically gender can determine income. Income does not determine your gender. • If we were looking at the relationship between ethnicity and location of a person’s home, ethnicity would be the column variable. • However, if we were looking at the relationship between gender and ethnicity, one does not influence the other. Either variable could be the column variable. Contingency Tables (Cross-Tabs) Marital Status Married Single Gender Male Female How do we measure the relationship? 37 51 41 32 What do we EXPECT if there is no relationship? Gender Female Result Total Male Cured Not 88 Total 73 78 83 161 Observed F M Cured 37 41 Not 51 32 Expected F M Cured 42.6 35.4 Not 45.4 37.6 (37  42.6)2 ( 41  35.4)2 (51  45.4)2 (32  37.6)2    42.6 35.4 45.4 37.6 3.18 RESULT ● This test statistic has a χ2 distribution with (2-1)(2-1) = 1 degree of freedom ● The critical value at α = .01 of the χ2 distribution with 1 degree of freedom is 6.63 ● Thus we do not reject the null hypothesis that the two proportions are equal, that the drug is equally effective for female and male patients INTRODUCTION TO ANOVA • The easiest way to understand ANOVA is to generate a tiny data set (using GLM): 𝑌 =𝜇+𝛼+𝑒 As a first step set the mean 𝜇, to 5 for the dataset with 10 cases. In the table below all 10 cases have a score of 5 at this point. 𝒂𝟏 𝒂𝟐 CASE SCORE CASE SCORE 𝑠1 5 𝑠6 5 𝑠2 5 𝑠7 5 𝑠3 5 𝑠8 5 𝑠4 5 𝑠9 5 𝑠5 5 𝑠10 5 • The next step is to add the effects of the IV. Suppose that the effect of the treatment at 𝑎1 is to raise scores by 2 units and the effect of the treatment at 𝑎2 is to lower scores by 2 units. 𝒂𝟏 𝒂𝟐 CASE SCORE CASE SCORE 𝑠1 5+2=7 𝑠6 5-2=3 𝑠2 5+2=7 𝑠7 5-2=3 𝑠3 5+2=7 𝑠8 5-2=3 𝑠4 5+2=7 𝑠9 5-2=3 𝑠5 5+2=7 𝑠10 5-2=3 Σ𝑌𝑎1 = 35 Σ𝑌𝑎2 = 15 2 Σ𝑌𝑎1 = 245 2 Σ𝑌𝑎2 = 45 𝑌𝑎1 = 7 𝑌𝑎2 = 3 • The changes produced by treatment are the deviations of the scores from 𝜇. Over all of these cases the deviations is 5 2 2 + 5 −2 2 = 40 This is the sum of the (squared) effects of treatment if all cases are influenced identically by the various levels of A and there is no error. • The third step is to complete the GLM with addition of error. 𝒂𝟏 𝒂𝟐 CASE SCORE CASE SCORE 𝑠1 5+2+2=9 𝑠6 5-2+0=3 𝑠2 5+2+0=7 𝑠7 5-2-2=1 𝑠3 5+2-1=6 𝑠8 5-2+0=3 𝑠4 5+2+0=7 𝑠9 5-2+1=4 𝑠5 5+2-1=6 𝑠10 5-2+1=4 Σ𝑌𝑎1 = 35 Σ𝑌𝑎2 = 15 2 Σ𝑌𝑎1 = 251 2 Σ𝑌𝑎2 = 51 𝑌𝑎1 = 7 𝑌𝑎2 = 3 Σ𝑌 = 50 Σ𝑌 2 = 302 𝑌=5 Then the variance for the 𝑎1 group is Σ𝑌 Σ𝑌 2 − 𝑁 2 𝑎1 : 𝑠𝑁−1 = 𝑁−1 And the variance for the 𝑎2 group is 2 352 251 − 5 = 1.5 = 4 152 51 − 5 = 1.5 2 𝑎2 : 𝑠𝑁−1 = 4 The average of these variances is also 1.5 Check that these numbers represent error variance; that means they represent random variability in scores within each group where all cases are treated the same and therefore are uncontaminated by effects of the IV. The variance for this group of 10 numbers, ignoring group memebership is 502 302 − 10 = 5.78 2 𝑠𝑁−1 = 9 Standard Setup for ANOVA 𝒂𝟏 𝒂𝟐 9 3 7 1 6 3 7 4 6 4 Σ𝑌𝑎1 = 𝐴1 = 35 Σ𝑌𝑎2 = 𝐴2 = 15 Σ𝑌 = 𝑇 = 50 2 Σ𝑌𝑎1 = 251 2 Σ𝑌𝑎2 = 51 Σ𝑌 2 = 302 𝑌𝑎1 = 7 𝑌𝑎2 = 3 𝑌 = 𝐺𝑀 = 5 Sum The difference between each score and the Grand Mean (𝑌𝑖𝑗 − 𝐺𝑀) is broken into two components: 1. The difference between the score and its own group mean (𝑌𝑖𝑗 − 𝑌𝑗 ) 2. The difference between that group mean and the grand mean (𝑌𝑗 − 𝐺𝑀) 𝑌𝑖𝑗 − 𝐺𝑀 = 𝑌𝑖𝑗 − 𝑌𝑗 + Yj − GM 𝑌𝑖𝑗 − 𝐺𝑀 = 𝑌𝑖𝑗 − 𝑌𝑗 + Yj − GM Sum of squares for treatment The effect of the IV!!! Sum of squares for error Each term is then squared and summed seperately to produce the sum of squares for error and the sum of squares for treatment seperately. The basic partition holds because the cross product terms vanish. 𝑌𝑖𝑗 − 𝐺𝑀 𝑖 𝑗 2 = 𝑌𝑖𝑗 − 𝑌𝑗 𝑖 𝑗 2 + Yj − GM 𝑛 𝑗 2 𝑌𝑖𝑗 − 𝐺𝑀 𝑖 𝑗 2 = 𝑌𝑖𝑗 − 𝑌𝑗 𝑖 2 + 𝑗 Yj − GM 𝑛 2 𝑗 This is the deviation form of basic ANOVA. Each of these terms is a sum of squares (SS). 2 𝑌𝑖𝑗 − 𝐺𝑀 = 𝑆𝑆𝑡𝑜𝑡𝑎𝑙 = 𝑆𝑆𝑇 The average of this sum is the total variance in the set of scores ignoring group memebership. 𝑖 𝑗 2 𝑌𝑖𝑗 − 𝑌𝑗 = SSerror = SSwg This term is called sum of square within groups. 𝑖 𝑗 2 Yj − GM = 𝑆𝑆𝑡𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡 = 𝑆𝑆𝑏𝑔 This term is called SS between groups. This sum is frequently symbolized as, 𝑆𝑆𝑇 = 𝑆𝑆𝑏𝑔 + 𝑆𝑆𝑤𝑔 𝑛 𝑗 At this point it is important to realize that the total variance in the set of scores is partitioned into two sources. One is the effect of the IV and the other is all remaining effects (which we call error). Because the effects of the IV are assessed by changes in the central tendencies of the groups, the inferences that come from ANOVA are about differences in central tendency. However sum of squares are not yet variances. To become variances, they must be ‘averaged’. The denominators for averaging SS must be degrees of freedom so that the statistics will have a proper 𝜒 2 distribution (remember previous slides). So far we now that the degrees of freedom of 𝑆𝑆𝑇 must be N-1. 𝑑𝑓𝑡𝑜𝑡𝑎𝑙 = 𝑁 − 1 = 𝑎𝑛 − 1 Furthermore, 𝑑𝑓𝑏𝑔 = 𝑎 − 1 Also, 𝑑𝑓𝑤𝑔 = 𝑎 𝑛 − 1 = 𝑎𝑛 − 𝑎 = 𝑁 − 𝑎 Thus we have (as expected) 𝑑𝑓𝑡𝑜𝑡𝑎𝑙 = 𝑑𝑓𝑏𝑔 + 𝑑𝑓𝑤𝑔 Variance is an ‘averaged’ sum of squares (for empirical data of course). Then to obtain mean sum of squares (MS), 𝑆𝑆𝑏𝑔 𝑀𝑆𝑏𝑔 = 𝑎−1 𝑆𝑆𝑤𝑔 𝑀𝑆𝑤𝑔 = 𝑁−𝑎 The F distribution is a sampling distribution of the ratio of two 𝜒 2 distributions. 𝑀𝑆𝑏𝑔 𝐹= 𝑀𝑆𝑤𝑔 This statististic is used to test the null hypothesis that 𝜇1 = 𝜇2 = ⋯ = 𝜇𝑎 Source table for basic ANOVA Source Between Within Total SS 40 12 52 df 1 8 9 MS 40 1.5 F 26.67

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Remembered?