Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Ronald Heck Testing Whether an Observed Proportion is Consistent 1 EDEP 768E: Seminar in Categorical Data Modeling (F2012) Aug. 20, 2012 Week 2: Testing Whether an Observed Proportion is Consistent with a Population Proportion of Interest This week we are going to look into the meaning of certain distributions other than the normal distribution. When we draw a sample we must make certain assumptions about the population we draw the sample from. I drew the following sample of N = 105 from a data set of 10,000 or so individuals. Here is a distribution of ninth grade students in terms of how many core courses they failed during their first year of high school. This is an example of a Poisson distribution, which looks at a distribution of events over a specific period of time. The mean is just about 0.5. We can see that a few students failed all four courses, but the majority did not fail any. So this is obviously not a normal distribution. We might wish to test whether the sample is consistent with a Poisson distribution. One-Sample Kolmogorov-Smirnov Test N Poisson Parametera,b Most Extreme Differences Kolmogorov-Smirnov Z Asymp. Sig. (2-tailed) a. Test distribution is Poisson. b. Calculated from data. Mean Absolute Positive Negative Fail 105 .48 .103 .103 -.044 1.052 .218 Ronald Heck Testing Whether an Observed Proportion is Consistent 2 EDEP 768E: Seminar in Categorical Data Modeling (F2012) Aug. 20, 2012 In the next example, let’s suppose we wish to examine whether 10 students selected at random meet an intended target for proficiency in the population of 0.8 proficient. Proficient Frequency Valid .00 1.00 Total 3 7 10 Percent 30.0 70.0 100.0 Valid Percent 30.0 70.0 100.0 Cumulative Percent 30.0 100.0 We can test whether the results (of .7 proficient) are different from the target using a binomial test. We can access this test in a couple of ways in SPSS. One way is through Analyze: Nonparametric Tests (One Sample). We open up a screen that says “Choose tests” and we select “Compare observed binary probability to hypothesized (Binomial test). “ We then open options and select the hypothesized proportion (0.8) and define which category is “success” (the event of interest) and enter “1,” which defines the proficient category. If we double click on this output we can obtain additional output we might wish to have. Ronald Heck Testing Whether an Observed Proportion is Consistent 3 EDEP 768E: Seminar in Categorical Data Modeling (F2012) Aug. 20, 2012 Alternatively we can also obtain the test through Analyze: Nonparametric Tests (Legacy Dialogs) and Open “Binomial.” When you do that you place the variable you wish to test in the “Test Variable” dialog box and for “Test Proportion” you have to put in the proportion for the first category (which is typically coded 0 for a dichotomous variable). Note this is the “other” proportion from the previous test. So in this case the target proportion for “not proficient” students is 0.2. You can see the results are displayed differently but match. Binomial Test Category proficient Group 1 Group 2 Total .00 1.00 N Observed Prop. 3 7 .3 .7 10 1.0 Test Prop. .2 Exact Sig. (1tailed) .322 We can also provide a chi-square test for this. We can again open Analyze: Nonparametric Tests (One Sample). We then open the “Choose tests” screen and we select “Compare observed probabilities to hypothesized (chi-square test). “ We then open options and select the hypothesized proportion for the category coded 1 (0.8) and the category coded 0 (0.2). This defines which category is “success” (proficient, coded 1) and which is the not proficient category. This produces a “score” test of the observed proportion against a hypothesized proportion. If we double-click on the output we can open up additional output we might want. Ronald Heck Testing Whether an Observed Proportion is Consistent 4 EDEP 768E: Seminar in Categorical Data Modeling (F2012) Aug. 20, 2012 If the population proportion of proficient students is 0.8, we might want to estimate the probability of obtaining 7 students (k = 7) who are proficient out of 10 randomly selected students (n = 10). This can be estimated as follows (see Week 1 notes): 10 7 0.8 (1‐0.2)10‐7= 120(0.8)7(0.2)3= 120(.21)(.008)=.2016 7 P(Y = 7) = Contingency Tables The previous examples compare a sample to a theoretical distribution of interest. In other cases, we may wish to investigate the association between two categorical variables. Traditionally, the approach for conducting such analyses was to use contingency tables (or crosstabs). Let’s suppose we find a table in a journal with the following information concerning students’ proficiency in reading and their home language spoken. We might want to test whether there is an association between English speaking (1) or non-English speaking (0) language background and likelihood to be proficient in reading at grade 3. English * proficient Crosstabulation Count Proficient .00 1.00 .00 35 5 English 1.00 93 67 Total 128 72 Total 40 160 200 We can enter this data into SPSS in the following manner: Prof English Count .00 .00 35.00 .00 1.00 93.00 1.00 1.00 67.00 1.00 .00 5.00 We then can use the “weight cases” command (Data: Weight Cases) and select “count” as the frequency variable. We can next open Analyze: Descriptives (Crosstabs): This will open up a screen with row and column variables. We can place English as the row variable and proficient as the column variable. We next can open the Cell screen and obtain the observed and expected Ronald Heck Testing Whether an Observed Proportion is Consistent 5 EDEP 768E: Seminar in Categorical Data Modeling (F2012) Aug. 20, 2012 frequencies. We can see now that there is a discrepancy between the observed counts in each cell and the expected counts (which are calculated from the observed frequencies and marginal totals for each cell. English * proficient Crosstabulation proficient 1.00 35 5 25.6 14.4 93 67 102.4 57.6 128 72 128.0 72.0 Total .00 .00 English 1.00 Total Count Expected Count Count Expected Count Count Expected Count 40 40.0 160 160.0 200 200.0 Next, we can open “Statistics” to obtain a number of tests on whether the categories are independent or not. The typical test is the chi-square test. Fisher’s exact test provides an alternative for small samples. Chi-Square Tests Value Pearson Chi-Square 11.985a df Asymp. Sig. (2sided) 1 .001 Continuity Correctionb 10.744 1 .001 Likelihood Ratio 13.662 1 .000 11.925 1 .001 Fisher's Exact Test Linear-by-Linear Association Exact Sig. (2sided) Exact Sig. (1sided) .000 .000 N of Valid Cases 200 a. 0 cells (0.0%) have expected count less than 5. The minimum expected count is 14.40. b. Computed only for a 2x2 table We can obtain the likelihood ratio test (G2) by multiplying the natural log (ln) of the observed/expected frequency in each cell by the observed frequency and adding the cells together. We then multiply the result by 2. 2[(35ln(35/25.6)+5ln(5/14.4)+93ln(93/102.4)+67ln(67/57.6)]= 2[35(.3128)+5(-1.0578)+93(-.0963)+67(.1512)] = 2[10.948+(-5.289)+(-8.956)+10.130]= 2[6.833] = 13.666 (with slight discrepancy due to rounding) One useful measure of association for a symmetric cross-tabulation table is the contingency coefficient (which ranges from 0 to .71 for a 2 x 2 table). It is calculated by obtaining the square root of χ2 /(n + χ2). In this case that would be the square root of 11.985/(211.985) = 0.238. Ronald Heck Testing Whether an Observed Proportion is Consistent 6 EDEP 768E: Seminar in Categorical Data Modeling (F2012) Aug. 20, 2012 Symmetric Measures Value Approx. Sig. Nominal by Nominal Contingency Coefficient .238 .001 N of Valid Cases 200 a. Not assuming the null hypothesis. b. Using the asymptotic standard error assuming the null hypothesis. Using contingency tables, one can also introduce a third variable as a control. It is then possible to assess whether the third variable is (a) unrelated to the two variables being compared, (b) specifies their relationship (i.e., changes it under one particular condition), or (c) makes it spurious (i.e., makes it disappear altogether).