Download 2_Testing Whether an Observed Proportion

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
Ronald Heck
Testing Whether an Observed Proportion is Consistent 1
EDEP 768E: Seminar in Categorical Data Modeling (F2012)
Aug. 20, 2012
Week 2:
Testing Whether an Observed Proportion is Consistent with a
Population Proportion of Interest
This week we are going to look into the meaning of certain distributions other than the
normal distribution. When we draw a sample we must make certain assumptions about the
population we draw the sample from. I drew the following sample of N = 105 from a data set of
10,000 or so individuals. Here is a distribution of ninth grade students in terms of how many core
courses they failed during their first year of high school. This is an example of a Poisson
distribution, which looks at a distribution of events over a specific period of time. The mean is
just about 0.5. We can see that a few students failed all four courses, but the majority did not fail
any. So this is obviously not a normal distribution.
We might wish to test whether the sample is consistent with a Poisson distribution.
One-Sample Kolmogorov-Smirnov Test
N
Poisson Parametera,b
Most Extreme Differences
Kolmogorov-Smirnov Z
Asymp. Sig. (2-tailed)
a. Test distribution is Poisson.
b. Calculated from data.
Mean
Absolute
Positive
Negative
Fail
105
.48
.103
.103
-.044
1.052
.218
Ronald Heck
Testing Whether an Observed Proportion is Consistent 2
EDEP 768E: Seminar in Categorical Data Modeling (F2012)
Aug. 20, 2012
In the next example, let’s suppose we wish to examine whether 10 students selected at
random meet an intended target for proficiency in the population of 0.8 proficient.
Proficient
Frequency
Valid
.00
1.00
Total
3
7
10
Percent
30.0
70.0
100.0
Valid Percent
30.0
70.0
100.0
Cumulative
Percent
30.0
100.0
We can test whether the results (of .7 proficient) are different from the target using a binomial
test. We can access this test in a couple of ways in SPSS. One way is through Analyze:
Nonparametric Tests (One Sample). We open up a screen that says “Choose tests” and we select
“Compare observed binary probability to hypothesized (Binomial test). “ We then open options
and select the hypothesized proportion (0.8) and define which category is “success” (the event of
interest) and enter “1,” which defines the proficient category.
If we double click on this output we can obtain additional output we might wish to have.
Ronald Heck
Testing Whether an Observed Proportion is Consistent 3
EDEP 768E: Seminar in Categorical Data Modeling (F2012)
Aug. 20, 2012
Alternatively we can also obtain the test through Analyze: Nonparametric Tests (Legacy
Dialogs) and Open “Binomial.” When you do that you place the variable you wish to test in the
“Test Variable” dialog box and for “Test Proportion” you have to put in the proportion for the
first category (which is typically coded 0 for a dichotomous variable). Note this is the “other”
proportion from the previous test. So in this case the target proportion for “not proficient”
students is 0.2. You can see the results are displayed differently but match.
Binomial Test
Category
proficient
Group 1
Group 2
Total
.00
1.00
N
Observed Prop.
3
7
.3
.7
10
1.0
Test Prop.
.2
Exact Sig. (1tailed)
.322
We can also provide a chi-square test for this. We can again open Analyze:
Nonparametric Tests (One Sample). We then open the “Choose tests” screen and we select
“Compare observed probabilities to hypothesized (chi-square test). “ We then open options and
select the hypothesized proportion for the category coded 1 (0.8) and the category coded 0 (0.2).
This defines which category is “success” (proficient, coded 1) and which is the not proficient
category. This produces a “score” test of the observed proportion against a hypothesized
proportion.
If we double-click on the output we can open up additional output we might want.
Ronald Heck
Testing Whether an Observed Proportion is Consistent 4
EDEP 768E: Seminar in Categorical Data Modeling (F2012)
Aug. 20, 2012
If the population proportion of proficient students is 0.8, we might want to estimate the
probability of obtaining 7 students (k = 7) who are proficient out of 10 randomly selected
students (n = 10). This can be estimated as follows (see Week 1 notes):
10 7
0.8 (1‐0.2)10‐7= 120(0.8)7(0.2)3= 120(.21)(.008)=.2016 
7
P(Y = 7) = 
Contingency Tables
The previous examples compare a sample to a theoretical distribution of interest. In other
cases, we may wish to investigate the association between two categorical variables.
Traditionally, the approach for conducting such analyses was to use contingency tables (or
crosstabs). Let’s suppose we find a table in a journal with the following information concerning
students’ proficiency in reading and their home language spoken. We might want to test whether
there is an association between English speaking (1) or non-English speaking (0) language
background and likelihood to be proficient in reading at grade 3.
English * proficient Crosstabulation
Count
Proficient
.00
1.00
.00
35
5
English
1.00
93
67
Total
128
72
Total
40
160
200
We can enter this data into SPSS in the following manner:
Prof
English
Count
.00
.00
35.00
.00
1.00
93.00
1.00
1.00
67.00
1.00
.00
5.00
We then can use the “weight cases” command (Data: Weight Cases) and select “count” as the
frequency variable. We can next open Analyze: Descriptives (Crosstabs): This will open up a
screen with row and column variables. We can place English as the row variable and proficient
as the column variable. We next can open the Cell screen and obtain the observed and expected
Ronald Heck
Testing Whether an Observed Proportion is Consistent 5
EDEP 768E: Seminar in Categorical Data Modeling (F2012)
Aug. 20, 2012
frequencies. We can see now that there is a discrepancy between the observed counts in each
cell and the expected counts (which are calculated from the observed frequencies and marginal
totals for each cell.
English * proficient Crosstabulation
proficient
1.00
35
5
25.6
14.4
93
67
102.4
57.6
128
72
128.0
72.0
Total
.00
.00
English
1.00
Total
Count
Expected Count
Count
Expected Count
Count
Expected Count
40
40.0
160
160.0
200
200.0
Next, we can open “Statistics” to obtain a number of tests on whether the categories are
independent or not. The typical test is the chi-square test. Fisher’s exact test provides an
alternative for small samples.
Chi-Square Tests
Value
Pearson Chi-Square
11.985a
df
Asymp. Sig. (2sided)
1
.001
Continuity Correctionb
10.744
1
.001
Likelihood Ratio
13.662
1
.000
11.925
1
.001
Fisher's Exact Test
Linear-by-Linear Association
Exact Sig. (2sided)
Exact Sig. (1sided)
.000
.000
N of Valid Cases
200
a. 0 cells (0.0%) have expected count less than 5. The minimum expected count is 14.40.
b. Computed only for a 2x2 table
We can obtain the likelihood ratio test (G2) by multiplying the natural log (ln) of the
observed/expected frequency in each cell by the observed frequency and adding the cells
together. We then multiply the result by 2.
2[(35ln(35/25.6)+5ln(5/14.4)+93ln(93/102.4)+67ln(67/57.6)]=
2[35(.3128)+5(-1.0578)+93(-.0963)+67(.1512)] =
2[10.948+(-5.289)+(-8.956)+10.130]=
2[6.833] = 13.666 (with slight discrepancy due to rounding)
One useful measure of association for a symmetric cross-tabulation table is the
contingency coefficient (which ranges from 0 to .71 for a 2 x 2 table). It is calculated by
obtaining the square root of χ2 /(n + χ2). In this case that would be the square root of
11.985/(211.985) = 0.238.
Ronald Heck
Testing Whether an Observed Proportion is Consistent 6
EDEP 768E: Seminar in Categorical Data Modeling (F2012)
Aug. 20, 2012
Symmetric Measures
Value
Approx. Sig.
Nominal by Nominal
Contingency Coefficient
.238
.001
N of Valid Cases
200
a. Not assuming the null hypothesis.
b. Using the asymptotic standard error assuming the null hypothesis.
Using contingency tables, one can also introduce a third variable as a control. It is then
possible to assess whether the third variable is (a) unrelated to the two variables being compared,
(b) specifies their relationship (i.e., changes it under one particular condition), or (c) makes it
spurious (i.e., makes it disappear altogether).