Download 2 Testing for 2 Categorical Variables

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
2 Testing for 2 Categorical Variables
Chi-Square Testing gives us a way to test COUNTS of categorical data. The test measures how
far the observed counts deviate from the expected counts (values) of the situation. The test
statistics of chi-square, denoted as 2, combined with degrees of freedom, denoted as df, is used
to calculate the probability of such a difference between observed and expected counts and any
extreme difference beyond that.
We have practiced using the Goodness-of-Fit Chi-Square Test for a single categorical variable
with three or more outcomes. Now we will investigate how to use the Chi-Square Test on two
categorical variables.
When we sample from ONE population regarding two categorical variables and write data into
a two-way (contingency) table, we will test to see if no association (or independence) exists or
not between the 2 categorical variables. This is called a Chi-Squared Test of Independence.
If we sample from TWO populations regarding two categorical variables and write data into a
two-way (contingency) table, we will test to see if the population distributions of proportions are
the same for the specific categories. This is called a Chi-Squared Test of Homogeneity.
Example: In a survey reported in a special issue of Newsweek magazine (Special Edition: Health
for Life, Spring/Summer 1999), n = 747 randomly selected women were asked, “How satisfied
are you with your overall appearance?” There were four possible responses to this question, and
the following table shows the distribution of counts for the possible responses for each of three
age groups. (Note: The counts were estimated from percents given in Newsweek.) MOS, p.482
How Satisfied Are You with Your Overall Appearance?
Very
Somewhat
Not Too
Not at All
Age
Under 30
30-49
Over 50
Total
45
73
106
224
82
168
153
403
10
47
41
98
4
6
12
22
Total
141
294
312
747
We must first calculate the EXPECTED VALUES using the
row
141 224
total, column total, and grand total for each cell.
E Under 30, Very  
 42.281
747
...
Expected Value 
row total column tot al
E 30  49, Not Too  
Grand Total
294  98
 38.570
747
...
E Over 50, Not at All  
312  22
 9.189
747
Normally in a computer output the expected values are calculated and listed under the
OBSERVED COUNT (EXPECTED VALUES usually in parentheses), as seen in the
following output:
Very
Somewhat
Not Too
Not at All
Total
Age
Under 30
45
82
10
4
141
(42.281)
(76.068)
(18.498)
(4.153)
30-49
73
168
47
6
294
(88.161)
(158.61)
(38.57)
(8.659)
Over 50
106
153
41
12
312
(93.558)
(168.32)
(40.932)
(9.189)
Total
224
403
98
22
747
ASSUMPTIONS
1. All expected values are greater than five OR [all expected values are greater than one
AND no more than 20% of expected values are less than 5]
2. IO: Independent Observations
3. ME: Mutually exclusive
4. RS: Random sample
VERIFY ASSUMPTIONS (Includes calculations of expected values--see table above as an
example)
1. All expected values are greater than 1 AND 8.3% (less than 20%) expected values are
less than 5.
2. The 747 women responses are independent observations.
3. Each of the 747 women's responses is mutually exclusive (meaning each response falls
strictly into one cell).
4. We assume the 747 women were randomly sampled.
STATE HYPOTHESES
Ho: There is NO ASSOCIATION between age group and satisfaction level of overall
appearance.
Ha: There is an ASSOCIATION between age group and satisfaction level of overall appearance.
SIGNIFICANCE LEVEL:  = .05
CALCULATE CHI-SQUARE TEST STATISTIC
2

observed  expected 
2
 
expected
2 
45  42.2812  82  76.0682  10  18.4982
42.281
  14.27799784
76.068
18.498
 ... 
41  40.9322  12  9.1892
40.932
9.189
2
CALCULATE DF: df = (number of row categories - 1)(number of columns categories- 1)
= (4 – 1)(3 – 1) = 6
CALCULATE P-VALUE: Use Chi-square table to approximate OR TI-83+ for value to 4
decimal places.


WRITE PROBABILITY STATEMENT: P  2  14.278  .0267
INTERPRETATION
Since p-value (.0267) is less than alpha (.05), we will reject the null hypothesis that there is no
association between age group and satisfaction level of overall appearance for 747 women
responses.
EXAMPLE: The data on drinking behavior for independently chosen random samples of male
and female students is similar to data that appeared in the article “Relationship of Health
Behaviors to Alcohol and Cigarette Use by College Students” (J. of College Student
Development (1992):163-170). Does there appear to be a gender difference with respect to
drinking behavior?
Low (1-7
Moderate (8- High (25+
Row
Drinking None
drinks/week) 24
drinks/week) Marginal
Level
drinks/week)
Total
Gender
Men
140
478
300
63
981
(158.6)
(554.0)
(230.1)
(38.4)
Women
186
661
173
16
1036
(167.4)
(585.0)
(242.9)
(40.6)
Column
326
1139
473
79
2017
Marginal
Total
1
Are you working with categorical data?
2
How many variables are there, 1 or 2?
3
If one variable, use Goodness-of-fit
Chi-square test. OR If two variables,
use Association (Independence) Chisquare test.
Calculate the expected values.
4
5
State and VERIFY assumptions
(requirements) to perform test.
6
State null and alternative hypotheses.
7
8
Define alpha value.
Calculate chi-square statistic.
9
Calculate degrees of freedom.
11
Using table or TI-83+, determine pvalue.
Sketch and shade distribution.
12
Write probability statement.
13
Interpret results in context of problem.
10
ASSIGNMENT:
1. The article “Factors Associated with Sexual Risk-Taking Behaviors Among Adolescents”
(J. Marriage and Family (1994): 663-632) examined the relationship between gender and
contraceptive use by sexually active teens. Each person in a random sample of sexually
active teens was classified according to gender and contraceptive use (with three
categories: rarely or never use, use sometimes or most of the time, and always use). Data
consistent with percentages in the article is given in the table. Is there evidence of an
association between gender and contraceptive use of active teens?
Gender
Female
Male
Row
Marginal
Contraceptive Use
Total
Rarely/Never
210
350
560
Sometimes/Most Times
190
320
510
Always
400
530
930
Column Marginal Total
800
1200
2000
Remember to enter Observed Counts in Matrix A ("RC Cola", Row then Column), then
run 2 Test.
2. Do women have different patterns of work behavior than men? The article “Workaholism
in Organizations: Gender Differences” (Sex Roles: A Journal of Research (1999): 333346) attempts to answer this question. Each person in a random sample of 423 graduates
of a business school in Canada were polled and classified by gender and workaholism
type.
Gender
Female
Male
Workaholism Types
Work Enthusiasts
20
41
Workaholics
32
37
Enthusiastic Workaholics
34
46
Unengaged Workers
43
52
Relaxed Workers
24
27
Disenchanted Workers
37
30
a. Test the hypothesis that gender and workaholism type are independent.
b. The author writes “women and men fell into each of the six workaholism types to
a similar degree.” Does the outcome of the test you performed in part (a) support
this conclusion? EXPLAIN.
3. Reference: Keppel, R. D., and Weis, J. G., in their article "Time and Distance as
Solvability Factor in Murder Cases," Journal of Forensic Science, Vol. 39, No 2, March
1994. Below is tabled information from a sample of single victim--single offender cases
in the state of Washington from January 1981 through December 1986. Assume it is
reasonable to regard this sample as a random sample of such murders in the United States
(a debatable and almost certainly false assumption!).
Time Elapsed and Distance between Victim Last Seen and Body Recovery
0-24 hours
24 hours - 1 month
Greater than 1 month
0-199 feet
505
52
9
200 feet to 1.5 miles
28
10
4
More than 1.5 miles
55
60
47
a) Test the hypothesis that the distance and elapsed time between the victim last seen
and body recoveries are independent.
b) Notice the "greater than expected" and "less than expected" for the individuals cells.
Do you see any pattern? If so, describe it in a few sentences.
4. In the summer of 1846--July 31, in fact--what was to become the famous Donner party
left Fort Bridger in Wyoming, headed for California. What with one thing and another
they ran a bit late and on November 1 found themselves only just west of the present-date
California-Nevada border. After an accumulation of snow in late October, a fierce
snowstorm blew up and trapped them at Tuckee Lake--now renamed Donner Lake. To
make a long story short by leaving out the gruesome details, many of the Donner party
did not make it through the winter. Below you will find the breakdowns of who lived and
who died by age and sex. You are to test the hypotheses that living and dying were
independent of (a) age, and independent of (b) sex.
(a) Age by Fate Data
1-4 years old
5-45 years old 46+ years old
Lived
6
37
3
Died
9
23
7
(b) Gender by Fate Data
Male
Lived
24
Died
29
Female
22
10
Drinking
Level
Gender
Men
Women
CM Total
None
140
(158.6)
186
(167.4)
326
Low: 1-7
drinks/
week
478
(554.0)
661
(585.0)
1139
Calculate the expected
values.
State and VERIFY
assumptions
(requirements) to perform
test.
State null and alternative
hypotheses.
Define alpha value.
Calculate chi-square
statistic.
Calculate df.
Determine p-value.
Sketch and shade
distribution.
Write probability
statement.
Interpret results in context
of problem.
Moderate: High: 25+
8-24 d/wk
drinks/
week
300
63
(230.1)
(38.4)
173
16
(242.9)
(40.6)
473
79
Row
Marg.
Total
981
1036
2017
The article “Factors Associated with Sexual Risk-Taking
Behaviors Among Adolescents” (J. Marriage and Family (1994):
663-632) examined the relationship between gender and
contraceptive use by sexually active teens. Each person in a
random sample of sexually active teens was classified according
to gender and contraceptive use (with three categories: rarely or
never use, use sometimes or most of the time, and always use).
Data consistent with percentages in the article is given in the
table. Is there evidence of an association between gender and
contraceptive use of active teens?
Gender Female Male
Row Marginal
Contraceptive Use
Total
Rarely/Never
210
350
560
Sometimes/Most Times
190
320
510
Always
400
530
930
Column Marginal Total
800
1200
2000
Remember to enter Observed Counts in Matrix A ("RC Cola",
Row then Column), then run 2 Test.
Do women have different patterns of work behavior than men?
The article “Workaholism in Organizations: Gender
Differences” (Sex Roles: A Journal of Research (1999): 333346) attempts to answer this question. Each person in a random
sample of 423 graduates of a business school in Canada were
polled and classified by gender and workaholism type.
Gender Female
Male
Workaholism Types
E(F,___)
E(M,___)
Work Enthusiasts
20
41
Workaholics
32
37
Enthusiastic Workaholics
34
46
Unengaged Workers
43
52
Relaxed Workers
24
27
Disenchanted Workers
37
30
a. Test the hypothesis that gender and workaholism type are
independent.
b. The author writes “women and men fell into each of the six
workaholism types to a similar degree.” Does the outcome
of the test you performed in part (a) support this
conclusion? EXPLAIN.
Reference: Keppel, R. D., and Weis, J. G., in their article "Time
and Distance as Solvability Factor in Murder Cases," Journal of
Forensic Science, Vol. 39, No 2, March 1994. Below is tabled
information from a sample of single victim--single offender
cases in the state of Washington from January 1981 through
December 1986. Assume it is reasonable to regard this sample
as a random sample of such murders in the United States (a
debatable and almost certainly false assumption!).
Time Elapsed and Distance between Victim Last Seen and Body
Recovery
0-24
24 hours - Greater than
hours
1 month
1 month
0-199 feet
505
52
9
200 feet to
1.5 miles
More than
1.5 miles
28
10
4
55
60
47
a) Test the hypothesis that the distance and elapsed time
between the victim last seen and body recoveries are
independent.
b) Notice the "greater than expected" and "less than expected"
for the individuals cells. Do you see any pattern? If so,
describe it in a few sentences.
In the summer of 1846--July 31, in fact--what was to become the
famous Donner party left Fort Bridger in Wyoming, headed for
California. What with one thing and another they ran a bit late
and on November 1 found themselves only just west of the
present-date California-Nevada border. After an accumulation of
snow in late October, a fierce snowstorm blew up and trapped
them at Tuckee Lake--now renamed Donner Lake. To make a
long story short by leaving out the gruesome details, many of the
Donner party did not make it through the winter. Below you will
find the breakdowns of who lived and who died by age and sex.
You are to test the hypotheses that living and dying were
independent of (a) age, and independent of (b) sex.
a) Age by Fate Data
1-4 years old 5-45 years old 46+ years old
Lived
6
37
3
Died
9
23
7
(b) Gender by Fate Data
Male Female
Lived
24
22
Died
29
10