Download Relations in categorical data

Document related concepts
no text concepts found
Transcript
Inference for Categorical Data
William P. Wattles, Ph. D.
Francis Marion University
1
Continuous vs. Categorical
• Continuous (measurement) variables have
many values
• Categorical variables have only certain
values representing different categories
• Ordinal-a type of categorical with a natural
order (e.g., year of college)
• Nominal-a type of categorical with no order
(e.g., brand of cola)
2
Categorical Data
• Tells which category an individual is in
rather than telling how much.
• Sex, race, occupation naturally categorical
• A quantitative variable can be grouped to
form a categorical variable.
• Analyze with counts or percents.
3
Describing relationships in
categorical data
• No single graph
portrays the
relationship
• Also no similar
number summarizes
the relationship
• Convert counts to
proportions or
percents
4
Prediction
5
5
Prediction
6
6
Moving from descriptive to
Inferential
• Chi Square Inference involves a test of
independence.
• If variable are independent, knowledge of
one variable tells you nothing about the
other.
7
Moving from descriptive to
Inferential
• Inference involves expected counts.
– Expected count=The count that would occur if
the variables are independent
8
Inference for two-way tables
• Chi Square test of independence.
• For more than two groups
• Cannot compare multiple groups one at a
time.
9
To Analyze Categorical Data
• First obtain counts
• In Excel can do this with a pivot table
• Put data in a Matrix or two-way table
10
Matrix or two-way table
Republican Democrat Independent
Male
18
43
14
Female
39
23
18
11
Inference for two-way tables
• Expected count
• The count that would occur if the variables
are independent
12
Matrix or two-way table
• Rows
• Columns
• Distribution: how often each outcome
occurred
• Marginal distribution: Count for all entries
in a row or column
13
Row and column totals
Male
Female
14
RepublicanDemocrat Independent
18
43
14
39
23
18
57
66
32
75
80
155
RepublicanDemocrat Independent
Male
Female
57
37%
15
66
43%
32
21%
75
80
155
48%
52%
Expected counts
• 37% of all subjects are Republicans
• If independent 37% of females should be
Republican (expected value)
• 37% of 80= 29
• 37% of 75 = 28
16
Expected counts rounded
Republican
Male
Female
total
17
Democrat Independent total
28
32
15 75
29
34
17 80
57
66
32 155
Observed vs. Expected
Male
Female
RepublicanDemocrat Independent
18
43
14
39
23
18
57
66
32
Republican
Male
Female
total
18
75
80
155
Democrat Independent total
28
32
15 75
29
34
17 80
57
66
32 155
Chi-Square
• Chi-square A measure of how far the
observed counts are from the expected
counts
19
Chi-square test of
independence
(
f

f
)
2
o
e
X 
fe
20
2
Chi Square test of
independence with SPSS
21
Chi Square test of
independence with SPSS
22
Chi Square
23
Chi-square test of
independence
• Degrees of Freedom
• df=number of rows-1 times number of
columns -1
• compare the observed and expected counts.
• P-value comes from comparing the Chisquare statistic with critical values for a chisquare distribution
24
Example
• Have the percent of majors changed by
school?
25
Data collection
http://www.fmarion.edu/about/FactBook
2004/2005 Fall 2004 Graduates by Major
26
27
28
Chi Square
29
Marital Status, page 543
job grade single married divorced widowed
1
58
874
15
8
2
222
3927
70
20
3
50
2396
34
10
4
7
533
7
4
30
Marital Status, page 543
Test Statistics
Pearson
31 Chi-Square
Value
67.491
df
9
p-value
0.0000
Olive Oil, page 578
Olive Oil
low medium high
Colon cancer 398
397 430
rectal
250
241 217
controls
1368
1377 1409
32
Olive Oil, page 578
Test Statistics
Value
Pearson Chi-Square
1.552
Continuity Adjusted Chi-Square
1.396
Likelihood Ratio Chi-Square
1.549
33
df
4
4
4
p-value
0.817
0.845
0.818
Business Majors, page 563
Female
Accounting
Administration
Economics
Finance
34
Male
68
91
5
61
56
40
6
59
Business Majors, page 563
Test Statistics
Pearson Chi-Square
35
Value
10.827
df
3
p-value
0.013
Exam Three
• 37 multiple choice
questions, 4 short answer
• T-tests and chi square on
Excel
• General questions about
analyzing categorical data
and t-tests
• Review from earlier this
term
36
Inference as a decision
• We must decide if the null hypothesis is
true.
• We cannot know for sure.
• We choose an arbitrary standard that is
conservative and set alpha at .05
• Our decision will be either correct or
incorrect.
37
Type I and Type II errors
Ho is really
True
We reject Type I Error
Ho
(false alarm)
Ho is really
False
Correct
Decision
We accept Correct decision Type II Error
Ho
(miss)
38
Type I error
• If we reject Ho when in fact Ho is true, this
is a Type I error
• Statistical procedures are designed to
minimize the probability of a Type I error,
because they are more serious for science.
• With a Type I error we erroneously
conclude that an independent variable
works.
39
Type II error
• If we accept Ho when in fact Ho is false this
is a Type II error.
• A type two error is serious to the researcher.
• The Power of a test is the probability that
Ho will be rejected when it is, in fact, false.
40
Probability
41
We
reject
Ho
We
accept
Ho
Ho is
Ho is
really True really
False
p=
p=1-
p=1-
p=
Power
• The goal of any scientific research is to
reject Ho when Ho is false.
• To increase power:
–
–
–
–
42
a. increase sample size
b. increase alpha
c. decrease sample variability
d. increase the difference between the means
Categorical data example
• African-American students more likely to
register via the web.
43
Table
Variable
Students University-Wide
Register on the Web
Register with other method
Total
44
White
n
447
876
1323
African-American
Percent
n
34%
284
66%
356
640
Percent
44%
56%
Web Registration by Race
60%
50%
40%
44%
30%
20%
34%
29%
African-American
25%
10%
0%
2000
45
White
Year
2001
Categorical Data Example
• African-American students university-wide
(44%) were more likely that white students
(34%) to use web registration, X2(1, N =
1963) = 20.7 , p < .001.
46
47
Smoking among French Men
• Do these data show a relationship between
education and smoking in French men?
48
49
50
The End
51
The End
Benford’s Law page 550
• Faking data?
52
Problem 20.14
Digit
53
ratio
1
2
3
4
5
6
7
8
9
0.301
0.176
0.125
0.097
0.079
0.067
0.058
0.051
0.046
Observed
6
4
6
7
3
5
6
4
4
Digit
ratio
1
2
3
4
5
6
7
8
9
54
0.301
0.176
0.125
0.097
0.079
0.067
0.058
0.051
0.046
Expected
Observed
13.545
7.92
5.625
4.365
3.555
3.015
2.61
2.295
2.07
6
4
6
7
3
5
6
4
4
Expected
Observed
13.545
7.92
5.625
4.365
3.555
3.015
2.61
2.295
2.07
55
6
4
6
7
3
5
6
4
4
4.20280731
1.94020202
0.025
1.59065865
0.08664557
1.30687396
4.40310345
1.26667756
1.7994686
16.6214371
Significance test
chitest p =
56
0.03430
Example
• Survey2 Berk & Carey
page 261
57
Related documents