Download Lecture note

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Chapter 16
Chi Squared Tests
16.1 Introduction
• Two statistical techniques are presented, to
analyze nominal data.
– A goodness-of-fit test for the multinomial experiment.
– A contingency table test of independence.
• Both tests use the c2 as the sampling distribution
of the test statistic.
16.2 Chi-Squared Goodness-of-Fit Test
• The hypothesis tested involves the probabilities p1, p2, …,
pk.of a multinomial distribution.
• The multinomial experiment is an extension of the binomial
experiment.
– There are n independent trials.
– The outcome of each trial can be classified into one of k
categories, called cells.
– The probability pi that the outcome fall into cell i remains
constant for each trial. Moreover,
p1 + p2 + … +pk = 1.
– Trials of the experiment are independent
16.2 Chi-squared Goodness-of-Fit Test
• We test whether there is sufficient evidence to
reject a pre-specified set of values for pi.
• The hypothesis:
H 0 : p1  a1 , p 2  a 2 ,..., p k  a k
H 1 : At least one p i  a i
• The test builds on comparing actual frequency
and the expected frequency of occurrences in all
the cells.
The multinomial goodness of fit test Example
• Example 16.1
– Two competing companies A and B have enjoy
dominant position in the market. The companies
conducted aggressive advertising campaigns.
– Market shares before the campaigns were:
• Company A = 45%
• Company B = 40%
• Other competitors = 15%.
The multinomial goodness of fit test Example
• Example 16.1 – continued
– To study the effect of the campaign on the market
shares, a survey was conducted.
– 200 customers were asked to indicate their preference
regarding the product advertised.
– Survey results:
• 102 customers preferred the company A’s product,
• 82 customers preferred the company B’s product,
• 16 customers preferred the competitors product.
The multinomial goodness of fit test Example
• Example 16.1 – continued
Can we conclude at 5% significance level that
the market shares were affected by the
advertising campaigns?
The multinomial goodness of fit test Example
• Solution
–
–
–
–
The population investigated is the brand preferences.
The data are nominal (A, B, or other)
This is a multinomial experiment (three categories).
The question of interest: Are p1, p2, and p3 different
after the campaign from their values before the
campaign?
The multinomial goodness of fit test Example
• The hypotheses are:
H0: p1 = .45, p2 = .40, p3 = .15
H1: At least one pi changed.
The expected frequency for each
category (cell) if the null hypothesis
is true is shown below:
90 = 200(.45)
80 = 200(.40)
What actual frequencies
did the sample return?
102
82
1
2
1
3
2
30 = 200(.15)
3
16
The multinomial goodness of fit test Example
• The statistic is
2
(
f

e
)
i
c2   i
ei
i 1
k
where e i  np i
• The rejection region is
c 2  c 2 ,k 1
The multinomial goodness of fit test Example
• Example 16.1 – continued
k
c2 

i1
(102  90)2 (82  80)2 (16  30)2


 8.18
90
80
30
c2 ,k 1  c .205,31  5.99147
The p value  P( c 2  8.18)  .01679
[ from Excel ( CHIDIST (8.18,2)]
The multinomial goodness of fit test Example
• Example 16.1 – continued
c2 with 2 degrees of freedom
0.025
Conclusion: Since 8.18 > 5.99, there is sufficient
evidence at 5% significance level to reject the null
hypothesis. At least one of the probabilities pi is
different. Thus, at least two market shares have
changed.
0.02
0.015
0.01
Alpha
0.005
0
0
2
4
5.99
6
P value
8.18
8
10
Rejection region
12
Required conditions –
the rule of five
• The test statistic used to perform the test is only
approximately Chi-squared distributed.
• For the approximation to apply, the expected cell
frequency has to be at least 5 for all the cells
(npi  5).
• If the expected frequency in a cell is less than 5,
combine it with other cells.
16.3 Chi-squared Test of a Contingency Table
• This test is used to test whether…
– two nominal variables are related?
– there are differences between two or more
populations of a nominal variable
• To accomplish the test objectives, we need to
classify the data according to two different
criteria.
Contingency table c2 test –
Example
• Example 16.2
– In an effort to better predict the demand for courses
offered by a certain MBA program, it was hypothesized
that students’ academic background affect their choice
of MBA major, thus, their courses selection.
– A random sample of last year’s MBA students was
selected. The following contingency table summarizes
relevant data.
Contingency table c2 test –
Example
Degree
BA
BENG
BBA
Other
Accounting
31
8
12
10
61
Finance
13
16
10
5
44
Marketing
16
7
17
7
47
60
31
60
39
152
The observed values
There are two ways to address the problem
If each classification is considered
a nominal variable, are these two
variables dependent?
If each undergraduate degree
is considered a population, do
these populations differ?
Contingency table c2 test –
Example
• Solution
–
Since ei = npi but pi is
unknown, we need to
The hypotheses are:
estimate the unknown
H0: The two variables are independent probability from the data,
H1: The two variables are dependent assuming H0 is true.
– The test statistic
k
c 
2

i1
( fi  e i )2
ei
k is the number of cells in
the contingency table.
– The rejection region
c2  c2,(r 1)( c 1)
Estimating the expected frequencies
Undergraduate
Degree
Accounting
BA
BENG
BBA
Other
6161
Probability
61/152
MBA Major
Finance Marketing
44
44
44/152
6060
31
3939
22
47
47/152
Probability
60/152
31/152
39/152
22/152
152
152
Under the null hypothesis the two variables are independent:
P(Accounting and BA) = P(Accounting)*P(BA) = [61/152][60/152].
The number of students expected to fall in the cell “Accounting - BA” is
eAcct-BA = n(pAcct-BA) = 152(61/152)(60/152) = [61*60]/152 = 24.08
The number of students expected to fall in the cell “Finance - BBA” is
eFinance-BBA = npFinance-BBA = 152(44/152)(39/152) = [44*39]/152 = 11.29
The expected frequencies for a
contingency table
• The expected frequency of cell of raw i and
column j in the contingency table is calculated by
(Column j total)(Row i total)
eij =
Sample size
k
c 
2

i1
( fi  e i )2
ei
Calculation of the c2 statistic
• Solution – continued
Undergraduate
Degree
Accounting
31 (24.08)
24.08
BA
k
BENG
2 8 (12.44)
BBA 31 24.08
12 (15.65)
Other
10 (8.83)
i61
1
31 24.08
c 
31
24.08
31
c2=
24.08

MBA Major
Finance
Marketing
13 (17.37) 2 16 (18.55)
16
(8.97)
7 (9.58)
i
i
10 (11.29) 17 (12.06)
(6.39) 77 6.80
(6.80)
55 6.39
i
44
47
(f  e )
e
5 6.39
The expected frequency
5 6.39
60
31
39
22
152
7 6.80
7 6.80
7 6.80
5 6.39
(31 - 24.08)2
(5 - 6.39)2
(7 - 6.80)2
=
+….+
+….+
24.08
6.39
6.80
14.70
Contingency table c2 test –
Example
• Solution – continued
– The critical value in our example is:
c 2 ,( r 1)( c 1)  c.205,( 4 1)( 31)  12.5916
• Conclusion:
Since c2 = 14.70 > 12.5916, there
is sufficient evidence to infer at 5% significance
level that students’ undergraduate degree
and MBA students courses selection
are dependent.
Using the computer
Select the Chi squared / raw data
Option from Data Analysis Plus
under tools. See Xm16-02
Define a code to specify each nominal
value. Input the data in columns one
column for each category.
Code:
Undergraduate degree
1 = BA
2 = BENG
3 = BBA
4 = OTHERS
MBA Major
1 = ACCOUNTING
2 = FINANCE
3 = MARKETING
Degree MBA Major
3
1
1
1
1
1
1
1
2
2
1
3
.
.
.
.
Contingency Table
1
2
3 Total
1
31
13
16
60
2
8
16
7
31
3
12
10
17
39
4
10
5
7
22
Total 61
44
47 152
Test Statistic CHI-Squared = 14.7019
P-Value = 0.0227
Required condition Rule of five
– The c2 distribution provides an adequate approximation to
the sampling distribution under the condition that eij >= 5 for
all the cells.
– When eij < 5 rows or columns must be added such that the
condition is met.
Example
10 (10.1) 14
18 (12.8)
(17.9)
23 (16.0)
(22.3)
12 (12.7) 16
(12.8)
8 ( 7.2) 12
8 (9.2)
We combine
column 2 and 3
14 + 4
16 + 7
8+4
4 (5.1)
7 (6.3)
4 (3.6)
12.8 + 5.1
16 + 6.3
9.2 + 3.6
16.5 Chi-Squared test for Normality
• The goodness of fit Chi-squared test can be used to
determined if data were drawn from any distribution.
• The general procedure:
– Hypothesize on the parameter values of the distribution we test
(i.e. m  m0, s  s0 for the normal distribution).
– For the variable tested X specify disjoint ranges that cover all its
possible values.
– Build a Chi squared statistic that (aggregately) compares the
expected frequency under H0 and the actual frequency of
observations that fall in each range.
– Run a goodness of fit test based on the multinomial experiment.
15.5 Chi-Squared test for Normality
• Testing for normality in Example 12.1
For a sample size of n=50 (see Xm12-01) ,the sample
mean was 460.38 with standard error of 38.83. Can we
infer from the data provided that this sample was drawn
from a normal distribution with m = 460.38 and s =
38.83? Use 5% significance level.
c2 test for normality
Solution
First let us select z values that define each cell (expected frequency > 5 for each cell.)
z1 = -1; P(z < -1) = p1 = .1587; e1 = np1 = 50(.1587) = 7.94
z2 = 0; P(-1 < z< 0) = p2 = .3413; e2 = np2 = 50(.3413) = 17.07
z3 = 1; P(0 < z < 1) = p3 = .3413; e3 = 17.07
P(z > 1) = p4 = .1587; e4 = 7.94
The cell boundaries are
calculated from the
corresponding z values
under H0.
z1 =(x1 - 460.38)/38.83 = -1;
x1 = 421.55
The expected
frequencies
can now be
determined for
each cell.
e1 = 7.94
e2 = 17.07 e3 = 17.07
.3413
.1587
.3413
.1587
421.55 460.38 499.21
e4 = 7.94
c2 test for normality
– The test statistic
2
2
(10 - 7.94)2
(13
17.07)
(19
17.07)
2
c = 7.94 + 17.07 + 17.07 + (8 - 7.94)2
7.94
f3 = 19
e2 = 17.07
f1 = 10
e1 = 7.94
f2 = 13
= 1.72
e3 = 17.07
f4 = 8
e4 = 7.94
c2 test for normality
– The test statistic
2
2
(10 - 7.94)2
(13
17.07)
(19
17.07)
2
c = 7.94 + 17.07 + 17.07 + (8 - 7.94)2
7.94
= 1.72
– The rejection region
c 2  c 2,k 1L
where L is the number of parameters
estimated from the data .
c2,k3  c.205,43  3.84146
Conclusion: There is insufficient evidence to conclude
at 5% significance level that the data are not normally
distributed.