Download Chi Square - Faculty Web Pages

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
R Programming
Data Analysis Module:
Chi Square
Data Analysis Module
 Basic Descriptive Statistics and Confidence Intervals
 Basic Visualizations
 Histograms
 Pie Charts
 Bar Charts
 Scatterplots
 Ttests/Bivariate testing
 One Sample
 Paired
 Independent Two Sample
 ANOVA
 Chi Square and Odds
 Regression Basics
2
Data Analysis Module: Chi Square
When presented with categorical data, one common
method of analysis is the “Contingency Table” or “Cross
Tab”. This is a great way to display frequencies For example, lets say that a firm has the following data:
120 male and 80 female employees
40 males and 10 females have been promoted
Data Analysis Module: Chi Square
Using this data, we could create the following 2x2
matrix:
Promoted
Not Promoted
Total
Male
40
80
120
Female
10
70
80
Total
50
150
200
Data Analysis Module: Chi Square
Now, a few questions…
1) From the data, what is the probability of being
promoted?
2) Given that you are MALE, what is the probability of
being promoted?
3) Given that you are promoted, what is the probability
that you are MALE?
4) Given that you are FEMALE, what is the probability of
being promoted?
5) Given that you are promoted, what is the probability
that you are female?
Data Analysis Module: Chi Square
The answers to these questions help us start to understand
if promotion status and gender are related.
Specifically, we could test this relationship using a ChiSquare. This is the test used to determine if two variables
are related.
The relevant hypothesis statements for a Chi-Square test
are:
H0: Variable 1 and Variable 2 are NOT Related
Ha: Variable 1 and Variable 2 ARE Related
Develop the appropriate hypothesis statements and
testing matrix for the gender/promotion data.
Data Analysis Module: Chi Square
The Chi-Square Test uses the Χ2 test statistic, which has a
distribution that is skewed to the right (it approaches
normality as the number of obs increases).
The observed counts are provided in the dataset.
The expected counts are the counts which would be
expected if there was NO relationship between the two
variables.
Data Analysis Module: Chi Square
Going back to our example, the data provided is
“observed”:
Promoted
Not Promoted
Total
Male
40
80
120
Female
10
70
80
Total
50
150
200
What would the matrix look like if there was no relationship
between promotion status and gender? The resulting
matrix would be “expected”…
Data Analysis Module: Chi Square
From the data, 25% of all employees were promoted.
Therefore, if gender plays no role, then we should see 25%
of the males promoted (75% not promoted) and 25% of the
females promoted…
Promoted
Male
Female
Total
Not Promoted
Total
120*.25 = 30
120*.75 = 90
120
80*.25 = 20
50
80*.75 = 60
150
80
200
Notice that the marginal values did not change…only the
interior values changed.
Data Analysis Module: Chi Square
Now, calculate the X2 statistic using the observed
and the expected matrices:
((40-30)2/30)+((80-90)2/90)+((10-20)2/20)+((7060)2/60) =
3.33+1.11+5+1.67 = 11.11
This is conceptually equivalent to a t-statistic or a
z-score.
Data Analysis Module: Chi Square
To determine if this is in the rejection region, we
must determine the df.
Df = (r-1)*(c-1)…
In the current example, we have two rows and
two columns. So the df = 1*1 = 1.
At alpha = .05 and 1df, the critical value is
3.84…our value of 11.11 is clearly in the reject
region…so what does this mean?
Data Analysis Module: Chi Square
#here, the code is pretty simple…first install the
“prettyR” package. Then, you can run an xtab:
Xtab(var1~var2, data=data)
Then a Chi Squared test:
chisq.test(var1, var2, correct=FALSE)