Download Chi-square goodness of fit tests

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Probability wikipedia , lookup

Statistics wikipedia , lookup

Foundations of statistics wikipedia , lookup

Transcript
The Chi-square goodness of fit test
Chi-square goodness of fit
• Core issue in statistics: When are you viewing just
random noise and when is there a real trend?
– Example: To see if squash shape & color are linked genes do a test cross.
GgLl
x
1 : 1 :
ggll
1 : 1
When to use a chi-square test
• Your response variable is count data.
• You have more than one category of the
response variable.
• You have a hypothesis for the responses you
expect.
• You want to know if the difference between
the responses you observe and the responses
you expect is significant or not.
Turn a hypothesis into a number
Your hypothesis tells you what you expect any given response (observation) to be.
Turn your expectation into a fraction or percentage.
• Example hypothesis: “The MSU football team will win
every single game this season.” So, according to my
hypothesis, I expect MSU’s chance of winning any game is
___%.
100%
Does this mean MSU will win 100% of their games?
• Example hypothesis: “The MSU football team’s number of
wins and losses will be random.” So, according to this new
hypothesis, I expect the team’s chance of winning any
game is ___%.
50%
Turn a hypothesis into a number
Hyp.: “People over the age of 60 are 50% more likely to attend a baseball game
than younger people.” So, according to my hypothesis if I go to a baseball game
and find out the ages for all the fans in the audience, I expect the odds of any one
fan being > 60 to be…
x+ (x-50) = 100, solve for x.
75% or 3 out of 4.
What are the odds a fan will be < 60 years old?
Turn a hypothesis into a number
• “Pre-hypothesis”: Given the choice, people prefer red and blue m&m’s over
the other 4 colors.
• But don’t know how strong their preference might be. So test
the “null hypothesis”—People choose m&m colors at random, i.e. they
don’t show preference. (vs. “alternative” or “experimental” hypothesis).
• So, according to my null hypothesis, if I hand around a bowl of m&ms, I
expect the chance of each color being chosen is…
1/6 or 16.67%.
•
Use chi square test to see if what you actually observe is significantly different
from 1/6.
The chi-square test
Observe
>
d %60fans
years
Game
old
Expected
Game
% fans >
60 years
old
1
69
1
75
2
80
2
75
3
20
3
75
4
55
4
75
5
67
5
75
6
76
6
75
7
47
7
75
8
81
8
75
9
70
9
75
10
68
10
75
The chi-square test determines
whether or not the difference
between the responses you
observe and the responses you
expect is significant.
Significant = not due to random
chance alone.
Calculate the “strength of the
difference”, get a value that tells
you the probability the difference
is due to chance (random noise)
alone.
If this probability is small (<5%), we
conclude there is a significant
difference (the difference is not
simply due to chance) between obs
and exp values.
Interpreting the chi-square test
Observe
>
d %60fans
years
Game
old
≈≠
Expected
Game
% fans >
60 years
old
1
69
1
75
2
80
2
75
3
20
3
75
4
55
4
75
5
67
5
75
6
76
6
75
7
47
7
75
8
81
8
75
9
70
9
75
10
68
10
75
Hypothesis: “People over the age of
60 are 50% more likely to attend a
baseball game than younger
people.”
If the test tells you your data are
not significantly different from
what you expect, (your data have a
“good fit” to the expected values),
you support the hypothesis.
Note: no statistical test ever proves a
hypothesis!
If the test tells you your data are
significantly different from what
you expect, you reject the
hypothesis.
What is chi-square?
“Chi-square” symbol is χ2 (Greek).
χ2 =
Σ
Based on
your
hypothesis!
(Observed – Expected)2
Expected
“Sum of”
Observed Expected Obs-Exp
(Obs-Exp)2
Category 1
Category 2
…
χ2 total
Degrees of Freedom
Number of categories minus 1 = N-1
(Obs-Exp)2
Exp
Example problem #1
A university biology department would like to hire a new professor. They advertised
the opening and received 220 applications, 25% of which came from women. The
department came up with a “short list” of their favorite 25 candidates, 5 women and 20
men, for the job. You want to know if there is evidence for the search committee being
biased against women. Note: If the committee is unbiased the proportion of women in
the short list should match the proportion of women in all the applications.
 Define your hypothesis.
Women: 25 * 0.25 =
 Set up table.
Men: 25 * 0.75 =
Observed Expected Obs-Exp
Women
Men
5
20
25
=
6.25
18.75
25
-1.25
1.25
(Obs-Exp)2
1.5625
1.5625
χ2 total
Degrees of Freedom
(Obs-Exp)2
Exp
0.25
0.08
0.33
1
Chi-square probability table
Probabilities 
Observed values not significantly different from expected
(differences due to random chance).
Reject
Support hypothesis. hyp.
Observed values are
significantly different from
expected (differences not just
due to random chance).
Reject hypothesis.
Chi-square probability table
Probabilities 
Observed values not significantly different from expected
(differences due to random chance).
Probability range: 0.5 < p < 0.6
Means that there is a 50-60%
probability that the difference
between obs & exp values are
from random chance alone.
Reject
Support hypothesis. hyp.
Observed values are
So,
is the department
significantly
different from
expected
not just
biased (differences
against women
due to random chance).
applicants?
Reject hypothesis.
Example problem #2
Work in groups
Example problem #2
Hypothesis:
Body color and wing size are
unlinked genes.
Expected ratio?
9:3:3:1.
Expected values:
 Gray Normal wings (GgWw):
9/16 * 102 = 57.375
 Gray Vestigial wings (Ggww):
3/16 * 102 = 19.125
 Ebony Normal wings (ggWw):
3/16 * 102 = 19.125
 Ebony Vestigial (ggww):
1/16 * 102 = 6.375
Observed Expected Obs-Exp
Gray Norm.
Gray Vest.
Ebony Norm.
Ebony Vest.
53
16
25
8
102
57.375
19.125
19.125
6.375
=
102
-4.375
-3.125
5.875
1.625
(Obs-Exp)2
19.141
9.766
34.516
2.641
χ2 total
Degrees of Freedom
(Obs-Exp)2
Exp
0.333
0.511
1.805
0.414
3.063
3
Chi-square probability table
Probabilities 
Support hypothesis. Reject
hyp.
Probability range: 0.3 < p < 0.4
Means that there is a 30-40%
probability that the difference
between obs & exp values are
from random chance alone.
Biology?
Example problem #3
Using Chi-square to test for
linked genes
Example problem #3
1. Hypothesis:
Squash color and shape are not linked genes.
OR Squash color and shape are linked genes.
2. Describe the phenotypes and circle the recombinants.
LlGg
llGg
llgg
Llgg
3. If the 2 genes are not linked the expected ratio is:
1:1:1:1
4. If the two genes are linked the expected phenotype ratio is:
1:0:0:1
Example problem #3
If you tested the hypothesis that squash shapre and color ARE LINKED (1:1:1:1) :
5. Calculate the expected number of offspring for each phenotype:
Wild Wild (LlGg) :
509/4 = 127.25
Wild Orange (Llgg) :
127.25
Round Wild (llGg) :
127.25
Round Orange (llgg) :
127.25
Observed Expected Obs-Exp
Wild Wild
Wild Orange
Round Wild
Round Orange
228
17
21
243
127.25
127.25
127.25
127.25
100.75
-110.25
-106.25
115.75
(Obs-Exp)2
10150.56
12155.06
11289.06
13398.06
χ2 total
Degrees of Freedom
(Obs-Exp)2
Exp
79.8
95.5
88.7
105.3
369.3
3
Chi-square probability table
Probabilities 
Probability
range:
0.3 < p < 0.4
Support hypothesis. Reject
hyp.
Statistical meaning: 30-40% probability
that the difference between obs & exp
values are from random chance alone.
The obs and exp values are
not significantly different.
Support hypothesis.
Biological meaning?
Example problem #3
If you tested the hypothesis that squash shapre and color ARE NOT LINKED (1:0:0:1) :
5. Calculate the expected number of offspring for each phenotype:
Wild Wild (LlGg) :
509/2 = 254.5
Wild Orange (Llgg) :
0
Round Wild (llGg) :
0
Round Orange (llgg) :
509/2=254.5
Observed Expected Obs-Exp
Wild Wild
Wild Orange
Round Wild
Round Orange
228
17
21
243
254.5
0
0
254.5
-26.5
17
21
-11.5
(Obs-Exp)2
702.25
289
441
132.25
χ2 total
Degrees of Freedom
(Obs-Exp)2
Exp
2.76
(Undef.) 0
(Undef.) 0
0.52
3.28
3
Chi-square probability table
Probabilities 
Probability
range: p < 0.01
Support hypothesis. Reject
hyp.
Statistical meaning: < 1% probability
that the difference between obs & exp
values are from random chance alone.
The obs and exp values are
significantly different. Reject
hypothesis.
Biological meaning?
Example problem #3
Hypothesis not linked  p<0.01  Reject hypothesis
Hypothesis linked  0.3 < p < 0.4, in other words, p > 0.05 
Support hypothesis
Are these test results in agreement?
So do these data show that the genes
are linked or not?
If you weren’t very confident in your
test results, what could you do next to
improve your confidence?