Download Probability and the Chi-Square Test written by JD

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Probability wikipedia , lookup

Probability interpretations wikipedia , lookup

Transcript
Probability and the Chi-Square Test
written by J. D. Hendrix
Learning Objectives
Upon completing the exercise, each student should be able:
•
to determine the chance that a given state will occur in a system that consists of a finite
number of equivalent states;
•
to determine the chance of two or more independent events occurring simultaneously by
using the product rule;
•
to determine the chance that either one or the other of two mutually exclusive events will
occur by using the sum rule;
•
to test hypotheses based on expected frequencies using the chi-square test.
Background
Many genetic events are random processes. These include the segregation and assortment
of genes during meiosis, the combination of gametes during fertilization, and crossover between
homologous chromosomes. Scientists describe random natural processes using the mathematical
tools of probability and statistics.
A.
The Chance that an Event will Occur
Consider a system that consists of a finite number of equivalent states. The chance that a
given state will occur is given by the equation
C=
a
t
in which “C” is the chance (probability) of the state, “a” is the number of times the state is
represented in the system, and “t” is the total number of equivalent states in the system.
For example, we can develop a mathematical model to describe a coin toss. We assume
that a coin toss is a system with two equivalent states, heads-up and tails-up. We describe each
state with a probability.
Chance of heads =
number of heads
=
total sides on the coin
1
2
= 0.5 = 50%
Chance of tails =
number of tails
=
total sides on the coin
1
2
= 0.5 = 50%
We can use probabilities to predict the frequency of an event, or how often an event will
occur. In an experiment with 50 coin tosses:
Expected number of heads = 50 x 0.5 = 25
Expected number of tails = 50 x 0.5 = 25
We test the validity of the hypothesis (that there is an equal chance of getting heads or
tails) by comparing the observed number of heads and tails in a coin-toss experiment with the
expected values calculated from the probabilities. If the original assumptions in the hypothesis
are not valid (for example, if the coin is heavier on one side, or if it is deformed in some way),
1
2
then there could be a significant difference between the observed and expected values.
It is customary to express the expected outcome of an experiment involving frequencies
as a ratio. In the coin toss experiment, we expect a heads:tails ratio of 1:1.
Imagine a standard deck of 52 playing cards, randomly shuffled. What is the chance of
drawing an ace of hearts (A♥) from the deck?
Chance of A♥ =
Number of A ♥
=
Total number of Cards
1
≈ 0.0192
52
What is the chance of drawing any ace?
Chance of any ace =
Number of Aces
4
=
≈ 0.0769
Total number of Cards
52
Imagine that an ace of hearts was drawn from a standard deck of 52 cards and returned to
the deck. Then, the deck was reshuffled and another card drawn. What is the chance that the card
will be an ace of hearts?
Number of A ♥
1
Chance of A♥ =
=
≈ 0.0192
Total number of Cards
52
Notice that, since the card was returned to the deck, the total number of cards and the chance
remain the same.
Imagine that an ace of hearts was drawn from a standard deck of 52 cards and discarded.
Then, another card was drawn. What is the probability that the card will be an ace of hearts?
Number of A ♥
0
Chance of A♥ =
=
=0
Total number of Cards
51
Notice that the number of cards and the chance have changed. Since there is no longer an ace of
hearts in the deck, then the probability of drawing an ace of hearts is zero. What is the chance of
drawing one of the three remaining aces?
3
Chance of drawing one of the three remaining aces =
≈ 0.0588
51
3
B.
The Chance of Independent Events occurring together
The chance of two or more independent events occurring together is the product of their
individual probabilities. An example is the simultaneous tossing of two coins. The outcome of
the toss on one coin should not affect the outcome on the second coin (unless the coins are glued
together). Therefore, the events are independent of each other.
Outcome of toss on
Coin #1
Heads
Heads
Tails
Tails
Coin #2
Heads
Tails
Heads
Tails
Probability of toss on
Coin #1
0.5
0.5
0.5
0.5
Probability of both events
occurring
0.5 x 0.5 = 0.25
0.5 x 0.5 = 0.25
0.5 x 0.5 = 0.25
0.5 x 0.5 = 0.25
Coin #2
0.5
0.5
0.5
0.5
What is the probability of rolling a pair of sixes on a standard set of dice? As you
probably know, a “die” is a game cube with six sides, each side numbered with between one
through six dots. Assuming that the mass of the cube is evenly distributed, the chance of rolling
any of the numbers is 1/6. Therefore, the probability of rolling a pair of sixes on a pair of dice is
1 1 1
× =
6 6 36
C.
Mutually Exclusive Events (Either/or situations)
The chance that either one or the other of two mutually exclusive events will occur is the
sum of their individual probabilities. For example, consider a box containing two red beads, three
white beads, and four blue beads. If one bead is randomly chosen, what is the chance that the
bead will be either red or white?
2
9
Chance of a red bead =
Chance of a white bead =
Chance of either red or white =
3
9
2 3
 =
9 9
5
9
4
D.
Hypothesis Testing using the Chi-square Test
Let’s develop a “formal hypothesis” for the coin toss experiment.
Hypothesis: If the mass of a coin is symmetrically distributed on both sides of the coin, then there
is an equal probability of a coin toss resulting in “heads” or “tails.”
From this hypothesis we can make the following prediction.
Prediction: If a specific coin is tossed 50 times, then 25 of the tosses will result in “heads” and 25
of the tosses will result in “tails.”
The prediction can be tested by performing the following experiment.
Experiment: Toss the coin 50 times and count the number of “heads” and “tails.”
Independent variable: Number of times the coin is tossed.
Dependent variable: Number of “heads” or “tails.”
The observed results in an experiment are almost never exactly equal to the expected
results. For example, in the coin toss experiment one expects 25 heads and 25 tails if a coin is
tossed 50 times. However, what if the result is 27 heads and 23 tails? Is this a significant
difference between the expected and the observed results, or can we attribute the difference to
random chance? It seems to “make sense” that a result of 27 heads, 23 tails is reasonable, but
how can we be sure? If we repeated the experiment 100 times, how often would we expect to this
much deviation from the expected value (25:25)? The chi-square (χ2) test is a statistical test used
to determine whether the difference between an expected result and an observed result is
significant or whether the difference can be attributed to random chance.
To analyze experimental data using the χ2 test, the data must consist of a finite number of
mutually exclusive outcomes or classes. Also, we must know the probability of each class in
order to calculate the expected values. The degrees of freedom in an experiment is the total
number of classes minus one:
df = k - 1, where k is the number of classes.
In the coin experiment, there are two outcomes or classes of results, heads and tails.
Therefore, there is one degree of freedom.
The value of χ2 is given by the equation
O−E 2
χ =∑
E
2
where O is the observed number of items in a given class, E is the expected number of items in
the class, and the summation sign (Σ) indicates the sum of all values of [(O-E)2]/E for every class
in the system.
Consider the following results of the coin toss experiment.
5
Toss
Heads
Tails
# obtained
(O)
27
23
Total: 50
# expected
(E)
50 x 0.5 = 25
50 x 0.5 = 25
O–E
(O – E)2
2
-2
4
4
(O – E)2
E
0.160
0.160
2
χ = 0.320
Is the difference between O and E significant? If so, then we reject the hypothesis. If not,
then we fail to reject the hypothesis. We evaluate the difference from a table of χ2 values, such as
the one shown below.
df
1
2
3
4
5
6
7
8
9
10
P value = Probability that the Difference is due to Chance and is Not Significant
0.95
0.80
0.50
0.20
0.10
0.05
0.01
0.00393
0.0642
0.455
1.642
2.706
3.841
6.635
0.103
0.446
1.386
3.129
4.605
5.991
9.210
0.352
1.005
2.366
4.642
6.251
7.815
11.34
0.711
1.649
3.357
5.989
7.779
9.488
13.277
1.145
2.343
4.351
7.289
9.236
11.070
15.086
1.635
3.070
5.348
8.558
10.645
12.592
16.812
2.167
3.822
6.346
9.803
12.017
14.067
18.475
2.733
4.594
7.344
11.030
13.362
15.507
20.090
3.325
5.380
8.343
12.242
14.684
16.919
21.666
3.940
6.179
9.342
13.442
15.987
18.307
23.209
Locate the value of χ2 in the row corresponding to the appropriate df value. In this
example, the value of χ2 = 0.320, and the value of df = 2 - 1 = 1. Therefore, the χ2 value is
between 0.0642 and 0.455.
0.0642 < χ2 < 0.455
The probability, P, that the deviation is due to random chance, and is not significant, is
read from the top row of the table.
0.80 > P > 0.50
How do we interpret this nonsense?
In most genetics work, deviations are considered significant only if the probability value
from the χ2 table is 0.05 (5%) or less. This is called a 5% level of significance (or a 95%
confidence level). If the probability is 0.05 or less, this means that there is a 95% or greater
probability that the deviation is not due to chance, and the hypothesis is rejected. If the
probability is greater than 0.05, then we cannot reject the hypothesis based on the data. In our
example, P is greater than 0.05, so the hypothesis is not rejected.
Whew! That’s a lot of words. To put it simply:
•
•
If the P value from the χ2 table is less than 0.05, then the deviation of the observed values
from the expected values is significant and the data do not support the hypothesis.
If the P value from the χ2 table is greater that 0.05, then the deviation of the observed
values from the expected values is not significant and the data support the hypothesis.
6
In this example, the P value is between 0.50 and 0.80. This means that there is between a
50% and 80% probability that the deviation seen is due to chance. Since the P value is greater
than 0.05, the deviation is not significant at the 95% confidence level, and the data support the
hypothesis.
Consider the results of another coin toss experiment, using a different coin.
Toss
Heads
Tails
# obtained
(O)
13
37
Total: 50
# expected
(E)
25
25
O–E
(O – E)2
-12
12
144
144
(O – E)2
E
5.76
5.76
2
χ = 11.52
As before, df = 2 – 1 = 1. At df = 1, it looks like the χ2 value we calculated is “off the
chart!” This simply means that the deviation is so big that it is larger than the the largest recorded
value in the chart. The P value must therefore be much much smaller than 0.01 (and smaller than
0.05). Hence:
χ2 > 6.635
P < 0.01
Since P < 0.05, the deviation of the observed values from the expected in this coin toss is
significant and the data do not support the hypothesis. Can you suggest a reason why the coin
toss experiment failed to support the expected 1:1 ratio in this case? (Here’s a hint: read the first
sentence of the hypothesis for an important assumption that lead us to the 1:1 ratio.)
Probability and the Chi-Square Test
Laboratory Report Sheet
Name
Lab Partners
1.
A standard deck of 52 playing cards has 13 cards of each suit (hearts, spades, diamonds,
or clubs). What is the probability of drawing a diamond?
2.
If two coins are tossed, what is the probability that one coin will be heads and the other
coin will be tails, with either of the two coins being heads. To solve this problem, start
with the information given in section “B” under the “Product Rule,” then apply the “Sum
Rule” to solve for an either/or situation.
3.
What is the probability of rolling a seven in any combination on a pair of dice? To solve
this problem, you will need to use a combination of the product rule and the summation
rule. Try completing this table. Remember that each roll is a “mutually exclusive event”
(that is, if you roll a 1 and a 6, you can’t roll a 2 and a 5 at the same time).
Roll on Die 1
1
Roll on Die 2
Probability of this Roll
6
1
1
1
x
=
6
6
36
2
Probability of rolling a seven in any combination: _________________________
7
8
4.
In corn, the genes for seed color (purple or yellow) and seed shape (smooth or wrinkled)
assort independently of each other. This means that the expected frequencies
(probabilities) of these traits in a cross can be treated as independent events occurring
simultaneously, so the product rule applies.
If hybrid purple corn is self-fertilized, the following offspring are expected:
¾ Purple
¼ Yellow
If hybrid smooth corn is self-fertilized, the following offspring are expected:
¾ Smooth
¼ Wrinkled
Here’s the question: If corn that is both hybrid purple and hybrid smooth is self-fertilized,
what results do you expect? Use the product rule to figure out how many purple smooth,
purple wrinkled, yellow smooth, and yellow wrinkled kernels you expect.
5.
You will be provided a 6-sided die (game cube).
(a)
Write a formal hypothesis, prediction, experiment, and variables about the probabilities of
tossing numbers on the die.
(b)
Perform your experiment. You should have a sufficient sample size (i.e. several hundred
rolls) for a valid statistical sample.
(c)
Use the χ2 test to determine if your data support your hypothesis. Show your work.
(d)
Write a brief conclusion summarizing your results. If the data do not support your
conclusion, you should suggest reasons in your conclusion.