Download Lecture 12/3 (Chi-Square, nonparametric tests, and summing up)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

Foundations of statistics wikipedia , lookup

History of statistics wikipedia , lookup

Hardy–Weinberg principle wikipedia , lookup

Law of large numbers wikipedia , lookup

Transcript
Friday, Dec. 3
Chi-square Goodness of Fit
Chi-square Test of Independence: Two Variables.
Summing up!
gg
yy
yg
yg
yy
yg
25%
25%
yg
yg
gg
gy
25%
25%
Pea Color
Yellow
Green
TOTAL
freq Observed
freq Expected
158
150
42
50
200
200
Chi Square Goodness of Fit
Pea Color
freq Observed
Yellow
Green
TOTAL
2

k
=

i=1
(fo - fe)2
fe
freq Expected
158
150
42
50
200
200
d.f. = k - 1, where k = number of
categories of in the variable.
“… the general level of agreement between
Mendel’s expectations and his reported results
shows that it is closer than would be expected in
the best of several thousand repetitions. The data
have evidently been sophisticated systematically,
and after examining various possibilities, I have
no doubt that Mendel was deceived by a
gardening assistant, who knew only too well what
his principal expected from each trial made…”
-- R. A. Fisher
Chi Square Goodness of Fit
Pea Color
freq Observed
Yellow
Green
TOTAL
2

k
=

i=1
(fo - fe)2
fe
freq Expected
151
150
49
50
200
200
d.f. = k - 1, where k = number of
categories of in the variable.
Peas to Kids: Another Example
Goodness of Fit
At my children’s school science fair last year,
where participation was voluntary but strongly encouraged,
I counted about 60 boys and 40 girls who had
submitted entries. Since I expect a ratio of 50:50
if there were no gender preference for submission,
is this observation deviant, beyond chance level?
Boys
Girls
Expected:
50
50
Observed:
60
40
Boys
Girls
Expected:
50
50
Observed:
60
40
2

k
=

i=1
(fo - fe)2
fe
Boys
Girls
Expected:
50
50
Observed:
60
40
2

k
=

i=1
(fo - fe)2
fe
For each of k categories, square the difference between the
observed and the expected frequency, divide by the expected
frequency, and sum over all k categories.
Boys
Girls
Expected:
50
50
Observed:
60
40
2

k
=

i=1
(fo - fe)2
fe
(60-50)2
=
(40-50)2
= 4.00
+
50
For each of k categories, square the difference between the
observed and the expected frequency, divide by the expected
frequency, and sum over all k categories.
50
Boys
Girls
Expected:
50
50
Observed:
60
40
2

k
=

i=1
(fo - fe)2
fe
(60-50)2
=
(40-50)2
= 4.00
+
50
50
For each of k categories, square the difference between the
observed and the expected frequency, divide by the expected
frequency, and sum over all k categories.
This value, chi-square, will be distributed with known probability
values, where the degrees of freedom is a function of the number of
categories (not n). In this one-variable case, d.f. = k - 1.
Boys
Girls
Expected:
50
50
Observed:
60
40
2

k
=

i=1
(fo - fe)2
fe
(60-50)2
=
(40-50)2
= 4.00
+
50
50
For each of k categories, square the difference between the
observed and the expected frequency, divide by the expected
frequency, and sum over all k categories.
This value, chi-square, will be distributed with known probability
values, where the degrees of freedom is a function of the number of
categories (not n). In this one-variable case, d.f. = k - 1.
Critical value of chi-square at =.05, d.f.=1 is 3.84, so reject H0.
Chi-square Test of Independence
Are two nominal level variables related or independent
from each other?
Is race related to SES, or are they independent?
White
Black
Hi
12
3
15
Lo
16
16
32
19
47
SES
28
The expected frequency of any given cell is
Row n x Column n
Total n
White
Black
Hi
12
3
15
Lo
16
16
32
19
47
SES
28
2

r
=
c

r=1 c=1
(fo - fe)2
fe
At d.f. = (r - 1)(c - 1)
The expected frequency of any given cell is
Row n x Column n
Total n
(15x28)/47
(15x19)/47
15
(32x28)/47
(32x19)/47
32
19
47
28
The expected frequency of any given cell is
Row n x Column n
Total n
(15x28)/47
(15x19)/47
8.94
6.06
(32x28)/47
(32x19)/47
19.06
28
12.94
19
15
32
47
Please calculate:
2

r
=
c

r=1 c=1
(fo - fe)2
fe
12
8.94
3
6.06
15
16
19.06
16
12.94
32
19
47
28
Important assumptions:
Independent observations.
Observations are mutually exclusive.
Expected frequencies should be reasonably large:
d.f. 1, at least 5
d.f. 2, >2
d.f. >3, if all expected frequencies but one are greater
than or equal to 5 and if the one that is not is at least
equal to 1.
Univariate Statistics:
Interval
Mean
Ordinal
Median
Nominal
Mode
one-sample t-test
Chi-squared goodness of fit
Bivariate Statistics
Y
Nominal
X
Ordinal
Interval
Nominal
Ordinal
Interval
2
Rank-sum
Kruskal-Wallis H
t-test
ANOVA
Spearman rs (rho)
Pearson r
Regression
Who said this?
"The definition of insanity is doing the
same thing over and over again and
expecting different results".
Who said this?
"The definition of insanity is doing the
same thing over and over again and
expecting different results".
• I don’t like it because from a statistical point of view, it
is insane to do the same thing over and over again
and expect the same results!
• More to the point, the wisdom of statistics lies in
understanding that repeating things some ways ends
up with results that are more the same than others.
Hmm. Think about this for a moment. Statistics
allows one to understand the expected variability in
results even when the same thing is done, as a
function of σ and N.
Your turn!
• Given this start, explain why uncle Albert
heads us down the wrong path. In your
answer, make sure you refer to the error
statistic (e.g., standard error of the mean,
standard error of the difference between
means, Mean Square within) as well as the
sample size N. In short, explain why
statistical thinking is beautiful, and why Albert
Einstein (if he ever said it) was wrong.