Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
0. 0.1 0.2 0.3 0.4
Statistical Data Analysis
-4
-2
0
x
2010/2011
M. de Gunst
Lecture 8
2
0. 0.1 0.2 0.3 0.4
Statistical Data Analysis: Introduction
-4
-2
0
2
x
Topics
Summarizing data
Investigating distributions
Bootstrap
Robust methods
Nonparametric tests
Analysis of categorical data
Multiple linear regression
Statistical Data Analysis
2
0. 0.1 0.2 0.3 0.4
Analysis of categorical data (Reader: Chapter 7)
-4
-2
0
2
x
Categorical data of two variables
Goal: 1) investigation relationship between the variables
2) if there is relationship, investigate which categories involved
1) Tests
Fisher’s exact test
chisquare test
2) Methods for further investigation
based on cell frequencies
using bootstrap
Statistical Data Analysis
3
0. 0.1 0.2 0.3 0.4
Examples
-4
-2
0
2
x
Question 1. Are “beta”-studies more for men than for women?
Data: 60 students categorized as follows
beta alpha
M 23 17
W 7 13
Question 2. Does frequency of nucleotides in DNA depend on the
position in the DNA sequence?
Data: 100 DNA-sequences of length 5
of which nucleotide sequence is known
pos 1
A
33
G
22
C
31
T
14
2
34
27
18
21
3
19
23
34
24
4
20
24
30
26
5
21
21
25
33
Question 3. Are nucleotides in DNA at consecutive positions
dependent?
Data: 413 pairs of consecutive nucleotides
A
G
C
T
A
35
22
30
14
G
35
29
18
11
C
27
20
38
28
T
18
26
38
24
Statistical Data Analysis
4
0. 0.1 0.2 0.3 0.4
Example 1
-4
-2
0
2
x
Question 1. Are “beta”-studies more for men than for women?
Data: 60 students categorized as follows
beta alpha
M 23 17
W 7 13
Does number 23 reflect what you expect if there would be no relationship
Between gender and study type?
Which number do we expect?
Why?
Statistical Data Analysis
5
0. 0.1 0.2 0.3 0.4
Example 1
-4
-2
0
2
x
Question 1. Are “beta”-studies more for men than for women?
Data: 60 students categorized as follows
beta alpha
M 23 17
W 7 13
Thus:
We wish to test:
H0: no relationship between gender and study type
H1: beta studies are more for men than for women
Approach:
compare observed numbers with expected numbers under H0
Need:
model and test statistic
Statistical Data Analysis
6
0. 0.1 0.2 0.3 0.4
Example 1
-4
-2
0
2
x
Question 1. Are “beta”-studies more for men than for women?
Data: 60 students categorized as follows
with marginals
beta alpha Total
M 23 17
40
W 7 13
20
Total 30 30
60
4 possibilities
i) Given row and column totals: like we just did (only for 2x2 tables)
Not in Reader!
ii) One sample of size 60 from students; given total sample size:
Model A in Reader
iii) Two independent samples taken, one of size 40 from male students, one of size 20
from female students; given sample sizes of row variable:
Model B in Reader
iv) Two independent samples taken, one of size 30 from beta students, one of size 30
from alpha students; given sample sizes of column variable:
Model C in Reader
Statistical Data Analysis
7
0. 0.1 0.2 0.3 0.4
Example 1
-4
-2
0
2
x
Question 1. Are “beta”-studies more for men than for women?
Data: 60 students categorized as follows
with marginals
beta alpha Total
M 23 17
40
W 7 13
20
Total 30 30
60
i) Given row and column totals: like we just did (only for 2x2 tables)
Under H0 of no relationship:
number N11 has same distribution as number of men in random sample of 30
students drawn without replacement from set of 60 students of which 40 are men
Which distribution is this?
N11 ~ hypergeometric(60,40,30)
(R: parameters 40,(60-40),30)
Expectation of this distribution? 40*30/60 = 20
PH0(N11 >= 23) = probability of drawing 23 or more men in random sample of 30
students drawn without replacement from set of 60 students of which 40 are
men: exactly computable → exact test
Statistical Data Analysis
8
0. 0.1 0.2 0.3 0.4
Example 1
-4
-2
0
2
x
Question 1. Are “beta”-studies more for men than for women?
Data: 60 students categorized as follows
with marginals
beta alpha Total
M 23 17
40
W 7 13
20
Total 30 30
60
H0: no relationship between gender and study type
H1: beta studies are more for men than for women
i) Given row and column totals: like we just did (only for 2x2 tables)
Fisher’s exact test:
Test statistic: N11
reject H0 if conditionally on marginals
PH0(N11 >= observed value) is too small
In R:
> 1-phyper(22,40,20,30)
[1] 0.08511085
Statistical Data Analysis
9
0. 0.1 0.2 0.3 0.4
Example 1
-4
-2
0
2
x
Fisher test for 2x2 tables in R (continued)
#Alternative: Fisher’s Exact Test with fisher.test:
> ab=matrix(c(23,17,7,13),2,2,byrow=T)
> dimnames(ab) = list(c("M","V"),c("beta","alpha"))
> fisher.test(ab,alternative="greater")
Fisher's Exact Test for Count Data
data: ab
p-value = 0.08511
alternative hypothesis: true odds ratio is greater than 1
95 percent confidence interval:
0.8634461
Inf
sample estimates:
odds ratio
2.47347
Conclusion? Are “beta”-studies more for men than for women?
if we choose alpha=0.05, and given the marginals, the null hypothesis
that the frequencies in the table are due to chance, is not rejected
Statistical Data Analysis
10
0. 0.1 0.2 0.3 0.4
Example 1
-4
-2
0
2
x
Question 1. Are “beta”-studies more for men than for women?
Data: 60 students categorized as follows
with marginals
beta alpha Total
M 23 17
40
W 7 13
20
Total 30 30
60
Other 3 situations with approximate test: (later)
ii) One sample of size 60 from students; given total sample size:
= Model A in Reader
iii) Two independent samples taken, one of size 40 from male students, one of size 20
from female students; given sample sizes of row variable:
= Model B in Reader
iv) Two independent samples taken, one of size 30 from beta students, one of size 30
from alpha students; given sample sizes of column variable:
= Model C in Reader
Statistical Data Analysis
11
0. 0.1 0.2 0.3 0.4
General
-4
-2
0
2
x
Categorical data of two variables summarized in kxr-contingency table
Goal: 1) investigation relationship between the variables A and B
2) further investigation
Statistical Data Analysis
12
0. 0.1 0.2 0.3 0.4
1) Investigate relationship - models
-4
-2
0
2
x
Statistical model for Nijs depending on how data were collected
Model A: 1 sample; Model B: k indep. samples, Model C: r indep. samples
Important ingredient: pij = probability of object in both Ai and Bj
Which marginals fixed? Which relationship(s) for the pij ?
Which distribution for data?
Model A: total sample size n ;
1 multinomial
Model B: row marginals Ni. ;
k multinomials
Model C: column marginals N.j ;
Statistical Data Analysis
r multinomials
13
0. 0.1 0.2 0.3 0.4
1) Investigate relationship - null hypotheses
-4
-2
0
2
x
Model A: 1 sample; Model B: k indep. samples, Model C: r indep. samples
H0: no relationship between variables A and B
becomes
Model A: H0A : A and B independent
or
Model B: H0B : k samples from identical r-nomial distributions
or
Model C: H0C : ….
Statistical Data Analysis
14
0. 0.1 0.2 0.3 0.4
1) Investigate relationship - testing
-4
-2
0
2
x
Model A: 1 sample; Model B: k indep. samples, Model C: r indep. samples
For different models: which test statistic, which distribution under H0 ?
Will use
Statistical Data Analysis
15
0. 0.1 0.2 0.3 0.4
1) Investigate relationship – test statistics
-4
-2
0
2
x
Model A: 1 sample
Test statistic
→
n large
Model B: k independent samples
Test statistic
→
N.j large
Statistical Data Analysis
16
0. 0.1 0.2 0.3 0.4
1) Investigate relationship – unknown parameters
-4
-2
0
2
x
But … pij unkown: use MLEs based on cell frequencies Nij
Model A:
yields MLEs
(k-1) + (r-1) parameters estimated
Model B:
yields MLEs
(r-1) parameters estimated
Plugging in these estimators in the test statistics, we get ….
Statistical Data Analysis
17
0. 0.1 0.2 0.3 0.4
1) Investigate relationship – unknown parameters
-4
-2
0
2
x
Test statistic
with
→
large samples
for all three models!
This is chi-square test
- Rejects for large values of X2
- Only two-sided alternatives
- Rule of thumb: chi-square approximation appropriate if
Statistical Data Analysis
18
0. 0.1 0.2 0.3 0.4
Example 1
-4
-2
0
2
x
Question 1. Are “beta”-studies more for men than for women?
Data: 60 students categorized as follows
with marginals
beta alpha Total
M 23 17
40
W 7 13
20
Total 30 30
60
H0: no relationship between gender and study type
H1: there is a relationship between gender and study type (now different!)
For models A, B, C: chi-square test
> chisq.test(ab)
Pearson's Chi-squared test with Yates' continuity correction
data: ab
X-squared = 1.875, df = 1, p-value = 0.1709
Note 1: p-value = 0.1709 is twice one of one-sided Fisher’s exact test, because alternative
hypothesis for chi-square test is (always) two-sided.
Note 2: For 2x2 tables R-function chisq.test applies a (Yate’s) continuity correction to make
result closer to real p-value (check this by applying the function with parameter
`correct=F’ to see what happens when this correction is not used; next apply the
function with parameter `simulate.p.value=T’ to obtain p.value close to real p.value).
Statistical Data Analysis
19
0. 0.1 0.2 0.3 0.4
Example 2
-4
-2
0
2
x
Question 2. Does frequency of nucleotides in DNA depend on the
position in the DNA sequence?
Data: 100 DNA-sequences of length 5
of which nucleotide sequence is known
pos 1 2 3 4 5 Total
A
33 34 19 20 21 127
G
22 27 23 24 21 117
C
31 18 34 30 25 138
T
14 21 24 26 33 118
Total 100 100 100 100 100 500
Which model is most appropriate?
5 independent samples, one out of each category of column variable
= Model C with null hypothesis of 5 identical 4-nomial samples
H0: for each position (= sample) the probability of occurrence of a
nucleotide is the same for each nucleotide
H1: for at least one of the positions at least one of these probabilities is
different
Statistical Data Analysis
20
0. 0.1 0.2 0.3 0.4
Example 2
-4
-2
0
2
x
Question 2. Does frequency of nucleotides in DNA depend on the
position in the DNA sequence?
Data: 100 DNA-sequences of length 5
of which nucleotide sequence is known
pos 1 2 3 4 5 Total
A
33 34 19 20 21 127
G
22 27 23 24 21 117
C
31 18 34 30 25 138
T
14 21 24 26 33 118
Total 100 100 100 100 100 500
fixed
Check rule of thumb
Estimates of cell probabilities under Model C and H0:
phat1 = 127/500=0.254, phat2 = 117/500=0.234,
phat3 = 138/500=0.276, phat4 = 118/500=0.236.
Estimated expected cell frequencies under H0 :
1
2
3
4
5
A
25.4 25.4 25.4 25.4 25.4
G
23.4 23.4 23.4 23.4 23.4
C
27.6 27.6 27.6 27.6 27.6
T
23.6 23.6 23.6 23.6 23.6
From these numbers: rule of thumb for applying chi-square test OK.
Statistical Data Analysis
21
0. 0.1 0.2 0.3 0.4
Example 2
-4
-2
0
2
x
Question 2. Does frequency of nucleotides in DNA depend on the
position in the DNA sequence?
Data: 100 DNA-sequences of length 5
of which nucleotide sequence is known
pos 1 2 3 4 5 Total
A
33 34 19 20 21 127
G
22 27 23 24 21 117
C
31 18 34 30 25 138
T
14 21 24 26 33 118
Total 100 100 100 100 100 500
fixed
Chi-square test for homogeneity of samples:
> chisq.test(dna1)
Pearson's Chi-squared test
data: dna1
X-squared = 23.4967, df = 12, p-value = 0.02379
Conclusion: if we choose alpha=0.05, then null hypothesis “for each position
probability of occurrence of a nucleotide is same for each nucleotide” rejected.
Based on these data we may conclude that type of nucleotide is not
independent of position.
Statistical Data Analysis
22
0. 0.1 0.2 0.3 0.4
Example 3
-4
-2
0
2
x
Question 3. Are nucleotides in DNA at consecutive positions
dependent?
Data: 413 pairs of consecutive nucleotides
A G C T Total
A
35 35 27 18 115
G
22 29 20 26 97
C
30 18 38 38 124
T
14 11 28 24 77
Total 101 93 113 106 413
Which model is most appropriate?
1 sample of size 413
= Model A with null hypothesis that type of nucleotide on first position is
independent of type of nucleotide on second position
H0: nucleotide type at first and second position are independent
H1: nucleotide type at first and second position are dependent
Statistical Data Analysis
23
0. 0.1 0.2 0.3 0.4
Example 3
-4
-2
0
2
x
Question 3. Are nucleotides in DNA at consecutive positions
dependent?
Data: 413 pairs of consecutive nucleotides
A G C T Total
A
35 35 27 18 115
G
22 29 20 26 97
C
30 18 38 38 124
T
14 11 28 24 77
Total 101 93 113 106 413
Check rule of thumb
fixed
Estimates of cell probabilities under Model A and H0:
phat1.*phat.2=(115/413)*(93/413)= 0.063; Estimate of EN12 = 0.063*413 = 25.90,
phat4.*phat.3=(77/413)*(113/413)= 0.051; Estimate of EN43 = 0.051*413 = 21.07, etc..
Estimated expected cell frequencies under H0 :
A G
C
T
A
28.12 25.90 31.46 29.52
G
23.72 21.84 26.54 24.90
C
30.32 27.92 33.93 31.83
T
18.83 17.34 21.07 19.76
these numbers: rule of thumb for applying chi-square test OK.
StatisticalFrom
Data Analysis
24
0. 0.1 0.2 0.3 0.4
Example 3
-4
-2
0
2
x
Question 3. Are nucleotides in DNA at consecutive positions
dependent?
Data: 413 pairs of consecutive nucleotides
A G C T Total
A
35 35 27 18 115
G
22 29 20 26 97
C
30 18 38 38 124
T
14 11 28 24 77
Total 101 93 113 106 413
Chi-square test for independence:
fixed
> chisq.test(dna2)
Pearson's Chi-squared test
data: dna2
X-squared = 26.1018, df = 9, p-value = 0.001966
Conclusion: for alpha=0.05 null hypothesis “row and column variable independent”
rejected.
Based on these data we may conclude that nucleotides at consecutive
positions are not independent
What does this mean for chi-square test in Example 2?
Statistical Data Analysis
25
0. 0.1 0.2 0.3 0.4
General
-4
-2
0
2
x
Categorical data of two variables summarized in kxr-contingency table
Goal: 1) investigation relationship between the variables A and B
2) further investigation
Statistical Data Analysis
26
0. 0.1 0.2 0.3 0.4
2) Further investigation - which categories involved in
relationship
-4
-2
0
2
x
Method 1. per cell: compare estimate of pij
not under H0
with
under H0
Model A:
with
Model B:
with
Difference large, cell extreme
Statistical Data Analysis
27
0. 0.1 0.2 0.3 0.4
2) Further investigation - which categories involved in
relationship
-4
-2
0
2
x
Method 2. per cell: compare Nij with its estimate under H0
Residual:
same for all 3 models
~
Normalized residuals : Cij , Vij , Uij , Vij
All conditionally on marginals for large samples ~ N(0,1)
Value in tail of N(0,1): cell extreme
Method 3. for whole table at once: for large tables, kr large,
determine outliers of empirical distribution of (normalized) residual
of all cells
Statistical Data Analysis
28
0. 0.1 0.2 0.3 0.4
2) Further investigation - bootstrap
-4
-2
0
2
x
Method 4. Bootstrap
for testing same hypotheses with general test statistic T
e.g.
T = X2 when conditions of chi-square test are not fulfilled
T = max(contributions) when interest is in detecting extreme cells
etc.
Conditionally on marginals: Nij hypergeometric under H0 of all 3 models
So that
has known distribution under H0 conditionally on marginals,
say
Then distribution under H0 conditionally on marginals of test statistic
can be obtained by bootstrap simulation:
Statistical Data Analysis
29
0. 0.1 0.2 0.3 0.4
2) Further investigation - bootstrap
-4
-2
0
2
x
Test statistic
Bootstrap procedure for its distribution under H0 :
If observed value of T in tail of this empirical distribution, then reject H0
Note: in R:
chisq.test has option to perform bootstrap for testing with chi-square
(check yourself)
Statistical Data Analysis
30
0. 0.1 0.2 0.3 0.4
Example 2- investigation extreme values
-4
-2
0
2
x
Question 2. Does frequency of nucleotides in DNA depend on the
position in the DNA sequence?
Data: 100 DNA-sequences of length 5
of which nucleotide sequence is known
pos 1 2 3 4 5 Total
A
33 34 19 20 21 127
G
22 27 23 24 21 117
C
31 18 34 30 25 138
T
14 21 24 26 33 118
Total 100 100 100 100 100 500
fixed
Hypothesis “for each position probability of occurrence of a nucleotide
is same for each nucleotide” was rejected.
Which categories involved?
Statistical Data Analysis
31
0. 0.1 0.2 0.3 0.4
Example 2- investigation extreme values
-4
-2
0
2
x
Observed-expected numbers under H0:
1 2 3 4 5
A 7.6 8.6 -6.4 -5.4 -4.4
G -1.4 3.6 -0.4 0.6 -2.4
C 3.4 -9.6 6.4 2.4 -2.6
T -9.6 -2.6 0.4 2.4 9.4
Contributions compared with N(0,1); red=extreme on 5% level (0.975 quantile of N(0,1)=1.96):
1.51 1.71 -1.27 -1.07 -0.87
-0.29 0.74 -0.08 0.12 -0.50
0.65 -1.83 1.22 0.46 -0.49
-1.98 -0.54 0.08 0.49 1.93
Uijs compared with N(0,1); red=extreme, 5% level:
1.95 2.21 -1.64 -1.39 -1.13
-0.37 0.95 -0.11 0.16 -0.63
0.85 -2.40 1.60 0.60 -0.65
-2.53 -0.68 0.11 0.63 2.48
These comparisons give slightly different results, but cells (1,2), (3,2), (4,1),
and (4,4) consistently stand out.
Conclusion: H0 was rejected because there seem to be observed more A on
position 2, less C on position 2, less T on position 1, and more T on position 4,
than expected if H0 would be true
Statistical Data Analysis
32