Download Multiple Comparisons with Gene Expression Arrays Using a Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Oncogenomics wikipedia , lookup

Gene nomenclature wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Gene desert wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

History of genetic engineering wikipedia , lookup

Pathogenomics wikipedia , lookup

Essential gene wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Public health genomics wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Genome evolution wikipedia , lookup

Gene wikipedia , lookup

Genomic imprinting wikipedia , lookup

Minimal genome wikipedia , lookup

Genome (book) wikipedia , lookup

Ridge (biology) wikipedia , lookup

Gene expression programming wikipedia , lookup

Microevolution wikipedia , lookup

RNA-Seq wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Gene expression profiling wikipedia , lookup

Designer baby wikipedia , lookup

Transcript
IZBI-Workshop, November 11, 2002, Leipzig
New Proposals for
Multiple Test Procedures,
Applied to Gene Expression Array Data
Siegfried Kropf, Otto von Guericke University Magdeburg
in cooperation with
Jürgen Läuter, Magdeburg
Peter H. Westfall, Lubbok, USA
Markus Eszlinger, Knut Krohn, Leipzig
Otto von Guericke
1602-2002
11.11.2002
Leipzig
2
Example data – gene expression arrays
Example 1
Atlas Human Cancer 1.2 Array
6 quadrants,
each 98 genes
(double spotted):
588 genes +
housekeeping
genes
11.11.2002
Leipzig
3
•
•
•
•
applied to 6 patients with nodules in thyroid gland
3 hot, 3 cold, here not distinguished, + surrounding
logarithmic transformation, double spots averaged
correction with housekeeping gene at position i5a
 distribution can hardly be checked with n = 6,
standard deviation of the genes is not too different:
100
20
80
60
10
40
20
0
0
.005
.045
.025
.085
.065
.125
.105
.165
.145
.205
.185
.245
Standard deviation of 98 genes in quadrant A
11.11.2002
.005
.045
.025
.225
Leipzig
.085
.065
.125
.105
.165
.145
.205
.185
.245
.225
Standard deviation of all 588 genes
4
Example 2
• 30 patients with nodules in thyroids
• 15 hot nodules, 15 cold nodules
• tissue samples of nodules and surrounding
• analysed with Affymetrix® Gene Chips
• Signal log ratio nodule vs. surrounding from each patient
for each of 12.625 genes
• outlier catching by additional logistic transformation
• approximately multivariate normal distribution
• “similar” variances for all genes, expectation 0 if unaffected
11.11.2002
Leipzig
5
Why familywise error rate?
• Discussion: non-statistical assessment – unadjusted statistical
assessment – false discovery rate – familywise error rate
• familywise error rate is rather high claim, growing with dimension
of array (in contrast to false discovery rate)
• if possible, however, then highest degree of security for the
positive results of this one trial
• trials mostly with small or moderate samples sizes,
not enough to rule out effects in case of non-significance,
therefore at least the positive results should be as sure as possible
• results for FWE could at least be given in addition to other versions
11.11.2002
Leipzig
6
Procedure with data-driven
ordering of hypotheses
Starting point:
Two well known procedures for MCPs controlling the FWE
• Testing with a-priori ordered hypotheses (without -adjustment)
• Bonferroni-Holm (data dependent order, with adjustment)
In analysis of high-dimensional gene expression arrays both not
applicable/optimal.
 We are looking for a method with data dependent
ordering of hypotheses but without -adjustment.
11.11.2002
Leipzig
7
New proposal (Kropf, 2000; Kropf and Läuter, 2002)
Consider one-sample situation first:
data matrix from n iid p-dimensional normal data vectors
 x  x  x 
1p
 (1)   11

X           , x( j ) ~ N p (, ) ( j  1,, n) ,    i  ,   ij 
 x    xn1  xnp 

 (n)  
Aim: test of the local hypotheses Hi: i = 0 at the strong FWE .
Procedure I:
n
• sort variables for decreasing values of wii   x ji 2 ,
j 1
• in that order carry out the unadjusted one-sample t tests for
the variables as long as significance is attained.
11.11.2002
Leipzig
8
Remark:
In order to yield an efficient order of variables, the variances
of the variables should be approximately equal
because with
1 n
1 n
2
xi
2


s

x

x
xi   x ji , i
and ti 

ji
i
n  1 j 1
n j 1
si
n
we have
wii   x  n  x   x ji  xi   si2 (ti2  n  1) .
n
j 1
n
2
ji
2
i
2
j 1
Thus, approximately equal variances are important for a
high power of the procedure.
11.11.2002
Leipzig
9
Proof that the procedure keeps the FWE (draft):
The univariate t tests with the single variables are considered as
special cases of the stabilized multivariate tests with scores
zj = d´x(j) .
The weight vectors are
d  (d1  d p ) with
n
n

2
2
1
if
x

max
(
x



ji
jl )
l 1,..., p
di  
j 1
l 1
0 else
That alone would, however, only ensure the multiple error rate
under the global hypothesis (regardless of variances).
Additionally, we have special criteria for ordering and special tests
– the non-null variables do not confuse the behaviour of the null
variables, only conservative influence.
11.11.2002
Leipzig
10
Example I: Comparison nodules vs. surrounding
(3 hot and 3 cold nodules together  one-sample test vs. 0)
Quadrant A only (98 genes, 2 spots aver., corrected with housek. genes)
gene no.
56
71
1
78
88
29
19
10
57
54
31
8
sum of squares
unadjusted P-value
0.6020
0.0017
0.5481
0.0390
0.5334
0.0012
0.4551
0.0075
0.3629
0.0363
0.3573
0.0036
0.3456
0.0263
0.3454
0.0043
0.3451
0.0092
0.3430
0.0460
0.3052
0.0594
0.3023
0.0462
. . . . . . . . . . . . . . . . .
# locally sign. genes:
33
# sign. genes Holm‘s proc.: 0
11.11.2002
# sign. genes Westfall-Young: 0
# sign. genes Procedure I:
10
Leipzig
11
Example II: 15 cold nodules vs. surrounding (one-sample problem)
no.
gene
P value
1
6746
2 · 10-4
2
6567
2 · 10-4
3
848
0.38
4
3568
5 · 10-3
5
6257
0.34
6
6518
0.04
7
8104
1 · 10-6
8
8135
5 · 10-5
9
6919
0.21
...
...
11.11.2002
...
For comparison:
without any adjustment: 1064
Bonferroni /Holm: 1 (gene 8104)
Westfall / Young:
0
The present procedure stops
already after the 2nd gene.
The basic trend for sums of squares
is present, but the procedure is
sensitive to disturbances.
It should be smoothed
(see below, hybridisation with
Bonferroni / Holm).
Leipzig
12
Simulation experiments
Simulation experiments guided by example I:
n = 6,...,33 cases, p = 98 variables, normally distributed,
variance 1, pairwise correlation
0.5,
A
expectation 0 for 88 var‘s,
other 10 var‘s:   12 / n
B
C
10
Average # of
significant genes in
Monte Carlo
replications
8
unadjusted tests
6
Procedure I
4
Westfall/Young
2
Holm's procedure
0
5
11.11.2002
10
15
20
25
sample size n
sample
size n
Leipzig
30
35
13
Extensions:
• Other test problems:
– particularly comparison of two/more independent samples
K
nk
• ordering by sums of squares wii   ( xkji  xi ) 2,
k 1 j 1
i.e., related to the variablewise total mean of all samples,
• then two-sample t tests or one-way ANOVA.
• Other subsets of variables (e.g., pairs of variables)
 Kropf, Läuter (2002)
• „Distribution-free“ version possible
11.11.2002
Leipzig
14
Weighted procedure (Procedure II)
In notation of the one-sample problem (Westfall, Kropf, Finos, 2002)
• Calculate the P-values pi (i = 1, …, p) for the usual unadjusted
one-sample t test for each of the p variables.
• For each variable, determine the sums of squares values
n
wii   x ji and the weights gi  wii  for fixed   0 .
2
j 1
• Calculate the weighted P-values qi = pi / gi and order the
variables for increasing values of them.
• Then the hypothesis H(j) for the jth ordered variable is rejected
iff

q(i ) 
 i  1,..., j .
h: xhSi g h
Si : ith ordered variable and all subsequent ones.
11.11.2002
Leipzig
15
How does this procedure fit to the others above?
• Procedure II utilises ideas from Bonferroni/Holm (fixed
weights) as well as from Procedure II (data-driven through wii).
•  = 0, gi = wii0 = 1 : Then the procedure is identical to usual
unweighted Bonferroni / Holm.
•   : According to Westfall and Krishen (2001), the influence
of the weights totally rules out the P-values from BonferroniHolm, critical function converges to that of Procedure I.
• Intermediate values of : both parts are present, „power- assumption“ of equal variances only important for part of Procedure I.
In an application,  has to be fixed in advance!
11.11.2002
Leipzig
16
Example 2 again
Cold nodules vs. surrounding
Is the choice of genes stable?
unadjusted 1064, Westf./Y. 0
14
200
12
100
10
50
40
30
20
eta
8
6

10
5
4
3
2
4
Pr. I
2
0
1
VAR00001
11.11.2002
3568
9685
7704
8137
4682
825
7603
6746
6567
64
8135
16
5786
4

0
20
0
10
0
450
30
20
10
1
45
3
2
1
.54
.
.3
.2
.25
7465
.5
.4
.3
8104
B/H
gene
Leipzig
17
Example 2, cont.
hot nodules vs. surrounding
hot vs. cold nodules
unadjusted 2597, Westf./Y. 93
unadjusted 1290, Westf./Y. 2
120
100
80
60
40
20
0
number of significant genes
B/H
5
4
3
B/H
Pr. I
Pr. I
0
Leipzig
18
0
20
0
10
64
50
40
30
20
16
10
4

5
4
3
2
1
1
.25
VAR00001
parameter eta
11.11.2002
1
.5
.4
.3
.2
.1 .2 .3.4.5 1 2 3 45 10 20304050 100 200
.25 1
4
16
64

2
Simulation experiments with weighted procedure guided
by example II
•
•
•
•
•
p = 12.625 variables,
n = 4, 6, 8, 12, 16, 20, 30, 50, 100
number of significant genes 10, 100, 1000
pairwise correlation coefficient 0 or 0.5
heterogeneity of variances in 5 levels
• influence of pairwise correlation on optimal choice of  small,
also number of significant genes not so important
• sample size is influential (and known in practice)
• heterogeneity of variances is important, too, but not known in
practice; estimation through wii ensures only weak FWE control.
11.11.2002
Leipzig
19
Summary
• A new technique for multiple testing with data-dependent ordering of
hypotheses is proposed.
• It keeps the FWE in the strong sense for arbitrary multivariate normal
data.
• In order to provide a high power, the variables should have
approximately equal variances.
• The proposal is advantageous in very small samples of high-dimensional
data.
• The method is sensitive to disturbances.
• Westfall‘s proposal of the weighted procedure establishes a link of the
above procedure and the Bonferroni-Holm method and smoothes out for
these disturbances.
• The weighted procedure is a real alternative to existing analysis techniques for microarray data, problem of suitable choice of .
11.11.2002
Leipzig
20
References
Fang, K.-T. and Zhang, Y.-T., 1990: General Multivariate Analysis. Science Press Beijing and
Springer-Verlag Berlin Heidelberg.
Kropf, S., 2000: Hochdimensionale multivariate Verfahren in der medizinischen Statistik.
Shaker Verlag, Aachen.
Kropf, S., and Läuter, J., 2002: Multiple Tests for Different Sets of Variables Using a DataDriven Ordering of Hypotheses, with an Application to Gene Expression Data. Biometrical
Journal 44, no. 7.
Läuter J., 1996: Exact t and F Tests for Analysing Studies with Multiple Endpoints. Biometrics
52, 964-970.
Läuter, J., Glimm, E., and Kropf, S., 1998: Multivariate Tests Based on Left-Spherically
Distributed Linear Scores. Annals of Statistics 26, 1972-1988. Erratum: Annals of Statistics 27,
1441.
Westfall, P.H., Kropf, S., and Finos, L., 2002: Weighted FWE-controlling methods in highdimensional situations. Submitted for IMS Philadelphia companion volume.
Westfall, P.H. and Krishen, A. (2001): Optimally weighted, fixed sequence, and gatekeeping
multiple testing procedures. Journal of Statistical Planning and Inference 99, 25-40.
Westfall, P.H. and Young, S.S., 1993: Resampling-Based Multiple Testing. John Wiley & Sons,
New York.
11.11.2002
Leipzig
21