Download Genomics in Drug Discovery

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Testing statistical significance
scores of sequence comparison
methods with structure similarity
Tim Hulsen
NCMLS PhD Two-Day Conference
2006-04-27
Introduction
• Sequence comparison:
Important for finding similar proteins
(homologs) for a protein with unknown
function
• Algorithms: BLAST, FASTA, SmithWaterman
• Statistical scores: E-value (standard), Zvalue
E-value or Z-value?
• Smith-Waterman sequence comparison
with Z-value statistics:
100 randomized shuffles to test
significance of SW score
O. MFTGQEYHSV
# seqs
rnd
ori: 5*SD 
Z=5
shuffle
1. GQHMSVFTEY
2. YMSHQFTVGE
etc.
SW score
E-value or Z-value?
• Z-value calculation takes much time
(2x100 randomizations)
• Comet et al. (1999) and Bastien et al.
(2004): Z-value is theoretically more
sensitive and more selective than E-value
• BUT Advantage of Z-value has never been
proven by experimental results
How to compare?
• Structural comparison is better than
sequence comparison
• ASTRAL SCOP: Structural Classification
Of Proteins
• e.g. a.2.1.3, c.1.2.4; same number ~ same
structure
• Use structural classification as benchmark
for sequence comparison methods
ASTRAL SCOP statistics
max. % identity
members
families
avg. fam. size
max. fam. size
families =1
families >1
10%
3631
2250
1.614
25
1655
595
20%
3968
2297
1.727
29
1605
692
25%
4357
2313
1.884
32
1530
783
30%
4821
2320
2.078
39
1435
885
35%
5301
2322
2.283
46
1333
989
40%
5674
2322
2.444
47
1269
1053
50%
6442
2324
2.772
50
1178
1146
70%
7551
2325
3.248
127
1087
1238
90%
8759
2326
3.766
405
1023
1303
95%
9498
2326
4.083
479
977
1349
Methods (1)
• Smith-Waterman algorithms:
dynamic programming; computationally
intensive
– Paracel with e-value (PA E):
• SW implementation of Paracel
– Biofacet with z-value (BF Z):
• SW implementation of Gene-IT
– ParAlign with e-value (PA E):
• SW implementation of Sencel
– SSEARCH with e-value (SS E):
• SW implementation of FASTA (see next page)
Methods (2)
• Heuristic algorithms:
– FASTA (FA E)
• Pearson & Lipman, 1988
• Heuristic approximation; performs better than
BLAST with strongly diverged proteins
– BLAST (BL E):
• Altschul et al., 1990
• Heuristic approximation; stretches local alignments
(HSPs) to global alignment
• Should be faster than FASTA
Method parameters
- all:
- matrix: BLOSUM62
- gap open penalty: 12
- gap extension penalty: 1
- Biofacet with z-value: 100 randomizations
Receiver Operating Characteristic
• R.O.C.: statistical value, mostly used in
clinical medicine
• Proposed by Gribskov & Robinson (1996)
to be used for sequence comparison
analysis
ROC50 Example
query
d1c75a_
hit #
pc e
a.3.1.1
1
d1gcya1
b.71.1.1
0.31
2
d1h32b_
a.3.1.1
0.4
3
d1gks__
a.3.1.1
0.52
4
d1a56__
a.3.1.1
0.52
5
d1kx2a_
a.3.1.1
0.67
6
d1etpa1
a.3.1.4
0.67
7
d1zpda3
c.36.1.9
0.87
8
d1eu1a2
c.81.1.1
0.87
9
d451c__
a.3.1.1
1.1
10
d1flca2
c.23.10.2
1.1
11
d1mdwa_
d.3.1.3
1.1
12
d2dvh__
a.3.1.1
1.5
13
d1shsa_
b.15.1.1
1.5
14
d1mg2d_
a.3.1.1
1.5
15
d1c53__
a.3.1.1
2.4
16
d3c2c__
a.3.1.1
2.4
17
d1bvsa1
a.5.1.1
6.8
18
d1dvva_
a.3.1.1
6.8
19
d1cyi__
a.3.1.1
6.8
20
d1dw0a_
a.3.1.1
6.8
21
d1h0ba_
b.29.1.11
6.8
22
d3pfk__
c.89.1.1
6.8
23
d1kful3
d.3.1.3
6.8
24
d1ixrc1
a.4.5.11
14
25
d1ixsb1
a.4.5.11
14
- Take 100 best hits
- True positives: in same SCOP
family, or false positives: not in same
family
- For each of first 50 false positives:
calculate number of true positives
higher in list
(0,4,4,4,5,5,6,9,12,12,12,12,12)
- Divide sum of these numbers by
number of false positives (50) and by
total number of possible true
positives (size of family -1) = ROC50
(0,167)
- Take average of ROC50 scores for
all entries
ROC50 results
0.50
0.45
0.40
mean ROC50
0.35
pc e
bf z
bl e
fa e
ss e
pa e
0.30
0.25
0.20
0.15
0.10
0.05
0.00
pdb010 pdb020 pdb025 pdb030 pdb035 pdb040 pdb050 pdb070 pdb090 pdb095
ASTRAL SCOP set
Coverage vs. Error
• C.V.E. = Coverage vs. Error (Brenner et
al., 1998)
• E.P.Q. = selectivity indicator (how much
false positives?)
• Coverage = sensitivity indicator (how
much true positives of total?)
CVE Example
query
d1c75a_
hit #
pc e
a.3.1.1
1
d1gcya1
b.71.1.1
0.31
2
d1h32b_
a.3.1.1
0.4
3
d1gks__
a.3.1.1
0.52
4
d1a56__
a.3.1.1
0.52
5
d1kx2a_
a.3.1.1
0.67
6
d1etpa1
a.3.1.4
0.67
7
d1zpda3
c.36.1.9
0.87
8
d1eu1a2
c.81.1.1
0.87
9
d451c__
a.3.1.1
1.1
10
d1flca2
c.23.10.2
1.1
11
d1mdwa_
d.3.1.3
1.1
12
d2dvh__
a.3.1.1
1.5
13
d1shsa_
b.15.1.1
1.5
14
d1mg2d_
a.3.1.1
1.5
15
d1c53__
a.3.1.1
2.4
16
d3c2c__
a.3.1.1
2.4
17
d1bvsa1
a.5.1.1
6.8
18
d1dvva_
a.3.1.1
6.8
19
d1cyi__
a.3.1.1
6.8
20
d1dw0a_
a.3.1.1
6.8
21
d1h0ba_
b.29.1.11
6.8
22
d3pfk__
c.89.1.1
6.8
23
d1kful3
d.3.1.3
6.8
24
d1ixrc1
a.4.5.11
14
25
d1ixsb1
a.4.5.11
14
- Vary
threshold above which a hit is
seen as a positive: e.g.
e=10,e=1,e=0.1,e=0.01
- True positives: in same SCOP family,
or false positives: not in same family
- For each threshold, calculate
coverage: number of true positives
divided by total number of possible true
positives
- For each threshold, calculate errorsper-query: number of false positives
divided by number of queries
- Plot coverage on x-axis and errorsper-query on y-axis; right-bottom is
best
CVE results
-
(for PDB010)
+
Mean Average Precision
• A.P.: borrowed from information retrieval
search (Salton, 1991)
• Recall: true positives divided by number of
homologs
• Precision: true positives divided by
number of hits
• A.P. = approximate integral to calculate
area under recall-precision curve
Mean AP Example
query
d1c75a_
hit #
pc e
a.3.1.1
1
d1gcya1
b.71.1.1
0.31
2
d1h32b_
a.3.1.1
0.4
3
d1gks__
a.3.1.1
0.52
4
d1a56__
a.3.1.1
0.52
5
d1kx2a_
a.3.1.1
0.67
6
d1etpa1
a.3.1.4
0.67
7
d1zpda3
c.36.1.9
0.87
8
d1eu1a2
c.81.1.1
0.87
9
d451c__
a.3.1.1
1.1
10
d1flca2
c.23.10.2
1.1
11
d1mdwa_
d.3.1.3
1.1
12
d2dvh__
a.3.1.1
1.5
13
d1shsa_
b.15.1.1
1.5
14
d1mg2d_
a.3.1.1
1.5
15
d1c53__
a.3.1.1
2.4
16
d3c2c__
a.3.1.1
2.4
17
d1bvsa1
a.5.1.1
6.8
18
d1dvva_
a.3.1.1
6.8
19
d1cyi__
a.3.1.1
6.8
20
d1dw0a_
a.3.1.1
6.8
21
d1h0ba_
b.29.1.11
6.8
22
d3pfk__
c.89.1.1
6.8
23
d1kful3
d.3.1.3
6.8
24
d1ixrc1
a.4.5.11
14
25
d1ixsb1
a.4.5.11
14
- Take 100 best hits
- True positives: in same SCOP
family, or false positives: not in
same family
-For each of the true positives:
divide the positive rank
(1,2,3,4,5,6,7,8,9,10,11,12) by the
true positive rank
(2,3,4,5,9,12,14,15,16,18,19,20)
- Divide the sum of all of these
numbers by the total number of hits
(100) = AP (0.140)
- Take average of AP scores for all
entries = mean AP
Mean AP results
0.30
0.27
0.24
mean AP
0.21
pc e
bf z
bl e
fa e
ss e
pa e
0.18
0.15
0.12
0.09
0.06
0.03
0.00
pdb010 pdb020 pdb025 pdb030 pdb035 pdb040 pdb050 pdb070 pdb090 pdb095
ASTRAL SCOP set
Time consumption
• PDB095 all-against-all comparison:
– Biofacet: multiple days (Z-value calc.!)
– SSEARCH: 5h49m
– ParAlign: 47m
– FASTA: 40m
– BLAST: 15m
Conclusions
• e-value better than Z-value(!)
• SW implementations are (more or less)
the same (SSEARCH, ParAlign and
Biofacet), but SSEARCH with e-value
scores best of all
• Use FASTA/BLAST only when time is
important
• Larger structural comparison database
needed for better analysis
Credits
Peter Groenen
Wilco Fleuren
Jack Leunissen
Related documents