Download Biological discoveries in sequence databases

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Using (and abusing) sequence analysis to
make biological discoveries
Nothing in (computational) biology makes
sense except in the light of evolution
after Theodosius Dobzhansky (1970)
Significant sequence similarity is evidence of homology
Only a small fraction of amino acid residues is directly
involved in protein function (including enzymatic);
the rest of the protein serves largely as structural
scaffold
Conserved sequence motifs are determinants of
conserved ancestral functions
The evolving roles of computational analysis in biology
Pre-sequencing era (before 1978-80)
Study biological function
Pre-genomic era (1980-1996)
Study biological function
Clone/sequence gene
Analyze/interpret sequence
Post-genomic era (1996-
Sequence genome
Study biological function
Analyze/interpret sequences
of all genes
Prioritize targets
Sequence complexity
Measure of the randomness of a sequence
Random sequence - highest complexity (entropy) globular protein domains
Homopolymer - lowest complexity (entropy) non-globular structures
Algorithmic complexity
QQQQQQQQQQQQQ = (Q)n
KRKRKRKRKRKR = (KR)n
ASDFGHKLCVNM - random sequence - no algorithm to derive
from a simpler one
seg BRCA1 45 3.4 3.7 > BRCA1.seg
>gi|728984|sp|P38398|BRC1_HUMAN Breast cancer type 1 susceptibility protein
Non-globular regions
sdellgsddshdgesesnakvadvldvlne
vdeysgssekidllasdphealickservh
sksvesnied
1-388
389-458
459-526
ktpeminqgtnqteqngqvmnitnsghenk
tkgdsiqneknpnpieslekesafktkaep
isssisnmelelnihnskapkknrlrrkss
trhihalelvvsrnlsppn
KIFGKTYRKKASLPNLSHVTENLIIGAFVT
EPQIIQERPLTNKLKRKRRPTSGLHPEDFI
KKADLAVQ
527-635
636-995
knlleenfeehsmsperemgnenipstvst
isrnnirenvfkeasssninevgsstnevg
MDLSALRVEEVQNVINAMQKILECPICLEL
IKEPVSTKCDHIFCKFCMLKLLNQKKGPSQ
CPLCKNDITKRSLQESTRFSQLVEELLKII
CAFQLDTGLEYANSYNFAKKENNSPEHLKD
EVSIIQSMGYRNRAKRLLQSEPENPSLQET
SLSVQLSNLGTVRTLRTKQRIQPQKTSVYI
ELGSDSSEDTVNKATYCSVGDQELLQITPQ
GTRDEISLDSAKKAACEFSETDVTNTEHHQ
PSNNDLNTTEKRAAERHPEKYQGSSVSNLH
VEPCGTNTHASSLQHENSSLLLTKDRMNVE
KAEFCNKSKQPGLARSQHNRWAGSKETCND
RRTPSTEKKVDLNADPLCERKEWNKQKLPC
SENPRDTEDVPWITLNSSIQKVNEWFSR
996-1089
CTELQIDSCSSSEEIKKKKYNQMPVRHSRN
LQLMEGKEPATGAKKSNKPNEQTSKRHDSD
TFPELKLTNAPGSFTKCSNTSELKEFVNPS
LPREEKEEKLETVKVSNNAEDPKDLMLSGE
RVLQTERSVESSSISLVPGTDYGTQESISL
LEVSTLGKAKTEPNKCVSQCAAFENPKGLI
HGCSKDNRNDTEGFKYPLGHEVNHSRETSI
EMEESELDAQYLQNTFKVSKRQSFAPFSNP
GNAEEECATFSAHSGSLKKQSPKVTFECEQ
KEENQGKNESNIKPVQTVNITAGFPVVGQK
DKPVDNAKCSIKGGSRFCLSSQFRGNETGL
ITPNKHGLLQNPYRIPPLFPIKSFVKTKCK
Globular domains
1422-1513 GSQPSNSYPSIISDSSALEDLRNPEQSTSE
KAVLTSQKSSEYPISQNPEGLSADKFEVSA
DSSTSKNKEPGVERSSPSKCPSLDDRWYMH
SC
sgslqnrnypsqeelikvvdveeqqleesg 1514-1616
phdltetsylprqdlegtpylesgislfsd
dpesdpsedrapesarvgnipsstsalkvp
qlkvaesaqspaa
1617-1863 AHTTDTAGYNAMEESVSREKPELTASTERV
NKRMSMVVSGLTPEEFMLVYKFARKHHITL
TNLITEETTHVVMKTDAEFVCERTLKYFLG
IAGGKWVVSYFWVTQSIKERKMLNEHDFEV
RGDVVNGRNHQGPKRARESQDRKIFRGLEI
CCYGPFTNMPTDQLEWMVQLCGASVVKELS
SFTLGTGVHPIVVVQPDAWTEDNGFHAIGQ
MCEAPVVTREWVLDSVALYQCQELDTYLIP
QIPHSHY
Removing spurious database hits for the
low sequence complexity protein BRCA1
by modifying SEG parameters a
Parameter set
of SEGa
No filtering
0
1,863
3e-11
4e-15
1e-28
12 2.1
12 2.2
(default)
12 2.3
12 2.4
12 2.5
12 2.6
12 2.7
12 2.8
35
1,828
4e-9
4e-15
1e-28
117
1,746
5e-4
5e-12
7e-22
172
279
487
616
908
1,164
0
1,691
1,584
1,376
1,247
955
699
1,863
-
5e-11
5e-11
6e-11
2e-10
4e-06
0.003
3e-12
3e-21
1e-14
8e-10
5e-9
2e-8
6e-7
1e-20
2.4
2.5
2.6
2.7
2.8
2.9
3.0
3.1
Compositionbased filtering
a
E-values of the BLAST hits
Number of residues
Maske Unmasked Dentin Plant Opossum
d
BRCA BRCA1
1
SEG parameters are trigger window length, trigger
complexity, and extension
1422-1513 GSQPSNSYPSIISDSSALEDLRNPEQSTSE
KAVLTSQKSSEYPISQNPEGLSADKFEVSA
DSSTSKNKEPGVERSSPSKCPSLDDRWYMH
SC
sgslqnrnypsqeelikvvdveeqqleesg 1514-1616
phdltetsylprqdlegtpylesgislfsd
dpesdpsedrapesarvgnipsstsalkvp
qlkvaesaqspaa
1617-1863 AHTTDTAGYNAMEESVSREKPELTASTERV
NKRMSMVVSGLTPEEFMLVYKFARKHHITL
TNLITEETTHVVMKTDAEFVCERTLKYFLG
IAGGKWVVSYFWVTQSIKERKMLNEHDFEV
RGDVVNGRNHQGPKRARESQDRKIFRGLEI
CCYGPFTNMPTDQLEWMVQLCGASVVKELS
SFTLGTGVHPIVVVQPDAWTEDNGFHAIGQ
MCEAPVVTREWVLDSVALYQCQELDTYLIP
QIPHSHY
Paradigm shift in database searching
Traditional
PSI-BLAST
Set of homologs
Query
sequence
Sequence
database
PSSM
RPS-BLAST
New
Query
sequence
PSSM database
Domain
architecture
DOMAIN ARCHITECTURE OF SELECTED BRCT PROTEINS
BRCT
RING
BRCA1
BARD1
PHD-l
BRCA1/BARD
homolog plant
REV1 yeast
CMP-trans
DPB11 yeast
AZF
PARP
ATP-dep ligase
HhH
polX
ATP and PCNA-binding
NAD-dep ligase
PARP
vertebrates
DNA ligase III
human
TdT eukaryotes
RFC1 eukaryotes
DNA ligase
bacteria
Use of profile libraries to examine domain representation
in individual proteomes
yeast
10
0
6,200
Profile library
Detect domains
using
PSI-BLAST,
IMPALA
worm
0
Compare domain
distributions
10
0
~20,000
0
Chervitz SA, Aravind L, Sherlock G, Ball CA, Koonin EV, Dwight SS, Harris MA, Dolinski K, Mohr S, Smith
T, Weng S, Cherry JM, Botstein D. 1998. Comparison of the complete protein sets of worm and yeast:
orthology and divergence. Science 282: 2022-8
Normalized domain counts in worm and yeast
16
14
1
12
Worm
10
2
8
15
3
14
6
4
4
10
5
6
7
13
16
17
11
12
18
8 9
2
0
19
0
2
4
Yeast 6
8
10
12
1.Hormone receptor; 2.POZ; 3.EGF; 4.MATH; 5.PTPase; 6.Cation Channels; 7.PDZ;
8.SH2; 9.FNIII; 10.Homeodomain; 11.LRR; 12.EF hands; 13.Ankyrin; 14.RING finger;
15.C2H2 finger; 16.small GTPase; 17.RRM; 18.AAA+; 19.C6 finger
•Searching a domain library is often easier and more informative
than searching the entire sequence database. However, the latter
yields complementary information and should not be skipped
if details are of interest.
•Varying the search parameters, e.g. switching composition-based statistics
on and off, can make a difference.
•Using subsequences, preferably chosen according to objective criteria,
e.g. separation from the rest of the protein by a low-complexity linker,
may improve search performance.
•Trying different queries is a must when analyzing protein (super)families.
Even hits below the threshold of statistical significance often are worth
analyzing, albeit with extreme care. Transferring functional information
between homologs on the basis of a database description alone is dangerous.
• Conservation of domain architectures, active sites and other features
needs to be analyzed (hence automated identification of protein families is
difficult and automated prediction of functions is extremely error-prone).
•Always do a reality check!
Related documents