Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Using (and abusing) sequence analysis to make biological discoveries Nothing in (computational) biology makes sense except in the light of evolution after Theodosius Dobzhansky (1970) Significant sequence similarity is evidence of homology Only a small fraction of amino acid residues is directly involved in protein function (including enzymatic); the rest of the protein serves largely as structural scaffold Conserved sequence motifs are determinants of conserved ancestral functions The evolving roles of computational analysis in biology Pre-sequencing era (before 1978-80) Study biological function Pre-genomic era (1980-1996) Study biological function Clone/sequence gene Analyze/interpret sequence Post-genomic era (1996- Sequence genome Study biological function Analyze/interpret sequences of all genes Prioritize targets Sequence complexity Measure of the randomness of a sequence Random sequence - highest complexity (entropy) globular protein domains Homopolymer - lowest complexity (entropy) non-globular structures Algorithmic complexity QQQQQQQQQQQQQ = (Q)n KRKRKRKRKRKR = (KR)n ASDFGHKLCVNM - random sequence - no algorithm to derive from a simpler one seg BRCA1 45 3.4 3.7 > BRCA1.seg >gi|728984|sp|P38398|BRC1_HUMAN Breast cancer type 1 susceptibility protein Non-globular regions sdellgsddshdgesesnakvadvldvlne vdeysgssekidllasdphealickservh sksvesnied 1-388 389-458 459-526 ktpeminqgtnqteqngqvmnitnsghenk tkgdsiqneknpnpieslekesafktkaep isssisnmelelnihnskapkknrlrrkss trhihalelvvsrnlsppn KIFGKTYRKKASLPNLSHVTENLIIGAFVT EPQIIQERPLTNKLKRKRRPTSGLHPEDFI KKADLAVQ 527-635 636-995 knlleenfeehsmsperemgnenipstvst isrnnirenvfkeasssninevgsstnevg MDLSALRVEEVQNVINAMQKILECPICLEL IKEPVSTKCDHIFCKFCMLKLLNQKKGPSQ CPLCKNDITKRSLQESTRFSQLVEELLKII CAFQLDTGLEYANSYNFAKKENNSPEHLKD EVSIIQSMGYRNRAKRLLQSEPENPSLQET SLSVQLSNLGTVRTLRTKQRIQPQKTSVYI ELGSDSSEDTVNKATYCSVGDQELLQITPQ GTRDEISLDSAKKAACEFSETDVTNTEHHQ PSNNDLNTTEKRAAERHPEKYQGSSVSNLH VEPCGTNTHASSLQHENSSLLLTKDRMNVE KAEFCNKSKQPGLARSQHNRWAGSKETCND RRTPSTEKKVDLNADPLCERKEWNKQKLPC SENPRDTEDVPWITLNSSIQKVNEWFSR 996-1089 CTELQIDSCSSSEEIKKKKYNQMPVRHSRN LQLMEGKEPATGAKKSNKPNEQTSKRHDSD TFPELKLTNAPGSFTKCSNTSELKEFVNPS LPREEKEEKLETVKVSNNAEDPKDLMLSGE RVLQTERSVESSSISLVPGTDYGTQESISL LEVSTLGKAKTEPNKCVSQCAAFENPKGLI HGCSKDNRNDTEGFKYPLGHEVNHSRETSI EMEESELDAQYLQNTFKVSKRQSFAPFSNP GNAEEECATFSAHSGSLKKQSPKVTFECEQ KEENQGKNESNIKPVQTVNITAGFPVVGQK DKPVDNAKCSIKGGSRFCLSSQFRGNETGL ITPNKHGLLQNPYRIPPLFPIKSFVKTKCK Globular domains 1422-1513 GSQPSNSYPSIISDSSALEDLRNPEQSTSE KAVLTSQKSSEYPISQNPEGLSADKFEVSA DSSTSKNKEPGVERSSPSKCPSLDDRWYMH SC sgslqnrnypsqeelikvvdveeqqleesg 1514-1616 phdltetsylprqdlegtpylesgislfsd dpesdpsedrapesarvgnipsstsalkvp qlkvaesaqspaa 1617-1863 AHTTDTAGYNAMEESVSREKPELTASTERV NKRMSMVVSGLTPEEFMLVYKFARKHHITL TNLITEETTHVVMKTDAEFVCERTLKYFLG IAGGKWVVSYFWVTQSIKERKMLNEHDFEV RGDVVNGRNHQGPKRARESQDRKIFRGLEI CCYGPFTNMPTDQLEWMVQLCGASVVKELS SFTLGTGVHPIVVVQPDAWTEDNGFHAIGQ MCEAPVVTREWVLDSVALYQCQELDTYLIP QIPHSHY Removing spurious database hits for the low sequence complexity protein BRCA1 by modifying SEG parameters a Parameter set of SEGa No filtering 0 1,863 3e-11 4e-15 1e-28 12 2.1 12 2.2 (default) 12 2.3 12 2.4 12 2.5 12 2.6 12 2.7 12 2.8 35 1,828 4e-9 4e-15 1e-28 117 1,746 5e-4 5e-12 7e-22 172 279 487 616 908 1,164 0 1,691 1,584 1,376 1,247 955 699 1,863 - 5e-11 5e-11 6e-11 2e-10 4e-06 0.003 3e-12 3e-21 1e-14 8e-10 5e-9 2e-8 6e-7 1e-20 2.4 2.5 2.6 2.7 2.8 2.9 3.0 3.1 Compositionbased filtering a E-values of the BLAST hits Number of residues Maske Unmasked Dentin Plant Opossum d BRCA BRCA1 1 SEG parameters are trigger window length, trigger complexity, and extension 1422-1513 GSQPSNSYPSIISDSSALEDLRNPEQSTSE KAVLTSQKSSEYPISQNPEGLSADKFEVSA DSSTSKNKEPGVERSSPSKCPSLDDRWYMH SC sgslqnrnypsqeelikvvdveeqqleesg 1514-1616 phdltetsylprqdlegtpylesgislfsd dpesdpsedrapesarvgnipsstsalkvp qlkvaesaqspaa 1617-1863 AHTTDTAGYNAMEESVSREKPELTASTERV NKRMSMVVSGLTPEEFMLVYKFARKHHITL TNLITEETTHVVMKTDAEFVCERTLKYFLG IAGGKWVVSYFWVTQSIKERKMLNEHDFEV RGDVVNGRNHQGPKRARESQDRKIFRGLEI CCYGPFTNMPTDQLEWMVQLCGASVVKELS SFTLGTGVHPIVVVQPDAWTEDNGFHAIGQ MCEAPVVTREWVLDSVALYQCQELDTYLIP QIPHSHY Paradigm shift in database searching Traditional PSI-BLAST Set of homologs Query sequence Sequence database PSSM RPS-BLAST New Query sequence PSSM database Domain architecture DOMAIN ARCHITECTURE OF SELECTED BRCT PROTEINS BRCT RING BRCA1 BARD1 PHD-l BRCA1/BARD homolog plant REV1 yeast CMP-trans DPB11 yeast AZF PARP ATP-dep ligase HhH polX ATP and PCNA-binding NAD-dep ligase PARP vertebrates DNA ligase III human TdT eukaryotes RFC1 eukaryotes DNA ligase bacteria Use of profile libraries to examine domain representation in individual proteomes yeast 10 0 6,200 Profile library Detect domains using PSI-BLAST, IMPALA worm 0 Compare domain distributions 10 0 ~20,000 0 Chervitz SA, Aravind L, Sherlock G, Ball CA, Koonin EV, Dwight SS, Harris MA, Dolinski K, Mohr S, Smith T, Weng S, Cherry JM, Botstein D. 1998. Comparison of the complete protein sets of worm and yeast: orthology and divergence. Science 282: 2022-8 Normalized domain counts in worm and yeast 16 14 1 12 Worm 10 2 8 15 3 14 6 4 4 10 5 6 7 13 16 17 11 12 18 8 9 2 0 19 0 2 4 Yeast 6 8 10 12 1.Hormone receptor; 2.POZ; 3.EGF; 4.MATH; 5.PTPase; 6.Cation Channels; 7.PDZ; 8.SH2; 9.FNIII; 10.Homeodomain; 11.LRR; 12.EF hands; 13.Ankyrin; 14.RING finger; 15.C2H2 finger; 16.small GTPase; 17.RRM; 18.AAA+; 19.C6 finger •Searching a domain library is often easier and more informative than searching the entire sequence database. However, the latter yields complementary information and should not be skipped if details are of interest. •Varying the search parameters, e.g. switching composition-based statistics on and off, can make a difference. •Using subsequences, preferably chosen according to objective criteria, e.g. separation from the rest of the protein by a low-complexity linker, may improve search performance. •Trying different queries is a must when analyzing protein (super)families. Even hits below the threshold of statistical significance often are worth analyzing, albeit with extreme care. Transferring functional information between homologs on the basis of a database description alone is dangerous. • Conservation of domain architectures, active sites and other features needs to be analyzed (hence automated identification of protein families is difficult and automated prediction of functions is extremely error-prone). •Always do a reality check!