Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Elements of Bioinformatics (14F001) TP2: Gene prediction 22 October 2012 CORRECTIONS Notice: During this practical, you will need to use ‘raw’ and ‘fasta’ sequence formats. For additional information on the different sequence formats available, please have a look at http://www.genomatix.de/online_help/help/sequence_formats.html nc RNA gene prediction Choose: eukaryotic tRNA; does not give any result with general tRNA model ! CpG island prediction CpG island in the C. Elegans cosmid Lenght 219 pb; position 21’954 to 22’172 cgttttctgtggtcaca cacgagtatc cggatcttct ggatcaactt gttctcgtct gcaacgtctt tgcaagaatg gcaccagaac agaaacaact actcgtggaa caccttcaag acgttgggca gacggtcgct atgtgtggcg atggagctaa tgattgtgct gctctgaaag cagctcacgc gggaatctca ctatcggagg ctgaagcatc ga To confirm that this sequence could be part of a promoter sequence (> 80 % of CpG islands extend in the 5’ flanking region of the associated genes), check - according to its positions - if this CpG island is located in a gene promoter region(see later). Gene prediction with HMM on the complete cosmid sequence 3 HMM models: firstex, exon_n, lastex Gene 1 Wrong CDS ? Gene 4 Gene 3 Gene 2 Summary: tRNA 169 238 1 2 3 4 Predicted CpG island: 21954 22172 -> in the middle of CDS4: not a ‘classical’ CpG (not in the 5’ of a gene) Gene 1 Gene 1 prediction with HMMgene One gene found Gene 1 prediction with HMMgene With ‘human’: 2 genes found, one on each strand, (strand minus with less good scores) The programs are ‘trained’ with sequence from specific organisms. The ‘codon bias’ for example, is not the same for the different species. Example of codon usage tables (-> codon bias) http://www.kazusa.or.jp/codon/ Gene 1 prediction with Netgene2 Netgene 2 gives the positions of the first and last nucleotide of the intron (donnor and acceptor splice sites) intron GT AG donnor acceptor Gene 1 prediction with GeneBuilder (organism: no choice….human; option: first and last exon disabled) Matrix: miscellaneous One gene found Gene 1 prediction with GenScan !! No choice except: vertebrate, maize and arabidobsis ! Two genes found Two genes found !! No choice except: vertebrate, maize and arabidobsis ! FGENESH One gene found Summary (gene prediction) One gene 977 163 211 1003 1083 1305 1406 1452 1557 1914 1661 5’ 2000 1997 3’ DO 1084 (1.00) AC 1304 (0.77) DO 1407 (0.89) AC 1451 (0.90) DO 1662 (1.00) HMMgene and GenScan (organism = human !!) Genebuilder (organism = human !!) Netgene2 DO:donnor site AC: acceptor site GeneMark: finds a second gene in 3’!!! FGENESH AC 1913 (1.00) + another potential gene from positions 2000 to 2900 ID FGENESH Unreviewed; 159 AA. SQ SEQUENCE 159 AA; 17780 MW; F9A2C7DE9614425C CRC64; MKVETCVYSG YKIHPGHGKR LVRTDGKVQI FLSGKALKGA KLRRNPRDIR WTVLYRIKNK KGTHGQEQVT RKKTKKSVQV VNRAVAGLSL DAILAKRNQT EDFRRQQREQ AAKIAKDANK AVRAAKAAAN KEKKASQPKT QQKTAKNVKT AAPRVGGKR // ID GENESCAN1 Unreviewed; 159 AA. SQ SEQUENCE 159 AA; 17780 MW; F9A2C7DE9614425C CRC64; MKVETCVYSG YKIHPGHGKR LVRTDGKVQI FLSGKALKGA KLRRNPRDIR WTVLYRIKNK KGTHGQEQVT RKKTKKSVQV VNRAVAGLSL DAILAKRNQT EDFRRQQREQ AAKIAKDANK AVRAAKAAAN KEKKASQPKT QQKTAKNVKT AAPRVGGKR // ID GENESCAN2 Unreviewed; 202 AA. SQ SEQUENCE 202 AA; 23684 MW; 98A69FA21823F2F3 CRC64; MRTLRIAQYS VLTVGFAIYM YRLIEEIPID IRNLNSDSLE GIINSDELCD VTVSNRNRGL LVRNDSLDLD ILKAKFTTFF SKRYLTRFLS EQVPFLHVID EALLVKRFVM CACFMVFCLT VIWFLVIRRM GNLIKRLSVL NQLEDAESVE WARCIREFTQ EKLAVLCFCI VPPFAQTDKL VSDKIKLFRE HKILRIRSVQ HI // ID GENEMARK1 Unreviewed; 184 AA. SQ SEQUENCE 184 AA; 20255 MW; 85BB0234E6C14EA0 CRC64; MGRCGSSGKR DGYGAKDSSS EGLSTMKVET CVYSGYKIHP GHGKRLVRTD GKVQIFLSGK ALKGAKLRRN PRDIRWTVLY RIKNKKGTHG QEQVTRKKTK KSVQVVNRAV AGLSLDAILA KRNQTEDFRR QQREQAAKIA KDANKAVRAA KAAANKEKKA SQPKTQQKTA KNVKTAAPRV GGKR // ID GENEMARK2 Unreviewed; 183 AA. SQ SEQUENCE 183 AA; 21336 MW; 64F65D472A58046E CRC64; MRTLRIAQYS VLTVGFAIYM YRLIEEIPID IRNLNSDSLE GIINSDELCD VTVSNRNRGL LVRNDSLDLD ILKAKFTTFF SKRYLTRFLS EQVPFLHVID EALLVKRFVM CACFMVFCLT VIWFLVIRRM GNLIKRLSVL NQLEDAESVE WARCIREFTQ EKLAVLCFCI VPPFAQTDNV QHI // For fun… Compare the predictions with the same program (GenMark) with different parameters (HMM trained with eukaroyta or prokaroyta) Two genes found Gene 1 prediction with GeneMark (prokaryota specific; E.coli K12) Protein 1 Protein 2 Gene 1 prediction with GeneMark (prokaryota specific) Protein 1 Protein 2 CDS corresponds ~ to ‘exon’ : there is no intron in prokaryota ! Summary (prokaryota gene prediction) 1914 1003 1083 1305 1406 1452 1997 15571661 5’ 3’ 2000 DO 1084 (1.00) 1254 AC 1304 (0.77) Protein 1 DO 1407 (0.89) 1433 AC 1451 (0.90) AC 1913 (1.00) 1437 Protein 2 1688 HMMgene GenScan GenMark (euka) Genebuilder Netgene2 Gene Mark (proka) DO 1662 (1.00) DO:donnor site AC: acceptor site Alignment between the ‘eukaryota and prokaryota’ predicted sequences Gene prediction: similarity searches with ESTs ESTs: Expressed sequence tags (cDNAs which are rapidly and badly sequenced) Two genes found Blast 2012 Gene A Gene B Blast 2010 Gene A Gene B Gene A EST1 >gi|47590759|gb|BJ750997.1|BJ750997 BJ750997 unpublished oligo-capped cDNA library Caenorhabditis elegans cDNA clone yk1360e06 5', mRNA sequenceGGTTTAATTACCCAAGTTTGAGATTCGTCAAGCGAGGGCCTATCAGCAATGAAGGTCGAAACCTGCGTTTACTCCGGATACAAGATCCACCCAGGACACGGAAAGAGA CTTGTCCGTACTGACGGAAAGGTGAGTTCAGTTTCTCTTTGAAAGGCGTTAGCATGCTGTTAGAGCTCGTAAGGTATATTGTAATTTTACGAGTGTTGAAGTATTGCAAAAGTAAA GCATAATCACCTTATGTATGTGTTGGTGCTATATCTTCTAGTTTTTAGAAGTTATACCATCGTTAAGCATGCCACGTGTTGAGTGCGACAAACTACCGTTTCATGATTTATTTATT CAAATTTCAGGTCCAAATCTTCCTCAGTGGAAAGGCACTCAAGGGAGCCAAGCTTCGCCGTAACCCACGTGACATCAGATGGACTGTCCTCTACAGAATCAAGAACAAGAAGGGAA CCCACGGACAAGAGCAAGTCACCAGAAAGAAGACCAAGAAGTCCGTCCAGGTTGTTAACCGCGCCGTCGCTGGACTTTCCCTTGATGCTATCCTTGCCAAGAGAAACCAGACCGAA GACTTCCGTCGCCAACAGCGTGAACAAGCCGCTAAGATCGCCAA EST2 >gi|47646579|gb|BJ775052.1|BJ775052 BJ775052 unpublished oligo-capped cDNA library Caenorhabditis elegans cDNA clone yk1360e06 3', mRNA sequenceATAACGGGACCGAGAACGTTTATCGCTTTCCTCCGACACGTGGAGCAGCAGTCTTCACATTCTTGGCGGTCTTTTGCTGGGTCTTTGGCTGAGAGGCCTTCTTTTCCT TGTTGGCAGCAGCCTTGGCGGCACGGACAGCCTTGTTGGCATCCTTGGCGATCTTAGCGGCTTGTTCACGCTGTTGGCGACGGAAGTCTTCGGTCTGGTTTCTCTTGGCAAGGATA GCATCAAGGGAAAGTCCAGCGACGGCGCGGTTAACAACCTGGACGGACTTCTTGGTCTTCTTTCTGGTGACTTGCTCTTGTCCGTGGGTTCCCTTCTTGTTCTTGATTCTGTAGAG GACAGTCCATCTGATGTCACGTGGGTTACGGCGAAGCTTGGCTCCCTTGAGTGCCTTTCCACTGAGGAAGATTTGGACCTGAAATTTGAATAAATAAATCATGAAACGGTAGTTTG TCGCACTCAACACGTGGCATGCTTAACGATGGTATAACTTCTAAAAACTAGAAGATATAGCACCAACACATACATAAGGTGATTATGCTTTACTTTTGCAATACTTCAACACTCGT AAAATTACAATATACCTTACGAGCTCTAACAGCATGCTAACGCCTTTCAAAGAGAAACTGAACTCACCTTTCCGTCAGTACGGACAAGTCTCTTTCCGTGTCCTGGGTGGATCTTG TATCCGGAGTAAACGCAGGTTTCGACCTTCATTGCTGATANGCCCTCGCTTGACGAATCTCAAACTTGGGTAATTAAACCCCA EST3 >gi|47727995|gb|BJ818152.1|BJ818152 BJ818152 unpublished oligo-capped cDNA library, stage L4 Caenorhabditis elegans cDNA clone yk1685h11 3', mRNA sequence TAACGGGACCGAGAACGTTTATCGCTTTCCTCCGACACGTGGAGCAGCAGTCTTCACATTCTTGGCGGTC TTTTGCTGGGTCTTTGGCTGAGAGGCCTTCTTTTCCTTGTTGGCAGCAGCCTTGGCGGCACGGACAGCCT TGTTGGCATCCTTGGCGATCTTAGCGGCTTGTTCACGCTGTTGGCGACGGAAGTCTTCGGTCTGGTTTCT CTTGGCAAGGATAGCATCAAGGGAAAGTCCAGCGACGGCGCGGTTAACAACCTGGACGGACTTCTTGGTC TTCTTTCTGGTGACTTGCTCTTGTCCGTGGGTTCCCTTCTTGTTCTTGATTCTGTAGAGGACAGTCCATC TGATGTCACGTGGGTTACGGCGAAGCTTGGCTCCCTTGAGTGCCTTTCCACTGAGGAAGATTTGGACCTT TCCGTCAGTACGGACAAGTCTCTTTCCGTGTCCTGGGTGGATCTTGTATCCGGAGTAAACGCAGGTTTCG ACCTTCATTGTTGATAGGCCCTCGCTTGACGAATCTCAAACTTGGGTAATTAAACCTACAAATAAAAATG AGATAAAGCATACTGCCATTCTACAACCGGAGAATAAGAAAACCGAAAACGAGAAAATTATTCTATTATG ACAGATAGAATAAGTTAAAATGGGAAGAGTGCATTTGTCACTGATTTACTTGGTGACTTGGTGGAGAGCG TGGGCAAGGTAAGCGACATTGTTCGATGAA Blast result with EST1 975-1407 1450-1615 1692-1865 BUT: Blast does not take care of the intron-exon boundaries when aligning DNA with RNA -> we have to use a specific tool : SIM4 The 3rd part of the EST1 is of very bad quality SIM4 alignment Example with EST 1 BJ750997 (partial) The 3rd part of the EST1 is of very bad quality: not align by SIM4 -> EST1 is considered as partial ! SIM4 alignment results EST 1 BJ750997 (partial) EST 2 BJ775052 EST 3 BJ818152 summary (ESTs) 1914 1003 EST3 BJ818152.1 1083 1305 EST1BJ750997.1 1615 … EST2 BJ775052.1 Alternative splicing event (intron retention) -> 2 different mRNAs (EST BJ750997.1 is partial) 1997 1661 1406 1452 5’ Gene A 3’ Translation and BLASTp Translation (beware the EST sequence orientation !) >gi|47590759|gb|BJ750997.1|BJ750997 BJ750997 unpublished oligo-capped cDNA library Caenorhabditis elegans cDNA clone yk1360e06 5', mRNA sequence GGTTTAATTACCCAAGTTTGAGATTCGTCAAGCGAGGGCCTATCAGCAATGAAGGTCGAAACCTGCGTTT ACTCCGGATACAAGATCCACCCAGGACACGGAAAGAGACTTGTCCGTACTGACGGAAAGGTGAGTTCAGT TTCTCTTTGAAAGGCGTTAGCATGCTGTTAGAGCTCGTAAGGTATATTGTAATTTTACGAGTGTTGAAGT ATTGCAAAAGTAAAGCATAATCACCTTATGTATGTGTTGGTGCTATATCTTCTAGTTTTTAGAAGTTATA CCATCGTTAAGCATGCCACGTGTTGAGTGCGACAAACTACCGTTTCATGATTTATTTATTCAAATTTCAG GTCCAAATCTTCCTCAGTGGAAAGGCACTCAAGGGAGCCAAGCTTCGCCGTAACCCACGTGACATCAGAT GGACTGTCCTCTACAGAATCAAGAACAAGAAGGGAACCCACGGACAAGAGCAAGTCACCAGAAAGAAGAC CAAGAAGTCCGTCCAGGTTGTTAACCGCGCCGTCGCTGGACTTTCCCTTGATGCTATCCTTGCCAAGAGA AACCAGACCGAAGACTTCCGTCGCCAACAGCGTGAACAAGCCGCTAAGATCGCCAA EST1 MIYLFKFQVQIFLSGKALKGAKLRRNPRDIRWTVLYRIKNKKGTHGQEQVTRKKTKKSVQ VVNRAVAGLSLDAILAKRNQTEDFRRQQREQAAKIA Blastp results >gi|47646579|gb|BJ775052.1|BJ775052 BJ775052 unpublished oligo-capped cDNA library Caenorhabditis elegans cDNA clone yk1360e06 3', mRNA sequence ATAACGGGACCGAGAACGTTTATCGCTTTCCTCCGACACGTGGAGCAGCAGTCTTCACATTCTTGGCGGT CTTTTGCTGGGTCTTTGGCTGAGAGGCCTTCTTTTCCTTGTTGGCAGCAGCCTTGGCGGCACGGACAGCC TTGTTGGCATCCTTGGCGATCTTAGCGGCTTGTTCACGCTGTTGGCGACGGAAGTCTTCGGTCTGGTTTC TCTTGGCAAGGATAGCATCAAGGGAAAGTCCAGCGACGGCGCGGTTAACAACCTGGACGGACTTCTTGGT CTTCTTTCTGGTGACTTGCTCTTGTCCGTGGGTTCCCTTCTTGTTCTTGATTCTGTAGAGGACAGTCCAT CTGATGTCACGTGGGTTACGGCGAAGCTTGGCTCCCTTGAGTGCCTTTCCACTGAGGAAGATTTGGACCT GAAATTTGAATAAATAAATCATGAAACGGTAGTTTGTCGCACTCAACACGTGGCATGCTTAACGATGGTA TAACTTCTAAAAACTAGAAGATATAGCACCAACACATACATAAGGTGATTATGCTTTACTTTTGCAATAC TTCAACACTCGTAAAATTACAATATACCTTACGAGCTCTAACAGCATGCTAACGCCTTTCAAAGAGAAAC TGAACTCACCTTTCCGTCAGTACGGACAAGTCTCTTTCCGTGTCCTGGGTGGATCTTGTATCCGGAGTAA ACGCAGGTTTCGACCTTCATTGCTGATANGCCCTCGCTTGACGAATCTCAAACTTGGGTAATTAAACCCC A EST2 MIYLFKFQVQIFLSGKALKGAKLRRNPRDIRWTVLYRIKNKKGTHGQEQVTRKKTKKSVQ VVNRAVAGLSLDAILAKRNQTEDFRRQQREQAAKIAKDANKAVRAAKAAANKEKKASQPK TQQKTAKNVKTAAPRVGGKR Blastp results >gi|47727995|gb|BJ818152.1|BJ818152 BJ818152 unpublished oligo-capped cDNA library, stage L4 Caenorhabditis elegans cDNA clone yk1685h11 3', mRNA sequence TAACGGGACCGAGAACGTTTATCGCTTTCCTCCGACACGTGGAGCAGCAGTCTTCACATTCTTGGCGGTC TTTTGCTGGGTCTTTGGCTGAGAGGCCTTCTTTTCCTTGTTGGCAGCAGCCTTGGCGGCACGGACAGCCT TGTTGGCATCCTTGGCGATCTTAGCGGCTTGTTCACGCTGTTGGCGACGGAAGTCTTCGGTCTGGTTTCT CTTGGCAAGGATAGCATCAAGGGAAAGTCCAGCGACGGCGCGGTTAACAACCTGGACGGACTTCTTGGTC TTCTTTCTGGTGACTTGCTCTTGTCCGTGGGTTCCCTTCTTGTTCTTGATTCTGTAGAGGACAGTCCATC TGATGTCACGTGGGTTACGGCGAAGCTTGGCTCCCTTGAGTGCCTTTCCACTGAGGAAGATTTGGACCTT TCCGTCAGTACGGACAAGTCTCTTTCCGTGTCCTGGGTGGATCTTGTATCCGGAGTAAACGCAGGTTTCG ACCTTCATTGTTGATAGGCCCTCGCTTGACGAATCTCAAACTTGGGTAATTAAACCTACAAATAAAAATG AGATAAAGCATACTGCCATTCTACAACCGGAGAATAAGAAAACCGAAAACGAGAAAATTATTCTATTATG ACAGATAGAATAAGTTAAAATGGGAAGAGTGCATTTGTCACTGATTTACTTGGTGACTTGGTGGAGAGCG T GGGCAAGGTAAGCGACATTGTTCGATGAA EST3 Gene A EST1 is partial in C-ter Gene A EST1 is partial. EST3 corresponds to the UniProtKB/Swiss-Prot RL24_CAEEL sequence Gene A Some prediction programs give the correct protein sequence None have predicted the alternative splicing event (EST2; intron 1084-1304 retention) summary (ESTs) 1010 1914 MKVET….. 1003 1083 1305 EST BJ818152 1406 1452 Gene A 1997 1661 5’ 3’ EST BJ775052.1 1284 MIYLF….. Alternative splicing events (intron retention) -> 2 different mRNAs Gene 1 is on C.elegans chromosome I BLAT results Isoform 2 EST2 Gene A Gene B RefSeq sequence >NP_491399 length=159 MKVETCVYSGYKIHPGHGKRLVRTDGKVQIFLSGKALKGAKLRRNPRDIR WTVLYRIKNKKGTHGQEQVTRKKTKKSVQVVNRAVAGLSLDAILAKRNQT EDFRRQQREQAAKIAKDANKAVRAAKAAANKEKKASQPKTQQKTAKNVKT AAPRVGGKR InterPro scan results: the protein contains a ribosomal L24e domain Conclusions (1) Gene A There are 2 different protein sequences due to alternative splicing (intron retention; the shortest isoform is due to a intron retention and is rarely expressed – only 2 ESTs) Conclusions (2) Gene prediction programs can not predict an alternative splicing event (it can only predict the alternative splice junction) The protein (Gene A) is a ribosomal protein which belongs to the ribosomal protein L24e family (UniProtKB/Swiss-Prot O01868). The alternatively spliced sequence is not yet in the protein sequence databases, because it is ‘derived’ from ESTs sequences which are submitted to public DNA/RNA databases without annotated CDS Non coding region analysis 3’end of chromosome Y EMBL #AJ271736 Example of Alu sequence Gene 2 Schema recapitulatif 5’ 3’ 789 Exon 3 1111 1410 3’ 1557 Exon 2 1688 1636 1845 Exon 1 5’ DO AC 1112 1409 (0.56) (0.92) HMMgene Netgene2 DO:donneur AC: accepteur DO 1556 (0.96) AC 1637 (0.61) 1112 1407 1637 1688 GeneBuilder prediction is not confirmed anywhere else CDS2 (3 exons) RefSeq NP_491393 (AF272397) UniProtKB/TrEMBL: G5EC89 237 AA; 3 exons MMMEYGGYFS SSAVAQQSGD VPTTAPSAVT NSFFYTPQSH NIYHQYATPY LQSGRALTTA HNTSSSSAGN STSSSSSSSN YRNTTHDSLQ AFFNTGLQYQ LYQKSQLIGS DTIQRTSSNV LNGLPRSSLV GALCSTGGAP LNPAERRKQR RIRTTFTSGQ LKELERSFCE THYPDIYTRE EIAMRIDLTE ARVQVWFQNR RAKYRKQEKI RRVKDEEEDP LKKEPGQISL EEIIDQI A probable nuclear protein with a DNA binding domain (homeobox) Gene 3 Numérotation « direct strand » CDS3 >tr|O01864|O01864_CAEEL Hypothetical protein - Caenorhabditis elegans. METEVMKSFNNELSSLFDSKNMSKNKIQDITKAAIKAKSQYKHVVFSVEKLINKCKPDQR LNVLYVIDSIVRASKHQLKEKDTFGPRFMKQFDKFLMPLLKCGQKEKMRTVRTLNLWMSN KVFKESEIQPLREMCKASGLTIDFEEVELAVKGKQADMSIYSGVYKKKPKRSSSSSQPKS RTPTNPHPDDGLLGAGPSSALRSVPDIPNFVLSEDYFLGTISEREMLELVQKFGIDRSGV LSKDKNLLQRALQIFAGSLSQKVEEVLAENNRINGSSIQNVLTKDFEYSDDEEEKEKEPQ PEKQKNLPHAQVLLLAQSLLTQPQILAKLAEVLIPQGNPFGLPFPGEHIVPTSSAALTLG APPPNLMALQQSLPPGFPNQQLGLPNLSGLNQAQLMNVQNAQNMLQLQQRAAQLQALQGN PNAQRNLLMLGNPLLNPFALQHGVNPMLNDLQAAAAAQQQAMLNEAAQSPEKKILELSGG NSGINNSGDVERARLREKEKERESKERRRMGLPPVRIGFTIIASRTLWLKKIPTNIVEND LKQAVESCGEASRVKVIGNRACAYITMENRRSANDVVSKMREVSVAKKMVKVYWARSPGM DSDQFSDLWDSNRGVLEIPYEKLPLDLVALCEGAMLDIESLPIEKKLLYKETGETVISIP PPNIQPPVPHPPPMGFPFQHQLTQLPGQPRPAGLPPGVPPMFNLNAPPPPGIPGYPPAPP PPGVGPPPPQGIPPMGFDPNKPPPPMFQQGFNAGAPPPPFGRGAGPMSSFPPPPRGGMHH MPPPPSFRGGRGGHGGPPPPHFDRRGGGGPPFRPENGRGRLLDQSEMWNREQREMRGGGG AGRDGGREHRDYDRDRSQIDRRRQDDMGARRRSRWGDDDRRDDDRRDDRRDDRRESRRRS PRSPRSPDRRTRRSPSYEREEPPVKKTSVEEETVSSTTLDELKPSVEPTPVPAPIPAPAP ELKAAEEPVKIVAEHHEDQTDEVPMDLE Gene 4 EST 1346 1695 HMMgene 5841 6080 6993 7411 7520 9631 9770 9997 Netgene2 1411 1794 5679 6049 6908 7589 7800 7954 WebGene 5668 6049 6908 7187 7414 7564 7753 7911 8113 7800 7959 8223 8497 8710 9115 9574 9770 9705 9943 10350 5859 6864 7132 7328 7517 7589 7911 8135 8413 8659 9042 9528 9705 9946 (AG) 1691 (GT ) 1795 5682 6048 6907 7186 7413 5842 6865 7133 7329 7518 5405 5683 6049 6908 7187 7414 5449 5841 6864 7132 7328 7517 7959 8153 7958 7589 7799 8154 7754 7912 8223 8497 8710 9115 9574 9770 8413 8659 9042 9528 9705 9942 8222 8496 8709 9114 9573 8414 8660 9043 9529 9706 9943 9996 Removed from gene 4: 1412-1691, 7518-7589, 8660-8709, 1795-5682, 7754-7999, 9043-9114, 5842-6048, 7912-7958, 9529-9573, 6865-6907, 8154-8222, 9706-9769, 7133-7413, 8414-8496, 9943-9996 Protein Q3N323 >tr|Q9N323|Q9N323_CAEEL Hypothetical protein - Caenorhabditis elegans. MSTNNYQTLSQNKADRMGPGGSRRPRNSQHATASTPSASSCKEQQKDVEHEFDIIAYKTT FWRTFFFYALSFGTCGIFRLFLHWFPKRLIQFRGKRCSVENADLVLVVDNHNRYDICNVY YRNKSGTDHTVVANTDGNLAELDELRWFKYRKLQYTWIDGEWSTPSRAYSHVTPENLASS APTTGLKADDVALRRTYFGPNVMPVKLSPFYELVYKEVLSPFYIFQAISVTVWYIDDYVW YAALIIVMSLYSVIMTLRQTRSQQRRLQSMVVEHDEVQVIRENGRVLTLDSSEIVPGDVL VIPPQGCMMYCDAVLLNGTCIVNESMLTGESIPITKSAISDDGHEKIFSIDKHGKNIIFN GTKVLQTKYYKGQNVKALVIRTAYSTTKGQLIRAIMYPKPADFKFFRELMKFIGVLAIVA FFGFMYTSFILFYRGSSIGKIIIRALDLVTIVVPPALPAVMGIGIFYAQRRLRQKSIYCI SPTTINTCGAIDVVCFDKTGTLTEDGLDFYALRVVNDAKIGDNIVQIAANDSCQNVVRAI ATCHTLSKINNELHGDPLDVIMFEQTGYSLEEDDSESHESIESIQPILIRPPKDSSLPDC QIVKQFTFSSGLQRQSVIVTEEDSMKAYCKGSPEMIMSLCRPETVPENFHDIVEEYSQHG YRLIAVAEKELVVGSEVQKTPRQSIECDLTLIGLVALENRLKPVTTEVIQKLNEANIRSV MVTGDNLLTALSVARECGIIVPNKSAYLIEHENGVVDRRGRTVLTIREKEDHHTERQPKI VDLTKMTNKDCQFAISGSTFSVVTHEYPDLLDQLVLVCNVFARMAPEQKQLLVEHLQDVG QTVAMCGDGANDCAALKAAHAGISLSEAEASIAAPFTSKVADIRCVITLISEGRAALVTS YSAFLCMAGYSLTQFISILLLYWIATSYSQMQFLFIDIAIVTNLAFLSSKTRAHKELAST PPPTSILSTASMVSLFGQLAIGGMAQVAVFCLITMQSWFIPFMPTHHDNDEDRKSLQGTA IFYVSLFHYIVLYFVFAAGPPYRASIASNKAFLISMIGVTVTCIAIVVFYVTPIQYFLGC LQMPQEFRFIILAVATVTAVISIIYDRCVDWISERLREKIRQRRKGA Prediction of mitochondrial genes (human) Mitochondrial genome NC_012920.1 annotation tRNA scan prediction NC_012920.1 tRNA scan lists 1- all the tRNAs in the current strand 2- all the tRNAs in the complement strand This tRNA is found at the end of the list Conclusion • Good tRNA prediction • If you try: very bad protein-coding gene prediction…. – Mitochondrial genome has not the same sequence content (codon biais, signals) compare to the nuclear genome. – You might try with ‘prokaryota’-like gene model, but the results are not perfect… !