Download Document

Document related concepts
no text concepts found
Transcript
Elements of Bioinformatics (14F001)
TP2: Gene prediction
22 October 2012
CORRECTIONS
Notice:
During this practical, you will need to use ‘raw’ and ‘fasta’
sequence formats.
For additional information on the different sequence formats
available, please have a look at
http://www.genomatix.de/online_help/help/sequence_formats.html
nc RNA gene prediction
Choose: eukaryotic tRNA; does not give any result with general tRNA model !
CpG island prediction
CpG island in the C. Elegans cosmid
Lenght 219 pb; position 21’954 to 22’172
cgttttctgtggtcaca cacgagtatc cggatcttct
ggatcaactt gttctcgtct gcaacgtctt tgcaagaatg
gcaccagaac agaaacaact actcgtggaa caccttcaag
acgttgggca gacggtcgct atgtgtggcg atggagctaa
tgattgtgct gctctgaaag cagctcacgc gggaatctca
ctatcggagg ctgaagcatc ga
To confirm that this sequence could be part of a promoter
sequence (> 80 % of CpG islands extend in the 5’ flanking
region of the associated genes), check - according to its
positions - if this CpG island is located in a gene promoter
region(see later).
Gene prediction
with HMM on the complete cosmid sequence
3 HMM models: firstex, exon_n, lastex
Gene 1
Wrong CDS ?
Gene 4
Gene 3
Gene 2
Summary:
tRNA
169
238
1
2
3
4
Predicted CpG island: 21954 22172
-> in the middle of CDS4: not a ‘classical’ CpG (not
in the 5’ of a gene)
Gene 1
Gene 1 prediction with HMMgene
One gene found
Gene 1 prediction with HMMgene
With ‘human’: 2 genes found, one on each strand, (strand minus with less good scores)
The programs are ‘trained’ with sequence from specific organisms.
The ‘codon bias’ for example, is not the same for the different species.
Example of codon usage tables (-> codon bias)
http://www.kazusa.or.jp/codon/
Gene 1 prediction with Netgene2
Netgene 2 gives the positions of the
first and last nucleotide of the intron
(donnor and acceptor splice sites)
intron
GT
AG
donnor
acceptor
Gene 1 prediction with GeneBuilder
(organism: no choice….human; option: first and last exon disabled)
Matrix: miscellaneous
One gene found
Gene 1 prediction with GenScan
!! No choice except: vertebrate, maize and arabidobsis !
Two genes found
Two genes found
!! No choice except: vertebrate, maize and arabidobsis !
FGENESH
One gene found
Summary (gene prediction)
One gene
977
163
211
1003
1083
1305
1406
1452
1557
1914
1661
5’
2000
1997
3’
DO
1084
(1.00)
AC
1304
(0.77)
DO
1407
(0.89)
AC
1451
(0.90)
DO
1662
(1.00)
HMMgene and GenScan (organism = human !!)
Genebuilder (organism = human !!)
Netgene2 DO:donnor site AC: acceptor site
GeneMark: finds a second gene in 3’!!!
FGENESH
AC
1913
(1.00)
+ another potential gene
from positions 2000 to 2900
ID
FGENESH
Unreviewed;
159 AA.
SQ
SEQUENCE
159 AA; 17780 MW; F9A2C7DE9614425C CRC64;
MKVETCVYSG YKIHPGHGKR LVRTDGKVQI FLSGKALKGA KLRRNPRDIR WTVLYRIKNK
KGTHGQEQVT RKKTKKSVQV VNRAVAGLSL DAILAKRNQT EDFRRQQREQ AAKIAKDANK
AVRAAKAAAN KEKKASQPKT QQKTAKNVKT AAPRVGGKR
//
ID
GENESCAN1
Unreviewed;
159 AA.
SQ
SEQUENCE
159 AA; 17780 MW; F9A2C7DE9614425C CRC64;
MKVETCVYSG YKIHPGHGKR LVRTDGKVQI FLSGKALKGA KLRRNPRDIR WTVLYRIKNK
KGTHGQEQVT RKKTKKSVQV VNRAVAGLSL DAILAKRNQT EDFRRQQREQ AAKIAKDANK
AVRAAKAAAN KEKKASQPKT QQKTAKNVKT AAPRVGGKR
//
ID
GENESCAN2
Unreviewed;
202 AA.
SQ
SEQUENCE
202 AA; 23684 MW; 98A69FA21823F2F3 CRC64;
MRTLRIAQYS VLTVGFAIYM YRLIEEIPID IRNLNSDSLE GIINSDELCD VTVSNRNRGL
LVRNDSLDLD ILKAKFTTFF SKRYLTRFLS EQVPFLHVID EALLVKRFVM CACFMVFCLT
VIWFLVIRRM GNLIKRLSVL NQLEDAESVE WARCIREFTQ EKLAVLCFCI VPPFAQTDKL
VSDKIKLFRE HKILRIRSVQ HI
//
ID
GENEMARK1
Unreviewed;
184 AA.
SQ
SEQUENCE
184 AA; 20255 MW; 85BB0234E6C14EA0 CRC64;
MGRCGSSGKR DGYGAKDSSS EGLSTMKVET CVYSGYKIHP GHGKRLVRTD GKVQIFLSGK
ALKGAKLRRN PRDIRWTVLY RIKNKKGTHG QEQVTRKKTK KSVQVVNRAV AGLSLDAILA
KRNQTEDFRR QQREQAAKIA KDANKAVRAA KAAANKEKKA SQPKTQQKTA KNVKTAAPRV
GGKR
//
ID
GENEMARK2
Unreviewed;
183 AA.
SQ
SEQUENCE
183 AA; 21336 MW; 64F65D472A58046E CRC64;
MRTLRIAQYS VLTVGFAIYM YRLIEEIPID IRNLNSDSLE GIINSDELCD VTVSNRNRGL
LVRNDSLDLD ILKAKFTTFF SKRYLTRFLS EQVPFLHVID EALLVKRFVM CACFMVFCLT
VIWFLVIRRM GNLIKRLSVL NQLEDAESVE WARCIREFTQ EKLAVLCFCI VPPFAQTDNV
QHI
//
For fun…
Compare the predictions with the same
program (GenMark) with different
parameters (HMM trained with
eukaroyta or prokaroyta)
Two genes found
Gene 1 prediction with GeneMark (prokaryota specific; E.coli K12)
Protein 1
Protein 2
Gene 1 prediction with GeneMark (prokaryota specific)
Protein 1
Protein 2
CDS corresponds ~ to ‘exon’ : there is no intron in prokaryota !
Summary (prokaryota gene prediction)
1914
1003
1083
1305
1406
1452
1997
15571661
5’
3’
2000
DO
1084
(1.00)
1254
AC
1304
(0.77)
Protein 1
DO
1407
(0.89)
1433
AC
1451
(0.90)
AC
1913
(1.00)
1437 Protein 2 1688
HMMgene GenScan GenMark (euka)
Genebuilder
Netgene2
Gene Mark (proka)
DO
1662
(1.00)
DO:donnor site
AC: acceptor site
Alignment between the ‘eukaryota and prokaryota’ predicted sequences
Gene prediction: similarity searches with ESTs
ESTs: Expressed sequence tags
(cDNAs which are rapidly and badly sequenced)
Two genes found
Blast 2012
Gene A
Gene B
Blast 2010
Gene A
Gene B
Gene A
EST1 >gi|47590759|gb|BJ750997.1|BJ750997 BJ750997 unpublished oligo-capped cDNA library Caenorhabditis elegans cDNA
clone yk1360e06 5', mRNA
sequenceGGTTTAATTACCCAAGTTTGAGATTCGTCAAGCGAGGGCCTATCAGCAATGAAGGTCGAAACCTGCGTTTACTCCGGATACAAGATCCACCCAGGACACGGAAAGAGA
CTTGTCCGTACTGACGGAAAGGTGAGTTCAGTTTCTCTTTGAAAGGCGTTAGCATGCTGTTAGAGCTCGTAAGGTATATTGTAATTTTACGAGTGTTGAAGTATTGCAAAAGTAAA
GCATAATCACCTTATGTATGTGTTGGTGCTATATCTTCTAGTTTTTAGAAGTTATACCATCGTTAAGCATGCCACGTGTTGAGTGCGACAAACTACCGTTTCATGATTTATTTATT
CAAATTTCAGGTCCAAATCTTCCTCAGTGGAAAGGCACTCAAGGGAGCCAAGCTTCGCCGTAACCCACGTGACATCAGATGGACTGTCCTCTACAGAATCAAGAACAAGAAGGGAA
CCCACGGACAAGAGCAAGTCACCAGAAAGAAGACCAAGAAGTCCGTCCAGGTTGTTAACCGCGCCGTCGCTGGACTTTCCCTTGATGCTATCCTTGCCAAGAGAAACCAGACCGAA
GACTTCCGTCGCCAACAGCGTGAACAAGCCGCTAAGATCGCCAA
EST2
>gi|47646579|gb|BJ775052.1|BJ775052 BJ775052 unpublished oligo-capped cDNA library Caenorhabditis elegans cDNA clone
yk1360e06 3', mRNA
sequenceATAACGGGACCGAGAACGTTTATCGCTTTCCTCCGACACGTGGAGCAGCAGTCTTCACATTCTTGGCGGTCTTTTGCTGGGTCTTTGGCTGAGAGGCCTTCTTTTCCT
TGTTGGCAGCAGCCTTGGCGGCACGGACAGCCTTGTTGGCATCCTTGGCGATCTTAGCGGCTTGTTCACGCTGTTGGCGACGGAAGTCTTCGGTCTGGTTTCTCTTGGCAAGGATA
GCATCAAGGGAAAGTCCAGCGACGGCGCGGTTAACAACCTGGACGGACTTCTTGGTCTTCTTTCTGGTGACTTGCTCTTGTCCGTGGGTTCCCTTCTTGTTCTTGATTCTGTAGAG
GACAGTCCATCTGATGTCACGTGGGTTACGGCGAAGCTTGGCTCCCTTGAGTGCCTTTCCACTGAGGAAGATTTGGACCTGAAATTTGAATAAATAAATCATGAAACGGTAGTTTG
TCGCACTCAACACGTGGCATGCTTAACGATGGTATAACTTCTAAAAACTAGAAGATATAGCACCAACACATACATAAGGTGATTATGCTTTACTTTTGCAATACTTCAACACTCGT
AAAATTACAATATACCTTACGAGCTCTAACAGCATGCTAACGCCTTTCAAAGAGAAACTGAACTCACCTTTCCGTCAGTACGGACAAGTCTCTTTCCGTGTCCTGGGTGGATCTTG
TATCCGGAGTAAACGCAGGTTTCGACCTTCATTGCTGATANGCCCTCGCTTGACGAATCTCAAACTTGGGTAATTAAACCCCA
EST3
>gi|47727995|gb|BJ818152.1|BJ818152 BJ818152 unpublished oligo-capped cDNA library, stage L4 Caenorhabditis elegans
cDNA clone yk1685h11 3', mRNA
sequence TAACGGGACCGAGAACGTTTATCGCTTTCCTCCGACACGTGGAGCAGCAGTCTTCACATTCTTGGCGGTC
TTTTGCTGGGTCTTTGGCTGAGAGGCCTTCTTTTCCTTGTTGGCAGCAGCCTTGGCGGCACGGACAGCCT
TGTTGGCATCCTTGGCGATCTTAGCGGCTTGTTCACGCTGTTGGCGACGGAAGTCTTCGGTCTGGTTTCT
CTTGGCAAGGATAGCATCAAGGGAAAGTCCAGCGACGGCGCGGTTAACAACCTGGACGGACTTCTTGGTC
TTCTTTCTGGTGACTTGCTCTTGTCCGTGGGTTCCCTTCTTGTTCTTGATTCTGTAGAGGACAGTCCATC
TGATGTCACGTGGGTTACGGCGAAGCTTGGCTCCCTTGAGTGCCTTTCCACTGAGGAAGATTTGGACCTT
TCCGTCAGTACGGACAAGTCTCTTTCCGTGTCCTGGGTGGATCTTGTATCCGGAGTAAACGCAGGTTTCG
ACCTTCATTGTTGATAGGCCCTCGCTTGACGAATCTCAAACTTGGGTAATTAAACCTACAAATAAAAATG
AGATAAAGCATACTGCCATTCTACAACCGGAGAATAAGAAAACCGAAAACGAGAAAATTATTCTATTATG
ACAGATAGAATAAGTTAAAATGGGAAGAGTGCATTTGTCACTGATTTACTTGGTGACTTGGTGGAGAGCG TGGGCAAGGTAAGCGACATTGTTCGATGAA
Blast result with EST1
975-1407
1450-1615
1692-1865
BUT: Blast does not take care of
the intron-exon boundaries when
aligning DNA with RNA -> we
have to use a specific tool : SIM4
The 3rd part of the EST1 is
of very bad quality
SIM4 alignment
Example with
EST 1 BJ750997
(partial)
The 3rd part of the EST1 is
of very bad quality: not
align by SIM4 -> EST1 is
considered as partial !
SIM4 alignment results
EST 1 BJ750997
(partial)
EST 2 BJ775052
EST 3 BJ818152
summary (ESTs)
1914
1003
EST3 BJ818152.1
1083
1305
EST1BJ750997.1
1615
…
EST2 BJ775052.1
Alternative splicing event (intron retention)
-> 2 different mRNAs
(EST BJ750997.1 is partial)
1997
1661
1406
1452
5’
Gene A
3’
Translation and BLASTp
Translation
(beware the EST sequence orientation !)
>gi|47590759|gb|BJ750997.1|BJ750997 BJ750997 unpublished oligo-capped cDNA library Caenorhabditis elegans cDNA clone yk1360e06 5',
mRNA sequence
GGTTTAATTACCCAAGTTTGAGATTCGTCAAGCGAGGGCCTATCAGCAATGAAGGTCGAAACCTGCGTTT
ACTCCGGATACAAGATCCACCCAGGACACGGAAAGAGACTTGTCCGTACTGACGGAAAGGTGAGTTCAGT
TTCTCTTTGAAAGGCGTTAGCATGCTGTTAGAGCTCGTAAGGTATATTGTAATTTTACGAGTGTTGAAGT
ATTGCAAAAGTAAAGCATAATCACCTTATGTATGTGTTGGTGCTATATCTTCTAGTTTTTAGAAGTTATA
CCATCGTTAAGCATGCCACGTGTTGAGTGCGACAAACTACCGTTTCATGATTTATTTATTCAAATTTCAG
GTCCAAATCTTCCTCAGTGGAAAGGCACTCAAGGGAGCCAAGCTTCGCCGTAACCCACGTGACATCAGAT
GGACTGTCCTCTACAGAATCAAGAACAAGAAGGGAACCCACGGACAAGAGCAAGTCACCAGAAAGAAGAC
CAAGAAGTCCGTCCAGGTTGTTAACCGCGCCGTCGCTGGACTTTCCCTTGATGCTATCCTTGCCAAGAGA
AACCAGACCGAAGACTTCCGTCGCCAACAGCGTGAACAAGCCGCTAAGATCGCCAA
EST1
MIYLFKFQVQIFLSGKALKGAKLRRNPRDIRWTVLYRIKNKKGTHGQEQVTRKKTKKSVQ
VVNRAVAGLSLDAILAKRNQTEDFRRQQREQAAKIA
Blastp results
>gi|47646579|gb|BJ775052.1|BJ775052 BJ775052 unpublished oligo-capped cDNA library Caenorhabditis elegans cDNA clone yk1360e06 3',
mRNA sequence
ATAACGGGACCGAGAACGTTTATCGCTTTCCTCCGACACGTGGAGCAGCAGTCTTCACATTCTTGGCGGT
CTTTTGCTGGGTCTTTGGCTGAGAGGCCTTCTTTTCCTTGTTGGCAGCAGCCTTGGCGGCACGGACAGCC
TTGTTGGCATCCTTGGCGATCTTAGCGGCTTGTTCACGCTGTTGGCGACGGAAGTCTTCGGTCTGGTTTC
TCTTGGCAAGGATAGCATCAAGGGAAAGTCCAGCGACGGCGCGGTTAACAACCTGGACGGACTTCTTGGT
CTTCTTTCTGGTGACTTGCTCTTGTCCGTGGGTTCCCTTCTTGTTCTTGATTCTGTAGAGGACAGTCCAT
CTGATGTCACGTGGGTTACGGCGAAGCTTGGCTCCCTTGAGTGCCTTTCCACTGAGGAAGATTTGGACCT
GAAATTTGAATAAATAAATCATGAAACGGTAGTTTGTCGCACTCAACACGTGGCATGCTTAACGATGGTA
TAACTTCTAAAAACTAGAAGATATAGCACCAACACATACATAAGGTGATTATGCTTTACTTTTGCAATAC
TTCAACACTCGTAAAATTACAATATACCTTACGAGCTCTAACAGCATGCTAACGCCTTTCAAAGAGAAAC
TGAACTCACCTTTCCGTCAGTACGGACAAGTCTCTTTCCGTGTCCTGGGTGGATCTTGTATCCGGAGTAA
ACGCAGGTTTCGACCTTCATTGCTGATANGCCCTCGCTTGACGAATCTCAAACTTGGGTAATTAAACCCC
A
EST2
MIYLFKFQVQIFLSGKALKGAKLRRNPRDIRWTVLYRIKNKKGTHGQEQVTRKKTKKSVQ
VVNRAVAGLSLDAILAKRNQTEDFRRQQREQAAKIAKDANKAVRAAKAAANKEKKASQPK
TQQKTAKNVKTAAPRVGGKR
Blastp results
>gi|47727995|gb|BJ818152.1|BJ818152 BJ818152 unpublished oligo-capped cDNA library, stage L4 Caenorhabditis elegans cDNA clone yk1685h11 3', mRNA sequence
TAACGGGACCGAGAACGTTTATCGCTTTCCTCCGACACGTGGAGCAGCAGTCTTCACATTCTTGGCGGTC
TTTTGCTGGGTCTTTGGCTGAGAGGCCTTCTTTTCCTTGTTGGCAGCAGCCTTGGCGGCACGGACAGCCT
TGTTGGCATCCTTGGCGATCTTAGCGGCTTGTTCACGCTGTTGGCGACGGAAGTCTTCGGTCTGGTTTCT
CTTGGCAAGGATAGCATCAAGGGAAAGTCCAGCGACGGCGCGGTTAACAACCTGGACGGACTTCTTGGTC
TTCTTTCTGGTGACTTGCTCTTGTCCGTGGGTTCCCTTCTTGTTCTTGATTCTGTAGAGGACAGTCCATC
TGATGTCACGTGGGTTACGGCGAAGCTTGGCTCCCTTGAGTGCCTTTCCACTGAGGAAGATTTGGACCTT
TCCGTCAGTACGGACAAGTCTCTTTCCGTGTCCTGGGTGGATCTTGTATCCGGAGTAAACGCAGGTTTCG
ACCTTCATTGTTGATAGGCCCTCGCTTGACGAATCTCAAACTTGGGTAATTAAACCTACAAATAAAAATG
AGATAAAGCATACTGCCATTCTACAACCGGAGAATAAGAAAACCGAAAACGAGAAAATTATTCTATTATG
ACAGATAGAATAAGTTAAAATGGGAAGAGTGCATTTGTCACTGATTTACTTGGTGACTTGGTGGAGAGCG T
GGGCAAGGTAAGCGACATTGTTCGATGAA
EST3
Gene A
EST1 is partial in C-ter
Gene A
EST1 is partial.
EST3 corresponds to the UniProtKB/Swiss-Prot RL24_CAEEL sequence
Gene A
Some prediction programs give the correct protein
sequence
None have predicted the alternative splicing event
(EST2; intron 1084-1304 retention)
summary (ESTs)
1010
1914
MKVET…..
1003
1083
1305
EST BJ818152
1406
1452
Gene A
1997
1661
5’
3’
EST BJ775052.1
1284
MIYLF…..
Alternative splicing events (intron retention)
-> 2 different mRNAs
Gene 1 is on C.elegans chromosome I
BLAT results
Isoform 2
EST2
Gene A
Gene B
RefSeq sequence
>NP_491399 length=159 MKVETCVYSGYKIHPGHGKRLVRTDGKVQIFLSGKALKGAKLRRNPRDIR
WTVLYRIKNKKGTHGQEQVTRKKTKKSVQVVNRAVAGLSLDAILAKRNQT
EDFRRQQREQAAKIAKDANKAVRAAKAAANKEKKASQPKTQQKTAKNVKT AAPRVGGKR
InterPro scan results: the protein contains a ribosomal L24e domain
Conclusions (1)
Gene A
There are 2 different protein sequences due to alternative splicing
(intron retention; the shortest isoform is due to a intron retention
and is rarely expressed – only 2 ESTs)
Conclusions (2)
Gene prediction programs can not predict an alternative splicing event
(it can only predict the alternative splice junction)
The protein (Gene A) is a ribosomal protein which belongs to the
ribosomal protein L24e family (UniProtKB/Swiss-Prot O01868).
The alternatively spliced sequence is not yet in the protein
sequence databases, because it is ‘derived’ from ESTs sequences
which are submitted to public DNA/RNA databases without
annotated CDS
Non coding region analysis
3’end of chromosome Y
EMBL #AJ271736
Example of Alu sequence
Gene 2
Schema recapitulatif
5’
3’
789
Exon 3
1111
1410
3’
1557
Exon 2
1688
1636
1845
Exon 1
5’
DO
AC
1112 1409
(0.56) (0.92)
HMMgene
Netgene2
DO:donneur
AC: accepteur
DO
1556
(0.96)
AC
1637
(0.61)
1112 1407 1637 1688
GeneBuilder prediction is not confirmed anywhere else
CDS2 (3 exons)
RefSeq NP_491393 (AF272397)
UniProtKB/TrEMBL: G5EC89
237 AA; 3 exons
MMMEYGGYFS SSAVAQQSGD VPTTAPSAVT NSFFYTPQSH
NIYHQYATPY LQSGRALTTA HNTSSSSAGN STSSSSSSSN
YRNTTHDSLQ AFFNTGLQYQ LYQKSQLIGS DTIQRTSSNV
LNGLPRSSLV GALCSTGGAP LNPAERRKQR RIRTTFTSGQ
LKELERSFCE THYPDIYTRE EIAMRIDLTE ARVQVWFQNR
RAKYRKQEKI RRVKDEEEDP LKKEPGQISL EEIIDQI
A probable nuclear protein with a DNA binding domain (homeobox)
Gene 3
Numérotation « direct strand »
CDS3
>tr|O01864|O01864_CAEEL Hypothetical protein - Caenorhabditis elegans.
METEVMKSFNNELSSLFDSKNMSKNKIQDITKAAIKAKSQYKHVVFSVEKLINKCKPDQR
LNVLYVIDSIVRASKHQLKEKDTFGPRFMKQFDKFLMPLLKCGQKEKMRTVRTLNLWMSN
KVFKESEIQPLREMCKASGLTIDFEEVELAVKGKQADMSIYSGVYKKKPKRSSSSSQPKS
RTPTNPHPDDGLLGAGPSSALRSVPDIPNFVLSEDYFLGTISEREMLELVQKFGIDRSGV
LSKDKNLLQRALQIFAGSLSQKVEEVLAENNRINGSSIQNVLTKDFEYSDDEEEKEKEPQ
PEKQKNLPHAQVLLLAQSLLTQPQILAKLAEVLIPQGNPFGLPFPGEHIVPTSSAALTLG
APPPNLMALQQSLPPGFPNQQLGLPNLSGLNQAQLMNVQNAQNMLQLQQRAAQLQALQGN
PNAQRNLLMLGNPLLNPFALQHGVNPMLNDLQAAAAAQQQAMLNEAAQSPEKKILELSGG
NSGINNSGDVERARLREKEKERESKERRRMGLPPVRIGFTIIASRTLWLKKIPTNIVEND
LKQAVESCGEASRVKVIGNRACAYITMENRRSANDVVSKMREVSVAKKMVKVYWARSPGM
DSDQFSDLWDSNRGVLEIPYEKLPLDLVALCEGAMLDIESLPIEKKLLYKETGETVISIP
PPNIQPPVPHPPPMGFPFQHQLTQLPGQPRPAGLPPGVPPMFNLNAPPPPGIPGYPPAPP
PPGVGPPPPQGIPPMGFDPNKPPPPMFQQGFNAGAPPPPFGRGAGPMSSFPPPPRGGMHH
MPPPPSFRGGRGGHGGPPPPHFDRRGGGGPPFRPENGRGRLLDQSEMWNREQREMRGGGG
AGRDGGREHRDYDRDRSQIDRRRQDDMGARRRSRWGDDDRRDDDRRDDRRDDRRESRRRS
PRSPRSPDRRTRRSPSYEREEPPVKKTSVEEETVSSTTLDELKPSVEPTPVPAPIPAPAP
ELKAAEEPVKIVAEHHEDQTDEVPMDLE
Gene 4
EST
1346
1695
HMMgene
5841
6080
6993
7411
7520
9631
9770
9997
Netgene2
1411
1794
5679
6049
6908
7589
7800
7954
WebGene
5668
6049
6908
7187
7414
7564
7753
7911
8113
7800
7959
8223
8497
8710
9115
9574
9770
9705
9943
10350
5859
6864
7132
7328
7517
7589
7911
8135
8413
8659
9042
9528
9705
9946
(AG)
1691
(GT )
1795
5682
6048
6907
7186
7413
5842
6865
7133
7329
7518
5405
5683
6049
6908
7187
7414
5449
5841
6864
7132
7328
7517
7959
8153
7958
7589
7799
8154
7754
7912
8223
8497
8710
9115
9574
9770
8413
8659
9042
9528
9705
9942
8222
8496
8709
9114
9573
8414
8660
9043
9529
9706
9943
9996
Removed from gene 4:
1412-1691,
7518-7589,
8660-8709,
1795-5682,
7754-7999,
9043-9114,
5842-6048,
7912-7958,
9529-9573,
6865-6907,
8154-8222,
9706-9769,
7133-7413,
8414-8496,
9943-9996
Protein Q3N323
>tr|Q9N323|Q9N323_CAEEL Hypothetical protein - Caenorhabditis elegans.
MSTNNYQTLSQNKADRMGPGGSRRPRNSQHATASTPSASSCKEQQKDVEHEFDIIAYKTT
FWRTFFFYALSFGTCGIFRLFLHWFPKRLIQFRGKRCSVENADLVLVVDNHNRYDICNVY
YRNKSGTDHTVVANTDGNLAELDELRWFKYRKLQYTWIDGEWSTPSRAYSHVTPENLASS
APTTGLKADDVALRRTYFGPNVMPVKLSPFYELVYKEVLSPFYIFQAISVTVWYIDDYVW
YAALIIVMSLYSVIMTLRQTRSQQRRLQSMVVEHDEVQVIRENGRVLTLDSSEIVPGDVL
VIPPQGCMMYCDAVLLNGTCIVNESMLTGESIPITKSAISDDGHEKIFSIDKHGKNIIFN
GTKVLQTKYYKGQNVKALVIRTAYSTTKGQLIRAIMYPKPADFKFFRELMKFIGVLAIVA
FFGFMYTSFILFYRGSSIGKIIIRALDLVTIVVPPALPAVMGIGIFYAQRRLRQKSIYCI
SPTTINTCGAIDVVCFDKTGTLTEDGLDFYALRVVNDAKIGDNIVQIAANDSCQNVVRAI
ATCHTLSKINNELHGDPLDVIMFEQTGYSLEEDDSESHESIESIQPILIRPPKDSSLPDC
QIVKQFTFSSGLQRQSVIVTEEDSMKAYCKGSPEMIMSLCRPETVPENFHDIVEEYSQHG
YRLIAVAEKELVVGSEVQKTPRQSIECDLTLIGLVALENRLKPVTTEVIQKLNEANIRSV
MVTGDNLLTALSVARECGIIVPNKSAYLIEHENGVVDRRGRTVLTIREKEDHHTERQPKI
VDLTKMTNKDCQFAISGSTFSVVTHEYPDLLDQLVLVCNVFARMAPEQKQLLVEHLQDVG
QTVAMCGDGANDCAALKAAHAGISLSEAEASIAAPFTSKVADIRCVITLISEGRAALVTS
YSAFLCMAGYSLTQFISILLLYWIATSYSQMQFLFIDIAIVTNLAFLSSKTRAHKELAST
PPPTSILSTASMVSLFGQLAIGGMAQVAVFCLITMQSWFIPFMPTHHDNDEDRKSLQGTA
IFYVSLFHYIVLYFVFAAGPPYRASIASNKAFLISMIGVTVTCIAIVVFYVTPIQYFLGC
LQMPQEFRFIILAVATVTAVISIIYDRCVDWISERLREKIRQRRKGA
Prediction of mitochondrial genes
(human)
Mitochondrial genome
NC_012920.1 annotation
tRNA scan prediction
NC_012920.1
tRNA scan lists
1- all the tRNAs in the current strand
2- all the tRNAs in the complement strand
This tRNA is found at the end of the list
Conclusion
• Good tRNA prediction
• If you try: very bad protein-coding gene
prediction….
– Mitochondrial genome has not the same
sequence content (codon biais, signals)
compare to the nuclear genome.
– You might try with ‘prokaryota’-like gene
model, but the results are not perfect… !
Related documents