Download Protein sequence - Purdue Genomics Wiki

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Seq5part2(50244-10000)
1. NCBI Blast
Est database
-
-
-
-
EST1019 Zea mays embryo sac cDNA library Zea mays cDNA clone ES2376 5-, mRNA
sequence (GenBank: CF972490.1) “Transcriptome of Zea mays embryo sac”
zmrww005_0B20-004-d01.s0 zmrww005 Zea mays cDNA 5-, mRNA sequence
(GenBank: CK371462.1) “Functional Genomics of Root Growth and Root Signaling
Under Drought”
zmrww00_0B20-004-d01.s2 zmrww00 Zea mays cDNA 3-, mRNA sequence
(GenBank: CF637313.1) “Functional Genomics of Root Growth and Root Signaling
Under Drought”
25273824 CERES-504 Zea mays cDNA clone 1585441 3-, mRNA sequence (GenBank:
FL474732.1) “Insights into corn genes derived from large-scale cDNA sequencing”
Tissue: root and shoot
ZM_BFb0268O02.f
ZM_BFb
Zea
mays
cDNA
3-,
mRNA
sequence
-
-
-
-
(GenBank:DY536200.1) “Maize Full-length cDNA Project”
Zm03_02c05_A Zm03_AAFC_ECORC_cold_stressed_maize_seedlings Zea mays
cDNA clone Zm03_02c05, mRNA sequence (GenBank: BG319847.1) “Expressed
Sequence Tags from Cold-Stressed Maize Seedlings Grown Under High Light
Intensity”
Tissue: leaf
E0081 Zea mays egg cell cDNA library Zea mays cDNA clone 117 5- similar to
retroposon, mRNA sequence (GenBank: DR452005.1) “Transcriptome of Zea mays
egg cell”
EST5083 Zea mays sperm cell cDNA library Zea mays cDNA clone Zmsp12591 5-,
mRNA sequence (GenBank: CK700981.1) “Sperm cells of Zea mays have a
complex complement of mRNAs”
EL01N0306E03.b Endosperm_3 Zea mays cDNA, mRNA sequence (GenBank:
CD433258.1) “Characterization of the maize endosperm transcriptome and its
comparison to the rice genome”
Then I used these sequences in as queries and searched the Nucleotide database of Corn on
NCBI. All of the sequences hit on Chromosome 8 (Genomic sequence for Zea mays clone
ZMMBBb0614J24, from chromosome 8, complete sequence GenBank: AC157487.1), Zea mays
cultivar inbred line B73 teosinte glume architecture 1 (tga1) gene, complete cds (GenBank:
AY883559.2) and Zea mays cytochrome P450 monooxygenase CYP71C3v2 gene, complete cds
(GenBank: AY072299.1).
Nucleotide
The results that I used the full length(50000bp) sequence as query to search the Nucleotide
database of corn:
The hits with the highest scores are:
1) Zea mays B transcriptional activator (b1) gene, b1-B' allele, exons 1 through 3 and partial
cds GenBank: AY078063.2
2) Genomic sequence for Zea mays clone ZMMBBb0614J24, from chromosome 8, complete
sequence GenBank: AC157487.1
3) Zea mays cultivar inbred line B73 teosinte glume architecture 1 (tga1) gene, complete cds
GenBank: AY883559.2
Also, there are a lot of Mu transposons in the sequences.
Here is the dotplot of the query sequence and the b1 gene:
Then I searched the nucleotide database of all organisms. And I found a hit named “Zea
mays gypsy retrotransposon huck, and copia retrotransposon ji, complete sequence; and helitron
Mo17_14594, complete sequence GenBank: DQ002408.1”. This result is consistent with the
result reported by Repeatmasker, in which 83.18% of the sequence are Copia or Gypsy elements.
Here is the dotplot of the query sequence and the transposon sequences:
2. Gene Prediction
The next step is to predict the genes in this sequence. The programs I chose are FGENESH
and Augustus(http://bioinf.uni-greifswald.de/augustus/submission).
FGENESH
FGENESH predicted 7 genes. The predicted genes were then translated into peptides. These
peptides were used as queries to run Blastp in the swissprot database. 3 of them had significant
hits.
Segment 1: 62665 - 64698
1 CDSf: 62665 - 63676 1011bp
2 CDSl: 63853 - 64698
846bp
Retrotrans_gag[pfam03732], Retrotransposon gag protein; Gag or Capsid-like proteins from LTR
retrotransposons.
>ATGGCGACCGACAACTCGCCCGCCGGCGGCGGAATCGACGACGTCTTCCCCGCGCGGTGGAAGAACAAC
ATTCGAGCTTGCCTCGTCCCCTCCCCCGCCGACGGAGGAGGAGGCGGGGCAACCCAAGGCCAAGCAGGA
GGCGGCACCTCGTCGGCTGTCGAGCGAGTCGACGGTGCCAGCGCCCCAATGGGGGGCACGTCGGGCATC
GACCTCGCGTCTGAGACGAAGACGAGCGCCGTCTCCCCGCAACACGTCAACCCCAAGCAAACGGACGACG
CCAACACGCTCGCAAGGGACTTGCTGGGCGTCACCCTCGTACCTGAGACGGCGGTGCAGTCTACCCCTGAC
GTGACTTCGTCACCGCCCGTCGACCAAGAGGTACCGACCGATTCCCATCTCGCGCCTTTTGGATTCAGCCTC
AACCCCCCAAGCGACTTCGCTTTGGTGGACGCTCTCATAGAGGCGAGTCCAAACCCTCTGGGGTATCGTATG
CGGTCACCATGGGACCGGCTGACGGCCGTCTCAACCTACGGGCCCTTAGGGTCCGAGGAAGATGACGAGC
CCGACTTTAGTTGGGATTTCTCTGGACTTGGTAACCCCAGTGCCATGCGGGACTTTATGACCGCGTGCGACT
ACTGCCTTTCCGACTGTTCCGACGGTAGCCGCAGCCTCGGCGACAAGGACTGCGGCCCAAGTCGTGAATGT
TTTCACGTCGATCTAGGGGGTCCCGACGAAGGCAACCATCTTGGTATGCCAGAGAATGGTGACCTTCCTAG
GCCTGTGCCTCACGTTGACATCCTTCGGGAGCTAGCTGTGGTCCCCGTTCCGGCAGGGGGTCATGACCCAC
AACTCGAGCAAATCCGCGAGATGCAGGCCAGGCTCGACGAGGGAGCAGGAACACTTGAGCCGTTCCGCC
GGGACAATAGGCAGGAATGGGCGGGCCAACCTCTGGCCGGAGAAGTGCGTCATCTACCCCAGGGCATCCA
GCACCGCGTCGCCGACGATGTCAGGgtaaggccgccaccggtttccagtggggtcggccagaacctggctgcagcggcaatact
tctccgcgcgatgccggagccatcaaccaccgaggggcggcgtatccagggagagctcaagaacctcctggaggacgccgcggtctgacg
ggccgaaagctccgcctcccgaaggcagGGGTACCCCTCGGAACATCGCGCCGCGACTTCCCGATTCATGCGGGAAG
CCTCGGTCCACACCGGCCGCATGCGTAACATAGCGCATGCGGCCCCGGGTCGCCTCGGCAACGAGCACCAT
CACCATAACTGTTGGGCCCACCTCGACGAGAGGGTGCGCCGAGGCTACCACCCCAGGCGTGGGGGACGCT
ACGACAGCGGGGAGGATCGGAGTCCCTCGCCCAAACCACCTGGTCCGCAGGCTTTCAACCGCGCCATACG
ACGGGCGCCGTTCCCGACCCGGTTCCGAACCCCGACTACTATCACAAAGTACTCGGGGGAGACGAGACCG
GAACTGTGGCTCGCAGACTACCGGCTGGCCTGCCAGCTGGGTGGAACGGACGATGACAACCTCATCATCTG
CAACCTCCCCCTGTTCCTTTCCGACACCGCTCGCGCCTGGCTGGAGCACCTGCCTCCGGGGCAGATCTCCAA
CTGGGACGACCTGGTCCAAGCCTTCGCCGGTAATTTCCAGGGCACGTACGTGCGCCCTGGAAACTCCTGGG
ATCTCCGAAGCTGCCGCCAGCAGCCGGGGGGGTCTCTCCGGGACTACATCCGGCGATTCTCGAAGCAGCG
CACCGAGCTGCCCAACATCGCCGATTCGGATGTCATCGGCGCGTTCCTCGCCGGCACCACCTGCCGTGACCT
GGTGAGCAAGCTGGGTCGCAAGACCCCCACCAGGGCGAGCGAGCTGATGGACATCGCCACCAAGTTCGCC
TCTGGCCAGGAGGCGGTTGAGGCCATCTTCCGGAAGGACAAGCAGCCCCAGGGCCGCCCACCGGAAGAT
GTCCCCGAGGCGTCAACTTAG
Protein sequences:
MATDNSPAGGGIDDVFPARWKNNIRACLVPSPADGGGGGATQGQAGGGTSSAVERVDGASAPMGGTSGID
LASETKTSAVSPQHVNPKQTDDANTLARDLLGVTLVPETAVQSTPDVTSSPPVDQEVPTDSHLAPFGFSLNPPS
DFALVDALIEASPNPLGYRMRSPWDRLTAVSTYGPLGSEEDDEPDFSWDFSGLGNPSAMRDFMTACDYCLSDC
SDGSRSLGDKDCGPSRECFHVDLGGPDEGNHLGMPENGDLPRPVPHVDILRELAVVPVPAGGHDPQLEQIRE
MQARLDEGAGTLEPFRRDNRQEWAGQPLAGEVRHLPQGIQHRVADDVRGYPSEHRAATSRFMREASVHTG
RMRNIAHAAPGRLGNEHHHHNCWAHLDERVRRGYHPRRGGRYDSGEDRSPSPKPPGPQAFNRAIRRAPFPT
RFRTPTTITKYSGETRPELWLADYRLACQLGGTDDDNLIICNLPLFLSDTARAWLEHLPPGQISNWDDLVQAFAG
NFQGTYVRPGNSWDLRSCRQQPGGSLRDYIRRFSKQRTELPNIADSDVIGAFLAGTTCRDLVSKLGRKTPTRAS
ELMDIATKFASGQEAVEAIFRKDKQPQGRPPEDVPEAST
Segment 2: 66287 - 69085
3 exons:
1 CDSf 66287 - 67405 1119bp
2 CDSi 67439 - 67615 177bp
3 CDSl 68270 - 69085 816bp
RNase_HI_archaeal_like[cd09279], RNAse HI family that includes Archaeal RNase HI;
RT_LTR[cd01647], RT_LTR: Reverse transcriptases (RTs) from retrotransposons and retroviruses
which have long terminal repeats (LTRs) in their DNA copies but not in their RNA template.
rve[pfam00665], Integrase core domain
RVT_3[pfam13456], Reverse transcriptase-like; This domain is found in plants and appears to be
part of a retrotransposon.
RNase_HI_RT_Ty3[cd09274], Ty3/Gypsy family of RNase HI in long-term repeat retroelements;
RNase_H[cd06222], RNase H is an endonuclease that cleaves the RNA strand of an RNA/DNA
hybrid in a sequence non-specific manner
RNase_H[pfam00075], RNase H; RNase H digests the RNA strand of an RNA/DNA hybrid.
Important enzyme in retroviral replication cycle.
RVT_1[pfam00078], Reverse transcriptase (RNA-dependent DNA polymerase)
PRK07238[PRK07238], bifunctional RNase H/acid phosphatase
PRK07708[PRK07708], hypothetical protein; Validated
>ATGCCATTCAGTTTGAGGAATGCGGGTGCAACGTACCAACGGTGCATGAACCACATGTTCGGCGAACACA
TTGGCCGAACGGTCGAGGCCTACGTCGATGACATCGTAGTCAAGACGAGGAAAGCCTCCGACCTCCTTTCC
GACCTTGAAGCGACATTCCGATGTCTCAAGGCGAAAGGCGTGAAGCTCAATCCCGAGAAATGTGTCTTCGG
GGTTCCACGAGGCATGCTCTTGGGGTTCATCGTCTCCGAGCGGGGCATCGAGGCCAACCCGGAGAAGATC
GCGGCCAACACCAGCATGGGGCCCATCAAGGACTTGAAAGGCGTACAGAGAGTCACAGGATGCCTTGCGG
CTCTGAGCCGTTTCATCTCGCGCCTCGGCGAAAGAGGCCTACCTCTGTACCGCCTCTTAAGGAAGGCCGAGT
GCTTCACTTGGACCCCTGAGGCCGAGGAAGCCCTCGGGAACCTGAAGGCGCTCCTCACGAACGCGCCCAT
CTTGGTGCCCCCCGCTGCCGGAGAAGCCCTCTTGATCTACGTCACCACGACCACTCAGGTGGTTAGCGCCG
CGATTGTGGTTGAGAGACGAGAAGAGGGGCATGCATTGCCCGTACAGAGGCCAGTCTACTTCATCAGTGAG
GTACTGTCCGAGACCAAGATCCGCTACCCACAAATTCAGAAGCTGCTGTACGCAGTGATCCTGACACGACGG
AAGTTGCGACACTACTTCAAGTCTCATCCGGTGACTGTGGTGTCATCCTTCCCCCTGGGGGAGATCATCCAG
TGCCGAGAGGCCTCGGCTAGAATTGCAAAGTGGGCGGTGGAAATCATGGGCGAGACGATCTCGTTCGCCC
CTCGGAAGGCCATCAAGTCCCAGGTCTTGGCGGACTTTGTGGCTGAATGGGTCGACACCCAGCTCCCAACA
GCTCCGATCCAACCGGAACTCTGGACCATGTTTTTCGACGGGTCACTGATGAAGACAGGAGCAGGCGCAG
GCCTGCTCTTGATCTCGCCCCTCAAGAAGCACCTACGCTACGTGCTACGCCTCCACTTCCCGGCGTCCAACAA
TGTGGCTAAGTACGAGGCTCTAGTCAACGGGTTGCGCATCGCCATCGAGCTGGGGgtctgacgcctcgacgctcgt
ggtgactcgcagCTCGTCATCGACCAAGTCATGAAGAACTCCCACTGCCACGACCCGAAGATGGAGGCCTACTG
CGATGAGGTTCGGCGCCTGGAAGACAAGTTCTACGGGCTCGAGCTCAACCACATCGCCCGACGCCACAAC
GAGACTGCGGACGAGCTGGCTAAAATAGCCTCGGGGCGAACAACGgttcccccagacgtcttctcccgagacctgcat
caaccctccgtcaagaccgacgacacgcccgagcccgagacaccctcggcttagtccgaggcaccctcggctcagtccgaggcgccatcgg
ctcggcccgaggcaccctcggctcaacccgaggcaccctcggcccccgagggtgaggcactgcgcatcgaggaggagcggagaggggtc
atgcctaatcgaaactggcagaccccgtacctgcaatatctccgccgaggagagctacccctcgaccaagccgaagcttggcggttggcgc
ggcgcgccaagtcgttcgtcttgctgggagacgagaaggagctctaccaccgcagcccctcgggcatcctccagcgatgcatttccatcgcc
gaaggccaggagctcctacaagagatacactcgggggcttgtggccatcacgcagcacctcgagcccttgttggaaacgccttccgacaag
gtttctactggccgacggcggtggccgacaccactagaattgtccgcacctgcgaagggtgtcagttctacacaaggcagacccacctaccc
gcttaggccctgcagaccatacccatcacctggtcatttgttgtgtggggtctggacctagttggccccttgcagAAGGCACCCGGGGG
CTACACGCATCTGTTGGTCGCCATCGACAAATTCTCCAAGTGGATCGAGGTCCGACCCCTAAACAGCATCAG
GTCCGAACAGGCGGTGGCGTTCTTCACCAACATCATCCATCGCTTTGGGGTCCCGAACTCCATCATCACCGA
CAACGGCACCCAGTTCACCGGCAGAAAGTTCCTGGACTTCTGCGAGGATCACCACATCTGGGTGGACTGG
GCCGCCGTGGCTCACCCCATGACGAATGGGCAAGTAGAGCGTGCCAACGGCATGATTCTACAAGGACTCAA
GCCTCGAATCTACAACGACCTCAACAAGTTCGGCAAGCGGTGGATGAAGGAACTCCCCTCGGTGGTCTGGA
GTCTGAGGACGACGCTGAGCCGGGCCACGGGCTTCACACCGTTCTTTCTAGTCTATGGGGCCGAGACCGTC
TTGCCCATAGACTTAGAATACGGTTCCCCGAGGACGAGGGCCTACGACGACCAAAGCAATCGAGCTAATCG
AGAAGACTCACCGGACCAGCTGGAAGAGGCTCGGGACATGGCCTTACTACACTCGGCGCGGTACCAGCAG
TCCTTGCGACGCTACCACGCCCGAGGGGTTCGGTCCCGAGACCTCCAGGTGGGCGACCTGGTGCTTCGGCT
GCGACAAGACGCCCGAGGGCGGCACAAGCTCATGCCTCCCTGGGAAGGGTCGTTCGTCATCGCCAAAGTT
CTGAAGCCTGGGACGTACAAGCTGGCCAACAGTCAAGGCGAGGTCTACAGCAACGCTTGGAACATCCGAC
AGCTACGTCGCTTCTACCCTTAA
Protein sequence:
MPFSLRNAGATYQRCMNHMFGEHIGRTVEAYVDDIVVKTRKASDLLSDLEATFRCLKAKGVKLNPEKCVFGVP
RGMLLGFIVSERGIEANPEKIAANTSMGPIKDLKGVQRVTGCLAALSRFISRLGERGLPLYRLLRKAECFTWTPEA
EEALGNLKALLTNAPILVPPAAGEALLIYVTTTTQVVSAAIVVERREEGHALPVQRPVYFISEVLSETKIRYPQIQKLL
YAVILTRRKLRHYFKSHPVTVVSSFPLGEIIQCREASARIAKWAVEIMGETISFAPRKAIKSQVLADFVAEWVDTQL
PTAPIQPELWTMFFDGSLMKTGAGAGLLLISPLKKHLRYVLRLHFPASNNVAKYEALVNGLRIAIELGLVIDQVM
KNSHCHDPKMEAYCDEVRRLEDKFYGLELNHIARRHNETADELAKIASGRTTKAPGGYTHLLVAIDKFSKWIEVR
PLNSIRSEQAVAFFTNIIHRFGVPNSIITDNGTQFTGRKFLDFCEDHHIWVDWAAVAHPMTNGQVERANGMIL
QGLKPRIYNDLNKFGKRWMKELPSVVWSLRTTLSRATGFTPFFLVYGAETVLPIDLEYGSPRTRAYDDQSNRAN
REDSPDQLEEARDMALLHSARYQQSLRRYHARGVRSRDLQVGDLVLRLRQDARGRHKLMPPWEGSFVIAKVL
KPGTYKLANSQGEVYSNAWNIRQLRRFYP
Segment 3: 82383 - 88664
7 exons
1 CDSf
2 CDSi
3 CDSi
4 CDSi
5 CDSi
6 CDSi
7 CDSl
82383 - 83722
84124 - 84298
84369 - 85018
85130 - 85433
85920 - 86500
86862 - 87035
87327 - 88664
1338bp
174bp
684bp
303bp
579bp
174bp
1338bp
RT_LTR[cd01647], RT_LTR: Reverse transcriptases (RTs) from retrotransposons and retroviruses.
RNase_HI_archaeal_like[cd09279], RNAse HI family that includes Archaeal RNase HI;
rve[pfam00665], Integrase core domain;
DUF4370[pfam14290], Domain of unknown function (DUF4370);
RT_DIRS1[cd03714], RT_DIRS1: Reverse transcriptases (RTs) occurring in the DIRS1 group of
retransposons.
RVT_1[pfam00078], Reverse transcriptase (RNA-dependent DNA polymerase); A reverse
transcriptase gene is usually indicative of a mobile element such as a retrotransposon or
retrovirus.
PRK12829[PRK12829], short chain dehydrogenase; Provisional
PHA03307[PHA03307], transcriptional regulator ICP4; Provisional
>ATGGCGGCCGACAACCCGCCCGCCGGCGGCGGAATCGATGACGTCTTCCCCACGTGGCGGAAGAACGAC
ATTCGGGCTTGTCCCGTCCCCTCCCCCGTCGACGGAGGAGGAGGCGGGGCAACCAAGGCCAAGCAGGAG
GCGGCACCTCGTCGGCTATCGAGCGAGTCGACGGCGCCGGTGCCCCCAACGAGGGGCGCGATGGGCATCG
ACATCGCGTCTGAGACGAAGACGAGCGCCGTCTCCCCGCAACACGCCAACTCCAAGCAAACGGACGACGC
CAGCACGCTCGCAAAAGACTTGTTGGGCGTCACCCTCGTACCTGAGACGACGGTGCAGTCTACCCCTGACG
TGACTTCGTCACCGCCCGTCGACCAAGACGTACCGACCGATTCCCATCTCGCGCCTTTTGGATTCAGCCTCG
ACCCACCAAGCGACTTCGCTTTGGTGGACGCTTTCATAGAGGCGAGTCCAAACCCTCCGGGGTATCGTGTG
CGGTCACCCTGGGACCGGCTGACAGCCGTCTCGACCTACGGGCCCTCGGGTTCCGAGGAAGATGACGAGC
CCGACTTTTGTTGGGATTTCTCTGGACTTGGTAACCCCAGTGCCATGCGGGACTTCATGACCACATGCGACT
ACTGCCTTTCCGACTGTTCCGACGGTAGCCGCAGCCTCGGCGACGAGGACTATGGCCCAAGTCGTGAATGT
TTCCACGTCGACCTAGGGGGTCCCGGCGAAGGAAACCATCCTGGTATACCGGAAAATGGTGATCCCCCTAG
GCCTGCGCCTCGCGTTGACATCCTACGGGAGCTAGCTGTGGTCCCAGTCCCTGCGGGGGTCAGGACTCACA
GCTCGAGCAAATCTGCGAGATGCAGGCCAGGCTCGACGAGGGAGCAGGAACACTTGAGCCGTTCCGCCG
GGACATCGGGCAGGAATGGGCAGGCCAACCTCCGGCCGGAGAAGCGCGCCATCTACCCCAGGGCATCCAA
CACCGCATCGCCGACGATGTCAGGGCAAGGCCGCCACCGGCCTCCAGTGGGGTCGGCCAGAACCTGGCTG
CAGCGGCAATACTTCTCCGCGCGATGCCGGAGCCATCTACCACCGAGGGGCGGCGTATCCAGGGAGAGCTC
AAGAATCTCCTGGAGGATGTCGCGGTCCGACGGGCCGAAAGCTCCGCCTCCCGAAGGCAGGGGTACCCCT
CGGAACATCGCGCCGCGACTTCCCAATTCATGCGGAAAGCCTCGGTCCACACCGGGCGCACGCGCAACACA
GCGCCTGCGGCCCTGGGTCGCCTCGGCAACGAACACCCTCACCGCAACCGTCGAACCCACCTCGACGAGA
gggtgcgccgaggctaccaccccaggcgtgggggacgctacgacagcggggaggattggagtccctcgcccgaaccacccggtccgcag
gctttcagccgggccatacgacgggcgccgttcccgacccggttccgaaccccgactactatcacaaagtactcgggggagacgagaccgg
aactgtggctcgcggactaccggctagcctgccacctgggtggaacagacgatgacaatctcatcatccggaacctccccctgttcctctccg
acaccgctcgagcctggctggagcacctgcctccggggcagatctccaactaggacgacctggtccaagccttcgccggcaacttccagggt
acgtatgtgtgccctgggaactcctgggatctccaaaGCTGCCGCCAGCAGCCGGGGGAGTCTCTCTGGGACTACATCC
GGCAATTCTCGAAGCAGCGCACCGAGTTGCCCAATGTCACCGACTCGGATGTCATCGGCGCGTTCCTCGCC
GACACCACTTGCCGCGACCTGGTTAGCAAGCTGGGTCGCAAGACCCCCACCAGGGCGAGTGaggtgatggac
atcgccaccaagttcgcctctggctaGGATGCGGTTGAGGCCATCTTCCGGAAGGACAAGCAGCCCCAGGGCCGCC
CACCGGAAGATGTCCCCGAGGCGTCAACTCAGCGCGGCATCAAGAAGAAAGGCAAGAAGAAGTCGCAAG
CAAAACGCGACGCCGCCGATGCGAACTTTGTCGCCGCCGCCGAGTACAAGAACCCTCGGAAACCTCCTGG
AGGTGCCAATCTCTTCGACAAGATGCTCAAGGAGCCGTGCCCCTGTCATCAGGGGCCCGTCAAGCACACCC
TTGAGGAGTGCGCCATGCTTCGGCGCCACTTTCACAAAGCCGGGCCACCTGCGGAGGGTGGCCGGGCCCG
CGACGACGATAAGAAGGAGGATCACAAGGCAGGAGAGTTCCCCGAGGTCCACGACTGCTTCATGATCTAC
GGTGGGCAAGTGGCGAACGCCTCGGCTCGGCACCACAAGCAAGAGCGTCGGGAGGTCTGCTCGGTAAAG
GTGGCGGCGCCAGTCTACCTAGACTGGTCCGACAAGCCCATCACCTTCGACCAGGGCGACCACCCCGACCG
CGTGCCGAGCCTGGGGAAGTACCCGCTCGTTGTCGACCCCGTCATCGGCAACGTCAGGCTCACCAAGGTCC
TCATGGACGGAGGCAGCAGCCTCAACGTCATCTACGCCAAGACCCTCGGGCTCCTGCGGATCGATCTGTCCT
Cggtacgggcaggagctgcgccttttcacgggatcatccctgggaagcgcgtccagcccctcggacaactcgatctacccgtctgctttggg
acaccctccaacttctgaaagGAGACCCTCACGTTCGAGGTGGTCGGGTTTCGAGGAACCTACCACGCAGTGCTG
AGGAGGCCATGCTACGCCAAGTTCATGGTCGTCCCCAACTACACCTACCACAAGCTAAAGATGCCAGGCCCC
AACGGGGTCATCACCGTCGGCCCCACGTACCGACACGCGTACGAATGCGACGTGGAGTGCATGGAGTACGC
CGAGGCCCTCGCCAAATCCGAGGCCCTCATCGCCGACCTGGAGAGCCTCTCCAAGGAGGCGCCAGACGTG
AAGCGCCACACCAGCAACTTCGAGCCAACGGAGATGggtaagttcgtccctctcaacaccagcaacgatacctccaagctg
atccggatcgggctccgagctcgaccccaaataggaagcagtctcgtcgactttctccgtgcaaacaccgatgtttttgcatggaatccctcgg
acatgcccggcataccgagggatgtcgccgagcactcgctggatatccgagctagagcccgacccgtgaagcagcctctgcgccggttcga
cgaagaaaagcgcagagccataggcgaggagatccacaagctaatggcggtagggttcatcaaagaggtattccatcccgagtggcttgc
caaccctgtgcttgtgagaaagaaaggagggaaatggcgtatgtgtgtagactacactggtctaaacaaagcatgtccaaaagttccctacc
ctctgcctcgcatcgatcaaatcgtggattccactgctgggtgcgaaaccctgtctttcctcgatgcctactcagGGTATCGCCAAATCA
GGATGAAAGAGTCCGACCAGCTCGCGACTTCTTTCATCACACCTTTCGGCATGTACTGCTATGTTACCATGTC
GTTTGGTTTGAGGAATGCGGGTGCGACATACCAAAGGTGCATGAACCACGTGTTCGGCGAACACATTGGTC
GAACGGTCGAGGCTTACATCGATGACATCGTAGTCAAGACGAGGAAAGCCTCTGACCTCCTTTCCGACCTTG
AAACGACATTCTGGTGTCTCAAGGCGAAAGGTGTAAAGCTCAATCCCGAGAAGTGCGTCTTCGGGGTCCCC
CAAGGCTTGCTCTTGGGGTTTATCGTCTCCGAGCGGGGCATCGAGGCCAACCCAGAGAAAATCGTGGCCAT
CACCAACATGGGGCCCATCAAGGACTTGAAAGGCGTACAGAGGGTCACGGGGTGCCTTGCGGCTCTGAGC
CGTTTCATCTCACGCCTCGGCGAAAGAGGCCTGCCTCTGTACCGCCTCTTAAGGAAGGCCGAGTGCTTCACT
TGGACCCCTGAGGCCGAGGAAGCCCTCGGGAACCTGAAGGCGCTCCTCACGAACGCGCCCATCTtggtgccc
ccgcggccggagaagccctcttgatctacgtcgccgctaccactcaggtggtcagcgccgcgatcgtggttgagagacgagaagagggaca
tgcattgcctgtccagaggccagtctacttcgtcagtgaggtactgtccgagaccaagatccgctacccacaaattccgagtctcatccggtga
ctgtggtgtcatctttccccctgggggagatcatccagtgccgagaggcctcgggtaggattgcaaagtgggcggtggaaatcatgggcgag
acaatctcgttcgccactcgtaaggccataaagtcccaagtcttggcggactttgtggctgaatgggtcgatacccaGCTCCCGACAGC
TCCGATCCAACCGGAACTCTGGACCATGTTTTTTGACGGGTCGCTGATGAAGACAGGGGCAGGCGCGGGC
CTGCTCTTCATCTCGCCCCTCGGGAAGCACCTACGCTACGTGCTACGCCTCCACTTCCCGGCGTCCAACAATG
TGGCCGAGTACGAGGCTCTggtcaacgggttgcgcgtcgccatcgagctagggatccgacgtctcgacgctcgcggtgactcgtagc
tcgtcattgactaagtcatgaagaactcccacttctgcgactcgaagatggaagcctactgcgatgaggttcggcgcctggaggacaagttct
atgggctcgagttcaaccacatcgcccgacgctacaacgagactgcggacaagctggctaagatagcctcggggcaaacaacggttccccc
ggacgtcttctcctgagacctgcatcaaccctccgtcaagACCGACGACACGCCCGAGCCCGAGAAGGCCTCGGCCCAGC
CCGAGGCACCCTCGGCCCCCGAGGATGAGGCACTGCGTGTCGAGGAGGAGCGGAGCGGGGTCACGCCTA
ATCGAAACTGGCAGACCCCGAACCTGCAATATCTCCACCGAGGAGAGCTACCCCTCGACCGAGCCGAAGCT
CGGCGGTTGGCGCGGCGTGCCAAGTCGTTCGTCTTGCTGGGGGACGGGAAGGAGCTCTACCATCGCAGCC
CCTCAGGCATCCTCCAGCAATGCATATCCATCACCGAAGGCCAGGAGCTCTTACAAGAAATACACTCGGGGG
CTTGCGGGCATCACGCGGCGCCCCGAGCCCTTGTTGGGAACGCCTTCCGACAAGGTTTCTACTGGCCAACC
GCGGTGGCCGACGCCACTAGAATTGTTCGCACCTGCCAGGGGTGTCAATTCTACGCAAGGCAGACTCACCT
TCCCGCCCAGGCTCTACAGACCATACCCATCACCTGGTCGTTTGCTGTGTGGGGTCTGGACCTCGTCGGCAC
CTTGCAGAAGGCACCCGGGGGCTACACGCACCTGCTGGTCGCCATCGACAAATTCTCCAAGTGGATCGAGG
TCCGACCCCTAAACAGCATCAGGTCTGAACAGGCGGTGGCGTTCTTCACCAACATCATCCATCGCTTTGGGG
TCCCGAACTCCATCATCACCGACAACGACACCCAGTTCACCGACAGAAAGTTCCTGGACTTCTGCGAGGATC
ACCACATCCGGGTGGACTGGGCCGCCGTGGCTCACCCCATGACGAATGGGCAAGTAGAGCGTGCCAACGG
CATGATCCTGCAAGGACTCAAGCCGTGGATCTACAACAACCTTAACAAGTTCGGCAAGCGATGGATGAAGG
AGCTCCCCTCGGTGGTCTGGAGTCTGAGGACAACGCCGAGCCGAGCCACGGGCTTCACACCGTTCTTTCTA
GTCTATGGGGCCGAGGCCATCTTGCCCATAGACTTAGAATACGGTTCCCCAAGGACGAGGGCCTACAACGA
CCAAAGCAATCGAGCTAACCGAGAAGACTCACTGGACCAGCTGGAAGAGGCTCGGAACATGGCCTTCCTA
CACTCGGCGCGGTATCAGCAGTCCCTGCGACGCTACCACGCCCGAAGGGTTCGGTCCCGAGACCTCCAGGT
GGGCGACTTGGTGCTTCGGCTGCGACAAGACGCCCGAGGGCGGCACAAGCTCACGCCTCCCTGGGAAGG
GTCGTTCGTCATCGCCAAGGTTCTGAAGCCCGGGACGTATAAGCTGGCCAACAGTCAAGGCGAGGTCTACA
ACAACGCTTGGAACATCCGATAG
Protein sequence:
MAADNPPAGGGIDDVFPTWRKNDIRACPVPSPVDGGGGGATKAKQEAAPRRLSSESTAPVPPTRGAMGIDIA
SETKTSAVSPQHANSKQTDDASTLAKDLLGVTLVPETTVQSTPDVTSSPPVDQDVPTDSHLAPFGFSLDPPSDFA
LVDAFIEASPNPPGYRVRSPWDRLTAVSTYGPSGSEEDDEPDFCWDFSGLGNPSAMRDFMTTCDYCLSDCSDG
SRSLGDEDYGPSRECFHVDLGGPGEGNHPGIPENGDPPRPAPRVDILRELAVVPVPAGVRTHSSSKSARCRPGS
TREQEHLSRSAGTSGRNGQANLRPEKRAIYPRASNTASPTMSGQGRHRPPVGSARTWLQRQYFSARCRSHLP
PRGGVSRESSRISWRMSRSDGPKAPPPEGRGTPRNIAPRLPNSCGKPRSTPGARATQRLRPWVASATNTLTATV
EPTSTRGCRQQPGESLWDYIRQFSKQRTELPNVTDSDVIGAFLADTTCRDLVSKLGRKTPTRASEDAVEAIFRKD
KQPQGRPPEDVPEASTQRGIKKKGKKKSQAKRDAADANFVAAAEYKNPRKPPGGANLFDKMLKEPCPCHQG
PVKHTLEECAMLRRHFHKAGPPAEGGRARDDDKKEDHKAGEFPEVHDCFMIYGGQVANASARHHKQERREV
CSVKVAAPVYLDWSDKPITFDQGDHPDRVPSLGKYPLVVDPVIGNVRLTKVLMDGGSSLNVIYAKTLGLLRIDLS
SETLTFEVVGFRGTYHAVLRRPCYAKFMVVPNYTYHKLKMPGPNGVITVGPTYRHAYECDVECMEYAEALAKSE
ALIADLESLSKEAPDVKRHTSNFEPTEMGYRQIRMKESDQLATSFITPFGMYCYVTMSFGLRNAGATYQRCMN
HVFGEHIGRTVEAYIDDIVVKTRKASDLLSDLETTFWCLKAKGVKLNPEKCVFGVPQGLLLGFIVSERGIEANPEKI
VAITNMGPIKDLKGVQRVTGCLAALSRFISRLGERGLPLYRLLRKAECFTWTPEAEEALGNLKALLTNAPILLPTAPI
QPELWTMFFDGSLMKTGAGAGLLFISPLGKHLRYVLRLHFPASNNVAEYEALTDDTPEPEKASAQPEAPSAPED
EALRVEEERSGVTPNRNWQTPNLQYLHRGELPLDRAEARRLARRAKSFVLLGDGKELYHRSPSGILQQCISITEG
QELLQEIHSGACGHHAAPRALVGNAFRQGFYWPTAVADATRIVRTCQGCQFYARQTHLPAQALQTIPITWSFA
VWGLDLVGTLQKAPGGYTHLLVAIDKFSKWIEVRPLNSIRSEQAVAFFTNIIHRFGVPNSIITDNDTQFTDRKFLD
FCEDHHIRVDWAAVAHPMTNGQVERANGMILQGLKPWIYNNLNKFGKRWMKELPSVVWSLRTTPSRATGF
TPFFLVYGAEAILPIDLEYGSPRTRAYNDQSNRANREDSLDQLEEARNMAFLHSARYQQSLRRYHARRVRSRDL
QVGDLVLRLRQDARGRHKLTPPWEGSFVIAKVLKPGTYKLANSQGEVYNNAWNIR
Augustus gene prediction
Augustus predicted 13 genes. The predicted genes were then translated into peptides. These
peptides were used as queries to run Blastp in the swissprot database. Only 2 of them had
significant hits. One belongs to the Reverse transcriptases (RTs) superfamily, the other belongs to
the RNase H superfamily.
Segment 1: 65858 --- 67411
CDS
65858 --- 67411
1553bp
RT_LTR[cd01647]: Reverse transcriptases (RTs) from retrotransposons and retroviruses which
have long terminal repeats (LTRs) in their DNA copies but not in their RNA template.
RT_Rtv[cd01645]: Reverse transcriptases (RTs) from retroviruses (Rtvs).
RT_ZFREV_like[cd03715]: A subfamily of reverse transcriptases (RTs) found in sequences similar
to the intact endogenous retrovirus ZFERV from zebrafish and to Moloney murine leukemia virus
RT.
>ATGCCCGGCATACCGAGGGATGTCGCCGAGCACTCGCTGGATATCCGAGCTGGAGCCCGACCCGTGAAGC
AGCCTTTGCGCCGATTCGACGAAGAAAAGCGCAGAGCCATAGGCGAGGAGATCCACAAGCTAATGGCGGC
AGGGTTCATCAAAGAGGTATTCCACCCCGAATGGCTTGCCAACCCTGTGCTTGTGAGAAAGAAAGGAGGG
AAATGGCGGATGTGTGTAGACTACACTGGTCTAAACAAAGCATGTCCGAAAGTTCCCTACCCTCTACCTCGCA
TCGATCAAATCGTGGATTCCACTGCTGGGTGCGAAACCCTATCTTTCCTTGATGCCTACTCGGGGTATCACCA
GATCAGGATGAAAGAGTCCGACCAGCTCGCGACTTCTTTCATCACACCCTTCGGCATGTACTGTTATGTTACC
ATGCCATTCAGTTTGAGGAATGCGGGTGCAACGTACCAACGGTGCATGAACCACATGTTCGGCGAACACATT
GGCCGAACGGTCGAGGCCTACGTCGATGACATCGTAGTCAAGACGAGGAAAGCCTCCGACCTCCTTTCCGA
CCTTGAAGCGACATTCCGATGTCTCAAGGCGAAAGGCGTGAAGCTCAATCCCGAGAAATGTGTCTTCGGGG
TTCCACGAGGCATGCTCTTGGGGTTCATCGTCTCCGAGCGGGGCATCGAGGCCAACCCGGAGAAGATCGC
GGCCAACACCAGCATGGGGCCCATCAAGGACTTGAAAGGCGTACAGAGAGTCACAGGATGCCTTGCGGCT
CTGAGCCGTTTCATCTCGCGCCTCGGCGAAAGAGGCCTACCTCTGTACCGCCTCTTAAGGAAGGCCGAGTG
CTTCACTTGGACCCCTGAGGCCGAGGAAGCCCTCGGGAACCTGAAGGCGCTCCTCACGAACGCGCCCATCT
TGGTGCCCCCCGCTGCCGGAGAAGCCCTCTTGATCTACGTCACCACGACCACTCAGGTGGTTAGCGCCGCG
ATTGTGGTTGAGAGACGAGAAGAGGGGCATGCATTGCCCGTACAGAGGCCAGTCTACTTCATCAGTGAGGT
ACTGTCCGAGACCAAGATCCGCTACCCACAAATTCAGAAGCTGCTGTACGCAGTGATCCTGACACGACGGA
AGTTGCGACACTACTTCAAGTCTCATCCGGTGACTGTGGTGTCATCCTTCCCCCTGGGGGAGATCATCCAGT
GCCGAGAGGCCTCGGCTAGAATTGCAAAGTGGGCGGTGGAAATCATGGGCGAGACGATCTCGTTCGCCCC
TCGGAAGGCCATCAAGTCCCAGGTCTTGGCGGACTTTGTGGCTGAATGGGTCGACACCCAGCTCCCAACAG
CTCCGATCCAACCGGAACTCTGGACCATGTTTTTCGACGGGTCACTGATGAAGACAGGAGCAGGCGCAGG
CCTGCTCTTGATCTCGCCCCTCAAGAAGCACCTACGCTACGTGCTACGCCTCCACTTCCCGGCGTCCAACAAT
GTGGCTAAGTACGAGGCTCTAGTCAACGGGTTGCGCATCGCCATCGAGCTGGGGGTCTGA
Protein sequence:
MPGIPRDVAEHSLDIRAGARPVKQPLRRFDEEKRRAIGEEIHKLMAAGFIKEVFHPEWLANPVLVRKKGGKWR
MCVDYTGLNKACPKVPYPLPRIDQIVDSTAGCETLSFLDAYSGYHQIRMKESDQLATSFITPFGMYCYVTMPFSL
RNAGATYQRCMNHMFGEHIGRTVEAYVDDIVVKTRKASDLLSDLEATFRCLKAKGVKLNPEKCVFGVPRGMLL
GFIVSERGIEANPEKIAANTSMGPIKDLKGVQRVTGCLAALSRFISRLGERGLPLYRLLRKAECFTWTPEAEEALG
NLKALLTNAPILVPPAAGEALLIYVTTTTQVVSAAIVVERREEGHALPVQRPVYFISEVLSETKIRYPQIQKLLYAVILT
RRKLRHYFKSHPVTVVSSFPLGEIIQCREASARIAKWAVEIMGETISFAPRKAIKSQVLADFVAEWVDTQLPTAPI
QPELWTMFFDGSLMKTGAGAGLLLISPLKKHLRYVLRLHFPASNNVAKYEALVNGLRIAIELG
Segment 2: 86898 --- 88664
2 exons
1 CDS
2 CDS
86898---87090
87304---88664
192bp
1360bp
RNase_HI_archaeal_like[cd09279], RNAse HI family that includes Archaeal RNase HI
RVT_3[pfam13456], Reverse transcriptase-like; This domain is found in plants and appears to be
part of a retrotransposon.
RNase_H[cd06222], RNase H is an endonuclease that cleaves the RNA strand of an RNA/DNA
hybrid in a sequence non-specific manner
RnhA[COG0328], Ribonuclease HI [DNA replication, recombination, and repair]
PRK07238[PRK07238], bifunctional RNase H/acid phosphatase; Provisional
PRK07708[PRK07708], hypothetical protein; Validated
>ATGTTTTTTGACGGGTCGCTGATGAAGACAGGGGCAGGCGCGGGCCTGCTCTTCATCTCGCCCCTCGGGA
AGCACCTACGCTACGTGCTACGCCTCCACTTCCCGGCGTCCAACAATGTGGCCGAGTACGAGGCTCTGGTCA
ACGGGTTGCGCGTCGCCATCGAGCTAGGGATCCGACGTCTCGACGCTCGCggtgactcgtagctcgtcattgactaag
tcatgaagaactcccacttctgcgactcgaagatggaagcctactgcgatgaggttcggcgcctggaggacaagttctatgggctcgagttca
accacatcgcccgacgctacaacgagactgcggacaagctggctaagatagcctcggggcaaacaacggttcccccggacgtcttctcctg
agaCCTGCATCAACCCTCCGTCAAGACCGACGACACGCCCGAGCCCGAGAAGGCCTCGGCCCAGCCCGAGG
CACCCTCGGCCCCCGAGGATGAGGCACTGCGTGTCGAGGAGGAGCGGAGCGGGGTCACGCCTAATCGAA
ACTGGCAGACCCCGAACCTGCAATATCTCCACCGAGGAGAGCTACCCCTCGACCGAGCCGAAGCTCGGCGG
TTGGCGCGGCGTGCCAAGTCGTTCGTCTTGCTGGGGGACGGGAAGGAGCTCTACCATCGCAGCCCCTCAG
GCATCCTCCAGCAATGCATATCCATCACCGAAGGCCAGGAGCTCTTACAAGAAATACACTCGGGGGCTTGCG
GGCATCACGCGGCGCCCCGAGCCCTTGTTGGGAACGCCTTCCGACAAGGTTTCTACTGGCCAACCGCGGTG
GCCGACGCCACTAGAATTGTTCGCACCTGCCAGGGGTGTCAATTCTACGCAAGGCAGACTCACCTTCCCGCC
CAGGCTCTACAGACCATACCCATCACCTGGTCGTTTGCTGTGTGGGGTCTGGACCTCGTCGGCACCTTGCAG
AAGGCACCCGGGGGCTACACGCACCTGCTGGTCGCCATCGACAAATTCTCCAAGTGGATCGAGGTCCGACC
CCTAAACAGCATCAGGTCTGAACAGGCGGTGGCGTTCTTCACCAACATCATCCATCGCTTTGGGGTCCCGAA
CTCCATCATCACCGACAACGACACCCAGTTCACCGACAGAAAGTTCCTGGACTTCTGCGAGGATCACCACAT
CCGGGTGGACTGGGCCGCCGTGGCTCACCCCATGACGAATGGGCAAGTAGAGCGTGCCAACGGCATGATC
CTGCAAGGACTCAAGCCGTGGATCTACAACAACCTTAACAAGTTCGGCAAGCGATGGATGAAGGAGCTCCC
CTCGGTGGTCTGGAGTCTGAGGACAACGCCGAGCCGAGCCACGGGCTTCACACCGTTCTTTCTAGTCTATG
GGGCCGAGGCCATCTTGCCCATAGACTTAGAATACGGTTCCCCAAGGACGAGGGCCTACAACGACCAAAGC
AATCGAGCTAACCGAGAAGACTCACTGGACCAGCTGGAAGAGGCTCGGAACATGGCCTTCCTACACTCGGC
GCGGTATCAGCAGTCCCTGCGACGCTACCACGCCCGAAGGGTTCGGTCCCGAGACCTCCAGGTGGGCGAC
TTGGTGCTTCGGCTGCGACAAGACGCCCGAGGGCGGCACAAGCTCACGCCTCCCTGGGAAGGGTCGTTCG
TCATCGCCAAGGTTCTGAAGCCCGGGACGTATAAGCTGGCCAACAGTCAAGGCGAGGTCTACAACAACGCT
TGGAACATCCGATAG
protein sequence:
MFFDGSLMKTGAGAGLLFISPLGKHLRYVLRLHFPASNNVAEYEALVNGLRVAIELGIRRLDARDLHQPSVKTDD
TPEPEKASAQPEAPSAPEDEALRVEEERSGVTPNRNWQTPNLQYLHRGELPLDRAEARRLARRAKSFVLLGDG
KELYHRSPSGILQQCISITEGQELLQEIHSGACGHHAAPRALVGNAFRQGFYWPTAVADATRIVRTCQGCQFYAR
QTHLPAQALQTIPITWSFAVWGLDLVGTLQKAPGGYTHLLVAIDKFSKWIEVRPLNSIRSEQAVAFFTNIIHRFGV
PNSIITDNDTQFTDRKFLDFCEDHHIRVDWAAVAHPMTNGQVERANGMILQGLKPWIYNNLNKFGKRWMK
ELPSVVWSLRTTPSRATGFTPFFLVYGAEAILPIDLEYGSPRTRAYNDQSNRANREDSLDQLEEARNMAFLHSAR
YQQSLRRYHARRVRSRDLQVGDLVLRLRQDARGRHKLTPPWEGSFVIAKVLKPGTYKLANSQGEVYNNAWNI
R
Related documents