Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Sequence Analysis with Artemis and Artemis Comparison Tool (ACT) Carribean Bioinformatics Workshop 18th-29th January , 2010 atcttttacttttttcatcatctatacaaaaaatcatagaatattcatcatgttgtttaaaataatgtattccattatgaactttattacaaccctcgtt tttaattaattcacattttatatctttaagtataatatcatttaacattatgttatcttcctcagtgtttttcattattatttgcatgtacagtttatca tttttatgtaccaaactatatcttatattaaatggatctctacttataaagttaaaatctttttttaattttttcttttcacttccaattttatattccg cagtacatcgaattctaaaaaaaaaaataaataatatataatatataataaataatatataataaataatatataatatataataaataatatataatat ataatatataataaataatatataatatataatatataataaataatatataataaataatatataatatataatatataatactttggaaagattattt atatgaatatatacacctttaataggatacacacatcatatttatatatatacatataaatattccataaatatttatacaacctcaaataaaataaaca tacatatatatatataaatatatacatatatgtatcattacgtaaaaacatcaaagaaatatactggaaaacatgtcacaaaactaaaaaaggtattagg agatatatttactgattcctcatttttataaatgttaaaattattatccctagtccaaatatccacatttattaaattcacttgaatattgttttttaaa ttgctagatatattaatttgagatttaaaattctgacctatataaacctttcgagaatttataggtagacttaaacttatttcatttgataaactaatat tatcatttatgtccttatcaaaatttattttctccatttcagttattttaaacatattccaaatattgttattaaacaagggcggacttaaacgaagtaa ttcaatcttaactccctccttcacttcactcattttatatattccttaatttttactatgtttattaaattaacatatatataaacaaatatgtcactaa taatatatatatatatatatatatatatatatattataaatgttttactctattttcacatcttgtccttttttttttaaaaatcccaattcttattcat taaataataatgtattttttttttttttttttttttttattaattattatgttactgttttattatatacactcttaatcatatatatatatttatatat atatatatatatatatatatattattcccttttcatgttttaaacaagaaaaaaaactaaaaaaaaaaaaaataataaaatatatttttataacatatgt attattaaaatgtatatataaaaatatatattccatttattattatttttttatatacattgttataagagtatcttctcccttctggtttatattacta ccatttcactttgaacttttcataaaaattaatagaatatcaaatatgtataatatataacaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaata tatatatatatatatatacatataatatatatttcatctaatcatttaaaattattattatatattttttaaaaaatatatttatgataacataaaaaga atttaattttaattaaatatatataattacatacatctaatattattatatatatataataagttttccaaatagaatacttatatattatatatatata tatatatatatatattcttccataaaaagaataaaataaaataaaaacaccttaaaagtatttgtaaaaaattccccacattgaatatatagttgtattt ataaaattaaagaaaaagcataaagttaccatttaatagtggagattagtaacattttcttcattatcaaaaatatttatttcctaattttttttttttg taaaatatatttaaaaatgtaatagattatgtattaaataatataaatatagcaaaatgttcaattttagaaatttgcctctttttgacaaggataattc aaaagatacaggtaaaaaaaaaaaaataaagtaaaacaaaacaaaacaaaaaacaaaaaaaaaaaaaaaaaaaaaaatgacatgttataatataatataa taaataaaaattatgtaatatatcataatcgaagaaacatatatgaaaccaaaaagaaacagatcttgatttattaatacatatataactaacattcata tctttatttttgtagatgatataaaaaattttataaactcttatgaagggatatatttttcatcatccaataaatttataaatgtatttctagacaaaat tctgatcattgatccgtcttccttaaatgttattacaataaatacagatctgtatgtagttgatttcctttttaatgagaaaaataagaatcttattgtt ttagggtaatgaaatatatatagatttatatttttatttatttattatatattattttttaatttttcttttatatatttattttatttagtgtataaaa tgatatcctttatatttatatttacatgggatattcaaataataacaaaaatgagtatacacatatatatatatatatatatatatatgtatattttttt tttttttttatgttcctataggaaagggaagaattcactgatttgtagtgtttacaatattagggaatgcaactttacacttttgaaaaaaattcagtta agcaaaaatattaataacattaaaaagacactgatagcaaaatgtaatgaatatataataacattagaaaataagaaaattactttttatttcttaaata aagattatagtataaatcaaagtgaattaatagaagacggaaaagaacttattgaaaatatctatttgtcaaaaaatcatatcttgttagtaataaaaaa ttcatatgtatatatataccaattagatattaaaaattcccatattagttatacacttattgatagtttcaatttaaatttatcctacctcagagaatct ataaataataaaaaaaagcatataaataaaataaatgatgtatcaaataatgacccaaaaaaggataataatgaaaaaaatacttcatctaataatataa atcttttacttttttcatcatctatacaaaaaatcatagaatattcatcatgttgtttaaaataatgtattccattatgaactttattacaaccctcgtt tttaattaattcacattttatatctttaagtataatatcatttaacattatgttatcttcctcagtgtttttcattattatttgcatgtacagtttatca tttttatgtaccaaactatatcttatattaaatggatctctacttataaagttaaaatctttttttaattttttcttttcacttccaattttatattccg cagtacatcgaattctaaaaaaaaaaataaataatatataatatataataaataatatataataaataatatataatatataataaataatatataatat ataatatataataaataatatataatatataatatataataaataatatataataaataatatataatatataatatataatactttggaaagattattt atatgaatatatacacctttaataggatacacacatcatatttatatatatacatataaatattccataaatatttatacaacctcaaataaaataaaca tacatatatatatataaatatatacatatatgtatcattacgtaaaaacatcaaagaaatatactggaaaacatgtcacaaaactaaaaaaggtattagg agatatatttactgattcctcatttttataaatgttaaaattattatccctagtccaaatatccacatttattaaattcacttgaatattgttttttaaa ttgctagatatattaatttgagatttaaaattctgacctatataaacctttcgagaatttataggtagacttaaacttatttcatttgataaactaatat tatcatttatgtccttatcaaaatttattttctccatttcagttattttaaacatattccaaatattgttattaaacaagggcggacttaaacgaagtaa ttcaatcttaactccctccttcacttcactcattttatatattccttaatttttactatgtttattaaattaacatatatataaacaaatatgtcactaa taatatatatatatatatatatatatatatatattataaatgttttactctattttcacatcttgtccttttttttttaaaaatcccaattcttattcat taaataataatgtattttttttttttttttttttttttattaattattatgttactgttttattatatacactcttaatcatatatatatatttatatat atatatatatatatatatatattattcccttttcatgttttaaacaagaaaaaaaactaaaaaaaaaaaaaataataaaatatatttttataacagatgt attattaaaatgtatatataaaaatatatattccatttattattatttttttatatacattgttataagagtatcttctcccttctggtttatattacta Extracting information & ccatttcactttgaacttttcataaaaattaatagaatatcaaatatgtataatatataacaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaata tatatatatatatatatacatataatatatatttcatctaatcatttaaaattattattatatattttttaaaaaatatatttatgataacataaaaaga interpreting atttaattttaattaaatatatataattacatacatctaatattattatatatatataataagttttccaaatagaatacttatatattatatatatata tatatatatatatattcttccataaaaagaataaaataaaataaaaacaccttaaaagtatttgtaaaaaattccccacattgaatatatagttgtattt ataaaattaaagaaaaagcataaagttaccatttaatagtggagattagtaagtttttcttcattatcaaaaatatttatttcctaattttttttttttg What´s there taaaatatatttaaaaatgtaatagattatgtattaaataatataaatatagcaaaatgttcaattttagaaatttgcctctttttgacaaggataattc where are the genes aaaagatacaggtaaaaaaaaaaaaataaagtaaaacaaaacaaaacaaaaaacaaaaaaaaaaaaaaaaaaaaaaatgacatgttataatataatataa taaataaaaattatgtaatatatcataatcgaagaaacatatatgaaaccaaaaagaaacagatcttgatttattaatacatatataactaacattcata which genes tctttatttttgtagatgatataaaaaattttataaactcttatgaagggatatatttttcatcatccaataaatttataaatgtatttctagacaaaat how to find them? tctgatcattgatccgtcttccttaggtgttattacaataaatacagatctgtatgtagttgatttcctttttaatgagaaaaataagaatcttattgtt ttagggtaatgaaatatatatagatttatatttttatttatttattatatattattttttaatttttcttttatatatttattttatttagtgtataaaa tgatatcctttatatttatatttacatgggatattcaaataataacaaaaatgagtatacacatatatatatatatatatatatatatgtatattttttt SEQUENCE ANNOTATION tttttttttatgttcctataggaaagggaagaattcactgatttgtagtgtttacaatattagggaatgcaactttacacttttgaaaaaaattcagtta agcaaaaatattaataacattaaaaagacactgatagcaaaatgtaatgaatatataataacattagaaaataagaaaattactttttatttcttaaata aagattatagtataaatcaaagtgaattaatagaagacggaaaagaacttattgaaaatatctatttgtcaaaaaatcatatcttgttagtaataaaaaa ttcatatgtatatatataccaattagatattaaaaattcccatattagttatacacttattgatagtttcaatttaaatttatcctacctcagagaatct ataaataataaaaaaaagcatataaataaaataaatgatgtatcaaataatgacccaaaaaaggataataatgaaaaaaatacttcatctaataatataa Sequencing is just the beginning of the process Strategies for sequence annotation Predictive methods Interpretation of the DNA sequence into genes according to rules Comparative methods Experimental methods Strategies for sequence annotation Predictive methods Interpretation of the DNA sequence into genes according to rules Comparative methods Interpretation of the DNA sequence into genes according to similarities with other sequences Experimental methods Strategies for sequence annotation Predictive methods Interpretation of the DNA sequence into genes according to rules Comparative methods Interpretation of the DNA sequence into genes according to similarities with other sequences Experimental Interpretation of the DNA sequence into genes methods according to experimental results (e.g. cDNA) EST Blast Hit Gene prediction programs: ORFs and CDSs ORFs are not equivalent to CDSs Not all open reading frames are coding sequences Gene prediction Orpheus PHAT GeneMark Glimmer Gene finder Gene finding programs • Genefinding software packages use Hidden Markov Models. • Predict coding, intergenic and intron sequences • Need to be trained on a specific organism. • Never perfect! Gene prediction programs: Problems • ORFs are not equivalent to CDSs • Gene prediction programs find new genes that share properties with a given set of genes. • They can be confounded by: – – – – – Sequence constraints (ribosomal proteins etc.) Sequence biases Different sets of genes Horizontal gene transfer Non-coding DNA Gene prediction programs: Problems Different gene training sets: Plasmodium falciparum Original annotation Updated annotation Gene prediction programs: Problems Non-protein coding regions: S. typhi ribosomal RNA genes final genefinder orpheus glimmer glimmer orpheus genefinder final Gene prediction programs: Problems Non-protein coding regions: N. meningitidis DNA repeats final orpheus glimmer glimmer orpheus final Gene prediction programs: Problems Pseudogenes M. leprae Gene prediction programs: Problems Pseudogenes: M. leprae Glimmer Gene prediction programs: Problems Pseudogenes: M. leprae ORPHEUS Gene prediction programs: Problems Pseudogenes: M. leprae WUBLASTX vs. M. tuberculosis Gene prediction programs: Problems Pseudogenes: M. leprae Final annotation The Gene Prediction Process ESTs ANNALYSIS SOFTWARE DNA SEQUENCE FASTA BlastX Gene finders Codon Usage AT content Annotator Usefull CDS Prediction Eukaryotic gene 5’UTR Exon I intron ATG GT AG stop Exon III 3’UTR Exon II GT AG CAP AAAAAAAAAA CAP AAAAAAAAAA mRNA TTTTTTTTT cDNA TTTTTTTTT EST EST AT content • Coding regions have higher GC content in AT rich genomes AT content CODON USAGE • Codon bias is different for each organism. • DNA content in coding regions is restricted – but it is not restricted in non coding regions. • The codon usage for any particular gene can influence expression. Codon usage • All organisms have a preferred set of codons. Malaria GUU GUC GUA GUG 0.41 0.06 0.42 0.11 Trypanosoma GUU GUC GUA GUG 0.28 0.19 0.14 0.39 Codon Usage • http://www.kazusa.or.jp/codon/ Codon Usage in Artemis Forward frames Reverse frames Codon usage & gene finding in : Leishmania GC frame plot • Plots the third position GC content of each frame of a DNA sequence. • In coding DNA the GC content of the 3rd base is often higher. • Good prediction of coding in malaria and trypanosomes. GC frame plot of tubulin gene cluster on T. brucei Chr 1 Homology Data • Coding regions are more conserved than non coding regions due to selective pressure. • Comparing all possible translations against all known proteins will give clues to known genes. • Blastx Gene finding: using ACT P. yoelii P. falciparum P. knowlesi TBLASTX comparisons Gene finding by RNA-Seq (Transcriptional landscape of Neospora caninum Tachyzoites Day 3 Tachyzoites (RNAseq) Day 4 Tachyzoites (RNAseq) Transcriptome sequencing in Neospora (RNAseq is useful for predicting/confirming UTR boundaries) Day 3 Tachyzoites (RNAseq) Day 4 Tachyzoites (RNAseq) N. caninum Chr08 TBLASTX matches visualised in ACT T. gondii Chr08 5’ UTR 3’ UTR RNA-Seq: correcting gene models Before %GC __16hr, __32hr, __48hr After %GC