* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Genome Annotation: From Sequence to Biology
Protein moonlighting wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Quantitative trait locus wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Copy-number variation wikipedia , lookup
Epigenetics of human development wikipedia , lookup
DNA sequencing wikipedia , lookup
Segmental Duplication on the Human Y Chromosome wikipedia , lookup
Transposable element wikipedia , lookup
Point mutation wikipedia , lookup
Bisulfite sequencing wikipedia , lookup
Genetic engineering wikipedia , lookup
Public health genomics wikipedia , lookup
Genome (book) wikipedia , lookup
Gene expression profiling wikipedia , lookup
Designer baby wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
Microsatellite wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Non-coding DNA wikipedia , lookup
Microevolution wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Pathogenomics wikipedia , lookup
Human genome wikipedia , lookup
Minimal genome wikipedia , lookup
History of genetic engineering wikipedia , lookup
Helitron (biology) wikipedia , lookup
Whole genome sequencing wikipedia , lookup
Genomic library wikipedia , lookup
Human Genome Project wikipedia , lookup
Genome editing wikipedia , lookup
Metagenomics wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Genome Annotation: From Sequence to Biology Ashley Bateman & Andrew Tritt Genetics 677 Prof. Ahna Skop Spring 2009 Introduction -over 450 organisms have been completely sequenced since 1995, and many more have working drafts -361 prokaryotes, 28 archaea, 20 protists, 8 plants, 15 fungi, 26 mammals, and 21 “other” (wikipedia) List of Sequenced Organisms Genome Sequencing 454 Sanger Solexa Sanger Sequencing 454 Sequencing: Sequencing by synthesis Reads ~200 bp QuickTime™ and a decompressor are needed to see this picture. 1-fix DNA strands to beads in water-in-oil emulsion 2-DNA amplified by PCR 3-use PPi product of PCR to determine identity of added base http://www.nature.com/nrmicro/journal/vaop/ncurrent/images/nrmicro1901-f3.jpg High Throughput Sanger Sequencing QuickTime™ and a decompressor are needed to see this picture. ~900 bp read -DNA of interest inserted into a plasmid, and sequenced using primers for plasmid Solexa Sequencing QuickTime™ and a decompressor are needed to see this picture. ~26-50 bp reads -newest sequencing technology --> cheaper and faster -small reads present problems if dealing with repetitive sequence http://seqanswers.com/forums/showthread.php?t=21 Genome Annotation The process of taking the DNA sequence produced by genome-sequencing projects, and adding layers of analysis/interpretation to understand its biological significance in a larger context QuickTime™ and a decompressor are needed to see this picture. Genome Annotation: A multistep process 3 general levels of annotation: -1 Nucleotide-level (where) -2 Protein-level (what) -3 Process-level (how) QuickTime™ and a decompressor are needed to see this picture. Stein, 2001. Nucleotide-level Annotation: Mapping -“…identify the punctuation marks…” -Identification and placement of known landmarks into the genome (genes, genetic markers, etc.) -Connects the pre-genomic literature with post-genomic research Nucleotide-level Annotation: Finding Genomic Landmarks -short sequences: PCR-based genetic markers (ID with e-PCR program) -long sequences: RFLPs (ID with BLASTN, etc.) Nucleotide-level Annotation: Gene Finding Prokaryotes: ID ORFs Eukaryotes: Sophisticated software needed (gene prediction) -overlapping ORFs -signal-to-noise ratio -splicing -unclear exon/intron delineations Gene Prediction Software -use algorithms that contain sensors to identify specific sequence features - neural networks - rule-based system - hidden Markov model -sequence similarity to known CDS -BLAST -cDNA -EST’s Ab initio gene prediction without use of prior knowledge about similarities to other genes Hidden Markov Models 0.85 -a set of states with transition and emission probabilities in a sequence predicted by finding most probable path 1.0 EXON A: 0.2 C: 0.3 G: 0.3 T: 0.2 0.05 QuickTime™ and a 0.10 decompressor are needed to see this0.05 picture. INTRON A: 0.25 C: 0.25 G: 0.25 T: 0.25 -genes Example : 0.95 DNA Sequence : AGTTCGAATCGATGCTAAGACGA Possible Path : EEEEIIIIIIIIIIIIIIEEEEE Most probable path: EEEIIIIIIIIIIIIIIIIIEEE Sequence Similarity -currently, most powerful tool for detecting CDS -Problems exist: -Fragmentary ESTs -Repetitive cDNA sequences -Ortholog-paralog problem -Incomplete data ab initio predictions + similarity data = more powerful model Nucleotide-level Annotation: non-coding RNAs and regulatory regions -include tRNAs, rRNAs, snRNAs, nRNAs -transcription factor binding sites -largely unknown; active area of bioinformatics research Nucleotide-level Annotation: non-coding RNAs and regulatory regions QuickTime™ and a decompressor are needed to see this picture. -red and blue boxes represent unknown positions of motifs -Gibbs Motif Sampler1 and MEME infer models for motifs and identify motif locations within sequences 1 Lawrence et al. 1993, Thompson et al. 2007 Nucleotide-level Annotation: Repetitive Elements & Segmental Duplications Repetitive Elements: -account for a large proportion of genome size variation -important to (generally) exclude these from later assembly process -problematic for next-gen sequencing technologies Segmental Duplications: -paralogs exist throughout many genomes Nucleotide-level Annotation: Mapping Variation -SNPs are important for population genetics and association mapping AAGTCGATGCTAGCGCTACTAGCTAGGCTCGATGTT AAGTCGATGCTAGCGCTACTAGCTAGGCTAGATGTT AAGTCGATGCTAGCCCTACTAGCTAGGCTCGATGTT AAGTCGATGCTAGCGCTACTAGCTAGGCTAGATGTT AAGTCGATGCTAGCCCTACTAGCTAGGCTTGATGTT AAGTCGATGCTAGCGCTACTAGCTAGGCTCGATGTT SNPs Protein-level Annotation -Assign putative functions to proteins of an organism -Classify proteins into families: -using similarities to better-characterized proteins of other species (BLASTP) -on the basis of functional domains, motifs, and folds -Search against protein databases of functional domains (e.g. PFAM) -InterPro: integration of several protein databases -makes things much easier! Process-level Annotation -linking the genome to biological processes -bench work required (e.g. microarrays, RNAi, etc.) -classification scheme required: Gene Ontology (GO) -standardized vocabulary for molecular function, biological process, and cellular component -hierarchy of terms provides flexibility for new additions Process-level Annotation -hierarchical structure of GO terminology QuickTime™ and a decompressor are needed to see this picture. Organizing Annotation Efforts Several models: - factory - museum - cottage industry - party Bioinformatics research in biomedical text mining to automate annotation process QuickTime™ and a decompressor are needed to see this picture. Conclusion A synthesis of biology and annotation must be developed… …change is constant, databases are updated sometimes hourly… …the experimental literature of the past must be tied with the genome annotations of the future! Student Question “The paper was mostly about predicting the number of genes and proteins in an organism. Why do we need to predict the number of genes and proteins in the cell? It appears that most studies identify genes based on phenotypes. For proteins, many methodologies exist for identifying protein function. I cannot see the purpose of this prediction--pardon my short sightedness. Also, has a standardized format emerged in regard to the genome files?” NCBI standardized format example