* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Gene Prediction Gene Prediction Genes Prokaryotic
Genomic library wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Non-coding DNA wikipedia , lookup
Oncogenomics wikipedia , lookup
Neuronal ceroid lipofuscinosis wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Essential gene wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Point mutation wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Genetic engineering wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Copy-number variation wikipedia , lookup
Transposable element wikipedia , lookup
Human genome wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Metagenomics wikipedia , lookup
Gene therapy wikipedia , lookup
Public health genomics wikipedia , lookup
History of genetic engineering wikipedia , lookup
Ridge (biology) wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Genomic imprinting wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Gene desert wikipedia , lookup
Minimal genome wikipedia , lookup
Gene nomenclature wikipedia , lookup
Gene expression programming wikipedia , lookup
Pathogenomics wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Genome editing wikipedia , lookup
Genome (book) wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Helitron (biology) wikipedia , lookup
Gene expression profiling wikipedia , lookup
Microevolution wikipedia , lookup
Genome evolution wikipedia , lookup
Gene Prediction TAGTCAGACAGAAAGGCAGGCACAAAGTACGGTAGAGTCTTCTAGCACTA AAATCCTATTTGACCTTCTCCTGGGCCTTTTCTTCTAACACAGCCACACT ACCTTATATAATTCTTGTTGTAAGCAGAAAGTTGGCATGCCATCCAAACA AAACAACTTCCTTCCAGAGGACAGGTCCATGAGAACTTTCCCACAGATAC CCATTCACATACATTCAATGTCCTGGACAGGGCTCCTCCTCAGTCTGCCA CGCAAGAAGAACACACAGGACACAGGGCATACTCTATTTGATTCAACTAG TGCGTTCCACGGACACTTTCTAACACAGTAGCTCTGGACCTAGAACGCGG CATCCAGCAGTACACTCTGCTAGATGAAGGGGGAGAAAAGGCATTTTGAA TACATTCTCTAAAAATCCTGACAGCAAGGCTACAGGTATATCGAAGTATA ATGGAACAGTCACGAGGCCCCGGGTTTGATCCTCAGTGTGGCTAAGCAAT GAATCCACATAGCAACTCGGGAATAATTATTTTAGCTTATTATTTTAAAA CGCCAGCGACTTTATTTTCTTCGCCCAAGCTCAAATTAATTAAAGGTTAT AAATGGTCACTTCTCCGTAGAAGCCAGAACTCTCCCCCTCTTCAGAGCAG GGGAATACCTCATAAATAAATTAGGCGAAACCATGGCTTGCTGATTGAAT GAATGATAATCCACAGTCCATGTGGTTGCCAAGTCTTTCTCTAGACCTCT CTACCGCAATGAGCAATCCCTGAACGTCAACGAAGAGGCTTACTTCATCA GTTATCTGGAAGTCTGCGAGTCGTGAAGACAGCCCACAGAAATACTAGCT TCTCCACTCAGCCTCGATTCACCGGAAGGACCATGAAAAGGAACAGCACC AGTGAATCTGATGCGGCTCCCTTCCAACTCACTGCAGCTCAGTCAGCCTG Identify genes from genomic (DNA) sequence Elucidate gene structure ‒ exons, introns, promoter Use gene structure to predict transcripts and polypeptide (protein) sequences Gene Prediction Benjamin King Mount Desert Island Biological Laboratory Outline Prokaryotic vs. eukaryotic genes Genome analysis pipelines Review resources that represent genes Gene prediction programs Genes Prokaryotic Genes • • Prokaryotic genes • Eukaryotic genes Small genomes, high gene density ‒ Haemophilus influenza genome 85% genic • Operons ‒ One transcript, many genes • No introns. ‒ One gene, one protein • Open reading frames ‒ One ORF per gene ‒ ORFs begin with start, end with stop codon 1 Eukaryotic Genes • Much lower gene density in genome ‒ Gene-rich regions ‒ Gene-poor regions • Gene Desert - a region with no known, novel, or partial genes in a 500 kb • Undergo several post transcriptional modifications. ‒ 5 CAP ‒ Poly A tail ‒ Splicing 2 How are genes predicted? 1. transcript based alignments • RefSeq RNA, ESTs to produce gene model Gene Prediction 2. ab initio (de novo) 1. Programs infer gene models 1. Use features in sequence and protein alignments 3. Hand curation 1. consolidation, pruning, non-automated or curated annotation always prevails The highest quality annotation is manual Conscensus CDS protein set • Collaboration between EBI, NCBI WTSI and UCSC • Mouse and human genomes • Manual curation is primarily conducted by • Havana (human and vertebrate analysis and annotation) at Sanger • RefSeq annotation group at NCBI VEGA (Vertebrate Genome Annotation) • has its own browser • is linked to the Ensembl browser • manual annotation by Baylor College of Medicine, Broad Institute, DOE Joint Genomes Institute, Genoscope, Havana @ Sanger and Washington University Genome Center. “de novo” annotation is more dubious NCBI s ab initio pipeline - GenomeScan program Genscan - based on on transcriptional, translational, and donor/ acceptor splicing signals, as well as the length and compositional distributions of exons, introns and intergenic regions. Exoniphy - based on exon structure and exon evolution (relies on multispecies Alignment) ACEScan - Alternative Conserved Exons (human-mouse conservation) Identifies exons that are present in some transcripts, but skipped by alternative splicing in other transcripts in both human and mouse 3 Gene Prediction Procedure Obtain genomic sequence Ensure vector sequences are removed Analysis Pipelines Genomic sequence Remove vector sequences (Search NCBI Mask highcomplexity repeats (RepeatMasker) VectorBase) Mask high-complexity repeats RepeatMasker ‒ do not have it mask low-complexity repeats Uses REPBASE, a database of repeat sequences (SINES, LINES, etc) Run gene prediction program to predict exons and ORFs e.g., GenomeScan Look for transcripts to verify exons Run gene prediction program(s) (e.g., GenomeScan) Align all fulllength cDNAs Align all ESTs (RefSeq, MGC) Align all protein sequences (SWISS-PROT/ TrEMBL) Align genomic sequences from similar species Identify conserved seqs Align full-length cDNAs and ESTs These can be from same species and similar species Look for protein sequences to verify ORFs (open reading frames) Compile results Align proteins Also from same species and similar species Assembly and Annotation example from NCBI Genome Browsers UCSC: http://genome.ucsc.edu University of Santa Cruz Annotate other gene builds Ensembl: http://www.ensembl.org EBI and Sanger collaboration Gene build, predict novel genes Pay attention to gene nomenclature NCBI: http://www.ncbi.nlm.nih.gov/mapview/ NCBI map viewer Gene build, predicts novel genes Build your own genome browser with GBrowse http://www.gmod.org/ggb/ 4 Genes Classified By Evidence Known genes as catalogued by the reference sequence project Ensembl known genes (red genes) NCBI known genes Novel genes (1) based on similarity to known genes, or cDNAs these need not have 100% matching supporting evidence Ensembl novel genes NCBI LOC genes Genes Classified By Evidence Novel genes (2) based on the presence of ESTs resource of alternative splicing EST genes in Ensembl Database of transcribed sequences (DOTs) Acembly Gene prediction Single organism: Genscan Comparative information: Twinscan Pseudogenes - matches a known gene but with a a disrupted ORF Genes Classified By Evidence ‒ Microbial Genomes Known Gene (Nfkb1 ) Lots of Evidence Classified function Conserved, unknown function Species specific, unknown function Strain-specific Hypothetical protein Other TIGR nomenclature rules 5 Supporting Evidence For Genes Example of a Novel Gene mRNA reverse transcription cDNA Expressed Sequence Tag (EST) full length cDNA sequence Gene Prediction Programs Gene Prediction Methods Compositional Methods ‒ Scan for features in sequence using consensus sequence ‒ ab initio methods ‒ Only 50% accurate (1996) Comparative Methods ‒ Compare sequence to cDNA sequence databases ‒ Compare sequence to EST sequence databases Have to use both methods 6 Gene Prediction Programs Genie Predominant Gene Prediction Programs: GENSCAN GenomeScan FGENES N-SCAN many others Models Gene Structures as Grammars • Searls (1988) introduced ideas of formal language theory in biosequence analysis • Context-free grammar recursive decomposition Gene Model SE I EI B States David Kulp U5 S B = Begin position S = start position D = donor site (gt) A = acceptor site (ag) T = termination site F = final position D E A Transitions FE T U3 F U5 = 5 UTR U3 = 3 UTR EI = exon to intron boundary SE = single exon I = intron E = exon FE = final coding exon 7 Models and Graphs Gene Model SE I EI B U5 S Gene Graph 5’ UTR S E A FE Exon T q3 T D U3 F 3’ UTR Intron q1 q2 B D Default Genie Gene Model q4 S A A Parse, φ q5 q6 T F David Kulp Genie addresses problem of stop codons that span two exons David Kulp Other Gene Prediction Programs • ORF detectors ‒ NCBI: http://www.ncbi.nih.gov/gorf/gorf.html • Promoter predictors ‒ CSHL: http://rulai.cshl.org/software/index1.htm ‒ BDGP: fruitfly.org/seq_tools/promoter.html ‒ ICG: TATA-Box predictor • PolyA signal predictors ‒ CSHL: http://rulai.cshl.org/tools/polyadq/polyadq_form.html • Splice site predictors ‒ BDGP: http://www.fruitfly.org/seq_tools/splice.html • Start-/stop-codon identifiers ‒ DNALC: Translator/ORF-Finder ‒ BCM: Searchlauncher • Genie (Have to download source code, compile, and install to run) http://brl.cs.umass.edu/Research/GenePredictionWithConstraints 8 Acknowledgements David Kulp University of Massachusetts - Amherst Worked Examples • Worked Example #1: Examine open reading frames in a full-length cDNA for skate SHH using NCBI ORF Finder. • Worked Example #2: Run GenomeScan to predict the gene structure for the region of the human Chr. 7 that encodes SHH. • Worked Example #3: Run FGENESH to predict the gene structure for the region of the human Chr. 7 that encodes SHH. 9