* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download D. melanogaster - GEP Community Server
Copy-number variation wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Epigenetics of depression wikipedia , lookup
Oncogenomics wikipedia , lookup
History of genetic engineering wikipedia , lookup
Whole genome sequencing wikipedia , lookup
Ridge (biology) wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
RNA interference wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Genome (book) wikipedia , lookup
Genomic library wikipedia , lookup
Public health genomics wikipedia , lookup
Gene expression programming wikipedia , lookup
Epitranscriptome wikipedia , lookup
Gene desert wikipedia , lookup
Designer baby wikipedia , lookup
Nucleic acid tertiary structure wikipedia , lookup
History of RNA biology wikipedia , lookup
Short interspersed nuclear elements (SINEs) wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Transposable element wikipedia , lookup
Microevolution wikipedia , lookup
Gene expression profiling wikipedia , lookup
Metagenomics wikipedia , lookup
Minimal genome wikipedia , lookup
Human genome wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Long non-coding RNA wikipedia , lookup
RNA silencing wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Non-coding DNA wikipedia , lookup
Pathogenomics wikipedia , lookup
Non-coding RNA wikipedia , lookup
Helitron (biology) wikipedia , lookup
Human Genome Project wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Genome editing wikipedia , lookup
Primary transcript wikipedia , lookup
Searching for Transcription Start Sites in Drosophila Wilson Leung 01/2017 Outline Transcription start sites (TSS) annotation goals Promoter architecture in D. melanogaster New D. melanogaster TSS datasets Find the initial transcribed exon Annotate putative transcription start sites Search for core promoter motifs Muller F element, heterochromatic, and euchromatic genes show similar expression levels Riddle NC, et al. Enrichment of HP1a on Drosophila chromosome 4 genes creates an alternate chromatin structure critical for regulation in this heterochromatic domain. PLoS Genet. 2012 Sep;8(9):e1002954. Goals for the transcription start sites (TSS) annotations Research goal: Identify motifs that enable Muller F element genes to function within a heterochromatic environment Annotation goals: Define search regions enriched in regulatory motifs Define precise location of TSS if possible Define search regions where TSS could be found Document the evidence used to support the TSS annotations Estimated evolutionary distances with respect to D. melanogaster D. melanogaster D. simulans D. sechellia D. yakuba D. erecta D. ficusphila D. eugracilis D. biarmipes D. takahashii D. elegans D. rhopaloa D. kikkawai D. bipectinata D. ananassae D. pseudoobscura D. persimilis D. willistoni D. mojavensis D. virilis D. grimshawi New species sequenced by modENCODE Species Substitutions per neutral site D. ficusphila 0.80 D. eugracilis 0.76 D. biarmipes 0.70 D. takahashii 0.65 D. elegans 0.72 D. rhopaloa 0.66 D. kikkawai 0.89 D. bipectinata 0.99 Data from table 1 of the modENCODE comparative genomics whitepaper GEP annotation projects Phylogenetic tree based on the analysis of 13 Type IIB restriction endonucleases D. simulans D. sechellia D. melanogaster D. yakuba D. santomea D. erecta D. eugracilis D. biarmipes D. takahashii D. elegans D. rhopaloa D. ficusphila D. kikkawai D. ananassae D. bipectinata D. persimilis D. pseudoobscura D. willistoni D. virilis D. mojavensis D. grimshawi Simulate restriction digests of 21 genomes DNA fragments range from 21-33 bp in size Calculate distance between two genomes based on number of shared fragments Seetharam AS and Stuart GW. Whole genome phylogeny for 21 Drosophila species using predicted 2b-RAD fragments. PeerJ. 2013 Dec 23;1:e226. D. biarmipes Muller F element TSS annotation projects Reconcile coding region annotations submitted by GEP students Reconciled annotations might be incorrect Based on older FlyBase release Possible misannotations See the “Revised gene models report form” section of the Transcription Start Sites Project Report Challenges with TSS annotations Fewer constraints on untranslated regions (UTRs) UTRs evolve more quickly than coding regions Open reading frames, compatible phases of donor and acceptor sites do not apply to UTRs Low percent identity (~50-70%) between D. biarmipes contigs and D. melanogaster UTRs Most gene finders do not predict UTRs Lack of experimental data Cannot use RNA-Seq data to precisely define the TSS TSS annotation workflow 1. Identify the ortholog 2. Note the gene structure in D. melanogaster 3. Annotate the coding exons 4. Classify the type of core promoter in D. melanogaster 5. Annotate the initial transcribed exon 6. Identify any core promoter motifs 7. Define TSS positions or TSS search regions RNA Polymerase II core promoter Initiator motif (Inr) contains the TSS TFIID binds to the TATA box and Inr to initiate the assembly of the preinitiation complex (PIC) Juven-Gershon T and Kadonaga JT. Regulation of gene expression via the core promoter and the basal transcriptional machinery. Dev Biol. 2010 Mar 15;339(2):225-9. Peaked versus broad promoters Peaked promoter (Single strong TSS) Broad promoter (Multiple weak TSS) 50-300 bp Kadonaga JT. Perspectives on the RNA polymerase II core promoter. Wiley Interdiscip Rev Dev Biol. 2012 Jan-Feb;1(1):40-51. RNA-Seq biases introduced by library construction RNA-Seq Read Count cDNA fragmentation Strong bias at the 3’ end RNA fragmentation More uniform coverage Miss the 5’ and 3’ ends of the transcript 5’ Gene Span 3’ Wang Z, et al. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009 Jan;10(1):57-63. Techniques for finding TSS Identify the 5’ cap at the beginning of the mRNA Cap Analysis of Gene Expression (CAGE) RNA Ligase Mediated Rapid Amplification of cDNA Ends (RLM-RACE) Cap-trapped Expressed Sequence Tags (5’ ESTs) More information on these techniques: Takahashi H, et al. CAGE (cap analysis of gene expression): a protocol for the detection of promoter and transcriptional networks. Methods Mol Biol. 2012 786:181-200. Sandelin A, et al. Mammalian RNA polymerase II core promoters: insights from genome-wide studies. Nat Rev Genet. 2007 Jun;8(6):424-36. Promoter architecture in Drosophila Classify core promoter based on the Shape Index (SI) Determined by the distribution of CAGE and 5’ RLM-RACE reads Shape index is a continuum Most promoters in D. melanogaster contain multiple TSS Median width = 162 bp ~70% of vertebrate genes have broad promoters Hoskins RA, et al. Genome-wide analysis of promoter architecture in Drosophila melanogaster. Genome Res. 2011 Feb;21(2):182-92. Genes with peaked promoters show stronger spatial and tissue specificity 46% of genes with broad promoters are expressed in all stages of embryonic development 19% of genes with peaked promoters are expressed in all stages Hoskins RA, et al. Genome-wide analysis of promoter architecture in Drosophila melanogaster. Genome Res. 2011 Feb;21(2):182-92. Peaked and broad promoters are enriched in different core promoter motifs Rach EA, et al. Motif composition, conservation and condition-specificity of single and alternative transcription start sites in the Drosophila genome. Genome Biol. 2009;10(7):R73. Using modENCODE data to classify the type of core promoter in D. melanogaster Only a subset of the modENCODE data are available through FlyBase D. melanogaster GEP UCSC Genome Browser [Aug. 2014 (BDGP Release 6) assembly] FlyBase gene annotations (release 6.13) modENCODE TSS (Celniker) annotations DNase I hypersensitive sites (DHS) CAGE and RAMPAGE TSS datasets 9-state and 16-state chromatin models Transcription factor binding site (TFBS) HOT spots 9-state chromatin model Kharchenko PV, et al. Comprehensive analysis of the chromatin landscape in Drosophila melanogaster. Nature. 2011 Mar 24;471(7339):480-5. DNaseI Hypersensitive Sites (DHS) correspond to accessible regions Aasland R and Stewart AF. Analysis of DNaseI hypersensitive sites in chromatin by cleavage in permeabilized cells. Methods Mol Biol. 1999;119:355-62. Ho JW, et al. Comparative analysis of metazoan chromatin organization. Nature. 2014 Aug 28;512(7515):449-52. modENCODE TSS annotations Two sets of modENCODE TSS predictions TSS (Celniker) Most recent dataset produced by modENCODE Available on the GEP UCSC Genome Browser TSS (Embryonic) Older dataset available from FlyBase GBrowse Use TSS (Celniker) dataset as the primary evidence Hoskins RA, et al. Genome-wide analysis of promoter architecture in Drosophila melanogaster. Genome Res. 2011 Feb;21(2):182-92 Classify the D. melanogaster core promoter for each unique TSS TSS classification # Annotated TSS # DHS positions Peaked 1 0 1 0 1 1 Intermediate ≤1 >1 >1 ≤1 Broad >1 >1 Insufficient evidence 0 0 Consider DHS positions within a 300bp window surrounding the start of the D. melanogaster transcript DEMO: Classify the core promoter of D. melanogaster Rad23 Additional DHS data from different stages of embryonic development DHS data produced by the BDTNP project Evidence tracks: Detected DHS Positions (Embryos) DHS Read Density (Embryos) Thomas S, et al. Dynamic reprogramming of chromatin accessibility during Drosophila embryo development. Genome Biol. 2011;12(5):R43. New TSS evidence tracks available in FlyBase release 6.11 Batut P, Dobin A, Plessy C, Carninci P, Gingeras TR. High-fidelity promoter profiling reveals widespread alternative promoter usage and transposon-driven developmental gene expression. Genome Res. 2013 Jan;23(1):169-80. Benefits of RAMPAGE RAMPAGE = RNA Annotation and Mapping of Promoters for Analysis of Gene Expression CAGE only allows sequencing of short sequence tags (~27 bp) near the 5’ cap Ambiguous read mapping to large parts of the genome RAMPAGE produces long paired-end reads instead of short sequence tags Developed novel algorithm to identify TSS clusters Used paired-end information during peak calling Used Cufflinks to produce partial transcript models Batut P, Gingeras TR. RAMPAGE: promoter activity profiling by paired-end sequencing of 5'complete cDNAs. Curr Protoc Mol Biol. 2013 Nov 11;104:Unit 25B.11. Signals in the FlyBase RAMPAGE and MachiBase TSS tracks are off by one base “FlyBase: GBrowse Tracks” page on the FlyBase Wiki http://flybase.org/wiki/FlyBase:GBrowse_Tracks#Aligned_Evidence RAMPAGE results on the GEP UCSC Genome Browser Lifted RAMPAGE results from release 5 to release 6 Results from 36 developmental stages Combined TSS peak call from all samples Available under the “Expression and Regulation” section Analysis of MachiBase and modENCODE CAGE data using CAGEr Bioconductor package developed by RIKEN Standardize analysis of CAGE data compared to the custom protocols used by modENCODE Map datasets against release 6 assembly 37 modENCODE CAGE samples; 7 MachiBase samples Define TSS and promoters for each sample Define consensus promoters across all samples Haberle V, et al. CAGEr: precise TSS data retrieval and high-resolution promoterome mining for integrative analyses. Nucleic Acids Res. 2015 Apr 30;43(8):e51. TSS classifications based on CAGEr Peaked FlyBase Genes modENCODE CAGE Peaks modENCODE CAGE (Plus) Intermediate FlyBase Genes modENCODE CAGE Peaks modENCODE CAGE (Plus) Broad FlyBase Genes modENCODE CAGE Peaks modENCODE CAGE (Minus) Changes in the dominant TSS of Rad23 across different developmental stages Stages of Development Rad23 CAGE Tag Clusters Adult females Evidence for TSS annotations (in general order of importance) 1. Experimental data RNA-Seq RNA Pol II ChIP-Seq (D. biarmipes only) 2. Conservation Type of TSS (peaked/intermediate/broad) in D. melanogaster Sequence similarity to initial exon in D. melanogaster Sequence similarity to other Drosophila species (Multiz) 3. Core promoter motifs Inr, TATA box, etc. Determine the gene structure in D. melanogaster FlyBase: GBrowse Gene Record Finder: Transcript Details UTR CDS Identify the initial transcribed exon using NCBI blastn Retrieve the sequences of the initial exons from the Transcript Details tab of the Gene Record Finder Use placement of the flanking exons to reduce the size of the search region if possible Increase sensitivity of nucleotide searches Change Program Selection to blastn Change Word size to 7 Change Match/Mismatch Scores to +1, -1 Change Gap Costs to Existence: 2, Extension: 1 Optimize alignment parameters based on expected levels of conservation Derive alignment scores using information theory Relative entropy of target and background frequencies Match +2, Mismatch -3 optimized for 90% identity Match +1, Mismatch -1 optimized for 75% identity States DJ, et al. Improved Sensitivity of Nucleic Acid Database Searches Using ApplicationSpecific Scoring Matrices. Methods 3:66-70. RNA PolII ChIP-Seq tracks (D. biarmipes only) Show regions that are enriched in RNA Polymerase II compared to input DNA Gene Models RNA-Seq RNA PolII Peaks RNA PolII Enrichment Use core promoter motifs to support TSS annotations Some sequence motifs are enriched in the region (~300 bp) surrounding the TSS Some motifs (e.g., Inr, TATA) are well-characterized Other motifs are identified based on computational analysis Presence of core promoter motifs can be used to support the TSS annotations Inr motif (TCAKTY) overlaps with the TSS (-2 to +4) Absence of core promoter motifs is a negative result Most D. melanogaster TSS do not contain the Inr motif Use UCSC Genome Browser Short Match to find Drosophila core promoter motifs TATA box Initiator (Inr) Ohler U, et al. Computational analysis of core promoters in the Drosophila genome. Genome Biol. 2002; 3(12):RESEARCH0087. Available under “Projects” “Annotation Resources” “Core Promoter Motifs” on the GEP web site: http://gander.wustl.edu/~wilson/core_promoter_motifs.html Core Promoter Motifs tracks Show core promoter motif matches for each contig Separated by strand Visualize matches to different core promoter motifs Use UCSC Table Browser (or other means) to export the list of motif matches within the search region DEMO: Use the Inr motif to determine the TSS position of Rad23 Using RNA-Seq and RNA PolII ChIP-Seq data to define the TSS search region D. mel Transcripts RNA-Seq RNA PolII Peaks RNA PolII Enrichment TSS search region TSS annotation for Rad23 TSS position: 28,936 Conservation with D. melanogaster blastn search of initial exon “D. mel Transcripts” track Location of the Inr motif TSS search region: 28,716-28,936 Enrichment of RNA PolII upstream of the TSS position RNA-Seq read coverage upstream of the TSS position Search region defined by the extent of the RNA PolII peak TSS annotation resources Walkthroughs: Annotation of Transcription Start Sites in Drosophila Sample TSS report for onecut Reference: TSS Annotation Workflow GEP Annotation Report: Classify the type of core promoter Evidence that supports or refutes the TSS annotation Distribution of core promoter motifs Additional TSS annotation resources The D. melanogaster gene annotations are the primary source of evidence Resources that could be useful if the D. melanogaster evidence is ambiguous Whole genome alignments of 14 Drosophila species PhastCons and PhyloP conservation scores Genome browsers for nine Drosophila species RNA Pol II ChIP-Seq (D. biarmipes only) RNA-Seq coverage, TopHat junctions, assembled transcripts Augustus and N-SCAN gene predictions TSS annotation summary Most of the D. melanogaster core promoters have multiple TSS Classify the type of promoter (peaked/intermediate/broad) based on the transcriptome evidence from D. melanogaster Define search regions that might contain TSS Use multiple lines of evidence to infer the TSS region Identify initial exon RNA-Seq coverage blastn (change search parameters) Distribution of core promoter motifs (e.g., Inr) D. biarmipes RNA PolII ChIP-Seq peaks Maintain conservation compared to D. melanogaster Questions? Structure of a typical mRNA Pesole G. et al. Untranslated regions of mRNAs. Genome Biology. 2002: 3(3) reviews0004.1-reviews0004.10. RAMPAGE protocol DEMO blastn search of the initial transcribed exon of Rad23 against D. biarmipes contig19 Use RNA PolII tracks on the D. biarmipes genome browser to identify putative TSS April 2013 (BCM-HGSC/Dbia_2.0) assembly Search for orthologous regions in D. elegans Gnomon predictions for eight Drosophila species Based on RNA-Seq data from either the same or closely-related species D. simulans, D. yakuba, D. erecta, D. ananassae, D. pseudoobscura, D. willistoni, D. virilis, and D. mojavensis Predictions include untranslated regions and multiple isoforms Records not yet available through the NCBI RefSeq database Conservation tracks on the D. melanogaster GEP UCSC Genome Browser Whole genome alignments of 14 Drosophila species Drosophila Chain/Net composite track Generate multiple sequence (Multiz) alignments from these pairwise alignments Identify conserved regions from Multiz alignments PhastCons: identify conserved elements PhyloP: measure level of selection at each nucleotide Multiz alignment of 27 insect species available on the official UCSC Genome Browser Aug. 2014 (BDGP Release 6 + ISO1 MT/dm6) assembly Use the conservation tracks to identify regions under selection PhyloP scores: Under negative selection Fast-evolving Examine the Multiz alignments to identify the orthologous TSS regions Use RNA-Seq data to predict untranslated regions and putative TSS TSS predictions available for 9 Drosophila species N-SCAN+PASA-EST, Augustus, TransDecoder D. mel Proteins N-SCAN Augustus TransDecoder RNA-Seq DEMO Genome browsers for other Drosophila species