* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download RNA-Seq with the Tuxedo Suite - UC Davis Bioinformatics Core
List of types of proteins wikipedia , lookup
Transcriptional regulation wikipedia , lookup
Gene desert wikipedia , lookup
Promoter (genetics) wikipedia , lookup
Ridge (biology) wikipedia , lookup
Gene regulatory network wikipedia , lookup
Gene expression wikipedia , lookup
Genomic imprinting wikipedia , lookup
Molecular evolution wikipedia , lookup
Community fingerprinting wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Endogenous retrovirus wikipedia , lookup
Gene expression profiling wikipedia , lookup
RNA-Seq with the Tuxedo Suite Monica Britton, Ph.D. Sr. Bioinformatics Analyst June 2015 Workshop The Basic Tuxedo Suite References Trapnell C, et al. 2009 TopHat: discovering splice junctions with RNA-Seq. Bioinformatics Trapnell C, et al. 2010 Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnology Kim D, et al. 2011 TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biology Roberts A, et al. 2011 Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biology Roberts A, et al. 2011 Identification of novel transcripts in annotated genomes using RNA-Seq. Bioinformatics Trapnell C, et al. 2013 Differential analysis of gene regulation at transcript resolution with RNA-Seq. Nature Biotechnology Cufflinks assembles transcripts Cuffdiff identifies differential expression of genes/ transcripts/promoters Alignment and Differential Expression Read set(s) Existing annotation (GTF) Toptables, etc. TopHat bam file(s) We followed these steps with the single-end reads Cuffdiff But, do we have all the genes? • For organisms with genomes, gene models are stored in gtf file • Assumptions: – The gtf file contains annotation for ALL transcripts and genes – All splice sites, start/stop codons, etc. are correct • Are these assumptions correct for every sequenced organism? • RNA-Seq reads can be used to independently construct genes and splice variants using limited or no annotation • Method used depends on how much sequence information there is for the organism… Gene Construction (Alignment) vs. Assembly Trinity software GenomeSequenced Organisms Novel or Non-Model Organisms Haas and Zody (2010) Nat. Biotech. 28:421-3 Gene / Transcriptome Construction • Annotation can be improved – even for well-annotated model organisms – – – – Identify all expressed exons Combine expressed exons into genes Find all splice variants for a gene Discover novel transcripts • For newly sequenced organisms – Validate ab initio annotation – Comparison between different annotation sets • Can assist in finding some types of contamination – Reconstruction of rRNA genes – Genomic/mitochondrial DNA in RNA library preps. Reference Annotation Based Transcript (RABT) Assembly Read set(s) Existing annotation (GTF) [optional] Toptables, etc. TopHat bam file(s) Cufflinks Read-set specific GTF(s) Cuffmerge Merged GTF Cuffcompare Final assembly (GTF and stats) Cuffdiff TopHat Spliced Alignment to a Genome Reference Annotation Based Transcript (RABT) Assembly Cufflinks – Identification of Incompatible Fragments Incompatible alignment Cufflinks – Minimum Paths to Transcripts Cufflinks – Abundance Estimation Cufflinks – Abundance Estimation Merging Cufflinks Assemblies So Now We’ve Explored These Tools… We’ve Used Other Software in Conjunction HTSeq-count Raw Counts edgeR (But HTSeq-count and edgeR are independent) And Then Came Some Extensions… Modules Introduced in 2014 Cuffquant • Improves efficiency of running multiple samples • Stores data in “.cxb” compressed format, that can later be analyzed with cuffdiff or cuffnorm Cuffnorm • Generate tables of expression values that are normalized for library size. • Tables are used as input to Monocle Monocle • Used to analyze single-cell expression data • Trapnell, et al., 2014, Nat. Biotech. 32:381 …But Software Continues to Evolve HISAT (Hierarchical Indexing for Spliced Alignment of Transcripts) • Kim et al., 2015, Nat. Methods • Planned to be Tophat3 • Faster than other aligners • More accurate on simulated reads. …But Software Continues to Evolve StringTie • Pertea et al., 2015, Nat. Biotech • Probable successor to Cufflinks2 • Assembles more transcripts (based on simulated reads) Ballgown • Frazee et al., 2015, Nat. Biotech • Bioconductor R package • Probable successor to Cuffdiff2 • Includes useful Tablemaker preprocessor A New Potential Game-Changer (2015) Kallisto (“Near-Optimal RNA-Seq Quantification”) • Bray et al. (http://arxiv.org/abs/1505.02710) • Extremely fast, uses pseudo-alignment based on k-mers and deBruijn graphs Speed Accuracy