Download RNA-Seq with the Tuxedo Suite - UC Davis Bioinformatics Core

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

List of types of proteins wikipedia , lookup

Transcriptional regulation wikipedia , lookup

Gene desert wikipedia , lookup

Promoter (genetics) wikipedia , lookup

Ridge (biology) wikipedia , lookup

Gene regulatory network wikipedia , lookup

Gene expression wikipedia , lookup

Genomic imprinting wikipedia , lookup

Gene wikipedia , lookup

Molecular evolution wikipedia , lookup

Community fingerprinting wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Gene expression profiling wikipedia , lookup

Genome evolution wikipedia , lookup

RNA-Seq wikipedia , lookup

Transcript
RNA-Seq with the Tuxedo Suite
Monica Britton, Ph.D.
Sr. Bioinformatics Analyst
June 2015 Workshop
The Basic Tuxedo Suite
References
Trapnell C, et al. 2009 TopHat: discovering splice
junctions with RNA-Seq. Bioinformatics
Trapnell C, et al. 2010 Transcript assembly and
quantification by RNA-Seq reveals unannotated
transcripts and isoform switching during cell
differentiation. Nature Biotechnology
Kim D, et al. 2011 TopHat2: accurate alignment of
transcriptomes in the presence of insertions,
deletions and gene fusions. Genome Biology
Roberts A, et al. 2011 Improving RNA-Seq
expression estimates by correcting for fragment
bias. Genome Biology
Roberts A, et al. 2011 Identification of novel
transcripts in annotated genomes using RNA-Seq.
Bioinformatics
Trapnell C, et al. 2013 Differential analysis of gene
regulation at transcript resolution with RNA-Seq.
Nature Biotechnology
Cufflinks assembles transcripts
Cuffdiff identifies differential expression of genes/
transcripts/promoters
Alignment and Differential Expression
Read set(s)
Existing
annotation
(GTF)
Toptables, etc.
TopHat
bam file(s)
We followed these steps
with the single-end reads
Cuffdiff
But, do we have all the genes?
• For organisms with genomes, gene models are stored in gtf file
• Assumptions:
– The gtf file contains annotation for ALL transcripts and genes
– All splice sites, start/stop codons, etc. are correct
• Are these assumptions correct for every sequenced organism?
• RNA-Seq reads can be used to independently construct genes
and splice variants using limited or no annotation
• Method used depends on how much sequence information
there is for the organism…
Gene Construction (Alignment) vs. Assembly
Trinity software
GenomeSequenced
Organisms
Novel or
Non-Model
Organisms
Haas and Zody (2010) Nat. Biotech. 28:421-3
Gene / Transcriptome Construction
• Annotation can be improved – even for well-annotated model
organisms
–
–
–
–
Identify all expressed exons
Combine expressed exons into genes
Find all splice variants for a gene
Discover novel transcripts
• For newly sequenced organisms
– Validate ab initio annotation
– Comparison between different annotation sets
• Can assist in finding some types of contamination
– Reconstruction of rRNA genes
– Genomic/mitochondrial DNA in RNA library preps.
Reference Annotation Based Transcript (RABT) Assembly
Read set(s)
Existing
annotation
(GTF)
[optional]
Toptables, etc.
TopHat
bam file(s)
Cufflinks
Read-set
specific GTF(s)
Cuffmerge
Merged GTF
Cuffcompare
Final assembly
(GTF and stats)
Cuffdiff
TopHat Spliced Alignment to a Genome
Reference Annotation Based Transcript (RABT) Assembly
Cufflinks – Identification of Incompatible Fragments
Incompatible
alignment
Cufflinks – Minimum Paths to Transcripts
Cufflinks – Abundance Estimation
Cufflinks – Abundance Estimation
Merging Cufflinks Assemblies
So Now We’ve Explored These Tools…
We’ve Used Other Software in Conjunction
HTSeq-count
Raw Counts
edgeR
(But HTSeq-count and edgeR are independent)
And Then Came Some Extensions…
Modules Introduced in 2014
Cuffquant
• Improves efficiency of running multiple samples
• Stores data in “.cxb” compressed format, that can later be
analyzed with cuffdiff or cuffnorm
Cuffnorm
• Generate tables of expression values that are normalized for
library size.
• Tables are used as input to Monocle
Monocle
• Used to analyze single-cell expression data
• Trapnell, et al., 2014, Nat. Biotech. 32:381
…But Software Continues to Evolve
HISAT (Hierarchical Indexing for Spliced Alignment of Transcripts)
• Kim et al., 2015, Nat. Methods
• Planned to be Tophat3
• Faster than other aligners
• More accurate on simulated
reads.
…But Software Continues to Evolve
StringTie
• Pertea et al., 2015, Nat. Biotech
• Probable successor to Cufflinks2
• Assembles more transcripts (based on simulated reads)
Ballgown
• Frazee et al., 2015, Nat. Biotech
• Bioconductor R package
• Probable successor to Cuffdiff2
• Includes useful Tablemaker preprocessor
A New Potential Game-Changer (2015)
Kallisto (“Near-Optimal RNA-Seq Quantification”)
• Bray et al. (http://arxiv.org/abs/1505.02710)
• Extremely fast, uses pseudo-alignment based on k-mers and
deBruijn graphs
Speed
Accuracy