Download RNA-Seq workshop Achems 2017

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

RNA interference wikipedia , lookup

RNA silencing wikipedia , lookup

Non-coding RNA wikipedia , lookup

Non-coding DNA wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Transcriptional regulation wikipedia , lookup

Gene desert wikipedia , lookup

Exome sequencing wikipedia , lookup

Ridge (biology) wikipedia , lookup

Genomic imprinting wikipedia , lookup

Promoter (genetics) wikipedia , lookup

Gene expression wikipedia , lookup

Gene regulatory network wikipedia , lookup

Gene wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Molecular evolution wikipedia , lookup

Community fingerprinting wikipedia , lookup

Genomic library wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Gene expression profiling wikipedia , lookup

Genome evolution wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

RNA-Seq wikipedia , lookup

Transcript
RNA-Seq Workshop
AChemS 2017
Sunil K Sukumaran
Monell Chemical Senses Center
Philadelphia
Benefits & downsides of RNA-Seq
Benefits:
■ High resolution, sensitivity and large dynamic range
■ Independent of prior knowledge (in contrast to
predesigned probes in microarray analysis).
■ Unravel previously inaccessible complexities.
Downside:
■ Data analysis is not straightforward; methods continue
to evolve.
■ Cost
Types of RNA-Seq analysis
■ Gene expression analysis
■ Single cell RNA-Seq (scRNA-Seq)
■ Small RNA-Seq (miRNA-Seq)
■ Analysis of RNA-protein/RNA-RNA-interaction
Goals of ‘typical’ RNA-Seq analysis
■ Identify expressed genes and transcripts
■ Quantify gene expression in different conditions or
tissues (differential expression).
■ Identify novel transcripts and genes (de novo
assembly)
– Alternative splicing
– Novel transcribed genes
– Transcriptome from non-model organisms
Comparison of sequencing
platforms
“1st Gen”
“2nd Gen”
“3rd Gen”
Overview of Illumina RNA-Seq
https://www.slideshare.net/ueb52/uebuat-bioinformatics-course-session-23-vhir-barcelona
Sequencing strategies
■ Which library preparation protocol to use?
■ How many replicates?
■ What is the optimal library size (sequencing depth)?
■ Paired end or single end?
■ Which data analysis pipeline to use?
Not all types of RNA encode
information
The bulk (~95%) of cellular
RNA is rRNA and tRNA.
http://finchtalk.blogspot.com/2009/05/small-rnas-get-smaller.html
Quality and quantity of input RNA
■ High quality RNA is preferred, but many times not
available.
– Needle biopsies, Laser microdissection and
formalin fixed paraffin embedded samples
yield low integrity RNA.
■ The amount of RNA may be low by necessity or by
design (e.g. scRNA-Seq).
mRNA has to be selectively enriched
polyA Selection
= Magnetic bead
RNase H
Ribo-Zero
Stranded libraries are better!
■ Stranded libraries preserve information on the strand of origin
of the transcript
– Helpful when overlapping antisense transcripts occur in a
genomic region (~19% of genes in human genome!)
e.g. Mouse Gng13 and Chtf18 genes.
How many replicates?
 Considerations Include:
― Technical variability of RNA-Seq protocol.
― The intrinsic biological variability.
― The desired statistical power.
 Multiple samples can be sequenced in the same lane
(multiplexing).
 Prepare all replicate libraries at once, to avoid batch
effects.
Sequencing mode and length
■ Paired end preferred for de novo transcriptome
assembly and isoform level analysis
■ Single end sequencing sufficient for gene
expression studies
■ Illumina sequencer read lengths vary from 50150bp.
■ Longer reader length= better mappability.
Library size
■ Only a subset of the genome is transcribed
■ The dynamic range of gene expression is huge
– Reliable detection of genes expressed at lower
levels need bigger library size.
– scRNA seq needs lower depth
■ Tools such as Scotty and RNASeqPower can help
calculate optimum library size and # replicates based
on pilot data.
■ The ENCODE consortium guidelines:
http://encodeproject.org/ENCODE/experiment_guidelin
es.html
RNA-Seq Library preparation
Library specific index sequences
allow pooling multiple libraries
~ 6 libraries are pooled per lane for typical RNA-Seq
100’s of libraries are pooled for scRNA-Seq.
Digital RNA-Seq uses barcodes to
correct PCR bias
Proc Natl Acad Sci U S A. 2012 Jan 24;109(4):1347-52
Particularly useful when many cycles of PCR amplification
are used (e.g scRNA-Seq)
Illumina Sequencing
From sequence to biological insights
FASTQ Files
QC by
FastQC/R
Reads Mapping
To genome/transcriptome/de novo
Expression quantification
Summarize read counts : EM/union of exons
Differential Expression
Analysis
QC by
RSeQC
Gene/transcript level
Functional Interpretation
Enriched pathways/GO terms, integration with other data
Biological Insights & hypothesis
FASTQ file format
■ FASTQ format is used by modern sequencers. Bundles a FASTA
sequence and its quality data.
–
–
–
–
Line1: Sequence identifier
Line2: Raw sequence
Line3: meaningless, may repeat sequence identifier
Line4: quality values for the sequence (!=lowest, ~
highest)
@HWUSI-EAS100R:6:73:941:1973#0/1
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
Sequencing QC, using FastQC
■ Basic information (total reads, sequence
length, etc.)
■ Per base sequence quality
■ Overrepresented sequences
■ GC content
■ Duplication level
■ Etc.
FastQC report
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Per base sequence quality
Overrepresented Sequences
Adapter
Challenges in RNA-Seq read
alignment
■ How to correctly align short reads to the parent gene?
– Theoretically, the chances of a 100 bp read occurring
more than once in a genome is infinitesimally small (4100
= ~1.6*1060, compared to the size of mammalian
genome, ~3*109).
– But repeat elements, such as conserved regions in gene
families and overlapping antisense genes abound in the
genome.
– About 1/3 of RNA-Seq reads span exon-exon junctions!
Are you awake?
The shredded book analogy for
short read alignment
Adapted from a lecture by Michael Schatz, JHU
Nature Methods. 10, 1165–1166 (2013)
de Bruijn Graph assembly
Repeat sequences make correct reconstruction
and quantification difficult.
Overview of read mapping and
transcript identification
Model organisms (Reference
sequence available)
RNA-Seq reads
RNA-Seq Reads
Splice aware
mapper
TopHat,
STAR, Ungapped
HISAT mapper
Align to
genome
EM algorithm
Cufflinks,
RSEM…
With GTF
Analyze known
transcripts
BWA,
Bowtie
Align to
transcriptome
Union of exons
FeatureCounts
W/O GTF
Discover novel
transcripts
Non-model organisms (poor
or no reference sequence)
RSEM,
Kallisto
Analyze known
transcripts
De novo
assembler
Trinity
StringTie
Cufflinks
Identify all
transcripts
BWA,
Bowtie
Align to de novo
transcriptsome
Analyze
RSEM
Kallisto
Alignment and annotation files
■ SAM is a text based file format for storing sequences aligned
to a reference sequence.
– Consists of header (read names) and alignment sections
(mandatory).
– Alignment section has 11 mandatory fields specifying
alignment information
■ BAM files are compressed forms of SAM files.
■ GTF, GFF and BED files contain annotations of features such
as the cordinates of genes, transcripts and exons.
Genome browsers
Sashimi Plot
Web based:
UCSC
Desktop:
IGV
Taste cell and tissue isolation for
RNA-Seq analysis
A
B
C
Before cutting
Type III Salt
T1R3GFP (Sweet/Umami)
Circumvallate
Type III Sour
GustGFP (Mostly bitter)
Fungiform
After cutting
GADGFP (Type III)
Lgr5GFP (Stem)
33
Mapping QC






Percentage of reads properly mapped or uniquely
mapped
Among the mapped reads, the percentage of
reads in exon, intron, and intergenic regions.
Splice junctions
5' or 3' bias
Etc…
Popular software include RseqQC and RNAseqQC.
Read mapping to gene features
Splice junction saturation
Taste cells express many novel
isoforms and genes

42%-45% of the splice junctions in taste libraries are
either completely or partially novel.

But these novel splice junctions were rarely used (<5%).

Taste and olfactory tissue is barely represented in public
gene annotation efforts.
Normalized read mapping intensity
Gene body coverage
100
Bulk taste libraries
Single cell libraries
0
Normalized Distance along transcript 5’->3’(%)
Motivation for re-annotating the
taste transcriptome
■ Not all transcripts are fully annotated, even in human
and mouse
■ Transcriptomes are annotated from ‘well studied’
tissues by RefSeq and Gencode.
■ The 3’ and 5’ UTRs of genes are poorly annotated
– This causes problems for 3’ end sequencing
– Especially problematic for scRNA-Seq
Strategies vary for model
organisms…
…And non-model organisms
Methods for transcriptome Assembly
Reference-based assembly
De novo assembly
Martin J.A. and Wang Z., Nat. Rev. Genet. (2011) 12:671–682
Transcriptome assembly when
reference genome is available
https://galaxyproject.org/tutorials/rb_rnaseq/
When reference genome and
transcriptome are available
Bioinformatics (2011) 27 (17): 2325-2329.
Reference annotation based transcriptome assembly (RABT
assembly) leverages existing gene annotations for discovering
novel transcripts. Appropriate for model organisms.
Strategy for RABT assembly of
taste RNA-Seq data
■ Reference annotation based transcriptome assembly using
cufflinks and Stringtie packages of taste bud libraries
■ Results from the two workflows were combined.
■ Non coding, pre-mRNA and transcripts containing
premature stop codons were removed.
■ Potential coding transcripts were functionally annotated.
■ More info: Poster # 520
Many novel genes and isoforms of
known genes were identified in the
taste buds
Transcript types
Identical to known
Novel Intronic
Novel isoforms of known genes
Novel intergenic Transcripts
Novel antisense transcripts
De novo Gene
annotations
111512*
115
50110
1649
303
*Out of a total of 111706 transcripts in Gencode M7
Improved bitter taste receptor gene
annotations
Blue= de novo model, red = refseq model
23/35 Tas2r genes are multi-exonic.
Ten of them were verified by RT-PCR using cDNA from taste tissue
Novel isoform of known genes:
e.g. Chromogranin A
Improved mouse OR gene annotations
A
B
913 (73.1%) OR and 246 (45.9%)
VR genes had extended gene
Models.
The de novo models are more
sensitive at detecting OR gene
expression (B).
A : From PLoS Genet. 2014 Sep 4;10(9):e1004593
B: From Scientific Reports 5, Article number: 18178 (2015) doi:10.1038/srep18178
Thanks for your attention!
[email protected]
Many figures and slides in this presentations came from
publications, presentations, web pages etc.
I am grateful to the authors for making them available.