Download Introduction to RNA-Seq

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Schedule change
Galaxy server going down for maintenance on Thursday
• Day 2: AM - Introduction to RNA-Seq (and a touch of
miRNA-Seq)
• Day 2: PM - RNA-Seq practical (Tophat + Cuffdiff
pipeline on Galaxy)
• Day 3: AM – Introduction to Exome Sequencing and
Variant Discovery
• Day 3: PM - Exome sequence analysis practical (Galaxy)
Quick Recap
• NGS data production becoming commonplace
• Many applications -> research intent determines
technology platform choice
• High volume data BUT error prone
• FASTQ is accepted format standard
• Must assess quality scores before proceeding
• ‘Bad’ data can be rescued
Introduction to RNAseq
The Central Dogma of Molecular Biology
Reverse
Transcription
4
RNAseq Protocols
• cDNA, not RNA sequencing
• Types of libraries available:
– Total RNA sequencing (not advised)
– polyA+ RNA sequencing
– Small RNA sequencing (specific size range
targeted)
cDNA Synthesis
Genome-scale Applications
• Transcriptome analysis
• Identifying new transcribed regions
• Expression profiling
• Resequencing to find genetic polymorphisms:
– SNPs, micro-indels
– CNVs
– Question: Why even bother with exome sequencing
then?
Sequencing details
• Standard sequencing
–
–
–
–
polyA/total RNA
Size selection
Primers and adapters
Single- and paired-end sequencing
• Strand-specific sequencing
– still immature tech
– Sequencing only + or – strand
– Mostly paired-end
What about microarrays??!!!
• Assumes we know all
transcribed regions and
that spliceforms are not
important
• Cannot find anything
novel
• BUT may be the best
choice depending on
QUESTION
Arrays vs RNAseq (1)
• Correlation of fold change between arrays
and RNAseq is similar to correlation between
array platforms (0.73)
• Technical replicates almost identical
• Extra analysis: prediction of alternative
splicing, SNPs
• Low- and high-expressed genes do not
match
RNA-Seq promises/pitfalls
• can reveal in a single assay:
– new genes
– splice variants
– quantify genome-wide gene expression
• BUT
– Data is voluminous and complex
– Need scalable, fast and mathematically principled
analysis software and LOTS of computing resources
Experimental considerations
• Comparative conditions must make biological sense
• Biological replicates are always better than technical
ones
• Aim for at least 3 replicates per condition
• ISOLATE the target mRNA species you are after
• NOT looking for new transcripts can bias expression
estimates
Analysis strategies
• De novo assembly of transcripts:
+ re-constructs actual spliced transcripts
+ does not require genome sequence
easier to work post-transcriptional modifications
- requires huge computational resources (RAM)
- low sensitivity: hard to capture low abundance transcripts
• Alignment to the genome => Transcript assembly
+ computationally feasible
+ high sensitivity
+ easier to annotate using genomic annotations
- need to take special care of splice junctions
# 13
Basic analysis flowchart
Illumina
reads
Re-align
with different
number of
mismatches
etc
un-mapped
Remove
artifacts
AAA..., ...N...
Align
to the
genome
Clip adapters
(small RNA)
un-mapped
Pre-filter:
low complexity
synthetic
"Collapse"
identical
reads
mapped
Count
and
discard
mapped
Assemble:
contigs (exons)
+ connectivity
Filter out low
confidence
contigs
(singletons)
Annotate
# 14
Software
• Short reads aligners
•
Stampy, BWA, Novoalign, Bowtie, TOPHAT
• Data preprocessing
•
•
Fastx toolkit
samtools
• Expression studies
•
•
Cufflinks package
R packages (DESeq, edgeR, more…)
• Alternative splicing
•
•
Cufflinks
Augustus
The ‘Tuxedo’ protocol
• TOPHAT + CUFFLINKS
• TopHat aligns reads to genome and discovers
splice sites
• Cufflinks predicts transcripts present in dataset
• Cuffdiff identifies differential expression
Very widely adopted suite
‘Tuxedo’ protocol limitations
• Uses shortread data - Illumina OR SOLiD
• Requires a sequenced genome
• No GUI
• Versions implemented in GALAXY are old(ish)
Read alignment with TopHat
Splice junctions
R
RNA:
Lexon
Genome:
• In humans, terminal exons are ~1kb long, and since mRNAs
are ~2kb,
~half of the reads should originate in initial and internal
exons
• Initial and internal exons are ~200b long
=> for 75-mer reads, ~20% of reads are supposed to cross
splice junctions
Splice junctions strategies
• Create a splice junctions database joining together donors
and acceptors
• Typically, use known (annotated) splice junctions or known
splice sites
• TopHat: uses putative exons from mapped reads, database
is made of canonical splice sites around putative exons
Read alignment with TopHat (2)
• Uses BOWTIE aligner to align reads to genome
• BOWTIE cannot deal with large gaps (introns)
• Tophat segments reads that remain unaligned
• Smaller segments mostly end up aligning
Read alignment with TopHat (3)
• When there is a large gap between segments of
same read -> probable INTRON
• Tophat uses this to build an index of probable
splice sites
• Allows accurate measurement of spliceform
expression
• Possibility of detecting gene fusion events
Cufflinks package
• http://cufflinks.cbcb.umd.edu/
• Cufflinks:
– Expression values calculation
– Transcripts de novo assembly
• Cuffcompare:
– Transcripts comparison (de novo/genome
annotation)
• Cuffdiff:
– Differential expression analysis
Cufflinks: Transcript assembly
• Assembles individual transcripts based on aligned
reads
• Infers likely spliceforms of each gene
• Builds ‘transfrags’
• The smallest number of spliceforms that can be
explained by the data
• NOTE: assembly errors do occur -> sequencing depth
helps
Cufflinks: Transcript assembly (2)
• Quantifies expression level of each transfrag
• Filters out those likely to be premature
terminations, non-mature mRNAs, etc
Cuffmerge
• Merges transfrags into transcripts where
appropriate
• Also performs a reference based assembly of
transcripts using known transcripts
• Produces single annotation file which aids
downstream analysis
Cuffdiff: Differential expression
• Calculates expression level in two or more
samples
• Expression level relates to read abundance
• Because of bias sources, cuffdiff tries to model
the variance in its significance calculation
What else is important?
FPKM (RPKM): Expression Values
 Fragments Reads Per Kilobase of exon model per
Million mapped fragments
 Nat Methods. 2008, Mapping and quantifying
mammalian transcriptomes by RNA-Seq.
Mortazavi A et al.
C
FPKM = 10 ´
NL
9
C= the number of reads mapped onto the gene's exons
N= total number of reads in the experiment
L= the sum of the exons in base pairs.
Cufflinks (Expression analysis)
gene_id bundle_id
chr left right FPKM FPKM_conf_lo FPKM_conf_hi
ENSG00000236743 31390 chr1 459655 461954 0
0
0
OK
ENSG00000248149 31391 chr1 465693 688071 787.12 731.009 843.232 OK
ENSG00000236679 31391 chr1 470906 471368 0
0
0
OK
ENSG00000231709 31391 chr1 521368 523833 0
0
0
OK
ENSG00000235146 31391 chr1 523008 530148 0
0
0
OK
ENSG00000239664 31391 chr1 529832 532878 0
0
0
OK
ENSG00000230021 31391 chr1 536815 659930 2.53932 0
5.72637 OK
ENSG00000229376 31391 chr1 657464 660287 0
0
0
OK
ENSG00000223659 31391 chr1 562756 564390 0
0
0
OK
ENSG00000225972 31391 chr1 564441 564813 96.9279 77.2375 116.618 OK
ENSG00000243329 31391 chr1 564878 564950 0
0
0
OK
ENSG00000240155 31391 chr1 564951 565019 0
0
0
OK
status
Cuffdiff (differential expression)
• Pairwise or time series comparison
• Normal distribution of read counts
• Fisher’s test
test_id gene
locus
ENSG00000000003TSPAN6
ENSG00000000005TNMD
ENSG00000000419DPM1
ENSG00000000457SCYL3
sample_1
sample_2
chrX:99883666-99894988 q1
chrX:99839798-99854882 q1
chr20:49551403-49575092 q1
chr1:169631244-169863408 q1
status
q2
q2
q2
q2
value_1 value_2
NOTEST 0
NOTEST 0
NOTEST 15.0775
OK
32.5626
ln(fold_change) test_stat
p_value significant
0
0
0
1
no
0
0
0
1
no
23.8627 0.459116 -1.39556 0.162848 no
16.5208 -0.678541
15.8186 0
yes
Visualization: Genome Viewers
• Web based:
– UCSC Genome Browser (http://genome.ucsc.edu/)
• Standalone
– Integrated Genome Viewer
(http://www.broadinstitute.org/software/igv/)
RNAseq hands-on practical
(Galaxy)
• Data QC and trimming
• Aligning reads to reference genome
• Running CUFFLINKS and looking at some
transcripts using the UCSC genome browser
• Finding differentially expressed genes with
CUFFDIFF