Download Click

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Transposable element wikipedia , lookup

Oncogenomics wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Short interspersed nuclear elements (SINEs) wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Public health genomics wikipedia , lookup

RNA interference wikipedia , lookup

Minimal genome wikipedia , lookup

History of genetic engineering wikipedia , lookup

Copy-number variation wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

X-inactivation wikipedia , lookup

Ridge (biology) wikipedia , lookup

Gene therapy wikipedia , lookup

Genomic imprinting wikipedia , lookup

Pathogenomics wikipedia , lookup

Non-coding RNA wikipedia , lookup

RNA silencing wikipedia , lookup

Gene desert wikipedia , lookup

Gene nomenclature wikipedia , lookup

Genomic library wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Genome (book) wikipedia , lookup

Gene wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

NEDD9 wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Microevolution wikipedia , lookup

Helitron (biology) wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Genomics wikipedia , lookup

Genome evolution wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Designer baby wikipedia , lookup

Metagenomics wikipedia , lookup

Mir-92 microRNA precursor family wikipedia , lookup

Gene expression profiling wikipedia , lookup

Gene expression programming wikipedia , lookup

Primary transcript wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

RNA-Seq wikipedia , lookup

Transcript
Gene expression from RNA-Seq
Once sequenced the problem becomes computational
sequencer
cells
Sequenced
reads
cDNA
ChIP
Alignment
read
coverage
genome
Considerations and assumptions
•
High library complexity
•
•
#molecules in library >> #sequenced molecules
Short reads
•
Read length << sequenced molecule length
Not all applications satisfy this:
•
miRNA sequencing
•
Small input sequencing (e.g. single cell sequencing)
Corollaries
•
Libraries satisfying assumptions 1 & 2 only measure relative abundance
•
Key quantity: # fragments sequenced for each transcript. Need to:
•
•
Which transcript generated the observed read?
Isn’t this easy?
•
Reads do not uniquely map
•
Transcripts or genes have different isoforms
•
Sequencing has a ~ 1% error rate
•
Transcripts are not uniformly sequenced
The RNA-Seq quantification problem (simple case)
• Start with a set of previous gene/transcript annotations
• Assume only one isoform per gene
• Assume 1-1 read to transcript correspondence.
(Sequencing depth)
Using the Poisson
approximation to the
binomial
We seek to maximize the likelihood of
transcript frequencies given the data
Which, of course has MLE
The process of RNA-Seq quantification
•
•
•
Sequenced reads are aligned to a reference sequence
•
the species genome or
•
its transcriptome
Transcript abundance is measured:
•
By counting reads mapped to each transcript (not accurate when multiple isoforms
share sequence)
•
By solving a maximizing the likelihood of the observed mapping given transcript
abundance
To compare samples counts need to be normalized
•
Libraries have different sequencing depth
•
Sample composition may be different
•
Most standard normalization: counts  Transcripts per Million (TPM) units
The gene expression table
Genes are quantified. Each gene or isoform has:
•
•
A TPM value
•
A (expected) fragment count vaue
All samples were quantified in the same fashion and arranged into a table of
genes (22,000) x samples (24).
•
•
Row i gives the expression of the gene i across all samples
•
Row j gives the expression of genes in sample j.
gene
Mir301
LD1,2.rep1 LD1,2.rep2 LD1,2.rep3 LD1.rep1 LD1.rep2 LD1.rep3 LD2.rep1 LD2.rep2 LD2.rep3
0
0
0
0
0
0
0
0
0
Cpne2
157
158.98
88.04
69
111.99
114.33
93
208
140
Capn5
36
65
46
46
69
42
33
58
59.01
Lage3
313.06
241.23
276.23
218.9
285.19
359.65
269.7
359.04
417.47
Brd7
379
358.58
390
336
357.26
368.08
264
564.07
476
Dimt1
77
68
58
54
62
60
54
76
97.03
0
0
0
0
0
0
0
0
0
AK017068
mapping
nt transs53 use
nsensus
ed to a
central
artition
orms of
e Bruijn
k-mers’
reduces
eads to
of k – 1
possible
ersed in
s, elimie shared
paired
orted as
Lik
ufflinks
10−2 10−2 100
0%
Isoform 1
Isoform 2
101
102
103
104
True FPKM
But, how are these quantities computed?
25%
100%
Conf dence interval
c
Isoform 1
Exon union method
Isoform 2
Exon intersection method
• Figure
Start3 with
a set ofofprevious
gene/transcript
| An overview
gene expression
quantification annotations
with RNA-seq.
Illustration
of transcripts
of different
lengths
with different read
• (a)Assume
Define
only one
isoform
per gene
levels
welltranscript
as total read
counts observed for Reads
each (fragments)
• coverage
Assume
1-1(left)
readasto
correspondence.
transcript (middle) and FPKM-normalized read counts (right). (b) Reads
are now short, one transcript generates many fragments.
from alternatively spliced genes may be attributable to a single isoform
or more than
one isoform.
Reads are color-coded
when their isoform of
Change:
Transcripts
of different
lengths generate
fragments
origin is clear. Black reads indicate reads with uncertain origin. ‘Isoform
expression
methods’
estimate
isoform abundances that best explain the
Transcript
effective
length
observed read counts under a generative model. Samples near the original
maximum likelihood estimate (dashed line) improve the robustness of the
, withabundance.
MLE:
Model:
, around each isoform’s
estimate and provide a confidence interval
(c) For a gene with two expressed isoforms, exons are colored according to
the isoform of origin. Two simplified gene models used for quantification
purposes, spliced transcripts from each model and their associated lengths,
nceptuare shown to the right. The ‘exon union model’ (top) uses exons from all
some alternatively
The RNA-Seq quantification problem. Isoform deconvolution
contain unique exon
Alternative method
as Cufflinks29 and m
tainty by construc
sequencing process
Isoform 1
that best explain th
(Fig. 3b). This estim
maximizes the likel
Isoform 2
lihood estimate (M
Transcript expression method
intersection method
MLE is not an accu
improves
Main difference: quantification involves read assignment.
Our model
must the robu
Detected
change
capture read assignment uncertainty.
pling’ alternative a
also providing a co
Parameters: Transcript relative abundance
ition 1
Condition 2
Condition 1
Condition 2
We note that the
Latent variables: Fragment alignment source
the results, with inc
Observed
variables:
N fragment
alignments,
erview
of RNA-seq
differential
expression
analysis. transcripts, fragment length
uncertainty. As suc
distribution
n microarrays
rely on fluorescence intensity via a hybridization
the maximal isofor
mber of probes to the gene RNA. RNA-seq gene expression
as the fraction of aligned reads that can be assigned to the
before expression e
Expression estimator value
Transcript
expression level
ession estimate
We can estimate the insert size distribution
P1
P2
Get all single isoform reconstructions
0.004
0.003
0.002
Estimate insert size
empirical distribution
d2
0.001
d1
0.000
Splice and compute
insert distance
0
100
200
300
400
500
600
700
… and use it for probabilistic read assignment
Isoform 1
Isoform 2
Isoform 3
0.004
d2
0.003
d2
0.000
P(d > di)
0.001
0.002
d1
d1
0
100
200
300
400
500
600
700
For methods such as MISO, Cufflinks and RSEM, it is critical to have paired-end data
some alternatively spliced genes, it fa
The RNA-Seq quantification problem.
Isoform
deconvolution
contain unique
exons
from which to esti
Alternative methods termed ‘isoform-e
d2
d1
29
and
mixture
of isoforms
as Cufflinks
tainty by constructing a ‘likelihood fu
sequencing process and identifies isofo
Isoform 1
that best explain the reads obtained in
(Fig. 3b). This estimate, defined as the
maximizes the likelihood function, is te
Isoform 2
lihood estimate (MLE). For genes exp
Transcript expression method
MLE is not an accurate expression estim
Parameters: Transcript relative abundance
improves the robustness of expression
Latent variables: Fragment alignment
source
Detected
change
pling’transcripts,
alternative
abundance
Observed variables: N fragment
alignments,
fragment
length estimate
distribution
also providing a confidence measure on
Condition 1
Condition 2
of the fragment
We noteProbability
that the number
of potential
alignment
originating
from t
the
results,
with
incorrect
or
misassemb
ential expression analysis.
uncertainty.
working wi
CanAs
besuch,
shownwhen
it is concave,
orescence intensity via a hybridization
andisoform
hence solvable
byis necessar
the maximal
sets, it
ene RNA. RNA-seq gene expression
expectation
maximization
d reads that can be assigned to the
before expression
estimation
for some g
0.000
0.001
0.002
0.003
0.004
Expression estimator value
Transcript
expression level
0
100
200
300
400
500
600
700
Summary: Current quantification models are complex
•
In its simplest form we assume that reads can be unequivocally mapped.
This allows:
•
Read counts distribute multinomial with rate estimated from the observed counts
•
When this assumption breaks, multinomial is no longer appropriate.
•
More general models use:
•
•
Base quality scores
•
Sequence mapability
•
Protocol biases (e.g. 3’ bias)
•
Sequence biases (e.g. GC)
Handling each of these involves a more complex model where reads are
assigned probabilistically not only to an isoform but to a different loci
RNA-Seq libraries revisited: End-sequence libraries
•
Target the start or end of transcripts.
•
Source: End-enriched RNA
•
•
Fragmented then selected
•
Fragmented then enzymatically purified
Uses:
•
Annotation of transcriptional start sites
•
Annotation of 3’ UTRs
•
Quantification and gene expression
•
Depth required 3-8 mill reads
•
Low quality RNA samples
•
Single cell RNA sequencing
RNA-Seq libraries: Summary
End-sequencing solution
Analysis of counting data requires 3 broad tasks
• Read mapping (alignment): Placing short reads in the
genome
• Quantification:
• Transcript relative abundance estimation
• Determining whether a gene is expressed
• Normalization
• Finding genes/transcripts that are differentially
represented between two or more samples.
• Reconstruction: Finding the regions that originated the
reads
What are we normalizing?
A typical replicate scatter plot
What are we normalizing?
A typical replicate scatter plot
TPM normalization
•
Accounts for:
Differences in sequencing depth
Show External URL Show Embeded Code •
Hide MathML Code
•
Differences in the number of reads generated by transcripts of different length
ow Embeded Code Hide MathML Code
Show Embeded Code Hide MathML Code
Embeded Code Hide MathML Code
Estimated reads/fragments for the gene
Total reads/fragments
Length of the transcript
Sample composition impacts transcript relative abundance
Cell type I
Cell type II
Normalizing by total reads does not work well for samples with very
different RNA composition
Example normalization techniques
Counts for gene i in experiment j
Geometric mean for that gene
over ALL experiments
i runs through all n genes
j through all m samples
kij is the observed counts for gene i in sample j
sj Is the normalization constant
Alders and Huber, 2010
Lets do an experiment
Similar read number,
one transcript many fold changed
Size normalization results in 2-fold
changes in all transcripts
When everything changes: Spike-ins
Lovén et al, Cell 2012