Download Lecture:RNA-seq introduction

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

List of types of proteins wikipedia , lookup

RNA silencing wikipedia , lookup

Gene expression wikipedia , lookup

RNA-Seq wikipedia , lookup

Transcript
RNA-­‐seq Introduc1on Promises and pi7alls RNA gives informa1on on which genes that are expressed How DNA get transcribed to RNA (and some1mes then translated to proteins) varies between e. g. -­‐Tissues -­‐ Cell types -­‐ Cell states -­‐Individuals -­‐Cells RNA gives informa1on on which genes that are expressed How DNA get transcribed to RNA (and some1mes then translated to proteins) varies between e. g. -­‐Tissues -­‐ Cell types -­‐ Cell states -­‐Individuals RNA gives informa1on on which genes that are expressed How DNA get transcribed to RNA (and some1mes then translated to proteins) varies between e. g. -­‐Tissues -­‐ Cell types -­‐ Cell states -­‐Individuals RNA flavors (pre sequencing era) •  House keeping RNAs –  rRNAs, tRNAs, snoRNAs, snRNAs, SRP RNAs, cataly1c RNAs (RNAse E) •  Protein coding RNAs –  (1 coding gene ~ 1 mRNA) •  Regulatory RNAs –  Few rare examples ENCODE, the Encyclopedia of DNA Elements, is a project funded by the Na1onal Human Genome Research Ins1tute to iden1fy all regions of transcrip1on, transcrip1on factor associa1on, chroma1n structure and histone modifica1on in the human genome sequence. ENCyclopedia Of Dna Elements Different kind of RNAs have different expression values Landscape of transcrip/on in human cells, S Djebali et al. Nature 2012 What defines RNA depends on how you look at it Coverage Variants Abundance House keeping RNAs mRNAs Regulatory RNAs Novel intergenic None Adapted from Landscape of transcrip/on in human cells, S Djebali et al. Nature 2012 Defining func1onal DNA elements in the human genome •  Statement •  Consequence –  A priori, we should not expect the –  Thus, one should have high transcriptome to consist exclusively of func1onal RNAs. •  Why is that –  Zero tolerance for errant transcripts would come at high cost in the proofreading machinery needed to perfectly gate RNA polymerase and splicing ac1vi1es, or to instantly eliminate spurious transcripts. –  In general, sequences encoding RNAs transcribed by noisy transcrip1onal machinery are expected to be less constrained, which is consistent with data shown here for very low abundance RNA confidence that the subset of the genome with large signals for RNA or chroma1n signatures coupled with strong conserva1on is func1onal and will be supported by appropriate gene1c tests. –  In contrast, the larger propor1on of genome with reproducible but low biochemical signal strength and less evolu1onary conserva1on is challenging to parse between specific func1ons and biological noise. This is of course not without an debate Variants Abundance Most ‘‘Dark Matter’’ Transcripts Are Associated With
Known Genes
Perspective
Harm van
Bakel1, Corey Nislow1,2, Benjamin J. Blencowe1,2, Timothy R. Hughes1,2*
1 Banting and Best Department of Medical Research, University of Toronto, Toronto, Ontario, Canada, 2 Department of Molecular Genetics, University of Toronto, Toronto,
Ontario, Canada
The Reality of Pervasive Transcription
Abstract
Michael B. Clark1, Paulo P. Amaral1., Felix J. Schlesinger2., Marcel E. Dinger1, Ryan J. Taft1, John L.
Perspective 4
5
6 portion of the mammalian
8
A series
of3,reports
the last few
yearsF.have
indicated
that aV.much
larger
is
Chris over
P. Ponting
, Peter
Stadler
, Kevin
Morris
, Antonin Morillon7, Joelgenome
S. Rozowsky
,
Rinn
transcribed than can be accounted
for by currently
and nature of10these additional
8
9 annotated genes, but the quantity
10
Mark B. Gerstein , Claes Wahlestedt , Yoshihide Hayashizaki , Piero Carninci , Thomas R. Gingeras2*,
transcripts remains unclear. Here, we have used data from single- and paired-end RNA-Seq and tiling arrays to assess the
John
Mattick1*of transcripts in PolyA+ RNA from human and mouse tissues. Relative to tiling arrays, RNA-Seq
quantity
andS.composition
Response to ‘‘The Reality of Pervasive Transcription’’
identifies
many
transcribed
regions of(‘‘seqfrags’’)
outsideQueensland,
known exons
and2 Watson
ncRNAs.
Most
nonexonic
seqfrags
are inHarbor Laboratory,
1 Institute
for fewer
Molecular
Bioscience, University
Queensland, Brisbane,
Australia,
School
of Biological
Sciences,
Cold Spring
1
1,2 pre-mRNAs. The chromosomal
1,2 locations of the majority
1,2of
introns,
the
possibility
that States
they
areAmerica,
fragments
Harm
Bakel
, Corey
Nislow
, Benjamin
Blencowe
, Timothy
R. Hughes
* Genomics Unit,
Coldraising
Spring Harbor,
Newvan
York, United
of
3 Broadof
Institute,
Cambridge,J.Massachusetts,
United
States of America,
4 MRC Functional
intergenic
seqfrags
in
RNA-Seq
data
are
near
known
genes,
consistent
with
alternative
cleavage
and
polyadenylation
siteof Leipzig, Leipzig,
Department of 1
Physiology,
Anatomy
and Genetics,
University
of Oxford,
Kingdom,
Department
of Computer
Science,
University
Banting and
Best Department
of Medical
Research
and Oxford,
TerrenceUnited
Donnelly
Centre5for
Cellular and
Biomolecular
Research,
University
of Toronto, Toronto, Ontario,
usage,Germany,
promoterand terminator-associated
transcripts,
or
newResearch
alternative
exons;
indeed,
reads
bridge
splice
sites Curie,
6 Department
Molecular and
Experimental
Medicine,
Scripps
Institute,
La Jolla,
California,
Unitedthat
States
of America,
7 Institut
UMR3244Canada, 2ofDepartment
of Molecular
Genetics,
University
of Toronto, Toronto,
Ontario,
Canada
identified
4,544
new
exons,Paris,
affecting
genes. Biology
Most of
remaining
to eitherUnited
singleStates
reads
that 9 University of
Pavillon
Trouillet
Rossignol,
France, 83,554
Computational
andthe
Bioinformatics,
Yaleseqfrags
University,correspond
New Haven, Connecticut,
of America,
display
characteristics
random
sampling
from
low-level
background
or Institute,
severalTsurumi-ku,
thousandYokohama,
small transcripts
(median
Miami,
Miami, Florida, of
United
States of
America, 10
OmicsaScience
Center,
RIKEN Yokohama
Japan
tic’’
transcriptsconservation
greatly increases
their Kanagawa,
emphasized
the lack of abundant pervasive
Clark
al. criticize
aspects
length = 111 bp) present
atethigher
levels, several
which also
tendofto display
sequence
and originate
from
regions with
[7,8].transcripts, their number andtranscription
our conclude
study [1],that,
andwhile
specifically
challenge
open chromatin. We
there are
bona fide abundance
new intergenic
abundance in
is our study. Clark et al. cite
papers
that have previously documented
that transcribed
the phrase quoted
our assertiontothat
the degree
pervasive
generally low in comparison
known
exons, of
and
the genome isWe
notacknowledge
as pervasively
as previously
reported.
pervasive transcription, and point out that
by Clark et al. in our Author Summary
transcription has previously been overstatribosomal
portion
polyA+approaches
transcrip- have been
Current estimates
indicate
that
only
Previous
Evidence
for
Pervasive
severalof the
different
should
have
read
‘‘stably
transcribed’’,
or
ed.
We
disagree
with
much
of
their
Citation: van Bakel H, Nislow C, Blencowe BJ, Hughes TR (2010) Most ‘‘Dark Matter’’ Transcripts Are Associated With Known Genes. PLoS Biol 8(5): e1000371.
tome
[7,8,10,14–19],
normalization
ap- that Clark
about
1.2%
of
the
mammalian
genome
some equivalent, rather than simply ‘‘tranused as confirmation. We believe
reasoning and their interpretationTranscription
of our
doi:10.1371/journal.pbio.1000371
proaches
were
used
to reduce thewhat
quantity
codes for amino
acids
in proteins.
Howscribed’’.
But
this
does
not
change
the
fact
et
al.
misinterpret
can
be claimed
work.
For
example,
many
of
our
concluAcademic Editor: Sean R. Eddy, HHMI Janelia Farm, United States of America
highly expressed
transcripts
in these
The
conclusion
the disagree
mammalian
ever, mounting
evidence
over
the past
that wethat
strongly
with theoffundafrom much
of the literature
in this area, and
sions
are
based
on
overall
sequence
read
Received December 3, 2009; Accepted April 9, 2010; Published May 18, 2010
and are implicit
genome
transcribed
(i.e., bycDNA
decade has suggested
that while
the vast
majormental argument
put forward
Clarkanalyses
fail [7,8],
to acknowledge
known inweaknesses in
distributions,
Clark
et al. focus
on is pervasively
Copyright:
Bakel et al.isThis
is an open-access
distributed
under
the
Commons
Attribution
License,
which permits
tiling
array
approaches.
This studies.
was neces‘‘that
the majority
ofwhich
its ofbases
are associity !of2010
thevangenome
transcribed,
well article(sets
et the
al., terms
is Creative
that
the
genomic
area
some
of
these
We previously
transcript
units
and
seqfrags
of
unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
sary imto allowreviewed
the detection
rarer [9].
(oftenFor example,
with at corresponding
least one primary
transcript’’ is more
beyond the boundaries
known A
genes,
to transcripts
theseofissues
overlappingof reads).
key apoint ated
is that
Funding:phenomenon
This work was supported
byasGenome
Canadatran(http://www.genomecanada.ca)
through
the Ontario Genomics
Ontario Research Fund,
and
cellthe
type–restricted
[1,13,16,19,20])
tran- detected in
[1]) the
was based
onthan
multiple
lines abundance.
ofInstitute,
known
pervasive
portant
their relative
This
the number
of transfrags
one
can derive
a robust
March of Dimes (http://www.marchofdimes.com).
HvB wasestimate
supportedof
by the Netherlands Organization for Scientific Research (NWO; http://www.nwo.nl) (grant
scripts.
evidence.
Both
large-scale
cDNA
sequencscription
[1].Canadian
Challenging
this
an (CIHR;
viewpoint makes
sense toThe
us. funders
Given the
permuted
tiling array data can be as high
no. 825.06.033)
and the
Institutes
of Health
Research
http://www.cihr-irsc.gc.ca/)
(grantlittle
no. 193588).
had no role
in study design,
relative
amounts
ofview,
different
transcript
The evidence
and hybridization
to genome-wide
data collection
analysis,
decision
to publish,
or by
preparation
the
manuscript.
articleandpublished
inwithout
PLoS
Biology
van of ing
various sources
of extraneous sequence
as itfor
is pervasive
in the realtranscription
data [10]. In addition, a
types
having
a complete
reconalso
includes
observations
a wide
tiling
arrays
were
the
major
empirical
Bakel
et al. concluded
that
‘‘the
genome
is
reads,
both
biological
and
laboratorycommon
form from
of ‘‘validation’’
in these
Competing
Interests:
The
authors
have
declared
that
no
competing
interests
exist.
struction of every single transcript.
variety
independent
sources of data.
Analysis
of full-length
(see
below),
it isanalysis
expected
that
with of other
not as APA,
pervasively
as polyadenylation;
previouspapers
RT-PCRtechniques
or RACE, but these
In transcribed
this
brief and
response,
we firstBW,revisit
Abbreviations:
alternative
cleavage
bandwidth derived
parameter;
CAGE,
capped
of gene
expression;
CNV,
copyisnumber
reviews
[21]
[22]
references).
cDNAs
from
many
tissues
and pasRNA,
developsufficient
sequencing
depth
the(see
entire
reported’’
[2] and
that
the
oftranscription,
variation; ly
lincRNAs,
large intervening
noncoding
RNAs; ncRNAs,
noncoding
RNAs;
ORF,
open reading
frames;
promoter-associated
RNA;
TSS, and
transcription
approaches
arefor
generally
semi-quantitative
what
is meant
by majority
pervasive
Biochemical evidence not enough to iden1fy func1onal RNAs Defining functional DNA elements in the human genome
Kellis M et al. PNAS 2014;111:6131-6138
One gene many different mRNAs RNA-seq: alternative splicing
•  RNA seq course The RNA seq course • 
• 
• 
• 
• 
• 
• 
From RNA seq to reads Mapping reads programs Transcriptome reconstruc1on using reference Transcriptome reconstruc1on without reference QC analysis sRNA analysis Differen1al expression analysis –  mRNAs –  miRNAs •  Genome annota1on using RNA and other sources •  Differen1al expression using mul1 variate analysis •  RNA long read analysis From RNA to short reads Promises and pi7alls Long reads • 
• 
• 
• 
• 
• 
Low throughput (-­‐) Complete transcripts (+) Only highly expressed genes (-­‐-­‐) Expensive (-­‐) Low background noise (+) Easy downstream analysis (+) • 
• 
• 
• 
• 
• 
• 
High throughput (+) Only known sequences (-­‐) Limited dynamic range (-­‐) Cheap (+) High background noise (-­‐) Not strand specific (-­‐) Well established downstream methods (+) RNAseq • 
• 
• 
• 
• 
• 
• 
• 
High throughput Frac1ons of transcripts Full dynamic range Unlimited dynamic range Cheap Low background noise Strand specificity Re-­‐sequencing
10000 1000 Signal Micro Arrays EST 100 MicroArray 10 RNAseq 1 1 10 100 1000 10000 # trancripts/cell 100000 1000000 (+) (-­‐) (+-­‐) (+) (+) (+) (+) (+) How are RNA-­‐seq data generated? Sampling process RNA seq reads correspond directly to abundance of RNAs in the sample RNA to reads RNA-­‐> enrichments -­‐> AAAAAAAA PolyA (mRNA) RiboMinus (-­‐ rRNA) Size <50 nt (miRNA ) ….. Size of fragment Strand specific 5’ end specific 3’ end specific ….. library -­‐> reads -­‐> Single end (1 read per fragment) Paired end (2 reads per fragment) Transcriptome assembly using reference Transcriptome assembly without reference Mapping long reads to reference Quality control -­‐samples might not be what you think they are •  Experiments go wrong –  30 samples with 5 steps from samples to reads has 150 poten1al steps for errors –  Error rate 1/100 with 5 steps suggest that one of every 20 samples the reads does not represent the sample •  Mixing samples –  30 samples with 5 steps from samples to reads has ~24M poten1al mix ups of samples –  Error rate 1/ 100 with 5 steps suggest that one of every 20 sample is mislabeled •  Combine the two steps and approximately one of every 10 samples are wrong RNA QC Read quality Mapping sta1s1cs Transcript quality Compare between samples Differen1al expression analysis using univariate analysis Typically univariate analysis (one gene at a 1me) – even though we know that genes are not independent Gene set analysis and data integra1on
cleaved by RNase Y
of total virulence factors are affected by RNase Y deletion (88% downregulated)
SCP hidden by gene downregulation ?
microRNA analysis (Jakub) (Berezikov et al. Genome Research, 2011.) Single cell RNA-­‐seq analysis (Sandberg, Nature Methods 2014) Long reads Short reads Long reads Short reads Long reads