Download Long and short/small RNA-seq data analysis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Long and short/small RNA-seq
data analysis
GEF5, 4.9.2015
Sami Heikkinen, PhD, Dos.
UEF // University of Eastern Finland
Topics
1.  RNA-seq in a nutshell
2.  Long vs short/small RNA-seq
3.  Bioinformatic analysis work flows
UEF // University of Eastern Finland
GEF5 / Heikkinen S / 4.9.2015
2
RNA-seq in a nutshell
UEF // University of Eastern Finland
GEF5 / Heikkinen S / 4.9.2015
3
Bench
Planning
Execution
Bioinformatics
Define problem
Consult bioinformatician (et al)!
Get ethical permits
Select/get samples (groups, N)
Define sequencing strategy
Extract RNA
Generate sequencing libraries
Sequence
Perform QC
Preprocess, align, analyze / test
Summarize, visualize
Interpret!
UEF // University of Eastern Finland
RNA-seq
project
work flow
Bed side
Individualized
treatment?
Genetic risk?
etc!
GEF5 / Heikkinen S / 4.9.2015
4
Next Generation
Sequencing (NGS)
•  =deep sequencing (Fin: syväsekvensointi)
•  “counting” applications
–  the count of reads aligning to a
genomic location matter (the most)
–  e.g. ChIP-seq, RNA-seq, many others
•  “qualitative” applications
–  the sequence itself matters
–  e.g. whole genome / targeted / exome
sequencing
d
Nature Reviews in Genetics
Me>ger 2010
UEF // University of Eastern Finland
Annu. Rev. Anal. Chem, Mardis 2013
GEF5 / Heikkinen S / 4.9.2015
5
Sample barcoding
Barcoded sequencing libraries
e.g. ATCACG
Linkers & adapters
DNA fragment
Linkers & adapters
Sample 1
Sample 2
Sequencing
•  allows for multiplexing
-  take benefit of the modern high-capasity sequencers:
~200 million reads per one run on “old” Illumina HiSeq
- 
All reads
recent versions up to 20x that!
-  typically up to 48 or even 96 barcodes
De-­‐‑bar coded reads
Sample 1
UEF // University of Eastern Finland
Sample 2
GEF5 / Heikkinen S / 4.9.2015
6
Read length
• most counting applications: 50 bp
–  genome sequencing: 100 – 600 bp
Reads
Fragments
Target mRNA
100 bp
• long RNA-seq
–  only measure gene expression levels (etc)? 50 bp OK
–  interested in alternative splicing? Need 100+ bp!
Sequence reads
• short/small RNA-seq
Sequenced fragment
Target mRNA
Exon 1
50 bp
100 bp
Exon 2
Exon 3
–  e.g. mature miRNAs ~22 nt
 ~40 bp
UEF // University of Eastern Finland
GEF5 / Heikkinen S / 4.9.2015
7
Single or paired-end?
•  single end  most counting applications,
including typical long RNA-seq
Single end
Paired-­‐‑end
OR
AND
•  paired-end  helps in alignment
–  alternative splicing in RNA-seq
Sequenced fragment
? -­‐‑> !
Genome
–  genome sequencing
–  higher cost, longer sequencer run times
50 bp
Sequence read pairs
100 bp
Sequenced fragment
Target mRNA
Exon 1
UEF // University of Eastern Finland
Alternative
exon 2
Exon 3
GEF5 / Heikkinen S / 4.9.2015
8
Sequencing “depth”
•  in RNA-seq, more depth = more reliability (for lower expressed genes)
Random result?!
Low expressed gene
Higher expressed gene
Sample 1 Sample 2
Sample 1 Sample 2
N reads
6
4
60
40
4 * N reads
24
16
240
160
•  long RNA-seq on mammalian-size transcriptome
-  gene expression: need 10-40 million single end reads per sample  multiplex ~6-12x
-  gene expression + alternatively spliced mRNA isoforms:
need ≥100 million paired-end reads per sample  no/low multiplexing
•  short/small RNA-seq
-  need 2-3 million single end, ≤40 bp reads per sample
 use lower capasity sequencer, and multiplex e.g. 12 x
UEF // University of Eastern Finland
GEF5 / Heikkinen S / 4.9.2015
9
Replicates
•  also RNA-seq suffers from the inherent variation in
e.g. gene expression levels between individuals
-  need samples from many individuals per group
-  probably at the very least tens
-  the smaller the expected difference, the bigger the N must be
-  power calculations…?
UEF // University of Eastern Finland
GEF5 / Heikkinen S / 4.9.2015
10
Long vs short RNA-seq
UEF // University of Eastern Finland
GEF5 / Heikkinen S / 4.9.2015
11
Long RNA-seq
Short/small RNA-seq
•  Target
•  Target
-  any RNA present in the extracted
RNA sample
-  messanger RNAs (mRNAs)
-  long non-coding RNAs (lncRNAs),
processed pseudogenes etc
-  typical min length: ~200 bp
-  Small non-coding RNAs (sncRNAs)
-  microRNAs (miRNAs)
-  PIWI-interacting RNAs (piRNAs)
-  small nucleolar RNAs (snoRNAs)
-  utilizes chemical properties at the
ends of small RNAs
-  all expressed isoforms included
E.g.
Protein
mRNA
5’
3’
Genome
UEF // University of Eastern Finland
GEF5 / Heikkinen S / 4.9.2015
12
Long RNA-seq
•  Starting material
-  total RNA
-  mRNA –only sample
Short/small RNA-seq
•  Starting material
-  total RNA THAT MUST include also
the <50 bp small RNA species
 RNA extraction method matters!
-  small RNA –only sample
•  Comlexity
•  Comlexity
-  e.g. 19797 known protein coding
genes (through 79795 transcripts)
-  e.g. 2588 known mature miRNAs
-  variable tissue specificity
-  very short
-  variable lengths
à high complexity
UEF // University of Eastern Finland
-  higher tissue specificity
à low complexity
GEF5 / Heikkinen S / 4.9.2015
13
RNA-seq analysis work flows
UEF // University of Eastern Finland
GEF5 / Heikkinen S / 4.9.2015
14
Long RNA-seq data analysis work-flow
Raw data
on server
Trim adapter and Q-­‐‑filter (‘Trimmomatic’)
Preprocess
Decontaminate
Quantitate (‘cuffquant’)
UEF // University of Eastern Finland
‘fastqc’
Aligned
Test (‘cuffdiff’)
DEG
results
bowtie2 index
Unaligned
Gene expressions
?
transcriptome (re-­‐‑)annotation (‘cuffcompare’)
Export (‘cuffnorm’)
Norm’d GEx
& counts
Pathway analysis
Associations to clinical data
etc
Analyze
Align to rRNA+chrM+etc (‘bowtie2’ or ‘tophat2’)
‘fastqc’
‘fastqc’
‘fastqc’
ools’
‘igvt
.tdf
‘cufflinks’
Initial QC
Preprocessing
Trim
Align, sort, index,
and visualize
bowtie2 index
Quantitate
Decontam.
Trim for 3’-­‐‑An (‘homertools trim’)
Align and index (‘tophat2’ & ‘samtools’)
Alignment
Raw data locally
QC results
‘fastqc’
Transcriptome + genome
Public data
Download
vs
GEF5 / Heikkinen S / 4.9.2015
15
RNA-seq pipeline architecture
pipeline_se]ings.txt
log.txt
run_fastqc.sh
fastqc
Output (folders)
filename(s).suffix
Input (folder)
log.txt
Master Unix shell script
run_homertools.sh
homertools trim
filename(s).suffix
log.txt
run_trimmomatic.sh
trimmomatic
log.txt
etc.sh
UEF // University of Eastern Finland
Output (folder)
filename(s).suffix
reporting.sh(s)
summary.txt
Output (folder)
etc
.
.
.
Output (folder)
filename(s).suffix
GEF5 / Heikkinen S / 4.9.2015
16
Some data formats and types
Raw sequence data (.fastq.gz)
FastQC quality control (.html)
Visualization in genome browser (.tdf, .bigBed, .bigWig…)
etc…
Aligned reads (.sam, .bam, indexed and sorted .bam…)
Gene expression test results (from ‘cuffdiff’, tab.delim.txt)
UEF // University of Eastern Finland
GEF5 / Heikkinen S / 4.9.2015
17
Long RNA-seq data analysis work-flow
Raw data
on server
Trim adapter and Q-­‐‑filter (‘Trimmomatic’)
Preprocess
Decontaminate
Quantitate (‘cuffquant’)
UEF // University of Eastern Finland
‘fastqc’
Aligned
Test (‘cuffdiff’)
DEG
results
bowtie2 index
Unaligned
Gene expressions
?
transcriptome (re-­‐‑)annotation (‘cuffcompare’)
Export (‘cuffnorm’)
Norm’d GEx
& counts
Pathway analysis
Associations to clinical data
etc
Analyze
Align to rRNA+chrM+etc (‘bowtie2’ or ‘tophat2’)
‘fastqc’
‘fastqc’
‘fastqc’
ools’
‘igvt
.tdf
‘cufflinks’
Initial QC
Preprocessing
Trim
Align, sort, index,
and visualize
bowtie2 index
Quantitate
Decontam.
Trim for 3’-­‐‑An (‘homertools trim’)
Align and index (‘tophat2’ & ‘samtools’)
Alignment
Raw data locally
QC results
‘fastqc’
Transcriptome + genome
Public data
Download
vs
GEF5 / Heikkinen S / 4.9.2015
18
Pilot RNA-seq sample from human blood
Read count across processing steps
UEF // University of Eastern Finland
Tissue Specific Expression Analysis (TSEA)
(top 1000 expressed genes)
GEF5 / Heikkinen S / 4.9.2015
19
Pilot RNA-seq sample from human blood
UEF // University of Eastern Finland
GEF5 / Heikkinen S / 4.9.2015
20
Small RNA-seq data analysis work-flow
Raw data
on server
Align, QC, sort, index, visualize
mature miRNAindex
Initial QC
QC
Aligned
Viz.
Preprocessing
Unaligned
hairpin miRNAindex
Decontaminate
QC
Viz.
sncRNA
NOTE: with e.g. miSeq (Mediteknia),
adapter clipping done already on sequencer
40 bp
QC
Viz.
piRNA
QC
Quantitate
Annotate
Test
DESeq2 depend.
DESeq2 depend.
DESeq2
Phenotype data
• Groups
• Clim chem
• Histology
• Disease risk
• etc
Aligned
Unaligned
index
Aligned
Unaligned
index
Aligned
Viz.
e.g. 22 bp
UEF // University of Eastern Finland
GEF5 / Heikkinen S / 4.9.2015
21
Thank you!
uef.fi
Related documents