Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Long and short/small RNA-seq data analysis GEF5, 4.9.2015 Sami Heikkinen, PhD, Dos. UEF // University of Eastern Finland Topics 1. RNA-seq in a nutshell 2. Long vs short/small RNA-seq 3. Bioinformatic analysis work flows UEF // University of Eastern Finland GEF5 / Heikkinen S / 4.9.2015 2 RNA-seq in a nutshell UEF // University of Eastern Finland GEF5 / Heikkinen S / 4.9.2015 3 Bench Planning Execution Bioinformatics Define problem Consult bioinformatician (et al)! Get ethical permits Select/get samples (groups, N) Define sequencing strategy Extract RNA Generate sequencing libraries Sequence Perform QC Preprocess, align, analyze / test Summarize, visualize Interpret! UEF // University of Eastern Finland RNA-seq project work flow Bed side Individualized treatment? Genetic risk? etc! GEF5 / Heikkinen S / 4.9.2015 4 Next Generation Sequencing (NGS) • =deep sequencing (Fin: syväsekvensointi) • “counting” applications – the count of reads aligning to a genomic location matter (the most) – e.g. ChIP-seq, RNA-seq, many others • “qualitative” applications – the sequence itself matters – e.g. whole genome / targeted / exome sequencing d Nature Reviews in Genetics Me>ger 2010 UEF // University of Eastern Finland Annu. Rev. Anal. Chem, Mardis 2013 GEF5 / Heikkinen S / 4.9.2015 5 Sample barcoding Barcoded sequencing libraries e.g. ATCACG Linkers & adapters DNA fragment Linkers & adapters Sample 1 Sample 2 Sequencing • allows for multiplexing - take benefit of the modern high-capasity sequencers: ~200 million reads per one run on “old” Illumina HiSeq - All reads recent versions up to 20x that! - typically up to 48 or even 96 barcodes De-‐‑bar coded reads Sample 1 UEF // University of Eastern Finland Sample 2 GEF5 / Heikkinen S / 4.9.2015 6 Read length • most counting applications: 50 bp – genome sequencing: 100 – 600 bp Reads Fragments Target mRNA 100 bp • long RNA-seq – only measure gene expression levels (etc)? 50 bp OK – interested in alternative splicing? Need 100+ bp! Sequence reads • short/small RNA-seq Sequenced fragment Target mRNA Exon 1 50 bp 100 bp Exon 2 Exon 3 – e.g. mature miRNAs ~22 nt ~40 bp UEF // University of Eastern Finland GEF5 / Heikkinen S / 4.9.2015 7 Single or paired-end? • single end most counting applications, including typical long RNA-seq Single end Paired-‐‑end OR AND • paired-end helps in alignment – alternative splicing in RNA-seq Sequenced fragment ? -‐‑> ! Genome – genome sequencing – higher cost, longer sequencer run times 50 bp Sequence read pairs 100 bp Sequenced fragment Target mRNA Exon 1 UEF // University of Eastern Finland Alternative exon 2 Exon 3 GEF5 / Heikkinen S / 4.9.2015 8 Sequencing “depth” • in RNA-seq, more depth = more reliability (for lower expressed genes) Random result?! Low expressed gene Higher expressed gene Sample 1 Sample 2 Sample 1 Sample 2 N reads 6 4 60 40 4 * N reads 24 16 240 160 • long RNA-seq on mammalian-size transcriptome - gene expression: need 10-40 million single end reads per sample multiplex ~6-12x - gene expression + alternatively spliced mRNA isoforms: need ≥100 million paired-end reads per sample no/low multiplexing • short/small RNA-seq - need 2-3 million single end, ≤40 bp reads per sample use lower capasity sequencer, and multiplex e.g. 12 x UEF // University of Eastern Finland GEF5 / Heikkinen S / 4.9.2015 9 Replicates • also RNA-seq suffers from the inherent variation in e.g. gene expression levels between individuals - need samples from many individuals per group - probably at the very least tens - the smaller the expected difference, the bigger the N must be - power calculations…? UEF // University of Eastern Finland GEF5 / Heikkinen S / 4.9.2015 10 Long vs short RNA-seq UEF // University of Eastern Finland GEF5 / Heikkinen S / 4.9.2015 11 Long RNA-seq Short/small RNA-seq • Target • Target - any RNA present in the extracted RNA sample - messanger RNAs (mRNAs) - long non-coding RNAs (lncRNAs), processed pseudogenes etc - typical min length: ~200 bp - Small non-coding RNAs (sncRNAs) - microRNAs (miRNAs) - PIWI-interacting RNAs (piRNAs) - small nucleolar RNAs (snoRNAs) - utilizes chemical properties at the ends of small RNAs - all expressed isoforms included E.g. Protein mRNA 5’ 3’ Genome UEF // University of Eastern Finland GEF5 / Heikkinen S / 4.9.2015 12 Long RNA-seq • Starting material - total RNA - mRNA –only sample Short/small RNA-seq • Starting material - total RNA THAT MUST include also the <50 bp small RNA species RNA extraction method matters! - small RNA –only sample • Comlexity • Comlexity - e.g. 19797 known protein coding genes (through 79795 transcripts) - e.g. 2588 known mature miRNAs - variable tissue specificity - very short - variable lengths à high complexity UEF // University of Eastern Finland - higher tissue specificity à low complexity GEF5 / Heikkinen S / 4.9.2015 13 RNA-seq analysis work flows UEF // University of Eastern Finland GEF5 / Heikkinen S / 4.9.2015 14 Long RNA-seq data analysis work-flow Raw data on server Trim adapter and Q-‐‑filter (‘Trimmomatic’) Preprocess Decontaminate Quantitate (‘cuffquant’) UEF // University of Eastern Finland ‘fastqc’ Aligned Test (‘cuffdiff’) DEG results bowtie2 index Unaligned Gene expressions ? transcriptome (re-‐‑)annotation (‘cuffcompare’) Export (‘cuffnorm’) Norm’d GEx & counts Pathway analysis Associations to clinical data etc Analyze Align to rRNA+chrM+etc (‘bowtie2’ or ‘tophat2’) ‘fastqc’ ‘fastqc’ ‘fastqc’ ools’ ‘igvt .tdf ‘cufflinks’ Initial QC Preprocessing Trim Align, sort, index, and visualize bowtie2 index Quantitate Decontam. Trim for 3’-‐‑An (‘homertools trim’) Align and index (‘tophat2’ & ‘samtools’) Alignment Raw data locally QC results ‘fastqc’ Transcriptome + genome Public data Download vs GEF5 / Heikkinen S / 4.9.2015 15 RNA-seq pipeline architecture pipeline_se]ings.txt log.txt run_fastqc.sh fastqc Output (folders) filename(s).suffix Input (folder) log.txt Master Unix shell script run_homertools.sh homertools trim filename(s).suffix log.txt run_trimmomatic.sh trimmomatic log.txt etc.sh UEF // University of Eastern Finland Output (folder) filename(s).suffix reporting.sh(s) summary.txt Output (folder) etc . . . Output (folder) filename(s).suffix GEF5 / Heikkinen S / 4.9.2015 16 Some data formats and types Raw sequence data (.fastq.gz) FastQC quality control (.html) Visualization in genome browser (.tdf, .bigBed, .bigWig…) etc… Aligned reads (.sam, .bam, indexed and sorted .bam…) Gene expression test results (from ‘cuffdiff’, tab.delim.txt) UEF // University of Eastern Finland GEF5 / Heikkinen S / 4.9.2015 17 Long RNA-seq data analysis work-flow Raw data on server Trim adapter and Q-‐‑filter (‘Trimmomatic’) Preprocess Decontaminate Quantitate (‘cuffquant’) UEF // University of Eastern Finland ‘fastqc’ Aligned Test (‘cuffdiff’) DEG results bowtie2 index Unaligned Gene expressions ? transcriptome (re-‐‑)annotation (‘cuffcompare’) Export (‘cuffnorm’) Norm’d GEx & counts Pathway analysis Associations to clinical data etc Analyze Align to rRNA+chrM+etc (‘bowtie2’ or ‘tophat2’) ‘fastqc’ ‘fastqc’ ‘fastqc’ ools’ ‘igvt .tdf ‘cufflinks’ Initial QC Preprocessing Trim Align, sort, index, and visualize bowtie2 index Quantitate Decontam. Trim for 3’-‐‑An (‘homertools trim’) Align and index (‘tophat2’ & ‘samtools’) Alignment Raw data locally QC results ‘fastqc’ Transcriptome + genome Public data Download vs GEF5 / Heikkinen S / 4.9.2015 18 Pilot RNA-seq sample from human blood Read count across processing steps UEF // University of Eastern Finland Tissue Specific Expression Analysis (TSEA) (top 1000 expressed genes) GEF5 / Heikkinen S / 4.9.2015 19 Pilot RNA-seq sample from human blood UEF // University of Eastern Finland GEF5 / Heikkinen S / 4.9.2015 20 Small RNA-seq data analysis work-flow Raw data on server Align, QC, sort, index, visualize mature miRNAindex Initial QC QC Aligned Viz. Preprocessing Unaligned hairpin miRNAindex Decontaminate QC Viz. sncRNA NOTE: with e.g. miSeq (Mediteknia), adapter clipping done already on sequencer 40 bp QC Viz. piRNA QC Quantitate Annotate Test DESeq2 depend. DESeq2 depend. DESeq2 Phenotype data • Groups • Clim chem • Histology • Disease risk • etc Aligned Unaligned index Aligned Unaligned index Aligned Viz. e.g. 22 bp UEF // University of Eastern Finland GEF5 / Heikkinen S / 4.9.2015 21 Thank you! uef.fi