Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Introduction to RNAseq NGS - Quick Recap • Many applications -> research intent determines technology platform choice • High volume data BUT error prone • FASTQ is accepted format standard • Must assess quality scores before proceeding • ‘Bad’ data can be rescued The Central Dogma of Molecular Biology Reverse Transcription 3 RNAseq Protocols • cDNA, not RNA sequencing • Types of libraries available: – Total RNA sequencing (not advised) – polyA+ RNA sequencing – Small RNA sequencing (specific size range targeted) cDNA Synthesis Genome-scale Applications • Transcriptome analysis • Identifying new transcribed regions • Expression profiling • Resequencing to find genetic polymorphisms: – SNPs, micro-indels – CNVs – Question: Why even bother with exome sequencing then? What about microarrays??!!! • Assumes we know all transcribed regions and that spliceforms are not important • Cannot find anything novel • BUT may be the best choice depending on QUESTION Arrays vs RNAseq (1) • Correlation of fold change between arrays and RNAseq is similar to correlation between array platforms (0.73) • Technical replicates almost identical • Extra analysis: prediction of alternative splicing, SNPs • Low- and high-expressed genes do not match RNA-Seq promises/pitfalls • can reveal in a single assay: – new genes – splice variants – quantify genome-wide gene expression • BUT – Data is voluminous and complex – Need scalable, fast and mathematically principled analysis software and LOTS of computing resources Experimental considerations • Comparative conditions must make biological sense • Biological replicates are always better than technical ones • Aim for at least 3 replicates per condition • ISOLATE the target mRNA species you are after Analysis strategies • De novo assembly of transcripts: + re-constructs actual spliced transcripts + does not require genome sequence easier to work post-transcriptional modifications - requires huge computational resources (RAM) - low sensitivity: hard to capture low abundance transcripts • Alignment to the genome => Transcript assembly + computationally feasible + high sensitivity + easier to annotate using genomic annotations - need to take special care of splice junctions # 11 Basic analysis flowchart Illumina reads Re-align with different number of mismatches etc un-mapped Remove artifacts AAA..., ...N... Align to the genome Clip adapters (small RNA) un-mapped Pre-filter: low complexity synthetic "Collapse" identical reads mapped Count and discard mapped Assemble: contigs (exons) + connectivity Filter out low confidence contigs (singletons) Annotate # 12 Software • Short-read aligners • BWA, Novoalign, Bowtie, TOPHAT (eukaryotes) • Data preprocessing • Fastx toolkit, samtools • Expression studies • Cufflinks package, R packages (DESeq, edgeR, more…) • Alternative splicing • Cufflinks, Augustus The ‘Tuxedo’ protocol • TOPHAT + CUFFLINKS • TopHat aligns reads to genome and discovers splice sites • Cufflinks predicts transcripts present in dataset • Cuffdiff identifies differential expression Very widely adopted suite Read alignment with TopHat • Uses BOWTIE aligner to align reads to genome • BOWTIE cannot deal with large gaps (introns) • Tophat segments reads that remain unaligned • Smaller segments mostly end up aligning Read alignment with TopHat (2) Read alignment with TopHat (3) • When there is a large gap between segments of same read -> probable INTRON • Tophat uses this to build an index of probable splice sites • Allows accurate measurement of spliceform expression Cufflinks package • http://cufflinks.cbcb.umd.edu/ • Cufflinks: – Expression values calculation – Transcripts de novo assembly • Cuffdiff: – Differential expression analysis Cufflinks: Transcript assembly • Assembles individual transcripts based on aligned reads • Infers likely spliceforms of each gene • Quantifies expression level of each Cuffmerge • Merges transfrags into transcripts where appropriate • Also performs a reference based assembly of transcripts using known transcripts • Produces single annotation file which aids downstream analysis Cuffdiff: Differential expression • Calculates expression level in two or more samples • Expression level relates to read abundance • Because of bias sources, cuffdiff tries to model the variance in its significance calculation FPKM (RPKM): Expression Values Fragments Reads Per Kilobase of exon model per Million mapped fragments Nat Methods. 2008, Mapping and quantifying mammalian transcriptomes by RNA-Seq. Mortazavi A et al. C FPKM = 10 ´ NL 9 C= the number of reads mapped onto the gene's exons N= total number of reads in the experiment L= the sum of the exons in base pairs. Cuffdiff (differential expression) • Pairwise or time series comparison • Normal distribution of read counts • Fisher’s test test_id gene locus ENSG00000000003TSPAN6 ENSG00000000005TNMD ENSG00000000419DPM1 ENSG00000000457SCYL3 sample_1 sample_2 chrX:99883666-99894988 q1 chrX:99839798-99854882 q1 chr20:49551403-49575092 q1 chr1:169631244-169863408 q1 status q2 q2 q2 q2 value_1 value_2 NOTEST 0 NOTEST 0 NOTEST 15.0775 OK 32.5626 ln(fold_change) test_stat p_value significant 0 0 0 1 no 0 0 0 1 no 23.8627 0.459116 -1.39556 0.162848 no 16.5208 -0.678541 15.8186 0 yes Recommendations • You can use BOWTIE or BOWTIE2 but • Use CUFFDIFF2 – Better statistical model – Detection of truly differentially expressed genes – VERY easy to parse output file (See example on course page)