Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Schedule change Galaxy server going down for maintenance on Thursday • Day 2: AM - Introduction to RNA-Seq (and a touch of miRNA-Seq) • Day 2: PM - RNA-Seq practical (Tophat + Cuffdiff pipeline on Galaxy) • Day 3: AM – Introduction to Exome Sequencing and Variant Discovery • Day 3: PM - Exome sequence analysis practical (Galaxy) Quick Recap • NGS data production becoming commonplace • Many applications -> research intent determines technology platform choice • High volume data BUT error prone • FASTQ is accepted format standard • Must assess quality scores before proceeding • ‘Bad’ data can be rescued Introduction to RNAseq The Central Dogma of Molecular Biology Reverse Transcription 4 RNAseq Protocols • cDNA, not RNA sequencing • Types of libraries available: – Total RNA sequencing (not advised) – polyA+ RNA sequencing – Small RNA sequencing (specific size range targeted) cDNA Synthesis Genome-scale Applications • Transcriptome analysis • Identifying new transcribed regions • Expression profiling • Resequencing to find genetic polymorphisms: – SNPs, micro-indels – CNVs – Question: Why even bother with exome sequencing then? Sequencing details • Standard sequencing – – – – polyA/total RNA Size selection Primers and adapters Single- and paired-end sequencing • Strand-specific sequencing – still immature tech – Sequencing only + or – strand – Mostly paired-end What about microarrays??!!! • Assumes we know all transcribed regions and that spliceforms are not important • Cannot find anything novel • BUT may be the best choice depending on QUESTION Arrays vs RNAseq (1) • Correlation of fold change between arrays and RNAseq is similar to correlation between array platforms (0.73) • Technical replicates almost identical • Extra analysis: prediction of alternative splicing, SNPs • Low- and high-expressed genes do not match RNA-Seq promises/pitfalls • can reveal in a single assay: – new genes – splice variants – quantify genome-wide gene expression • BUT – Data is voluminous and complex – Need scalable, fast and mathematically principled analysis software and LOTS of computing resources Experimental considerations • Comparative conditions must make biological sense • Biological replicates are always better than technical ones • Aim for at least 3 replicates per condition • ISOLATE the target mRNA species you are after • NOT looking for new transcripts can bias expression estimates Analysis strategies • De novo assembly of transcripts: + re-constructs actual spliced transcripts + does not require genome sequence easier to work post-transcriptional modifications - requires huge computational resources (RAM) - low sensitivity: hard to capture low abundance transcripts • Alignment to the genome => Transcript assembly + computationally feasible + high sensitivity + easier to annotate using genomic annotations - need to take special care of splice junctions # 13 Basic analysis flowchart Illumina reads Re-align with different number of mismatches etc un-mapped Remove artifacts AAA..., ...N... Align to the genome Clip adapters (small RNA) un-mapped Pre-filter: low complexity synthetic "Collapse" identical reads mapped Count and discard mapped Assemble: contigs (exons) + connectivity Filter out low confidence contigs (singletons) Annotate # 14 Software • Short reads aligners • Stampy, BWA, Novoalign, Bowtie, TOPHAT • Data preprocessing • • Fastx toolkit samtools • Expression studies • • Cufflinks package R packages (DESeq, edgeR, more…) • Alternative splicing • • Cufflinks Augustus The ‘Tuxedo’ protocol • TOPHAT + CUFFLINKS • TopHat aligns reads to genome and discovers splice sites • Cufflinks predicts transcripts present in dataset • Cuffdiff identifies differential expression Very widely adopted suite ‘Tuxedo’ protocol limitations • Uses shortread data - Illumina OR SOLiD • Requires a sequenced genome • No GUI • Versions implemented in GALAXY are old(ish) Read alignment with TopHat Splice junctions R RNA: Lexon Genome: • In humans, terminal exons are ~1kb long, and since mRNAs are ~2kb, ~half of the reads should originate in initial and internal exons • Initial and internal exons are ~200b long => for 75-mer reads, ~20% of reads are supposed to cross splice junctions Splice junctions strategies • Create a splice junctions database joining together donors and acceptors • Typically, use known (annotated) splice junctions or known splice sites • TopHat: uses putative exons from mapped reads, database is made of canonical splice sites around putative exons Read alignment with TopHat (2) • Uses BOWTIE aligner to align reads to genome • BOWTIE cannot deal with large gaps (introns) • Tophat segments reads that remain unaligned • Smaller segments mostly end up aligning Read alignment with TopHat (3) • When there is a large gap between segments of same read -> probable INTRON • Tophat uses this to build an index of probable splice sites • Allows accurate measurement of spliceform expression • Possibility of detecting gene fusion events Cufflinks package • http://cufflinks.cbcb.umd.edu/ • Cufflinks: – Expression values calculation – Transcripts de novo assembly • Cuffcompare: – Transcripts comparison (de novo/genome annotation) • Cuffdiff: – Differential expression analysis Cufflinks: Transcript assembly • Assembles individual transcripts based on aligned reads • Infers likely spliceforms of each gene • Builds ‘transfrags’ • The smallest number of spliceforms that can be explained by the data • NOTE: assembly errors do occur -> sequencing depth helps Cufflinks: Transcript assembly (2) • Quantifies expression level of each transfrag • Filters out those likely to be premature terminations, non-mature mRNAs, etc Cuffmerge • Merges transfrags into transcripts where appropriate • Also performs a reference based assembly of transcripts using known transcripts • Produces single annotation file which aids downstream analysis Cuffdiff: Differential expression • Calculates expression level in two or more samples • Expression level relates to read abundance • Because of bias sources, cuffdiff tries to model the variance in its significance calculation What else is important? FPKM (RPKM): Expression Values Fragments Reads Per Kilobase of exon model per Million mapped fragments Nat Methods. 2008, Mapping and quantifying mammalian transcriptomes by RNA-Seq. Mortazavi A et al. C FPKM = 10 ´ NL 9 C= the number of reads mapped onto the gene's exons N= total number of reads in the experiment L= the sum of the exons in base pairs. Cufflinks (Expression analysis) gene_id bundle_id chr left right FPKM FPKM_conf_lo FPKM_conf_hi ENSG00000236743 31390 chr1 459655 461954 0 0 0 OK ENSG00000248149 31391 chr1 465693 688071 787.12 731.009 843.232 OK ENSG00000236679 31391 chr1 470906 471368 0 0 0 OK ENSG00000231709 31391 chr1 521368 523833 0 0 0 OK ENSG00000235146 31391 chr1 523008 530148 0 0 0 OK ENSG00000239664 31391 chr1 529832 532878 0 0 0 OK ENSG00000230021 31391 chr1 536815 659930 2.53932 0 5.72637 OK ENSG00000229376 31391 chr1 657464 660287 0 0 0 OK ENSG00000223659 31391 chr1 562756 564390 0 0 0 OK ENSG00000225972 31391 chr1 564441 564813 96.9279 77.2375 116.618 OK ENSG00000243329 31391 chr1 564878 564950 0 0 0 OK ENSG00000240155 31391 chr1 564951 565019 0 0 0 OK status Cuffdiff (differential expression) • Pairwise or time series comparison • Normal distribution of read counts • Fisher’s test test_id gene locus ENSG00000000003TSPAN6 ENSG00000000005TNMD ENSG00000000419DPM1 ENSG00000000457SCYL3 sample_1 sample_2 chrX:99883666-99894988 q1 chrX:99839798-99854882 q1 chr20:49551403-49575092 q1 chr1:169631244-169863408 q1 status q2 q2 q2 q2 value_1 value_2 NOTEST 0 NOTEST 0 NOTEST 15.0775 OK 32.5626 ln(fold_change) test_stat p_value significant 0 0 0 1 no 0 0 0 1 no 23.8627 0.459116 -1.39556 0.162848 no 16.5208 -0.678541 15.8186 0 yes Visualization: Genome Viewers • Web based: – UCSC Genome Browser (http://genome.ucsc.edu/) • Standalone – Integrated Genome Viewer (http://www.broadinstitute.org/software/igv/) RNAseq hands-on practical (Galaxy) • Data QC and trimming • Aligning reads to reference genome • Running CUFFLINKS and looking at some transcripts using the UCSC genome browser • Finding differentially expressed genes with CUFFDIFF