Download RNA-Seq workshop Achems 2017

RNA-Seq Workshop AChemS 2017 Sunil K Sukumaran Monell Chemical Senses Center Philadelphia Benefits & downsides of RNA-Seq Benefits: ■ High resolution, sensitivity and large dynamic range ■ Independent of prior knowledge (in contrast to predesigned probes in microarray analysis). ■ Unravel previously inaccessible complexities. Downside: ■ Data analysis is not straightforward; methods continue to evolve. ■ Cost Types of RNA-Seq analysis ■ Gene expression analysis ■ Single cell RNA-Seq (scRNA-Seq) ■ Small RNA-Seq (miRNA-Seq) ■ Analysis of RNA-protein/RNA-RNA-interaction Goals of ‘typical’ RNA-Seq analysis ■ Identify expressed genes and transcripts ■ Quantify gene expression in different conditions or tissues (differential expression). ■ Identify novel transcripts and genes (de novo assembly) – Alternative splicing – Novel transcribed genes – Transcriptome from non-model organisms Comparison of sequencing platforms “1st Gen” “2nd Gen” “3rd Gen” Overview of Illumina RNA-Seq https://www.slideshare.net/ueb52/uebuat-bioinformatics-course-session-23-vhir-barcelona Sequencing strategies ■ Which library preparation protocol to use? ■ How many replicates? ■ What is the optimal library size (sequencing depth)? ■ Paired end or single end? ■ Which data analysis pipeline to use? Not all types of RNA encode information The bulk (~95%) of cellular RNA is rRNA and tRNA. http://finchtalk.blogspot.com/2009/05/small-rnas-get-smaller.html Quality and quantity of input RNA ■ High quality RNA is preferred, but many times not available. – Needle biopsies, Laser microdissection and formalin fixed paraffin embedded samples yield low integrity RNA. ■ The amount of RNA may be low by necessity or by design (e.g. scRNA-Seq). mRNA has to be selectively enriched polyA Selection = Magnetic bead RNase H Ribo-Zero Stranded libraries are better! ■ Stranded libraries preserve information on the strand of origin of the transcript – Helpful when overlapping antisense transcripts occur in a genomic region (~19% of genes in human genome!) e.g. Mouse Gng13 and Chtf18 genes. How many replicates?  Considerations Include: ― Technical variability of RNA-Seq protocol. ― The intrinsic biological variability. ― The desired statistical power.  Multiple samples can be sequenced in the same lane (multiplexing).  Prepare all replicate libraries at once, to avoid batch effects. Sequencing mode and length ■ Paired end preferred for de novo transcriptome assembly and isoform level analysis ■ Single end sequencing sufficient for gene expression studies ■ Illumina sequencer read lengths vary from 50150bp. ■ Longer reader length= better mappability. Library size ■ Only a subset of the genome is transcribed ■ The dynamic range of gene expression is huge – Reliable detection of genes expressed at lower levels need bigger library size. – scRNA seq needs lower depth ■ Tools such as Scotty and RNASeqPower can help calculate optimum library size and # replicates based on pilot data. ■ The ENCODE consortium guidelines: http://encodeproject.org/ENCODE/experiment_guidelin es.html RNA-Seq Library preparation Library specific index sequences allow pooling multiple libraries ~ 6 libraries are pooled per lane for typical RNA-Seq 100’s of libraries are pooled for scRNA-Seq. Digital RNA-Seq uses barcodes to correct PCR bias Proc Natl Acad Sci U S A. 2012 Jan 24;109(4):1347-52 Particularly useful when many cycles of PCR amplification are used (e.g scRNA-Seq) Illumina Sequencing From sequence to biological insights FASTQ Files QC by FastQC/R Reads Mapping To genome/transcriptome/de novo Expression quantification Summarize read counts : EM/union of exons Differential Expression Analysis QC by RSeQC Gene/transcript level Functional Interpretation Enriched pathways/GO terms, integration with other data Biological Insights & hypothesis FASTQ file format ■ FASTQ format is used by modern sequencers. Bundles a FASTA sequence and its quality data. – – – – Line1: Sequence identifier Line2: Raw sequence Line3: meaningless, may repeat sequence identifier Line4: quality values for the sequence (!=lowest, ~ highest) @HWUSI-EAS100R:6:73:941:1973#0/1 GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 Sequencing QC, using FastQC ■ Basic information (total reads, sequence length, etc.) ■ Per base sequence quality ■ Overrepresented sequences ■ GC content ■ Duplication level ■ Etc. FastQC report http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ Per base sequence quality Overrepresented Sequences Adapter Challenges in RNA-Seq read alignment ■ How to correctly align short reads to the parent gene? – Theoretically, the chances of a 100 bp read occurring more than once in a genome is infinitesimally small (4100 = ~1.6*1060, compared to the size of mammalian genome, ~3*109). – But repeat elements, such as conserved regions in gene families and overlapping antisense genes abound in the genome. – About 1/3 of RNA-Seq reads span exon-exon junctions! Are you awake? The shredded book analogy for short read alignment Adapted from a lecture by Michael Schatz, JHU Nature Methods. 10, 1165–1166 (2013) de Bruijn Graph assembly Repeat sequences make correct reconstruction and quantification difficult. Overview of read mapping and transcript identification Model organisms (Reference sequence available) RNA-Seq reads RNA-Seq Reads Splice aware mapper TopHat, STAR, Ungapped HISAT mapper Align to genome EM algorithm Cufflinks, RSEM… With GTF Analyze known transcripts BWA, Bowtie Align to transcriptome Union of exons FeatureCounts W/O GTF Discover novel transcripts Non-model organisms (poor or no reference sequence) RSEM, Kallisto Analyze known transcripts De novo assembler Trinity StringTie Cufflinks Identify all transcripts BWA, Bowtie Align to de novo transcriptsome Analyze RSEM Kallisto Alignment and annotation files ■ SAM is a text based file format for storing sequences aligned to a reference sequence. – Consists of header (read names) and alignment sections (mandatory). – Alignment section has 11 mandatory fields specifying alignment information ■ BAM files are compressed forms of SAM files. ■ GTF, GFF and BED files contain annotations of features such as the cordinates of genes, transcripts and exons. Genome browsers Sashimi Plot Web based: UCSC Desktop: IGV Taste cell and tissue isolation for RNA-Seq analysis A B C Before cutting Type III Salt T1R3GFP (Sweet/Umami) Circumvallate Type III Sour GustGFP (Mostly bitter) Fungiform After cutting GADGFP (Type III) Lgr5GFP (Stem) 33 Mapping QC       Percentage of reads properly mapped or uniquely mapped Among the mapped reads, the percentage of reads in exon, intron, and intergenic regions. Splice junctions 5' or 3' bias Etc… Popular software include RseqQC and RNAseqQC. Read mapping to gene features Splice junction saturation Taste cells express many novel isoforms and genes  42%-45% of the splice junctions in taste libraries are either completely or partially novel.  But these novel splice junctions were rarely used (<5%).  Taste and olfactory tissue is barely represented in public gene annotation efforts. Normalized read mapping intensity Gene body coverage 100 Bulk taste libraries Single cell libraries 0 Normalized Distance along transcript 5’->3’(%) Motivation for re-annotating the taste transcriptome ■ Not all transcripts are fully annotated, even in human and mouse ■ Transcriptomes are annotated from ‘well studied’ tissues by RefSeq and Gencode. ■ The 3’ and 5’ UTRs of genes are poorly annotated – This causes problems for 3’ end sequencing – Especially problematic for scRNA-Seq Strategies vary for model organisms… …And non-model organisms Methods for transcriptome Assembly Reference-based assembly De novo assembly Martin J.A. and Wang Z., Nat. Rev. Genet. (2011) 12:671–682 Transcriptome assembly when reference genome is available https://galaxyproject.org/tutorials/rb_rnaseq/ When reference genome and transcriptome are available Bioinformatics (2011) 27 (17): 2325-2329. Reference annotation based transcriptome assembly (RABT assembly) leverages existing gene annotations for discovering novel transcripts. Appropriate for model organisms. Strategy for RABT assembly of taste RNA-Seq data ■ Reference annotation based transcriptome assembly using cufflinks and Stringtie packages of taste bud libraries ■ Results from the two workflows were combined. ■ Non coding, pre-mRNA and transcripts containing premature stop codons were removed. ■ Potential coding transcripts were functionally annotated. ■ More info: Poster # 520 Many novel genes and isoforms of known genes were identified in the taste buds Transcript types Identical to known Novel Intronic Novel isoforms of known genes Novel intergenic Transcripts Novel antisense transcripts De novo Gene annotations 111512* 115 50110 1649 303 *Out of a total of 111706 transcripts in Gencode M7 Improved bitter taste receptor gene annotations Blue= de novo model, red = refseq model 23/35 Tas2r genes are multi-exonic. Ten of them were verified by RT-PCR using cDNA from taste tissue Novel isoform of known genes: e.g. Chromogranin A Improved mouse OR gene annotations A B 913 (73.1%) OR and 246 (45.9%) VR genes had extended gene Models. The de novo models are more sensitive at detecting OR gene expression (B). A : From PLoS Genet. 2014 Sep 4;10(9):e1004593 B: From Scientific Reports 5, Article number: 18178 (2015) doi:10.1038/srep18178 Thanks for your attention! [email protected] Many figures and slides in this presentations came from publications, presentations, web pages etc. I am grateful to the authors for making them available.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download RNA-Seq workshop Achems 2017