Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
+ RNAseq for differential gene expression analysis Molly Hammell, PhD 2016 1 + 2 RNAseq for differential gene expression: the tuxedo suite Why? How? Examples. . . . M. Hammell + 3 A good reference protocol M. Hammell + Analysis Outline What do you do first? How do you know when you’re done? What do you do with the results? You’ve got your data, what do you do with it? FastQC to check the quality of the reads Tophat Map to your favorite genome Cufflinks/Cuffdiff Assign reads to transcripts Differential expression Calculate your stats RNAseq QC Check that they mapped well, with little contamination, & replicates correlate Post-analysis CummeRbund (plots) heatmaps in R GSEA (pathways analysis) 4 M. Hammell + Mapping with Tophat 5 … Overview of the process. Making RNA-seq Libraries AAAAAAAA AAAAAAAA Isolate RNA Fragment RNA Convert RNA to cDNA fragments Ligate Adapters to each fragment PCR! M. Hammell + Mapping with Tophat 6 … Overview of the process. Mapping RNA-seq Libraries Making RNA-seq Libraries AAAAAAA A AAAAAAA A M. Hammell Trapnell C et al. Bioinformatics 2009;25:1105-1111 + Mapping with Tophat 7 … more parameters than you’d care to care about. What do you absolutely need to specify for each run? The genome index file (just the prefix) The reads files, one for each paired end The transcript files (gtf format) The insert size (-r). This is the size of the fragment minus reads length The library type (stranded or not, illumina-like or not) In the Green Line, these parameters are set for you as defaults in a Basic Run. You can use the default settings and probably be ok, but we’ll talk about those. M. Hammell + Mapping with Tophat 8 … more parameters than you’d care to care about. You need to give Tophat a reference genome for mapping your reads. In the Green Line, you choose the genome you are working with from the list of provided genomes. We actually give it an “index file” which is a compressed map of the genome. There are 6 genome index files. If you built these with bowtie1-build, they look like this Just specify the prefix, e.g.: hg19 The genome index file (just the prefix) The reads files, one for each paired end The transcript files (gtf format) The insert size (-r). This is the size of the fragment minus reads length The library type (stranded or not, illumina-like or not) M. Hammell + Mapping with Tophat 9 … more parameters than you’d care to care about. Next, we give Tophat your actual sequenced reads. These are FastQ files, which contain both sequence information as well as quality scores for each read. For paired end reads, we tell Tophat about both of them, (pair the reads) so it can try to map them together. In the Green Line, the output files from FastX toolkit are the input files to TopHat Example: First WT replicate (rep1); Read 1 paired with Read 2 wt-rep1_R1.fq.gz wt-rep1_R2.fq.gz The genome index file (just the prefix) The reads files, one for each paired end The transcript files (gtf format) The insert size (-r). This is the size of the fragment minus reads length The library type (stranded or not, illumina-like or not) M. Hammell + Mapping with Tophat 10 … more parameters than you’d care to care about. Tophat works much better if you give it the exon/intron structure of known genes, i.e., a gtf file. You can still choose to find novel transcripts later, but providing this file makes it easier to find reads that span known introns, which makes the search faster. Example: the human file is hg19_refseq_genes.gtf **Be careful – tophat is very picky about the format of this file. Either download one from their website, or get someone to give you one that plays nicely with tophat. In the Green Line, the gtf file is provided for you The genome index file (just the prefix) The reads files, one for each paired end The transcript files (gtf format) The insert size (-r). This is the size of the fragment minus reads length The library type (stranded or not, illumina-like or not) M. Hammell + Mapping with Tophat 11 … more parameters than you’d care to care about. 100 r=300 100 Mate Inner Distance Let’s say your RNAseq library prep kit preferentially fragments the RNA into 500 bp fragments. Then, you did a PE100bp run. Your “insert size” is 500 – 100 – 100 = 300. You did a PE300 run? Okay, your insert size is 0. Why does Tophat care? Tophat judges how well the two ends of a paired-end segment mapped by whether the inferred distance between them is consistent with the expected insert size. This is part of “mapping quality.” The genome index file (just the prefix) The reads files, one for each paired end The transcript files (gtf format) The mate inner distance (-r). This “insert size” is the size of the fragment minus reads length The library type (stranded or not, illumina-like or not) M. Hammell + Mapping with Tophat 12 … more parameters than you’d care to care about. In general, an Illumina Tru-seq Stranded RNA-seq library prep kit should use fr-firststrand. That means the library is stranded and the complementary strand is the first one sequenced. In other words, all your reads are the reversecomplement of the transcriptome! The other options: fr-unstranded fr-secondstrand common for non-stranded libraries used for ABI Solid libraries – not common The genome index file (just the prefix) The reads files, one for each paired end The transcript files (gtf format) The insert size (-r). This is the size of the fragment minus reads length The library type (stranded or not, illumina-like or not) M. Hammell + Mapping with Tophat 13 What other options might you care to change? Preset default options (In the Green Line in a Basic run) -N 2 number of mismatches per read (default is 2) -g 2 maximum number of alignments per read to report (default is 2) --suppress-hits set this if you want to suppress reads that map more than max (2) --no-mixed don’t report an alignment if you can’t map both ends of the fragment --no-novel-juncs don’t look for novel splice junctions – just use the ones in the GTF file (In the Green Line, if you will be skipping CuffLinks, you must run TopHat Advanced and choose this option to not look for novel junction) M. Hammell + Mapping with Tophat 14 What should your output look like? A typical tophat output directory should have these files: (The output files will be in your Project folder in the Data Store) accepted_hits.bam insertions.bed deletions.bed left_kept_reads.info junctions.bed logs right_kept_reads.info The bam file is your main output – the aligned reads. It’s in binary sam format. Here is an example of what one line of a bam file looks like: read name map flags MENDEL_0001_FC61FR7AAXX7:69:18748:7104#0 81 CGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTG read sequence map position map quality chr/start other PE read is mapped to same chr (=) at pos 709587, which is 699060 bp away chr1 10563 255 36M = 709587 CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC read qualities 699060 NM:i:0 NH:i:1 tophat flags NM:i:0 = 0 mismatches NH:i:1 = aligns uniquely M. Hammell + 15 Going from bam files to differential expression After running tophat on each of your sample replicates, you have a set of bam files. Now, put all your bam files together and feed them to cuffdiff But, wait? The package was called cufflinks. Don’t we need to run cufflinks first? No. M. Hammell + There’s stuff I’m skipping, why? 16 Cufflinks was designed to find novel transcripts: maybe your genome doesn’t have good transcript annotations maybe you want to find novel splice junctions maybe you want to find novel transcripts The thing is, you need really deep data (100 million reads) to do a good job at this. So, if you don’t care about calling novel stuff, and/or if you don’t have deep enough sequencing data, you should skip cufflinks and go straight to cuffdiff. (In the Green line, you have the option of running CuffDiff directly after TopHat.) Cuffdiff always calculates the transcript abundances, even if you already did cufflinks. So, cufflinks saves you no time. M. Hammell + Cuffdiff for DE 17 differential expression, with stats Find reads mapping uniquely to one isoform Cuffdiff calculates: isoform abundances then fold changes then q-values. Iterate to find the most likely distribution of the rest of the reads Output a list of isoform abundances Now do some math to calculate log2FC, Q-values M. Hammell Trapnell C et al. Nat Biotechnol 2010 + Cuffdiff for DE 18 differential expression, with stats What do you absolutely need to specify for each CuffDiff run? A BAM file of input aligned reads for each condition (genotype, treatment, etc) and for biological replicates A GTF file that specifies all the transcripts you want to consider (Refseq genes) In the Green Line, the GTF file corresponding to your organism is provided by default The normalization strategy you want to use (FPKM is common) In the Green Line, FPKM is used by default M. Hammell + Cuffdiff for DE 19 differential expression, with stats What do you absolutely need to specify for each run? A BAM file of input reads for each condition (genotype, treatment, etc). A GTF file that specifies all the transcripts you want to consider (Refseq genes) The normalization strategy you want to use (FPKM is common) accepted_hits.bam Your Tophat output files, in BAM format, are the input to Cuffdiff. You should have at least two of these for each condition (biological replicates), the more the better. M. Hammell + Cuffdiff for DE 20 differential expression, with stats What do you absolutely need to specify for each run? A BAM file of input reads for each condition (genotype, treatment, etc). A GTF file that specifies all the transcripts you want to consider (Refseq genes) In the Green Line, the GTF file corresponding to your organism is provided by default The normalization strategy you want to use (FPKM is common) hg19_refseq_genes.gtf **Be careful –Cuffdiff is very picky about the format of this file. Either download one from their website, or get someone to give you one that plays nicely with Cuffdiff. M. Hammell + Cuffdiff for DE 21 differential expression, with stats What do you absolutely need to specify for each run? The normalization strategy you want to use: Fragments Per Kilobase of Exons Per Million Mapped Reads (FPKM) Transcript A Why per kilobase? Genes 2x as long will have 2x as many reads from the same number of starting transcripts. 200 bases 2 fragments/transcript Transcript B 400 bases 4 fragments/transcript M. Hammell + Cuffdiff for DE 22 differential expression, with stats What do you absolutely need to specify for each run? The normalization strategy you want to use Fragments Per Kilobase of Exons Per Million Mapped Reads (FPKM) Why per million mapped reads? Sequencing deeper gives you more reads. So, if you got 4x the # of reads from your control library (40M) and only 10M from your mutant library, before normalization it will seem as if all transcripts are expressed 4x more in the control than in the mutant Mutant Sample total library 10 million reads Control Sample total library 40 million reads Transcript A Transcript A M. Hammell + Cuffdiff for DE 23 differential expression, with stats What other options might you care to change? Other options: -c 10 You can specify the minimum reads per gene, below which no statistics are calculated -M You can specify a mask file of transcripts -- anything that should be removed from the analysis (like an rRNA file) -u You can tell it to take any multi-mapper reads, and try to assign the true locus In the Green Line, these options are pre-set defaults --dispersion-method (pooled, per condition, blind, poisson) M. Hammell + 24 Cuffdiff output A typical cuffdiff output directory should have these files: cds_exp.diff gene_exp.diff isoform_exp.diff tss_group_exp.diff cds.fpkm_tracking genes.fpkm_tracking isoforms.fpkm_tracking tss_groups.fpkm_tracking That’s a lot! What’s important? gene_exp.diff isoform_exp.diff abundances, fold changes & q-values of genes abundances, fold changes & q-values of isoforms Maybe important to you? tss cds tracking files alternate transcription start sites differential splicing that would produce distinct proteins lots of information about reads per replicate, stats M. Hammell + 25 A note on P-values The problem of multiple hypothesis testing FDR = False Discovery Rate Statisticians have been harping on something called P-value fishing for a long time. If you keep running tests over and over, you’ll eventually get something to come up as significant by random chance. How do you fix this? Adjust your P-values to q-values Benjamini & Hochberg (BH): rank genes by P-value (large to small) each P-value is adjusted by the number of genes with a smaller P-value than itself q-value = p-value * n/(n-k) n = number of genes k = rank in gene list M. Hammell + 26 So, what do I do now? A little QC & a lot of downstream analysis QC How well do the replicates correlate with each other? Does a PCA plot show that my samples group by genotype? What fraction of transcripts are expressed > 1 RPKM? Downstream analysis make lots of Excel tables of comparisons make some heatmaps (in R or using gplots, CummeRbund) figure out if any pathways are enriched among genes that change (DAVID, GSEA, PathDE, etc) M. Hammell + RNAseq post-analysis QC 27 RNAseq Replicate & Sample Correlations Replicates, Condition 1 Replicates, Condition 2 Condition 1 Vs. Condition 2 All axes show RPM (reads per million mapped) M. Hammell + RNAseq post-analysis & QC 28 PCA plots & corresponding heatmap Condition 1 + drug Condition 2 + drug Condition 1 Condition 2 PCA plot separates: (1) Condition 1 vs. Condition 2 (2) Lines grown in drug or normal media Condition 1 Condition 2 Condition 2 + drug M. Hammell + Standard Pathways Analysis 29 DAVID, Ingenuity’s IPA, etc. M. Hammell + 30 Gene Set Enrichment Analysis Are many of the altered genes changing in a single direction? M. Hammell Subramanian A et al. PNAS 2005;102:15545-15550 Molly Hammell Lab at CSHL 2016 Ray Ami Ying Nik Regina Wen-Wei Christos Funding: CSHL, NIH, NSF, Rita Allen Foundation, Ride for Life David 31 Yuan + Statistical Distributions gaussian, poisson, negative binomial 32 -- what does all this mean? Gaussian (normal) distributions -nice and easy to work with -describe smooth distributions -work well for microarrays -underlie the t-test (among others) The dispersion in this case is equal to the standard deviation You completely specify this distribution by the mean ( ) and the standard deviation (). M. Hammell + Statistical Distributions gaussian, poisson, negative binomial 33 -- what does all this mean? Poisson distributions - like a gaussian for nonsmooth distributions - describes things like stars in a small area on the sky - for very large numbers, this looks like a gaussian distribution The dispersion in this case is equal to the mean ( ). You completely specify this distribution by the mean. e.g., For a set of samples where your gene has an average of =100 reads overall, how likely is it that your gene randomly has k=10 reads in the M. mutant Hammell + Statistical Distributions gaussian, poisson, negative binomial 34 -- what does all this mean? Negative Binomial distributions - like a poisson but allows the variance to be different from the mean - often called “over-dispersed” poisson distribution - for very large numbers, this looks like a gaussian distribution The dispersion in this case is measured empirically from the data M. Hammell + Statistical Distributions gaussian, poisson, negative binomial 35 -- what does all this mean? RNA-seq data fits a Negative Binomial (NB) distribution. But really, that’s just saying that RNAseq looks like “counts” data with more variation than just statistical fluctuations– it also has biological variation in it. How do we know? Because, when you measure variance (per gene, between replicates), it’s not equal to the mean, and it’s not even a good linear fit poisson negative binomial fit * For real data, even the NB fit isn’t always great M. Hammell Anders & Huber, Genome Biology 2010 + 36 Okay, it’s an NB (ish), so what? how do I use this to help me set the cuffdiff parameters? --dispersion-method (pooled, per condition, blind, poisson) Pooled is the default. This means you average the within-type variances to fit an NB distribution to your data. Per condition means that you think the variance is different for each sample type and you want to build a separate NB distribution for each. You need lots of replicates for this. Blind means you ignore the sample type and act like all samples are replicates of each other. Only do this if you have no replicates. Poisson means you’re fitting to a poisson distribution and not measuring variance. Why would you do this? Idk. M. Hammell