* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Analysis of RNA-seq Data.pptx
X-inactivation wikipedia , lookup
Public health genomics wikipedia , lookup
History of genetic engineering wikipedia , lookup
Oncogenomics wikipedia , lookup
Pathogenomics wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Quantitative trait locus wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Microevolution wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Gene expression programming wikipedia , lookup
Genome (book) wikipedia , lookup
Ridge (biology) wikipedia , lookup
Designer baby wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Genome evolution wikipedia , lookup
Minimal genome wikipedia , lookup
Genomic imprinting wikipedia , lookup
Metagenomics wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Analysis of RNA-seq Data Feb 8, 2017 Peikai CHEN (PHD) Outline • What is RNA-seq? • What can RNA-seq do? • How is RNA-seq measured? • How to process RNA-seq data: the basics • How to visualize and diagnose your RNA-seq data? • How to analyze RNA-seq? • What are getting trendy in RNA-seq field? • Summary What is RNA-seq? A way of measuring the transcriptome in high-throughput Some biology: • RNAs constitute the transcriptome, also called `gene expressions` • Genes expression patterns vary in: – Tissue types – Cell types – Development stages – Disease conditions – Time points – Ethnicity and others • Many type of RNAs: • mRNA: usually protein-coding • microRNA • Non-coding RNA • tRNA, rRNA, snoRNA, siRNAs Nature Reviews Genetics 10, 57-63 (January 2009) Its competitors and advantages • Its main competitor was microarray • It is unbiased, hi-thruput, de novo, sensitive, and becoming more economical What can RNA-seq do? To the least, quantify expression values of genes; but much more What can RNA-seq do? • Basic: • Quantification of whole-genome transcriptions • Advanced: • • • • • • • Novel isoforms/splicing events Novel intergenic transcripts Novel coding variants Allele-specific expression events Novel gene fusion events Call copy numbers Transcriptome of single cells: clustering, sub-populations of cells, signature, etc. How is RNA-seq measured? How is RNA-seq measured? https://wikis.utexas.edu/display/bioiteam/Introduction+to+RNA+Seq+Course+2016 Pair-end vs. single-end How to process RNA-seq Data The basics Overview Key steps: • • • • • • • • QC, initial look-up Alignment or assembly Quantification Gene-wise analyses: DEG identification, filtering, etc. Sample-wise analyses: PCA/clustering/pseudo-time etc. Functional analyses: pathway, gene set Integration with multi-omics: may develop your own methodologies Validations: wet-lab Conesa et al. Genome Biology (2016) 17:13 Tools/software most widely used https://wikis.utexas.edu/display/bioiteam/RNA-Seq+Approaches Step 1: look at your input data • Input data: • could be single-end or pair-end • data format: mostly fastq, but Sequence Read Format (SRF) also used • fastq looks like this: • • • • • Every four lines is one read First of them is the read id/info Second the sequence Third was optional, seldom used Fourth is the sequence quality, in ASSCII codes: called phred score • Usually one fastq file (or one pair of them) is one sample: a mouse, a patient tissue, or a cell-line Step 1: look at your input data • If you have N samples, you will have: • 1N fastq files, if single-end • 2N, if pair-end • At this stage, your data has not been aligned, and you don’t know: • • • • each read’s coordinate If a read is from your target transcriptome, or contamination a read’s quality the whole file’s quality • QC is thus needed, and FastQC was frequently used Step 2: do some read-level QC • By looking at FASTQC report, you can check that • • • • The average quality per read That per position (usually the leading/tail reads are lower in qual) The GC contents (if it looks naturally occurring) Any repetitive elements (might be linker/adapter/barcodes) • If one or some of your fastq files fail too many QC criteria: might want to filter them from further analyses • Go to FASTQC report examples Step 3: alignment/assembly • Just want to check known genes? Use “alignment” approach: • Use Tophat/Star/HISAT2 etc. to determine the locations of your reads • Use some known gene models (like GENCODE, or refseq-gene) to determine the # of reads falling on the exons • Want to check novel transcripts? Use “assemble” approach: • • • • Cufflinks the best tool to do this job can assemble transcripts in de novo manner, like the old-day shotgun method But can be highly unreliable for most genes not so highly expressed Because today’s kits can’t capture reads evenly across the transcript • Semi-alignment/semi-assembly approach: • Use cufflinks, align reads to known coordinates, but don’t tell it where genes are, let it figure out • This approach works much better, but will not give you other than transcripts from the provided genome Step 3: alignment/assembly • Important points: • Don’t use DNA alignment tools, like BOWTIE • Because DNA don’t splice • You will have extremely low mapping rates • Tune your parameters: • I usually allow 3 mismatches max. • But if your data from cancer, bacteria/virus, you might want to allow more, as they mutate a lot • Handle the low-quality reads: set some threshold • Set the bp’s trimmed for lead/tail of reads: if QC report tells you to do so • Make sure you map to both strands: otherwise you get half mapping rates • Set the max # of locations a read allowed to map, usually 5 Step 3: alignment/assembly • After alignment, you get a “sam/bam” file • Bam is binary version of sam, it saves more space • You can use samtools to view your bam files: Read-IDs Chromosomes Position read mapped to mapped to CIGAR code Step 3: alignment/assembly – check your alignment rates, and alignment structure • Concordant: or • Discordant: or or • • • • • Multi-reads don’t always mean bad mapping A lot of orthologous genes share same domains A lot of TF also share DNA-binding domains, same sequence in there A gene from this domains will map to domains of other genes too Copy number increase will also cause multi-reads Or on different chromosomes • Too many discordant events might indicate deletions or inversions • Mate mapping: only one mate is mapped Our real data as example Step 3: alignment/assembly – what else? • You can: • output your splice sites • check read distributions across different chromosomes • Most importantly, check the unaligned reads (they can be set to store in separate output files): • • • • BLAST them against all other genomes Particularly bacteria/virus Or align them to some spike-in sequences (like ERCC) In all, make sure these reads are unaligned not because you set the wrong parameters, and understand their sources • Visualize your alignment outputs: • use UCSC browser, or • Broad Inst. IGV (recommended) Step 3: alignment/assembly – visualization • Sort and index your bam files, load them into IGV • First, pick a few well-known house-keeping genes, like GAPDH, to check • • • • Second, check some genes of your interest You can even load other data types (like GWAS), annotations (e.g. conservation scores) Many people ignore visualization. Ended up making serious mistakes. Visualization very informative, and produce pub-ready, multi-omics figures. Step4: quantification • Concept simple: gene model + bam files àexpression tables • Tools: • Raw read counts: use HTSeq-count or featureCount • Normalized read counts (i.e. FPKM): use RSEM or cufflinks • Important notes: • Make sure same versions of genomes are used. Don’t use HG37 of gene model with HG38 of bam files. • Don’t convert between raw-read counts and FPKM • What else: • Check the genic vs. non-genic read ratios • Generally genic should be ~80% Step5: normalization • Some simple facts: The raw read counts tend to be Poissonian/negative-binomial Variance proportional to mean Log scale was used A pseudo-count was usually added to genes, to avoid log(0) Sometimes TPM (transcript per million) was used: different bio assumptions Min expression level set: many use FPKM=1 as minimum acceptable evidence of expression, could be wrong, depends on library sizes • Genes w too few expressed samples: excluded • Same for samples • • • • • • • Further normalization tricks: • Quantile normalization • Variance stability normalization examples • Normalized needed • Normalized and comparable How to analyze RNA-seq? Clustering, DEGs, signature/marker, pathways Step1: visualization of expression tables • By now you have converted ~GBs of fastq data into a table of expression values • Heavyweight computation finished, now on lightweight ones: use R • Use all sorts of diagnostic diagrams to examine the characteristics of your expression tables • • • • • • Heatmap – check the `dropouts`, the gene patterns etc. Boxplots -- check the samples are properly normalized Barcharts – check the # of genes expressed per sample Dendrogram – check clustering patterns sample-wise MA-plot – check fold change at different expr levels Scatter-plot – check sample reproducibility Step1: visualization – some examples examples • Also check your expressed genes by gene family Step2: identification of `DEGs` • DEGs==differentially expressed genes, thought be most biologically important in most studies • Tools to detect them: • • • • • DESEQ – need raw-read counts as inputs, bio-duplicates required edgeR – deal with FPKM Cuff-diff – directly compare at the bam-file level! Limma– if you log your FPKM, you can use limma too scDE – if your samples are single cells • In case no duplicate is available: • Use hard threshold holding: a threshold for fold-change, say at least 10 fold change to consider differentially expressed • Some statistical tests: Kal’s test of 1999, but it inflates p-values a lot! Step3: functional analyses • Pathway/GO term/gene-set enrichment: • IPA • DAVID • GSEA (recommended, really simple to use; credible results; comprehensive) • Important notes: • Don’t use too many nor too few genes • Too many (>2,000), you are bound to get some pathways, but not really biologically relevant • Too few (say <10), you will get nothing • Be careful with GO term analysis: tend to give too many positives Step3: functional analyses • Integrate with other omics data: GWAS, chipseq • Comparing with data of a different species, e.g. human vs mouse • Molecular validations: knock-down, knock-out and knock-in What is trendy? Single cell RNA-seq, non-coding RNAs, eRNAs … • Single cell RNAseq data: • Offer unprecedented resolution of cellular heterogeneity • Can identify subpopulations, establish their lineage, and identify their signature genes • Many old techniques don’t apply, new tools are quickly being developed • Emerging tech with challenges: unstable qualities, huge dropout rates • Non-coding RNAs: • • • • • Intergenic transcripts Don’t occur a lot in major cell types Lowly expressed Some are enhancer RNAs Could have regulatory roles Summary • RNAseq is latest tech for massive transcriptomic profiling • Better and getting cheaper than old tech like microarray • Proper processing to reduce technical noise, avoid biases, and delineate biological variations • Use conventional tools, or develop your own methods, to perform functional analyses