Download Gene expression pipelining, applications and the wisdom

Genomic sequence of the pathogenic and allergenic filamentous fungus Aspergillus fumigatus Nierman et al., Nature , 2005 Carlos De Niz Gene expression pipelining, applications and the wisdom of crowds Carlos De Niz Image taken from Konrad J. Karczewski’s website What is gene expression? 3 Outline Background Basic steps of a pipeline Transcriptome normalization and obtaining missing data The wisdom of crowds Examples and applications 4 Background  The Cancer Genome Atlas (TCGA) kicked off in 2005-$100M (Pilot): National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI) • Create an atlas of changes for specific cancer types • Pool the results • World wide freely available data  DNA=A,G,C and T -> These 4 “letters” give us a lot of headaches  Cancer is a disease of the Genome Hallmarks of cancer (Gutschner): (1) sustaining proliferative signaling (2) evading growth suppressors (3) enabling replicative immortality (4) activating invasion and metastasis (5) inducing angiogenesis (6) resisting cell death Image: TCGAGenomics Brochure Advise Advice Avice Avrice • Transcript abundance: RNASeq • The TCGA consortium, the Cancer Genomics Hub (CGHub – UCSC) is a repository for storing, cataloging, and accessing cancer genome sequences, alignments, and mutations • How to find data: Barcode Universal Unique Identifier (UUID) TCGA-D8-A1JJ-01A-31R-A14M-07 -> 0307bd0b-b59a-4996-b89d-612e72652890 5 RNASeq is now becoming more popular in clinical use Transcript Quantification from RNASeq – Data Analysis Pipeline  RNASeq produces millions of reads (ranging from 30-400 bp up to 10-15kb) by sampling fragments of RNA: Illumina, SOLiD, 454, PacBio, etc.  Once the sequencing is done, the tasks to achieve are:  Mapping/aligning such reads to a reference genome or transcript (or DeNovo assembly if there isn’t one)  Estimate abundance at the gene/isoform level  Differential expression, mutations, SNV, gene fusion, SNPs, TE ID, etc Reference Transcriptome (GFF/GTF) RNASeq reads (FASTQ) Reference Genome (FASTA) -Bowtie index- + Alignment -Bowtie-Mapsplice- Reads aligned to the genome (SAM/BAM) + -RSEM-Cufflinks- Expression -Cuffdiff- Abundance estimation Differential Expression Each one of these stages have to take place on a step by step basis, that’s the reason why the overall process is called pipeline 6 Mapping and Gene Expression  De Novo Splice Aligners (~6hrs) - Mapping • • • • TopHat Mapsplice Subread STAR  Quantitative Analysis and Differential Expression – Gene Expression (GExp) • • • • • RSEM ~ 5 hrs Cufflinks ~ 4 hrs Unix based eXpress ~ 7 hrs Matlab and R-Based Salifish – K-mers approach (no mapping required) ~ .09 hrs The core of most of these programs is based on: the Burrows-Wheeler Transform (Mapping) and Expectation Maximization (GExp) RNASeq – Normalized measuring units 𝑭𝑷𝑲𝑴/𝑹𝑷𝑲𝑴 𝒇𝒐𝒓 𝒈𝒆𝒏𝒆 𝒊 = 𝟏𝟎𝟗 𝒄𝒊 × 𝒍𝒊 ′𝑵 where: RPKM= Reads Per Kilobase per Million mapped reads FPKM= Fragments Per Kilobase per Million mapped reads Ci = # reads mapping to transcript i N= total # of mappable reads Li’= length What happens when we have missing data? 8 Expectation Maximization (EM) • EM is a method to find the maximum likelihood estimator of a parameter θ of a probability distribution • Putting this into a practical context: o Let’s say the probability of the temperature outside your room’s window during the 24-hours of the day: 𝑥 ∈ ℝ24 , depends on the season Θ ∈ {summer, fall, winter, spring}, and that we know the seasonal temperature distribution is p(x | θ) (with some West Texas exceptions of course, because #Lubbock) o But let’s assume we can only measure the average temperature 𝒚 = 𝒙 for the day:  TASK: we want to guess what season θ it is  The maximum likelihood estimate of θ MAXIMIZES p(y | θ). In some cases it may be hard to find • That’s when EM is useful! EM takes the observed data y, iteratively makes guesses about the complete data x, and then finds the θ that maximizes p(x | θ) over Θ • EM tries to find the maximum likelihood estimate of θ given y • EM doesn’t actually promise to find you the θ that maximizes p(y | θ), but there are some theoretical guarantees, and it often does a good job in practice. However, it may need a little help in the form of multiple random starts Maximization: Expectation: 9 Required Data and Results  FASTQ: is a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding @SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCA quality scores (3-10 GB) + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65  FASTA: for representing either nucleotide sequences or peptide sequences, in which nucleotides or amino acids are NM_207127 2 represented using single-letter codes (3GB) CCTCGCTCCGCCTCCGGCCTCCTCCGAGAGCTCCAGACCTCCCGGCTACTCAGAAGCCCTCGGACTGCCCGGACCGCGC  GTF/GFF: contains 9 columns of data, each line describes one feature. Version 2 spec (20-100MB) chr20 scripture exon 61747569 61747837 . + . gene_id "XLOC_013608.1"; transcript_id "TCONS_00028587.1"; exon_number "1"; oId "TCONS_00024272"; linc_name "linc-BIRC7-2"; tss_id "TSS21239"; class_code "u"; gene_name "linc-BIRC7-2";  Annotation txt: (for RSEM only) Contains the associated names to the gene ID and the transcript ID XLOC_013608.1 TCONS_00028587.1 XLOC_013608.2 TCONS_00028587.2  BAM: A BAM file (.bam) is the binary version of a SAM file. A SAM file (.sam) is a tab-delimited text file that contains sequence alignment data UNC11-SN627_66:4:47:2750:9058/1 339 chr1 10061 69 50M = 10179 168 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAA DFAFFFDF@FHFHGFHHHFHEHHHHGCGGGHHFHHHHHHHHHHHHHHHHH RG:Z:110302_UNC11-SN627_0066_AB047KABXX_4_ IH:i:3 HI:i:3 NM:i:0  Abundance reads : contains the transcript abundance per gene (~1MB) gene_id A1BG|1 transcript_id(s) uc002qsd.3,uc002qsf.1 length 1976.30 effective_length 1854.25 expected_count 46.00 TPM 0.73 FPKM 0.48 10 The Wisdom of Crowds (WOC) 1390 987 874 1278 Francis Galton Mean Real Weight 1198 lb ~ 1197lb 977 MAQC/SEQC Consortium Data • The Sequencing Quality Control (SEQC/MAQC) Project along with the FDA:  Examined Illumina HiSeq, Life Technologies SOLiD and Roche 454 platforms at multiple laboratory sites using reference RNA samples with built-in controls plus TaqMan and PrimePCR verification (for 843 selected genes):   Sample A - Universal Human Reference RNA Sample B - Human Brain Reference RNA • The images below show the correlation between some of the RNA-SEQ technologies, in order to compare gene expression consistency among them For Sample A, approximately 400 different samples were averaged for Illumina and 190 for Life Technologies 12 MAQC/SEQC Consortium Data 13 Each plot also has a fitted model (red line) using linear regression which can help to predict linear data generation DREAM 5– WOC (Dialogue on Reverse Engineering Assessment and Methods ) Aggregation is robust and often better than the best performer (transcriptional gene regulatory networks) 14 BioViva - First gene therapy successful against human aging (April 21, 2016) • Elizabeth Parrish, CEO of Bioviva USA • • The first human to be ’successfully rejuvenated’ by gene therapy, after her own company’s experimental therapies reversed 20 years of normal telomere shortening Telomere score is calculated according to telomere length of white blood cells (T-lymphocytes) • The higher the telomere score, the ’younger’ the cells • Telomeres are short segments of DNA which cap the ends of every chromosome, acting as ‘buffers’ against wear and tear. They shorten with every cell division, eventually getting too short to protect the chromosome, causing the cell to malfunction and the body to age  Her telomeres had lengthened ~20 years, from 6.71kb to 7.33kb (protect against loss of muscle mass and to battle stem cell depletion) 15 Takeaways • • Genomic data like Gene Expression (transcriptome count), among its many applications, it is becoming a helpful and popular clinical tool In order to obtain transcriptome count, it is necessary to take raw data from the sequencers (FASTQ file) and pipeline it through a series of additional steps to assemble it and obtain gene expression o o • There are many programs available under different computational platforms that can be used The results from the different programs rely upon the different assumptions they make: like the way they estimate missing data or supposing data has a particular underlying probability function depending if the data comes from a technical or from a biological replicate (which can generate bias), among many others The wisdom of crowds or crowdsourcing is an effective process that has proven to produce accurate results in many fields, by the simple approach where the collective knowledge of a community is greater than the knowledge of any individual 16 References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. The cancer genome atlas (http://cancergenome.nih.gov/ , http://cancergenome.nih.gov/newsevents/newsannouncements/news_12_13_2005) SAMtools (http://samtools.sourceforge.net/) Bowtie An ultrafast memory-efficient short read aligner (http://bowtie-bio.sourceforge.net/index.shtml) TopHat A spliced read mapper for RNA-Seq (http://ccb.jhu.edu/software/tophat/index.shtml) RNA-SeqTutorial 1 (https://www.msi.umn.edu/sites/default/files/RNA-Seq%20Module%201.pdf) UUID (https://wiki.nci.nih.gov/display/TCGA/Universally+Unique+Identifier) RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome (http://www.biomedcentral.com/14712105/12/323, http://deweylab.biostat.wisc.edu/rsem/README.html ) The hallmarks of cancer: a long non-coding RNA point of view. Gutschner T, Diederichs S (http://www.ncbi.nlm.nih.gov/pubmed/22664915) A survey of best practices for RNA-seq data analysis. Conesa Ana, Et al (http://genomebiology.biomedcentral.com/articles/10.1186/s13059-0160881-8) Large Scale Comparison of Gene Expression Levels by Microarrays and RNAseq Using TCGA Data, Yan Guo., Quanhu Sheng, 2013 RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome, Bo Li and Colin N Dewey, 2011 Benchmarking RNA-Seq Quantification Tools, Raghu Chandramohan, Po-Yen Wu, 2013 http://bioviva-science.com/2016/04/21/first-gene-therapy-successful-against-human-aging/ A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium (http://www.nature.com/nbt/journal/v32/n9/full/nbt.2957.html) Wisdom of crowds for robust gene network inference (http://www.nature.com/nmeth/journal/v9/n8/full/nmeth.2016.html) EM Demystified: An Expectation-Maximization Tutorial (https://www.ee.washington.edu/techsite/papers/documents/UWEETR-2010-0002.pdf) 17

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Gene expression pipelining, applications and the wisdom