Download Genomics Post-ENCODE

Nuevas perspectivas en análisis genomico: implicaciones del proyecto ENCODE Rory Johnson Bioinformatics and Genomics Centre for Genomic Regulation AEEH 21 / 2 / 14 1 This talk: • Our view of the human genome today thanks to ENCODE • What it means for translational research 2 Epigenetics: the intermediate between genome and phenotype 3 Our changing view of the genome 2000 2014 (Hong Kong) 4 Our changing view of the genome CAGGCATTAACCTTAGTCCTAATGGTTAGAGTCGTCCCTGATAATCTTAGTGAGGAAGGGACATTTCCAGAGTCGCCCAG CAGCAAATTCCAGATGTCTAAGGTCCCCAAACAGAACAAAATTGCATAAT Histones, + modifications Chromatin Transcription factors This organisation is encoded in non-protein coding genome sequence Enhancers 5 The Genome and Epigenome Genome sequence: Simple Static Epigenome sequence: Multi-layered Dynamic Cell-specific => Hence ENCODE 6 The human genome in numbers • 3 x10^9 base pairs • 20,345 protein coding genes • 13,870 Long noncoding RNA genes • 9013 Small noncoding RNA genes • 3x10^6 regulatory regions (enhancers) • 12,460 known trait-associated SNPs (short nucleotide variants) • 88% of trait-associated SNPs lie outside protein coding sequence 7 Next Generation Sequencing The high throughput reading of DNA or RNA. The main system now is Illumina Hiseq Statistics: Read length: ~150nt Reads per lane: ~150 million Lanes per run: 16 Total nt per run: ~400 billion Cost per run: ~16,000 euro (Human genome project took 13 years and $3billion to sequence 3 billion nt, ending 2003) http://www.labome.com/method/RNA-seq-Using-Next-Generation-Sequencing.html 8 NGS based methods for genome analysis: towards the clinic ChIP-seq (chromatin immunoprecipation) Transcription factor binding / chromatin state Dnase-seq Transcription factor binding / chromatin state RNAseq mRNA transcription / splicing Ribosome footprinting Translation rate Hiseq Genome 3D structure These methods have been demonstrated to be practical for continuous patient monitoring or diagnostics: • Rui et al Cell, Volume 148, Issue 6, 1293-1307, “iPOP” • Buenrostro et al Nat Methods Nature Methods 10, 1213–1218 (2013) “Using ATAC-seq maps of human CD4+ T cells from a proband obtained on consecutive days, we demonstrated the feasibility of analyzing an individual's epigenome on a timescale compatible with clinical decision-making.” 9 The ENCODE Project • ENCODE: Encyclopedia of DNA Elements (http://www.genome.gov/10005107) • International consortium dedicated to comprehensively mapping the human epigenome. • Created high quality ongoing gene annotations: GENCODE • 32 laboratories, $400million • In Spain: Roderic Guigo (CRG) was one of the leaders (with Tom Gingeras, CSHL) of the transcriptomics section. • 147 cell types (mainly transformed cell lines) • 1640 genome-wide datasets 10 ENCODE integrates multiple data types across cell types RNAseq Gene expression ChIP Chromatin ChIP Transcription Factors ChIA-PET Genome structure / folding GENCODE Gene annotation catalogue 11 Visualizing ENCODE data at the UCSC Genome Browser http://genome-euro.ucsc.edu/ 12 ENCODE data of relevance to hepatology Genes Chromatin Transcription Factors RNA ENCODE Tier 2: HepG2 cell line hepatocellular carcinoma (see http://www.genome.gov/26524238 for other cell types) Including: 8 RNAseq experiments 114 Transcription Factor ChIP experiments (inc CEBPB, HNF4A, HNF4G) http://genome-euro.ucsc.edu/ENCODE/dataMatrix/encodeDataMatrixHuman.html 13 Chromatin state is extremely cell type specific 14 Other projects of relevance: Epigenomics Roadmap Project 15 Other projects of relevance: eQTL • Gtex – Genotype Tissue Expression project • Hunting for genetic variants that influence gene expression  Linking genetic variants to changes in gene expression – regulatory variants or “expression quantitative trait loci” (eQTL)  These will be different between tissues 16 What does this mean for translational research? • Protein-focussed studies will miss the majority of functional disease causing variants / mutations • Non-coding variants will usually be regulatory • Non-coding variants will usually be cell type specific • Large projects like ENCODE are producing rich data that can be used to interpret clinical results ` 17 How can genetic variants (SNPs) in noncoding regions cause phenotype? • By altering the nucleotide sequence recognized by regulatory protein Hawkins et al Nature Reviews Genetics 11, 476-486 18 How can genetic variants (SNPs) in noncoding regions cause phenotype? Genetic Variant (SNP) Gene Expression Disease 19 How does ENCODE affect translational research projects? • Genome wide association study (GWAS) • Exome sequencing • Gene expression profiling 20 Translational research approaches 1: Genetic approaches Genomic approaches to identify genetic variants underlying disease: GWAS – genome wide association study Advantages Disadvantages Genome wide Depends on limited # of marker SNPs Not biased towards coding regions Low resolution Good at identifying common variants Does not yield insights into mechanism Exome sequencing – target genome sequencing Advantages Disadvantages Proteome wide No information about noncoding variants Can identify rare causative variants Likely missing most causative variants Usually yields mechanistic hypothesis High resolution 21 Interpretation of GWAS results GWAS gives an unbiased genome wide set of candidate SNPs The majority of these lie outside protein coding regions Two main challenges: 1. Identifying the causative SNP 2. Understanding the mechanism of action of that SNP Hepatocellular carcinoma Li et al PLoS Genet 8(7): e1002791. 22 Identifying the causative SNP using ENCODE data Hunt for the likely functional SNP in LD with marker Schaub et al Genome Res. 2012 Sep;22(9):1748-59. doi: 10.1101/gr.136127.111. e 23 Understanding the mechanism of a noncoding SNP using ENCODE data Schaub et al Genome Res. 2012 Sep;22(9):1748-59. doi: 10.1101/gr.136127.111. e 24 RegulomeDB: A web server for functional prediction of SNPs using ENCODE data 25 Exome sequencing Exome sequencing: targeted genome sequencing of protein coding exons Relies on capturing a selected subset of genome Advantages: • lower cost and • higher statistical power • can detect rare private mutations Disadvantages: • Presently ignoring the noncoding genome (~99%) 26 Exome sequencing: whats next? Whole genome sequence not likely to be practical: no statistical power Exome technology is highly customisable could be adapted to noncoding regions The main question: what are the target regions? • How to define the target space? • regulatory regions? • Noncoding RNAs? • Protein binding sites? • Likely to be organ / disease specific • Will require bioinformatic analysis to design reagents before experimental project begins. 27 Translational research approaches 2: Transcriptomic approaches ENCODE has made a major contribution to gene expression studies, by providing high quality annotations of novel noncoding genes through GENCODE. Microarray studies • Microarrays are restricted by the catalogue of probes chosen • Commercial arrays: usually protein coding genes • MicroRNA arrays available • Long noncoding RNA arrays available (CRG provide free designs) – based on ENCODE annotations 28 Translational research approaches 2: Transcriptomic approaches RNAseq • Unbiased > can discover novel RNAs • Can quantify expression of known and novel genes, and discover RNA from non “genic” loci • Analysis requires more bioinformatic analysis • Still more expensive than arrays 29 Translational research approaches 2: Transcriptomic approaches Problems: It is easy to discover and quantify the expression of novel genes It is difficult to understand the function of such genes We have no bioinformatic tools to predict the function of most novel ncRNAs We have limited experimental tools to investigate them 30 What does ENCODE mean for these studies? GWAS • GWAS study design will not likely be affected • ENCODE will allow better interpretation of discovered SNPs Exome • Whole genome cohort studies may never be feasible • Capture sequence approach can be redesigned to study noncoding variants in disease of choice • ENCODE and other public data will aid in the design of these projects Gene expression • New gene annotations can help in both microarray and RNAseq projects to discover novel noncoding gene targets. • RNAseq will eventually replace arrays as costs drop, but right now new array designs are competitive in large experiments and given bioinformatic requirements 31 Nothing would have been possible without… CRG Bioinformatics & Genomics ENCODE / GENCODE Roderic Guigó Jennifer Harrow Tim Hubbard (GENCODE, Sanger) Bioinformatics and Genomics group FUNDING Ramón y Cajal RYC-2011-08851 Plan Nacional BIO2011-27220 32 The main message of ENCODE To understand genotypes and phenotypes, we must look beyond the protein coding gene. Further reading: Interpreting noncoding genetic variation in complex traits and human disease •Lucas D Ward & Manolis Kellis •Affiliations Nature Biotechnology 30, 1095–1106 (2012) 33 How could variants in noncoding regions cause phenotype? • By altering the nucleotide sequence recognized by regulatory protein • By altering a noncoding RNA gene, either in expression levels or mature sequence Hawkins et al Nature Reviews Genetics 11, 476-486 Haas et al RNA Biol. 2012 Jun;9(6):924-37 34 Levels of genome regulation We now appreciate the genome is regulated at multiple levels: • “Epigenetically” – chromatin structure • Transcriptionally – RNA production • Post-transcriptionally – RNA processing (splicing, transport, stability) • Translationally – protein production at ribosome • Structurally – the folding structure of the genome => These sequences all have effects on phenotype and thus may contribute to disease => All of these are encoded in noncoding DNA sequence 35 Case study: Studying disease-associated regulatory SNPs incorporating cohort epigenome data A SNP for breast cancer creates a NFκB binding site Karczewski KJ et al Proc Natl Acad Sci U S A. 2013 Jun 4;110(23):9607-12 36

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Genomics Post-ENCODE