* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Ensembl. Going beyond A,T, G and C
Public health genomics wikipedia , lookup
Gene expression programming wikipedia , lookup
Protein moonlighting wikipedia , lookup
Genome (book) wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
Genomic library wikipedia , lookup
Gene desert wikipedia , lookup
Transposable element wikipedia , lookup
Cre-Lox recombination wikipedia , lookup
Short interspersed nuclear elements (SINEs) wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Cancer epigenetics wikipedia , lookup
Metagenomics wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Minimal genome wikipedia , lookup
Epigenomics wikipedia , lookup
Histone acetyltransferase wikipedia , lookup
Pathogenomics wikipedia , lookup
Gene expression profiling wikipedia , lookup
Point mutation wikipedia , lookup
Human genome wikipedia , lookup
History of genetic engineering wikipedia , lookup
Designer baby wikipedia , lookup
Transcription factor wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Microevolution wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Epigenetics in learning and memory wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Genome evolution wikipedia , lookup
Genome editing wikipedia , lookup
Non-coding DNA wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Primary transcript wikipedia , lookup
ENCODE: understanding our genome Ewan Birney The ENCODE Project Consortium Biosapiens Network of Excellence QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. ENCODE experiments Area Assay Groups Proteins Manual annotation, Guigo, RT-PCR Harrow+Hubbard, Reymond Transcripts Tiling Arrays Gingeras, Snyder Transcripts Tag seq. Yijun, Riken General Chromatin Marks Tiling Arrays, ChIP Dunham, Reng Sequence sp. Factors Tiling Arrays, ChIP Snyder, Gingeras, Farnham, Dunham DNaseI sens. PCR, Tiling arrays Stam. , Crawford Replication Tiling arrays Dutta Conservation Comparative sequence Green, Sidow, Miller DNA structure Hydroxyl radical Leib Promoter Reporter assays Myers ENCODE Pilot • Considered too expensive and too risky to decide on winning technologies (started in 2004) • 1% of the genome (30MB) chosen - all experiments on the same 1% • Pilot phase ended – Analysis and publication – Scale up to genome wide now funded A lot of Chip/Chip Nowdays, a lot of Chip/seq Transcription Transcription • Lots of it – And not all of it genes – And even when it is inside a gene, not all of it with open reading frames – And even when it has an open reading frame, not all of it making sense! (evolutionary or structurally) • Not technical false positives Protein coding loci are far more complex than we think • On average 5 transcripts per locus • Many do not encode proteins (as far as we can see) • Even the ones which do encode proteins, many of these proteins look “weird” Unplausible structures Many effects on potential function QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. Signal peptides, TM Helices • 1097 protein transcripts from 487 loci – 219 have signal peptides (107 loci) – 12 loci have an isoform without the signal peptide – 41 transcripts have a gain or loss of a tansmembrane helix (sometimes up to 8!) The Clade B Serpins a inactive, "stressed" Potential Missing fragments b active (beta inserted) (c) (d) (e) (f) Transcription Start Sites Technologies on TSS Gencode Manual Ann. Unbiased TxFrag Ditag data Cage data Histone mod. Dnase I sens Sequence sp Factors (eg Myc) Integration Strategy Anchor on 5’ ends GenCode 5’ and CAGE/DiTag Categorise and assess using Transcript based evidence Exons, TxFrags, CpG islands Assess categories with Histone and TF data 16,051 unique TSS 8,587 TSS “tight clusters” 5 different classes First 4 low-Pvalues First 4 categories have Biological signals: 4,491 TSS TSS Categories Category GenCode 5’ Number P-value of (non-redundant) overlap 1730 2e-70 Exon(sense) 1437 6e-39 Exon(anti) 521 3e-8 TxFrag 639 7e-63 CpG 164 4e-90 No support 2666 GenCode 5’ ends Unsupported tags Novel TSSs Conclusion • There are 4,418 TSS with multiple lines of evidence supporting them • This is ~10 fold more than the number of Genes • Only 38% would be traditionally classified as TSS (less if one took Ensembl or RefSeq) Implications of many more TSSs • Consistent with considerable diversity of transcripts • Independently integrating Chip/Chip data suggested ~1,000 “Regulatory Clusters” – 25% proximal considering Ensembl/Refseq – 65% when this TSS catalog is considered More subtle conclusions • Sequence specific factors are distributed symmetrically around the TSS – Should we only be taking upstream regions for reporter genes? • Histone information is highly correlated with gene on/off status – Generalising many locus specific studies Gene On/Off Gene status prediction Distal sites Finding distal sites • Chip/Chip not “great” – Most look close to one of these new TSSs – Factor bias? • DNaseI Hypersenstive Sites – All factors give a DHS signal – 55% of DHSs are distal to any TSS Distal DHS Most surveyed factors are proximal Replication H3K27me3 is correlated Evolutionary conservation and ENCODE Evolutionary conservation …but not everything is constrained Why is there a discrepancy? • False positives in the experiments – But experiments validate at >80% and crossvalidate each other • False negatives in the constraint detection – But can detect up to 8bp elements, and within “neutral” zone of alignability • Neutral turnover model Neutral biochemical events Time Lineage specific Time “Functional” conservation Mouse Human Special case: Transcription Constrained sequence Gene Regulatory Information Constrained sequence Pre-miRNAs What should we learn from ENCODE • “whacky” transcription is real (but god knows what it does) – Unconventional Transcript • Lots more TSSs than we understand – Many “distal” regions are actually close to promoters • Broad specificity marks are more useful – DNaseI sites, Histone marks Neutral model for biochemical events on the genome • Because things happen reproducibly in multiple tissues does not imply selection • (this is not the same as experimental variance) • Could imply “functional” conservation outside of orthologous bases – Comparative genomics sequencing not enough (but a great starting point!) – Comparative functional investigation Consortia work • ENCODE – Experimentally lead consortia – Needs a lot of computational collaboration • Biosapiens – Computationally lead consortia – Needs experimental collaboration (!) • DNA: ENCODE • Protein: Biosapiens What happens next? Ensembl Regulatory Build Chr 14, 5677077-567896 elements GM06990 Cells, Myc bound Status Initial Regulatory Build • DNaseI Hypersenstive sites, 6 histone modifications, CTCF binding • ~110,000 elements, ~2MB of DNA • 6,000 “promoter associated” by inherent pattern (DNaseI + H3K36me3) • Available now • This year: Mouse, More classification Regulatory build Ensembl - at your service • Web browser www.ensembl.org • MySQL DB access • BioMart • “Geek for a week” – You send someone to use for a week • Xose for a day – We send someone to you for a day The ENCODE Project Consortium Damian Keefe, Yutao Fu, Zhiping Weng, Mike Snyder, Elliott Marguilles, John Stam., Manolis Dermitzakis, Tom Gingeras, Roderic Guigo, Ian Dunham, Christophe Koch, Anindya Dutta Paul Flicek and 293 others… The Biosapiens Network of Excellence Michael Tress, Alfonso Valencia, Janet Thornton, Roderic Guigo, Soren Brunak, David Jones, Martin Vingron, Anna Tramontano, Jacques van Helden and 57 others…