Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
AWG Members: Srinka Ghosh, Sandra Michelsen, David MacAlpine, Matt Eaton, Steve Henikoff, Ann Hammonds, Gos Micklem, Manolis Kellis, Peter Park, Xiaole Liu, Mark Gerstein, Sue Celniker, Eric Lai, Kris Gunsalus, Bob Waterston, David Miller, Lincoln Stein modENCODE Consortium meeting Rockville, MD 2008.06.17, 9:15-10:30 (10' near beginning of session) Slides downloadable from Lectures.GersteinLab.org. (Please read permissions statement.) Paper references mostly from Papers.GersteinLab.org. (Quick overview of the ENCODE pilot results, focusing on G&T results, pgenes + DART. [I:ENCODE], fit into time ) Do not reproduce without permission (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu 1 1modENCODE.org Thoughts on Integrative Genome Analysis -- as Stimulus for a Discussion towards Consortium Publication(s) Purpose of the Session • Our charge "NHGRI would like the Consortium to think early on about what would be involved in an integrative paper and what would be the steps forward to accomplish this goal." What integrative analyses do we want to do? What have we done, a year in? What do we need to do? • Data freezes ? Analysis discussions ? • Bringing in new types of data or analysts ? • Fast-track particular experiments or analyses ? Do not reproduce without permission 3 modENCODE.org • Specific Things to think about • What were integrative analyses in the framework of pilot ENCODE? (case study) • Where is modENCODE with respect to these? 2 Case studies on microRNAs in fly (E Lai) and the 3' UTRome in worm (K Gunsalas) Do not reproduce without permission (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu 5 5modENCODE.org What is Integrative Analysis? Brief presentations to give us "data" for our deliberations • Case study in the annotation of unannotated transcription and relating it other annotation Do not reproduce without permission (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu 6 6modENCODE.org A Tale of TARs (& TxFrags) • Vignette drawn from activity in pilot ENCODE Genes & Transcripts group (Gingeras TR, Guigó R, Snyder M, Birney E, Zhang ZD, Reymond A, Kapranov P, Rozowsky J, Zheng D, Castelo R, Frankish A, Harrow J, Ghosh S...) Production Integrated Different types of analyses carried out in the modENCODE/ENCODE consortia • Development of Sequence (and Array) Technology • Output of Production Pipelines and Surveying of Single Type of Annotation • Integrated Analysis Connecting Different Types of Annotation Where are we now in modENCODE? Do not reproduce without permission (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu 7 7modENCODE.org Tech Production Li et al., PLOS one (2007) Do not reproduce without permission (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu 8 8modENCODE.org A Starting Point: Noisy Raw Signal from Tiling Arrays Johnson et al. (2005) TIG, 21, 93-102. Tech Tech Production • Array data can be normalized by mean, median, quantile, &c. How to do this consistently? • Tile scoring using a (smoothing) sliding window generates the signal map and the P-value map. Zhang et al. (2007) GenomeBiology Source: Bolstad, B.M., et al (2003), Bioinformatics, 19, 185-93. Do not reproduce without permission (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu 9 9modENCODE.org Signal Processing to Normalize and Standardize Signals to Get a Useable and Cross-experiment comparable Signal Map Tech • Sens. v Spec. for different platforms • How does this extrapolate genomewide, across samples? • Attempting to score experiments in uniform fashion • Understand "labeffects" vs real ones • Where is NextGen seq. technology in rel. to this? Emanuelsson Do et not al.reproduce ('07) without Gen.permission Res. 10 modENCODE.org Calibrating Error Rates for Each Platform Segmentation: Finding Discrete Annotation Blocks (TARs/TxFrags) from Processed Signal • Iterative Process of Building a Model, Segmenting Signal into discrete, easily useable "hits" (TARs/TxFrags), validating some of them • Defining consistent definitions of "Hits" and TARs (e.g. point sources) • Defining consistent thresholds Do not reproduce without permission (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu 1111modENCODE.org Production [Du et al. (2006) Bioinformatics; Fig. from Gerstein et al., Gen. Res. (’07)] Tech Production • Annotated and unannotated TxFrags detected in different cell lines. [ENCODE Consortium, Nature 447, 2007] Do not reproduce without permission (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu 1212modENCODE.org Statistics on the TxFrags: Surveys of a Single Type of Annotation Production Integrated Rozowsky et al. Gen. Res. (2007) Do not reproduce without permission (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu 1313modENCODE.org Phylogenetic Profiles or More Developed Annotation: Clustering and Classifying Blocks of Un-annotated Transcription into larger units Vast Amounts of Different Data Types to Integrate in pilot ENCODE • Determining experimental signals for biochemical activity across each base of genome • Large-scale sequence comparison in relation to the human genome [ENCODE Consortium, Nature 447, 2007] Feature Class Expt. Tech. Numb. Expt. Data Pts. Transcription Tiling array, Integrated annotation 63,348,656 Tag sequencing 864,964 Tiling array 4,401,291 Chromatin structure QT-PCR, Tiling array 15,318,324 Sequencespecific factors Tiling array, tag sequencing, Promoter assays 324,846,018 Replication Tiling array 14,735,740 Computational analysis Computational methods NA Comparative sequence analysis Genomic sequencing, multi- sequence alignments, computational analyses NA Polymorphisms Resequencing, copy number variation NA 5′ Ends of transcripts Histone modifications Do not reproduce without permission (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu 1414modENCODE.org Integrated Composite ChIP hit Special yG tracks in browser diTAG CAGE TARs ChIPchip Connecting TARs (TxFrags) in Integrative fashion to different types of Annotation • Single Ex. of Pseudogene Intersecting with Transcriptional and Regulatory Evidence • Are integrated experiments comparable -- i.e. done on consistent cell lines, on same coordinate sys., &c. Zheng et al. (2007) Gen. Res. Do not reproduce without permission (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu 1616modENCODE.org Integrated Integrated Integrating Transcriptional Evidence with Gene Annotation and Sequence Constraints Ka/Ks Avg. Integration over many instances • No Greater Tendency for Transcribed Pseudogenes to be under Selective Constraint • Need a way of easily defining degree of constraint on sequence (not so easy for non-coding) Measurement of Short-time variation (pN+pS) Zheng et al. (2007) Gen.without Res. Do not reproduce permission (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu 1717modENCODE.org Processed pseudogene Non-processed pseudogene Gene Transcribed Integrated • Integrating & averaging results over larger and larger sets • Comparison of integrated quantities [ENCODE Consortium, Nature 447, 2007] Do not reproduce without permission (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu 1818modENCODE.org Biochemically Active Regions Don't all Appear to be Under Constraint Integrated • Not all constrained sequence annotated in some fashion • Exactly how things are defined in terms of overlap? • "At the outset of the ENCODE Project, many believed that the broad collection of experimental data would nicely dovetail with the detailed evolutionary information derived from comparing multiple mammalian sequences to provide a neat ‘dictionary’ of conserved genomic elements, each with a growing annotation about their biochemical function(s). In one sense, this was achieved; the majority of constrained bases in the ENCODE regions are now associated with at least some experimentally-derived information about function. However, we have also encountered a remarkable excess of unconstrained experimentally-identified functional elements, and these cannot be dismissed for technical reasons. This is perhaps the biggest surprise of the pilot phase of the ENCODE Project, and suggests that we take a more ‘neutral’ view of many of the functions [ENCODE Consortium, Nature 447, 2007] Do not reproduce without permission (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu 1919modENCODE.org Grand Summary: Biochemical Activity vs. Sequence Constraints Integrated ++ • Making thresholds and statistics comparable across organisms (so we can really say whether or not worm has or less novel transcription than fly) • Can we relate tissues and developmental states betw. organisms? • Can we deal with seq. constraint in a uniform fashion? • Defining orthologs betw. organisms and lineage-specific genes • Proper liaison with Encode Do not reproduce without permission (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu 2020modENCODE.org An Added Challenge for modENCODE Comparing Results among Evolutionary Distant Organisms Production Integrated ++ ? Tale of TARs: Where is modENCODE? • Where are we on Tech, Production, Integrated Spectrum? More action on seq. tech call than AWG Tech stuff is important, towards getting useable, comparable error-parameterized annotation blocks • Importance of comparative genomics issues Constraint, orthologs Do not reproduce without permission (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu 2121modENCODE.org Tech