Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Genomics Algebra A New, Integrating Data Model, Language, and Tool for Processing and Querying Genomic Information Joachim Hammer and Markus Schneider University of Florida CIDR 2003 Asilomar, CA Jan. 5-8, 2003 Overview Data Management Problems in Bioinformatics Proposed Solution Joachim Hammer Genomics Algebra and Unifying Database Summary and Expected Impact CIDR 2003 2 Bioinformatics Growing field of problems in biological sciences that require application of computing and mathematics Genome Projects Construct detailed genetic and physical maps of a variety of organisms E.g., human genome project Functional Genomics Joachim Hammer Bioinformatics was coined in mid 80’s What do genes do and how do they interact? E.g., drug discovery, agro-food, pharmacogenomics (individualized medicine) CIDR 2003 3 Why is Bioinformatics Important? Acquiring sequences is first step … Ultimate goal is to decipher structural, functional, evolutionary information encoded in language of biological sequences To date, unable to predict structure (i.e., words and sentences) from sequence Joachim Hammer Alphabet (amino acids), words (motifs), sentences (proteins) Decoding an unknown language Mostly pattern-matching techniques: detect similarity between sequences and infer related structures and functions Number of experimentally determined protein structures is VERY small CIDR 2003 4 An Information Revolution … Emergence of rapid DNA sequencing and high throughput gene analysis techniques Flood of genomic data Data stored in more than 500 repositories Joachim Hammer Nucleic acid and protein sequences, motifs, folding units, modules, interaction information, etc. Complex data, e.g., sequential lists, deeply nested record structures, image & video data E.g., EMBL (150 GB, 2001), GenBank, SWISS-PROT, SANGER Centre (20TB, 2001), … Sequence repositories increase 4x per year Known sequence data outweighs protein structural data ~100:1 (sequence/structure deficit) CIDR 2003 5 … and the Resulting Problems for Biologists Scientists are overwhelmed by data which is awaiting further refinement and analysis Number and size of available data sources continuously growing Little or no agreement on terminology Unmanageable query results Forced to understand low-level data management Often required to learn and write SQL or code in some other programming language (Perl) Noisy data Joachim Hammer Overlap and conflicting information Proliferation of interfaces and portals Familiar sources sometimes disappear or get merged E.g., estimated that 30-60% of sequences in GenBank are erroneous CIDR 2003 8 Corresponding CS Problems Management of heterogeneous, autonomous sources Query languages not suitable for intended users Joachim Hammer Missing standard for genomic data representation Formatted files prevail over conventional database representations (few sources use DBMSs) Lots of redundancies and inconsistencies Many different interfaces (e.g., Web-based, specialized GUIs and retrieval packages) Limited interaction functionality of repositories Query results are often unmanageable CIDR 2003 9 CS Problems Cont’d Low-level treatment of data Lack of extensibility of software managing sources E.g., no personal scratch pad that can be integrated with existing data Dealing with uncertainty and erroneous data Joachim Hammer Not possible to integrate new, specialty evaluation functions Extraction of new knowledge from existing sources without much computational support Integration of new knowledge into repositories is tedious Users manipulate strings and integers instead of genes and sequences No high-level operations either E.g., frameshift problem CIDR 2003 10 State-of-the-Art Current research is focused mainly on integrating existing repositories Analysis is performed outside of the repositories Sequence similarity search: e.g., Basic Local Alignment Search (BLAST) and its derivatives, … Visualization tools: e.g., BEAUTY, BioWidgets, … Complex middleware tiers between end-users and the data servers Joachim Hammer Federated and query-driven approaches (e.g., SRS, BioNavigator, DiscoveryLink, K2/Kleisli, Tambis, …) Work on standardizing terminology and representations (e.g., Gene Ontology, EcoCyc, …) Inefficient, lots of user involvement (human query processor) CIDR 2003 11 Iterative Query and Analysis Query Relevant Database(s) Construct a database query Store Query Output Analyze Output Joachim Hammer While not done … Store query output Analyze query results Done? CIDR 2003 12 Fundamental Challenge Development of a more principled approach to genomic data management Joachim Hammer Leverage capabilities provided by modern DBMS Services tightly integrated Shields scientists from knowing low-level data management details as much as possible CIDR 2003 13 Integrating Approach to Genomics Data Management Extensible Genomics Algebra Formal data model, query language, and software for representing, storing, retrieving, querying, and manipulating genomic information Provides a set of high-level genomic data types (GDTs) together with genomic operations or functions Unifying Database Joachim Hammer Persistent storage for high-level, structured GDT values of Genomics Algebra Warehouse for data from existing genomic repositories CIDR 2003 14 Mini Genomics Algebra types codon, aminoAcid, gene, primaryTranscript, mRNA, protein operators decode: codon aminoAcid “given a codon, computes the corresponding amino acid” transcribe: gene primaryTranscript “given a gene, returns its primary transcript” splice: primaryTranscript mRNA “given a primary transcript, removes its introns to produce the mRNA” translate: mRNA protein “given a messenger RNA, determines the corresponding protein” . . Joachim Hammer CIDR 2003 15 What Can We Do with a Genomics Algebra? Can use the algebra to formally express existing biological operations Create new operations using function composition Joachim Hammer E.g., Given DNA fragment and sequence, returns true if fragment contains specified sequence contains(frag,“ATTGCCATA”) E.g., express central dogma of molecular biology as translate(splice(transcribe(g))) CIDR 2003 16 Research Challenges What data types and operations do we need? Formalize definition of GDTs and operations Vague or lacking knowledge of many biological processes makes this hard Implement algebra Joachim Hammer Need comprehensive ontology defining terminology, data objects, and operations Design of data structures and efficient algorithms for genomic operations Must be extensible Suitable for integration with a database system CIDR 2003 17 Unifying Database Persistent storage manager for Genomics Algebra Integrated repository (warehouse) for genomics sources Provides superior query processing performance in multi-source environments Ability to maintain and annotate extracted source data after it has been cleansed, reconciled and corrected Joachim Hammer GUS (U Penn) is only other known genomics warehouse prototype system Option to preserve historical data from those repositories that do not archive their contents CIDR 2003 18 Integrated System Architecture Genomics Algebra GUI DBMS-specific Adapter ETL Extensible DBMS (Oracle, DB2, …) Unifying Database public space user space Joachim Hammer user space … user space CIDR 2003 … External Repositories (e.g, GenBank, NCBI, …) 19 Implementation Adapter provides DBMS-specific coupling mechanism between Genomics Algebra and DBMS User interface component consisting of Joachim Hammer Use UDT mechanism (opaque types and user-defined operators linked as external functions) Supported by all major DB vendors Biological query language together with graphical output XML application as standardized exchange format for sharing genomics data CIDR 2003 20 Research Challenges Design of the integrated schema Detecting changes in underlying sources Iterative process with input from domain experts Push capabilities are slowly being offered Tools for computing what has changed Database maintenance View maintenance problem Derived data (annotations) based on update must be recomputed Joachim Hammer Knowing provenance of data could be used to determine which annotations need to be recomputed CIDR 2003 21 Vision and Expected Impact Advocate a “back to the roots” strategy of database technology for bioinformatics Fundamental change in way biologists analyze data New knowledge about design and implementation of biological type system and its operations Joachim Hammer Single interface specifically designed for biologists No need to become “computer scientists” Demonstrate extensibility of modern DBMS Help development of algebras for other applications CIDR 2003 22