Download Support on-the-fly bioinformatics data Integration

Supporting HighPerformance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University Motivation • Challenges of bioinformatics integration – Data volume: overwhelming • DNA sequence: 100 gigabases (August, 2005) – Data growth: exponential Figure provided by PDB Existing Solutions – (Relational) Databases • Support for indexing and high-level queries • Not suitable for biological data – Flat Files with Scripts • Compact, Perl Scripts available • Lack indexing and high-level query processing – Web-services • Significant overhead Our Approach • Enhance information integration systems on – Functionality • On-the-fly data incorporation • Flat file data process – Usability • Declarative interface • Low programming requirement – Performance • Incorporate indexing support Approach Summary • Metadata – Declarative description of data – Data mining algorithms for semi-automatic writing – Reusable by different requests on same data • Code generation – Request analysis and execution separated – General modules with plug-in data module System Overview Understand Data Data File User Request Metadata Description Layout Descriptor Layout Descriptor --------------------------------------------------Layout Descriptor --------------------------------------------------Schema Descriptor --------------------------------------------------Schema Descriptor Schema Descriptor Code Generation Request Processor Schema Miner Information Integration System Answer Layout Miner Process Data Advantages • Simple interface – At metadata level, declarative • General data model – Semi-structured data – Flat file data • Low human involvement – Semi-automatic data incorporation – Low maintenance cost • OK Performance – Linear scale guaranteed – Can improve by using indexing System Components • Understand data – Layout mining – Schema mining • Process data – Wrapper generation – Query Process – Query Process with indices Data Process Overview • Automatic code generation approach • Input – Metadata about datasets involved – Optional: • Implicit data transformation task • Request by users • Indexing functions • Output – Executable programs • General modules • Task-specific data module Metadata Description • Two aspects of data in flat files – Logical view of the data – Physical data organization • Two components of every data descriptor – Schema description – Layout description • Design goals – Powerful – Easy for writing and interpretation Schema Descriptors • Follow XML DTD standard for semi-structured data <?xml version='1.0' encoding='UTF-8'?> <!ELEMENT <!ELEMENT <!ELEMENT <!ELEMENT FASTA (ID, DESCRIPTION, SEQ)> ID (#PCDATA)> DESCRIPTION (#PCDATA)> SEQ (#PCDATA)> • Simple attribute list for relational data [FASTA] //Schema Name ID = string //Data type definitions DESCRIPTION = string SEQ = string Layout Descriptors • Overall structure (FASTA example) DATASET “FASTAData” { DATATYPE {FASTA} DATASPACE LINESIZE=80 { //Dataset name //Schema name // ---- File layout details goes here ---} DATA {osu/fasta} } //File location Wrapper Generation System Overview Layout Descriptor Schema Descriptors Layout Parser Mapping Generator Data Entry Representation Wrapper generation system Source Dataset Mapping File Mapping Parser Schema Mapping Application Analyzer WRAPINFO DataReader DataWriter Synchronizer wrapper Target Dataset Query With Indices Motivation • Goal – Improve the performance of query-proc program • Index – Maintain the advantages • Flat file based • Low requirement on programming Challenges & Approaches • Various indexing algorithms for various biological data – User defined indexing functions – Standard function interfaces • Flat file data – Values parsed implicitly and ready to be indexed – Byte offset as pointer • Metadata about indices – Layout descriptor System Revisited query Source/target names Dataset descriptors Metadata collection Query parser Descriptor parser Schema & Layout information mappings Application analyzer Query analysis Query execution Source data files Index file QUERYINFOR DataReader DataWriter Synchronizer Index functions Target data file Language Enhancement • Describe indices – Indexing is a property of dataset – Extend layout descriptors DATASET “name”{ … INDEX {attribute:index_file_loc:index_gen_fun:index_retr_fun:fun_loc [, attribute:index_file_loc:index_gen_fun:index_retr_fun:fun_loc]} } – Maintain query format AUTOWRAP GNAMES FROM CHIPDATA, YEASTGENOME BY CHIPDATA.GENE = YEASTGENOME.ID WHERE … New meaning of “=“: If index available, use index retrieving function Else, compare values directly System Enhancement • Metadata Descriptor Parser + parse index information • Application Analyzer + index information: index look-up table + test condition: compare_field_indexing Microarray Gene Information Look-up 90 81.59 80 Performance (sec) • Goal: gather information about genes (120) • Query: microarray output join genome database • Index: gene names in genome 70 60 50 40 30 20.89 20 10 0 0.01 query analysis 0.72 index generation query with indices query w/o indices BLAST-ENHANCE Query Performance (sec) • Goal: Add extra information to BLAST output • Query: BLAST output join SwissProt database • Index: protein ID in Swiss-Prot 1200 3 5 12 1000 800 600 400 200 0 index query w/ generation indices query w/o indices OMIM-PLUS Query 10000000 1000000 Performance (sec) • Goal: add SwissProt link to OMIM • Query: OMIM join Swiss-Prot • Index: protein ID in Swiss-Prot 100000 10000 1000 100 10 1 index generation query w/ indices query w/o indices Homology Search Query • Goal: find similar sequences • Query: query sequence list * sequence database • Indexing algorithm – Sequence-based – Transformation of sub-string composition – Indexing n-D numerical values Homology Search (1) – Data: yeast genome – wavelet coefficients – minimum bounding rectangles 350 Index generation 10 20 40 300 Performance (sec) • Index (Singh’s algorithm) 250 200 150 100 50 0 1 2 3 4 Database size (9.8MB) 5 Homology Search (2) – Data: GenBank – Wavelet coefficients – Scalar quantization – R-tree 30 performance (sec) • Index (Ferhatosmanoglu’s algorithm) 25 20 10 20 40 15 10 5 0 1 2 3 4 Database size (250MB) 5 Conclusions • A frame work and a set of tools for on-the-fly flat file data integration – New data source understood semi-automatically by data mining tools – New data processed automatically by generated programs – Support for indexing incorporated flexibly

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Support on-the-fly bioinformatics data Integration