Download Support on-the-fly bioinformatics data Integration

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Supporting HighPerformance
Data Processing
on Flat-Files
Xuan Zhang
Gagan Agrawal
Ohio State University
Motivation
• Challenges of bioinformatics integration
– Data volume: overwhelming
• DNA sequence: 100 gigabases (August, 2005)
– Data growth:
exponential
Figure provided by PDB
Existing Solutions
– (Relational) Databases
• Support for indexing and high-level queries
• Not suitable for biological data
– Flat Files with Scripts
• Compact, Perl Scripts available
• Lack indexing and high-level query processing
– Web-services
• Significant overhead
Our Approach
• Enhance information integration systems on
– Functionality
• On-the-fly data incorporation
• Flat file data process
– Usability
• Declarative interface
• Low programming requirement
– Performance
• Incorporate indexing support
Approach Summary
• Metadata
– Declarative description of data
– Data mining algorithms for semi-automatic
writing
– Reusable by different requests on same data
• Code generation
– Request analysis and execution separated
– General modules with plug-in data module
System Overview
Understand Data
Data File
User Request
Metadata Description
Layout Descriptor
Layout
Descriptor
--------------------------------------------------Layout
Descriptor
--------------------------------------------------Schema Descriptor
--------------------------------------------------Schema Descriptor
Schema Descriptor
Code
Generation
Request
Processor
Schema
Miner
Information Integration System
Answer
Layout
Miner
Process Data
Advantages
• Simple interface
– At metadata level, declarative
• General data model
– Semi-structured data
– Flat file data
• Low human involvement
– Semi-automatic data incorporation
– Low maintenance cost
• OK Performance
– Linear scale guaranteed
– Can improve by using indexing
System Components
• Understand data
– Layout mining
– Schema mining
• Process data
– Wrapper generation
– Query Process
– Query Process with indices
Data Process Overview
• Automatic code generation approach
• Input
– Metadata about datasets involved
– Optional:
• Implicit data transformation task
• Request by users
• Indexing functions
• Output
– Executable programs
• General modules
• Task-specific data module
Metadata Description
• Two aspects of data in flat files
– Logical view of the data
– Physical data organization
• Two components of every data descriptor
– Schema description
– Layout description
• Design goals
– Powerful
– Easy for writing and interpretation
Schema Descriptors
• Follow XML DTD standard for semi-structured
data
<?xml version='1.0' encoding='UTF-8'?>
<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ELEMENT
FASTA (ID, DESCRIPTION, SEQ)>
ID (#PCDATA)>
DESCRIPTION (#PCDATA)>
SEQ (#PCDATA)>
• Simple attribute list for relational data
[FASTA]
//Schema Name
ID = string
//Data type definitions
DESCRIPTION = string
SEQ = string
Layout Descriptors
• Overall structure (FASTA example)
DATASET “FASTAData” {
DATATYPE {FASTA}
DATASPACE LINESIZE=80 {
//Dataset name
//Schema name
// ---- File layout details goes here ---}
DATA {osu/fasta}
}
//File location
Wrapper Generation
System Overview
Layout Descriptor
Schema Descriptors
Layout Parser
Mapping Generator
Data Entry
Representation
Wrapper generation
system
Source
Dataset
Mapping File
Mapping Parser
Schema Mapping
Application Analyzer
WRAPINFO
DataReader DataWriter
Synchronizer
wrapper
Target
Dataset
Query With Indices
Motivation
• Goal
– Improve the performance of query-proc program
• Index
– Maintain the advantages
• Flat file based
• Low requirement on programming
Challenges & Approaches
• Various indexing algorithms for various
biological data
– User defined indexing functions
– Standard function interfaces
• Flat file data
– Values parsed implicitly and ready to be indexed
– Byte offset as pointer
• Metadata about indices
– Layout descriptor
System Revisited
query
Source/target names
Dataset
descriptors
Metadata
collection
Query parser
Descriptor
parser
Schema & Layout information
mappings
Application analyzer
Query analysis
Query execution
Source
data files
Index file
QUERYINFOR
DataReader DataWriter
Synchronizer
Index functions
Target
data file
Language Enhancement
• Describe indices
– Indexing is a property of dataset
– Extend layout descriptors
DATASET “name”{
…
INDEX {attribute:index_file_loc:index_gen_fun:index_retr_fun:fun_loc
[, attribute:index_file_loc:index_gen_fun:index_retr_fun:fun_loc]}
}
– Maintain query format
AUTOWRAP GNAMES
FROM CHIPDATA, YEASTGENOME
BY CHIPDATA.GENE = YEASTGENOME.ID
WHERE …
New meaning of “=“:
If index available, use index
retrieving function
Else, compare values directly
System Enhancement
• Metadata Descriptor Parser
+ parse index information
• Application Analyzer
+ index information: index look-up table
+ test condition: compare_field_indexing
Microarray Gene
Information Look-up
90
81.59
80
Performance (sec)
• Goal: gather
information about
genes (120)
• Query: microarray
output join genome
database
• Index: gene names
in genome
70
60
50
40
30
20.89
20
10
0
0.01
query
analysis
0.72
index
generation
query with
indices
query w/o
indices
BLAST-ENHANCE Query
Performance (sec)
• Goal: Add extra
information to
BLAST output
• Query: BLAST
output join SwissProt database
• Index: protein ID in
Swiss-Prot
1200
3
5
12
1000
800
600
400
200
0
index
query w/
generation indices
query w/o
indices
OMIM-PLUS Query
10000000
1000000
Performance (sec)
• Goal: add SwissProt link to OMIM
• Query: OMIM join
Swiss-Prot
• Index: protein ID
in Swiss-Prot
100000
10000
1000
100
10
1
index
generation
query w/
indices
query w/o
indices
Homology Search Query
• Goal: find similar sequences
• Query: query sequence list * sequence
database
• Indexing algorithm
– Sequence-based
– Transformation of sub-string composition
– Indexing n-D numerical values
Homology Search (1)
– Data: yeast
genome
– wavelet
coefficients
– minimum
bounding
rectangles
350
Index generation
10
20
40
300
Performance (sec)
• Index (Singh’s
algorithm)
250
200
150
100
50
0
1
2
3
4
Database size (9.8MB)
5
Homology Search (2)
– Data: GenBank
– Wavelet coefficients
– Scalar quantization
– R-tree
30
performance (sec)
• Index
(Ferhatosmanoglu’s
algorithm)
25
20
10
20
40
15
10
5
0
1
2
3
4
Database size (250MB)
5
Conclusions
• A frame work and a set of tools for on-the-fly
flat file data integration
– New data source understood semi-automatically
by data mining tools
– New data processed automatically by generated
programs
– Support for indexing incorporated flexibly
Related documents