Download BIBE07_Presentation_SNPMiner

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Entity–attribute–value model wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Microsoft SQL Server wikipedia , lookup

Open Database Connectivity wikipedia , lookup

Concurrency control wikipedia , lookup

Functional Database Model wikipedia , lookup

Database wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Relational model wikipedia , lookup

Clusterpoint wikipedia , lookup

Versant Object Database wikipedia , lookup

Database model wikipedia , lookup

Transcript
SNPMiner: A Domain-Specific
Deep Web Mining Tool
Fan Wang, Gagan Agrawal,
Ruoming Jin and Helen Piontkivska
The Ohio State University
The Kent State University
Presenter: Fan Wang
The Ohio State University
Outline
• Motivation
• SNPMiner design and implementation
• Performance of SNPMiner
• Conclusion
Motivation
• Large volume of biological data being available
•
for web-access
Large number of web-based biological databases
being accessible
– Google search for “SNP related databases” finding 45
online databases
– How many if we look for the entire biological domain?
• How to integrate these databases?
Motivation
• Example query:
– ERCC6
– SNP and Nonsynonymous SNP
– AA occurring in the corresponding position
– In orthologous gene of non-human mammals
• No single database can provide all of the
above information
• Generating query plan according to
database dependencies
Motivation
ERCC6
Entrez Gene
dbSNP
AA Positions for
Nonsynonymous SNP
Encoded Protein
Alignment
Database
Protein Sequence
Encoded Orthologous
Protein
Sequence
Database
Motivation
• Manually?
– Familiar with all related databases
– Time consuming
– Error prone
• Google?
– Dynamic web page
• So we need a new tool!
Outline
• Motivation
• SNPMiner design and implementation
• Performance of SNPMiner
• Conclusion
SNPMiner Design
• Features
– Integrating biological databases easily
– Unified user interface
– Generating query plans according to user queries
(Dynamic Query Planning)
– Efficient and correct query plans
– Robustness
• SNPMiner
– A query-oriented, mediator-based biological data
integration and querying tool
SNPMiner Design
• 8 databases integrated
– dbSNP
– Entrez Gene
– Entrez Protein
– Entrez BLAST
– SNP500Cancer
– SeattleSNP
– BIND
– SIFT
System Overview
• Web parser extracts
•
•
data from retrieved
HTML files
Dynamic query planner
schedules query plan
Web accessible, 40
SNP related terms
Sample Input and Output
Query Key
Terms
Query
Target
Terms
System Implementation
• Dynamic Query Planner
– Production Rule System
– Rule Representation
– Rule Selection
– State Update and Termination Condition
– Algorithm
• Web Page Parser
Production Rule System
• Model of computation for implementing search
•
algorithm
Our problem fits into this case
Using Current Knowledge
Gained
Knowledge
Gain New Knowledge
Knowledge
Base
Production Rule System
• Working Memory: Data extracted or
retrieved
• Production Rules: Query schemas of online
databases
• Goal State: Set of user requested terms
Rule Representation
• Each database query schema
corresponding to a rule
• QSi=(ID , Ii , Di , Oi , Ci)
– ID : Unique identifier of the rule
– Ii : The input set of the database
– Di : Unique identifier of the database
– Oi : The output set of the database
– Ci : Additional conditions imposed on Ii
Rule Representation
• Examples:
(9,{SNPID},{SIFT},{SIFT_Info},NonSyn(SNPID)
(9,{0},{6},{36},NonSyn{0})
Rule Selection
• Candidate Rule Set
– Find rules which can be fired, i.e. I  CS
– Test the availability of databases
• Compute benefit score for candidate rule
– Data coverage
– Select the rule with the highest score
State Update and Termination
• Output elements of a selected rule added
to the working memory
• Terminate when:
– Goal state is fully covered by the working
memory
– Not fully covered, but no more rules can be
fired
Algorithm
Query Planner (Key_Term, Target_Terms)
Initialize Current State CS, Goal State GS
Initialize Production Rule Set PR
Initialize Query Chain QC to be empty
while (x GS and x CS ) and (y unvisitedPR and z  output ( y)
and z GS and z CS )
Initialize an empty set CR for candidate rules
foreach p  unvisitedPR
if (available( p))
compute the benefit score of p, bs(p)
add p to CR
Select a rule bp from CR with the highest benefit score
Add output(bp) to CS
Delete bp from univisitedPR and add it to visitedPR
Add the production rule bp to QC
Web Page Parsing
• Using HTML labels and tags to parse
• Disadvantage
– Impacted by the change of web page layout
• Automatic or semi-automatic web page
layout learning method in the future
Outline
• Motivation
• SNPMiner design and implementation
• Performance of SNPMiner
• Conclusion
Performance
• Recording the query plan length
• A and B easy queries
• C and D harder ones; E the hardest
Performance
• For easier queries, we obtain the optimal results
• For harder ones, the generated plans are no
more than 40% longer than the optimal
Performance
• ERCC6, rs2228528
• Query planning time around 3.7s, 0.17% of the total
•
time
99% of the query planning time spent on database
availability test
Limitations
• Only considering data coverage
• User preferences
• Hidden rules
• Only for attribute searching
Our Current Work
• Using ontology to obtain preference
• Constructing a dependency graph model
• Using bidirectional keyword searching
algorithm
• Being able to discovery relationship
between any biological terms
Conclusion
• SNPMiner is an useful biological data
integration and querying tool
• Dynamic query planner schedules query
plan according to user query efficiently
• The length of generated query plan is no
more than 40% longer than the optimal
• On average, the system needs 2 seconds
to extract a term from the deep web