Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Entity–attribute–value model wikipedia , lookup
Extensible Storage Engine wikipedia , lookup
Microsoft SQL Server wikipedia , lookup
Open Database Connectivity wikipedia , lookup
Concurrency control wikipedia , lookup
Functional Database Model wikipedia , lookup
Microsoft Jet Database Engine wikipedia , lookup
Relational model wikipedia , lookup
Clusterpoint wikipedia , lookup
SNPMiner: A Domain-Specific Deep Web Mining Tool Fan Wang, Gagan Agrawal, Ruoming Jin and Helen Piontkivska The Ohio State University The Kent State University Presenter: Fan Wang The Ohio State University Outline • Motivation • SNPMiner design and implementation • Performance of SNPMiner • Conclusion Motivation • Large volume of biological data being available • for web-access Large number of web-based biological databases being accessible – Google search for “SNP related databases” finding 45 online databases – How many if we look for the entire biological domain? • How to integrate these databases? Motivation • Example query: – ERCC6 – SNP and Nonsynonymous SNP – AA occurring in the corresponding position – In orthologous gene of non-human mammals • No single database can provide all of the above information • Generating query plan according to database dependencies Motivation ERCC6 Entrez Gene dbSNP AA Positions for Nonsynonymous SNP Encoded Protein Alignment Database Protein Sequence Encoded Orthologous Protein Sequence Database Motivation • Manually? – Familiar with all related databases – Time consuming – Error prone • Google? – Dynamic web page • So we need a new tool! Outline • Motivation • SNPMiner design and implementation • Performance of SNPMiner • Conclusion SNPMiner Design • Features – Integrating biological databases easily – Unified user interface – Generating query plans according to user queries (Dynamic Query Planning) – Efficient and correct query plans – Robustness • SNPMiner – A query-oriented, mediator-based biological data integration and querying tool SNPMiner Design • 8 databases integrated – dbSNP – Entrez Gene – Entrez Protein – Entrez BLAST – SNP500Cancer – SeattleSNP – BIND – SIFT System Overview • Web parser extracts • • data from retrieved HTML files Dynamic query planner schedules query plan Web accessible, 40 SNP related terms Sample Input and Output Query Key Terms Query Target Terms System Implementation • Dynamic Query Planner – Production Rule System – Rule Representation – Rule Selection – State Update and Termination Condition – Algorithm • Web Page Parser Production Rule System • Model of computation for implementing search • algorithm Our problem fits into this case Using Current Knowledge Gained Knowledge Gain New Knowledge Knowledge Base Production Rule System • Working Memory: Data extracted or retrieved • Production Rules: Query schemas of online databases • Goal State: Set of user requested terms Rule Representation • Each database query schema corresponding to a rule • QSi=(ID , Ii , Di , Oi , Ci) – ID : Unique identifier of the rule – Ii : The input set of the database – Di : Unique identifier of the database – Oi : The output set of the database – Ci : Additional conditions imposed on Ii Rule Representation • Examples: (9,{SNPID},{SIFT},{SIFT_Info},NonSyn(SNPID) (9,{0},{6},{36},NonSyn{0}) Rule Selection • Candidate Rule Set – Find rules which can be fired, i.e. I CS – Test the availability of databases • Compute benefit score for candidate rule – Data coverage – Select the rule with the highest score State Update and Termination • Output elements of a selected rule added to the working memory • Terminate when: – Goal state is fully covered by the working memory – Not fully covered, but no more rules can be fired Algorithm Query Planner (Key_Term, Target_Terms) Initialize Current State CS, Goal State GS Initialize Production Rule Set PR Initialize Query Chain QC to be empty while (x GS and x CS ) and (y unvisitedPR and z output ( y) and z GS and z CS ) Initialize an empty set CR for candidate rules foreach p unvisitedPR if (available( p)) compute the benefit score of p, bs(p) add p to CR Select a rule bp from CR with the highest benefit score Add output(bp) to CS Delete bp from univisitedPR and add it to visitedPR Add the production rule bp to QC Web Page Parsing • Using HTML labels and tags to parse • Disadvantage – Impacted by the change of web page layout • Automatic or semi-automatic web page layout learning method in the future Outline • Motivation • SNPMiner design and implementation • Performance of SNPMiner • Conclusion Performance • Recording the query plan length • A and B easy queries • C and D harder ones; E the hardest Performance • For easier queries, we obtain the optimal results • For harder ones, the generated plans are no more than 40% longer than the optimal Performance • ERCC6, rs2228528 • Query planning time around 3.7s, 0.17% of the total • time 99% of the query planning time spent on database availability test Limitations • Only considering data coverage • User preferences • Hidden rules • Only for attribute searching Our Current Work • Using ontology to obtain preference • Constructing a dependency graph model • Using bidirectional keyword searching algorithm • Being able to discovery relationship between any biological terms Conclusion • SNPMiner is an useful biological data integration and querying tool • Dynamic query planner schedules query plan according to user query efficiently • The length of generated query plan is no more than 40% longer than the optimal • On average, the system needs 2 seconds to extract a term from the deep web