Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
The BINGO! System for Information Portal Generation and Expert Web Search Sergej Sizov, Michael Biwer, Jens Graupmann, Stefan Siersdorfer, Martin Theobald, Gerhard Weikum, Patrick Zimmer University of the Saarland, Saarbrücken, Germany [email protected] http://www-dbs.cs.uni-sb.de Outline: • Problem & research challenge • Our work in progress • Future research avenues 1 (In-) Effectivity of Web Search Engines query = „ARIES recovery But openthere source download“ is hope: Google (and Yahoo) top 3: • exploit structure CpSc 862 Assignment 6 • explore neighborhood ... You can open it by double clicking the Mars ... (logfunc ... and any deviations from•the standard Aries recovery ... use topic directory www.cs.clemson.edu/~jzwang/cpsc862/hw07/cpsc862hw07.htm Tutorial: Application Servers and Associated Technologies ... an OIA for his inventions (ARIES, ARIES/IM, Commit_LSN ... www.almaden.ibm.com/u/mohan/ AppServersTutorial_SIGMOD2002_Slid Recovery from Spiritual Abuse ... with you no matter how long your recovery takes ... powerful form of spiritual abuse and a source of spiritual ... their denial in hopes that ... www.nacronline.com/dox/library/ biblestudies/sp_abuse.pdf Sourceforge: Topic –> Database –> Database Engines/Servers BZFlag - OpenSource OpenGL Multiplayer Multiplatform battlezone ... JBoss.org - The JBoss/Server is the leading Open Source ... Dev-C++ - Dev-C++ is an full-featured Integrated ... libraries ... 2 From Observations to Research Avenues Key observation: yes, there are ways to find what you are searching, but intellectual time is expensive ! requires „intelligent“ automation Research Avenues: Structure and annotate information: XML Organize documents „semantically“: ontologies Leverage machine learning: automatic classification More computer time for better result: focused crawling Goal: should be able to find results for advanced info request in one day with < 5 min intellectual effort that the best human experts can find with infinite time 3 Challenge: Expert Web Queries Where can I download an open source implementation of the ARIES recovery algorithm? Find the text and notes of the western song Raw Hide. What are Chernoff-Hoeffding bounds? Find Fermat‘s last / Wiles‘ theorem in MathML format. Are there any theorems isomorphic to my new conjecture? Find related theorems. Which professors from D are teaching DBS and have research projects on XML? D DBS XML Which Shakespeare drama has a scene where a woman talks a Scottish nobleman into murder? Who was the Italian woman that I met at the PC meeting where Moshe Vardi was PC Chair? 4 Challenge: (Meta-) Portal Building Who are the top researchers in the database system community? Who is working on using machine learning techniques for searching XML data? What are the most important results in large deviation theory? Find information about public subsidies for plumbers. Find new EU regulations that affect an electrician‘s business. Which gene expression data from Barrett tissue in the esophagus exhibit high levels of gene A01g ? Are there metabolic models for acid reflux that could be related to the gene expression data? 5 Crawler Classifier DB / Cache / Queue Mgr Portal Explorer Deep Web Focus Mgr Web Service Generator (UDDI, WSDL) HTML2XML Auto Annotation BINGO! ... Filter Ontologies (XML ...) Search Engine Web Applications Chamber of Crafts Portal Materialography Portal XXL Query Processor Index Mgr Ontology Service Intranet Applications Our Research Agenda Ontology Mgr Semantic Web 6 Focused Crawling automatially build personal topic directory (Soumen Chakrabarti et al. 1999) WWW ...... ..... ...... ..... seeds Crawler training Classifier Link Analysis Root DB Core Technology Semistrutured Data Web Retrieval Data Mining XML critical issues: • classifier accuracy • feature selection • quality of training data 7 The BINGO! Focused Crawler WWW ...... ..... ...... ..... seeds Crawler training Classifier Link Analysis high SVM high authority confidence score re-training Root DB Core Technology Semistrutured Data Web Retrieval Data Mining XML topic-specific archetypes 8 Classification using Support Vector Machines (SVM) w x b 0 x2 ? C C n training vectors with components (x1, ..., xm, C) with C = +1 oder -1 x1 Training: Compute separating hyperplane w x b 0 that maximizes the min. distance of all positive and negative samples to the hyperplane solve quadratic programming problem Decision: m Test new vector y for membership in C: ( w y b ) wi yi b 0 i 1 y Distance of from hyperplane yields classification confidence 9 BINGO! Adaptive Re-training and Focus Control for each topic V do { archetypes(V) := top docs of SVM confidence ranking top authorities of V ; remove from archetypes(V) all docs d that do not satisfy confidence(d) mean confidence(V) ; recompute feature selection based on archetypes(V) ; recompute SVM model for V with archetypes(V) as training data } combine re-training with two-phase crawling strategy: • learning phase: aims to find archetypes (high precision) hard focus for crawling vicinity of training docs • harvesting phase: aims to find results (high recall) soft focus for long-range exploration with tunnelling 10 Feature Space Construction & Meta Strategies possible strategies: • single term frequencies or tf*idf with top n cross-entropy terms • term pairs within proximity window (e.g., support vector, match about world championship) • text terms from hyperlink neighbors • anchor text terms from neighbors (e.g., <a href=...> click here for soccer results </a>) meta strategies (over m feature spaces for class k): m • unanimous decision: Ck ( d j ) 1 if Ck( ) m 1 m • weighted average: Ck ( d j ) 1 if ~ pk( )Ck( ) 1 ~p ( ) with estimator k (Joachims‘00) strategy with best ratio of for precision of model for class k estimated precision to based on leave-one-out training runtime cost 11 Experiment on Information Portal Generation (1) for single-topic portal on „Database Research“ start with only 2 initial seeds: homepages of DeWitt and Gray goal: automatic gathering of DBLP author homepages (with DBLP excluded from crawl) learning phase for improved feature selection and classification: depth-first crawl limited to domains of seeds followed by archetype selection and retraining (for high precision) harvesting phase for building a rich portal: prioritized breadth-first crawl (for high recall) 12 Experiment on Information Portal Generation (2) result after 12 hours (on commodity PC): • 3 mio. URLs crawled on 30 000 hosts, 1 mio. pages analyzed, • 0.5 mio. pages positively classified • found 7000 homepages out of 30 000 DBLP authors, 712 authors of the top 1000 DBLP authors with 267 among the 1000 best crawl results + postprocessing for querying and analysis: • ranking by SVM confidence, authority score, etc. • clustering, relevance feedback, etc. 13 Ongoing and Future Work Deep Web exploration with auto-generated queries Exploiting ontological knowledge e.g.: search for a „woman talking someone into murder“ Construct richer feature spaces Exploiting linguistic analysis methods e.g.: „... cut his throat ...“ act: killing subject:... object:... Generalized links & semantic joins, e.g. named entities Identifying semantically coherent units Combining focused crawling with XML search auto-annotation of HTML, Latex, PDF, etc. docs cross-document querying à la XXL User guidance & portal admin methodology Towards “Semantically Coherent” Units Homepage of ... Teaching: DBS IR ... Homepage of ... Teaching Research vs. Address: ... Research: XML Auto-tuning ... My work: J2EE XML WSDL ... My hobbies: Poisonous frogs South Amerian stamps Canoeing ... Ongoing and Future Work Deep Web exploration with auto-generated queries Exploiting ontological knowledge e.g.: search for a „woman talking someone into murder“ Construct richer feature spaces Exploiting linguistic analysis methods e.g.: „... cut his throat ...“ act: killing subject:... object:... Generalized links & semantic joins, e.g. named entities Identifying semantically coherent units Combining focused crawling with XML search auto-annotation of HTML, Latex, PDF, etc. docs cross-document querying à la XXL User guidance & portal admin methodology Concluding Remarks long-term goal: exploit the Web‘s potential for being the world‘s largest knowledge base XML and Semantic Web are key assets, but by themselves not sufficient; we need to cope with diversity, incompleteness, and uncertainty absolute need for ranked retrieval view information organization and information search as dual views of the same problem combine techniques from DBS, IR, CL, AI, and ML need better theory about quality/efficiency tradeoffs as well as large-scale experiments 17