Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
The Index-based XXL Search Engine for Querying XML Data with Relevance Ranking Anja Theobald and Gerhard Weikum University of the Saarland Saarbrücken, Germany [email protected] http://www-dbs.cs.uni-sb.de 1 Conclusion Problem: • diversity of Web / Intranet data despite XML, global schema is a myth users are swamped with results or are looking for needles in haystacks Our contribution: • combine XML querying with relevance ranking • demonstrate efficiency and search result quality with XXL search engine prototype Outline • Adding relevance to XML • The XXL search engine: index-based query processing • Experiments 3 XML Data Graph Book Title: Author: Review: ... <Uni> ETH Zürich Stochastic R. Nelson Chapter on <Fak> <Uni> UniNat.-Techn. Stuttgart Fak. I ... Markov chains <FR> Fachrichtung Informatik <Fak> <Uni> Uni Nat.-Techn. Saarland Fak. I <Lehre> ... <FR> Fachrichtung Informatik Uni: Uni Saarland <School> Math & Engineering <Hauptstudium> <Lehre> <Dept> CS ... School: ... School: ... <Vorlesung> Leistungsanalyse <Hauptstudium> <Teaching> ... <Dozent>Leistungsanalyse ... </> <Vorlesung> <GradStudies> ... <Inhalt> ... Warteschlangen ... </> Dept: ... CS ... <Dozent> ... </> <Course> Performance analysis <Lit href=springer/nelson.xml <Inhalt> Warteschlangen ... </> > Teaching ... <Lecturer> ......</> href=... > </Vorlesung> <Lit<Lit href=springer/nelson.xml <Content> Queueing models .. </> > <Vorlesung> Sprachverarbeitung GradStudies href=... > </Vorlesung> <Lit<Lit href=springer/nelson.xml > <Inhalt> ... Markovketten ... </> Sprachverarbeitung <Lit<Vorlesung> href=... > </Course> Course: Course: </Vorlesung> <Inhalt> ... Markovketten ... </> Speech processing Performance analysis <Course> Speech processing ... </Vorlesung> <Content> ... Markov chains... </> </Lehre> ... </FR> ... </Fak> ... ... Content: ... Content: ... Lit: Lit: </Course> ... </Uni> Queueing models ... </Lehre> ... </FR> ... </Fak> ... Markov chains </Uni> .. </Dept> .. </School> ... </Teaching> </Uni> Uni: Uni Stuttgart Uni: Uni Augsburg Semistructured data: Inhalt Dozent ... Curriculum: School: CS ... URL=... links elements, attributes, E Commerce Course: Mobile Comm. ... organized as labeled graph Weekend: Data Mining Prerequisites: ... ... Markov processes ... ... ... ... ... ... ... ... XML Querying Regular expressions Booklabels over path + Logical Title: Author:conditions Review: ... Stochastic R. Nelson Chapter on over element contents www.allunis.de/unis.xml Uni: Uni Stuttgart Course: Mobile comm. School: ... Dept: ... CS Uni: Uni Augsburg Teaching ... Weekend: Data Mining ... Markov chains Uni: Uni Saarland ... Prerequisites: ... ... Markov processes Curriculum: E Commerce ... ... School: CS Outline: ... statistical methods for classification ... ... School: ... ... ... ... GradStudies ... ... Course: Speech processing ... Content: ... ... Markov chains Course: Performance analysis ... Content: ... Lit: Lit: Queueing models Select U, C From www.allunis.de/unis.xml Where Uni As U And U.#.School?.#.(Inst | Dept)+ As D And D Like „%CS%“ And D.#.Course As C And C.# Like „%Markov chain%“ XML Querying Book www.allunis.de/unis.xml Title: Author: Review: ... Stochastic R. Nelson Chapter on ... Markov chains Uni: Uni Stuttgart ... School: CS CS Course: Mobile comm. Uni: Uni Saarland ... School: ... Prerequisites: ... ... Markov processes Dept: ... CS Uni: Uni Augsburg Curriculum: E Commerce ... Weekend: Data Mining ... Teaching Outline: ... statistical methods for classification ... ... School: ... ... ... ... GradStudies ... ... Course: Speech processing ... Content: ... Markov chains ... Course: Performance analysis ... Content: ... Lit: Lit: Queueing models Select U, C From www.allunis.de/unis.xml Where Uni As U And U.#.School?.#.(Inst U.#.School?.#.(Inst || Dept)+ Dept)+ As As D D And D D Like Like „%CS%“ „%CS%“ And D.#.Course D.#.Course As As C C And C.# Like „%Markov chain%“ Boolean vs. Ranked Retrieval There is no global schema for Intranets or the Web Relevance ranking of results is absolutely crucial ! Ranked Retrieval with XXL Book www.allunis.de/unis.xml Title: Author: Review: ... Stochastic R. Nelson Chapter on ... Markov chains Uni: Uni Stuttgart ... School: CS Course: Mobile comm. Uni: Uni Saarland ... School: ... Prerequisites: ... ... Markov processes Dept: ... CS Uni: Uni Augsburg Curriculum: E Commerce ... Weekend: Data Mining ... Teaching Outline: ... statistical methods for classification ... ... School: ... ... ... ... GradStudies ... ... Course: Speech processing ... Content: ... ... Markov chains Course: Performance analysis ... Content: ... Lit: Lit: Queueing models Select U, C From www.allunis.de/unis.xml Where Uni As U And U.# As D And D ~~ „CS“ And D.#.~Course As C AND C.# ~~ „Markov chain“ of XML Inhalt data Dozent Ranked Retrieval with XXLResult ranking URL=... based on semantic similarity Book www.allunis.de/unis.xml Title: Author: Review: ... Stochastic R. Nelson Chapter on ... Markov chains Uni: Uni Stuttgart ... School: CS Course: Mobile comm. Uni: Uni Saarland ... School: ... Prerequisites: ... ... Markov processes Dept: ... CS Uni: Uni Augsburg Curriculum: E Commerce ... Weekend: Data Mining ... Teaching Outline: ... statistical methods for classification ... ... School: ... ... ... ... GradStudies ... ... Course: Speech processing ... Content: ... ... Markov chains Course: Performance analysis ... Content: ... Lit: Lit: Queueing models Select U, C From www.allunis.de/unis.xml Where Uni As U And U.# As D And D ~~ „Computer Science“ And D.#.~Course As C and C.# ~~ „Markov chain“ Outline Adding relevance to XML • The XXL search engine: index-based query processing • Experiments 10 XXL: Flexible XML Search Language Extensible, simple core language Where clause: conjunction of regular path expressions with binding of variables Elementary conditions on element/attribute names and contents Select F, D, S From www.allunis.de/unis.xml Where Uni.#.School?.#.(Inst|Dept) As F And F.#.Lecturer As D And F.#.Student As S And D.Name = S.Name And D.Area Like „%XML%“ Semantic similarity conditions on names and contents ... F.#.~Lecturer As D And D.~Area ~~ „XML“ Based on tf*idf similarity of contents, ontological similarity of names probabilistic combination of conditions XXL Result Ranking Query: Where Uni.#.School?.#.(Inst|Dept)+ As D And D.#.~Lecturer As D And D.~Area ~~ „XML“ Data graph: Result graph: Uni: UniSaarland 1.0 Uni: UniSaarland Dept: CS 1.0 Dept: CS Dept: Math 0.9 Prof: GW Prof: GW Teaching Dept: Math Project: IR for semistruct. data 0.8 Project: IR for 0.6 semistruct. data Project: Digital libraries Course: IR Seminar: XML Relevance score: 0.432 = 1.0 * 1.0 * 0.9 * 0.8 * 0.6 WWW XXL Search Engine XXL applet ...... ..... ...... ..... XXL servlets Path indexer Query processor Content indexer Ontology Select ... Where Uni.#.(Inst|Dept) As F And F ~~ „Computer Science“ And F.#.~Course.# ~~ „Markov Chains“ • Query decomposition into index-supported subexpressions • wide range of optimizations Uni.#.(Inst|Dept) As F F ~~ „Computer Science“ F.#.~Course.# ~~ „Markov Chains“ F.#.~Seminar.# ~~ „Markov Chains“ Index Structures materializes all (parent, child) element name pairs and dynamically checks Uni, {id1, {<School, {id13, id14}> <Prof, {id111, id117,transitive id119}>},connectivity id2, {<Prof>, {id15}>} } School, {id13, {<Dean, {id27}>, <Dept, {id31, id32, id33}>}, id14, { ... } } precomputes all term Element Content Index: occurrences in element contents, with frequency Engineering, idf=..., {<id79, tf=...>, <id85, tf=...>} statistics XML, idf=..., {<id46, tf=...>, <id49, tf=...>, <id53, tf=...>} Element Path Index: contains synonyms, hypernyms, and hyponyms of element names, and „semantic“ distances Course, {<Seminar, 0.9>, <Project, 0.7>}, {<Teaching, 0.9>} {<Telecourse, 0.9>, <Video lecture, 0.7>, <Meditation, 0.1>} Element Ontology Index: Query Decomposition & Evaluation decompose query into subqueries choose global evaluation order of subqueries represent subquery as NFSA for each subquery choose local evaluation strategy (top-down or bottom-up) evaluate subexpressions using indexes compute subquery result paths with relevance scores combine result paths into result graph Example query: Example of subquery NFSA: Uni.#.(Inst|Dept)+ As F And F ~~ „Computer Science“ And F.#.~Course.# ~~ „Markov Chains“ Inst Uni % Dept The Role of Ontologies Observation: WWW / Intranet Information becomes better searchable when it is more explicitly structured and canonically annotated University Dept <Uni> Univ. Saarland Confe- <School> Engineering Conference rence Insti<Dept> Computer Science Prof tute Publi- <Faculty> Publication Prof. Dr. GW cation (Course(c) <Project> c (Course(c) ((Dept(s) ((Dept(s) Inst(s)) Course Course sCurriculum Data Inst(s)) Re- Semistructured (c,x))) search search ... XML</> Curriculum(c,s))) ... JourJournal nal TeachTeaching ing ProProject ject Seminar nar „Poor man‘s ontology“: Graph of concepts capturing hypernym/hyponym relationships (e.g., from WordNet) quantitative reasoning („semantic similarity“ measures) ...... ..... ...... ..... Outline Adding relevance to XML The XXL search engine: index-based query processing • Experiments 17 Example Data Example Query SELECT * FROM INDEX WHERE ~drama.#.scene AS C AND C.speech AS S AND (S.speaker ~ "Woman") AND S.line AS L AND (L.CONTENT ~ "leader") AND C.speech AS M AND (M.speaker = "MACBETH") Example Ontology thane – (a feudal lord or baron in Scotland) => lord, noble, nobleman – (a titled peer of the realm) => male aristocrat – (a man who is an aristocrat) => leader – (a person who rules or guides or inspires others) Example Ontology woman, adult female – (an adult female person) => amazon, virago – (a large strong and aggressive woman) => donna -- (an Italian woman of rank) => geisha, geisha girl -- (...) => lady (a polite name for any woman) ... => wife – (a married woman, a man‘s partner in marriage) => witch – (a being, usually female, imagined to have special powers derived from the devil) Example Results Relevance = 0.0070400005 <scene> <speech> <speaker> Second Witch </speaker> <line> All hail, Macbeth, hail to thee, thane of Cawdor! </line> </speech> <speech> <speaker> MACBETH </speaker> <line> ... </line> </speech> </scene> XXL Runtime Measurements Test data: 100 XML documents with a total of 240 000 elements (ot.xml, nt.xml, ..., hamlet.xml, macbeth.xml, ..., SigmodRecord.xml) Q1: Select * From Index Where #.publication AS A 1 And A.~headline ~~ „XML“ 2 And A.author% AS B 3 4 #results: top-down bottom-up w/ optimization: 131 14.3 sec 694 sec 2.68 sec (incl. 0.37 sec) 2bu 1bu 3td Q2: Select * From Index Where #.play AS A And A.#.personae AS B And B.~figure ~~ „King“ And B. title AS C 58 8.5 sec 3.7 sec 4.64 sec (incl. 0.33 sec) 1bu 2td 3td 4td Conclusion Research avenue: explore and leverage synergies between XML (querying), (relevance-ranking) IR, (domain-specific or personal) ontologies, and machine learning (for classification, annotation, etc.) Goal: should be able to find results for every search in one day (computer time) with < 1 min intellectual effort that the best human experts can find with infinite time pursued in CLASSIX project (joint DFG project with Norbert Fuhr‘s group in Dortmund)