Download Markov Chains

The Index-based XXL Search Engine for Querying XML Data with Relevance Ranking Anja Theobald and Gerhard Weikum University of the Saarland Saarbrücken, Germany [email protected] http://www-dbs.cs.uni-sb.de 1 Conclusion Problem: • diversity of Web / Intranet data  despite XML, global schema is a myth  users are swamped with results or are looking for needles in haystacks Our contribution: • combine XML querying with relevance ranking • demonstrate efficiency and search result quality with XXL search engine prototype Outline • Adding relevance to XML • The XXL search engine: index-based query processing • Experiments 3 XML Data Graph Book Title: Author: Review: ... <Uni> ETH Zürich Stochastic R. Nelson Chapter on <Fak> <Uni> UniNat.-Techn. Stuttgart Fak. I ... Markov chains <FR> Fachrichtung Informatik <Fak> <Uni> Uni Nat.-Techn. Saarland Fak. I <Lehre> ... <FR> Fachrichtung Informatik Uni: Uni Saarland <School> Math & Engineering <Hauptstudium> <Lehre> <Dept> CS ... School: ... School: ... <Vorlesung> Leistungsanalyse <Hauptstudium> <Teaching> ... <Dozent>Leistungsanalyse ... </> <Vorlesung> <GradStudies> ... <Inhalt> ... Warteschlangen ... </> Dept: ... CS ... <Dozent> ... </> <Course> Performance analysis <Lit href=springer/nelson.xml <Inhalt> Warteschlangen ... </> > Teaching ... <Lecturer> ......</> href=... > </Vorlesung> <Lit<Lit href=springer/nelson.xml <Content> Queueing models .. </> > <Vorlesung> Sprachverarbeitung GradStudies href=... > </Vorlesung> <Lit<Lit href=springer/nelson.xml > <Inhalt> ... Markovketten ... </> Sprachverarbeitung <Lit<Vorlesung> href=... > </Course> Course: Course: </Vorlesung> <Inhalt> ... Markovketten ... </> Speech processing Performance analysis <Course> Speech processing ... </Vorlesung> <Content> ... Markov chains... </> </Lehre> ... </FR> ... </Fak> ... ... Content: ... Content: ... Lit: Lit: </Course> ... </Uni> Queueing models ... </Lehre> ... </FR> ... </Fak> ... Markov chains </Uni> .. </Dept> .. </School> ... </Teaching> </Uni> Uni: Uni Stuttgart Uni: Uni Augsburg Semistructured data: Inhalt Dozent ... Curriculum: School: CS ... URL=... links elements, attributes, E Commerce Course: Mobile Comm. ... organized as labeled graph Weekend: Data Mining Prerequisites: ... ... Markov processes ... ... ... ... ... ... ... ... XML Querying Regular expressions Booklabels over path + Logical Title: Author:conditions Review: ... Stochastic R. Nelson Chapter on over element contents www.allunis.de/unis.xml Uni: Uni Stuttgart Course: Mobile comm. School: ... Dept: ... CS Uni: Uni Augsburg Teaching ... Weekend: Data Mining ... Markov chains Uni: Uni Saarland ... Prerequisites: ... ... Markov processes Curriculum: E Commerce ... ... School: CS Outline: ... statistical methods for classification ... ... School: ... ... ... ... GradStudies ... ... Course: Speech processing ... Content: ... ... Markov chains Course: Performance analysis ... Content: ... Lit: Lit: Queueing models Select U, C From www.allunis.de/unis.xml Where Uni As U And U.#.School?.#.(Inst | Dept)+ As D And D Like „%CS%“ And D.#.Course As C And C.# Like „%Markov chain%“ XML Querying Book www.allunis.de/unis.xml Title: Author: Review: ... Stochastic R. Nelson Chapter on ... Markov chains Uni: Uni Stuttgart ... School: CS CS Course: Mobile comm. Uni: Uni Saarland ... School: ... Prerequisites: ... ... Markov processes Dept: ... CS Uni: Uni Augsburg Curriculum: E Commerce ... Weekend: Data Mining ... Teaching Outline: ... statistical methods for classification ... ... School: ... ... ... ... GradStudies ... ... Course: Speech processing ... Content: ... Markov chains ... Course: Performance analysis ... Content: ... Lit: Lit: Queueing models Select U, C From www.allunis.de/unis.xml Where Uni As U And U.#.School?.#.(Inst U.#.School?.#.(Inst || Dept)+ Dept)+ As As D D And D D Like Like „%CS%“ „%CS%“ And D.#.Course D.#.Course As As C C And C.# Like „%Markov chain%“ Boolean vs. Ranked Retrieval There is no global schema for Intranets or the Web  Relevance ranking of results is absolutely crucial ! Ranked Retrieval with XXL Book www.allunis.de/unis.xml Title: Author: Review: ... Stochastic R. Nelson Chapter on ... Markov chains Uni: Uni Stuttgart ... School: CS Course: Mobile comm. Uni: Uni Saarland ... School: ... Prerequisites: ... ... Markov processes Dept: ... CS Uni: Uni Augsburg Curriculum: E Commerce ... Weekend: Data Mining ... Teaching Outline: ... statistical methods for classification ... ... School: ... ... ... ... GradStudies ... ... Course: Speech processing ... Content: ... ... Markov chains Course: Performance analysis ... Content: ... Lit: Lit: Queueing models Select U, C From www.allunis.de/unis.xml Where Uni As U And U.# As D And D ~~ „CS“ And D.#.~Course As C AND C.# ~~ „Markov chain“ of XML Inhalt data Dozent Ranked Retrieval with XXLResult ranking URL=... based on semantic similarity Book www.allunis.de/unis.xml Title: Author: Review: ... Stochastic R. Nelson Chapter on ... Markov chains Uni: Uni Stuttgart ... School: CS Course: Mobile comm. Uni: Uni Saarland ... School: ... Prerequisites: ... ... Markov processes Dept: ... CS Uni: Uni Augsburg Curriculum: E Commerce ... Weekend: Data Mining ... Teaching Outline: ... statistical methods for classification ... ... School: ... ... ... ... GradStudies ... ... Course: Speech processing ... Content: ... ... Markov chains Course: Performance analysis ... Content: ... Lit: Lit: Queueing models Select U, C From www.allunis.de/unis.xml Where Uni As U And U.# As D And D ~~ „Computer Science“ And D.#.~Course As C and C.# ~~ „Markov chain“ Outline  Adding relevance to XML • The XXL search engine: index-based query processing • Experiments 10 XXL: Flexible XML Search Language Extensible, simple core language Where clause: conjunction of regular path expressions with binding of variables Elementary conditions on element/attribute names and contents Select F, D, S From www.allunis.de/unis.xml Where Uni.#.School?.#.(Inst|Dept) As F And F.#.Lecturer As D And F.#.Student As S And D.Name = S.Name And D.Area Like „%XML%“ Semantic similarity conditions on names and contents ... F.#.~Lecturer As D And D.~Area ~~ „XML“ Based on tf*idf similarity of contents, ontological similarity of names probabilistic combination of conditions XXL Result Ranking Query: Where Uni.#.School?.#.(Inst|Dept)+ As D And D.#.~Lecturer As D And D.~Area ~~ „XML“ Data graph: Result graph: Uni: UniSaarland 1.0 Uni: UniSaarland Dept: CS 1.0 Dept: CS Dept: Math 0.9 Prof: GW Prof: GW Teaching Dept: Math Project: IR for semistruct. data 0.8 Project: IR for 0.6 semistruct. data Project: Digital libraries Course: IR Seminar: XML Relevance score: 0.432 = 1.0 * 1.0 * 0.9 * 0.8 * 0.6 WWW XXL Search Engine XXL applet ...... ..... ...... ..... XXL servlets Path indexer Query processor Content indexer Ontology Select ... Where Uni.#.(Inst|Dept) As F And F ~~ „Computer Science“ And F.#.~Course.# ~~ „Markov Chains“ • Query decomposition into index-supported subexpressions • wide range of optimizations Uni.#.(Inst|Dept) As F F ~~ „Computer Science“ F.#.~Course.# ~~ „Markov Chains“ F.#.~Seminar.# ~~ „Markov Chains“ Index Structures materializes all (parent, child) element name pairs and dynamically checks Uni, {id1, {<School, {id13, id14}> <Prof, {id111, id117,transitive id119}>},connectivity id2, {<Prof>, {id15}>} } School, {id13, {<Dean, {id27}>, <Dept, {id31, id32, id33}>}, id14, { ... } } precomputes all term Element Content Index: occurrences in element contents, with frequency Engineering, idf=..., {<id79, tf=...>, <id85, tf=...>} statistics XML, idf=..., {<id46, tf=...>, <id49, tf=...>, <id53, tf=...>} Element Path Index: contains synonyms, hypernyms, and hyponyms of element names, and „semantic“ distances Course, {<Seminar, 0.9>, <Project, 0.7>}, {<Teaching, 0.9>} {<Telecourse, 0.9>, <Video lecture, 0.7>, <Meditation, 0.1>} Element Ontology Index: Query Decomposition & Evaluation decompose query into subqueries choose global evaluation order of subqueries represent subquery as NFSA for each subquery choose local evaluation strategy (top-down or bottom-up)  evaluate subexpressions using indexes  compute subquery result paths with relevance scores  combine result paths into result graph     Example query: Example of subquery NFSA: Uni.#.(Inst|Dept)+ As F And F ~~ „Computer Science“ And F.#.~Course.# ~~ „Markov Chains“ Inst Uni % Dept The Role of Ontologies Observation: WWW / Intranet Information becomes better searchable when it is more explicitly structured and canonically annotated University Dept <Uni> Univ. Saarland Confe- <School> Engineering Conference rence Insti<Dept> Computer Science Prof tute Publi- <Faculty> Publication Prof. Dr. GW cation (Course(c)   <Project>  c (Course(c) ((Dept(s) ((Dept(s) Inst(s)) Course Course sCurriculum Data  Inst(s)) Re- Semistructured (c,x))) search search ... XML</> Curriculum(c,s))) ... JourJournal nal TeachTeaching ing ProProject ject Seminar nar „Poor man‘s ontology“: Graph of concepts capturing hypernym/hyponym relationships (e.g., from WordNet)  quantitative reasoning („semantic similarity“ measures) ...... ..... ...... ..... Outline  Adding relevance to XML  The XXL search engine: index-based query processing • Experiments 17 Example Data Example Query SELECT * FROM INDEX WHERE ~drama.#.scene AS C AND C.speech AS S AND (S.speaker ~ "Woman") AND S.line AS L AND (L.CONTENT ~ "leader") AND C.speech AS M AND (M.speaker = "MACBETH") Example Ontology thane – (a feudal lord or baron in Scotland) => lord, noble, nobleman – (a titled peer of the realm) => male aristocrat – (a man who is an aristocrat) => leader – (a person who rules or guides or inspires others) Example Ontology woman, adult female – (an adult female person) => amazon, virago – (a large strong and aggressive woman) => donna -- (an Italian woman of rank) => geisha, geisha girl -- (...) => lady (a polite name for any woman) ... => wife – (a married woman, a man‘s partner in marriage) => witch – (a being, usually female, imagined to have special powers derived from the devil) Example Results Relevance = 0.0070400005 <scene> <speech> <speaker> Second Witch </speaker> <line> All hail, Macbeth, hail to thee, thane of Cawdor! </line> </speech> <speech> <speaker> MACBETH </speaker> <line> ... </line> </speech> </scene> XXL Runtime Measurements Test data: 100 XML documents with a total of 240 000 elements (ot.xml, nt.xml, ..., hamlet.xml, macbeth.xml, ..., SigmodRecord.xml) Q1: Select * From Index Where #.publication AS A 1 And A.~headline ~~ „XML“ 2 And A.author% AS B 3 4 #results: top-down bottom-up w/ optimization: 131 14.3 sec 694 sec 2.68 sec (incl. 0.37 sec) 2bu 1bu 3td Q2: Select * From Index Where #.play AS A And A.#.personae AS B And B.~figure ~~ „King“ And B. title AS C 58 8.5 sec 3.7 sec 4.64 sec (incl. 0.33 sec) 1bu 2td 3td 4td Conclusion Research avenue: explore and leverage synergies between XML (querying), (relevance-ranking) IR, (domain-specific or personal) ontologies, and machine learning (for classification, annotation, etc.) Goal: should be able to find results for every search in one day (computer time) with < 1 min intellectual effort that the best human experts can find with infinite time  pursued in CLASSIX project (joint DFG project with Norbert Fuhr‘s group in Dortmund)

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Markov Chains