Download Markov Chains

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
The Index-based XXL Search Engine
for Querying XML Data
with Relevance Ranking
Anja Theobald and Gerhard Weikum
University of the Saarland
Saarbrücken, Germany
[email protected]
http://www-dbs.cs.uni-sb.de
1
Conclusion
Problem:
• diversity of Web / Intranet data
 despite XML, global schema is a myth
 users are swamped with results or
are looking for needles in haystacks
Our contribution:
• combine XML querying with relevance ranking
• demonstrate efficiency and search result quality
with XXL search engine prototype
Outline
• Adding relevance to XML
• The XXL search engine:
index-based query processing
• Experiments
3
XML Data Graph
Book
Title:
Author: Review: ...
<Uni> ETH Zürich
Stochastic R. Nelson Chapter on
<Fak>
<Uni>
UniNat.-Techn.
Stuttgart Fak. I
...
Markov chains
<FR>
Fachrichtung
Informatik
<Fak>
<Uni>
Uni Nat.-Techn.
Saarland Fak. I
<Lehre>
...
<FR>
Fachrichtung
Informatik
Uni: Uni Saarland
<School>
Math
& Engineering
<Hauptstudium>
<Lehre>
<Dept>
CS ...
School: ...
School: ...
<Vorlesung> Leistungsanalyse
<Hauptstudium>
<Teaching>
...
<Dozent>Leistungsanalyse
... </>
<Vorlesung>
<GradStudies>
...
<Inhalt>
... Warteschlangen ... </> Dept: ... CS ...
<Dozent>
...
</>
<Course> Performance analysis
<Lit href=springer/nelson.xml
<Inhalt>
Warteschlangen ... </> > Teaching ...
<Lecturer>
......</>
href=... > </Vorlesung>
<Lit<Lit
href=springer/nelson.xml
<Content>
Queueing models .. </> >
<Vorlesung>
Sprachverarbeitung GradStudies
href=... > </Vorlesung>
<Lit<Lit
href=springer/nelson.xml
>
<Inhalt>
...
Markovketten
... </>
Sprachverarbeitung
<Lit<Vorlesung>
href=... > </Course>
Course:
Course:
</Vorlesung>
<Inhalt>
...
Markovketten
...
</>
Speech processing
Performance analysis
<Course> Speech processing
...
</Vorlesung>
<Content>
... Markov chains... </>
</Lehre> ... </FR> ... </Fak> ...
...
Content: ...
Content: ...
Lit: Lit:
</Course>
...
</Uni>
Queueing models
... </Lehre> ... </FR> ... </Fak> ... Markov chains
</Uni> .. </Dept> .. </School> ...
</Teaching>
</Uni>
Uni: Uni Stuttgart
Uni: Uni Augsburg
Semistructured
data: Inhalt
Dozent
...
Curriculum:
School: CS
...
URL=... links
elements, attributes,
E Commerce
Course: Mobile Comm. ...
organized as labeled graph
Weekend: Data Mining
Prerequisites:
...
... Markov processes
...
...
...
...
...
...
...
...
XML Querying
Regular expressions
Booklabels
over path
+ Logical
Title:
Author:conditions
Review: ...
Stochastic
R. Nelson
Chapter
on
over
element
contents
www.allunis.de/unis.xml
Uni: Uni Stuttgart
Course: Mobile comm.
School: ...
Dept: ... CS
Uni: Uni Augsburg
Teaching
...
Weekend: Data Mining
...
Markov chains
Uni: Uni Saarland
...
Prerequisites:
...
... Markov processes
Curriculum:
E Commerce
...
...
School: CS
Outline: ...
statistical methods
for classification ...
...
School: ...
...
...
...
GradStudies
...
...
Course:
Speech processing
...
Content: ...
...
Markov chains
Course:
Performance analysis
...
Content: ...
Lit: Lit:
Queueing models
Select U, C From www.allunis.de/unis.xml Where Uni As U
And U.#.School?.#.(Inst | Dept)+ As D And D Like „%CS%“
And D.#.Course As C And C.# Like „%Markov chain%“
XML Querying
Book
www.allunis.de/unis.xml
Title:
Author: Review: ...
Stochastic R. Nelson Chapter on
...
Markov chains
Uni: Uni Stuttgart
...
School: CS
CS
Course: Mobile comm.
Uni: Uni Saarland
...
School: ...
Prerequisites:
...
... Markov processes
Dept: ... CS
Uni: Uni Augsburg
Curriculum:
E Commerce
...
Weekend: Data Mining
...
Teaching
Outline: ...
statistical methods
for classification ...
...
School: ...
...
...
...
GradStudies
...
...
Course:
Speech processing
...
Content: ...
Markov chains ...
Course:
Performance analysis
...
Content: ...
Lit: Lit:
Queueing models
Select U, C From www.allunis.de/unis.xml Where Uni As U
And U.#.School?.#.(Inst
U.#.School?.#.(Inst || Dept)+
Dept)+ As
As D
D And D
D Like
Like „%CS%“
„%CS%“
And D.#.Course
D.#.Course As
As C
C And C.# Like „%Markov chain%“
Boolean vs. Ranked Retrieval
There is no global schema
for Intranets or the Web
 Relevance ranking of results
is absolutely crucial !
Ranked Retrieval with XXL
Book
www.allunis.de/unis.xml
Title:
Author: Review: ...
Stochastic R. Nelson Chapter on
...
Markov chains
Uni: Uni Stuttgart
...
School: CS
Course: Mobile comm.
Uni: Uni Saarland
...
School: ...
Prerequisites:
...
... Markov processes
Dept: ... CS
Uni: Uni Augsburg
Curriculum:
E Commerce
...
Weekend: Data Mining
...
Teaching
Outline: ...
statistical methods
for classification ...
...
School: ...
...
...
...
GradStudies
...
...
Course:
Speech processing
...
Content: ...
...
Markov chains
Course:
Performance analysis
...
Content: ...
Lit: Lit:
Queueing models
Select U, C From www.allunis.de/unis.xml Where Uni As U
And U.# As D And D ~~ „CS“
And D.#.~Course As C AND C.# ~~ „Markov chain“
of XML Inhalt
data
Dozent
Ranked Retrieval with XXLResult ranking
URL=...
based on semantic similarity
Book
www.allunis.de/unis.xml
Title:
Author: Review: ...
Stochastic R. Nelson Chapter on
...
Markov chains
Uni: Uni Stuttgart
...
School: CS
Course: Mobile comm.
Uni: Uni Saarland
...
School: ...
Prerequisites:
...
... Markov processes
Dept: ... CS
Uni: Uni Augsburg
Curriculum:
E Commerce
...
Weekend: Data Mining
...
Teaching
Outline: ...
statistical methods
for classification ...
...
School: ...
...
...
...
GradStudies
...
...
Course:
Speech processing
...
Content: ...
...
Markov chains
Course:
Performance analysis
...
Content: ...
Lit: Lit:
Queueing models
Select U, C From www.allunis.de/unis.xml Where Uni As U
And U.# As D And D ~~ „Computer Science“
And D.#.~Course As C and C.# ~~ „Markov chain“
Outline
 Adding relevance to XML
• The XXL search engine:
index-based query processing
• Experiments
10
XXL: Flexible XML Search Language
Extensible, simple core language
Where clause: conjunction of regular path expressions
with binding of variables
Elementary conditions on element/attribute names and contents
Select F, D, S From www.allunis.de/unis.xml
Where Uni.#.School?.#.(Inst|Dept) As F
And F.#.Lecturer As D And F.#.Student As S
And D.Name = S.Name And D.Area Like „%XML%“
Semantic similarity conditions on names and contents
... F.#.~Lecturer As D And D.~Area ~~ „XML“
Based on tf*idf similarity of contents,
ontological similarity of names
probabilistic combination of conditions
XXL Result Ranking
Query: Where Uni.#.School?.#.(Inst|Dept)+ As D And
D.#.~Lecturer As D And D.~Area ~~ „XML“
Data graph:
Result graph:
Uni: UniSaarland
1.0 Uni: UniSaarland
Dept: CS
1.0 Dept: CS
Dept: Math
0.9 Prof: GW
Prof: GW
Teaching
Dept: Math
Project: IR for
semistruct. data
0.8 Project: IR for
0.6 semistruct. data
Project:
Digital libraries
Course:
IR Seminar: XML
Relevance score: 0.432
= 1.0 * 1.0 * 0.9 * 0.8 * 0.6
WWW
XXL Search Engine
XXL
applet
......
.....
......
.....
XXL servlets
Path
indexer
Query
processor
Content
indexer
Ontology
Select ... Where
Uni.#.(Inst|Dept) As F
And F ~~ „Computer Science“
And F.#.~Course.#
~~ „Markov Chains“
• Query decomposition into
index-supported subexpressions
• wide range of optimizations
Uni.#.(Inst|Dept) As F
F ~~ „Computer Science“
F.#.~Course.#
~~ „Markov Chains“
F.#.~Seminar.#
~~ „Markov Chains“
Index Structures
materializes all (parent, child)
element name pairs
and dynamically checks
Uni,
{id1, {<School, {id13, id14}>
<Prof, {id111, id117,transitive
id119}>},connectivity
id2, {<Prof>, {id15}>} }
School, {id13, {<Dean, {id27}>,
<Dept, {id31, id32, id33}>},
id14, { ... } }
precomputes all term
Element Content Index:
occurrences in element contents,
with frequency
Engineering, idf=..., {<id79, tf=...>, <id85,
tf=...>} statistics
XML,
idf=..., {<id46, tf=...>, <id49, tf=...>, <id53, tf=...>}
Element Path Index:
contains synonyms, hypernyms,
and hyponyms of element names,
and „semantic“ distances
Course, {<Seminar, 0.9>, <Project, 0.7>},
{<Teaching, 0.9>}
{<Telecourse, 0.9>, <Video lecture, 0.7>, <Meditation, 0.1>}
Element Ontology Index:
Query Decomposition & Evaluation
decompose query into subqueries
choose global evaluation order of subqueries
represent subquery as NFSA
for each subquery choose local evaluation strategy
(top-down or bottom-up)
 evaluate subexpressions using indexes
 compute subquery result paths
with relevance scores
 combine result paths into result graph




Example query:
Example of subquery NFSA:
Uni.#.(Inst|Dept)+ As F
And F ~~ „Computer Science“
And
F.#.~Course.# ~~ „Markov Chains“
Inst
Uni
%
Dept
The Role of Ontologies
Observation:
WWW / Intranet
Information becomes
better searchable when it is
more explicitly structured
and canonically annotated
University
Dept
<Uni> Univ.
Saarland
Confe- <School> Engineering
Conference
rence
Insti<Dept> Computer Science
Prof
tute
Publi- <Faculty>
Publication Prof. Dr. GW
cation
(Course(c) 

<Project>  c (Course(c)
((Dept(s)
((Dept(s)
Inst(s))
Course
Course sCurriculum
Data  Inst(s))
Re- Semistructured
(c,x)))
search
search ... XML</> Curriculum(c,s)))
...
JourJournal
nal
TeachTeaching
ing
ProProject
ject
Seminar
nar
„Poor man‘s ontology“:
Graph of concepts capturing
hypernym/hyponym relationships (e.g., from WordNet)
 quantitative reasoning („semantic similarity“ measures)
......
.....
......
.....
Outline
 Adding relevance to XML
 The XXL search engine:
index-based query processing
• Experiments
17
Example Data
Example Query
SELECT *
FROM INDEX
WHERE ~drama.#.scene AS C
AND C.speech AS S
AND (S.speaker ~ "Woman")
AND S.line AS L
AND (L.CONTENT ~ "leader")
AND C.speech AS M
AND (M.speaker = "MACBETH")
Example Ontology
thane – (a feudal lord or baron in Scotland)
=> lord, noble, nobleman – (a titled peer of the realm)
=> male aristocrat – (a man who is an aristocrat)
=> leader – (a person who rules
or guides or inspires others)
Example Ontology
woman, adult female – (an adult female person)
=> amazon, virago – (a large strong and aggressive woman)
=> donna -- (an Italian woman of rank)
=> geisha, geisha girl -- (...)
=> lady (a polite name for any woman)
...
=> wife – (a married woman, a man‘s partner in marriage)
=> witch – (a being, usually female, imagined to
have special powers derived from the devil)
Example Results
Relevance = 0.0070400005
<scene>
<speech>
<speaker> Second Witch </speaker>
<line> All hail, Macbeth, hail to thee,
thane of Cawdor!
</line>
</speech>
<speech>
<speaker> MACBETH </speaker>
<line> ... </line>
</speech>
</scene>
XXL Runtime Measurements
Test data:
100 XML documents with a total of 240 000 elements
(ot.xml, nt.xml, ..., hamlet.xml, macbeth.xml, ..., SigmodRecord.xml)
Q1:
Select * From Index
Where #.publication AS A
1
And A.~headline ~~ „XML“ 2
And A.author% AS B
3
4
#results:
top-down
bottom-up
w/ optimization:
131
14.3 sec
694 sec
2.68 sec
(incl. 0.37 sec)
2bu 1bu 3td
Q2:
Select * From Index
Where #.play AS A
And A.#.personae AS B
And B.~figure ~~ „King“
And B. title AS C
58
8.5 sec
3.7 sec
4.64 sec
(incl. 0.33 sec)
1bu 2td 3td 4td
Conclusion
Research avenue:
explore and leverage synergies between
XML (querying), (relevance-ranking) IR,
(domain-specific or personal) ontologies,
and machine learning (for classification, annotation, etc.)
Goal:
should be able to find results for every search in
one day (computer time) with < 1 min intellectual effort
that the best human experts can find with infinite time
 pursued in CLASSIX project (joint DFG project
with Norbert Fuhr‘s group in Dortmund)
Related documents