Download INEX – a broadly accepted data set for XML database processing?

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
INEX – a broadly accepted data
set for XML database processing?
Pavel Loupal, Michal Valenta
Presentation Content
1.
2.
3.
4.
INEX initiative
INEX data set
Utilization framework
Example – approximate XML tree
embedding
Valenta, Loupal: INEX – a broadly accepted data set for XML database processing?
2
INEX Initiative 1/3

2001 – reference dataset for information
retrieval
Duisburg-Essen University – Norbert Fuhr, Saadia Malik
Queen Mary University London – Maunia Lalmas



2003 – 69 participants (mainly universities)
2 workshops (2002, 2003)
open discussion about actual stage of the
project
Valenta, Loupal: INEX – a broadly accepted data set for XML database processing?
3
INEX Initiative 2/3
1.stage – data collection (by IEEE)
2.stage – referential queries evaluation


30 Content Only (CO)
36 Content and Structure (CAS)
3.stage – manual relevance assessment of
query results
continues…
Valenta, Loupal: INEX – a broadly accepted data set for XML database processing?
4
INEX Initiative 3/3
3.stage – our join-point to INEX:




Assessment of queries 83,84 – 1000 docs each
2-dimensional scale (exhaustivity, specificity)
Relevance assessment on XML elements (parent-child
dependencies)
Finished in February 2004
4.stage (actual)


Study of researchers behaviour
Heterogenous resources / distributed systems
Valenta, Loupal: INEX – a broadly accepted data set for XML database processing?
5
INEX Initiative - Assessment
Valenta, Loupal: INEX – a broadly accepted data set for XML database processing?
6
INEX Data Set Structure 1/3





Actual version 1.4 – 536 MB
6 IEEE Transactions, 12 journals (1995-2002)
12107 articles – XML text only (without pictures)
Organized in file system matter
In average each article has


1532 nodes, 45 kB
average depth: 6.9
Valenta, Loupal: INEX – a broadly accepted data set for XML database processing?
7
INEX Data Set Structure 2/3
/inex-1.4
/dtd
...
xmlarticle.dtd
/xml
/an
/1995
...
a1019.xml
a1032.xml
a1034.xml
...
/...
/2002
/...
/ts
Valenta, Loupal: INEX – a broadly accepted data set for XML database processing?
8
INEX Data Set Structure 3/3
<article>
<fm>
...
<ti>IEEE Transactions on ...</ti>
<atl>Construction of ...</atl>
<au>
<fnm>John</fnm>
<snm>Smith</snm>
<aff>University of ...</aff>
</au>
</au>...</au>
...
</fm>
<bdy>
<sec>
<st>Introduction</st>
<p>...</p>
...
</sec>
<sec>
<st>...</st>
...
<ss1>...</ss1>
<ss1>...</ss1>
...
</sec>
...
</bdy>
<bm>
<bib>
<bb>
<au>...</au><ti>...</ti>
...
</bb>
...
</bib>
</bm>
</article>
Valenta, Loupal: INEX – a broadly accepted data set for XML database processing?
9
Data Set Utilization – Framework 1/2


Native XML storage (Apache Xindice)
Key features:




Inner structure: Collections & documents
Standard API (XML:DB or XML-RPC)
XPath expressions over collections & docs
Metadata
Valenta, Loupal: INEX – a broadly accepted data set for XML database processing?
10
Data Set Utilization – Framework 2/2


Web interface – Java Server Pages (JSPs)
Usage of XML:DB Java API:
String url = “xmldb:xindice://localhost:8080/inex/mu/2001”;
Collection col = DB.getCollection(url);
doc = col.getResource(“a1019.xml”);
System.out.println(doc.getContent());
Valenta, Loupal: INEX – a broadly accepted data set for XML database processing?
11
Approximate Tree Embedding 1/4


Aim: Approximately embed one XML tree
(query) into another (data)
Algorithm history:



Kilpelainen – NP complete problem
Schlieder – polynomial in practical examples
Vana – further improvements
Valenta, Loupal: INEX – a broadly accepted data set for XML database processing?
12
Approximate Tree Embedding 2/4
Valenta, Loupal: INEX – a broadly accepted data set for XML database processing?
13
Approximate Tree Embedding 3/4
Query:
<article>
<yr>2001</yr>
<au>
<snm>Smith</snm>
</au>
</article>
Data:
<articles>
…
<article yr=“2001”>
<authors>
<au>
<fnm>John</fnm><snm>Smith</snm>
</au>
<au>
<fnm>Mark</fnm><snm>Knopfler</snm>
</au>
</authors>
</article>
…
</articles>
Valenta, Loupal: INEX – a broadly accepted data set for XML database processing?
14
Approximate Tree Embedding 4/4
Valenta, Loupal: INEX – a broadly accepted data set for XML database processing?
15
Conclusion


INEX initiative overview
INEX data set + our testing framework =
suitable for testing algorithms & approaches

Further discussion
Valenta, Loupal: INEX – a broadly accepted data set for XML database processing?
16
Related documents