Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
INEX – a broadly accepted data set for XML database processing? Pavel Loupal, Michal Valenta Presentation Content 1. 2. 3. 4. INEX initiative INEX data set Utilization framework Example – approximate XML tree embedding Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 2 INEX Initiative 1/3 2001 – reference dataset for information retrieval Duisburg-Essen University – Norbert Fuhr, Saadia Malik Queen Mary University London – Maunia Lalmas 2003 – 69 participants (mainly universities) 2 workshops (2002, 2003) open discussion about actual stage of the project Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 3 INEX Initiative 2/3 1.stage – data collection (by IEEE) 2.stage – referential queries evaluation 30 Content Only (CO) 36 Content and Structure (CAS) 3.stage – manual relevance assessment of query results continues… Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 4 INEX Initiative 3/3 3.stage – our join-point to INEX: Assessment of queries 83,84 – 1000 docs each 2-dimensional scale (exhaustivity, specificity) Relevance assessment on XML elements (parent-child dependencies) Finished in February 2004 4.stage (actual) Study of researchers behaviour Heterogenous resources / distributed systems Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 5 INEX Initiative - Assessment Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 6 INEX Data Set Structure 1/3 Actual version 1.4 – 536 MB 6 IEEE Transactions, 12 journals (1995-2002) 12107 articles – XML text only (without pictures) Organized in file system matter In average each article has 1532 nodes, 45 kB average depth: 6.9 Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 7 INEX Data Set Structure 2/3 /inex-1.4 /dtd ... xmlarticle.dtd /xml /an /1995 ... a1019.xml a1032.xml a1034.xml ... /... /2002 /... /ts Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 8 INEX Data Set Structure 3/3 <article> <fm> ... <ti>IEEE Transactions on ...</ti> <atl>Construction of ...</atl> <au> <fnm>John</fnm> <snm>Smith</snm> <aff>University of ...</aff> </au> </au>...</au> ... </fm> <bdy> <sec> <st>Introduction</st> <p>...</p> ... </sec> <sec> <st>...</st> ... <ss1>...</ss1> <ss1>...</ss1> ... </sec> ... </bdy> <bm> <bib> <bb> <au>...</au><ti>...</ti> ... </bb> ... </bib> </bm> </article> Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 9 Data Set Utilization – Framework 1/2 Native XML storage (Apache Xindice) Key features: Inner structure: Collections & documents Standard API (XML:DB or XML-RPC) XPath expressions over collections & docs Metadata Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 10 Data Set Utilization – Framework 2/2 Web interface – Java Server Pages (JSPs) Usage of XML:DB Java API: String url = “xmldb:xindice://localhost:8080/inex/mu/2001”; Collection col = DB.getCollection(url); doc = col.getResource(“a1019.xml”); System.out.println(doc.getContent()); Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 11 Approximate Tree Embedding 1/4 Aim: Approximately embed one XML tree (query) into another (data) Algorithm history: Kilpelainen – NP complete problem Schlieder – polynomial in practical examples Vana – further improvements Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 12 Approximate Tree Embedding 2/4 Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 13 Approximate Tree Embedding 3/4 Query: <article> <yr>2001</yr> <au> <snm>Smith</snm> </au> </article> Data: <articles> … <article yr=“2001”> <authors> <au> <fnm>John</fnm><snm>Smith</snm> </au> <au> <fnm>Mark</fnm><snm>Knopfler</snm> </au> </authors> </article> … </articles> Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 14 Approximate Tree Embedding 4/4 Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 15 Conclusion INEX initiative overview INEX data set + our testing framework = suitable for testing algorithms & approaches Further discussion Valenta, Loupal: INEX – a broadly accepted data set for XML database processing? 16