Download On Efficient Part-match Querying of XML Data

www.fei.vsb.cz On Efficient Part-match Querying of XML Data Michal Krátký, [email protected] Marek Andrt, [email protected] Department of Computer Science VŠB–Technical University of Ostrava Czech Republic DATESO 2004 Contents      Introduction – XML, query languages, indexing XML data, part-match querying. Multi-dimensional approach to indexing XML data. Extension of the multi-dimensional approach for keyword-based querying. Index data structures. Preliminary experimental results. 2/21 Introduction     Native XML database. Set of documents is a database, DTD (XML Schema) is its database schema. XML query languages (XPath, XQL, XQuery,…). A common feature is a possibility to formulate paths in the XML graph (regular path expressions, XPath axes and so on). Approaches based on: relational decomposition, trie, multi-dimensional, signatures and so on. 3/21 Part-match querying XML data    Some approaches for keyword or phrase based searching were published: XQuery-IR (WebDb’02), XKeyword (ICDE’03) and so on. Knowledges from IR are applied. Query languages contain operators for matching term occurrence. For example contains(), ~=. 4/21 Multi-dimensional approach to indexing XML data A graph is a set of the paths. XML document is decomposed to paths and labelled paths.  labelled path: lp ∈ XLP: s0,s1,...,slPN  path: p ∈ XP: idU(u0),idU(u1),...,idU(ulLP),s idU(ui) – unique number of a node ui  5/21 Indexes    Term index – a storage of strings si of an XML document and their idT(si). Labelled path index – a storage of points representing labelled paths. Path index – a storage of points representing paths. 6/21 Example labelled path index, path index   books,book,id; books,book,title and books,book,author. Points (0,1,2); (0,1,4) and (0,1,6) are created using idT of element and attribute names, idLP = 0, 1 and 2. For example, the path to value The Two Towers. The labelled path books,book,title with idLP 1 belongs. Vector (1,0,1,3,5) is created using idLP, unique numbers idU of elements, and idT of the term. 7/21 Query for values of elements and attributes   XPath query: books/book[author=“Joseph Heller”] 3 phases of a query processing, finding: ● idT of terms from the term index, ● idLP 2 of labelled path books,book,author from the labelled path index: point query (0,1,6), ● points from the path index: range query (2,0,0,0,12)×(2,max,max,max,12). 8/21 Enhanced querying   XPath axes are processed by a range query or sequence of range queries. For example axis descendent: (0,idU(u0),…,idU(ul-1), idU(u),0,…, 0):(maxD,idU(u0),…,idU(ul-1), idU(u), maxD,…,maxD). Regular path expression. For example //title[name=‘Chaudhri’] is processed by a complex range query. The query is possible to process in one run in the multidimensional data structure. 9/21 Comparison of approaches    Mainline approaches (XISS, XPath Accelerator) index single element (attribute). For example query /e1[e2=‘dog’] is processed by joining single results. Result formatting. For example a result of the query //name is all matched subtree. Operation Update and Insert are simple possible. 10/21 Keyword-based searching  Motivation: /PLAY[PERSONAE/PERSONA~=OTHELLO]/TITLE    Path-Labelled Path-Term (PLT) index is added. The index indexes an 3-dimensional space: (idP, idLP, idT). idP is added into the point representing path: (idP,idLP,idU0,idU1,…,idUl,s). 11/21 Path-Labelled Path-Term index Example 12/21 Query processing plan Example 13/21 Index data structures Paged and balanced multi-dimensional data structures – (B)UB-trees, variants of Rtrees.  Problems: ● indexing points with different dimensions. ● narrow range query – the signature is applied for efficient processing – Signature R-tree.  Efficient processing of the complex range query.  14/21 Efficient processing the complex range query   Complex range query = sequence of range queries: qb1,qb2,…,qbn. The query is possible to process in one run in the multi-dimensional data structure. 15/21 Experimental results  Protein Sequence Database XML document: ● the document size is 683MB, ● number of elements: 21,305,818, ● number of attributes:1,290,647. ● maximal length of path: 7.  BUB-forest, R*-forest, Signature BUB-tree and R*-tree. Index structures: trees indexing spaces of dimension n=7 and n=9. 16/21 Experimental results Queries: ProteinDatabase/ProteinEntry/[reference/refinfo/ authors/author='Smith, E.L.'] 17/21 Experimental results Regular path expression     Query: //uid='89071748' , 5 labelled paths were matched. Naive processing the complex range query: DAC: 368 Efficient processing the complex range query: DAC: 139 Time: 0.03s, Improvement: 2.5x 18/21 Preliminary experimental results Keyword-based searching  othello.xml: ● document size is 250kB, ● maximal length of the path: 6 ● number of paths: 4,967 ● number of labelled paths: 13 ● number of terms: 8,744 ● PLT index: 27,127 19/21 Preliminary experimental results Keyword-based searching  Query: /PLAY[PERSONAE/PERSONA~=OTHELLO]/TITLE     Labelled path index: result size: 1, DAC: 3 PLT index: result size: 1, DAC: 3 Path index: result size: 1, DAC: 13 Path index: result size: 1, DAC: 4 20/21 Conclusion      Θ(m × log n), Θ(c × m × log n) vs. Θ(m1 × m2), m1 ,m2 ≥ m. Efficient processing a query with AND condition. Signature is applied. Multi-dimensional approach for term searching may be applied (e.g. *comp*). The update operation of XML documents. Comparison with another approaches for test collections (INEX, XMark, …). http://www.cs.vsb.cz/arg 21/21 References    M. Krátký, J. Pokorný, V. Snášel: Implementation of XPath Axes in the Multi-dimensional Approach to Indexing XML Data. Accepted at International Workshop on Database Technologies for Handling XML information on the Web, DataX, Int'l Conference on EDBT, Heraklion - Crete, Greece, 2004. M. Krátký, J. Pokorný, T. Skopal, V. Snášel: The Geometric Framework for Exact and Similarity Querying XML data. In Proceedings of EurAsia-ICT 2002. Shiraz, Iran, Springer Verlag, LNCS 2510. M. Krátký, T. Skopal, and V. Snášel: Multidimensional Term Indexing for Efficient Processing of Complex Queries. Kybernetika, Journal of the Academy of Sciences of the Czech Republic, 2004, accepted. Paths, labelled paths  Paths 0,1,2,’003-04212’; 0,5,6,’001-00863’ and 0,9,10,’045-00012’ belong to the labelled path books,book,id, ...  Paths 0,1,4,’J.R.R. Tolkien’; 0,5,8,’J.R.R. Tolkien’ and 0,9,12,’Joseph Heller’ belong to the labelled path books,book,author. Complex queries  Query for values and XPath axis processing, e.g. books/book[author='Joseph Heller']/title ● Combination of above described techniques: query for value, XPath axis processing.  Regular path expression queries for example: books//author ● A sequence of range queries processes this query in the path and labelled path index: books, author - books,*,author - books,*,…,*,author. (B)UB-tree, R-tree UB-tree B-tree Z-address Narrow range query – signature multi-dimensional ds    Regions intersecting a query hyper box are searched, O(NI × logc n). Ratio cR of relevant NR and intersect NI regions ≪ 1 with an increasing dimension. Signatures are applied to better filtration of irrelevant regions – signature md structures. Signature R-tree Experimental results Queries: ProteinDatabase/ProteinEntry/[reference/refinfo/ authors/author='Smith, E.L.'] Experimental results

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download On Efficient Part-match Querying of XML Data