Download SemistructuredData - Tufts Computer Science

Data Warehousing/Mining Comp 150 DW Semistructured Data Instructor: Dan Hebert Data Warehousing/Mining 1 Semistructured Data  Everything that has no rigid schema – Schema is contained within the data (self-describing), OR – No separate schema, OR – Schema exists but places only loose constraints on data  Emerged as an important topic for a variety of reasons – Many data sources like WWW which we would like to treat as databases but cannot for the lack of schema – Desirable to have an extremely flexible format for data exchange between disparate databases – May want to view structured data as semistructured data for the purpose of browsing Data Warehousing/Mining 2 Motivation  Some data really is unstructured/semistructured – World Wide Web, – Data exchange formats – Some exotic database management systems, e.g., ACeDB, popular with biologists Data integration  Browsing  Data Warehousing/Mining 3 Motivation - World Wide Web  Why do we want to treat the Web as a database? – To maintain integrity – To query based on structure (as opposed to content) – To introduce some “organization”.  But the Web has no structure. The best we can say is that it is an enormous graph. Data Warehousing/Mining 4 Motivation - Data Formats Much (probably most) of the world’s data is in data formats  These are formats defined for the interchange and archiving of data  Data formats vary in generality. ASN.1 and XDR are quite general  Scientific data formats tend to be “fixed schemas”  The textual representation given by data formats is sometimes not immediately translatable into a standard relational/objectoriented representation  Data Warehousing/Mining 5 Motivation - Data Integration  Goal is to integrate all types of information, including unstructured information – Irregular, missing information, structure not fully known, dynamic schema evolution, etc.  Traditional data models and languages not well suited – Cannot accommodate heterogeneous data sets (different types and structures), etc. – Difficult to build software that will easily convert between two disparate models  OEM (Object Exchange Model) – Semistructured data model from TSIMMIS project at Stanford – Internal data structure for exchange of data between DBMSs – Used by other systems: e.g., Windows 95 registry, Lotus Notes Data Warehousing/Mining 6 Motivation - Browsing To query a database one needs to understand the schema.  However schemas have opaque terminology and the user may want to start by querying the data with little or no knowledge of the schema.  – Where in the database is the string “Casablanca” to be found? – Are there integers in the database greater than 216 ? – What objects in the database have an attribute name that starts with “act”?  While extensions to relational query languages have been proposed for such queries, there is no generic technique for interpreting them. Data Warehousing/Mining 7 The Model  Represent data as some kind of graph-like or treelike model – Cycles are allowed but usually refer to them as trees – Several different approaches with minor differences (easy to convert)   Data on labels or edges, nodes carry information or not Straightforward to encode relational and objectoriented databases – Issue: object identity Data Warehousing/Mining 8 Querying Semistructured Data  There are (at least) three approaches to this problem – Add arbitrary features to SQL or to your favorite query language – Find some principled approach to programs that are based on the type of the data – Represent the graph (or whatever the structure is) as appropriate predicates and use some variety of datalog on that structure Data Warehousing/Mining 9 The “Extend SQL” Approach In fact it is an attempt to extend the philosophy of OQL and comprehension syntax to these new structures  It is the approach taken in the design of UnQL and also of Lorel  Looks very similar to OQL (path expressions)  Data Warehousing/Mining 10 Example select from where Data Warehousing/Mining Entry.Movie.Title DB Entry.Movie.Director... 11 Syntax Issues Need (path) variables to tie paths and edges together  Paths of arbitrary length  – “Find all strings in db” – “Find whether “Allen” acted in “Casablanca” – Need regular expresions to constrain paths  Rich set of overloadings for operators to deal with comparisons of objects with values and of values with sets Data Warehousing/Mining 12 Underlying Computational Strategy  Model graph as a relational database and use relational query language. – Database large relation (node-id, label, node-id) – Used by Stanford group in LORE/LOREL  Complications – Labels are from heterogeneous set of types, need more than one relation – Additional relations if info to be stored in nodes – Various navigation issues Data Warehousing/Mining 13 Semistructured Data - Case Study Object Exchange Model Data Warehousing/Mining 14 OEM Features • Common model for heterogeneous information exchange, self-describing • Each object: OID  Label Type Value OID = unique identifier or NULL  Label = character string descriptor  Type = atomic data type or set  Value = atomic value or set of object references • “Help pages” for labels • Query language OEM-QL Data Warehousing/Mining 15 15 Representing Semistructured Data Using OEM Label Memory Addresses <collection, {b1, a1, ...}> Set Value b1: <book, {t, a}> t: <title, “Database and ...”> Atomic Value a: <author, {n, p}> n: <name, “Jeff Ullman”> p: <picture, “/gifs/ullman.gif”> a1: <article, {v, w, x}> v: <author, “Gio Wiederhold”> w: <title, “Mediators in the …”> x: <journal, “IEEE Computer”> ... Data Warehousing/Mining 16 16 An OEM Query Language: OEM-QL • Logic-based language for OEM – Match object patterns, generate variable bindings, construct new OEM objects from existing ones • Get articles published in “IEEE Computer” P :P:<articles {<journal “IEEE Computer”>}> • Get titles of books by “Jeff Ullman” <answer_title T> :<book {<author “Jeff Ullman”> <title T>}> Data Warehousing/Mining 17 17 Semistructured Data - Case Study WWW Extraction Data Warehousing/Mining 18 Problem  Lots of valuable information on the Web – irregular structure – highly dynamic Embedded in HTML  Limited query facilities  Data Warehousing/Mining 19 Data Extraction Tool Flexible, easy to use  Accommodate virtually any HTML source  Interface with existing system, e.g., data warehouse, user interface for querying  Query World Wide Web Extractor WH Integrator Data Warehouse Specification Data Warehousing/Mining 20 Approach  Extract Web data into OEM format – Query using OEM-QL Python-based, configurable parser  Declarative description of HTML source  – location of data on page – how to package data into OEM “Regular expression”-like syntax  Human intelligence rather than A.I.  Data Warehousing/Mining 21 Extractor Specification Consists of commands of the form: [ “variable(s)”, “source”, “pattern” ] Data Warehousing/Mining 22 HTML Source File <HTML> <HEAD> .. . <TABLE> <TR> <TH><I> header 1 </I></TH> <TH><I> header 2 </I></TH> <TH><I> header 3 </I></TH> </TR> <TR> <TD> text 1 </TD> <TD><A HREF=http://www.stuff/> text 2 </A></TD> <TD> text 3 </TD> </TR> .. . </TABLE> .. . </BODY> </HTML> Data Warehousing/Mining 23 [ Specification File [“root”, “get('http://www.example.test/')”, “#” ], [“__tempvar1”, “root”, “*<table>#</table>*” ], [“__tempvar2”, “split (__tempvar1,’</tr>’)”, “#” ], [“rows”, “__tempvar2[1:-1]”, “#” ], [“header1,header2_url,header2,header3”, “rows”, “*<td>#</td>*<a*href=#>#</a>*<td>#</td>*”] ] Data Warehousing/Mining 24 Result OEM Object ... <root complex { <rows complex { <header1 string “text 1”> <header2_url string “http://www.stuff”> <header2 string “text 2” <header3 string “text 3”> }> <rows complex { ... }> }> Data Warehousing/Mining 25 Basic Syntax:Variable  variable(l:p:t) – optional parameters for specification of corresponding OEM object l: label name  t: type  p: parent object   _variable – temporary data structure, does not appear as OEM object Data Warehousing/Mining 26 Basic Syntax: Source  split(variable,token) – creates a list with multiple elements using token as the element separator  get(URL) – obtain contents of HTML file at address URL Data Warehousing/Mining 27 Basic Syntax: Patterns  token1 # token2 – match and store current input (between tokens)  token1 * token2 – match, don’t store current input (between tokens) Data Warehousing/Mining 28 Syntactic Sugar  Functions for extracting commonly used HTML constructs – extract_table(variable),pattern – split_table_row(variable) – split_table_column(variable) – extract_list(variable),pattern – split_list(variables) Data Warehousing/Mining 29 Advanced Features  Customization of output – structure, label names, data type, ... Extraction across multiple HTML pages  Graceful recovery from parse errors  – resume parsing using next input from source  Multiple patterns in single command – follow different parse tree depending on structure in source Data Warehousing/Mining 30 ... Sample Extraction Scenario Data Warehousing/Mining 31 Extracted OEM Data root OEM-QL}query: complex { complex { temperature city_temp complex { country string city_url url string city weather_today high_today low_today weather_tom high_tomorrow low_tomorrow } city_temp complex { country string city_url url string city … } … } “Austria” http://www… “Vienna” string “snow” string “-2” string “-7” string “snow” string “-2” string “-7” “Belgium” http://www… “Brussles” <city C {<high H> < low L>}> :<temperature {<city_temp {<country “Germany”> <city C> <high_today H> <low_today L>}>}> Data Warehousing/Mining 32 Evaluation  Better than – writing programs – YACC, PERL, etc. – A.I.  Can do better – GUI tool to simplify the generation of extractor specification – Machine learning or data mining techniques to automatically infer structure... Data Warehousing/Mining 33

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download SemistructuredData - Tufts Computer Science