Download Managing Semi-Structured Data

Managing Semi-Structured Data Is the web a database? Rules—What Rules? “The web changed the digital information rules.” • Easy to create web information • Cannot all be stored in relational databases • Cannot be queried in traditional ways Semi-structured Data • Fully structured data – Databases – Hidden web • Fully unstructured data—ordinary text • Semi-structured data—the grey area in between – No “good solutions;” no good “software, tools, or methodologies to manipulate [semi-structured data]” – “[Researchers] don’t even agree on the shape of the problem—much less, good approaches to solving it.” Nature of the Problem • Information embedded in text – Keyword search insufficient to answer queries – Natural language processing also insufficient • Lack of agreement of vocabularies and schemas – “Reaching schema agreements among different communities is one of the most expensive steps in software design.” – “We need to be able to process information without requiring … a priori schema and vocabulary agreements among participants.” Example: eBay • “Impossible for … developers to define an a priori schema for the information.” • “Information stored in raw text and searched using only keywords, significantly limiting its usability.” • “Some standard entities (e.g., buyer, date, ask, bid …), but the meat of the information—the item descriptions—has a rich and evolving structure that isn’t captured.” Why Schemas? • “Schemas assign meaning to the data and … allow automatic data search, comparison, and processing.” • Hierarchy of meaning – – – – Raw text: strings (values) Data: attribute-value pairs Information: data in a conceptual framework Knowledge: information with a degree of certainty or community agreement – Meaning: knowledge that is relevant or activates • “We have to learn to use and exploit schemas as helpers, but not rely on their existence or allow them to be constraining factors.” Schema-Agnostic Tools Possible Places to Start • Information retrieval (sophisticated search engines?) – Find (maybe?) but not answer – No DB-like query logic, updates, transactions • XML – XML data can exist w/wo schemas; schemas can be defined before or after – Mixed text/data content – Languages for query (XQuery) and transformation (XSLT) • OWL & RDF – – – – RDF: subject-predicate-object triples OWL: ontological descriptions usually over RDF triples Classification & inferencing Semantic annotation and tagging Are We Stuck? What’s Next? • Better information-authoring tools (annotation assistance) • Information extraction (automatic annotation) • Creation and reuse of standard schemas and vocabularies (ontology generation) • Mapping schemas to each other (schema mapping) • Automatic data linking (data linking & merging) • Automatic processing of semi-structured data (free-form queries) – Florescu (Embley) What’s beyond a database system? Dataspace System • Supports data and applications in a wide variety of formats all within a dataspace. • Offers an integrated means of searching, querying, updating, and administering the dataspace. • Has varying levels of service (e.g. “best-effort” or approximate answers) • Includes tools to create tighter integration of the data, as necessary. – Franklin, Halevy, Maier “We are still at day one.” “We need to find a compromise to the tension between the advantages of having schemas, in terms of better understanding and automatically processing the data, and disadvantages imposed by schemas, in terms of inflexibility and lack of evolution.” – Florescu

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Managing Semi-Structured Data