* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Towards Unifying Database Systems and Information
Survey
Document related concepts
Transcript
Databases and Information Retrieval: Rethinking the Great Divide SIGMOD Panel 14 Jun 2005 Jayavel Shanmugasundaram Cornell University 10000 Foot View of Data Management The Great Data Divide Information Retrieval Systems Ranked Keyword Search Queries Complex and Structured The Great Query Divide Database Systems Structured Unstructured Data Bridging the Great Divide • Option 1: Tie together existing DB and IR systems – Example: Approaches based on SQL/MM • Option 2: Extend existing DB systems with IR functionality, or vice versa – Example: Add searching and ranking to RDBMSs • Option 3: Design a new data management system from the ground-up – Example: Quark data management system Why Option 1 Wont Work Information Retrieval Systems Ranked Keyword Search Queries Complex and Structured Database Systems Structured Unstructured Data Bridging the Great Divide • Option 1: Tie together existing DB and IR systems – Example: Approaches based on SQL/MM – Drawback: Not powerful enough • Option 2: Extend existing DB systems with IR functionality, or vice versa – Example: Add searching and ranking to RDBMSs • Option 3: Design a new data management system from the ground-up – Example: Quark data management system <workshop date=”28 July 2000”> <title> XML and Information Retrieval: A SIGIR 2000 Workshop </title> <editors> David Carmel, Yoelle Maarek, Aya Soffer </editors> <proceedings> <paper id=”1”> <title> XQL and Proximal Nodes </title> <author> Ricardo Baeza-Yates </author> <author> Gonzalo Navarro </author> <abstract> We consider the recently proposed language … </abstract> <section name=”Introduction”> Searching on structured text is becoming more important with XML … </section> … <cite xmlns:xlink=”http://www.acm.org/www8/paper/xmlql> … </cite> </paper> … Find relevant elements in important workshops between the years 1999 and 2001 that are about ‘Ricardo’ and ‘XML’ Why Extending (R)DBMSs Won’t Work • Violates many assumptions “hardwired” into current database systems • Structured queries over structured fields, keyword search queries over text fields – Is author name a structured or text field? • Operators have precise, well-defined semantics – Even the query result is not well-defined – do we return a paper or a workshop? • Scoring is an attribute tacked on as a relational attribute – How can this scoring generalize IR scoring? Why Extending IR Systems Won’t Work • IR systems provide little support for structured data • No support for complex operators – How can complex queries be evaluated? • Scoring does not take structure into account – How can scoring capture both structured and unstructured data? Bridging the Great Divide • Option 1: Tie together existing DB and IR systems – Example: Approaches based on SQL/MM – Drawback: Not powerful enough • Option 2: Extend existing DB systems with IR functionality, or vice versa – Example: Add searching and ranking to RDBMSs – Drawback: Shoehorns alien functionality into already complex systems • Option 3: Design a new data management system from the ground-up – Example: Quark data management system Why Option 3 Will Work • Designed ground-up with three principles • Structural data independence – Users can issues any query (complex and keyword) over any data (structured and unstructured) • Generalized scoring – Scoring works over any mix of structured and unstructured data (e.g., XRank over HTML and XML) • Flexible query language – Allows for arbitrary return results and scores (e.g., TeXQuery, precursor to XQuery Full-Text, NEXI) Bridging the Great Divide • Option 1: Tie together existing DB and IR systems – Example: Approaches based on SQL/MM – Drawback: Not powerful enough • Option 2: Extend existing DB systems with IR functionality, or vice versa – Example: Add searching and ranking to RDBMSs – Drawback: Shoehorns alien functionality into already complex systems • Option 3: Design a new data management system from the ground-up – Example: Quark data management system – Most promising alternative!