Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
4/29 Issues in Bridging DB & IR Announcements: Next class: Interactive Review (Come prepared) Homework III solutions online Demos tomorrow (instructions will be mailed by the end of the class) First did some discussion of BibFinder—how queries are mapped etc. …”please be both enthusiastic and adamant about this in your classes” –Associate Dean for Academic affairs CEAS Online Evaluations You can do them at https://intraweb.eas.asu.edu/eval th Will be available until the end of day May 5 Instructors get to see it only after the grades have been given (so the exam is unfettered by what you might think about it ) (so you don’t need to feel compelled to be particularly nice) Your feedback would be appreciated (especially the written comments) Last semester I got 2,196 words of comments; let us see if we can break the record ;-) The popularity of Web brings two broad challenges to Databases Integration of autonomous Supporting heterogeneous data (combining DB/IR) data sources Data/information integration Technically has to handle heterogeneous data too But we will sort of assume that the sources are “quasirelational” This can be tackled in the presence of a single database The issues are How to do effective querying in the presence of structured and text data How to support IR-style querying on DB E.g. Stuff I have Seen project Because users seem to know IR/keyword style querying more (notice the irony here— we said structure is good because it supports structured querying) How to support imprecise queries DB vs. IR DBs allow structured querying Queries and results (tuples) are different objects Soundness & Completeness expected User is expected to know what she is doing IR only supports unstructured querying Queries and results are both documents! High Precision & Recall is hoped for User is expected to be a dunderhead. Some specific problems 1. 2. 3. How to handle textual attributes? How to support keyword-based querying? How to handle imprecise queries? (Ullas Nambiar’s work) 1. Handling text fields in data tuples Often you have database relations some of whose fields are “Textual” E.g. a movie database, which has, in addition to year, director etc., a column called “Review” which is unstructured text Normal DB operations ignore this unstructured stuff (can’t join over them). SQL sometimes supports “Contains” constraint (e.g. give me movies that contain “Rotten” in the review Soft Joins..WHIRL [Cohen] We can extend the notion of Joins to “Similarity Joins” where similarity is measured in terms of vector similarity over the text attributes. So, the join tuples are output n a ranked form— with the rank proportional to the similarity Neat idea… but does have some implementation difficulties Most tuples in the cross-product will have non-zero similarities. So, need query processing that will somehow just produce highly ranked tuples 2. Supporting keyword search on databases How do we answer a query like “Soumen Sunita”? Issues: --the schema is normalized (not everything in one table) --How to rank multiple tuples which contain the keywords? What Banks Does The whole DB seen as a directed graph (edges correspond to foreign keys) Answers are subgraphs Ranked by edge weights BANKS: Keyword Search in DB 3. Supporting Imprecise Queries Increasing number of Web accessible databases Difficulty in extracting desired information E.g. bibliographies, reservation systems, department catalogs etc Support for precise queries only – exactly matching tuples Want cars priced ‘around’ $7000 Limited query capabilities provided by form based query interface Lack of schema/domain information Increasing complexity of types of data e.g. hyptertext, images etc Often times user wants ‘about the same’ instead of ‘exact’ Bibliography search — find similar publications Solution: Provide answers closely matching query constraints Relaxing queries… It is obvious how to relax certain type of attribute values But how do we relax categorical attributes? E.g. price=7000 is approximately the same as price=7020 How should we relax Make=Honda? Two possible approaches Assume that domain specific information about similarity of values is available (difficult to satisfy in practice) Attempt to derive the similarity between attribute values directly from the data Qn: How do we compute similarity between “Make=Honda” and “Make=Chevrolet” Idea: Compare the set all tuples where Make=Honda to the set of all tuples where Make=Chevrolet Consider the set of tuples as a vector of bags (where bags correspond to the individual attributes) Use IR similarity techniques to compare the vectors Finding similarities between attribute values 5/4 Challenges in answering Imprecise Queries Challenges: Extracting additional tuples with minimal domain knowledge Estimating similarity with minimal user input We introduce IQE (Imprecise Query Engine): Uses query workload to identify other precise queries Extracts additional tuples satisfying a query by issuing similar precise queries Measures distance between queries using Answerset Similarity Answerset Similarity Answerset A(Q): Set of all answer Widom Stream…. tuples of query Q given by relation R. Widom Query Similarity: Sim(Q1,Q2) :- Sim(A(Q1), A(Q2)) Measuring answerset similarity Relational model exact match between tuples captures complete overlap Vector space model match keywords also detects partial overlaps VLDB 2002 Optimize…. ICDE 1998 Answerset for Q(Author=Widom) Ullman Optimize… PODS 1998 Ullman Mining… 2000 Answerset for Q(Author=Ullman) ST(QAuthor=Widom) Co-author Problem: Vector Space model representation for answersets Answer: SuperTuple VLDB Title Conference Year R. Motwani:3, Molina:6… warehouse:5, optimizing:2, streams:6.. SIGMOD:3, VLDB:4,… 2000:6,1999:5,…… Similarity Measures Jaccard similarity metric with bag semantics Doc-Doc Similarity SimJ(Q1,Q2) = |Q1 ∩ Q2| / |Q1 U Q2| Equal importance to all attributes Supertuple considered as “single bag” of keywords Co-author Title Conference Year C. Li:5, R. Motwani:7,… Data-mining:3, optimizing:5,.. SIGMOD:5, VLDB:5,… 2000:6,1999:5,…… Simdoc-doc(Q1, Q2) = SimJ(STQ1, STQ2) Weighted-Attribute Similarity ST(QAuthor=Ullman) Weights assigned to attributes signify importance to user Simwatr(Q1,Q2) = ∑ wi x SimJ(STQ1(Ai), STQ2(Ai)) ST(QAuthor=Widom) Co-author Title Conference Year R. Motwani:3, Molina:6… warehouse:5, optimizing:2, streams:6.. SIGMOD:3, VLDB:4,… 2000:6,1999:5,…… Empirical Evaluation Goal Evaluate the efficiency and effectiveness of our approach Setup A database system extending the bibliography mediator BibFinder projecting relation Publications( Author, Title, Conference, Journal, Year) Query log consists of 10K precise queries User study: 3 graduate students 90 test queries - 30 chosen by each student Platform: Java 2 on a Linux Server – Intel Celeron 2.2 Ghz, 512 MB Time Size Supertuple Generation 126 sec 21 Mb Similarity Estimation 10 hrs 6 Mb Answering Imprecise Query Estimating query similarity For each q Є Qlog Compute Sim(q,q’) for all q’ Є Qlog Simdoc-doc(q, q’) = SimJ(STq, STq’) Simwatr(q,q’) = ∑ wi x SimJ(STq(Ai), STq’(Ai)) Extracting similar answers Given a query Q Map Q to a query q Є Qlog Identify ‘k’ queries similar to q Execute the ‘k’ new queries Some Results 1 Imprecise Query Top Relevant Queries Title=“web-based learning” Title=“e learning” Title=“web technology” Conference=“WISE” 2 Title=“Information Extraction” Title=“Information filtering” Title = “text mining” Title = “relevance feedback” 3 Author=“Abiteboul” Author=“vianu” Author=“Dan Suciu” Relevance of Suggested Answers Relevance Estimation Error 1 Doc-Doc Sim ilarity 0.75 Weighted-Attribute Sim ilarity 0.5 0.25 0 1 11 21 Query Are the results precise? Average error in relevance estimation is around 25% User Study – Summary Precision for top-10 related queries is above 75% Lessons: Queries with popular keywords difficult Efficiently and effectively capturing user interest is difficult A solution requiring less input more acceptable Doc-Doc Relevance of top-10 Doc-Doc similarity measure dominates Weightedattribute similarity 85 Weighted-attribute 80 75 1 2 User 3 What’s Next ? Open Issues: Most similar query may not be present in the workload. Answers to a similar query will have varying similarity depending on the affected attributes Solution: Given an imprecise query generate the most similar query. Use attribute importance and value-value similarity to order tuples. Challenges: • Estimating attribute importance • Estimating value-value similarity Learning the Semantics of the data Estimate for value-value similarity Similarity between values of categorical attribute Sim(v11,v12) = ∑ wi x Sim(Co-related_value(Ai,v11), Co-related_value(Ai,v12)) where Ai Є Attributes(R), Ai <> A Euclidean distance for numerical attributes Use the Model of the database – AFDs, Keys, Value correlations to Identify an implicit structure for the tuple. Show other tuples that least break the structure. CarDb(Make,Model, Year, Price, Mileage, Location, Color) Approximate Keys Model, Mileage, Location – uniquely decides 90% cars in Db Model, Mileage, Color - uniquely decides 84% cars in Db Approximate Functional Dependencies (AFDs) Model -> Make Year -> Price Mileage -> Year Query relaxation Finding similarities between attribute values