Download Probabilistic Ranking of Database Query Results

Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International University Gerhard Weikum, MPI Informatik Presented by: Kiran Karnam  Introduction & Motivation  Problem Definition  Architecture  Ranking Function  Implementation  Experiments  Conclusions & Limitations    Many-answers problem Two alternative solutions: Query reformulation Automatic ranking Apply probabilistic model in IR to DB tuple ranking  Many answers problem SELECT * FROM REALTOR_DB WHERE CITY=‘SEATTLE’ ;  Query reformulation  Automatic ranking  Specified Attributes city  Unspecified Attributes View School District Boat Dock  Global Score: Global score which captures the global importance of unspecified attribute values. Eg: VIEW=‘WATERFRONT’  Conditional Score: which captures the strengths of dependencies (or correlations) between specified and unspecified attribute values. Eg: If CITY=‘SEATTLE’ and VIEW=‘WATERFRONT’  Important Rules and Theorem required  Bayes’ Rule: p(a/b) = [ p(b/a) p(a) ] / [p(b)]  Product Rule: p(a,b/c) = p(a/c) * p(b/a,c)  Bayes theorem shows the relation between two conditional probabilities which are the reverse of each other  The probability of an event A given an event B depends not only on the relationship between events A and B but on the marginal probability (or "simple probability") of occurrence of each event  Document (Tuple) t, Query Q R: Relevant Documents R = D - R: Irrelevant Documents  Tuple t is considered as a document  Partition t into t(X) and t(Y)  t(X) and t(Y) are written as X and Y  Derive from initial scoring function until final ranking function is obtained  Given a query Q and a tuple t, the X (and Y) values within themselves are assumed to be independent, though dependencies between the X and Y values are allowed  If Many Queries Specify Set X of Conditions then there is Preference Correlation between Attributes in X.  Global: E.g., If Many Queries ask for Waterfront then p(Waterfront=TRUE) is high.  Conditional: E.g., If Many Queries ask for 4-Bedroom Houses in Good School Districts, then p(Bedrooms=4 | SchoolDistrict=`good’), p(SchoolDistrict=`good’ | Bedrooms=4) are high.  Final Ranking Formula is Where: p(y|W) = Relative frequency of unspecified attribute ‘y’ given workload ‘W’ p(y|D)= Relative frequency of unspecified attribute ‘y’ given data base ‘D’ p(x|y,W)=Frequency of correlation between x and y in W P(x|y,D)=Frequency of correlation between x and y in D   Pre processing ◦ Atomic probability module ◦ Index module  Intermediate Knowledge Reference layer  Query processing ◦ Scan algorithm ◦ List merge algorithm  Computation of modules: p(y | W), p(y | D), p(x | y, W), and p(x | y, D) for all distinct values of x and y.  Storing these atomic probabilities as database tables in intermediate knowledge representation layer with appropriate indexes.  Computation of index module resulting in conditional and global lists table.  CONDITIONAL LISTS Cx: Contains <TID, CondScore> in descending order  GLOBAL LISTS Gx: Contains <TID,GlobScore> in descending order  Select Tuples that Satisfy the Query  Scan and Compute Score for Each Result-Tuple  Return Top-K Tuples Scan algorithm is Inefficient Many tuples in the answer set  Another approach Pre-compute top-K tuples for all possible queries Still infeasible in practice  Trade-off solution Pre-compute ranked lists of tuples for all possible atomic queries At query time, merge ranked lists to get top-K tuples  Databases Used ◦ MSN Home Advisor database (http://houseandhome.msn.com/) ◦ Internet Movie Database  Software and Hardware: • Microsoft SQL Server2000 RDBMS • P4 2.8-GHz PC, 1 GB RAM • C#, Connected to RDBMS through DAO  Quality Experiments  Performance Experiments Query: select * from SeattleHomes where City=‘Seattle’ and Bedroom=1; Conditional ranked condos with garages the highest  Global failed to recognize importance of the unspecified attribute Garage=‘Y’   User preference of rankings   5 new queries Users were given the top-5 results  Compare 2 algorithms ◦ Scan algorithm ◦ List Merge algorithm  Execution time of performance algorithms  Completely Automated Approach for the Many-Answers Problem which Leverages Data and Workload Statistics and Correlations LIMITATION: Existence of correlations between text and non-text data. Future Work  Empty-Answer Problem  Handle Plain Text Attributes     Surajit Chaudhuri, Gautam Das, Vagelis Hristidis, Gerhard Weikum, Probabilistic Ranking of Database Query Results, VLDB 2004. users.cs.fiu.edu/~vagelis/presentations/ProbRanking.ppt http://crystal.uta.edu/~cse6339/Fall08DBIR.htm http://crystal.uta.edu/~cse6339/Fall09DBIR.htm

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Probabilistic Ranking of Database Query Results