* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Towards Keyword-Driven Analytical Processing
Survey
Document related concepts
Transcript
Towards Keyword-Driven Analytical Processing Ping Wu, Yannis Sismanis, Berthold Reinwald Presented By – Amit Goyal ([email protected]) Date: 21st Nov 2007 Outline Background Motivation Solution Framework Differentiate Explore Experiments Conclusions Motivation Problem: Given: multiple data sources Find: patterns (such as?) What were the sales volumes by region and product category for the last year? Which orders should we fill to maximize revenues? Will a 10% discount increase sales volume sufficiently? Motivation Users don’t know how to specify what they want, but they know it when they see it Keyword queries can have different semantics based on the intent of user Background Decision Support Data Warehouse Decision support systems are a class of computer-based information systems including knowledge based systems that support decision making activities Main repository of an organization's historical data for analysis purposes It contains the raw material for management's decision support system OLAP (online analytical processing) Interactive process of creating, managing, analyzing and reporting of data Facts and Dimensions Database is a set of facts (points) in a multidimensional space Fact: Sales, budget, profit, inventory Example Fact Table: Transactions (timekey, storekey, ckey, units, price) Dimension: Example facts: Measures performance of a business Example dimensions: Specifies a fact Product, customer data, sales person, store Example dimension table: Customer (ckey, firstname, lastname, address, dateOfBirth, occupation, …) Example Order Product Order No ProductNO Order Date ProdName Customer Customer No Customer Name Customer Address City Salesperson SalespersonID SalespersonName City Quota Fact Table ProdDescr OrderNO Category SalespersonID CategoryDescription CustomerNO UnitPrice ProdNo Date DateKey DateKey CityName Date Quantity Total Price City CityName State Country Notations Dataspace DS: entire multi-dimensional dataspace in an OLAP database Subspace DS’: subset of dataspace Star Join: set of joins when a fact table is joined to two or more dimension tables. Different interpretations of the keywords reflect different star join expressions Solution Framework Phase Differentiate Generation of candidate subspaces using user keywords Phase Explore The system first calculates for the subspace the aggregated values for some predefined measures Then dynamically finds for each dimension the top-k interesting group-by attributes to partition the sub-dataspace Interestingness Application specific Focus in the paper: Surprising aggregates Correlated aggregates Differentiate Phase First generate candidate subspace Then, organize them effectively by ranking them Then, ask user to select one of candidate subspace Proceed to Explore Phase Ambiguity of Keyword Queries In large and complex OLAP dataspaces, a keyword almost always matches different attribute domains in different dimensions Creates large number of possible query interpretations Example: Consider query “Columbus LCD” Columbus – holiday or city? LCD – projectors or TV or monitor? Ambiguity of Keyword Queries: WHY? Correctness: Different interpretation of keywords may result in completely different subspaces. Thus, correct interpretation may eliminate error propagation to subsequent phases in the system Performance: Computation of all possible interpretations may be expensive Users know exactly the semantic meaning of their keywords. Put them in query processing loop Candidate Interpretation Generation Problem Stmt: Given a keyword query q={k1, k2, ...kn}, generate candidate interpretations CI ={C1,C2, ...,Cm} of q. For each keyword ki, the CI generator first probes the full-text index to obtain the Hit Set Hi. Hi = {hi1, hi2, .. , him} Each hit hij represents a triplet of relation name, attribute name and the attribute instance value {hij.R, hij.Attr, hij.Val}. Within a hit set, hits can be organized in Hit Groups if they have same relation name and attribute name. Hit Group : Example Consider a query “Columbus LCD” The hit set for keyword “Columbus” has 3 hits: Loc/City/Columbus, Holiday/Event/(“Columbus Day”), Holiday/Event/(“Columbus Week”). Thus, hit group for the keyword “Columbus” is {Holiday/Event/(“Columbus Day” OR “Columbus Week”)}. Star Seed and Star Net Star Seed (SS): For query q, SS is defined as a set of n hit groups, each of which is drawn from a different hit set E.g. For query “Columbus LCD”, one candidate SS could be {{Holiday/Event/(“Columbus Day” OR “Columbus Week”)}, PGROUP/Group Name/”LCD Projectors”} Star Net (SN): For SS, SN is defined as a join that connects all the hit groups from the SS Note: A single SS could correspond to multiple SN Candidate Star Net Generation Ranking the Candidate Star Nets Number of all interpretations may be large. Thus, ranking is necessary SCORE(SN,q) = Sim :- string matching similarity function between the query q and attribute value of hit hij |HGk| is the number of hits in the hit group ∑ over hits in a Hit Group divided by |HGk| is the average hit similarity value Avg hit similarity value is further normalized to penalize hit groups with many matched attribute instances. “California” – state or a large of distinct address on “California Street” Finally, summation of hit group scores are divided by |SN|2 to prioritize smaller Star Nets. Score(Star Nets with “San Jose{city}”) > Score(Star Nets with “San Antonio{city}” and “Jose{Customer First Name}”) Handling of Phrase Queries Consider Query: “San Jose” Output hit set: “San Jose”, “San Antonio” etc Score function does not take into consideration the fact that “San Jose” perfectly matches two keywords, thus should be ranked higher. Solution: Merge the two hit groups from two hit sets, if Both groups are from same attribute domain The intersection between two groups is not empty Can be generalized to phrases containing more than 2 keywords Explore Phase Till now, a unique sub-dataspace DS’ has been identified by the user And the system computes the group-by aggregates over the measure from all qualified fact points in DS’ Rank the group-by attributes Categorical Attributes Numerical Attributes Organizing Attribute Instances Categorical Attributes Numerical Attributes Automatic Facet Construction For SubDataspaces After a unique sub-dataspace DS’ has been identified by the user, the system dynamically constructs a multi-faceted search (MFS) interface for the user to explore detailed level aggregation in DS’. In real world databases, the number of dimensions may be large. And each dimension may have many attributes. So, need to rank group-by attributes dynamically based on interestingness of the resulting partitions. Ranking Group-by Attributes Rank the group-by attributes based on interestingness of their resulting partitions Roll-up Partitioning (RUP) Roll-up Partitioning By looking at the sub-dataspace alone, it is impossible to define interestingness of a certain partition in a robust way Dimensions are hierarchical, lets use it !! Roll-up along some dimensions, compare the two partitions. The more similar the two partitions are, the less the candidate group-by attribute is considered as surprising Example Determine whether the attribute zipcode of dimension store is an interesting group-by attribute for the subspace associated with Product Television Roll-up to bigger space along the Product dimension to Home Entertainment Electronics If distribution deviates -> surprising If correlated -> bellwethers Roll-up Partitioning SCORE(attrij, DS’) = - E((X-µx)(Y-µy))/(σx σy) Where, X = aggregation values on partition PAR(DS’, attrij) Y = aggregation values on partition PAR(RUP(DS’), attrij) E = expected value µ = mean σ = standard deviation Ranking both Categorical and Numerical Attributes Categorical: Easy. Previous score function can be directly applied. Numerical: Correlation depends on how the numerical domain is bucketized First split the domains of the candidate attributes into “sufficiently” many buckets or basic intervals Tuples in same bucket are aggregated together to produce new attribute values Intuition behind splitting is that the correlation value of two distributions can be preserved as the bucket number becomes large and the interval range becomes small Organizing Attribute Instances Till Now, we have ranked attributes How to organize the values within each attribute domain? Categorical Attribute: SCORE(attrij.catp, DS’) = G is the aggregate function Organizing Numerical Attribute Instances Given ‘m’ basic intervals, merge adjacent intervals into ‘k’ numeric categories 3 objectives: Number of resulting intervals should not be large (suitable for navigation) Number of merged intervals should not be skewed. Number of intervals in largest range should not exceed L times the number of intervals in smallest range The merged partition should preserve the original interestingness value, i.e. correlation value from basic intervals Organizing Numerical Attribute Instances Critique: • What is neighbor? • How to generate it? • SCORE function is not defined Experiments AdventureWorks data warehouse Divided in two separate databases: AW_ONLINE – 5 dimensions, 3 hierarchical, 10 tables AW_RESELLER – 7 dimensions, 4 hierarchical, 13 tables Qualitative Sample Results Keyword Query: “California Mountain Bike” Phase1: System returns a list of star nets Analyst selects the first Star Net Evaluation of Subspace Ranking Algorithm Manually written 50 keywords queries X-axis: Rank of the results Y-axis: %age of the queries satisfied 4 ranking methods Relevance is checked manually Note that group size normalization is not significant Effects of Bucket Number in Group-By Attribute Ranking Test the assumption that with a “sufficiently” large number of basic intervals, the actual correlation value can be captured AW_Online database Numerical Attributes: Yearly-Income from Customer table Dealer-Price from Product table Roll-up operations: StateProvinces to Countries Subcategory to Product Category Effects of Bucket Number in Group-By Attribute Ranking AW_Reseller Database Numerical Attributes: Roll-up operations: AnnualSales, AnnualRevenue from Reseller table NumberOfEmployees Product Subcategories to Categories Error percentage is computed by the deviation from the ideal case (where each distinct value has its own bucket) Study of Numerical Partitioning Methods Contribution/Conclusion Integrate keyword search with the efficient aggregation power of OLAP Provides an efficient and easy-to-use solution for business analyst Ambiguity problem has not been addressed by previous research Current research on keyword search over RDBMS uses indexes on a tuple level instead of an attribute level Critique Poorly written paper Typo mistakes. In eqn1, it is not clear what is |HGk|, |SN| In eqn2, it is mentioned G is an aggregate function, but didn’t specify it In Algorithm 2, the “SCORE” function used is not defined In Algorithm 2, the notion of “neighbor” is not defined First, they say score function (section 4.3) does not take into account that “San Jose” matches two keywords and therefore should be assigned much higher score than “San Antonio”; then in the next section, they claim that normalization factor |SN|2 takes this problem into consideration. In section 6.3, para 1, it should be “Section 4.3” instead of “Section 3.4” In section 6.5 para 1, it should be “Figure 7(a)” instead of “Figure 8(a)” Questions??