Download Towards Keyword-Driven Analytical Processing

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Clusterpoint wikipedia , lookup

Object-relational impedance mismatch wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Functional Database Model wikipedia , lookup

Database model wikipedia , lookup

Relational algebra wikipedia , lookup

Relational model wikipedia , lookup

Transcript
Towards Keyword-Driven
Analytical Processing
Ping Wu, Yannis Sismanis, Berthold Reinwald
Presented By – Amit Goyal ([email protected])
Date: 21st Nov 2007
Outline



Background
Motivation
Solution Framework




Differentiate
Explore
Experiments
Conclusions
Motivation

Problem:





Given: multiple data sources
Find: patterns (such as?)
What were the sales volumes by region and product
category for the last year?
Which orders should we fill to maximize revenues?
Will a 10% discount increase sales volume
sufficiently?
Motivation


Users don’t know how to specify what they
want, but they know it when they see it
Keyword queries can have different
semantics based on the intent of user
Background

Decision Support


Data Warehouse



Decision support systems are a class of computer-based
information systems including knowledge based systems
that support decision making activities
Main repository of an organization's historical data for
analysis purposes
It contains the raw material for management's decision
support system
OLAP (online analytical processing)

Interactive process of creating, managing, analyzing and
reporting of data
Facts and Dimensions
Database is a set of facts (points) in a multidimensional space

Fact:


Sales, budget, profit,
inventory
Example Fact Table:

Transactions (timekey,
storekey, ckey, units,
price)
Dimension:


Example facts:


Measures performance
of a business

Example dimensions:


Specifies a fact
Product, customer data,
sales person, store
Example dimension
table:

Customer (ckey,
firstname, lastname,
address, dateOfBirth,
occupation, …)
Example
Order
Product
Order No
ProductNO
Order Date
ProdName
Customer
Customer No
Customer Name
Customer
Address
City
Salesperson
SalespersonID
SalespersonName
City
Quota
Fact Table
ProdDescr
OrderNO
Category
SalespersonID
CategoryDescription
CustomerNO
UnitPrice
ProdNo
Date
DateKey
DateKey
CityName
Date
Quantity
Total Price
City
CityName
State
Country
Notations



Dataspace DS: entire multi-dimensional
dataspace in an OLAP database
Subspace DS’: subset of dataspace
Star Join: set of joins when a fact table is
joined to two or more dimension tables.
Different interpretations of the keywords
reflect different star join expressions
Solution Framework

Phase Differentiate


Generation of candidate subspaces using user keywords
Phase Explore


The system first calculates for the subspace the aggregated values for some
predefined measures
Then dynamically finds for each dimension the top-k interesting group-by
attributes to partition the sub-dataspace
Interestingness


Application specific
Focus in the paper:


Surprising aggregates
Correlated aggregates
Differentiate Phase




First generate candidate subspace
Then, organize them effectively by ranking
them
Then, ask user to select one of candidate
subspace
Proceed to Explore Phase
Ambiguity of Keyword Queries

In large and complex OLAP dataspaces, a
keyword almost always matches different
attribute domains in different dimensions


Creates large number of possible query
interpretations
Example: Consider query “Columbus LCD”


Columbus – holiday or city?
LCD – projectors or TV or monitor?
Ambiguity of Keyword Queries: WHY?


Correctness: Different interpretation of keywords
may result in completely different subspaces. Thus,
correct interpretation may eliminate error
propagation to subsequent phases in the system
Performance: Computation of all possible
interpretations may be expensive
Users know exactly the semantic meaning of their
keywords. Put them in query processing loop
Candidate Interpretation Generation




Problem Stmt: Given a keyword query q={k1, k2,
...kn}, generate candidate interpretations CI
={C1,C2, ...,Cm} of q.
For each keyword ki, the CI generator first probes
the full-text index to obtain the Hit Set Hi. Hi = {hi1,
hi2, .. , him}
Each hit hij represents a triplet of relation name,
attribute name and the attribute instance value
{hij.R, hij.Attr, hij.Val}.
Within a hit set, hits can be organized in Hit Groups
if they have same relation name and attribute name.
Hit Group : Example



Consider a query “Columbus LCD”
The hit set for keyword “Columbus” has 3
hits: Loc/City/Columbus,
Holiday/Event/(“Columbus Day”),
Holiday/Event/(“Columbus Week”).
Thus, hit group for the keyword “Columbus” is
{Holiday/Event/(“Columbus Day” OR
“Columbus Week”)}.
Star Seed and Star Net

Star Seed (SS): For query q, SS is defined as a set
of n hit groups, each of which is drawn from a
different hit set



E.g. For query “Columbus LCD”, one candidate SS could
be {{Holiday/Event/(“Columbus Day” OR “Columbus
Week”)}, PGROUP/Group Name/”LCD Projectors”}
Star Net (SN): For SS, SN is defined as a join that
connects all the hit groups from the SS
Note: A single SS could correspond to multiple SN
Candidate Star Net Generation
Ranking the Candidate Star Nets






Number of all interpretations may be large. Thus, ranking is necessary
SCORE(SN,q) =
Sim :- string matching similarity function between the query q and attribute
value of hit hij
|HGk| is the number of hits in the hit group
∑ over hits in a Hit Group divided by |HGk| is the average hit similarity value
Avg hit similarity value is further normalized to penalize hit groups with
many matched attribute instances.


“California” – state or a large of distinct address on “California Street”
Finally, summation of hit group scores are divided by |SN|2 to prioritize
smaller Star Nets.

Score(Star Nets with “San Jose{city}”) > Score(Star Nets with “San Antonio{city}”
and “Jose{Customer First Name}”)
Handling of Phrase Queries




Consider Query: “San Jose”
Output hit set: “San Jose”, “San Antonio” etc
Score function does not take into consideration the
fact that “San Jose” perfectly matches two
keywords, thus should be ranked higher.
Solution: Merge the two hit groups from two hit sets,
if



Both groups are from same attribute domain
The intersection between two groups is not empty
Can be generalized to phrases containing more than
2 keywords
Explore Phase



Till now, a unique sub-dataspace DS’ has been
identified by the user
And the system computes the group-by aggregates
over the measure from all qualified fact points in DS’
Rank the group-by attributes



Categorical Attributes
Numerical Attributes
Organizing Attribute Instances


Categorical Attributes
Numerical Attributes
Automatic Facet Construction For SubDataspaces


After a unique sub-dataspace DS’ has been
identified by the user, the system dynamically
constructs a multi-faceted search (MFS) interface for
the user to explore detailed level aggregation in DS’.
In real world databases, the number of dimensions
may be large. And each dimension may have many
attributes.

So, need to rank group-by attributes dynamically based on
interestingness of the resulting partitions.
Ranking Group-by Attributes


Rank the group-by attributes based on
interestingness of their resulting partitions
Roll-up Partitioning (RUP)
Roll-up Partitioning




By looking at the sub-dataspace alone, it is
impossible to define interestingness of a
certain partition in a robust way
Dimensions are hierarchical, lets use it !!
Roll-up along some dimensions, compare the
two partitions.
The more similar the two partitions are, the
less the candidate group-by attribute is
considered as surprising
Example


Determine whether the attribute zipcode of
dimension store is an interesting group-by
attribute for the subspace associated with
Product Television
Roll-up to bigger space along the Product
dimension to Home Entertainment
Electronics


If distribution deviates -> surprising
If correlated -> bellwethers
Roll-up Partitioning
SCORE(attrij, DS’) = - E((X-µx)(Y-µy))/(σx σy)
Where,
X = aggregation values on partition PAR(DS’, attrij)
Y = aggregation values on partition PAR(RUP(DS’),
attrij)
E = expected value
µ = mean
σ = standard deviation
Ranking both Categorical and Numerical
Attributes


Categorical: Easy. Previous score function can be
directly applied.
Numerical: Correlation depends on how the
numerical domain is bucketized



First split the domains of the candidate attributes into
“sufficiently” many buckets or basic intervals
Tuples in same bucket are aggregated together to produce
new attribute values
Intuition behind splitting is that the correlation value
of two distributions can be preserved as the bucket
number becomes large and the interval range
becomes small
Organizing Attribute Instances

Till Now, we have ranked attributes
How to organize the values within each
attribute domain?
Categorical Attribute:
SCORE(attrij.catp, DS’) =

G is the aggregate function



Organizing Numerical Attribute Instances


Given ‘m’ basic intervals, merge adjacent intervals
into ‘k’ numeric categories
3 objectives:



Number of resulting intervals should not be large (suitable
for navigation)
Number of merged intervals should not be skewed.
Number of intervals in largest range should not exceed L
times the number of intervals in smallest range
The merged partition should preserve the original
interestingness value, i.e. correlation value from basic
intervals
Organizing Numerical Attribute Instances
Critique:
• What is neighbor?
• How to generate
it?
• SCORE function is
not defined
Experiments


AdventureWorks data warehouse
Divided in two separate databases:


AW_ONLINE – 5 dimensions, 3 hierarchical, 10
tables
AW_RESELLER – 7 dimensions, 4 hierarchical,
13 tables
Qualitative Sample Results



Keyword Query: “California Mountain Bike”
Phase1: System returns a list of star nets
Analyst selects the first Star Net
Evaluation of Subspace Ranking
Algorithm






Manually written 50
keywords queries
X-axis: Rank of the results
Y-axis: %age of the queries
satisfied
4 ranking methods
Relevance is checked
manually
Note that group size
normalization is not
significant
Effects of Bucket Number in Group-By
Attribute Ranking
Test the assumption that with a “sufficiently” large number of basic intervals,
the actual correlation value can be captured


AW_Online database
Numerical Attributes:



Yearly-Income from
Customer table
Dealer-Price from Product
table
Roll-up operations:


StateProvinces to
Countries
Subcategory to Product
Category
Effects of Bucket Number in Group-By
Attribute Ranking


AW_Reseller Database
Numerical Attributes:



Roll-up operations:


AnnualSales,
AnnualRevenue from
Reseller table
NumberOfEmployees
Product Subcategories to
Categories
Error percentage is
computed by the deviation
from the ideal case (where
each distinct value has its
own bucket)
Study of Numerical Partitioning Methods
Contribution/Conclusion




Integrate keyword search with the efficient
aggregation power of OLAP
Provides an efficient and easy-to-use solution
for business analyst
Ambiguity problem has not been addressed
by previous research
Current research on keyword search over
RDBMS uses indexes on a tuple level instead
of an attribute level
Critique

Poorly written paper
 Typo mistakes.


In eqn1, it is not clear what is |HGk|, |SN|
 In eqn2, it is mentioned G is an aggregate function, but didn’t
specify it
 In Algorithm 2, the “SCORE” function used is not defined
 In Algorithm 2, the notion of “neighbor” is not defined
First, they say score function (section 4.3) does not take into
account that “San Jose” matches two keywords and therefore
should be assigned much higher score than “San Antonio”; then
in the next section, they claim that normalization factor |SN|2
takes this problem into consideration.


In section 6.3, para 1, it should be “Section 4.3” instead of “Section
3.4”
In section 6.5 para 1, it should be “Figure 7(a)” instead of “Figure
8(a)”
Questions??