Download Data Mining in Data Warehouses

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Mining in Data Warehouses
Elena Baralis x
Rosa Meo x
Giuseppe Psailay
x Politecnico di Torino, Dipartimento di Automatica e Informatica
Corso Duca degli Abruzzi, 24 - I-10129 Torino, Italy
y Dipartimento di Elettronica e Informazione, Politecnico di Milano
P.za Leonardo da Vinci 32 - 20133 Milano, Italy
baralis/[email protected], [email protected]
Abstract
Data warehouses provide an integrated environment where huge amounts of data extracted from operational
sources are available for various kinds of decision support analysis. Hence, in order to allow the user to improve
the quality of the performed analysis, it is becoming of fundamental importance to eectively integrate mining
capabilities and data warehousing technology.
This paper describes AMORE-DW, an integrated environment for the specication of data mining requests
and the extraction of association rules from a data warehouse. The adopted architecture is characterized by a
tight coupling of data mining with the relational OLAP (ROLAP) server on the data warehouse, that provides
ecient access to the data to be analyzed. The main issues faced during the design are presented and the trade-o
between exible data analysis and system performance is discussed.
1 Introduction
The availability of an ecient and reliable database technology allows the massive and systematic gathering of huge
amounts of operational data concerning every kind of human activity, such as business transactions or scienti
c
research. Before being analyzed, the raw data, possibly extracted from heterogeneous sources, needs to be properly
integrated and carefully cleaned, in order to allow the extraction of reliable information. Furthermore, data analysis
algorithms perform complex (and expensive) operations on data, which are best performed in a separated environment
to avoid hampering daily operational data processing. The data warehousing technology 6], which experienced an
explosive growth in the past few years, is able to provide an integrated environment where data extracted from
operational sources are available for dierent types of decision support analysis. Hence, data warehouses are expected
to naturally become a major platform for data mining 7].
To extract useful information that may be exploited, e.g., in business decision making, the data stored in the data
warehouse must be analyzed with appropriate techniques. While OLAP (On Line Analytical Processing) analysis is
devoted to the computation of complex aggregations on the data, data mining is focused on the extraction of the
most frequent regularities that characterize the data. Such regularities are described by means of speci
c models,
which give a more abstract description of the data. An important class of data mining problems is represented by
means of association rules. Association rules describe the most common links among data items in a large amount of
collected data. The \classical" example application of association rules discovery 1] is the analysis of data recording
customer purchases at supermarket checkouts. In this context, association rules describe which products are likely
to be bought together by a statistically relevant number of customers. The discovered information can then be used
by the store management to support strategical decisions, e.g., for planning and marketing.
In general, an association rule is characterized by the structure:
This work has been supported by the Interdata MURST grant.
1
X )Y
where X , the rule body, and Y , the rule head, are two sets of values drawn from the mined attribute, i.e., the
attribute whose behavior is observed (e.g., bought items in the above example). To perform rule extraction, data
are grouped by some attribute (e.g., customer transactions) rules describe regularities of the mined attribute with
respect to the groups. The relevance of a rule is expressed in terms of the frequency with which a rule is observed
in the analyzed data. Thus, two probability measures, called support and con dence, are associated to a rule. The
support is the joint probability to nd in the same group X and Y . The con
dence is the conditional probability
to nd in a group Y having found X . Two minimum thresholds for support and con
dence are de
ned with the
purpose to discard the less frequent associations in the database.
The integration of data mining techniques with data warehousing technology will enhance both the data analysis
capabilities provided by current data warehouse products and the expressive power and exibility of the analysis
performed by current data mining tools. In fact, the current commercial ROLAP (Relational OLAP) servers provide
both the powerful data retrieval services of relational DBMS servers and ad-hoc OLAP optimization techniques.
This in turn allows the data analysts to specify complex (more re
ned than current data mining tools would allow)
search criteria in order to extract more useful knowledge from the raw data stored in the warehouse.
These considerations inspired the AMORE-DW (Advanced Mining On Relational Environments - Data Warehousing) project. The main goal is the development of a mining tool tightly integrated with the data warehouse and
its ROLAP server, such that the source data are constituted by the data collected in the data warehouse, and the
extracted rules are represented as database relations. In this context, the description of mining requests is performed
by means of an SQL-like language, that allows a exible speci
cation of mining statements and extends the semantics
of other languages 8] this operator also provides speci
c constructs to deal with the data schema typical of data
warehouses (known as star schema), in order to simplify the speci
cation of the mining request and to allow the data
mining tool to perform appropriate optimizations to reduce the cost of the analysis.
In this paper we illustrate the main features of the AMORE-DW environment and we discuss the design issues and
the problems encountered in integrating data mining and data warehousing technology. In particular, in Section 2
we describe a declarative language for the speci
cation of mining requests, based on the SQL-like operator MINE
RULE, while in Section 3 we present the architecture of the AMORE-DW prototype. Section 4 discusses the trade-o
between exible data analysis and system performance when de
ning which tasks of the rule extraction process are
to be performed by the OLAP server or by special-purpose extraction algorithms. Section 5 draws conclusions.
2 Specication Language for Association Rules Extraction
Several algorithms have been proposed to extract speci
c types of association rules (e.g. \classical" association rules
1], generalized association rules 12], . . . ) which operate on xed format data and are tailored to the speci
c type
of association rules to be searched. We propose a general purpose, SQL-like language to declaratively specify the
features of the association rules to be extracted from the data warehouse.
Since this language is not bound to any speci
c data schema, it allows the user to freely search through the whole
schema of the data warehouse. Furthermore, it allows the speci
cation of complex extraction criteria, which are not
available in traditional association rule extraction prototypes. Thus, the user is able to restrict the search space by
progressively re
ning the characteristics of the association rules to be extracted.
An initial description of MINE RULE, the main operator of the language, can be found in 9] it is extended here
to cope with the speci
c features of data warehouses. The operator is introduced by means of an example, which
speci
es several complex extraction criteria. The example is based on the data warehouse schema, describing sales
in a supermarket, which is represented in Figure 1.
2
Customers
Sales
cust−id
name
address
birth−year
city
city−id
region
region−id
...
cust−id
product−id
day−id
price
qty
Time
Products
product−id
product
subcategory−id
subcategory
category−id
category
...
day−id
day
month
year
...
Figure 1: Star schema of the supermarket data warehouse bold attribute names represent the primary key of each
table.
cust-id
c-1
c-1
c-2
c-2
c-2
c-1
c-2
c-2
product-id
p-1
p-2
p-3
p-4
p-5
p-4
p-3
p-5
day-id
d-1
d-1
d-2
d-2
d-2
d-2
d-3
d-3
price
140
180
25
150
300
300
25
300
qty
1
1
2
1
1
1
3
2
cust-id
day-id
c-1
d-1
d-2
c-2
d-2
d-3
(a)
product-id
p-1
p-2
p-4
p-3
p-4
p-5
p-3
p-5
price
140
180
300
25
150
300
25
300
qty
1
1
1
2
1
1
3
2
(b)
Figure 2: The Sales fact table: (a) simpli
ed instance, (b) grouped by cust-id and clustered by day-id.
Since data warehouse schemas usually have a star topology, they are called star schemas. In this example, the
star center, called fact table, describes sales gures for a supermarket. In particular, each purchase is performed
by a speci
c customer (supposing the store uses customer cards to identify its customers) to buy a speci
c product
(identi
ed, e.g., by means of the bar code) in a speci
c date. The above informations are the main dimensions along
which each sale event is characterized. Further information on each dimension can be obtained by the appropriate
dimension table describing in more detail the features of customers (e.g., demographic information, complete address, . . . ), products (e.g., merchandise hierarchy, given by product sub-category and category, and other attributes,
such as package type, and many more), and time (e.g., time hierarchy, given by day, week or month, year, and other
attributes useful for tracking sales such as holiday and special events indication, . . . ). Each sale fact in the Sales
table is further characterized by two measures: the price and the quantity of the sold product.
We nally observe that non numerical attributes (e.g., category in table Products) are included in the dimension
tables also in encoded format1 . This format is usually used in data processing instead of the original one for eciency
reasons.
For the sake of simplicity, to illustrate the expressive power of the MINE RULE operator, we consider the reduced
instance of the Sales table presented in Figure 2(a). Suppose we want to extract association rules with the following
features:
a) Rules describe the behavior of customers in terms of the sets of products most frequently purchased by them.
b) Only customers born after 1970 that purchased at least two products are considered.
The attributes are encoded during the loading process and the periodical refresh of the data warehouse. Note that dimensions are
seldom updated during the periodical refresh of the data warehouse.
1
3
c) Products appearing in the body must be purchased in the same date, after October. Products in the head are
purchased in the same date, but after the products in the corresponding body.
d) Products in the body have a price less than or equal to 200$, whereas products in the head cost less than
products in the body.
e) Rules are interesting only if their support is at least 20% and their con
dence is at least 30%2 .
The following statement allows the extraction of association rules corresponding to the above speci
cation.
MINE RULE YoungCustomers AS
SELECT DISTINCT 1..n product-id->product AS BODY, 1..n product-id->product AS HEAD,
SUPPORT, CONFIDENCE
WHERE BODY.price <= 200 AND HEAD.price < BODY.price
FROM Sales
GROUP BY cust-id
HAVING COUNT(*) >= 2 AND cust-id->birth-year > 1970
CLUSTER BY day-id
HAVING BODY.day-id->day < HEAD.day-id->day AND BODY.day-id->month > 10
EXTRACTING RULES WITH SUPPORT: 0.2, CONFIDENCE: 0.3
The association rules are extracted by performing the following steps:
Data Source. The FROM clause speci
es the source data to analyze. Only the fact table needs to be speci
ed.
To reference attributes in the dimension tables, the referencing \->" operator is used to follow the foreign
key constraint from the fact table to the dimension table. For example, the attribute birth-year of the
Customers dimension is reached through the attribute cust-id. Observe that the analysis always involves the
data contained in the fact table the dimensions may be referenced as well, but never in absence of the fact
table.
Group computation. The GROUP
clause speci
es that the source relation Sales is logically partitioned into
groups of tuples having the same value for the grouping attribute cust-id (corresponding to feature (a)
above).
BY
Group ltering. The (optional) HAVING clause associated to the GROUP
clause discards, before rule extraction,
all groups with less than two tuples or whose customer is born before 1970 (corresponding to feature (b) above).
Cluster identication. The (optional) CLUSTER
BY
clause further partitions each group into sub-groups called
clusters, such that tuples in a cluster have the same value for the clustering attribute day-id. The result
of both grouping and clustering of the data instance in Figure 2(a) is represented In Figure 2(b). When
clustering is speci
ed, the body of a rule is extracted from (smaller) clusters instead of entire groups, and
analogously for rule heads. Thus elements in the body (and head) share the same value of the clustering
attribute (corresponding to feature (c) above).
BY
Cluster coupling. To compose rules, every pair of clusters (one for the body and one for the head) inside the same
group is considered. Furthermore, the optional HAVING clause of the CLUSTER BY clause selects the cluster pairs
that should be considered for extracting rules. In this case, a pair of clusters is considered only if the day of the
left hand cluster, the body cluster, precedes the day of the right hand cluster, the head cluster (corresponding
to feature (c) above).
2
These gures are meaningful only for the example data warehouse in real applications they would be signicantly lower.
4
Rule extraction. From each group, the SELECT clause extracts all possible associations of an unlimited set of
products (clause 1..n product-id->product AS BODY), representing the body of rules, with another set of
products (clause 1..n product-id->product AS HEAD), representing the head of rules (corresponding to feature (a) above).
Mining condition. The (optional) WHERE clause following the SELECT clause forces rule extraction to consider only
pairs of tuples, the rst one (called body tuple) coming from the body cluster and the second one (called head
tuple) coming from the head cluster, such that the value of attribute price in the body tuple is less than or
equal to 200, and it is higher than the value of attribute price in the head tuple (corresponding to feature (d)
above).
Support and condence evaluation. The support of a rule is the number of groups from which the rule is
extracted divided by the total number of groups generated by the GROUP BY clause. The con
dence is the
number of groups from which the rule is extracted divided by the number of groups that contain the body in
some cluster. When support or con
dence are lower than the respective minimum thresholds (20=100 = 0:2 for
support and 30=100 = 0:3 for con
dence in our sample statement, see feature (e)), the rule is discarded.
This is the rule set extracted from the Sales table instance presented in Figure 2(a):
Rule
fbrown bootsg ! fcol shirtsg
fjacketsg ! fcol shirtsg
fbrown boots,jacketsg ! fcol shirtsg
Support Con dence
0.5
1
0.5
0.5
0.5
1
where brown boots corresponds to product identi
er p-3, jackets to p-4, col shirts to p-5. Since both rule bodies and
heads are sets, they can be represented as relational set attributes, similarly to the forthcoming SQL3 standard.
The statement described above speci
es very complex extraction criteria and exploits most of the powerful features
provided by the MINE RULE operator. Simpler extraction criteria are normally speci
ed in the initial phase of the
search process mining requests are then progressively re
ned adding more complex ltering conditions. Rules can
be divided in several (orthogonal) classes, depending on the extraction criteria that guide the extraction process:
Simple Association Rules. Only the basic (mandatory) extraction criteria are speci
ed (source data, grouping at-
tribute, mined attribute, body and head cardinality, minimum support and con
dence). This class corresponds
to classical association rules extraction 1].
Filtered Association Rules. A group ltering condition is speci
ed: rules are extracted only from a selected
subset of groups.
Mining-Constrained Association Rules. A mining condition expresses a complex correlation between rule body
and rule head.
Clustering Association Rules. The clustering attributes, and possibly the cluster selection predicate, are speci
ed. Bodies and heads of the rules are extracted from clusters instead of entire groups.
These classes are not mutually exclusive. Hence, mining problems can be expressed as a combination of the described
extraction criteria.
5
Mining Statement
Translator
Mining Kernel
Extracted Rules
ROLAP SQL Server
Figure 3: Mining Server architecture.
3 AMORE-DW Architecture
The AMORE-DW prototype is based on a client-server architecture:
On the client side, by means of a suite of user-friendly interface tools, the user speci
es mining requests, which
are then automatically mapped to statements in the language for the extraction of association rules described
in Section 2. Then, the mining statement is submitted to the mining server, which performs the rule extraction.
Finally, on the client side, the user may browse the result of the extraction process.
The mining server, which is in charge of the actual rule extraction, is tightly coupled with the ROLAP server,
in order to exploit its powerful data manipulation language and its data storage and access facilities. Typically,
both servers (the mining server and the ROLAP server) should be resident on the same system, which is usually
devoted to run computationally expensive analysis jobs.
In the following, we focus on the description of the architecture of the mining server presenting in more detail all
its components and their interaction.
3.1 Mining Server Architecture
The architecture of the Mining Server is depicted in Figure 3. It encompasses the following components:
The Translator interprets the MINE RULE mining statement: it checks the correctness of the request and generates the processing directives for the Mining Kernel.
The Mining Kernel is the specialized component for data analysis and association rules extraction.
The ROLAP Server prepares the data for the analysis according to the Mining Kernel's directives, provides
ecient access to them and stores the analysis results.
When a mining request is received by the Mining Server, the following processing steps take place (see Figure 3,
where solid arrows indicate the information ow between the components of the system):
1. The Translator performs the syntactic and semantic veri
cation of the statement's correctness. It checks the
de
nition of table and attribute names referenced in the statement into the ROLAP server Data Dictionary.
This information ow is represented in Figure 3 by the edge connecting the ROLAP Server to the Translator.
2. The Translator extracts the features characterizing the submitted statement and generates processing directives
for the Mining Kernel. In the gure, the processing directives sent to the Mining Kernel are represented by the
edge from the Translator to the Mining Kernel.
6
3. The Mining Kernel is activated upon receiving the processing directives from the Translator. It extracts
association rules by performing the following tasks:
(a) It instructs the ROLAP Server to preprocess the data (see Section 3.3.1).
(b) It reads the data and composes association rules (see Section 3.3.2).
(c) It stores the extracted rules into the data warehouse.
In the gure, the bidirectional edge between the Mining Kernel and the ROLAP Server denotes the exchange
of information in both directions.
At this point, the extraction process is completed and the user can browse the obtained association rules on the
ROLAP Server.
An important issue in the implementation of the Mining Server is the identi
cation of the border between
typical data processing tasks, to be executed by the ROLAP server, and mining processing, performed by specialized
algorithms in the Mining Kernel. At one extreme, the ROLAP server could be used only to retrieve raw data, as
in the traditional approach. In this case, the mining algorithm would be overloaded by tasks such as the evaluation
of complex SQL predicates on raw data. At the opposite extreme, the entire rule extraction could be performed by
means of SQL programs executed by the ROLAP server. This solution is not ecient, since an excessively frequent
context-switching between the SQL context and the mining application context would occur 3]. In particular, to
decide which tasks are better performed by each server, we considered the following issues:
The presence of the reference operator -> in the MINE RULE statements requires some preprocessing to join the
data referenced by means of the -> operator. This task is eciently performed by the ROLAP Server.
SQL predicates allowed by the MINE RULE operator are eectively evaluated by the ROLAP Server without
overloading the mining algorithm.
The actual association rules computation is an iterative process 1, 4, 11] that is better performed in main
memory with suitable data structures by a specialized algorithm.
Hence, many of the extraction features that characterize the MINE RULE operator can be eciently delegated to
the ROLAP server. The pool of operations performed by the ROLAP server is embedded into an SQL package called
preprocessor (presented in Section 3.3.1). The actual rule discovery process is carried out by a specialized algorithm,
in the Mining Kernel, which is called the core operator (presented in Section 3.3.2).
3.2 Translator
The translator is in charge of the following tasks:
It executes any lexical, syntactic and semantic check on the mining statement.
Semantic checks range from
the simple veri
cation of correctness of the statement (e.g., the source speci
cation should reference tables
existing in the database dictionary), to the enforcement of constraints on the mining statement (e.g., the set
of attributes in the GROUP BY clause should be disjoint from the set of attributes in the other clauses, such as
the CLUSTER BY or the SELECT clause).
It maps all non-numerical attributes (e.g., strings) referenced by the mining statement in non-conditional
clauses into the corresponding encoded version3. Non-numerical attribute encoding allows a signi
cant eciency
improvement for both the preprocessor and the core operator.
3 The correspondence is permanently stored into appropriate meta-data tables. Recall that the values of both the original attribute
and its encoded version are stored in the data warehouse.
7
It analyzes the syntactic clauses in the MINE RULE statement, determining the class of mining statements to
which the current statement belongs, in order to generate the processing directives for the Mining Kernel.
RULE statements are classi
ed according to the following categories:
Basic statements: all the mandatory clauses are included (MINE
MINE
RULE, SELECT, FROM, GROUP BY, EXTRACTING
RULES WITH). Optionally, the selection of source data (WHERE predicate in the FROM clause) and the selection
of relevant groups (HAVING predicate following the GROUP BY clause) may be speci
ed. An arbitrary
cardinality is allowed both in body and head.
Complex statements: optional features, such as mining condition, clustering attribute(s) (CLUSTER BY
clause) and cluster selection predicates (HAVING predicate following the CLUSTER BY clause) are speci
ed.
Finally, the translator calls the Mining Kernel, passing to it all the information required for further processing.
3.3 Mining Kernel
The Mining Kernel performs the extraction of association rules. Its structure can be divided in two main blocks:
The Preprocessor, that performs raw data preprocessing, and yields data in a suitable format for the rule
extraction algorithms in the core operator.
The Core Operator, which performs the actual rule extraction.
3.3.1 Preprocessor
Depending on the class of mining statements speci
ed by the user, the preprocessor executes distinct SQL programs
whose execution provides to the core operator the appropriate set of input data. These programs include database
instructions that retrieve source data, evaluate selection predicates and association conditions, and prepare data in
the appropriate format for the core operator.
This component heavily exploits the services of the ROLAP Server and the speci
c data warehouse operating
context. In particular, for example, it takes advantage of the available mechanisms to disable continuous logging of
database operations, which is costly and useless in this operative context.
The structure of the preprocessor is depicted in Figure 4. In the gure, ovals indicate SQL programs, while
rectangles denote views or temporary tables. Directed edges show the processing ow: arcs entering a program
denote its input tables, arcs exiting a program denote its output tables. Labeled arcs denote that the corresponding
operation is executed only if the associated feature appears in the mining statement (a legend is reported in Figure 4)
the vertical bar (|) separating two labels denotes their disjunction.
Depending on the class of mining statements, the Preprocessor executes the following operations:
Basic statements. The preprocessor retrieves source data by means of program Q0 its result, named Source, may
be a view or a temporary table containing the result of joining the fact and the dimension tables. Then, program
Q1 counts the total number of groups in Source (needed by the Core Operator to compute rule support). If a
group ltering condition is speci
ed, program Q2 , which selects the set of groups that satisfy the condition, is
also executed. Finally, program Q3 generates the input for the Core Operator, named CoreSource. While in
the simplest case CoreSource is a view, it may be temporarily materialized when the Core Operator needs to
scan it several times. CoreSource includes only the tuples that must be inspected in the extraction process.
8
Complex statements. In this case, since the Preprocessor is in charge of the evaluation of all complex SQL
predicates on data, it generates the so-called elementary rules, i.e. basic associations of one item for the body
and one item for the head. Hence, some further processing steps are required with respect to the former case. In
particular, if either a mining condition is speci
ed, or clustering attributes and possibly a cluster condition are
speci
ed, program Q5 reads view CoreSource and creates the elementary rules, storing them into the AllRules
temporary table, together with the list of groups containing each rule. In presence of aggregate functions in the
cluster condition, the execution of program Q5 is preceeded by the execution of program Q4, which produces
table Clusters, where each cluster is associated to the corresponding value of the aggregate functions. This
table is then used to compose the elementary rules in program Q5.
Finally, the set of elementary rules having support higher than or equal to the minimum support threshold
(the large rules) are computed and stored in temporary table InputRules, which is the input of the Core
Operator. In particular, program Q6 counts the support of each rule and detects the large rules, while program
Q7 prepares table InputRules.
Summarizing, the preprocessor produces two types of input data for the core operator:
The Source Data. The core operator performs its analysis by reading only the tuples in the CoreSource table,
i.e., the actual tuples to be considered in the extraction process. It is unaware of the actual origin of the source
data.
The Large Elementary Rules. These are the basic associations of one item for the body and one item for the
head having support higher than or equal to the minimum support threshold, which are stored in temporary
table InputRules. They are produced by the evaluation of the association conditions in the mining statement
(e.g., the cluster or the mining condition).
3.3.2 Core Operator
The data processing tasks performed by the preprocessor allow the core operator to be independent of the selection
conditions speci
ed in the MINE RULE statement, e.g., HAVING conditions on clusters. This independence is essential
for the simplicity and eciency of the implementation of the extraction algorithms. The features provided by
our operator extend the semantics of the association rules with respect to other SQL-like operators proposed in
literature 8]. This extended semantics requires that the core operator is implemented with original solutions, that
need several adaptations of the well known algorithms proposed in literature 1, 4, 11]. The Mining Kernel includes
several algorithms, each one speci
cally tailored for the categories of MINE RULE statements outlined in Section 3.3.1.
Basic statements. In this case, the extraction process corresponds to the traditional rule extraction performed by
the algorithms described in 1, 4, 11, 10]. Rule discovery is performed by initially building sets of items with
sucient support (called large itemsets), and then creating rules from the large itemsets. All these algorithms
are based on the observation that the number of itemsets grows exponentially with the number of items. Since
a counter in main memory is needed for each itemset whose support is computed, these itemsets are carefully
selected and their number is kept as low as possible.
The current version of the core operator for basic statements is based on the algorithm presented in 11].
However, the modularity of our architecture allows us to easily replace this algorithm with any of the algorithms
proposed in literature (see 1, 2, 4, 11], etc.). The algorithm operates in two passes. In the rst, the source
data is divided in partitions of the same size designed so as the counters for the itemsets can be stored in
main memory. The search for large itemsets is performed separately in each partition. This step is based on
9
base tables
Q0:
create Source
Source
G
Q2:
create ValidGroups
Q1:
count total groups
totg
ValidGroups
Q3:
create CoreSource
G
CoreSource
Q4:
compute aggregate functions
F
M|H|C
Clusters
F
Q5:
create elementary rules
Q7:
select large rules
All Rules
M|H|C
M|H|C
Q6:
compute large rules
LargeRules
M|H|C
InputRules
Legend
Label Meaning of the label
G
Presence of the Group Filtering Condition
M
Presence of the Mining Condition
H
Dierent attribute schema for body and head
C
Clustered statement
F
Aggregate functions in the Cluster Filtering Condition
Figure 4: Preprocessor Architecture. Labels on the arcs denote when they are enabled in the table is reported the
meaning of the labels.
10
the observation that sucient support in at least a partition is a necessary condition for an itemset to have
sucient support also with respect to the whole source data. The itemsets that have sucient support within a
partition are saved on disk. The second pass computes the eective support of the saved itemsets, by counting
the number of groups in the entire database that contain each itemset. The actual support is given by the ratio
between this number and the total number of groups (computed by the preprocessor).
The actual rule discovery process considers each large itemset previously generated and extracts subsets of
items. Indicating with L a large itemset and with H L a subset, the rule (L ; H ) ) H is formed. Its
supp(L) ).
support is de
ned as the itemset support (supp(L)) con
dence is immediately computed ( supp
(L;H )
Complex statements. In this case the core operator receives from the ROLAP server the elementary rules4 from
which rules can be extracted, instead of computing them itself. The algorithm discovers rules starting from
formerly generated rules, as opposed to considering the whole itemsets as in the previous algorithm. Rule
discovery is performed by steps, increasing progressively at each step the cardinality of body and head, and
computing rule support and con
dence. At each step new rules are created by combining two rules found in
the previous step. At rst, two rules with matching heads are combined: the new rule has the same head
and the body is obtained by the union of the two bodies. In this way the cardinality of the rule body is
increased. Afterwards, new rules are created by combining the heads, thus increasing head cardinality. Finally,
the algorithm scans the CoreSource table to compute the support of bodies, in order to calculate the con
dence
of extracted rules.
4 Experimental Results
The experimental results in this section are aimed to the exploration of the trade-o between system performance
and exible data analysis that we faced in the development of our prototype. For the experiments, we generated
two synthetic databases, db1 and db2, using the publicly available datagen generator 4]. Database db1 contains
100000 groups and each group has an average of 5 tuples (i.e., each customer's purchase averages 5 products) this
yields a total database size of 494617 tuples. The chosen data distribution yields rules with an average length of
2 elements. Database db2 has again 100000 groups, but the average number of tuples in each group is 10. Hence, the
size of db2 is twice the size of db1. Furthermore, in this case, the selected data distribution yields an average rule
length of 4 elements. The experiments have been performed on a (non dedicated) Digital Alpha Server with 256MB
of RAM and the Oracle 8.0 ROLAP server.
The rst experiment, whose result is presented in Figure 5(a), is devoted to the comparison of the performance of
our architecture with respect to the traditional at le approach. We extracted simple association rules with minimum
support 0.75% from both the data stored in binary les and the data stored in the ROLAP server. The at le
approach, which is speci
cally designed for the discovery of association rules with a xed structure and with very
simple and unchangeable extraction criteria, yields clearly better performance (roughly faster by a factor of 8 for both
data distributions). This result is obtained without performing speci
c optimizations of the database I/O (e.g., data
is read tuple by tuple and not using arrays), to show the worst case dierence between the two approaches. Optimized
database access operations would allow closer performance results. Unfortunately, the superior performance of the
at le approach is obtained at the price of loosing completely the possibility of exibly varying the rule extraction
criteria.
The second experiment shows the eect of adding a mining condition (namely, BODY.price < HEAD.price) to
the previous extraction criteria. In this case, we compared both performance and selectivity of the above complex
4
Elementary rules represent large itemsets of cardinality two, i.e., the rules with the lowest cardinality for body and head.
11
Legenda:
preprocessing step on DW
algorithm step on DW
flat file approach
200 rules
1305
54 rules
54 rules
283
283
167
7 rules
9.8
7.4
33
0.1
0.1
db file
db1
simple
db file
db2
Experiment a)
with mining
condition
db1
Experiment b)
Figure 5: Experimental results: (a) Comparison of the at le with respect to the AMORE-DW prototype approach,
(b) Comparison of the execution of a simple statement with respect to a complex statement in the AMORE-DW
architecture.
statement with respect to the simple statement considered in the previous experiment. For the experiment, whose
result is presented in Figure 5(b), we considered database db1. The addition of a more selective extraction criterion
(the mining condition reduces the allowed product pairs) signi
cantly reduces the number of obtained rules (from
54 for the simple statement, to 7 for the complex statement), hereby improving the signi
cance of the extracted
information. Furthermore, the preprocessing step performed by the ROLAP server allows a signi
cant reduction of
the time spent by the algorithm in building the rules hence, the obtained performance improves signi
cantly (by a
factor larger than 25). In particular, while in the former experiment the work performed by the preprocessor was
very small and its eect on the overall performance was negligible, in this case most of the time is spent in the
preprocessing step, which prepares the elementary large rules for the extraction algorithm. Hence, the preprocessing
step, although taking a longer time than before, can contribute signi
cantly to the improvement of the overall
performance of the architecture. Finally note that the overall processing time for the complex statement is also lower
(roughly by a factor of 3) than the time required by at le processing of the simple statement.
5 Conclusions
In this paper we described the architecture of the AMORE-DW system. AMORE-DW provides an environment
for the speci
cation of mining requests by means of a powerful SQL-like language for expressing rule extraction
criteria. The MINE RULE operator has already been used for the speci
cation of several mining problems applied to
heterogeneous domains, e.g., the analysis of telephone data.
The architecture of the AMORE-DW environment is characterized by a tight coupling with a ROLAP server,
that manages the access to the warehouse data from which rules are extracted. The exibility gained by accessing
data stored in relational format is counterbalanced by the reduced performance of the system with respect to the
traditional at le approach. Nevertheless, when more selective extraction conditions are speci
ed, a signi
cant
improvement of the overall processing time is obtained, owing to the eective preprocessing of data performed by
the ROLAP server.
Finally, we observe that a tight coupling of mining and warehousing technology yields several opportunities that
we are currently exploring:
12
Selected relevant association rule sets can be precomputed and stored in the warehouse to guarantee optimal
response time. These precomputed rule sets can be incrementally updated upon periodical update of the
warehouse data.
Extraction statements can be progressively re
ned, e.g., by providing more restrictive selection conditions the
system can store and exploit the result of previous processing steps to simplify the successive extraction of
rules selected by the re
ned statements.
A collection of dierent extraction algorithms (e.g., 4, 10]), each one appropriate for a dierent type of data
distribution, can be easily incorporated in the system and used for dierent warehouse data distributions.
References
1] R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In
Proc.ACM SIGMOD Conference on Management of Data, pages 207{216, Washington, D.C., May 1993. British
Columbia.
2] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. I. Verkamo. Fast discovery of association rules. In
U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Knowledge Discovery in Databases,
volume 2. AAAI/MIT Press, Santiago, Chile, September 1995.
3] R. Agrawal and K. Shim. Developing tightly-coupled data mining applications on a relational database system.
KDD-96, pages 287{290, 1996.
4] R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In Proceedings of
the 20th VLDB Conference, Santiago, Chile, September 1994.
5] R. Agrawal and R. Srikant. Mining sequential patterns. In International Conference on Data Engineering,
Taipei, Taiwan, March 1995.
6] S. Chaudhuri and U. Dayal. An overview of data warehousing and olap technology. Sigmod Record, 26(1):65{74,
March 1997.
7] J. Han, S. H. Chee, and J. Y. Chiang. Issues for on-line analytical mining of data warehouses. In In Proceedings
of SIGMOD-98 Workshop on Research Issues on Data Mining and knowledge Discovery, Seattle, Washington,
USA., June 1998.
8] J. Han, Y. Fu, W. Wang, K. Koperski, and O. Zaiane. DMQL: A data mining query language for relational
databases. In Proceedings of SIGMOD-96 Workshop on Research Issues on Data Mining and knowledge Discovery, 1996.
9] R. Meo, G. Psaila, and S. Ceri. A new SQL-like operator for mining association rules. In Proceedings of the
22st VLDB Conference, Bombay, India, September 1996.
10] J. S. Park, M. Shen, and P. S. Yu. An eective hash based algorithm for mining association rules. In Proceedings
of the ACM-SIGMOD International Conference on the Management of Data, San Jose, California, May 1995.
11] A. Savasere, E. Omiecinski, and S. Navathe. An ecient algorithm for mining association rules in large databases.
In Proceedings of the 21st VLDB Conference, Zurich, Swizerland, 1995.
12] R. Srikant and R. Agrawal. Mining generalized association rules. In Proceedings of the 21st VLDB Conference,
Zurich, Switzerland, September 1995.
13