Download Integrating Data Mining with SQL Databases

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Integrating Data Mining with SQL Databases: OLE DB for Data Mining
Amir Netz Surajit Chaudhuri Usama Fayyad1 Jeff Bernhardt
Microsoft Corporation
Abstract
The integration of data mining with traditional database
systems is key to making it convenient, easy to deploy in
real applications, and to growing its user base. In this
paper we describe the new API for data mining proposed
by Microsoft as extensions to OLE DB standard. We
illustrate the basic notions that motivated the API's design
and describe the key components of an OLE DB for Data
Mining provider. We also include examples of the usage
and treat the problems of data representation and
integration with the SQL framework. We believe this new
API will go a long way in enabling deployment of data
mining in enterprise data-warehouses. A reference
implementation of a provider is available with the recent
release of Microsoft SQL Server 2000 database system.
1. Introduction
Progress in database technology has provided the
foundation that made massive data warehouses of
business data ubiquitous. Increasingly, such data
warehouses are built using relational database technology.
Recently, there has been a tremendous commercial
interest in mining the information in these data
warehouses. However, building mining applications on
data over relational databases is nontrivial and requires
significant work on the part of application builders.
Therefore, what is needed is a strong system support to
ease the burden of developing mining applications against
relational databases. In this paper, we introduce the
“OLE DB for Data Mining” (henceforth abbreviated as
OLE DB DM in this paper) Application Programming
Interface (API). As we will demonstrate in the rest of this
paper, if a data source supports the OLE DB DM (i.e., if
the data source is an OLE DB DM “provider”), then the
task of building mining applications is simplified because
of the native support for DM primitives in OLE DB DM.
The case for supporting such an API is best understood
by reviewing the current state of the art. Traditionally,
data mining tasks are performed assuming that the data is
1
Author’s current affiliation: digiMine, Inc.
in memory. In the last decade, there has been an increased
focus on scalability but research work in the community
has focused on scaling analysis over data that is stored in
the file system. This tradition of performing data mining
outside the DBMS has led to a situation where starting a
data mining project requires its own separate
environment. Data is dumped or sampled out of the
database, and then a series of Perl, Awk, and special
purpose programs are used for data preparation. This
typically results in the familiar large trail of droppings in
the file system, creating an entire new data management
problem outside the database. In particular, such “export”
creates nightmares of data consistency with data in
warehouses. This is ironic because database systems were
created to solve exactly such data management problems.
Once the data is adequately prepared, the mining
algorithms are used to generate data mining models.
However, past work in data mining has barely addressed
the problem of how to deploy models, e.g., once a model
is generated, how to store, maintain, and refresh it as data
in the warehouse is updated, how to programmatically use
the model to do predictions on other data sets, and how to
browse models. Over the life cycle of an enterprise
application, such deployment and management of models
remains one of the most important tasks. The objective of
proposing OLE DB DM is to alleviate problems in
deployment of models and to ease the data preparation by
working directly on relational data.
Another motivation to introduce OLE DB DM is to
enable enterprise application developers to participate in
building data mining solutions. Easing the ability to
develop integrated solutions is critical for the growth of
data mining technology in the enterprise space. In
enterprises, data mining needs to be perceived as a valueadded component along with traditional decision support
techniques, e.g., traditional SQL as well as OLAP
querying environments [3]. In order to ensure success of
data mining in this regard, it is important to recognize that
a typical enterprise application developer is hardly an
expert in statistics and pattern recognition. Rather, he or
she is comfortable using traditional database APIs/tools
based on SQL, OLE DB, and other well-known standards
and protocols. Therefore, it is extremely important that
the infrastructure for supporting data mining solutions is
aligned
with
traditional
database
development
environment and with APIs for database access.
Consequently, we decided to exploit the well-known OLE
DB API to ensure that data mining models and operations
gain the status of first-class objects in the mainstream
database development environment.
Our proposal builds upon OLE DB [7], an application
programming interface that has been gaining popular use
for universal data access not only from relational systems
but also from any data source that can be viewed as a “set
of tables” (rowsets) as well. It is important to point out
that OLE DB for DM is independent of any particular
provider or software and is meant to establish a uniform
API. Another feature of our proposal is that it is not
specialized to any specific mining model but is structured
to cater to all well-known mining models. Any party
interested in using this interface is encouraged to do so by
building its own provider, as with OLE DB (and ODBC).
However, we should also add that one of the strengths of
our proposal is that we have implemented this API as part
of the recent release of Microsoft SQL Server 2000
(Analysis Server component). The proposal also
incorporates feedback from a multitude of vendors in the
data mining space who participated in preview meetings
and calls for reviews over the past 18 months.
Our proposal is not focused on any particular KDD
algorithms per se since our intent is not to propose new
algorithms, but to suggest a system infrastructure that
makes it possible to “plug in” any algorithm and enable it
to participate in the entire life-cycle of data mining. This
paper is our attempt to summarize the basic philosophy
and design decisions leading to the current specification
of OLE DB DM. A complete specification may be
obtained from [9].
The rest of the paper is organized as follows. We first
introduce the overall architecture and design philosophy
of OLE DB for Data Mining (OLE DB DM) in Section 2.
In Section 3, we discuss, with illustrative examples of
their usage, the primitives and basic building blocks in the
API. Section 4 presents related work and we conclude in
Section 5.
2. Overview and Design Philosophy
OLE DB DM provides a set of extensions to OLE DB
to make it “mining-enabled.” OLE DB is an objectoriented specification for a set of data access interfaces
designed for record-oriented data stores. In particular, by
virtue of supporting command strings, OLE DB can
provide full support to SQL. In this paper, we will
primarily introduce extensions that leverage this aspect of
OLE DB API to make sure that support for mining
concepts are built in.
Figure 1 shows how Microsoft SQL Server database
management system exposes OLE DB DM. In this
specific implementation, the “core” relational engine
exposes the traditional OLE DB interface. However, the
analysis server component exposes an OLE DB DM
interface (along with OLE DB for OLAP) so that data
mining solutions may be built on top of the analysis
server. Of course, this is only one example of how OLE
DB DM may be exposed. For example, in the future, data
mining support may be deeply integrated in the “core”
relational database system.
Our approach in defining OLE DB DM is to maintain
the SQL metaphor and to reuse many of the existing
notions whenever possible. Due to the widespread use of
SQL in the developer community, we decided to expose
data mining interfaces in a language-based API. This
decision reflects a central design principle: building data
mining solutions (applications) should be easy. Thus, our
design philosophy requires relational database
management systems to invest resources to provide the
“drivers” to support OLE DB DM so that data mining
solutions may be developed with ease by data mining
application developers. Since there are likely to be many
more application developers than DBMS vendors, such a
design philosophy inflicts the least pain.
The key challenge is to be able to support the basic
operations of data mining without introducing much
change in programming model and environment that a
database developer is used to. Fundamentally, this
requires supporting operations on data mining models.
These key operations to support on a data mining model
are the ability to:
1.
Define a mining model, i.e., identify the set of
attributes of data to be predicted, the set of
attributes to be used for prediction, and the
algorithm used to build the mining model.
2.
Populate a mining model from training data using
the algorithm specified in (1).
3.
Predict attributes for new data using a mining
model that has been populated.
4.
Browse a mining model for reporting and
visualization applications.
In developing OLE DB DM, we have decided to
represent a data mining model (henceforth model or
DMM) as analogous to a table in SQL.2 Thus, it can be
defined via the CREATE statement (used to create a table
2
Although viewed conceptually and programmatically as
tables, models have important differences with tables. A data
mining model is populated by consuming a rowset but its own
internal structure can be more abstract, e.g., a decision-tree is a
tree-like structure, much more compact than the training data set
used to create it.
in SQL). It can be populated, possibly repeatedly via the
INSERT INTO statement (used to add rows in a SQL
table), and it can be emptied (reset) via the DELETE
statement. A mining model may be used to make
predictions on datasets. This is modeled through
prediction join between a mining model and a data set,
just as the traditional join operation is used to pull
together information from multiple tables. Finally, the
content of a mining model can be browsed via SELECT
(used to identify a subset of data in SQL). Since the
mining model is a native named object at the server, the
mining model can participate in interaction with other
objects using the primitives listed above (see also Section
3). Note that this achieves the desirable goal of avoiding
excessive data movement, extraction, copying and thus
resulting in better performance and manageability
compared to the traditional “mining outside the database
server” framework. Finally, once the DMM is created and
optimized, deployment within the enterprise becomes as
easy as writing SQL queries. This significantly minimizes
the cost of deployment by eliminating the need for
developing a proprietary infrastructure and interface to
deploy the data mining solution. Model deployment and
maintenance costs have received little attention in the
literature, though they are an expensive and difficult task
in real applications. In the next section, we review the
details of the language extensions that provide an
extensible framework for incorporating support for data
mining.
3. Basic Components of OLE DB DM
OLE DB DM has only two notions beyond traditional
OLE DB definition: cases and models.
• Input data is in the form of a set of cases (caseset). A
case captures the traditional view of an “observation”
by machine learning algorithms as consisting of all
information known about a basic entity being
analyzed for mining. Of course, structurally, a set of
cases is no different from a table.3 Cases are
explained further in Section 3.1.
• Data mining model (DMM) is treated as if it were a
special type of “table:” (a) We associate a caseset
with a DMM and additional meta-information while
creating (defining) a DMM. (b) When data (in the
form of cases) is inserted into the data mining model,
a mining algorithm processes it and the resulting
abstraction (or DMM) is saved instead of the data
itself. Once a DMM is populated, it can be used for
prediction, or its content can be browsed for reporting
3
However, like many other data sets, cases are more naturally
represented as hierarchical data, i.e., nested tables are more
natural.
and visualization purposes. Fundamental operations
on models include CREATE, INSERT INTO,
PREDICTION JOIN, SELECT, DELETE FROM,
and DROP.
It should be noted that OLE DB DM has all the
traditional OLE DB interfaces. In particular, schema
rowsets specify the capabilities of an OLE DB DM
provider. This is the standard mechanism in OLE DB
whereby a provider describes information about itself to
potential consumers. This information describes the
supported capabilities (e.g. prediction, segmentation,
sequence analysis, etc.), types of data distributions
supported, limitations of the provider (size or time
limitations), support for incremental model maintenance,
etc. Other schema rowsets provide metadata on the
columns of a mining model, on its contents, and the
supported services. The details of these schema rowsets
are beyond the scope of this paper.
The rest of this section is organized as follows. In
Section 3.1, we describe how input data is modeled as a
caseset. In Section 3.2, we describe how data mining
models may be defined (created) in a relational
framework and the kind of meta-information on casesets
that mining needs to be aware of. Section 3.3 describes
operations on mining models, notably insertion
(populating the model), prediction, and browsing.
3.1. Data as Cases
To ensure that a data mining model is accurate and
meaningful, data mining algorithms require that all
information related to the entity to be mined be
consolidated
before
invoking
the
algorithms.
Traditionally, mining algorithms also view the input as a
single table that represents such consolidated information
where each row represents an instance of an entity. Data
mining algorithms are designed so that they consume an
entity instance at a time. Having all the information
related to a single entity instance together in a rowset, has
two important benefits. First, traditional data mining
algorithms can be leveraged with relative ease. Next, it
increases scalability as it eliminates the need for data
mining algorithms to do considerable bookkeeping.
In a relational database, data is often normalized and
hence the information related to an entity scattered in
multiple tables. Therefore, in order to use data mining, a
key step is to be able to pull the information related to an
entity into a single rowset using views. Furthermore, as is
well recognized, data is often hierarchical in nature and so
ideally we require the view definition language to be
powerful enough to be able to produce a hierarchical
rowset. This collection of data pertaining to a single entity
is called a case, and the set of all relevant cases is referred
to as a caseset. As mentioned above, a caseset may need
Customer
ID
1
Gender
Male
Hair Color
Black
Age
35
Age
Prob.
100%
Product Purchases
Name
Quantity
TV
1
Car_Ownership
Type
Car
Prob.
Electronic
Truck
100%
VCR
1
Electronic
Van
Ham
2
Food
Beer
6
Beverage
Table 1: A nested table representation of the 3-way join result
50%
to be hierarchical. In the rest of this section, we elaborate
on these two issues through examples and further
discussions.
Consider a schema consisting of 3 tables representing
customers. The Customers table has some customer
properties. The Product Purchases table stores for each
customer what products they bought. The Car Ownership
table lists the cars owned by each customer.4 Consider a
customer, say Customer ID 1, who happens to be a male,
has black hair, and believed to be 35 years old with 100%
certainty. Assume this customer has bought a TV, a VCR,
Beer (quantity 6), and Ham (quantity 2). Further assume
we know this customer owns a truck (100%) and we
believe he also has a van (50% certainty). Suppose one
were to issue a join query over the 3 tables to get all the
information we know about the set of customers. Readers
familiar with SQL will quickly realize that this join query
will return a table of 12 rows. Furthermore, these 12 rows
contain lots of replication. More importantly, since the
information about an entity instance is scattered among
multiple rows, the quality of output from data mining
algorithms is negatively impacted by such flattened
representation.
Table 1 shows how a nested table can succinctly
describe the above join result. As illustrated in Table 1, a
customer case can include not only simple columns but
also multiple tables (nested tables as columns). Each of
these tables inside the case can have a variable number of
rows and a different number of columns. Some of the
columns in the example have a direct one-to-one
relationship with the case (such as “Gender” and “Age”),
while others have a one-to-many relationship with the
case and therefore exist in nested tables. The reader can
easily identify the two tables (nested) contained in the
sample case in Table 1.
• Product Purchases table containing the columns
Product Name, Product Quantity, and Product
Type.
• Car Ownership table containing the columns Cars
and Car Probability.
Nested table is a familiar notion in the relational
database world, though not adopted universally in
commercial systems. In OLE DB DM, we have chosen to
use the Data Shaping Service5 to generate a hierarchical
rowset. It should be emphasized that it is a logical
definition and does not have any implications for data
storage. In other words, it is not necessary for the storage
subsystem to support nested records. Cases are only
instantiated as nested rowsets prior to training/predicting
data mining models. Note that the same physical data may
be used to generate different casesets for different
analysis purposes. For example, if one chooses to mine
models over all products, each product then becomes a
single case and customers would need to be represented
as columns of the case.
Once a hierarchical caseset has been prepared using
the traditional view definition mechanism and the Data
Shaping Service in OLE DB, we can use this caseset to
populate a data mining model or we can predict values of
unknown columns using a pre-existing model. In order to
do that, we need a way to represent data mining models.
We illustrate this in the next section.
4
5
The columns Age Probability and Car_Ownership.Probability
are unusual in that they represent degree of certainty.
3.2. Creating/Defining Data Mining Models
In OLE DB DM, a data mining model (DMM) is
treated as a first class object of the provider, much like a
table. In this section, our focus is on definition (creation)
of data mining models. In particular, the DMM definition
includes a description of the columns of data over which
the model will be created, including meta-information on
relationships between columns. It is assumed that a data
set is represented as a nested table as discussed in Section
3.1. In Section 3.3, we will describe how other operations
on data mining models may be supported in OLE DB
DM.
As discussed earlier, a data mining model is trained
over a single caseset and is used to predict columns of a
caseset. Therefore, defining (creating) a data mining
model requires us to specify:
• The name of the model.
The Data Shaping Service is part of the Microsoft Data Access
Components (MDAC) products.
•
•
•
•
The algorithm (and parameters) used to build the
model.
The algorithm for prediction using the model.
The columns the caseset must have and the
relationships among these columns.
Identifying columns to be used as “source” columns
and the columns to be filled by the mining model
(“prediction columns”).
The following example illustrates the syntax:
CREATE MINING MODEL [Age Prediction] (
%Name of Model
[Customer ID] LONG KEY,
[Gender]
TEXT DISCRETE,
[Age]
DOUBLE DISCRETIZED PREDICT,
%prediction column
[Product Purchases] TABLE(
[Product Name] TEXT KEY,
[Quantity]
DOUBLE NORMAL CONTINUOUS,
[Product Type] TEXT DISCRETE
RELATED TO [Product Name] ))
USING [Decision_Trees_101]
%Mining Algorithm used
The annotations with the example identify the source
columns used for prediction, the column to be predicted
(age) and the mining algorithm used. Specifically, the
DMM is named [Age Prediction] (the square brackets [ ]
are name delimiters by convention). The PREDICT
keyword indicates that the Age column is to be predicted.
The [Product Purchases] is a nested table, with [Product
Name] being its key. The USING clause specifies the
mining algorithm to use. Parameters to the algorithm
could be included as part of this clause. When we contrast
this example with the CREATE TABLE syntax in SQL, a
key difference is the use of substantially more meta
information that adorns a column reference above.
The model columns describe all of the information
about a specific case. For example, assume that each case
in the DMM represents a customer. The columns of the
DMM will include all known and desired information
about the customer (e.g. Table 1). Several kinds of
information may be specified for columns:
• Information about the role of a column (Section
3.2.1): Is it a key? Does it contain an attribute?
How does it relate to other columns or nested
tables? Is it a qualifier on other columns?
• For columns that represent attributes, what type of
attribute is it? Attribute types are described in
Section 3.2.2.
• Additional information about data distributions
that can be provided as hints to the DMM (Section
3.2.3).
The above details are discussed in the rest of this
subsection and we conclude this section with a
comprehensive example. Finally, note that despite the
similarities with CREATE TABLE with respect to
programmability, mining models are conceptually
different as they represent only summarized information
for a data set. Also, unlike traditional data tables, they are
capable of prediction (See Section 3.3).
3.2.1 Content Types of Columns – Column Specifiers
Each column in a case can represent one of the
following content types:
KEY: the columns that identify a row. For example,
“Customer ID” uniquely identifies customer cases,
“Product Name” uniquely identifies a row in “Product
Purchases” table.
ATTRIBUTE: A direct attribute of the case. This type
of column represents some value for the case. Example
attributes are: the age, gender, or hair-color of the
customer or the quantity of a specific product the
customer purchased.
RELATION: Information used to classify attributes,
other relations, or key columns. For example, “Product
Type” classifies “Product Name.” A given relation value
must always be consistent for all of the instance values of
the other columns it describes—for example, the product
“Ham” must always be shown as “Food” for all cases. In
the CREATE MINING MODEL command syntax,
relations are identified in the column definition by using a
RELATED TO clause to indicate the column being
classified.
QUALIFIER: A special value associated with an
attribute that has a predefined meaning for the provider.
Take for example the probability that the attribute value is
correct. These qualifiers are all optional and apply only if
the data has uncertainties attached to it or if the output of
previous predictions is being chained as input to a
subsequent DMM training step. In the CREATE MINING
MODEL command syntax, modifiers are identified by
using an OF clause to indicate the attribute column they
modify. There are many qualifiers. We give a few
examples here:
(a) PROBABILITY: the probability [0,1] of the
associated value.
(b) VARIANCE: A number that describes the
variance of the value of an attribute.
(c) SUPPORT: A float that represents a weight
(case replication factor) to be associated with the
value.
(d) PROBABILITY_VARIANCE: The variance
associated with the probability estimator used for
PROBABILITY.
(e) ORDER: Specifies the order of a column. (See
ORDERED below.)
(f) TABLE: A nested table in the case consists of a
special column with the data type TABLE. For any
given case row, the value of a TABLE type column
contains the entire contents of the associated nested
table. The value of a TABLE type column is in
itself a table containing all of the columns for the
nested table. In the CREATE MINING MODEL
command syntax, nested tables are described by a
set of columns, all of which are contained within
the definition of a named TABLE type column.
3.2.2 Content Types of Columns – Attribute Types
For a column that is specified as an attribute, it is
possible to further indicate the type of attribute. The
following keywords specify the attribute type to the DM
provider.
• DISCRETE: The attribute values are discrete
(categorical) with no ordering implied by the values
(often called states). “Area Code” is a good example.
• ORDERED: Columns that define an ordered set of
values with a total ordering but no distance or
magnitude semantics are implied. A ranking of skill
level (say one through five) is an ordered set, but a
skill level of five isn't necessarily five times better
than a skill level of one.
• CYCLICAL: A set of values that have cyclical
ordering. Day of the week is a good example, since
day number one follows day number seven.
• CONTINOUS: Attributes with numeric values
(integer of float). Values are naturally ordered and
have implicit distance and magnitude semantics.
Salary is a typical example.
• DISCRETIZED: The data that will be inserted into
the model is continuous, but it should be transformed
into and modeled as a number of ORDERED states
by the provider. Some data mining algorithms cannot
accept CONTINUOUS attributes as input, or they
may not be able to predict CONTINUOUS values.
• SEQUENCE_TIME: A time measurement range. The
format is not restricted, e.g., a period number is
acceptable. This is typically used to associate a
sequence time with individual attribute values such as
purchase time.
3.2.3 Distribution Hints
An attribute, continuous or discrete, may have an
associated distribution. These distributions are used as
hints to the DMM and can specify prior knowledge about
the data. For example, a continuous attribute can be
normal (Gaussian), log normal, or uniform. A discrete can
be binomial, multinomial, Poisson, mixture. Other hints
could include: Not_Null which indicates there should
never be a null value, model existence only which means
the information of interest is not in the value of an
attribute, but in the fact that a value is present. Hence Age
could be modeled as binary where the significant
information is whether the age is known or not.
3.2.4 Prediction Columns
Output columns are columns in a dataset that the
DMM is capable of predicting. Because these columns
will be predicted over cases in which their values are
missing or unspecified, additional statistical information
may be associated with such predictions. We discuss this
additional information here. Section 3.2.3 presents the
syntax of such prediction columns.
Attribute or Table type columns can be input columns,
output columns, or both. The data mining provider builds
a DMM capable of predicting or explaining output
column values based on the values of the input columns.
Predictions may convey not only simple information such
as “estimated age is 21;” but they may also convey
additional statistical information associated with the
prediction such as confidence level and standard
deviation. Further, the prediction may actually be a
collection of predictions, such as “the set of products that
the customer is likely to buy.” Each of the predictions in
the collection may also include a set of statistics.
A prediction can be expressed as a histogram. A
histogram provides multiple possible prediction values,
each accompanied by a probability and other statistics.
When histogram information is required, each prediction
(which by itself can be part of a collection of predictions)
may have a collection of possible values that constitutes a
histogram.
Since there is a wealth of information we can associate
with predictions, it is often necessary to extract only a
portion of the predicted information. For example, a user
may want to see only the “best estimate,” “top 3
estimates,” or “estimates with probability greater than
55%.” Neither every provider nor every DMM can
support all of the possible requests. Therefore, it is
necessary for the output column to define whatever
information may be extracted out of it.
OLE DB DM defines a set of standard transformation
functions on output columns. These functions are
discussed in detail in the OLE DB DM specification
document [9]. The basic mechanism we used to
accommodate the flexibility in output is to draw on the
familiar notion of user-defined functions (UDF) used in
OLAP [10,15]. Each provider ships a set of functions that
can be referenced in the prediction query. Some UDF’s
are scalar-valued, such as probability, or support. Others
have tables as values, such as histogram and hence return
nested tables when invoked.
3.2.5 Detailed Example of DMM Column Specification
Now that we have described what the column models
are, it may be useful to provide an example. Column
descriptions allow the provider to better model the
training data it is given. We now return to our running
example and indicate how content types of some of the
columns in the example can now be classified as shown
below (content type in italics):
(a) Gender: discrete attribute
(b) Age: continuous attribute
(c) Age Probability: probability modifier of age
(d) Customer Loyalty: ordered attribute
(e) Product Name: Key (for Product Purchases
table)
(f) Product Type: Relation of Product Name
The example above includes a few additional columns
to illustrate a few more content types.
3.3. Operations on Data Model
Populating a Mining Model: Insert Into
Once a mining model is defined, the next step is to
populate a mining model by consuming a caseset that
satisfies the specification in the Create Mining Model
statement. In OLE DB DM, we use INSERT to model
instantiation of the mining model. As against a traditional
table, insertion does not result in addition of the tuple to
the rowset. Rather, the insertion corresponds to
consuming the observation represented by a case using
the data mining model. The following example illustrates
the syntax. Note the use of the SHAPE statement to create
the nested table from the input data:
INSERT INTO [Age Prediction]
([Customer ID], [Gender], [Age],
[Product Purchases]([Product Name], [Quantity], [Product
Type])) SHAPE
{SELECT [Customer ID], [Gender], [Age]
FROM Customers
ORDER BY [Customer ID]}
APPEND (
{SELECT [CustID], [Product Name], [Quantity], [Product
Type] FROM Sales ORDER BY [CustID]}
RELATE [Customer ID] To [CustID])
AS [Product Purchases]
Using Data Model to Predict: Prediction Join
We now explain how predictions can be obtained from
models in OLE DB DM that has been populated. The
basic operation of obtaining prediction on a dataset D
using a DMM M is modeled as a “prediction join”
between D and M. Of course, the caseset needs to match
the schema of the DMM. Since the model does not
actually contain data details, the semantics of the
Prediction Join are different than those of a equi-join on
tables. The following example illustrates how prediction
join may be used. In this example, the value for “Age” is
not known and is being predicted using the prediction join
between a data mining model and the given (test) data set.
Prediction join between the cases from the test data test
with the DMM and selecting the “Age” column from the
DMM returns a predicted “Age” for each of the test
cases.
We need to use “prediction join” since the process of
combining actual cases with all possible model cases is
not as simple as the semantics of a normal SQL equi-join.
For the instance when the schema of the actual case table
matches the schema of the model, NATURAL
PREDICTION JOIN can be used, obviating the need for
the ON clause of the join. Columns from the source query
will be matched to columns from the DMM based on the
names of the columns.
SELECT t.[Customer ID], [Age Prediction].[Age]
FROM [Age Prediction]
PREDICTION JOIN (SHAPE {
SELECT [Customer ID], [Gender], FROM Customers
ORDER BY [Customer ID]}
APPEND ({SELECT [CustID], [Product Name], [Quantity]
FROM Sales
ORDER BY [CustID]}
RELATE [Customer ID] To [CustID])
AS [Product Purchases]) as t
ON [Age Prediction].Gender = t.Gender and
[Age Prediction].[Product Purchases].[Product Name]
= t.[Product Purchases].[Product Name] and
[Age Prediction].[Product Purchases].[Quantity]
= t.[Product Purchases].[Quantity]
The DMM content can be thought of as a “truth table”
containing a row for every possible combination of the
distinct values for each column in the DMM. In other
words, it contains every possible case along with every
possible associated set of output columns. With this view
in mind, a DMM can be used to look up predicted values
and statistics by using the attribute values of a case as a
key for the join. It is important to point out that this
analogy is not quite accurate as the set of all possible
cases (combinations of attribute values) is huge, and
possibly infinite when one considers continuous
attributes. However, this logical view allows us to map
prediction into a familiar basic operation in the relational
world.
When a model is joined with a table, predicted values
are obtained for each case that “matches” the model (i.e.
cases for which the model is capable of making a
prediction). A SELECT statement associated with the
PREDICTION JOIN specifies which predicted values
should be extracted and returned in the result set of the
query. User-defined functions (UDF) in the projection
columns are used to further enhance the set of predicted
values with additional statistical information. We have
discussed this aspect in Section 3.2.4.
Browsing Model Content: Select
In addition to listing the column structure of a DMM, a
different type of browsing is to navigate the content of the
model viewed as a graph. Using a set of input cases, the
content of a DMM is learned by the data mining
algorithm. The content of a DMM is the set of rules,
formulas, classifications, distributions, nodes, or any other
information that was derived from a specific set of data
using a data mining technique. Content type varies
according to the specific data mining technique used to
create the DMM. The DMM content of a decision tree–
based classification will differ from a segmentation
model, which, in turn, differs from a multi-regression
DMM.
The most popular way to express DMM content is by
viewing it as a directed graph (that is, a set of nodes with
connecting edges). Note that decision trees fit such a view
nicely. Each node in the tree may have relationships to
other nodes. A node may have one or more parent nodes
and zero or more child nodes. The depth of the graph may
vary depending on the specific node.
Tree navigation we adopted is similar to the already
defined mechanism in the OLE DB for OLAP
specification [10]. On querying (SELECT * FROM
<mining model>.CONTENT), the content of a mining
model
is
exposed
through
the
MINING_MODEL_CONTENT
schema
rowset.
Unfortunately,
a
detailed
discussion
of
MINING_MODEL_CONTENT schema rowset is beyond
the scope of this paper (see Appendix A of [9]).
4. Related Work
The focus of OLE DB DM is not about new KDD
algorithms. Therefore, we do not discuss here the wide
body of literature on KDD algorithms and scalable
techniques for such algorithms.
Compared to the wide body of work on KDD
algorithms as well as the more recent work on scalable
algorithms, there has been relatively much less work on
the problem of integration of data mining with relational
databases.
The work by Sarawagi et al. [13] address the problem
of how to implement derivation of association rules
efficiently using SQL. Although significant from the
performance implications, this direction of work is
orthogonal to the issue of how to enable, deploy, and
expose data mining models and algorithms as first class
objects in the database application API. A similar effort
relating to generalized association rules and sequences is
also focused on implementing particular algorithms in
SQL [14].
Research
projects
such
as
Quest
(http://www.almaden.ibm.com/cs/quest) [1] and DBMiner
[6] (http://db.cs.sfu.ca) as well as commercial mining
tools provide application interfaces and UI to support
mining and allow access to data in the warehouse.
However, such tools do not provide the ability to deal
with arbitrary mining models and integration of the
mining application interfaces with SQL as well as
relational data access APIs (ODBC, OLE DB).
The paper by Meo et al. [8] is an example of
specialized extensions to SQL to provide support for
association rules. Such an approach is narrowly focused
and does not provide an extensible framework for
supporting models derived from other mining techniques,
e.g., clustering and classification. Moreover, no support is
provided for management of mining models and for using
such models subsequently in other contexts.
More recently, CRISP-DM [4] (http://www.crispdm.org/) has been proposed as a standard for Data Mining
process model. This initiative is complementary to OLE
DB DM. CRISP-DM provides a set of methodologies,
best practices, and attempts to define the various activities
involved in the KDD process. It does not address issues of
integration with the database system.
A related effort, called Predictive Model Markup
Language (PMML), provides an open standard for how
models should be persisted in XML. The basic idea of
PMML is to enable portability and model sharing [5].
PMML does not address database integration issue but
specifies a persistence format for models. We are
currently working with the PMML group to use PMML
format as an open persistence format. In fact, in the model
browsing methods, briefly discussed in Section 3.3.3, we
use PMML inspired XML strings in exposing the content
of data mining model nodes.
As mentioned earlier, OLE DB DM is an extension to
OLE DB, a specification of data access interfaces. OLE
DB generalized the widely used ODBC connectivity
standard by enabling access to record-oriented data stores
that are not necessarily SQL data sources.
5. Concluding Remarks
We presented the highlights of OLE DB DM, a new
API for integrating data mining into SQL databases. A
reference OLE DB DM provider is scheduled to ship with
the next release of a commercial database system:
Microsoft SQL Server 2000.
The target user base of OLE DB DM is the community
of enterprise developers who are familiar with SQL and
database connectivity APIs such as ODBC and OLE DB.
For this reason, a key design goal was to keep the set of
extensions as close to SQL’s look and feel as possible.
OLE DB DM introduces data mining objects and entities
by drawing straightforward analogies to familiar
relational objects. We have only introduced two new
concepts: Data Mining Models (DMM) and Cases.
We hope that OLE DB DM will help steer our
community to establish a standard in the field. The
advantages of such a standard include integration with the
database so that data mining no longer needs to involve
taking data outside the database, addressing issues of
model deployment once models are created, and
providing an infrastructure for managing data and models
while leveraging the existing SQL infrastructure.
Establishing standards also reduces the costs, increases
the ability to interoperate and partner, and reduces the
risks for both providers and consumers. Consumers are
protected since they are no longer dependent on a
proprietary interface so other vendors can step in to
provide replacements and enhancements. Providers are
protected from the need to build infrastructure from
scratch as partners can step in to build different
complementary components. To move towards a standard,
an open forum such as standards committee elected by the
community will be needed. We hope that this proposal
will encourage further discussions along this direction.
Acknowledgement: We are grateful to Venkatesh
Ganti for his thoughtful comments on the draft.
6. References
[1] R. Agrawal, A. Arning, T. Bollinger, M. Mehta, J. Shafer,
R. Srikant: The Quest data mining system, Proc. of the 2nd
int'l conference on knowledge discovery in databases and
data mining, August, 1996.
[2] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen and A. I.
Verkamo: Fast discovery of association rules. In Advances
in knowledge discovery and data mining, pp. 307-328,
AAAI Press, Menlo Park, CA, 1996.
[3] S. Chaudhuri and U. Dayal: An overview of data
warehousing and OLAP technology. SIGMOD Record
26(1), pp. 65-74, 1997.
[4] P. Chapman, R. Kerber, J. Clinton, T. Khabaza, T.
Reinartz, R. Wirth: The CRISP-DM process model,
Technical Report. http://www.crisp-dm.org/ 1999.
[5] DMG Organization, PMML 1.0-Predictive Model Markup
Languange, http://www.dmg.org/public/techreports/pmml1_0.html, 1999.
[6] J. Han, Y. Fu, W. Wang, J. Chiang, W. Gong, K. Koperski,
D. Li, Y. Lu, A. Rajan, N. Stefanovic, B. Xia, O. R.
Zaiane: DBMiner: A system for mining Knowledge in
Large Relational Databases. Proc. of int'l conf. on data
mining and knowledge discovery (KDD'96), Portland,
Oregon, pp. 250-255, August 1996.
[7] P. Hipson: OLE DB and ADO developer's guide, McGrawHill; 1999.
[8] Rosa Meo, Giuseppe Psaila, Stefano Ceri: An extension to
SQL for mining association rules. 195-224, Data Mining
and Knowledge Discovery, Vol2, Number 2, 1998.
[9] Introduction to OLE DB for Data Mining, Microsoft
Corporation, http://www.microsoft.com/data/oledb
[10] OLE
DB
for
OLAP,
Microsoft
Corporation,
http://www.microsoft.com/data/oledb
[11] R. Ramakrishnan and J. Gehrke. Principles of Database
Management (2nd Edition), 1999.
[12] J. Shanmugusundaram, U. M. Fayyad and P. S. Bradley.
Compressed data cubes for OLAP aggregate query
approximation on continuous dimensions. In Proc. 5th intl.
conf. on knowledge discovery and data mining, pp. 223232, 1999.
[13] Sunita Sarawagi, Shiby Thomas, Rakesh Agrawal:
Integrating mining with relational database systems:
alternatives and implications. Proc. of SIGMOD
Conference, pp. 343-354, 1998.
[14] S. Thomas, S. Sarawagi Mining generalized association
rules and sequential patterns using SQL queries. Proc. of
the 4th Int'l Conference on Knowledge Discovery in
Databases and Data Mining, New York, August 1998.
[15] E. Thomsen, G, Spofford, D. Chase, Microsoft OLAP
solutions, Wiley, John & Sons, 1999.