Download Integrating Data Mining with SQL Databases

Integrating Data Mining with SQL Databases: OLE DB for Data Mining Amir Netz Surajit Chaudhuri Usama Fayyad1 Jeff Bernhardt Microsoft Corporation Abstract The integration of data mining with traditional database systems is key to making it convenient, easy to deploy in real applications, and to growing its user base. In this paper we describe the new API for data mining proposed by Microsoft as extensions to OLE DB standard. We illustrate the basic notions that motivated the API's design and describe the key components of an OLE DB for Data Mining provider. We also include examples of the usage and treat the problems of data representation and integration with the SQL framework. We believe this new API will go a long way in enabling deployment of data mining in enterprise data-warehouses. A reference implementation of a provider is available with the recent release of Microsoft SQL Server 2000 database system. 1. Introduction Progress in database technology has provided the foundation that made massive data warehouses of business data ubiquitous. Increasingly, such data warehouses are built using relational database technology. Recently, there has been a tremendous commercial interest in mining the information in these data warehouses. However, building mining applications on data over relational databases is nontrivial and requires significant work on the part of application builders. Therefore, what is needed is a strong system support to ease the burden of developing mining applications against relational databases. In this paper, we introduce the “OLE DB for Data Mining” (henceforth abbreviated as OLE DB DM in this paper) Application Programming Interface (API). As we will demonstrate in the rest of this paper, if a data source supports the OLE DB DM (i.e., if the data source is an OLE DB DM “provider”), then the task of building mining applications is simplified because of the native support for DM primitives in OLE DB DM. The case for supporting such an API is best understood by reviewing the current state of the art. Traditionally, data mining tasks are performed assuming that the data is 1 Author’s current affiliation: digiMine, Inc. in memory. In the last decade, there has been an increased focus on scalability but research work in the community has focused on scaling analysis over data that is stored in the file system. This tradition of performing data mining outside the DBMS has led to a situation where starting a data mining project requires its own separate environment. Data is dumped or sampled out of the database, and then a series of Perl, Awk, and special purpose programs are used for data preparation. This typically results in the familiar large trail of droppings in the file system, creating an entire new data management problem outside the database. In particular, such “export” creates nightmares of data consistency with data in warehouses. This is ironic because database systems were created to solve exactly such data management problems. Once the data is adequately prepared, the mining algorithms are used to generate data mining models. However, past work in data mining has barely addressed the problem of how to deploy models, e.g., once a model is generated, how to store, maintain, and refresh it as data in the warehouse is updated, how to programmatically use the model to do predictions on other data sets, and how to browse models. Over the life cycle of an enterprise application, such deployment and management of models remains one of the most important tasks. The objective of proposing OLE DB DM is to alleviate problems in deployment of models and to ease the data preparation by working directly on relational data. Another motivation to introduce OLE DB DM is to enable enterprise application developers to participate in building data mining solutions. Easing the ability to develop integrated solutions is critical for the growth of data mining technology in the enterprise space. In enterprises, data mining needs to be perceived as a valueadded component along with traditional decision support techniques, e.g., traditional SQL as well as OLAP querying environments [3]. In order to ensure success of data mining in this regard, it is important to recognize that a typical enterprise application developer is hardly an expert in statistics and pattern recognition. Rather, he or she is comfortable using traditional database APIs/tools based on SQL, OLE DB, and other well-known standards and protocols. Therefore, it is extremely important that the infrastructure for supporting data mining solutions is aligned with traditional database development environment and with APIs for database access. Consequently, we decided to exploit the well-known OLE DB API to ensure that data mining models and operations gain the status of first-class objects in the mainstream database development environment. Our proposal builds upon OLE DB [7], an application programming interface that has been gaining popular use for universal data access not only from relational systems but also from any data source that can be viewed as a “set of tables” (rowsets) as well. It is important to point out that OLE DB for DM is independent of any particular provider or software and is meant to establish a uniform API. Another feature of our proposal is that it is not specialized to any specific mining model but is structured to cater to all well-known mining models. Any party interested in using this interface is encouraged to do so by building its own provider, as with OLE DB (and ODBC). However, we should also add that one of the strengths of our proposal is that we have implemented this API as part of the recent release of Microsoft SQL Server 2000 (Analysis Server component). The proposal also incorporates feedback from a multitude of vendors in the data mining space who participated in preview meetings and calls for reviews over the past 18 months. Our proposal is not focused on any particular KDD algorithms per se since our intent is not to propose new algorithms, but to suggest a system infrastructure that makes it possible to “plug in” any algorithm and enable it to participate in the entire life-cycle of data mining. This paper is our attempt to summarize the basic philosophy and design decisions leading to the current specification of OLE DB DM. A complete specification may be obtained from [9]. The rest of the paper is organized as follows. We first introduce the overall architecture and design philosophy of OLE DB for Data Mining (OLE DB DM) in Section 2. In Section 3, we discuss, with illustrative examples of their usage, the primitives and basic building blocks in the API. Section 4 presents related work and we conclude in Section 5. 2. Overview and Design Philosophy OLE DB DM provides a set of extensions to OLE DB to make it “mining-enabled.” OLE DB is an objectoriented specification for a set of data access interfaces designed for record-oriented data stores. In particular, by virtue of supporting command strings, OLE DB can provide full support to SQL. In this paper, we will primarily introduce extensions that leverage this aspect of OLE DB API to make sure that support for mining concepts are built in. Figure 1 shows how Microsoft SQL Server database management system exposes OLE DB DM. In this specific implementation, the “core” relational engine exposes the traditional OLE DB interface. However, the analysis server component exposes an OLE DB DM interface (along with OLE DB for OLAP) so that data mining solutions may be built on top of the analysis server. Of course, this is only one example of how OLE DB DM may be exposed. For example, in the future, data mining support may be deeply integrated in the “core” relational database system. Our approach in defining OLE DB DM is to maintain the SQL metaphor and to reuse many of the existing notions whenever possible. Due to the widespread use of SQL in the developer community, we decided to expose data mining interfaces in a language-based API. This decision reflects a central design principle: building data mining solutions (applications) should be easy. Thus, our design philosophy requires relational database management systems to invest resources to provide the “drivers” to support OLE DB DM so that data mining solutions may be developed with ease by data mining application developers. Since there are likely to be many more application developers than DBMS vendors, such a design philosophy inflicts the least pain. The key challenge is to be able to support the basic operations of data mining without introducing much change in programming model and environment that a database developer is used to. Fundamentally, this requires supporting operations on data mining models. These key operations to support on a data mining model are the ability to: 1. Define a mining model, i.e., identify the set of attributes of data to be predicted, the set of attributes to be used for prediction, and the algorithm used to build the mining model. 2. Populate a mining model from training data using the algorithm specified in (1). 3. Predict attributes for new data using a mining model that has been populated. 4. Browse a mining model for reporting and visualization applications. In developing OLE DB DM, we have decided to represent a data mining model (henceforth model or DMM) as analogous to a table in SQL.2 Thus, it can be defined via the CREATE statement (used to create a table 2 Although viewed conceptually and programmatically as tables, models have important differences with tables. A data mining model is populated by consuming a rowset but its own internal structure can be more abstract, e.g., a decision-tree is a tree-like structure, much more compact than the training data set used to create it. in SQL). It can be populated, possibly repeatedly via the INSERT INTO statement (used to add rows in a SQL table), and it can be emptied (reset) via the DELETE statement. A mining model may be used to make predictions on datasets. This is modeled through prediction join between a mining model and a data set, just as the traditional join operation is used to pull together information from multiple tables. Finally, the content of a mining model can be browsed via SELECT (used to identify a subset of data in SQL). Since the mining model is a native named object at the server, the mining model can participate in interaction with other objects using the primitives listed above (see also Section 3). Note that this achieves the desirable goal of avoiding excessive data movement, extraction, copying and thus resulting in better performance and manageability compared to the traditional “mining outside the database server” framework. Finally, once the DMM is created and optimized, deployment within the enterprise becomes as easy as writing SQL queries. This significantly minimizes the cost of deployment by eliminating the need for developing a proprietary infrastructure and interface to deploy the data mining solution. Model deployment and maintenance costs have received little attention in the literature, though they are an expensive and difficult task in real applications. In the next section, we review the details of the language extensions that provide an extensible framework for incorporating support for data mining. 3. Basic Components of OLE DB DM OLE DB DM has only two notions beyond traditional OLE DB definition: cases and models. • Input data is in the form of a set of cases (caseset). A case captures the traditional view of an “observation” by machine learning algorithms as consisting of all information known about a basic entity being analyzed for mining. Of course, structurally, a set of cases is no different from a table.3 Cases are explained further in Section 3.1. • Data mining model (DMM) is treated as if it were a special type of “table:” (a) We associate a caseset with a DMM and additional meta-information while creating (defining) a DMM. (b) When data (in the form of cases) is inserted into the data mining model, a mining algorithm processes it and the resulting abstraction (or DMM) is saved instead of the data itself. Once a DMM is populated, it can be used for prediction, or its content can be browsed for reporting 3 However, like many other data sets, cases are more naturally represented as hierarchical data, i.e., nested tables are more natural. and visualization purposes. Fundamental operations on models include CREATE, INSERT INTO, PREDICTION JOIN, SELECT, DELETE FROM, and DROP. It should be noted that OLE DB DM has all the traditional OLE DB interfaces. In particular, schema rowsets specify the capabilities of an OLE DB DM provider. This is the standard mechanism in OLE DB whereby a provider describes information about itself to potential consumers. This information describes the supported capabilities (e.g. prediction, segmentation, sequence analysis, etc.), types of data distributions supported, limitations of the provider (size or time limitations), support for incremental model maintenance, etc. Other schema rowsets provide metadata on the columns of a mining model, on its contents, and the supported services. The details of these schema rowsets are beyond the scope of this paper. The rest of this section is organized as follows. In Section 3.1, we describe how input data is modeled as a caseset. In Section 3.2, we describe how data mining models may be defined (created) in a relational framework and the kind of meta-information on casesets that mining needs to be aware of. Section 3.3 describes operations on mining models, notably insertion (populating the model), prediction, and browsing. 3.1. Data as Cases To ensure that a data mining model is accurate and meaningful, data mining algorithms require that all information related to the entity to be mined be consolidated before invoking the algorithms. Traditionally, mining algorithms also view the input as a single table that represents such consolidated information where each row represents an instance of an entity. Data mining algorithms are designed so that they consume an entity instance at a time. Having all the information related to a single entity instance together in a rowset, has two important benefits. First, traditional data mining algorithms can be leveraged with relative ease. Next, it increases scalability as it eliminates the need for data mining algorithms to do considerable bookkeeping. In a relational database, data is often normalized and hence the information related to an entity scattered in multiple tables. Therefore, in order to use data mining, a key step is to be able to pull the information related to an entity into a single rowset using views. Furthermore, as is well recognized, data is often hierarchical in nature and so ideally we require the view definition language to be powerful enough to be able to produce a hierarchical rowset. This collection of data pertaining to a single entity is called a case, and the set of all relevant cases is referred to as a caseset. As mentioned above, a caseset may need Customer ID 1 Gender Male Hair Color Black Age 35 Age Prob. 100% Product Purchases Name Quantity TV 1 Car_Ownership Type Car Prob. Electronic Truck 100% VCR 1 Electronic Van Ham 2 Food Beer 6 Beverage Table 1: A nested table representation of the 3-way join result 50% to be hierarchical. In the rest of this section, we elaborate on these two issues through examples and further discussions. Consider a schema consisting of 3 tables representing customers. The Customers table has some customer properties. The Product Purchases table stores for each customer what products they bought. The Car Ownership table lists the cars owned by each customer.4 Consider a customer, say Customer ID 1, who happens to be a male, has black hair, and believed to be 35 years old with 100% certainty. Assume this customer has bought a TV, a VCR, Beer (quantity 6), and Ham (quantity 2). Further assume we know this customer owns a truck (100%) and we believe he also has a van (50% certainty). Suppose one were to issue a join query over the 3 tables to get all the information we know about the set of customers. Readers familiar with SQL will quickly realize that this join query will return a table of 12 rows. Furthermore, these 12 rows contain lots of replication. More importantly, since the information about an entity instance is scattered among multiple rows, the quality of output from data mining algorithms is negatively impacted by such flattened representation. Table 1 shows how a nested table can succinctly describe the above join result. As illustrated in Table 1, a customer case can include not only simple columns but also multiple tables (nested tables as columns). Each of these tables inside the case can have a variable number of rows and a different number of columns. Some of the columns in the example have a direct one-to-one relationship with the case (such as “Gender” and “Age”), while others have a one-to-many relationship with the case and therefore exist in nested tables. The reader can easily identify the two tables (nested) contained in the sample case in Table 1. • Product Purchases table containing the columns Product Name, Product Quantity, and Product Type. • Car Ownership table containing the columns Cars and Car Probability. Nested table is a familiar notion in the relational database world, though not adopted universally in commercial systems. In OLE DB DM, we have chosen to use the Data Shaping Service5 to generate a hierarchical rowset. It should be emphasized that it is a logical definition and does not have any implications for data storage. In other words, it is not necessary for the storage subsystem to support nested records. Cases are only instantiated as nested rowsets prior to training/predicting data mining models. Note that the same physical data may be used to generate different casesets for different analysis purposes. For example, if one chooses to mine models over all products, each product then becomes a single case and customers would need to be represented as columns of the case. Once a hierarchical caseset has been prepared using the traditional view definition mechanism and the Data Shaping Service in OLE DB, we can use this caseset to populate a data mining model or we can predict values of unknown columns using a pre-existing model. In order to do that, we need a way to represent data mining models. We illustrate this in the next section. 4 5 The columns Age Probability and Car_Ownership.Probability are unusual in that they represent degree of certainty. 3.2. Creating/Defining Data Mining Models In OLE DB DM, a data mining model (DMM) is treated as a first class object of the provider, much like a table. In this section, our focus is on definition (creation) of data mining models. In particular, the DMM definition includes a description of the columns of data over which the model will be created, including meta-information on relationships between columns. It is assumed that a data set is represented as a nested table as discussed in Section 3.1. In Section 3.3, we will describe how other operations on data mining models may be supported in OLE DB DM. As discussed earlier, a data mining model is trained over a single caseset and is used to predict columns of a caseset. Therefore, defining (creating) a data mining model requires us to specify: • The name of the model. The Data Shaping Service is part of the Microsoft Data Access Components (MDAC) products. • • • • The algorithm (and parameters) used to build the model. The algorithm for prediction using the model. The columns the caseset must have and the relationships among these columns. Identifying columns to be used as “source” columns and the columns to be filled by the mining model (“prediction columns”). The following example illustrates the syntax: CREATE MINING MODEL [Age Prediction] ( %Name of Model [Customer ID] LONG KEY, [Gender] TEXT DISCRETE, [Age] DOUBLE DISCRETIZED PREDICT, %prediction column [Product Purchases] TABLE( [Product Name] TEXT KEY, [Quantity] DOUBLE NORMAL CONTINUOUS, [Product Type] TEXT DISCRETE RELATED TO [Product Name] )) USING [Decision_Trees_101] %Mining Algorithm used The annotations with the example identify the source columns used for prediction, the column to be predicted (age) and the mining algorithm used. Specifically, the DMM is named [Age Prediction] (the square brackets [ ] are name delimiters by convention). The PREDICT keyword indicates that the Age column is to be predicted. The [Product Purchases] is a nested table, with [Product Name] being its key. The USING clause specifies the mining algorithm to use. Parameters to the algorithm could be included as part of this clause. When we contrast this example with the CREATE TABLE syntax in SQL, a key difference is the use of substantially more meta information that adorns a column reference above. The model columns describe all of the information about a specific case. For example, assume that each case in the DMM represents a customer. The columns of the DMM will include all known and desired information about the customer (e.g. Table 1). Several kinds of information may be specified for columns: • Information about the role of a column (Section 3.2.1): Is it a key? Does it contain an attribute? How does it relate to other columns or nested tables? Is it a qualifier on other columns? • For columns that represent attributes, what type of attribute is it? Attribute types are described in Section 3.2.2. • Additional information about data distributions that can be provided as hints to the DMM (Section 3.2.3). The above details are discussed in the rest of this subsection and we conclude this section with a comprehensive example. Finally, note that despite the similarities with CREATE TABLE with respect to programmability, mining models are conceptually different as they represent only summarized information for a data set. Also, unlike traditional data tables, they are capable of prediction (See Section 3.3). 3.2.1 Content Types of Columns – Column Specifiers Each column in a case can represent one of the following content types: KEY: the columns that identify a row. For example, “Customer ID” uniquely identifies customer cases, “Product Name” uniquely identifies a row in “Product Purchases” table. ATTRIBUTE: A direct attribute of the case. This type of column represents some value for the case. Example attributes are: the age, gender, or hair-color of the customer or the quantity of a specific product the customer purchased. RELATION: Information used to classify attributes, other relations, or key columns. For example, “Product Type” classifies “Product Name.” A given relation value must always be consistent for all of the instance values of the other columns it describes—for example, the product “Ham” must always be shown as “Food” for all cases. In the CREATE MINING MODEL command syntax, relations are identified in the column definition by using a RELATED TO clause to indicate the column being classified. QUALIFIER: A special value associated with an attribute that has a predefined meaning for the provider. Take for example the probability that the attribute value is correct. These qualifiers are all optional and apply only if the data has uncertainties attached to it or if the output of previous predictions is being chained as input to a subsequent DMM training step. In the CREATE MINING MODEL command syntax, modifiers are identified by using an OF clause to indicate the attribute column they modify. There are many qualifiers. We give a few examples here: (a) PROBABILITY: the probability [0,1] of the associated value. (b) VARIANCE: A number that describes the variance of the value of an attribute. (c) SUPPORT: A float that represents a weight (case replication factor) to be associated with the value. (d) PROBABILITY_VARIANCE: The variance associated with the probability estimator used for PROBABILITY. (e) ORDER: Specifies the order of a column. (See ORDERED below.) (f) TABLE: A nested table in the case consists of a special column with the data type TABLE. For any given case row, the value of a TABLE type column contains the entire contents of the associated nested table. The value of a TABLE type column is in itself a table containing all of the columns for the nested table. In the CREATE MINING MODEL command syntax, nested tables are described by a set of columns, all of which are contained within the definition of a named TABLE type column. 3.2.2 Content Types of Columns – Attribute Types For a column that is specified as an attribute, it is possible to further indicate the type of attribute. The following keywords specify the attribute type to the DM provider. • DISCRETE: The attribute values are discrete (categorical) with no ordering implied by the values (often called states). “Area Code” is a good example. • ORDERED: Columns that define an ordered set of values with a total ordering but no distance or magnitude semantics are implied. A ranking of skill level (say one through five) is an ordered set, but a skill level of five isn't necessarily five times better than a skill level of one. • CYCLICAL: A set of values that have cyclical ordering. Day of the week is a good example, since day number one follows day number seven. • CONTINOUS: Attributes with numeric values (integer of float). Values are naturally ordered and have implicit distance and magnitude semantics. Salary is a typical example. • DISCRETIZED: The data that will be inserted into the model is continuous, but it should be transformed into and modeled as a number of ORDERED states by the provider. Some data mining algorithms cannot accept CONTINUOUS attributes as input, or they may not be able to predict CONTINUOUS values. • SEQUENCE_TIME: A time measurement range. The format is not restricted, e.g., a period number is acceptable. This is typically used to associate a sequence time with individual attribute values such as purchase time. 3.2.3 Distribution Hints An attribute, continuous or discrete, may have an associated distribution. These distributions are used as hints to the DMM and can specify prior knowledge about the data. For example, a continuous attribute can be normal (Gaussian), log normal, or uniform. A discrete can be binomial, multinomial, Poisson, mixture. Other hints could include: Not_Null which indicates there should never be a null value, model existence only which means the information of interest is not in the value of an attribute, but in the fact that a value is present. Hence Age could be modeled as binary where the significant information is whether the age is known or not. 3.2.4 Prediction Columns Output columns are columns in a dataset that the DMM is capable of predicting. Because these columns will be predicted over cases in which their values are missing or unspecified, additional statistical information may be associated with such predictions. We discuss this additional information here. Section 3.2.3 presents the syntax of such prediction columns. Attribute or Table type columns can be input columns, output columns, or both. The data mining provider builds a DMM capable of predicting or explaining output column values based on the values of the input columns. Predictions may convey not only simple information such as “estimated age is 21;” but they may also convey additional statistical information associated with the prediction such as confidence level and standard deviation. Further, the prediction may actually be a collection of predictions, such as “the set of products that the customer is likely to buy.” Each of the predictions in the collection may also include a set of statistics. A prediction can be expressed as a histogram. A histogram provides multiple possible prediction values, each accompanied by a probability and other statistics. When histogram information is required, each prediction (which by itself can be part of a collection of predictions) may have a collection of possible values that constitutes a histogram. Since there is a wealth of information we can associate with predictions, it is often necessary to extract only a portion of the predicted information. For example, a user may want to see only the “best estimate,” “top 3 estimates,” or “estimates with probability greater than 55%.” Neither every provider nor every DMM can support all of the possible requests. Therefore, it is necessary for the output column to define whatever information may be extracted out of it. OLE DB DM defines a set of standard transformation functions on output columns. These functions are discussed in detail in the OLE DB DM specification document [9]. The basic mechanism we used to accommodate the flexibility in output is to draw on the familiar notion of user-defined functions (UDF) used in OLAP [10,15]. Each provider ships a set of functions that can be referenced in the prediction query. Some UDF’s are scalar-valued, such as probability, or support. Others have tables as values, such as histogram and hence return nested tables when invoked. 3.2.5 Detailed Example of DMM Column Specification Now that we have described what the column models are, it may be useful to provide an example. Column descriptions allow the provider to better model the training data it is given. We now return to our running example and indicate how content types of some of the columns in the example can now be classified as shown below (content type in italics): (a) Gender: discrete attribute (b) Age: continuous attribute (c) Age Probability: probability modifier of age (d) Customer Loyalty: ordered attribute (e) Product Name: Key (for Product Purchases table) (f) Product Type: Relation of Product Name The example above includes a few additional columns to illustrate a few more content types. 3.3. Operations on Data Model Populating a Mining Model: Insert Into Once a mining model is defined, the next step is to populate a mining model by consuming a caseset that satisfies the specification in the Create Mining Model statement. In OLE DB DM, we use INSERT to model instantiation of the mining model. As against a traditional table, insertion does not result in addition of the tuple to the rowset. Rather, the insertion corresponds to consuming the observation represented by a case using the data mining model. The following example illustrates the syntax. Note the use of the SHAPE statement to create the nested table from the input data: INSERT INTO [Age Prediction] ([Customer ID], [Gender], [Age], [Product Purchases]([Product Name], [Quantity], [Product Type])) SHAPE {SELECT [Customer ID], [Gender], [Age] FROM Customers ORDER BY [Customer ID]} APPEND ( {SELECT [CustID], [Product Name], [Quantity], [Product Type] FROM Sales ORDER BY [CustID]} RELATE [Customer ID] To [CustID]) AS [Product Purchases] Using Data Model to Predict: Prediction Join We now explain how predictions can be obtained from models in OLE DB DM that has been populated. The basic operation of obtaining prediction on a dataset D using a DMM M is modeled as a “prediction join” between D and M. Of course, the caseset needs to match the schema of the DMM. Since the model does not actually contain data details, the semantics of the Prediction Join are different than those of a equi-join on tables. The following example illustrates how prediction join may be used. In this example, the value for “Age” is not known and is being predicted using the prediction join between a data mining model and the given (test) data set. Prediction join between the cases from the test data test with the DMM and selecting the “Age” column from the DMM returns a predicted “Age” for each of the test cases. We need to use “prediction join” since the process of combining actual cases with all possible model cases is not as simple as the semantics of a normal SQL equi-join. For the instance when the schema of the actual case table matches the schema of the model, NATURAL PREDICTION JOIN can be used, obviating the need for the ON clause of the join. Columns from the source query will be matched to columns from the DMM based on the names of the columns. SELECT t.[Customer ID], [Age Prediction].[Age] FROM [Age Prediction] PREDICTION JOIN (SHAPE { SELECT [Customer ID], [Gender], FROM Customers ORDER BY [Customer ID]} APPEND ({SELECT [CustID], [Product Name], [Quantity] FROM Sales ORDER BY [CustID]} RELATE [Customer ID] To [CustID]) AS [Product Purchases]) as t ON [Age Prediction].Gender = t.Gender and [Age Prediction].[Product Purchases].[Product Name] = t.[Product Purchases].[Product Name] and [Age Prediction].[Product Purchases].[Quantity] = t.[Product Purchases].[Quantity] The DMM content can be thought of as a “truth table” containing a row for every possible combination of the distinct values for each column in the DMM. In other words, it contains every possible case along with every possible associated set of output columns. With this view in mind, a DMM can be used to look up predicted values and statistics by using the attribute values of a case as a key for the join. It is important to point out that this analogy is not quite accurate as the set of all possible cases (combinations of attribute values) is huge, and possibly infinite when one considers continuous attributes. However, this logical view allows us to map prediction into a familiar basic operation in the relational world. When a model is joined with a table, predicted values are obtained for each case that “matches” the model (i.e. cases for which the model is capable of making a prediction). A SELECT statement associated with the PREDICTION JOIN specifies which predicted values should be extracted and returned in the result set of the query. User-defined functions (UDF) in the projection columns are used to further enhance the set of predicted values with additional statistical information. We have discussed this aspect in Section 3.2.4. Browsing Model Content: Select In addition to listing the column structure of a DMM, a different type of browsing is to navigate the content of the model viewed as a graph. Using a set of input cases, the content of a DMM is learned by the data mining algorithm. The content of a DMM is the set of rules, formulas, classifications, distributions, nodes, or any other information that was derived from a specific set of data using a data mining technique. Content type varies according to the specific data mining technique used to create the DMM. The DMM content of a decision tree– based classification will differ from a segmentation model, which, in turn, differs from a multi-regression DMM. The most popular way to express DMM content is by viewing it as a directed graph (that is, a set of nodes with connecting edges). Note that decision trees fit such a view nicely. Each node in the tree may have relationships to other nodes. A node may have one or more parent nodes and zero or more child nodes. The depth of the graph may vary depending on the specific node. Tree navigation we adopted is similar to the already defined mechanism in the OLE DB for OLAP specification [10]. On querying (SELECT * FROM <mining model>.CONTENT), the content of a mining model is exposed through the MINING_MODEL_CONTENT schema rowset. Unfortunately, a detailed discussion of MINING_MODEL_CONTENT schema rowset is beyond the scope of this paper (see Appendix A of [9]). 4. Related Work The focus of OLE DB DM is not about new KDD algorithms. Therefore, we do not discuss here the wide body of literature on KDD algorithms and scalable techniques for such algorithms. Compared to the wide body of work on KDD algorithms as well as the more recent work on scalable algorithms, there has been relatively much less work on the problem of integration of data mining with relational databases. The work by Sarawagi et al. [13] address the problem of how to implement derivation of association rules efficiently using SQL. Although significant from the performance implications, this direction of work is orthogonal to the issue of how to enable, deploy, and expose data mining models and algorithms as first class objects in the database application API. A similar effort relating to generalized association rules and sequences is also focused on implementing particular algorithms in SQL [14]. Research projects such as Quest (http://www.almaden.ibm.com/cs/quest) [1] and DBMiner [6] (http://db.cs.sfu.ca) as well as commercial mining tools provide application interfaces and UI to support mining and allow access to data in the warehouse. However, such tools do not provide the ability to deal with arbitrary mining models and integration of the mining application interfaces with SQL as well as relational data access APIs (ODBC, OLE DB). The paper by Meo et al. [8] is an example of specialized extensions to SQL to provide support for association rules. Such an approach is narrowly focused and does not provide an extensible framework for supporting models derived from other mining techniques, e.g., clustering and classification. Moreover, no support is provided for management of mining models and for using such models subsequently in other contexts. More recently, CRISP-DM [4] (http://www.crispdm.org/) has been proposed as a standard for Data Mining process model. This initiative is complementary to OLE DB DM. CRISP-DM provides a set of methodologies, best practices, and attempts to define the various activities involved in the KDD process. It does not address issues of integration with the database system. A related effort, called Predictive Model Markup Language (PMML), provides an open standard for how models should be persisted in XML. The basic idea of PMML is to enable portability and model sharing [5]. PMML does not address database integration issue but specifies a persistence format for models. We are currently working with the PMML group to use PMML format as an open persistence format. In fact, in the model browsing methods, briefly discussed in Section 3.3.3, we use PMML inspired XML strings in exposing the content of data mining model nodes. As mentioned earlier, OLE DB DM is an extension to OLE DB, a specification of data access interfaces. OLE DB generalized the widely used ODBC connectivity standard by enabling access to record-oriented data stores that are not necessarily SQL data sources. 5. Concluding Remarks We presented the highlights of OLE DB DM, a new API for integrating data mining into SQL databases. A reference OLE DB DM provider is scheduled to ship with the next release of a commercial database system: Microsoft SQL Server 2000. The target user base of OLE DB DM is the community of enterprise developers who are familiar with SQL and database connectivity APIs such as ODBC and OLE DB. For this reason, a key design goal was to keep the set of extensions as close to SQL’s look and feel as possible. OLE DB DM introduces data mining objects and entities by drawing straightforward analogies to familiar relational objects. We have only introduced two new concepts: Data Mining Models (DMM) and Cases. We hope that OLE DB DM will help steer our community to establish a standard in the field. The advantages of such a standard include integration with the database so that data mining no longer needs to involve taking data outside the database, addressing issues of model deployment once models are created, and providing an infrastructure for managing data and models while leveraging the existing SQL infrastructure. Establishing standards also reduces the costs, increases the ability to interoperate and partner, and reduces the risks for both providers and consumers. Consumers are protected since they are no longer dependent on a proprietary interface so other vendors can step in to provide replacements and enhancements. Providers are protected from the need to build infrastructure from scratch as partners can step in to build different complementary components. To move towards a standard, an open forum such as standards committee elected by the community will be needed. We hope that this proposal will encourage further discussions along this direction. Acknowledgement: We are grateful to Venkatesh Ganti for his thoughtful comments on the draft. 6. References [1] R. Agrawal, A. Arning, T. Bollinger, M. Mehta, J. Shafer, R. Srikant: The Quest data mining system, Proc. of the 2nd int'l conference on knowledge discovery in databases and data mining, August, 1996. [2] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen and A. I. Verkamo: Fast discovery of association rules. In Advances in knowledge discovery and data mining, pp. 307-328, AAAI Press, Menlo Park, CA, 1996. [3] S. Chaudhuri and U. Dayal: An overview of data warehousing and OLAP technology. SIGMOD Record 26(1), pp. 65-74, 1997. [4] P. Chapman, R. Kerber, J. Clinton, T. Khabaza, T. Reinartz, R. Wirth: The CRISP-DM process model, Technical Report. http://www.crisp-dm.org/ 1999. [5] DMG Organization, PMML 1.0-Predictive Model Markup Languange, http://www.dmg.org/public/techreports/pmml1_0.html, 1999. [6] J. Han, Y. Fu, W. Wang, J. Chiang, W. Gong, K. Koperski, D. Li, Y. Lu, A. Rajan, N. Stefanovic, B. Xia, O. R. Zaiane: DBMiner: A system for mining Knowledge in Large Relational Databases. Proc. of int'l conf. on data mining and knowledge discovery (KDD'96), Portland, Oregon, pp. 250-255, August 1996. [7] P. Hipson: OLE DB and ADO developer's guide, McGrawHill; 1999. [8] Rosa Meo, Giuseppe Psaila, Stefano Ceri: An extension to SQL for mining association rules. 195-224, Data Mining and Knowledge Discovery, Vol2, Number 2, 1998. [9] Introduction to OLE DB for Data Mining, Microsoft Corporation, http://www.microsoft.com/data/oledb [10] OLE DB for OLAP, Microsoft Corporation, http://www.microsoft.com/data/oledb [11] R. Ramakrishnan and J. Gehrke. Principles of Database Management (2nd Edition), 1999. [12] J. Shanmugusundaram, U. M. Fayyad and P. S. Bradley. Compressed data cubes for OLAP aggregate query approximation on continuous dimensions. In Proc. 5th intl. conf. on knowledge discovery and data mining, pp. 223232, 1999. [13] Sunita Sarawagi, Shiby Thomas, Rakesh Agrawal: Integrating mining with relational database systems: alternatives and implications. Proc. of SIGMOD Conference, pp. 343-354, 1998. [14] S. Thomas, S. Sarawagi Mining generalized association rules and sequential patterns using SQL queries. Proc. of the 4th Int'l Conference on Knowledge Discovery in Databases and Data Mining, New York, August 1998. [15] E. Thomsen, G, Spofford, D. Chase, Microsoft OLAP solutions, Wiley, John & Sons, 1999.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Integrating Data Mining with SQL Databases