Download 1 Introduction to OLE DB for Data Mining (DM)

OLE DB for Data Mining Specification Version 1.0 Microsoft Corporation J U L Y 2 0 0 0 Contents 1 Introduction to OLE DB for Data Mining (DM) ............................................................................. 5 1.1 Goals of Data Mining ............................................................................................................................. 5 1.2 Data Mining Tasks ................................................................................................................................. 6 1.2.1 Predictive Modeling (Classification) ............................................................................................ 6 1.2.2 Segmentation (Clustering) ............................................................................................................ 8 1.2.3 Association (Data Summarization) ............................................................................................... 9 1.2.4 Sequence and Deviation Analysis ............................................................................................... 11 1.2.5 Dependency Modeling ................................................................................................................ 12 1.3 The OLE DB for DM Specification ..................................................................................................... 12 1.4 The Columns Structure of a Data Mining Model (DMM) .................................................................. 15 1.4.1 Model Columns ........................................................................................................................... 15 1.4.2 Prediction Columns ..................................................................................................................... 20 2 OLE DB for DM Programmer's Guide ........................................................................................ 21 2.1 Connecting to a Data Mining Provider ................................................................................................ 21 2.2 Creating New Mining Models.............................................................................................................. 22 2.2.1 Detecting the Capabilities of the Provider .................................................................................. 22 2.2.2 Defining a New Mining Model ................................................................................................... 27 2.2.3 Copying a Mining Model ............................................................................................................ 29 2.2.4 Creating a Mining Model from Predictive Model Markup Language (PMML) ......................... 29 2.3 Finding Existing Mining Models ......................................................................................................... 30 2.4 Browsing Model Column Definition ................................................................................................... 31 2.4.1 Input Columns ............................................................................................................................. 31 2.4.2 Prediction Columns ..................................................................................................................... 33 2.5 Populating the Mining Model .............................................................................................................. 34 2.5.1 Inserting Cases ............................................................................................................................ 35 2.5.2 Populating the Column Values.................................................................................................... 35 Specification Version 1.0— Microsoft 1 Using OLE DB for Data Mining 2.6 Source Data .......................................................................................................................................... 36 2.6.1 SINGLETON CONSTANT as Source Data ............................................................................... 36 2.6.2 SINGLETON SELECT as Source Data ...................................................................................... 37 2.6.3 OPENROWSET as Source Data ................................................................................................. 38 2.6.4 SELECT as Source Data ............................................................................................................. 38 2.6.5 SHAPE as Source Data ............................................................................................................... 38 2.7 Browsing Mining Model Content ........................................................................................................ 40 2.8 Browsing All Possible Cases and Distinct Column Values ................................................................. 41 2.9 Querying—Applying Mining Models on New Data ............................................................................ 46 2.9.1 Components of a Prediction Query ............................................................................................. 46 2.9.2 An Example ................................................................................................................................ 48 2.9.3 Prediction Details ........................................................................................................................ 49 2.9.4 Flattening Nested Tables ............................................................................................................. 61 2.10 Deleting Existing Mining Models ...................................................................................................... 62 2.11 Refining Mining Models .................................................................................................................... 63 3 Appendix A: Schema Rowsets ................................................................................................... 65 3.1 MINING_MODELS Schema Rowset .................................................................................................. 65 3.2 MINING_COLUMNS Schema Rowset ............................................................................................... 67 3.3 MINING_MODEL_CONTENT Schema Rowset ............................................................................... 75 3.4 Layout of DISTRIBUTION Chapter in MINING_CONTENT Schema Rowset ................................. 78 3.5 MINING_SERVICES Schema Rowset ............................................................................................... 79 3.6 SERVICE_PARAMETERS Schema Rowset ...................................................................................... 85 3.7 MODEL_CONTENT_PMML Schema Rowset................................................................................... 86 4 Appendix B: OLE DB for DM Grammar ...................................................................................... 87 4.1 Statements ............................................................................................................................................ 87 4.1.1 CREATE MINING MODEL ...................................................................................................... 87 4.1.2 INSERT INTO ............................................................................................................................ 90 4.1.3 SELECT ...................................................................................................................................... 90 4.1.4 DELETE ..................................................................................................................................... 92 4.1.5 DROP .......................................................................................................................................... 93 4.2 A Sample BNF ..................................................................................................................................... 93 4.2.1 CREATE ..................................................................................................................................... 93 4.2.2 INSERT ...................................................................................................................................... 94 4.2.3 SELECT ...................................................................................................................................... 95 4.2.4 DELETE/DROP .......................................................................................................................... 97 4.2.5 RENAME .................................................................................................................................... 97 4.2.6 MISCELLANEOUS ................................................................................................................... 97 Specification Version 1.0 — Microsoft 2 Using OLE DB for Data Mining 5 Appendix C: Functions ............................................................................................................... 99 5.1 Predict .................................................................................................................................................. 99 5.2 PredictSupport ................................................................................................................................... 100 5.3 PredictVariance .................................................................................................................................. 100 5.4 PredictStdev ....................................................................................................................................... 101 5.5 PredictProbability .............................................................................................................................. 101 5.6 PredictProbabilityVariance ................................................................................................................ 102 5.7 PredictProbabilityStdev ..................................................................................................................... 102 5.8 Cluster ................................................................................................................................................ 103 5.9 ClusterDistance .................................................................................................................................. 103 5.10 ClusterProbability ............................................................................................................................ 104 5.11 PredictHistogram ............................................................................................................................. 104 5.12 TopCount ......................................................................................................................................... 105 5.13 TopSum............................................................................................................................................ 106 5.14 TopPercent ....................................................................................................................................... 107 5.15 Sub-SELECT ................................................................................................................................... 108 5.16 RangeMid......................................................................................................................................... 108 5.17 RangeMin......................................................................................................................................... 109 5.18 RangeMax ........................................................................................................................................ 109 5.19 PredictScore ..................................................................................................................................... 109 5.20 PredictNodeId .................................................................................................................................. 110 6 Appendix D: XML Format for Data Mining Models.................................................................. 111 6.1 DTD for the DMM Extended PMML ................................................................................................ 112 6.2 Example: Tree Model to Predict Credit Risk ..................................................................................... 122 7 Appendix E: Provider Support for SHAPE Syntax .................................................................. 127 8 Appendix F: Provider Support for OPENROWSET Syntax .................................................... 129 9 Appendix G: Support for Other Data Mining Algorithms ....................................................... 131 9.1 Support for Association Algorithm .................................................................................................... 131 9.2 Support for Regression Algorithm ..................................................................................................... 132 Copyright....................................................................................................................................... 133 Specification Version 1.0— Microsoft 3 1 Introduction to OLE DB for Data Mining (DM) The OLE DB for Data Mining (hereafter referred to as OLE DB for DM) draft specification assumes that the reader has a working knowledge of the following technologies and languages:  OLE DB  SQL (Structured Query Language)  Microsoft® Visual C++®  Data mining theory and practice 1.1 Goals of Data Mining Data mining is about finding interesting structures in data, which may be interpreted as knowledge about the data or may be used to predict events related to the data. These structures take the form of patterns, which are concise descriptions of the data set. Data mining makes the exploration and exploitation of large databases easy, convenient, and practical for those who have data but not years of training in statistics or data analysis. The "knowledge" extracted by a data mining algorithm can have many forms and many uses. It can be in the form of a set of rules, a decision tree, a regression model, or a set of associations, among many other possibilities. It may be used to produce summaries of data or to get insight into previously unknown correlations. It also may be used to predict events related to the data—for example, missing values, records for which some information is not known, and so forth. There are many different data mining techniques, most of them originating from the fields of machine learning, statistics, and database programming. Note Machine learning, as defined here, refers to the computer's ability to improve data mining algorithms automatically through experience. Data training, an important term that will be used in this context throughout this specification, refers to the process where the data mining algorithm analyzes the input data and finds hidden patterns. Using this trained data, these discovered patterns can then be formed into a model and applied to the machine's learning process. Specification Version 1.0— Microsoft 5 Using OLE DB for Data Mining 1.2 Data Mining Tasks Data mining can be applied for a number of different tasks. The major ones are predictive modeling (classification), segmentation (clustering), association, sequence and deviation analysis, and dependency modeling. This section presents a brief description of each of these tasks. 1.2.1 Predictive Modeling (Classification) Predictive modeling targets predicting one or more fields in the data by using the rest of the fields. When the variable being predicted is categorical (to approve or reject a loan, for example), the problem is called classification. When the variable is continuous (such as expected profit or loss), the problem is referred to as regression. Classification is a traditionally well-studied problem. Methods popular in data mining include decision trees, rules, neural networks (nonlinear regression), radial basis functions, and many others. For example, based on debt level, income level and employment type, you can use predictive modeling to predict the credit risk of a given customer. The classification algorithm determines the relationship of these attributes to the risk class in a training data set where the risk is known. Decision trees are a common and useful technique for predictive modeling. Figure 1 shows a set of training data that will be used to predict credit risk. Historical information was collected on customers that included their debt level, income level, what type of employment they had and whether they turned out to be a good or bad credit risk. Figure 2 shows a decision tree that might be created from this data. Customer ID Debt level Income level Employment type Credit risk 1 High High Self-employed Bad 2 High High Salaried Bad 3 High Low Salaried Bad 4 Low Low Salaried Good 5 Low Low Self-employed Bad 6 Low High Self-employed Good 7 Low High Salaried Good Figure 1. Sample data Specification Version 1.0 — Microsoft 6 Using OLE DB for Data Mining All Credit Risk Good: 3 Bad: 4 Debt = Low Credit Risk Good: 3 Bad: 1 Employment Type = Self Employed Credit Risk Good: 0 Bad: 1 Debt = High Credit Risk Good: 0 Bad: 3 Employment Type = Salaried Credit Risk Good: 3 Bad: 0 Figure 2. A decision tree In this trivial example, a decision tree algorithm might decide that the most significant attribute for predicting credit risk is debt level. The first split in the decision tree is therefore made on debt level. One of the two new nodes (debt level = high) is a leaf node, having three bad credit risks and no good credit risks. In this example, a high debt level is a perfect predictor of a bad credit risk. The other node (debt level = low) is still mixed, having three good credit risks and one bad. The decision tree algorithm then chooses employment type as the next most significant predictor of credit risk. The split on employment type gives two leaf nodes. It turns out that self-employed people are a bad credit risk. This is, of course, a completely imaginary and trivial example, but it illustrates how the decision tree can use known attributes of the credit applicants to predict credit risk. In reality, there would be far more attributes for each credit applicant and the numbers of applicants would be very large. When the scale of the problem expands like this, it is very difficult for a person to extract the rules to identify good and bad credit risks. The classification algorithm, on the other hand, can consider hundreds of attributes and millions of records to come up with the decision tree that describes rules for credit risk prediction. Specification Version 1.0— Microsoft 7 Using OLE DB for Data Mining 1.2.2 Segmentation (Clustering) Segmentation is finding the groups (clusters) in the data that consist of similar subsets of records. Unlike in predictive modeling, there is no target variable that appears as an attribute in the data. The clustering algorithm determines this new "hidden" attribute (the cluster ID to which each example belongs) by examining the data. Examples include segmenting a customer database into clusters of similar customers, which enables the design of a separate marketing strategy for each segment. There are many methods for clustering data. Popular approaches include K-Means algorithm, hierarchical agglomerative methods, and mixture modeling using the Estimation-Maximization (EM) algorithm for fitting probabilistic mixture models to data. It is possible for a data record to belong to different clusters with different degrees of membership. Consider an employee database in which each employee has three attributes—age, salary, and vested amount in a company pension plan. A user may want to issue a query that provides a cross-tabulation of the average ages of employees having pension plans in the ranges 100K– 200K, 200K–400K, and 400K–1000K and having salaries in the ranges 50K–100K, 100K– 200K, and 200K–300K. For traditional approaches, the problem is that the ranges specified by the user can be arbitrary. In other words, the query hierarchy is dynamic and not prediscretized along each dimension. Multidimensional data records can be viewed as points in a multidimensional space. For example, the records of the schema (age, salary) could be viewed as points in a twodimensional space, with the dimensions of age and salary. Figure 3a shows some data conforming to the above example schema. Figure 3b shows its representation as points in a two dimensional space. Figure 3. Clustering sample Specification Version 1.0 — Microsoft 8 Using OLE DB for Data Mining Now suppose one is to give a short representation of this simple data set. One could provide the average age and the average salary (and their standard deviations). This would represent the average employee as having a salary of $85.5K ( $35.5K) and an average age of 40 ( 15.5) years. However, imagine inspecting the data further and realizing that there are two groups of employees. The summary on the data would then be as shown in Figure 4. Group Age Income Average Std Dev. Average Std Dev. Segment 1 26 years 1.0 $54.3K $4K Segment 2 54 years 3.6 $116.6K $15.2 Figure 4. Clustering result As Figure 4 illustrates, the data has not only been identified to comprise two distinct segments but its average values are much more meaningful within each segment. This is evidenced by a much more reasonable standard deviation associated with each segment. How does one identify the presence of such segments? This is what a clustering algorithm does. While it may be obvious what these segments should be in two dimensions (as shown in the preceding simple two-dimensional example), finding segments in higher dimensions (for example, four or higher) is much more difficult for humans because simply plotting the data may no longer help. Also, plotting data becomes extremely inconvenient with many data points. However, clustering algorithms automatically find such segments in data. Each segment is represented by its own distribution. The normal distribution was used in this example, but categorical dimensions, such as gender or job description, can also be admitted and can be represented by using the multinomial distribution. A clustering algorithm can deal with both types of attributes and can produce useful groupings for summaries. 1.2.3 Association (Data Summarization) Association (data summarization) describes a class of methods that target producing summaries of parts of the data—for example, discovering correlations between variables over substantial subsets of the data or deriving an association between some items and other items. The most common technique in this category of methods is the use of association rules. Sometimes referred to as market basket analysis, the process of finding association rules depends on identifying frequent item sets in transactional data. Frequent item sets consist of sets of items (for example, products) that frequently occur together in the same transaction. Specification Version 1.0— Microsoft 9 Using OLE DB for Data Mining Frequent item sets can be used to summarize the sets of products customers tend to buy together in a supermarket basket. (For another example, to understand how a Web site is used by its visitors, frequent item sets can also be used to find a set of Web pages that will be visited during a Web-browsing session.) Therefore, retailers can use association techniques to do cross-selling by stocking related products together. For example, consider a set of transactions representing checkout baskets in a grocery store. Given a minimum support level (supplied by the analyst), the data mining algorithm can find items in the store that are bought together. Suppose one has a set of baskets shown in the Transaction table in Figure 5a. The Frequent item sets table in Figure 5b shows the respective support levels for the frequent item sets derived from the Transaction table. Basket ID Item ID 1 Milk Support Item sets found 1 Butter 4 {Milk} 2 Milk 3 {Milk}, {Butter}, {Milk, Butter} 2 Honey 2 {Milk}, {Butter}, {Milk, Butter} 2 Butter 3 Milk 3 Bread 3 Butter 4 Milk 4 Bread 4 Honey {Honey}, {Bread}, {Honey, Bread}, {Honey, Milk}, {Honey, Butter}, {Bread, Milk}, {Bread, Butter} (b) Frequent item sets (a) Transaction table Figure 5. Association Note that as the support level decreases, the number of frequent item sets grows monotonically. In general, in real databases—whether storing market baskets, tracking Webbrowsing behavior, or monitoring customer uses of a service (for example, a phone service)— the number of item sets having a high support value tends to be very small, and the number of item sets tends to grow exponentially as the support level is decreased. Once the frequent item sets are derived, they can be used to produce association rules. Association rules are derived by selecting one of the items in a frequent item set as the item to be predicted and then evaluating the remaining items as the conditions of a rule for predicting that item. For example, in the Frequent item sets table in Figure 5b, one may use the set "{milk, Butter} with support 3" to derive the following association rule: Specification Version 1.0 — Microsoft 10 Using OLE DB for Data Mining If a customer buys Milk, that customer also buys Butter. However, studying the example data set, one also determines that this rule has an accuracy rate of only 75%, because the transaction indicated by Basket ID number 4 does not obey this rule even though it satisfies the first condition. 1.2.4 Sequence and Deviation Analysis Sequence and deviation analysis accounts for sequence information and anomalies in the data. In the preceding three categories of data mining techniques—predictive modeling, segmentation, and association—the sequence in which events occurred was ignored and was treated simply as part of one record (the case). For example, on a data set consisting of people visiting a Web site, suppose user U774 first visits the home page (page 0), then page 13, then page 2, and then page 17 on the Web site. This case could simply be flattened into the following statement: Case: User U774: visited {page 0, page 2, page 13, page 17} On the other hand, it might be preferable to preserve the sequence information. This means that another user who visited the same pages, but in a different order, will be distinct from U774. Algorithms in this category focus on one of the following objectives: 1. Summarizing frequent sequences or episodes in data 2. Detecting changes in data over time 3. Detecting changes in knowledge (models or patterns) over time As an example of the first kind of task, summarizing, suppose it is discovered that users visit a particular Web site as follows: Figure 6. Sequence and deviation analysis The sequences found in the data may indicate that on a given Web site, 90% of users visit page 0 and 2% enter at page 10. The sequences also may indicate that from page 0, 60% go to page 15, and so forth. The graph in Figure 6 summarizes ordering relationships and gives an Specification Version 1.0— Microsoft 11 Using OLE DB for Data Mining idea of the flow. There may be infrequently visited pages between pages 15 and 17, but only the frequent visits are reported. Deviation analysis focuses on finding the anomalies in data. For example, if a user usually visits only page 0, 1, 15 and then one day visits page 17, the deviation analysis algorithm outlines this particular event. Deviation analysis is a common technique in fraud detection. 1.2.5 Dependency Modeling Dependency modeling or "density estimation" refers to the estimation of the underlying joint probability distribution or density of the data. If you know the joint probability distribution is, you can answer any question of interest about the data. Dependency modeling can be used to identify (sometimes novel) dependencies among attributes of cases. Identifying dependencies is one way to gain insight into your data. An often-used density estimate for a small number of attributes is the histogram. Unfortunately, this technique is not useful when then are many attributes . An simple form of density estimation that can handle a large number of attributes uses the Naïve Bayes model. In this model, it is assumed that all attributes are independent within a class or a cluster. Note that the model does not assume that attributes are globally independent. Another simple example of density estimation is to fit a multivariate-normal distribution to data. More complex (and more accurate) models for density estimation include mixture models and graphical models. In the mixture-model approach, one fits several distributions to a data set. For example, one may decide a population of users is composed of three distinct subpopulations, each having its own multivariate-normal distribution. Graphical models useful for density estimation include Bayesian networks and dependency networks. 1.3 The OLE DB for DM Specification OLE DB for DM is an OLE DB extension that supports data mining operations over OLE DB data providers. The goal of this specification is to provide an industry standard for data mining so that different data mining algorithms from various data mining ISVs can be easily plugged into user applications. In this documentation, software packages that provide data mining algorithms are called data mining providers and those applications that use data mining features are called data mining consumers. OLE DB for DM specifies the API between data mining consumers and data mining providers. OLE DB for DM introduces one new virtual object, referred to as the data mining model (DMM), as well as several new commands for manipulating the DMM. In its characteristics and use, the DMM is very similar to a table and is created with a CREATE statement very similar to the SQL CREATE TABLE statement. It is populated using the INSERT INTO statement, just as a table would be populated. The client uses a SELECT statement to make predictions and explore the DMM. Specification Version 1.0 — Microsoft 12 Using OLE DB for Data Mining OLE DB for DM treats a DMM as if it were a special type of table. When you insert the data into the table, it is processed by a DM algorithm and the resulting abstraction (or data mining model) is saved instead of the data itself. Subsequently, the DMM can be browsed, refined, or used to derive predictions. Data to be mined is represented logically as a collection of tables in a relational database. For instance, a customer database might record customers, demographic data about customers, orders, and order items. A join of the customer orders and order items tables may have many records for one customer (one per order item). This collection of data pertaining to a single entity is often called a case, and the set of all relevant cases is referred to as a case set. To represent these relationships, OLE DB for DM uses nested tables as defined by the Data Shaping Service, which is included with the Microsoft Data Access Components (MDAC) products. Note that the same physical data may be used to generate different case sets for different analysis purposes. For example, if one chooses to mine models or patterns over specific products, each product then becomes a single case and customers become attributes of the case. The content of a DMM can be thought of as a "truth table" containing a row for every possible combination of the distinct values for each column in the DMM. In other words, it contains every possible case. With this view in mind, a DMM can be used to look up learned values and statistics. A fundamental operation in OLE DB for DM is the training of a data mining model, followed by use of the model to derive predictions. The following is an outline of the process. The INSERT statement invokes the DM algorithm on the provider to create an abstraction of the data in the form of a DMM. This abstraction represents the patterns the algorithm found in the data; the patterns are saved rather than the training data. Selecting from a PREDICTION JOIN allows new data to be processed through the model to produce predictions. 1. Create an OLE DB data source object and obtain an OLE DB session object. This is the standard mechanism of connecting to data stores via OLE DB. 2. Create the data mining model object. Using an OLE DB command object, the client executes a CREATE statement that is similar to a CREATE TABLE statement. CREATE MINING MODEL [Age Prediction] ( [Customer ID] LONG KEY, [Gender] TEXT DISCRETE, [Age] DOUBLE DISCRETIZED() PREDICT, [Product Purchases] TABLE ( [Product Name] TEXT KEY, [Quantity] DOUBLE NORMAL CONTINUOUS, [Product Type] TEXT DISCRETE RELATED TO [Product Name] ) ) USING [Decision Trees] Specification Version 1.0— Microsoft 13 Using OLE DB for Data Mining 3. Insert training data into the model. In a manner similar to populating an ordinary table, the client uses a form of the INSERT INTO statement. Note the use of the SHAPE statement to create the nested table. INSERT INTO [Age Prediction] ( [Customer ID], [Gender], [Age], [Product Purchases](SKIP, [Product Name], [Quantity], [Product Type]) ) SHAPE { SELECT [Customer ID], [Gender], [Age] FROM Customers ORDER BY [Customer ID] } APPEND ( {SELECT [CustID], [Product Name], [Quantity], [Product Type] FROM Sales ORDER BY [CustID]} RELATE [Customer ID] To [CustID] ) AS [Product Purchases] 4. Use the data mining model to make some predictions. Predictions are made with a SELECT statement that joins the model's set of all possible cases with another set of actual cases. The actual cases can be incomplete. In this example, the value for "Age" is not known. Joining these incomplete cases to the model and selecting the "Age" column from the model will return a predicted "age" for each of the actual cases. SELECT t.[Customer ID], [Age Prediction].[Age] FROM [Age Prediction] PREDICTION JOIN ( SHAPE { SELECT [Customer ID], [Gender], FROM Customers ORDER BY [Customer ID] } APPEND ( {SELECT [CustID], [Product Name], [Quantity] FROM Sales ORDER BY [CustID]} RELATE [Customer ID] To [CustID] ) AS [Product Purchases] ) as t ON [Age Prediction] .Gender = t.Gender and [Age Prediction] .[Product Purchases].[Product Name] = t.[Product Purchases].[Product Name] and [Age Prediction] .[Product Purchases].[Quantity] = t.[Product Purchases].[Quantity] Specification Version 1.0 — Microsoft 14 Using OLE DB for Data Mining Note Because the process of combining actual cases with all possible model cases is not as simple as the semantics of a normal SQL JOIN, a new type of join, the PREDICTION JOIN, is introduced in OLE DB for DM. For the instance when the schema of the actual case table matches the schema of the model, NATURAL PREDICTION JOIN can be used, obviating the need for the ON clause of the join. Columns from the source query will be matched to columns from the DMM based on the names of the columns. Part 2 of this document describes the language for creating and manipulating a DMM in more detail. The complete details of the language and the schema rowsets used when working with a data mining provider (DMP) are described in Appendix A. 1.4 The Columns Structure of a Data Mining Model (DMM) In usage, the DMM is very similar to a SQL table. The SELECT statement returns columns from the input data, columns from the model, and predictions produced by the model. The DMM definition includes a definition of the columns of data over which the model will be created, including detailed information about the nature of the data and relationships between columns. 1.4.1 Model Columns The model columns describe all of the information about a specific case. For example, assume that each case in the DMM represents a customer. The columns of the DMM will include all known and desired information about the customer. The following table illustrates a customer case. Customer ID Hair Gender Color Age Age Probability Product Name Product Quantity Product Type Cars Car Owned Probability 1 Male 35 100% TV 1 Electronic Truck 100% VCR 1 Electronic Van 50% Ham 2 Food Beer 6 Beverage Black As the table indicates, a customer case is not easily describable using simple relational tables. Each case can include not only simple columns but also multiple tables. Each of these tables inside the case can have a variable number of rows and a different number of columns. The meaning of the information contained in the columns can also greatly differ. Specification Version 1.0— Microsoft 15 Using OLE DB for Data Mining Note The ability of a case to contain multiple tables of data is a key requirement for most of the data mining algorithms. Although most of the relational data stores today cannot support such table structures, the theoretical notion of nested tables (also known as table columns) already exists in the relational world and is also supported by MDAC. This specification will rely on these data structures with some anticipation of a wider adoption in the relational world in the future. Some of the columns in the example have a direct one-to-one relationship with the case (such as "Gender" and "Age"), while others have a one-to-many relationship with the case and therefore exist in tables. As noted above, the nested tables are a key element in the basic data structure of the case and therefore have an explicit representation in the case definition. You can easily identify the following two tables contained in the sample case:  "Product Purchases" table containing the columns "Product Name," "Product Quantity," and "Product Type"  "Car Ownership" table containing the columns "Cars Owned" and "Car Probability" The main row of the case is the case row. Columns in the case row describe the entity of the case. For example, in the case illustrated in the preceding table, the "Age" column contains the age of the customer whose Customer ID is 1. Rows inside nested tables are referred to as nested rows. Columns in nested rows describe the entity of the nested row as it relates to the case row. For example, the "Product Quantity" column represents the quantity of the product indicated in the "Product Name" column; therefore, 2 is the quantity of "Ham" purchased by customer 1. As the preceding example indicates, each column can represent the following content types:  KEY: the columns that identify a row. For example, "Customer ID" uniquely identifies customer cases, and "Product Name" uniquely identifies a row in the "Product Purchases" table. In the CREATE MINING MODEL command syntax, specifying the type flag KEY in the column definition identifies key columns.  ATTRIBUTE: A direct attribute of the case. This type of column represents some value for the case. For example, the age, gender, or hair-color of the customer or the quantity of a specific product the customer purchased.  RELATION: Information used to classify attributes, other relations, or key columns. For example, "Product Type" classifies "Product Name." A given relation value must always be consistent for all of the instance values of the other columns it describes—for example, the product "Ham" must always be shown as "Food" for all cases. In the CREATE MINING MODEL command syntax, relations are identified in the column definition by using a RELATED TO clause to indicate the column being classified.  QUALIFIER: A special value associated with an attribute that has a predefined meaning for the provider. Take for example the probability that the attribute is correct. These qualifiersare all optional and apply only if the data has uncertainties attached to it or if the output of previous predictions is being chained as input to a subsequent DMM training step. Following are examples of qualifiers. Specification Version 1.0 — Microsoft 16 Using OLE DB for Data Mining Note In the CREATE MINING MODEL command syntax, modifiers are identified by using an OF clause to indicate the attribute column they modify.  PROBABILITY: A number between zero and one that describes the probability of the associated value.  VARIANCE (or Stdev): A number that describes the variance (or standard deviation) of the value of an attribute.  SUPPORT: A float that represents a weight (case replication factor) to be associated with the value.  PROBABILITY_VARIANCE (or Stdev): The variance (or standard deviation) associated with the probability estimator used for PROBABILITY.  ORDER: Specifies the order of a column. (See ORDERED below.)  TABLE: A nested table is represented in the case as consisting of special column with the data type TABLE. For any given case row, the value of a TABLE type column contains the entire contents of the associated nested table. The value of a TABLE type column is in itself a table containing all of the columns for the nested table. In the CREATE MINING MODEL command syntax, nested tables are described by a set of columns, all of which are contained within the definition of a named TABLE type column.  DISCRETE: The attribute values are discrete. These are the simplest forms of an attribute. Gender is a typical example of such an attribute, where the values describe categories. Even if the values are numeric, no ordering is implied by the values. ("Area Code" is a good example.) The values of a discrete attribute are often called its states.  ORDERED: Columns that define an ordered set of values. Although there is a total ordering, no distance or magnitude semantics are implied. A ranking of skill level (say one through five) is an ordered set, but a skill level of five isn't necessarily five times better than a skill level of one. Attributes with a type flag of ORDERED are also considered to be discrete. There may be an associated "Order Of" column with numeric values that gives the ordering for this attribute type column. The order of column values can be defined before the model training. (See the section "Populating the Column Values.")  CYCLICAL: A set of values that have cyclical ordering. Day of the week is a good example, since day number one follows day number seven. Attributes with a type flag of CYCLICAL are also considered to be ordered and discrete.  CONTINUOUS: Attributes with values that form a continuous curve. Values are naturally ordered and have implicit distance and magnitude semantics. Salary is a typical example. Specification Version 1.0— Microsoft 17 Using OLE DB for Data Mining  DISCRETIZED: The data that will be inserted into the model is continuous, but it should be transformed into and modeled as a number of ORDERED states by the provider. Some Data Mining algorithms cannot accept CONTINUOUS attributes as input, or they may not be able to predict CONTINUOUS values. For these cases, columns with continuous domains should be made into DISCRETIZED attributes. In the CREATE MINING MODEL command syntax, the DISCRETIZED type flag can take arguments to override default discretization behavior.  SEQUENCE_TIME: A column containing time measurement units. A time column does not have to contain a data type of any particular format. A period number is acceptable. This is typically used to associate a sequence time with individual attribute values such as purchase time. A CONTINUOUS attribute's domain may also have a distribution associated with it. This is a hint given to the data mining provider describing the expected distribution of the column values that will be inserted into the model when trained. Specific values may be known to have typical distributions. For some algorithms, it is particularly beneficial to know the distribution ahead of time. If the distribution isn't known or isn't given, the provider may assume whatever distribution it finds convenient. Following are examples of distributions:  NORMAL: A histogram of the continuous values forms a normal Gaussian distribution. Household income values may form this curve.  LOG_NORMAL: A histogram of the continuous values forms a Gaussian distribution with all values greater than 0, with an elongated upper tail, and with a skew toward the low end of the curve. The quantity associated with a product purchase may form this curve if a value of 0 is not explicitly recorded and if most consumers tend to buy smaller quantities of the product.  UNIFORM: The likely occurrence of all values is equal. There are a number of other distribution models, such as BINOMIAL, MULTINOMIAL, POISSON, T-DISTRIBUTION, and so on. A data mining provider may support a subset of these distributions. All of the preceding column descriptions allow the provider to make some sense of the training data it is given with the INSERT command. Returning to the example, the columns can now be classified as shown in the following table. Containing Table Column Content Type Customer ID Key Gender Discrete Attribute HairColor Discrete Attribute Specification Version 1.0 — Microsoft 18 Model Hints Comments Special column that serves as the case identifier (key) Using OLE DB for Data Mining Containing Table Column Content Type Model Hints Comments Age Continuous Attribute Age Probability Probability Modifier of Age Customer Loyalty Ordered Attribute Product Purchases Table Product Purchases Product Name Key Product Purchases Product Quantity Continuous Attribute Product Purchases Product Type Relation of Product Name Product Purchases Month Purchased Cyclical Attribute Doesn't exist in the sample case. Added for additional illustration. Car Ownership Cars Owned Key Has an implicit "Exists" attribute for each distinct key. Car Ownership Cars Probability Probability Modifier of Implicit "Exists" Attribute Doesn't exist in the sample case. Added for additional illustration. Each distinct key represents the purchase of a product with a "Quantity" attribute. Log Normal Other hints can be given to the data mining provider to help it build good models of the training data. These modeling flags are provider-specific, but following are two examples:  MODEL_EXISTENCE_ONLY: The actual values for an attribute are not nearly as important as the simple existence of the attribute. For example, assume the existence of some general demographic data for a selected group of people, along with a nested table of the television programs and the viewing duration for all of the programs that each person watched. For modeling purposes, the fact that the person watched a particular program may be more important than how long they watched it. In this case, the Duration attribute should be marked as MODEL_EXISTENCE_ONLY.  NOT NULL: The attribute can never contain a null value, and encountering one while training should generate an error. Specification Version 1.0— Microsoft 19 Using OLE DB for Data Mining 1.4.2 Prediction Columns Attribute or Table type columns can be input columns, output columns, or both. The data mining provider will build a data mining model capable of predicting or explaining output column values based on the values of the input columns. Predictions may convey not only simple information such as "estimated age is 21", but they may also convey additional statistical information such as confidence level and standard deviation. Further, the prediction may actually be a collection of predictions, such as "the set of products that the customer is likely to buy." Each of the predictions in the collection may also include a set of statistics. A prediction can be expressed as a histogram. A histogram provides multiple possible prediction values, each accompanied by a probability and other statistics. When histogram information is required, each prediction (which by itself can be part of a collection of predictions) may have a collection of possible values that constitutes a histogram. Since the prediction information may be very rich, it is often necessary to extract only a portion of the predictions. For example, you may want to see only the "best estimate," "top 3 estimates," or "estimates with probability greater then 55%." Not every provider nor every DMM can support all of the possible requests. Therefore, it is necessary for the output column to define whatever information may be extracted out of it. OLE DB for DM defines a set of standard transformation functions on output columns. These functions are discussed in detail in section 2.9 Querying—Applying Mining Models on New Data," and in Appendix C. Specification Version 1.0 — Microsoft 20 Using OLE DB for Data Mining 2 OLE DB for DM Programmer's Guide This section of the specification illustrates how data mining consumers and providers work together. The section will walk you through the following operations:  Connecting to a DMP  Creating a new DMM  Enumerating and exploring existing data mining models  Executing queries and deriving predictions with a DMM  Housekeeping activities This section is not a formal representation of the interfaces and does not intend to describe every option and variation that the API enables. Instead, all of the interfaces are formally detailed in the appendixes. You should consider this section a tutorial that describes the principles of working with a DMP and introduces application programmers to the new world of DM client development. 2.1 Connecting to a Data Mining Provider The process of connecting to a DMP is the same as connecting to any other OLE DB provider (whether relational, multidimensional, or any other type). The connection sequence to an OLE DB provider is described in the OLE DB Programmer's Reference. As with all other OLE DB providers, a DMP supports the data source, session, command, and rowset objects. Although during the connection sequence a DMP behaves just like any other OLE DB provider, it is still very useful to be able to determine whether a specific provider supports the OLE DB for DM specification. To this end, the constant DBSOURCETYPE_DATASOURCE_DMP is defined and can be used when enumerating providers to locate a provider capable of performing data mining. A single provider may support many data store types. For example, a provider may support both relational operations as well as data mining operations concurrently. Bit operations on the SOURCE_TYPE value can detect whether a provider supports a specific data store type. Once a session object has been instantiated, the client application can query the provider for information and execute various commands. Specification Version 1.0— Microsoft 21 Using OLE DB for Data Mining 2.2 Creating New Mining Models A new DMM is created with the CREATE MINING MODEL command. This command correlates closely to the common relational database operation CREATE TABLE, which defines a table object structure. As will be shown in following sections, creating and populating a DMM follows the approach taken by relational databases for the management of tables. The similarities between DMMs and tables are not coincidental. It is widely expected that data mining capabilities will be fully integrated with relational databases in the future. Therefore, the present approach looks at the DMM as a future standard object of an RDBMS, just like a table or a view, and the DMM is indeed represented and accessed to a large degree as if it were a special type of a table. However, unlike a table, a DMM must announce a predefined goal and analysis technique. Each provider may support many and different analysis techniques. It is therefore necessary to be able to identify the provider capabilities. 2.2.1 Detecting the Capabilities of the Provider The different mining services (or algorithms as they are also known) are exposed through a new schema rowset—the mining services schema rowset. This schema rowset exposes the different algorithms supported by a provider and the way to specify goals for the algorithm. Many algorithms require a goal—for example, "predict whether the customer's transactions look fraudulent," "predict the sales amount for the customer," "predict the profit for a product," and "predict the sales of each store for next year" all have targeted goals. The algorithm will try to predict something about the case, usually one of the attributes of the case. Most of the algorithms will need to get a training set of cases where the attributes to be predicted are already known, and they will then create a DMM capable of predicting these attributes for cases in which the attribute is unknown. Different algorithms will be capable of predicting different things. They may also differ in the type of data they are capable of processing. The list of algorithms (or services), their possible goals, their limitations, and their capabilities are all exposed in the mining services achema rowset. This information will be used when defining a new model. The mining services schema rowset is described in detail in Appendix A. The following table describes some of the important columns that are found in the mining services schema rowset. Specification Version 1.0 — Microsoft 22 Using OLE DB for Data Mining Column Name Type Indicator Description SERVICE_NAME DBTYPE_WSTR The name of the algorithm. Provider-specific. Used with the CREATE MINING MODEL command to specify algorithm. SERVICE_TYPE_ID DBTYPE_UI4 A bitmask that describes mining service types. The list includes known popular mining services, such as the following:  DM_SERVICETYPE_CLASSIFICATION (0x0000001)  DM_SERVICETYPE_CLUSTERING  (0x0000002)  DM_SERVICETYPE_ASSOCIATION  (0x0000004)  DM_SERVICETYPE_DENSITY_ESTIMA TE (0x0000008)  DM_SERVICETYPE_SEQUENCE (0x0000010) PREDICTED_CONTENT DBTYPE_WSTR The attribute types that can be predicted. This is a comma-delimited list of content types. PREDICTION_LIMIT DBTYPE_UI4 The maximum number of predictions the model and algorithm can provide; 0 means no limit. SUPPORTED_DISTRIBUTION_ FLAGS DBTYPE_WSTR A comma-delimited list of one or more of the following:  NORMAL  LOG_NORMAL  UNIFORM  BINOMIAL  MULTINOMIAL  POISSON  T-DISTRIBUTION Provider-specific flags may also be defined. Specification Version 1.0— Microsoft 23 Using OLE DB for Data Mining Column Name Type Indicator Description SUPPORTED_INPUT_CONTEN T_TYPES DBTYPE_WSTR A comma-delimited list of one or more of the following:  KEY  DISCRETE  CONTINUOUS  DISCRETIZED  ORDERED  SEQUENCE_TIME  CYCLICAL  PROBABILITY  VARIANCE  STDEV  SUPPORT  PROBABILITY_VARIANCE  PROBABILITY_STDEV  ORDER  SEQUENCE  TABLE Provider-specific flags may also be defined. Specification Version 1.0 — Microsoft 24 Using OLE DB for Data Mining Column Name Type Indicator Description SUPPORTED_PREDICTION_C ONTENT_TYPES DBTYPE_WSTR A comma-delimited list of one or more of the following:  DISCRETE  CONTINUOUS  DISCRETIZED  ORDERED  SEQUENCE_TIME  CYCLICAL  PROBABILITY  VARIANCE  STDEV  SUPPORT  PROBABILITY VARIANCE  PROBABILITY_STDEV  ORDER  TABLE Provider-specific flags may also be defined. SUPPORTED_MODELING_FLA DBTYPE_WSTR GS A comma-delimited list of one or more of the following:  MODEL_EXISTENCE_ONLY  NOT NULL Provider-specific flags may also be defined. Specification Version 1.0— Microsoft 25 Using OLE DB for Data Mining Column Name Type Indicator Description TRAINING_COMPLEXITY DBTYPE_I4 Indication of expected time for training: PREDICTION_COMPLEXITY EXPECTED_QUALITY DBTYPE_I4 DBTYPE_I4  DM_TRAINING_COMPLEXITY_LOW— Running time is proportional to input and is relatively short.  DM_ TRAINING_COMPLEXITY_MEDIUM— Running time may be long but is generally proportional to input.  DM_ TRAINING_COMPLEXITY_HIGH— Running time is long and may grow exponentially in relationship to input. Indication of expected time for prediction:  DM_PREDICTION_COMPLEXITY_LOW —Running time is proportional to input and is relatively short.  DM PREDICTION_COMPLEXITY_MEDIUM —Running time may be long but is generally proportional to input.  DM_ PREDICTION_COMPLEXITY_HIGH— Running time is long and may grow exponentially in relationship to input. Indication of expected quality of model produced with this algorithm:  DM_EXPECTED_QUALITY_LOW  DM_EXPECTED_QUALITY_MEDIUM  DM_EXPECTED_QUALITY_HIGH ALLOW_INCREMENTAL_INSE DBTYPE_BOOL RT TRUE if additional INSERT INTO statements are allowed after the initial training. ALLOW_DUPLICATE_KEY TRUE if cases may have duplicate key. Specification Version 1.0 — Microsoft 26 DBTYPE_BOOL Using OLE DB for Data Mining 2.2.2 Defining a New Mining Model Defining a new model is done using a CREATE MINING MODEL statement. Similar to the CREATE TABLE statement, the creation of a DMM defines only its structure and properties. It does not define the specific content (the learned graphical structure), which will be created only when the DMM is populated. (See below.) The CREATE MINING MODEL statement will define the following: 1. The DMM columns 2. The specific algorithm to be used in the DMM The syntax used to define the DMM columns is similar to the syntax used to define the columns in a table object, as follows: CREATE MINING MODEL <mining model name> (<Column definitions>) USING <Service>[(<service arguments>)] However, since the columns of a DMM require a lot of specialized information, some extensions were added to the standard SQL syntax. Following is a statement example that applies to the case structure illustrated in Section 1.3: CREATE MINING MODEL [Age Prediction] ( [Customer ID] LONG KEY, [Gender] TEXT DISCRETE, [Hair Color] TEXT DISCRETE, [Age] DOUBLE DISCRETIZED() PREDICT, [Age Probability] DOUBLE PROBABILITY OF [Age], [Product Purchases] TABLE ( [Product Name] TEXT KEY, [Quantity] DOUBLE NORMAL CONTINUOUS [Product Type] TEXT RELATED TO [Product Name] ), [Car Ownership] TABLE ( [Car Name] TEXT KEY, [Probability] DOUBLE PROBABILITY OF [Car Name] ) ) USING [Microsoft_Decision_Trees] As the example shows, the definition includes the following information for each column:  Name (mandatory)  Data type (mandatory)—a special data type exists for tables contained in a case (TABLE)  List of column type flags and modeling flags Specification Version 1.0— Microsoft 27 Using OLE DB for Data Mining  Relationship to an attribute column (mandatory only if applies)—indicated by the RELATED TO or OF clauses  Prediction request (that is, indication to the algorithm to predict this column)—indicated by the PREDICT or PREDICT_ONLY string While a complete BNF for this grammar is given in Appendix B, following are a few interesting points: The syntax allows for explicit definition of "Table Columns." "Product Purchases" and "Car Ownership" are both columns that contain a full table each. A potential list of supported of data types is as follows: LONG, DOUBLE, TEXT, DATE, BOOL, and TABLE. For a list of the data types supported by the provider, see the PROVIDER_TYPES schema rowset in Appendix B of the OLE DB Programmer's Reference. The Discretized function cuts the value range of a continuous variable to a number of buckets. The syntax for the Discretized attribute type is as follows: Discretized([method[,n]]). Both arguments are optional, but parentheses are always required and a value must be given for "method" in order to supply a value for "n". The "n" argument is the recommended number of buckets that the discretization method should try to find to divide up the values of the column. Each provider will have a reasonable default. The "method" argument describes the algorithm that the provider should use to find the buckets. All providers should support the method DEFAULT as the default. Other possible providerspecific algorithms could be AUTOMATIC, EQUAL_AREAS, THRESHOLDS, CLUSTERS, and so forth. A column may have missing values. There are different ways to deal with missing values. The easy way is to ignore it, but sometimes missing values can be informative, and thus it is often beneficial to model the missing state. Users can specify how to deal with missing values in the column definition statement. For example, Gender TEXT DISCRETE NULL IGNORE means to ignore the missing state in the Gender column. The following is a list of possible ways to specify missing value treatment:  NOT NULL: The column should not contain missing values; otherwise it returns an error during the model training stage.  IGNORE NULL: Ignore the missing value.  NULL INFORMATIVE: Data mining algorithm will model the missing state. The default option is NULL INFORMATIVE. After the column definition, the statement indicates the type of algorithm to be used. Only one of the services listed by the provider in the services schema rowset can be used. The USING clause can be followed by a PARAMETERS clause containing provider-specific pairs of parameter-value settings. THE SERVICE_PARAMETERS schema rowset contains a list of parameters supported by the provider. A full description of this schema rowset is provided in Appendix A. Algorithm providers define the names of their parameters. However, we suggest the following list of parameters, which may used in many algorithms: Specification Version 1.0 — Microsoft 28 Using OLE DB for Data Mining  HOLDOUT_PERCENTAGE: The percentage of data that is held out during the training stage. This data may be used in validation or test phase.  HOLDOUT_SEED: The seed used to hold out data.  SAMPLE_PERCENTAGE: The percentage of data that is selected after sampling.  SAMPLE_SEED: The seed used in sampling data. When a CREATE MINING MODEL statement is executed, the model is cr eated and will appear in the schema rowsets of the provider. However, since data has not been inserted into the model, the model cannot be used for any kind of useful analysis. The client can use the MODEL_STATE column in the mining models schema rowset to get this indication. 2.2.3 Copying a Mining Model Sometimes you may want to run multiple algorithms against the same source data and model column structure. The OLE DB for DM specification provides a mechanism that allows you to easily create a new model from an existing model. SELECT * INTO <new model> USING <model type> [( <parameter list> )] FROM <model> The new model will contain all information from the existing model that is not specific to the actual algorithm. Executing this statement will cause the new model to be trained using the same training query as the existing model. If the existing model is not trained, only the structure of the model will be copied. 2.2.4 Creating a Mining Model from Predictive Model Markup Language (PMML) Because all of the structure and content of a DMM may be expressed as an XML string in the Predictive Model Markup Language (PMML) format (see Appendix D), it is conceivable that the expert user can use such a string as the basis for the creation of a model. This string could be a modified version of the string retrieved from another model. (See The MODEL_PMML column of the MODEL_CONTENT_PMML schema rowset.) Changes to the XML string will typically allow manipulation of the content nodes. The change may include pruning of the tree additions of other nodes or changing the rules described in the nodes. A provider does not have to support initialization based on a PMML document. To discover whether the provider supports this capability, the services schema rowset offers the ALLOW_PMML_INITIALIZATION column. To create a new model from PMML, use a modified version of the CREATE MINING MODEL statement, as follows: CREATE MINING MODEL <mining model name> FROM PMML <xml string> Specification Version 1.0— Microsoft 29 Using OLE DB for Data Mining 2.3 Finding Existing Mining Models Data mining models are exposed in the mining models schema rowset. This rowset can be viewed as an enhanced version of the TABLES schema rowset because it contains all of the same types of information. In addition, several DMM-specific columns have been added to the rowset. A complete description of the MINING_MODELS schema rowset can be found in Appendix A; the following table describes some of the interesting columns. Column Name Type Indicator Description MODEL_NAME DBTYPE_WSTR Model name. This column cannot contain NULL. SERVICE_TYPE_ID DBTYPE_UI4 A bitmask that describes mining service types. The list includes known popular mining services, such as the following:  DM_SERVICETYPE_CLASSIFICATION (0x0000001)  DM_SERVICETYPE_CLUSTERING (0x0000002)  DM_SERVICETYPE_ASSOCIATION (0x0000004)  DM_SERVICETYPE_DENSITY_ESTIMATE (0x0000008)  DM_SERVICETYPE_SEQUENCE (0x0000010) SERVICE_NAME DBTYPE_WSTR A provider-specific name that describes the algorithm used to generate the model. CREATION_STATEMENT DBTYPE_WSTR Optional. The statement used to create the original data mining model. PREDICTION_ENTITY DBTYPE_WSTR A comma-delimited list indicating which columns the model can predict. IS_POPULATED DBTYPE_BOOL VARIANT_TRUE if the model is populated. VARIANT_FALSE if the model is not populated. An empty model has a defined structure but has not been "trained" with data. Specification Version 1.0 — Microsoft 30 Using OLE DB for Data Mining 2.4 Browsing Model Column Definition Once an interesting DMM has been identified, you may want to explore its structure. The structure of a DMM is similar to the structure of a table that is represented as a set of columns. Like columns of a table, the structure represents the kind of inputs and outputs that the DMM can provide. Like a table, the structure is independent of the specific data instances that were or will be input into it. In fact, the structure of a DMM is described using a schema rowset that is derived from the COLUMNS schema rowset (see the Appendix B of the OLE DB Programmer's Reference), with new columns added to support data mining operations. 2.4.1 Input Columns The structure of the DMM is described by the inputs that are used to describe a case and by the set of possible predictions that can be selected from the model. This structure is described in the MINING_COLUMNS schema rowset. Data mining providers must support all mandatory columns, as defined by the OLE DB for DM specification. The section on The Columns Structure of a DMM in part one of this document describes the data types, content types, and other interesting flags that describe the columns of a DMM. Several columns in the MINING_COLUMNS schema rowset (the complete description can be found in Appendix A) describe these properties of a model column. The following table describes some interesting columns from that rowset. Column Name Type Indicator Description COLUMN_NAME DBTYPE_WSTR The name of the column; this might not be unique. If this cannot be determined, a NULL is returned. DATA_TYPE DBTYPE_UI2 The indicator of the column's data type—for example: "TABLE" = DBTYPE_HCHAPTER "TEXT" = DBTYPE_WCHAR "LONG" = DBTYPE_I8 "DOUBLE" = DBTYPE_R8 "DATE" = DBTYPE_DATE Specification Version 1.0— Microsoft 31 Using OLE DB for Data Mining Column Name Type Indicator Description DISTRIBUTION_FLAG DBTYPE_WSTR One of the following:  NORMAL  LOG_NORMAL  UNIFORM  BINOMIAL  MULTINOMIAL  POISSON  T-DISTRIBUTION Provider-specific flags may also be defined. CONTENT_TYPE DBTYPE_WSTR One of the following:  KEY  DISCRETE  CONTINUOUS  DISCRETIZED([args])  ORDERED  SEQUENCE TIME  CYCLICAL  PROBABILITY  VARIANCE  STDEV  SUPPORT  PROBABILITY_VARIANCE  PROBABILITY_STDEV  ORDER  SEQUENCE Provider-specific flags may also be defined. Specification Version 1.0 — Microsoft 32 Using OLE DB for Data Mining Column Name Type Indicator Description MODELING_FLAG DBTYPE_WSTR A comma-delimited list of flags. The defined flags are:  MODEL_EXISTENCE_ONLY  NOT NULL Provider-specific flags may also be defined. RELATED_ATTRIBUTE DBTYPE_WSTR This is the name of the target column that the current column either relates to or is a special property of. CONTAINING_COLUMN DBTYPE_WSTR Name of the TABLE column containing this column. NULL if any table does not contain the column. 2.4.2 Prediction Columns ATTRIBUTE or TABLE type columns can be input columns, output columns, or both. The data mining provider will build a DMM capable of predicting or explaining output column values based on the values of the input columns. In the CREATE MINING MODEL command syntax, output columns are identified with the PREDICT or the PREDICT_ONLY keyword. Marking a column for prediction (or not) has various implications for usage in the model, as described in the following table. Prediction Flag in Command Input Output Description PREDICT_ONLY No Yes Input column values will be used to predict this column's values. This column's values will not be used to predict other columns. PREDICT Yes Yes Input column values will be used to predict this column's values. This column's values will be used to predict predictable columns. (None mentioned) Yes No This column's values will be used to predict predictable columns. Specification Version 1.0— Microsoft 33 Using OLE DB for Data Mining The following table lists two additional columns in the MINING_COLUMNS schema rowset that describe the input/output state of a column. Column Name Type Indicator Description IS_INPUT DBTYPE_BOOL VARIANT_TRUE if this is an input column. IS_PREDICTABLE DBTYPE_BOOL VARIANT_TRUE if this is an output column. Any TABLE column containing a predictable column will itself become predictable. The MINING_COLUMNS schema rowset has additional columns that indicate the kind of additional information that can be found in the prediction of a predictable column and what extraction functions on the predictable column are supported. These additional columns apply only to output columns (that is, when IS_PREDICTABLE is set to TRUE). Column Name Type Indicator Description PREDICTION_SCALAR_FUNCTIONS DBTYPE_WSTR A comma-delimited list of scalar functions that may be performed on the column. PREDICTION_TABLE_FUNCTIONS DBTYPE_WSTR A comma-delimited list of functions that may be applied to the column, returning a table. The list has the following format: <function name>(<column1> [, <column2>], ...) The format allows the client to determine which columns will be present in the table returned by any given function. 2.5 Populating the Mining Model After the structure of the DMM is defined, you can use the INSERT INTOcommand to populate the model with training data. This command correlates closely to the common relational database operation INSERT, which populates a table with data. The model population stage will run the training data through the data mining algorithm and will generate a predictive model (referred to in this document as the DMM content). Notice that although massive quantities of data are fed into the DMM, the DMM usually will not store any of the data and will retain only the DMM content and distinct column values after the process is done. The population step may involve intensive processing of the data, and you should expect it to last for a while. A notification mechanism is available to follow the progress of the algorithm and the OLE DB asynchronous execution cancellation interfaces are also available. Specifically, for commands that do not return a rowset, the DM provider's command object Specification Version 1.0 — Microsoft 34 Using OLE DB for Data Mining should return an object that supports the following interfaces: IDBAsynchStatus and IConnectionPointContainer (allowing users to get a connection point for the IDBAsynchNotify interface). 2.5.1 Inserting Cases The command syntax for populating the DMM with data is identical to the population of a relational table with data in SQL. The basic syntax has the form: INSERT [INTO] <mining model name> [ <mapped model columns> ] <source data query> As is described in the following sections, various syntaxes can be used to specify the <source data query>. Regardless of which syntax is used, the column binding between the target DMM and the source query is done by column order, as is the standard with the INSERT INTO statement, or the command may specify an explicit mapping from source data columns into DMM columns using the <mapped model columns> clause. Because not every <source data query> syntax (for example, the SHAPE syntax) allows complete control over the set of columns that is returned, using the keyword SKIP in the INTO clause indicates columns that must be present in the source data query but have no meaning to the DMM. Once the DMM is populated, the client application can browse its content and perform queries to predict new data points. 2.5.2 Populating the Column Values In general, the DMM will learn the available set of distinct column values while training. However, there are instances when it is preferable or necessary to explicitly train these values independently of the model.  ORDERED or CYCLICAL attributes—The model may depend on the maintenance of a certain order of discrete attributes; for example, Monday < Tuesday. This order cannot be guaranteed to be introduced in that order in the training data.  Value hierarchies—Related columns introduce value hierarchies that would have to be described every time the attribute is used. For example, it is not necessary to tell the DMM that "Beer" is of type "Beverage" each time it appears in the training data. To train a column, OLEDB for DM specifies the following syntax: INSERT INTO <model>.COLUMN_VALUES(<mapped model columns>) <source_data_query> Unlike the model itself, the column values are incrementally trainable. Individual columns can be trained separately and repeatedly to add more values. However, if there are relationships between columns through the RELATED TO clause in the CREATE MINING MODEL statement, these columns must be trained together, as in the following example: Specification Version 1.0— Microsoft 35 Using OLE DB for Data Mining INSERT INTO [Age Prediction].COLUMN_VALUES(Gender) OPENROWSET('SQLOLEDB', '…', 'SELECT DISTINCT Gender FROM Customers') INSERT INTO [Age Prediction].COLUMN_VALUES([Product Purchases].[Product Name], [Product Purchases].[Product Type]) OPENROWSET('SQLOLEDB', '…', 'SELECT DISTINCT [Product Name], [Product Type] FROM Sales') INSERT INTO [Age Prediction].COLUMN_VALUES( SKIP, [Month]) OPENROWSET('SQLOLEDB', '…', 'SELECT MonthID, Month FROM Months ORDER BY MonthID') When the column values have been trained, the client application can browse those values but cannot yet perform queries or browse model content. Also, since all column-value relationships are now known, all RELATED TO columns can be omitted from the modeltraining query. 2.6 Source Data The <source data query> part of the INSERT (See "Populating the Mining Model") and SELECT FROM PREDICTION JOIN (See "Querying—Applying Mining Models on New Data") commands can be any of the sources described by the SUPPORTED_SOURCE_QUERY column from the MINING_SERVICES schema rowset described in Appendix A. The possible values for this column are as follows:      SINGLETON CONSTANT SINGLETON SELECT OPENROWSET SELECT SHAPE The meanings of each of these constants are described in more detail in the following section. If the data-mining provider is embedded in a relational provider that supports nested tables (also known as table columns), the entire population process could occur under the aegis of a single provider. However, it is expected that at first the DM providers will be separated from the relational providers and that the relational providers usually will not natively support nested tables. This specification offers suggested ways to overcome these issues. Data mining providers are strongly encouraged to support at least one of the methods discussed in the following sections and must publish which methods they support in the MINING_SERVICES schema rowset. 2.6.1 SINGLETON CONSTANT as Source Data If the provider supports SINGLETON CONSTANT as a SUPPORTED_SOURCE_QUERY value from the MINING_SERVICES schema rowset, a syntax allowing specification of cases as a set of constant values is supported in place of the <source data query> for the INSERT and SELECT FROM PREDICTION JOIN commands. Specification Version 1.0 — Microsoft 36 Using OLE DB for Data Mining <singleton constant> ::= (<value or set of values> [,<value or set of values>] ) <value or set of values> ::= <value> | (<set of values>) For example, the following could be a valid syntax to supply a set of values: ('1', 'Male', (('TV', 1), ('VCR', 2)), (('Van'), ('Truck'))) Although the syntax is identical, the (<singleton constant list>) used by the INSERT INTO VALUES command syntax is not the same as replacing <source data query> with a singleton constant data source object. (The only syntax difference is the word "VALUES." However, inserting a constant row by using the word VALUES is standard SQL, and accepting a constant list as a general replacement for a table is not.) 2.6.2 SINGLETON SELECT as Source Data If the provider supports SINGLETON SELECT as a SUPPORTED_SOURCE_QUERY value from the MINING_SERVICES schema rowset, a syntax allowing specification of cases as a selection of constant values is supported in place of the <source data query> for the INSERT and SELECT FROM PREDICTION JOIN commands. The syntax has the following form: <singleton select> ::= <compound constant select> as <alias> <compound constant select> ::= <constant select> | <compound constant select> UNION <compound constant select> <constant select> ::= (SELECT <alias constant list>) <alias constant list> ::= <alias constant element> | <alias constant list>, <alias constant element> <alias constant element> ::= <CONSTANT> | <CONSTANT> as <alias> | <singleton select> For example, the following could be valid syntaxes to supply a set of values: (SELECT 21 as Age, 'Male' as Gender) as Case (SELECT 21 as Age, 'Male' as Gender, ((SELECT 'ham' as Product, 10 as Qty) UNION (SELECT 'beer' as Product, 1 as Qty)) as Purchases) as Case Specification Version 1.0— Microsoft 37 Using OLE DB for Data Mining 2.6.3 OPENROWSET as Source Data If the provider supports OPENROWSET as a SUPPORTED_SOURCE_QUERY value from the MINING_SERVICES schema rowset, a syntax allowing cases to result from an OPENROWSET of an external command is supported in place of the <source data query> for the INSERT and SELECT FROM PREDICTION JOIN commands. Since many of the DM providers will not be embedded within the RDBMS containing the source data, the <source data query> will most likely need to read data from another data source. The OPENROWSET function supports this functionality and has the following basic syntax: OPENROWSET('provider_name','provider_string','query_syntax') The 'provider_name' is an OLE DB provider name, the 'provider_string' is the OLE DB connection string for that provider, and the 'query_syntax' is a query syntax that returns a rowset (either simple or using SHAPE). The DM provider will establish connection to the data source object using the 'provider_name' and 'provider_string' and will execute the query specified in 'query syntax' to retrieve the source data rowset. The complete syntax for OPENROWSET is described in Appendix F. 2.6.4 SELECT as Source Data If the provider supports SELECT as a SUPPORTED_SOURCE_QUERY value from the MINING_SERVICES schema rowset, the standard SQL SELECT command can is supported in place of the <source data query> for the INSERT and SELECT FROM PREDICTION JOIN commands. 2.6.5 SHAPE as Source Data If the provider supports SHAPE as a SUPPORTED_SOURCE_QUERY value from the MINING_SERVICES schema rowset, a syntax allowing specification of cases as a SHAPE of related queries is supported in place of the <source data query> for the INSERT and SELECT FROM PREDICTION JOIN commands. A single query to most popular relational providers cannot return the nested tables shaped result set that is needed for the population of many DMMs. Therefore, multiple queries must be executed in the data source to retrieve all of the data that a case represents. The queries must be shaped into a nested table form to feed them into the DMM. OLE DB for DM provides a number of alternatives for performing this operation, including the following:  Use of the MDAC Data Shaping Service. The Data Shaping Service is an OLE DB provider that can be layered on top of other providers. In OLE DB for DM, it can be invoked via OPENROWSET as follows: Specification Version 1.0 — Microsoft 38 Using OLE DB for Data Mining INSERT INTO [Age Prediction] ( [Customer ID], [Gender], [Age], [Age Probability], [Product Purchases] (SKIP, [Product Name], [Product Type], [Quantity]), [Car Ownership] (SKIP, [Cars Owned], [Probability]) ) OPENROWSET('MSDataShape','Data Provider=SQLOLEDB', 'SHAPE { SELECT [Customer ID], [Gender], [Age], [Age Probability] FROM [Customers] } APPEND ( {SELECT [CustID], [Product Name], [Product Type] , [Quantity] FROM [Customer Product Sales] } RELATE [Customer ID] TO [Cust ID] ) AS [Product Purchases], ( {SELECT [CustID], [Car Name], [Probability] from [Customer Cars] } RELATE [Customer ID] TO [Cust ID] ) AS [Car Ownership] ' ) Note Of course, OPENROWSET can be used to direct the query to any provider so that any syntax can be used as long as the relevant provider supports it. At this time, there is no standard SQL syntax to query a nested table. Until such a standard is established, it is likely that different relational database vendors will create unique and incompatible syntaxes.  Integrated support for the SHAPE syntax. Some DM providers may choose to adopt the SHAPE command syntax and provide integrated support for it within the data mining provider. With these providers, the SHAPE command does not need to be executed within the context of an OPENROWSET command: INSERT INTO [Age Prediction] ( [Customer ID], [Gender], [Age], [Age Probability], [Product Purchases] (SKIP, [Product Name], [Product Type], [Quantity]), [Car Ownership] ( SKIP, [Car Name], [Car Probability] ) ) SHAPE { OPENROWSET ('SQLOLEDB', 'catalog=Sales', 'SELECT [Customer ID], [Gender], [Age], [Age Probability] FROM [Customers] ORDER BY [Customer ID] ' ) } Specification Version 1.0— Microsoft 39 Using OLE DB for Data Mining APPEND ( { OPENROWSET ('SQLOLEDB', 'catalog=Sales', 'SELECT [CustID], [Product Name], [Product Type] , [Quantity] FROM [Customer Product Sales] ORDER BY [CustID]' ) } RELATE [Customer ID] TO [Cust ID] ) AS [Product Purchases], ( { OPENROWSET ('SQLOLEDB', 'catalog=Sales', 'SELECT [CustID], [Car Name], [Probability] FROM [Customer Cars] ORDER BY [CustID] ' ) } RELATE [Customer ID] TO [Cust ID] ) AS [Car Ownership] Note Appendix E contains more detail on the SHAPE command syntax. Provider support of the SHAPE command will likely depend on the explicit ordering of the input data.  Native support for nested tables. In time, data mining providers may become integrated with relational providers capable of fully supporting nested tables. Such providers might adopt their own syntax for specifying nested tables. OLE DB for DM does not preclude support for such syntax. 2.7 Browsing Mining Model Content In addition to listing the column structure of a DMM, a very different type of browsing is to navigate the graphical content of the model. Using a set of input cases, the content of a DMM is learned by the data mining algorithm. The content of a DMM is the set of rules, formulas, classifications, distributions, nodes, or any other information that was derived from a specific set of data using a data mining technique. Depending on the specific data mining technique used in the creation of the DMM, the content type may differ from one model to the other. The DMM content of a decision tree– based classification will differ from a segmentation model, which, in turn, is very different from a multiregression DMM. Browsing the content can provide important insight into the data. In many cases it allows you to understand the patterns and rules that can be used to predict new data points. You must be aware, however, that some DMMs do not support a way to express DMM content. One of the ways to browse the content of the DMM is to extract an XML description of it. The XML description of the contents can be found in the TABLES schema rowset. The format of the XML string is provided in Appendix D. The XML string provides an easy way to get, store, manipulate, and re-create all of the DMM information. However, this format requires significant expertise from the client application to navigate the content. Specification Version 1.0 — Microsoft 40 Using OLE DB for Data Mining The most popular way to express DMM content is by using a directed graph (that is, a tree of nodes). A decision tree is the classic example. Each node in the tree may have relationships to other nodes. A node may have one or more parent nodes and zero or more child nodes. The depth of the graph may vary depending on the specific node. Tree navigation is already defined in the OLE DB for OLAP specification, and a similar navigation mechanism is adopted for traversing DMM nodes. The MINING_MODEL_CONTENT schema rowset described in Appendix A provides a rich functional set of navigation operations. Querying the model directly will also return the MINING_MODEL_CONTENT rowset. The following query provides a result table with the exact structure of the MINING_MODEL_CONTENT schema rowset: SELECT * FROM <mining model>.CONTENT This allows the relational database to expose the set of DMM nodes without requiring custom OLE DB coding. 2.8 Browsing All Possible Cases and Distinct Column Values When a mining model is trained, it will encounter in the set of training cases a distinct set of possible values or "states" that the attributes of the model can take on. For example, consider a DMM with the following columns: Gender, Age and HairColor. After this DMM has been trained, the Gender column should end up knowing about the states "Male," "Female," and "Missing." (For completeness, assume that all attributes, even those with continuous domains, can take on the "Missing" state. This is true even when NULL or missing values are not encountered in the training data.) For HairColor, the DMM sees and remembers the values "Black," "Gray," and "Missing." Although the DMM has seen all of the values for the continuous attribute column Age, it does not remember every distinct value for the column. Instead, it learns the minimum, mean, and maximum values for the column. If the example model was built to predict the HairColor column from a set of 100 people, browsing the contents of the DMM might show the following tree structure: Specification Version 1.0— Microsoft 41 Using OLE DB for Data Mining The set of all possible cases contained in a DMM has one entry for every possible combination of the distinct values for each attribute. For discrete attributes, this is a list of the distinct values seen in the column (plus the "Missing" state). For continuous attributes, the "Minimum," "Maximum," "Mean," and "Missing" states are reported. For Discretized attributes, the buckets found during discretization are listed. The return value is the midpoint between the up and low bound of the bucket. Use of the SELECT command on the DMM reports these possible cases. Along with each possible case, the DMM can report statistics learned for the attributes that it has been built to predict. In the example, the following command and results (shown in the following table) are possible: SELECT *, PredictProbability(HairColor) FROM HairColorPredictDMM Gender Age HairColor P(HairColor) Male 2 Black .667 Male 2 Gray .267 Male 2 NULL .067 Male 91 Black .300 Male 91 Gray .625 Male 91 NULL .075 Male 45 Black .667 Male 45 Gray .267 Specification Version 1.0 — Microsoft 42 Using OLE DB for Data Mining Gender Age HairColor P(HairColor) Male 45 NULL .067 Male NULL Black .600 Male NULL Gray .350 Male NULL NULL .05 Female 2 Black .933 Female 2 Gray .067 Female 2 NULL .000 Female 91 Black .300 Female 91 Gray .625 Female 91 NULL .075 Female 45 Black .933 Female 45 Gray .067 Female 45 NULL .000 Female NULL Black .600 Female NULL Gray .350 Female NULL NULL .05 NULL 2 Black .800 NULL 2 Gray .167 NULL 2 NULL .033 NULL 91 Black .300 NULL 91 Gray .625 NULL 91 NULL .075 NULL 45 Black .800 NULL 45 Gray .167 NULL 45 NULL .033 NULL NULL Black .600 NULL NULL Gray .350 NULL NULL NULL .05 Specification Version 1.0— Microsoft 43 Using OLE DB for Data Mining Providers may support a WHERE clause on this command to filter the resulting set of all possible cases, as shown in the following example and results table: SELECT Age, PredictProbability(HairColor) FROM HairColorPredictDMM WHERE Gender = 'Male' and HairColor = 'Black' Gender Age HairColor P(HairColor) Male 2 Black .667 Male 91 Black .300 Male 45 Black .667 Male NULL Black .600 2.8.1 Finding Distinct Column Values To find the list of possible values against which a column from a DMM can be compared, use a command with the SELECT DISTINCT syntax from SQL, as in the following example: SELECT DISTINCT HairColor FROM HairColorPredictDMM HairColor Black Gray NULL As expected, selecting distinct combinations of columns will report rows for only the possible combinations of the selected columns values. SELECT DISTINCT HairColor, Gender FROM HairColorPredictDMM Specification Version 1.0 — Microsoft 44 Using OLE DB for Data Mining Gender HairColor Male Black Male Gray Male NULL Female Black Female Gray Female NULL NULL Black NULL Gray NULL NULL In theory, you could select TABLE type columns from a DMM that contains nested tables. However, in practice, such an operation would be impractical. This is because the set of possible values for a table-valued column is all of the conceivable tables having every possible combination of the keys for that nested table. Although this is the conceptual "truth table" content of the DMM, no provider should be expected to manifest this set of records. However, selecting distinct column values from a set of all possible nested table cases is often a useful task. Consider the larger example from Section 1.3 that contained a nested table of product purchases. The following command produces a list of the distinct product names that a customer may purchase: SELECT DISTINCT [Product Purchases].[Product Name] FROM [Age Prediction] Note that this syntax uses the "." operator to refer to a column from the scope of a nested table. Furthermore, you can determine relationships between trained column values with a WHERE clause. In the larger example, product names were classified by product type. To find the products of a certain type, consider the following command: SELECT DISTINCT [Product Purchases].[Product Name] FROM [Age Prediction] WHERE [Product Purchases].[Product Type] = 'Electronic' This will return a list of all Product Names with which the model was trained that have a corresponding type of "Electronic." Specification Version 1.0— Microsoft 45 Using OLE DB for Data Mining 2.9 Querying—Applying Mining Models on New Data Prediction queries on a DMM allow you to predict attributes that may be missing from new cases. To perform a query, you need a populated DMM (that is, already trained) and a set of new cases to predict (generally not the cases upon which the DMM was trained). 2.9.1 Components of a Prediction Query Prediction queries are retrieved from a DMM with a SELECT command. (The complete syntax for the OLE DB for DM–compliant SELECT statement is presented in Appendix B.) SELECT [FLATTENED] <SELECT-expressions> FROM <mining model name> PREDICTION JOIN <source data query> ON <join condition> [WHERE <WHERE-expression>] 2.9.1.1 Source Data Query The <source data query> clause identifies the set of new cases that will have attributes predicted by combining this set with the learned knowledge in the DMM. For information on source data queries, please see the section "Source Data." 2.9.1.2 PREDICTION JOIN When retrieving predictions from a DMM, the actual cases from <source data query> are matched up with the set of all possible cases from the model (<mining model name>) via a PREDICTION JOIN operation. See "Browsing All Possible Cases and Distinct Column Values" for an explanation of the possible cases contained in a DMM. For the following simple reasons, the matching of source cases to all possible cases with a PREDICTION JOIN does not follow the semantics of a standard relational JOIN:  The DMM cases do not represent every possible value of a continuous column, but a PREDICTION JOIN must match an exact continuous value from the source case to some learned distribution in the DMM. Using the simple example set of all possible cases defined earlier, the following command returns no records because the possible cases for the DMM contains the Age column values for only the "Minimum," "Mean," "Maximum," and "Missing" ages (2, 45, 91, "Missing"): SELECT * FROM GenderPredictDMM WHERE Gender = 'Male' AND Age = 30 However, a PREDICTION JOIN using the decision tree described for this model finds a distribution on HairColor for a 30-year-old Male of (Black = .667; Grey = .267; Missing = .067). Specification Version 1.0 — Microsoft 46 Using OLE DB for Data Mining  The DMM cases represent all possible states for a column being predicted, while a user selecting a prediction for a column often expects to get the single "Best" predicted state. Use of the same simple example model produces the following results: SELECT * FROM GenderPredictDMM WHERE Gender = 'Male' AND Age = 45 Gender Age HairColor Male 45 Black Male 45 Gray Male 45 NULL However, selecting HairColor from this model using PREDICTION JOIN to a case for a 45-year-old male would simply report "Black" as the single value for HairColor.  The PREDICTION JOIN may need to make some aggregations and assumptions when confronted with missing values in the source case. To continue the example, a PREDICTION JOIN between the simple model and a case where the person's age is 30 but the gender is unknown would report a hair color of "Black" with a probability of 80%. (As the sample tree indicates, this is a probability which is independent of Gender.) In general, PREDICTION JOIN will take one case from the input set, and using the conditions in the ON clause, it will find a matching set of cases from the DMM. This set of matching DMM cases is then "collapsed" by the algorithm (in an algorithm-specific way) into one aggregate case that contains the best predictions for all predictable columns in the model. This collapsed case may have prediction-describing statistics that are not directly observable in the set of all possible DMM cases because the statistics are the result of the collapsing process. 2.9.1.3 SELECT Expressions The <SELECT-expressions> clause is a set of comma-separated expressions, each of which can be just a simple column reference or a general expression containing prediction functions that may be connected with various types of operators. (See "Prediction Details.") Columns can be referenced from the DMM or from the source data query. When a name conflict occurs between the DMM and source, the column reference must be prefixed with the model name or the source query's alias. To validate the accuracy of the learned model, make a prediction on a set of new source cases where the predicted column value is known (a set of cases reserved from the set upon which the model was trained). Use SELECT to find the predicted value of the column from the model and the actual value from the source query. 2.9.1.4 ON and the Join Condition The existence of key columns on the case row are really for bookkeeping and consistency reasons; the key values from a set of training data may not be used by the DMM, and the Specification Version 1.0— Microsoft 47 Using OLE DB for Data Mining DMM does not retain the set of distinct values for these columns. However, because each row from the DMM's set of all possible cases is unique, it can be matched to rows from the source query of actual cases through the <join condition> clause of the ON keyword. The join condition matches columns from the DMM to columns from the source query. The join condition has one "=" expression for each set of columns to be matched, and the expressions are joined with the AND keyword. Column references in the join condition can be simple column names, they can be prefixed with a model or alias name to scope namespaces and resolve name conflicts, and they can have many scope levels to identify columns which are in turn members of table type columns. Consider the following examples: SELECT … ON GenderPredictDMM.Gender = T2.Gender AND GenderPredictDMM.Age = T2.Age Notice that even though the model has a column for HairColor, the source query may not have this column. In fact, if the SELECT command is predicting the "best" HairColor, the DMM's HairColor column should not be bound to a source column. SELECT … ON M1.Gender = T2.Sex AND M1.[Product Purchases].[product name] = T2.Age. [Product Purchases].[product name] The DMM [Age Prediction] has been aliased in the FROM clause as M1, and the source query has been renamed to T2. For both tables, the [product name] column exists in a nested table-valued column called [product purchases]. For the situation where the schema of the DMM matches the schema of the input query, the key words NATURAL PREDICTION JOIN can be used and the ON clause must be omitted. Columns from the source query will be matched to columns from the DMM based on the names of the columns. 2.9.1.5 WHERE Clause The <WHERE-expression> supports a simplified form of the SQL WHERE clause semantics that can limit the cases returned from a prediction query. Column references in the WHERE expression have the same semantics of column references in the <SELECT-expressions>. 2.9.2 An Example The following sample query will return the predicted age for set of new customers where the prediction is more than 80% likely: SELECT T1.[Customer ID], T1.[Gender], M1.[Age] FROM [Age Prediction] as M1 PREDICTION JOIN OPENROWSET('MSDataShape', 'data provider=Microsoft.Jet.OLEDB.4.0;data source=D:\customer.mdb', 'SHAPE { SELECT [Customer ID], [Gender] FROM [Customers] ORDER BY [Customer ID]} APPEND ( {SELECT [CustID], [Product Name], [Quantity] Specification Version 1.0 — Microsoft 48 Using OLE DB for Data Mining FROM [Customer Product Sales] ORDER BY [CustID] } RELATE [Customer ID] TO [Cust ID]) AS [Product Purchases], ( {SELECT [CustID], [Car Name] FROM [Customer Cars] ORDER BY [CustID] } RELATE [Customer ID] TO [Cust ID]) AS [Car Ownership]') as T1 ON M1.Gender = T1.Gender AND M1.[Product Purchases].[Product Name] = T1.[Product Purchases].[Product Name] AND M1.[Product Purchases].Quantity = T1.[Product Purchases].Quantity AND M1.[Car Ownership].[Car Name] = T1.[Car Ownership].[Car Name] WHERE PredictProbability(M1.Age) > .8 2.9.3 Prediction Details Along with the "best" predicted values, prediction queries on DMMs can convey additional information and statistics learned from the training data set. There are not explicit columns in the DMM dedicated to hold these additional bits of information; instead, they can be selected from the DMM by calling the appropriate functions (often a function taking the predicted column as an argument). Some of these functions report simple scalar values that relay measures of the confidence in a prediction or give fine-grained control over how a prediction is made. Other functions can expand a prediction into a table of details that better explain the prediction. Also, the value predicted for a nested table (a column of type TABLE that is predictable) will in theory produce a nested table with one row for every distinct value for the key of the nested table. Various functions can operate on this nested table and limit, expand, or reorder the records. These functions are often a shorthand form of a nested SELECT clause. (A SELECT statement operating on the nested table can produce a new version of the nested table. A nested SELECT can be used as an entry in the <SELECT-expressions> list to generate a nested table.) These functions will be described briefly in the following sections and are fully enumerated in Appendix C. 2.9.3.1 Scalar Functions Directly selecting a predictable column from a DMM is a shortcut for using the default behavior of the Predict function on the column. It will return the "best" predicted value for the column (that is, the one with highest probability or whatever the provider decides is most appropriate). When a non-TABLE type column is given to the Predict function, the result is a scalar value. All attributes of a DMM implicitly consider "Missing" as one of the possible values or states that they should model. In general, it is assumed that "Missing" or NULL values should not be returned as predictions, even if they are the most likely states. However, for some domains, a prediction of "Missing" could be informative. For example, consider a data set for the result of a survey that asked for Age, Gender, and Weight. If you are trying to predict Weight when Specification Version 1.0— Microsoft 49 Using OLE DB for Data Mining given Age and Gender, for example, you might learn that for a certain segment of the population the average Weight is 135 lbs, but the most likely response to the question is "Missing" (that is, "none of your business!"). An (optional) argument to the Predict function can be the value INCLUDE_NULL, which is used to force the Predict function to return "Missing" as one of the potential prediction values. Along with the predicted value, other functions can give statistics that describe the prediction. PredictSupport(MyColumn) will return the number of cases in support of the prediction, and PredictProbability will give the likelihood of the returned value amongst the set of possible values for the column. SELECT [Customer ID], Predict(Age), PredictProbability([Age]) as P … Customer ID Age P 10001 43 .667 10203 43 .400 In the preceding example, [Age] is the predicted attribute and it is a Discretized attribute, so the predicted value for age will be the midpoint of one of the "buckets" that were found for age values. To get a better description for the range of a predicted bucket, the RangeMin, RangeMax, and RangeMid functions can be called on the prediction for the Discretized column. However, if instead of Discretized, this model was created with [Age] as a continuous attribute, the reported prediction for Age would be a continuous value (in the domain of Age). This predicted age may be the mean of some local distribution—for example, the average age of people who buy the same products as those purchased by a person in the source case. Using this predicted value alone may be sufficient, but additional pieces of information might also be available. For example, the standard deviation will usually accompany a continuous attribute prediction, as follows: SELECT [Customer ID], [Age], PredictStdev([Age]) as S … Customer ID Age S 10001 45 5.2 10203 15 2.1 [Age] will return the mean value of prediction of age for the input case. The PredictStdev function will return the standard deviation for the predicted [Age] column. Notice that, unlike the SQL STD function, which is an aggregation function, the PredictStdev is a scalar function that may provide different results for each returned row. If the DMM supports finding a clustering of records, the cluster membership information for a given input case can be obtained with the Cluster function. It returns the cluster identifier that the given input case most likely belongs to. Details about the input case's fit into its cluster are retrieved with the ClusterDistance and ClusterProbability. Specification Version 1.0 — Microsoft 50 Using OLE DB for Data Mining SELECT [Customer ID], [Gender], Cluster() as C, ClusterProbability() as CP, …… Customer ID Gender C CP 10001 Male 2 .21 10203 Female 7 .32 The list of available functions for each of the prediction columns is found in the MINING_COLUMNS schema rowset of the DMM. Many of the common functions were standardized in this specification and are available in Appendix C. The following table provides a short description of these functions. Function Return Value Description Predict(<scalar column reference>, options, …) <column reference> General prediction function to modify behavior of prediction for scalar values, such as including a missing state. Returns the "best" value, given the options, for the specified scalar column. PredictSupport(<column reference>) Scalar value Count of cases in support of the predicted value. PredictVariance(<column reference >) Scalar value Variance describing the distribution for which the value of Predict is the mean (generally for continuous attributes). PredictionStdev(<column reference >) Scalar value Square root of PredictVariance. PredictProbability(<column reference >) Scalar value Likelihood that Predict is the correct value. PredictProbabilityVariance(<column reference >) Scalar value Expresses certainty in the value of PredictVariance. PredictProbabilityStdev(<column reference >) Scalar value Square root of PredictProbabilityVariance. Cluster Scalar value or <cluster column reference> Cluster identifier that the input case belongs to with the highest probability. It also can be used as a <cluster column reference> for a PredictHistogram function. ClusterDistance([ClusterID_expr]) Scalar value Distance from the center of the cluster that is identified by ClusterID_expr or the highest probability cluster. ClusterProbability([ClusterID_expr]) Scalar value Probability that the input case belongs to the cluster that is identified by ClusterID_expr or the highest probability cluster. Specification Version 1.0— Microsoft 51 Using OLE DB for Data Mining Function Return Value Description RangeMid(<column reference>) Scalar value Gives the midpoint of the predicted bucket for a discretized column. RangeMin(<column reference>) Scalar value Gives the low end of the predicted bucket for a discretized column. RangeMax(<column reference>) Scalar value Gives the upper end of the predicted bucket for a discretized column. 2.9.3.2 Expanding Scalar Predictions with PredictHistogram The additional information on a prediction need not be a simple scalar. For example, when predicting a discrete attribute (such as Gender), a histogram is one possible way to provide the predictions. The histogram will have one entry for each of the possible values that could have been returned for the column. Along with each value are some statistics that describe its likelihood. (The exact format of a histogram is presented in Appendix C.) This histogram is a table, and the PredictHistogram function returns this table as a column with the data type of TABLE (that is, a table column). The nested table has a predefined set of informationcontaining columns. These columns are $Support, $Variance, $Stdev (standard deviation), $Probability, $ProbabilityVariance, and $ProbabilityStdev. SELECT [Customer ID], PredictHistogram([Gender]) AS GH … Customer ID GH 10001 10203 Gender $Support $Probability Male 621 .621 Female 379 .379 Gender $Support $Probability Male 446 .446 Female 554 .554 … … Note For simplicity, only a few of the automatic information columns are shown in the preceding example. The Predict functions are selecting their return values from the table returned by PredictHistogram. From this table, the record with the highest value for $Probability is found and the value for the appropriate column is returned. Specification Version 1.0 — Microsoft 52 Using OLE DB for Data Mining Depending on the capabilities of the underlying DMM, the distribution for a continuous column may have more than one mode. (That is, the distribution graph shows more than one peak.) In this case, users can obtain the statistics (mean, standard deviation, and so on) of each mode by using the PredictHistogram function against a continuous column. SELECT [Customer ID], PredictHistogram([Age]) AS AH … Customer ID 10001 AH Age $StdDev $Probability 32.1 17.2 .621 65.2 6.4 .379 … If the DMM supports finding a clustering of records, the Cluster function returns the most likely cluster membership for a given input case. However, the input case may exist with various degrees of probability in many or all of the clusters. Using the PredictHistogram(Cluster) functions will expand the cluster prediction out to a table describing the full cluster membership of the input case. SELECT [Customer ID], PredictHistogram(Cluster()) AS CH … Customer ID 10001 CH Cluster() $Support $Probability 1 724 .55 2 1025 .05 3 20 .40 … By default, the PredictHistogram function will not include "Missing" as one of the reported states. To force the function to return statistics for the attribute's missing state, the argument passed into PredictHistogram should be a call to Predict on the attribute, with the argument to include "Missing" specified, as shown in the following example: SELECT [Customer ID], PredictHistogram(Predict([Gender], INCLUDE_NULL)) AS GH … If a column supports the PredictHistogram function, it will be found in the MINING_COLUMNS schema rowset of the DMM. A full description of PredictHistogram can be found in Appendix C. The following table provides a short description: Specification Version 1.0— Microsoft 53 Using OLE DB for Data Mining Function Return Value Description PredictHistogram(<scalar column reference>) <table> Generates a histogram that contains details of the predictions for the column. Input column reference can be a column returning a function such as Predict or Cluster. 2.9.3.3 Predictions on Table Columns TABLE type columns may be predicted. The result of selecting such a TABLE type column from a DMM in a PREDICTION JOIN query is a nested table with one row for every distinct value learned for the key of the nested table. Along with each row of the generated nested table will be the "best" predicted value for any predictable columns from the nested table. Directly selecting a TABLE type column by name is a shortcut for using the default behavior of the Predict function on the column. Also, because the column is in itself a table, a nested SELECT statement can be used to return the rows. Using the example schema, where the Gender. Product Purchases, and Quantity columns are predictable, the following three queries are equivalent and will return the same results: SELECT [Customer ID], [Gender], [Product Purchases] … SELECT [Customer ID], [Gender], Predict([Product Purchases]) … SELECT [Customer ID], [Gender], (SELECT * FROM [Product Purchases]) … Customer ID Gender Product Purchases Product Name 10001 Male Female Specification Version 1.0 — Microsoft 54 Product Type TV 1 Electronic Ham 2 Food Beer 6 Beverage Product Name 10203 Quantity Quantity Product Type TV 2 Electronic Ham 1 Food Beer 0 Beverage Using OLE DB for Data Mining The input table of actual cases may or may not contain a nested table that matches the nested table being predicted. If not, the interpretation of Predict on the table column is quite natural. Predict the membership of this table based on the other factors given for the case. If, however, the input case has a matching nested table, three possible behaviors may be desired. Consider the following example model: 1. A prediction simply could be the complete list of products the store offers, with associated predictions for quantities. 2. The prediction might show what other products a customer is likely to buy based on the products the customer has already bought. The reported list should not include the product from the input case. 3. The prediction might be just the predicted "Quantity" value associated with the products from the input case, or perhaps just the likelihood of each product in the input case. No other products should appear in the nested output table. To express these three different cases, user can specify, respectively, one of the following options in the Predict function:  INCLUSIVE, which affects behavior number 1.  EXCLUSIVE (default option), which causes behavior number 2.  INPUT_ONLY, which ensures that the predicted table contains only the rows supplied by the input (behavior number 3). Each entry in the predicted nested table has some probabilistic measurements for inclusion or ranking in the list. This is different from the probabilities and statistics associated with individual predictable columns within the nested table. Instead, these are statistics that describe what was learned about the mere existence of the record in the nested table. For instance, A model may show an 80% chance that a certain customer will buy beer but only a 40% chance that the beer will be purchased on sale, or a 70% chance that the number of units purchased will be 12. Another value for the option argument of the Predict function appends a new statistic containing columns to the returned nested table (similar to the way the PredictHistogram function creates statistics columns in the nested table it produces). Using the INCLUDE_STATISTICS value adds a $Support and a $Probability column to the resulting nested table, as illustrated in the following example: SELECT [Customer ID], [Gender], Predict([Product Purchases], INCLUDE_STATISTICS, INPUT_ONLY) … Specification Version 1.0— Microsoft 55 Using OLE DB for Data Mining Customer ID Gender 10001 Male Product Purchases Product Name Ham Product Name 10203 Female Quantity 2 Quantity Product Type Food Product Type $Probability $Support 725 .267 $Probability $Support Ham 1 Food 30 .34 Beer 0 Beverage 56 .83 Note In the preceding example, the customer 10001 input case contained a Product Purchases subrow only for Ham, and the customer 10203 case contained subrows for Ham and Beer. Because the INPUT_ONLY option was used, only these rows show up in the prediction. The $Probability column for a nested table contains the probability of existence for the particular subtable entry. No assumptions can be made about the relationships among the sets of probabilities returned for nested table membership. As they may be derived from independent parts of the DMM, they cannot be added together to make anything meaningful. One of the more complex forms of a returned prediction results from requesting a histogram for a value column inside a predicted table column. In this case, the prediction may include a histogram for the different statistics of each of the values. The following query will provide such a structure. (For simplicity, only a few of the automatic info columns are shown in this example.) SELECT [Customer ID], [Gender], (SELECT [Product Name], PredictHistogram([Quantity]) AS [Quantity Histogram] FROM Predict([Product Purchases]), INCLUDE_STATISTICS) … Specification Version 1.0 — Microsoft 56 Using OLE DB for Data Mining Customer ID Gender Product Purchases Product Name Quantity Histogram TV Ham 10001 Male Beer $Probability Quantity $Variance $Probability 1 1.3 0.60 2 1.8 0.10 3 3.2 0.30 Quantity $Variance $Probability 1 0.5 0.25 2 0.7 0.55 3 3.7 0.20 Quantity $Variance $Probability 1 1.1 0.15 2 0.7 0.15 3 0.2 0.70 0.23 0.267 0.832 If a TABLE column supports the Predict function, it will be found in the MINING_COLUMNS schema rowset of the DMM. A full description of Predict can be found in Appendix C. The following table provides a short description. Function Return Value Predict(<TABLE column <table column reference>, options, …) reference> Description General prediction function to modify default behavior of prediction—for example, including missing records, appending statistics, inclusive/exclusive/input only membership, and so on. 2.9.3.4 Operating on Nested Tables If a nested table returned as a prediction contains a great number of records (as would be the case if a store sold many, many different items), slogging through the results of the nested table to pick out interesting predictions would be an onerous task for both the provider and the consumer. Even if the nested table contains a relatively small number of records, finding good predictions from the set would be inconvenient. To solve this problem, OLE DB for DM introduces the TopX and BottomX family of functions, which operate on nested tables (including those resulting from PredictHistogram, a nested SELECT, or any other table Specification Version 1.0— Microsoft 57 Using OLE DB for Data Mining returning an expression). These functions order the records of the nested table by a specified column's value and then truncate the sorted list to a specified length. For example, using the TopCount function, the following syntax retrieves the three most probable hair colors (from the learned set of 8 possible) for an input case: SELECT [Customer ID], TopCount(PredictHistogram([HairColor]), $Probability, 3)… Or to get the 10 products (out of the 10,000) that a customer a customer is predicted buy in the largest quantity, the TopCount function could be used as follows: SELECT [Customer ID], TopCount([Product Purchases], [Quantity], 10) … If a nested table contains a large number of columns and only a few are interesting to the prediction, or if using a function that produces information columns (such as PredictHistogram or Predict) and some of the automatic columns are not needed, a nested SELECT can be used on the nested table or function to project out the desired columns. Following are two examples using a nested SELECT: SELECT [Customer ID], (SELECT [Product Name], Quantity FROM [Product Purchases]) … or SELECT [Customer ID], (SELECT HairColor, $Support as Sup, FROM TopCount(PredictHistogram([HairColor]), $Probability, 3)) as PH … Customer ID PH 200 220 HairColor Sup Red 100 Brown 57 Black 13 HairColor Sup Grey 675 Black 453 Green 2 Suppose you wanted to get a list of predicted records from a TABLE type column and, along with each nested table record, you wanted additional statistics on a predictable column in the nested table. An earlier example in this document provided this information (and more). This earlier example generated a prediction of product purchases and, along with each prediction, a detailed histogram explaining the prediction for the quantity column. Navigating such a nested rowset may be a bit cumbersome and is also unnecessary if the only information needed is the best prediction of quantity and some other measure of the prediction's strength Specification Version 1.0 — Microsoft 58 Using OLE DB for Data Mining that is returned from the prediction histogram. The following example shows how to get this result: SELECT [Customer ID], Gender, (SELECT [Product Name], [Quantity] as [Best Quantity], PredictStdev(Quantity) AS [Quantity Deviation], $Probabilty FROM Predict([Product Purchases], INCLUDE_STATISTICS)), … Customer ID 10001 Gender Male Product Purchases Product Name Best Quantity Quantity Deviation $Probability TV 1 1.3 0.23 Ham 2 0.7 0.267 Beer 3 0.2 0.832 The sub-SELECT in the preceding example extracts desired columns from the histogram generated by Predict([Product Purchases], INCLUDE_STATISTICS). Note that $Probability is one of the columns that the Predict function automatically creates and is the probability of the record existing in the set, not the probability on the quantity. A nested SELECT with a WHERE clause can be used to pull out certain records from a nested table. For example, if instead of always getting the "best" prediction for gender a query wanted to get the probability that each customer was "Female," this syntax would work as shown in the following example: SELECT [Customer ID], (SELECT $Probability FROM PredictHistogram([Gender]) WHERE Gender = 'Female') AS [Female Probability] … Customer ID Female Probability 10001 .379 10203 .554 Another similar use of the WHERE clause is to limit the records in the prediction on a TABLE type column to some specific entries or set of entries. The following example shows how to get only predictions for the purchase of "Beer" for any customer: SELECT [Customer ID], (SELECT * FROM [Product Purchases] WHERE [Product Name] = 'Beer') … Specification Version 1.0— Microsoft 59 Using OLE DB for Data Mining Customer ID Product Purchases 10001 Product Name Quantity Product Type 6 Beverage Quantity Product Type 0 Beverage Beer 10203 Product Name Beer The same idea applies to limit the scope of nested table predictions to a set of related records as defined by another column that is related to the key of the subtable, as in illustrated by the following example: SELECT [Customer ID], (SELECT * FROM [Product Purchases] WHERE [Product Type] = 'Beverage') … The list of available functions for a predictable TABLE type column is found in the MINING_COLUMNS schema rowset of the DMM. Many of the common functions were standardized in this specification and are available in Appendix C. The following table provides a short description of these common functions. Function Return Value Description TopCount(<table expr>, <table expr> Return the first <n-items> rows in a decreasing order of <rank expr >. <table expr> Return the first N rows in a decreasing order of <rank expr > such that the sum of the <rank column reference> values is at least <sum>. <table expr> Return the first N rows in a decreasing order of <rank expr> such that the sum of the <rank expr> values is at least the given percentage of the total sum of <rank expr> values. <table expr> Apply a SELECT against <table expr>. <table expr> can be either a table column reference or any tablereturning function except a subSELECT. <rank expr >, <n-items>) TopSum(<table expr>, <rank expr >, <sum>) TopPercent(<table expr>, <rank expr >, <percent>) Sub-SELECT: (SELECT <SELECT-expressions> FROM <table expr> [WHERE <WHERE clause>]) Specification Version 1.0 — Microsoft 60 Using OLE DB for Data Mining 2.9.3.5 Singleton Queries In some cases, you may want to make a prediction for a case that is not contained in a table. For example, during a Web site visit, the Web server needs to make a prediction about the visitor preferences based on the current activities recorded. The current activities may not yet be recorded in the RDBMS, and it may be very inefficient to generate a record (or a set of records in multiple tables) only for the prediction purpose. To solve this problem, the provider can support a syntax allowing sets of constant values in place of the <source data query> for the SELECT FROM PREDICTION JOIN syntax. See the section "Source Data" for examples of singleton data sources. 2.9.4 Flattening Nested Tables The nested table is a very useful form of data representation that is well suited to the needs of data mining algorithms. Unfortunately, however, there is currently no widespread support in relational databases for this form of data representation. The way to convert flat relational views to a nested table was discussed earlier, and the SHAPE statement is introduced in Appendix E. This mechanism helps to feed data into the DM provider. Some data mining clients will not be able to accept result sets in hierarchical format from a DM provider. This may be because the client lacks the ability to handle hierarchy or because the client application needs to store the results in a single relational table. To convert the data from nested tables to flattened tables, it is necessary to request that the query results be flattened. For this, the SELECT syntax provides the FLATTENED option, as in the following example: SELECT FLATTENED <SELECT-expressions> FROM … The FLATTENED option turns the SELECT result table from a hierarchical table to a flattened table form. The result set will contain one row for each predicted value, simplifying the processing of the prediction results. If the columns in the <SELECT-expressions> clause come from various levels of a hierarchy of table nesting, the resulting flattened table will not put the prediction results on the same record. Doing so implies a connection between the predictions, and no connection is assumed to exist. For example, a FLATTENED prediction on [Products Purchases] might give the result set shown in the following table. Customer ID Product Name Quantity Probability 1 TV 1 .25 1 TV 2 .1 1 TV 3 .02 1 Ham 2 .2 1 Ham 1 .05 1 Ham 3 .03 Specification Version 1.0— Microsoft 61 Using OLE DB for Data Mining In this result set, each row contains a single prediction of products and the possible quantities. If the columns in the <SELECT-expressions> clause include columns from more than one table column, the results will return the hierarchical shape in a flattened result set. Each row again contains a single prediction, but different rows might contain different types of predictions. For example, if a prediction is made for Gender and Product Purchases, the flattened result set might look like the following table. Gender Gender Probability Product Name Quantity Product Quantity Probability 1 Female .43 Null Null Null 1 Male .57 Null Null Null 1 Null Null TV 1 .25 1 Null Null TV 2 .1 1 Null Null TV 3 .02 1 Null Null Ham 2 .2 1 Null Null Ham 1 .05 1 Null Null Ham 3 .03 Customer ID Each row contains a single prediction; some rows contain a prediction for Gender while others have a prediction on Product Purchases. 2.10 Deleting Existing Mining Models Following are two ways to perform deletion operations: 1. Delete the DMM object—Remove the object from the system, with both its structure and its content. 2. Clear the DMM content—Clear the object of its content, but leave its structure intact. These two operations are similar to the operations of dropping a table from the database or clearing all of the table content by using the following statements:  DROP MINING MODEL <model name>: Will delete the DMM from the database. The model will disappear from the namespace.  DELETE FROM <model name>:  DELETE FROM <model name>.CONTENT: Will delete the content and the column values of the mining model but will leave the object structure intact. You may now repopulate the DMM with a new set of training data (using the INSERT INTO statement) without having to re-create the DMM structure. Will delete the content of the mining model but leave the structure and learned column values intact. Specification Version 1.0 — Microsoft 62 Using OLE DB for Data Mining 2.11 Refining Mining Models Existing DMMs may also be refined. Refinement refers to modifying the content, or set of rules, by inserting a new set of training cases. Refining a DMM based on additional cases is limited to certain algorithms that can be updated on an incremental basis. If the specific algorithm supports this capability, the ALLOW_INCREMENTAL_INSERT column in the MINING_SERVICES schema rowset indicates whether the provider supports this capability. If the capability is supported, the DMM can be refined by simply executing another INSERT INTO statement with the additional cases. If the capability is not supported, all of the DMM content will have to be deleted and the DMM must be retained using the full set of cases (both the old ones and the new ones). Specification Version 1.0— Microsoft 63 3 Appendix A: Schema Rowsets Schema information in OLE DB is retrieved using predefined schema rowsets; this appendix lists the contents of each schema rowset. Providers can add columns to these standard schema rowsets. We recommend that the names of the columns extended by the provider have the provider name as the prefix. 3.1 MINING_MODELS Schema Rowset Number of restriction columns: 6 Restriction columns: MODEL_CATALOG, MODEL_SCHEMA, MODEL_NAME, MODEL_TYPE, SERVICE_NAME, SERVICE_TYPE_ID Default sort order: MODEL_CATALOG, MODEL_SCHEMA, MODEL_NAME Description: Data mining models are exposed in the MINING_MODELS schema rowset. This schema rowset can be viewed as an enhanced form of the TABLES schema rowset for data mining models. Column Name Type Indicator Description 1 MODEL_CATALOG DBTYPE_WSTR Catalog name. NULL if the provider does not support catalogs. 2 MODEL_SCHEMA DBTYPE_WSTR Unqualified schema name. NULL if the provider does not support schemas. 3 MODEL_NAME DBTYPE_WSTR Model name. This column cannot contain NULL. 4 MODEL_TYPE DBTYPE_WSTR Model type, a provider-specific string—can be NULL. 5 MODEL_GUID DBTYPE_GUID GUID that uniquely identifies the model. Providers that do not use GUIDs to identify tables should return NULL in this column. 6 DESCRIPTION DBTYPE_WSTR Human-readable description of the model. Null if there is no description associated with the column. Specification Version 1.0— Microsoft 65 Using OLE DB for Data Mining Column Name Type Indicator Description 7 MODEL_PROPID DBTYPE_UI4 Property ID of the model. Providers that do not use PROPIDs to identify columns should return NULL in this column. 8 DATE_CREATED DBTYPE_DATE Date when the model was created or NULL if the provider does not have this information. Note 1.x providers do not return this column. 9 DATE_MODIFIED DBTYPE_DATE Date when the model definition was last modified or NULL if the provider does not have this information. 10 SERVICE_TYPE_ID DBTYPE_UI4 A bitmask that describes mining service types. The following list includes known popular mining service values:  DM_SERVICETYPE_CLASS IFICATION (0x0000001)  DM_SERVICETYPE_CLUST ERING (0x0000002)  DM_SERVICETYPE_ASSO CIATION (0x0000004)  DM_SERVICETYPE_DENSI TY_ESTIMATE (0x0000008)  DM_SERVICETYPE_SEQU ENCE (0x0000010) 11 SERVICE_NAME DBTYPE_WSTR A provider-specific name that describes the algorithm used to generate the model. 12 CREATION_STATE MENT DBTYPE_WSTR Optional. The statement used to create the original data mining model. Specification Version 1.0 — Microsoft 66 Using OLE DB for Data Mining Column Name Type Indicator Description 13 PREDICTION_ENTIT Y DBTYPE_WSTR A comma-delimited list indicating which columns the model can predict. 14 IS_POPULATED DBTYPE_BOOL VARIANT_TRUE if the model is populated; VARIANT_FALSE if the model is not populated. An empty model has a defined structure but has not been trained with data. 3.2 MINING_COLUMNS Schema Rowset Number of restriction columns: 4 Restriction columns: MODEL_CATALOG, MODEL_SCHEMA, MODEL_NAME, COLUMN_NAME Default sort order: MODEL_CATALOG, MODEL_SCHEMA, MODEL_NAME, COLUMN_NAME Description: The MINING_COLUMNS schema rowset describes the individual columns of all defined data mining models known to the provider. This schema rowset can be viewed as an enhanced form of the COLUMNS rowset for data mining models. Many of the entries are derived from the COLUMNS schema rowset and are optional. Column Name Type Indicator Description 1 MODEL_CATALOG DBTYPE_WSTR Catalog name. NULL if the provider does not support catalogs. 2 MODEL_SCHEMA DBTYPE_WSTR Unqualified schema name. NULL if the provider does not support schemas. 3 MODEL_NAME DBTYPE_WSTR Model name. This column cannot contain a NULL. Specification Version 1.0— Microsoft 67 Using OLE DB for Data Mining 4 Column Name Type Indicator Description COLUMN_NAME DBTYPE_WSTR The name of the column; this might not be unique. If this cannot be determined, a NULL is returned. This column, together with the COLUMN_GUID and COLUMN_PROPID columns, forms the column ID. One or more of these columns will be NULL, depending on which elements of the DBID structure the provider uses. If possible, the resulting column ID should be persistent. However, some providers do not support persistent identifiers for columns. 5 COLUMN_GUID DBTYPE_GUID Column GUID. Providers that do not use GUIDs to identify columns should return NULL in this column. 6 COLUMN_PROPID DBTYPE_UI4 Column property ID. Providers that do not associate PROPIDs with columns should return NULL in this column. 7 ORDINAL_POSITION DBTYPE_UI4 The ordinal of the column. Columns are numbered starting from one. NULL if there is no stable ordinal value for the column. 8 COLUMN_HASDEFAULT DBTYPE_BOOL VARIANT_TRUE—The column has a default value. VARIANT_FALSE—The column does not have a default value, or it is unknown whether the column has a default value. Specification Version 1.0 — Microsoft 68 Using OLE DB for Data Mining 9 Column Name Type Indicator Description COLUMN_DEFAULT DBTYPE_WSTR Default value of the column. A provider may expose DBCOLUMN_DEFAULTVAL UE but not DBCOLUMN_HASDEFAULT (for SQL-92 tables) in the rowset returned by IColumnsRowset::GetColumn sRowset. If the default value is the NULL value, COLUMN_HASDEFAULT is VARIANT_TRUE and the COLUMN_DEFAULT column is a NULL value. 10 COLUMN_FLAGS DBTYPE_UI4 A bitmask that describes column characteristics. The DBCOLUMNFLAGS enumerated type specifies the bits in the bitmask. This column cannot contain a NULL value. 11 IS_NULLABLE DBTYPE_BOOL VARIANT_TRUE—The column might be nullable. VARIANT_FALSE—The column is known not to be nullable. 12 DATA_TYPE DBTYPE_UI2 The indicator of the column's data type—for example:  "TABLE" = DBTYPE_HCHAPTER  "TEXT" = DBTYPE_WCHAR  "LONG" = DBTYPE_I8  "DOUBLE" = DBTYPE_R8  "DATE" = DBTYPE_DATE Specification Version 1.0— Microsoft 69 Using OLE DB for Data Mining Column Name Type Indicator Description 13 TYPE_GUID DBTYPE_GUID The GUID of the column's data type. Providers that do not use GUIDs to identify data types should return NULL in this column. 14 CHARACTER_MAXIMU M_LENGTH DBTYPE_UI4 The maximum possible length of a value in the column. For character, binary, or bit columns, this is one of the following: 15 CHARACTER_OCTET_LE DBTYPE_UI4 NGTH Specification Version 1.0 — Microsoft 70  The maximum length of the column in characters, bytes, or bits, respectively, if the length is defined. For example, a CHAR(5) column in an SQL table has a maximum length of 5.  The maximum length of the data type in characters, bytes, or bits, respectively, if the column does not have a defined length.  Zero (0) if neither the column nor the data type has a defined maximum length.  NULL for all other types of columns. Maximum length in octets (bytes) of the column, if the type of the column is character or binary. A value of zero means the column has no maximum length. NULL for all other types of columns. Using OLE DB for Data Mining 16 Column Name Type Indicator Description NUMERIC_PRECISION DBTYPE_UI2 If the column's data type is of a numeric data type other than VARNUMERIC, this is the maximum precision of the column. The precision of columns with a data type of DBTYPE_DECIMAL or DBTYPE_NUMERIC depends on the definition of the column If the column's data type is not numeric or is VARNUMERIC, this is NULL. 17 NUMERIC_SCALE DBTYPE_I2 If the column's type indicator is DBTYPE_DECIMAL, DBTYPE_NUMERIC, or DBTYPE_VARNUMERIC, this is the number of digits to the right of the decimal point. Otherwise, this is NULL. 18 DATETIME_PRECISION DBTYPE_UI4 Datetime precision (number of digits in the fractional seconds portion) of the column if the column is a datetime or interval type. If the column's data type is not datetime, this is NULL. 19 CHARACTER_SET_CATA DBTYPE_WSTR LOG Catalog name in which the character set is defined. NULL if the provider does not support catalogs or different character sets. 20 CHARACTER_SET_SCHE MA DBTYPE_WSTR Unqualified schema name in which the character set is defined. NULL if the provider does not support schemas or different character sets. 21 CHARACTER_SET_NAM E DBTYPE_WSTR Character set name. NULL if the provider does not support different character sets. Specification Version 1.0— Microsoft 71 Using OLE DB for Data Mining Column Name Type Indicator Description 22 COLLATION_CATALOG DBTYPE_WSTR Catalog name in which the collation is defined. NULL if the provider does not support catalogs or different collations. 23 COLLATION_SCHEMA DBTYPE_WSTR Unqualified schema name in which the collation is defined. NULL if the provider does not support schemas or different collations. 24 COLLATION_NAME DBTYPE_WSTR Collation name. NULL if the provider does not support different collations. 256 DOMAIN_CATALOG DBTYPE_WSTR Catalog name in which the domain is defined. NULL if the provider does not support catalogs or domains. 26 DOMAIN_SCHEMA DBTYPE_WSTR Unqualified schema name in which the domain is defined. NULL if the provider does not support schemas or domains. 27 DOMAIN_NAME DBTYPE_WSTR Domain name. NULL if the provider does not support domains. 28 DESCRIPTION DBTYPE_WSTR Human-readable description of the column. For example, the description for a column named Name in the Employee table might be "Employee name." Null if there is no description associated with the column. Specification Version 1.0 — Microsoft 72 Using OLE DB for Data Mining 29 Column Name Type Indicator Description DISTRIBUTION_FLAG DBTYPE_WSTR One of the following:  "NORMAL"  "LOG_NORMAL"  "UNIFORM"  "BINOMIAL"  "MULTINOMIAL"  "POISSON"  "HEAVYTAIL"  "MIXTURE" Provider-specific flags may also be defined. 30 CONTENT_TYPE DBTYPE_WSTR One of the following:  "KEY"  "DISCRETE"  "CONTINUOUS"  "DISCRETIZED([args])"  "ORDERED"  "SEQUENCE_TIME"  "CYCLICAL"  "PROBABILITY"  "VARIANCE"  "STDEV"  "SUPPORT"  "PROBABILITY_VARIAN CE"  "PROBABILITY_STDEV"  "ORDER"  "SEQUENCE" Provider-specific flags may also be defined. Specification Version 1.0— Microsoft 73 Using OLE DB for Data Mining 31 Column Name Type Indicator Description MODELING_FLAG DBTYPE_WSTR A comma-delimited list of flags. The defined flags are as follows:  "MODEL_EXISTENCE_ON LY"  "NOT NULL" Provider-specific flags may also be defined. 32 IS_RELATED_TO_KEY DBTYPE_BOOL VARIANT_TRUE if this column is related to the key. If the key is a single column, the RELATED_ATTRIBUTE field optionally may contain its column name. 33 RELATED_ATTRIBUTE DBTYPE_WSTR This is the name of the target column that the current column either relates to or is a special property of. 34 IS_INPUT DBTYPE_BOOL VARIANT_TRUE if this is an input column. 35 IS_PREDICTABLE DBTYPE_BOOL VARIANT_TRUE if the column is predictable. 36 CONTAINING_COLUMN DBTYPE_WSTR Name of the TABLE column containing this column. NULL if any table does not contain the column. 37 PREDICTION_SCALAR_F UNCTIONS DBTYPE_WSTR A comma-delimited list of scalar functions that may be performed on the column. Specification Version 1.0 — Microsoft 74 Using OLE DB for Data Mining 38 Column Name Type Indicator Description PREDICTION_TABLE_FU NCTIONS DBTYPE_WSTR A comma-delimited list of functions that may be applied to the column, returning a table. The list has the following format: <function name>(<column1> [, <column2>], ...) The format allows the client to determine which columns will be present in the table returned by any given function. 39 IS_POPULATED DBTYPE_BOOL VARIANT_TRUE if the column has learned a set of possible values. VARIANT_FALSE if the column is not populated. 40 PREDICTION_SCORE DBTYPE_R8 The score of the model on the predicting column. Score is used to measure the accuracy of a model. 3.3 MINING_MODEL_CONTENT Schema Rowset Number of restriction columns: 10 Restriction columns: MODEL_CATALOG, MODEL_SCHEMA, MODEL_NAME, ATTRIBUTE_NAME, NODE_NAME, NODE_UNIQUE_NAME, NODE_TYPE, NODE_GUID, and NODE_CAPTION Note A tenth restriction, called the tree operation, is not on any particular column of the MINING_MODEL_CONTENT rowset; rather, it specifies a tree operator. The idea is that the consumer specified a NODE_UNIQUE_NAME restriction and the tree operator (ANCESTORS, CHILDREN, SIBLINGS, PARENT, DESCENDANTS, SELF) to obtain the desired set of members. The SELF operator includes the row for the node itself in the list of returned rows. The following constants are defined: Specification Version 1.0— Microsoft 75 Using OLE DB for Data Mining DMTREEOP_ANCESTORS 0x00000020 DMTREEOP_CHILDREN 0x00000001 DMTREEOP_SIBLINGS 0x00000002 DMTREEOP_PARENT 0x00000004 DMTREEOP_SELF 0x00000008 DMTREEOP_DESCENDANTS 0x00000010 (These designations comprise a bit mask and may be combined.) Default sort order: MODEL_CATALOG, MODEL_SCHEMA, MODEL_NAME, ATTRIBUTE_NAME Description: The MINING_MODEL_CONTENT schema rowset allows browsing of the content of a data mining model. The user can employ special tree-operation restrictions to navigate the content as a directed acyclic graph. Column Name Type Indicator Description 1 MODEL_CATALOG DBTYPE_WSTR The name of the catalog to which this model belongs. NULL if the provider does not support catalogs. 2 MODEL_SCHEMA DBTYPE_WSTR The name of the schema to which this model belongs. NULL if the provider does not support schemas. 3 MODEL_NAME DBTYPE_WSTR Name of the model. 4 ATTRIBUTE_NAME DBTYPE_WSTR Name(s) of the attribute(s) corresponding to this node. For a model node, this would be a list of predictable attributes. For a leaf distribution node, this would be a single attribute that the distribution corresponds to. 5 NODE_NAME DBTYPE_WSTR Name of the node. 6 NODE_UNIQUE_NA ME DBTYPE_WSTR Unique name of the node. For providers that generate unique names by qualification, each component of this name is delimited. Specification Version 1.0 — Microsoft 76 Using OLE DB for Data Mining 7 Column Name Type Indicator Description NODE_TYPE DBTYPE_I4 The type of the node. Can be one of the following values:  DM_NODE_TYPE_MODEL  DM_NODE_TYPE_TREE  DM_NODE_TYPE_INTERIOR  DM_NODE_TYPE_DISTRIBUTI ON  DM_NODE_TYPE_CLUSTER  DM_NODE_TYPE_UNKNOWN 8 NODE_GUID DBTYPE_GUID Node GUID. NULL if no GUID. 9 NODE_CAPTION DBTYPE_WSTR A label or a caption associated with the node. Used primarily for display purposes. If a caption does not exist, NODE_NAME is returned. 10 CHILDREN_CARDI NALITY DBTYPE_UI4 Number of children that the node has. This can be an estimate of the number of children. Consumers should not rely on this being the exact count. Providers should return as good an estimate as possible. 11 PARENT_UNIQUE_ NAME DBTYPE_WSTR Unique name of the node's parent. NULL is returned for any nodes at the root level. For providers that generate unique names by qualification, each component of this name is delimited. 12 NODE_DESCRIPTIO N DBTYPE_WSTR A human-readable description of the node. 13 NODE_RULE DBTYPE_WSTR An XML description of the rule embedded in the node. The format of the XML string is based on the PMML standard. 14 MARGINAL_RULE DBTYPE_WSTR An XML description of the rule moving to the node from the parent node. 15 NODE_PROBABILIT Y DBTYPE_R8 The probability for reaching the node. Specification Version 1.0— Microsoft 77 Using OLE DB for Data Mining Column Name Type Indicator Description 16 MARGINAL_PROBA DBTYPE_R8 BILITY The probability of reaching the node from the parent node. 17 NODE_DISTRIBUTI ON DBTYPE_HCHAP TER A table containing the probability histogram of the node. 18 NODE_SUPPORT DBTYPE_R8 Number of cases in support of this node. 3.4 Layout of DISTRIBUTION Chapter in MINING_CONTENT Schema Rowset Number of restriction columns: Not applicable. Restriction columns: Not applicable. Default sort order: None. Description: The DISTRIBUTION column in the MINING_CONTENT schema rowset is a nested table (which is represented in OLE DB as a chapter column). It provides statistical distribution information for the attributes corresponding to the node that the parent row represents. Each attribute will have multiple rows in this table. Column Name Type Indicator Description 1 ATTRIBUTE_NAME DBTYPE_WSTR Name of the attribute. 2 ATTRIBUTE_VALU E DBTYPE_VARIA NT The attribute value represented as a variant. 3 SUPPORT DBTYPE_R8 The number of cases that support this attribute value. 4 PROBABILITY DBTYPE_R8 Probability of occurrence of this attribute value. 5 VARIANCE DBTYPE_R8 Variance of this attribute value. Specification Version 1.0 — Microsoft 78 Using OLE DB for Data Mining 6 Column Name Type Indicator Description VALUETYPE DBTYPE_I4 The value type of the attribute. Can be one of the following values:  VALUETYPE_MISSING = 1  VALUETYPE_EXISTING = 2  VALUETYPE_CONTINUOUS = 3  VALUETYPE_DISCRETE = 4  VALUETYPE_DISCRETIZED = 5  VALUETYPE_BOOLEAN = 6 3.5 MINING_SERVICES Schema Rowset Number of restriction columns: 2 Restriction columns: SERVICE_NAME, SERVICE_TYPE_ID Default sort order: SERVICE_NAME Description: The MINING_SERVICES schema rowset exposes the data mining algorithms available from the provider. It can be used to determine the prediction capabilities, complexity, and similar information about the algorithm. 1 Column Name Type Indicator Description SERVICE_NAME DBTYPE_WSTR The name of the algorithm. Providerspecific. This will be used as the service identifier in the language. (It is not localizable.) Specification Version 1.0— Microsoft 79 Using OLE DB for Data Mining 2 Column Name Type Indicator Description SERVICE_TYPE_ID DBTYPE_UI4 A bitmask that describes mining service types. The following list includes known popular mining service values:  DM_SERVICETYPE_CLASSIFI CATION (0x0000001)  DM_SERVICETYPE_CLUSTERI NG (0x0000002)  DM_SERVICETYPE_ASSOCIAT ION (0x0000004)  DM_SERVICETYPE_DENSITY_ ESTIMATE (0x0000008)  DM_SERVICETYPE_SEQUENC E (0x0000010) 3 SERVICE_DISPLAY _NAME DBTYPE_WSTR The localizable display name of the algorithm. Provider-specific. 4 SERVICE_GUID DBTYPE_GUID GUID for the algorithm. NULL if no GUID. 5 DESCRIPTION DBTYPE_WSTR Description of the algorithm. 6 PREDICTION_LIMIT DBTYPE_UI4 The maximum number of predictions the model and algorithm can provide; 0 means no limit. 7 SUPPORTED_DISTR IBUTION_FLAGS A comma-delimited list of one or more of the following: DBTYPE_WSTR  "NORMAL"  "LOG_NORMAL"  "UNIFORM" Provider-specific flags may also be defined. Specification Version 1.0 — Microsoft 80 Using OLE DB for Data Mining Column Name 8 Type Indicator SUPPORTED_INPUT DBTYPE_WSTR _CONTENT_TYPES Description A comma-delimited list of one or more of the following:  "KEY"  "DISCRETE"  "CONTINUOUS"  "DISCRETIZED"  "ORDERED"  "SEQUENCE_TIME"  "CYCLICAL"  "PROBABILITY"  "VARIANCE"  "STDEV"  "SUPPORT"  "PROBABILITY_VARIANCE"  "PROBABILITY_STDEV"  "ORDER"  "SEQUENCE" Provider-specific flags may also be defined. Specification Version 1.0— Microsoft 81 Using OLE DB for Data Mining 9 Column Name Type Indicator Description SUPPORTED_PREDI CTION_CONTENT_ TYPES DBTYPE_WSTR A comma-delimited list of one or more of the following:  "DISCRETE"  "CONTINUOUS"  "DISCRETIZED"  "ORDERED"  "SEQUENCE_TIME"  "CYCLICAL"  "PROBABILITY"  "VARIANCE"  "STDEV"  "SUPPORT"  "PROBABILITY_VARIANCE"  "PROBABILITY_STDEV" Provider-specific flags may also be defined. 10 SUPPORTED_MODE DBTYPE_WSTR LING_FLAGS A comma-delimited list of one or more of the following:  "MODEL_EXISTENCE_ONLY"  "NOT NULL" Provider-specific flags may also be defined. Specification Version 1.0 — Microsoft 82 Using OLE DB for Data Mining 11 12 Column Name Type Indicator Description SUPPORTED_SOUR CE_QUERY DBTYPE_WSTR The <source_data_query> types that the provider supports. This is a comma-delimited list of one or more of the following syntax descriptions that can be used as the source of data for INSERT INTO or that can be PREDICTION JOINED to a DMM for SELECT: TRAINING_COMPL EXITY DBTYPE_I4  "SINGLETON_CONSTANT"  "SINGLETON_SELECT"  "OPENROWSET"  "SELECT"  "SHAPE" Indication of expected time for training:  DM_TRAINING_COMPLEXITY _LOW—Running time is proportional to input and is relatively short.  DM_ TRAINING_COMPLEXITY_ME DIUM—Running time may be long but is generally proportional to input.  DM_ TRAINING_COMPLEXITY_HIG H—Running time is long and may grow exponentially in relationship to input. Specification Version 1.0— Microsoft 83 Using OLE DB for Data Mining Column Name 13 14 15 Type Indicator PREDICTION_COMP DBTYPE_I4 LEXITY EXPECTED_QUALI TY SCALING DBTYPE_I4 DBTYPE_I4 Description Indication of expected time for prediction:  DM_PREDICTION_COMPLEXI TY_LOW—Running time is proportional to input and is relatively short.  DM PREDICTION_COMPLEXITY_ MEDIUM—Running time may be long but is generally proportional to input.  DM_ PREDICTION_COMPLEXITY_H IGH—Running time is long and may grow exponentially in relationship to input. Indication of expected quality of model produced with this algorithm:  DM_EXPECTED_QUALITY_LO W  DM_EXPECTED_QUALITY_ME DIUM  DM_EXPECTED_QUALITY_HI GH Indication of the scalability of the algorithm:  DM_SCALING_LOW  DM_SCALING_MEDIUM  DM_ SCALING_HIGH 16 ALLOW_INCREME NTAL_INSERT DBTYPE_BOOL VARIANT_TRUE if additional INSERT INTO statements are allowed after the initial training. 17 ALLOW_PMML_INI TIALIZATION DBTYPE_BOOL VARIANT_TRUE if the creation of a DMM (including both structure and content) based on an XML string is allowed. Specification Version 1.0 — Microsoft 84 Using OLE DB for Data Mining 18 19 Column Name Type Indicator Description CONTROL DBTYPE_I4 One of the following: ALLOW_DUPLICAT E_KEY DBTYPE_BOOL  DM_CONTROL_NONE  DM_CONTROL_CANCEL  DM_CONTROL_SUSPENDRES UME  DM_CONTROL_SUSPENDWIT HRESULT TRUE if cases may have duplicate key. 3.6 SERVICE_PARAMETERS Schema Rowset Number of restriction columns: 2 Restriction columns: SERVICE_NAME, PARAMETER_NAME Default sort order: SERVICE_NAME, PARAMETER_NAME Description: The SERVICE_PARAMETERS schema rowset provides a list of parameters that can be supplied when generating a mining model via the CREATE MINING MODEL statement. The client will generally restrict by SERVICE_NAME to obtain the parameters supported by the provider and applicable to the type of mining model being generated. Column Name Type Indicator Description 1 SERVICE_NAME DBTYPE_WSTR The name of the algorithm. Providerspecific. 2 PARAMETER_NAM E DBTYPE_WSTR The name of the parameter. 3 PARAMETER_TYPE DBTYPE_WSTR Data type of parameter (DBTYPE). 4 IS_REQUIRED DBTYPE_BOOL If true, the parameter is required. 5 PARAMETER_FLAG DBTYPE_UI4 S A bitmask that describes parameter characteristics. The following values (or a combination thereof) may be used:  DM_PARAMETER_TRAINING (0x0000001)—for training  DM_PARAMETER_PREDICTIO N (0x00000002)—for prediction Specification Version 1.0— Microsoft 85 Using OLE DB for Data Mining 6 Column Name Type Indicator Description DESCRIPTION DBTYPE_WSTR Text describing the purpose and format of the parameter. 3.7 MODEL_CONTENT_PMML Schema Rowset Number of restriction columns: four Restrictions: MODEL_CATALOG, MODEL_SCHEMA, MODEL_NAME, MODEL_TYPE Default Sort Order: MODEL_NAME, MODEL_SCHEMA, MODEL_NAME Description: MODEL_CONTENT_PMML schema rowset stores the XML representation of the content of each model. The format of the XML string follows the PMML standard. Column Name Type Indicator Description 1 MODEL_CATALOG DBTYPE_WSTR Catalog name. NULL if the provider does not support catalogs. 2 MODEL_SCHEMA DBTYPE_WSTR Unqualified schema name. NULL if the provider does not support schemas. 3 MODEL_NAME DBTYPE_WSTR Model name. This column cannot contain NULL. 4 MODEL_TYPE DBTYPE_WSTR Model type, a provider-specific string—can be NULL 5 MODEL_GUID DBTYPE_GUID GUID that uniquely identifies the model. Providers that do not use GUIDs to identify tables should return NULL in this column. 6 MODEL_PMML DBTYPE_WSTR An XML representation of the model's content with PMML format. 7 SIZE DMTYPE_UI4 Number of bytes of the XML string size. 8 LOCATION DMTYPE_WSTR The location of the XML file. NULL if the file is stored in the default directory. Specification Version 1.0 — Microsoft 86 4 Appendix B: OLE DB for DM Grammar 4.1 Statements 4.1.1 CREATE MINING MODEL CREATE MINING MODEL <model> ( <column definition list> ) USING <algorithm> [(<parameter list>)] CREATE MINING MODEL <model> FROM PMML <xml string> Parameters <model> A unique name for the model. <column definition list> A comma-separated list of column definitions. <algorithm> The provider-defined name of a data mining algorithm. <parameter list> (Optional) A comma-separated list of provider-defined parameters for the algorithm. <xml string> An XML-encoded model (for advanced use only). Remarks The CREATE MINING MODEL statement creates a new mining model based on the column definition list. A column definition is one of the following forms: <column name> <column name> <type> TABLE [<content flags>] [<column relation>] [<prediction flag>] [<prediction flag>] ( < non-table column definition list > ) <column name> Any valid column identifier. <type> Any valid SQL type, including LONG, DOUBLE, DATE, TEXT, and TABLE. <content flags> Content flags are "hints" to the data mining algorithm that provide additional information. Flags appear in the order of the grouping shown here, and flags within the same group cannot appear on the same column. Specification Version 1.0— Microsoft 87 Using OLE DB for Data Mining Distribution Flags NORMAL The values of the column appear in a normal distribution. LOG NORMAL The values of the column appear in a log normal distribution UNIFORM The values of the column appear in a uniform distribution. Type Flags KEY The column is discrete and is a key. Key columns will not have any other flags except in the case of a nested table with no attribute columns. CONTINUOUS The column contains values in a continuous range, such as Age or Salary. DISCRETE The column contains a discrete set of values, such as Gender. DISCRETIZED The column contains a continuous set of values that should be converted to buckets. ORDERED The column contains a discrete set of values that are ordered, such as Salary Level. CYCLICAL The column contains an ordered discrete set of values that are cyclical, such as Day of Week, or Month. SEQUENCE TIME The column contains time measurement units. SEQUENCE The column contains the sorting key of the related columns. Modeling Flags MODEL_EXISTE NCE_ONLY The column should be modeled has having two states, missing and nonmissing, regardless of the values in the column. This is particularly useful for columns in a nested table, where values are sparse across cases. NOT NULL The column cannot accept NULL values. Specification Version 1.0 — Microsoft 88 Using OLE DB for Data Mining Special Property Flags These flags indicate a property of another column and will not appear with any other content flags or prediction flags PROBABILITY The value in this column is the probability (0–1) of the associated value. VARIANCE The value in this column is value variance of the associated value. STDEV The value in this column is the standard deviation of the associated value. PROBABILITY_VARIA NCE The value in this column is the variance of the probability associated with the associated value. PROBABILITY_STDEV The value in this column is the standard deviation of the probability associated with the associated value. SUPPORT The value in this column is the weight (case replication factor) of the associated value. <column relation> The column relation appears in two forms: OF <column name> and RELATED TO <column name>. OF This form is restricted to use for columns with Special Property content flags—for example, ProbGender Double PROBABILITY OF Gender. RELATED TO This form indicates a value hierarchy. The target of a related to column can be a key column in a nested table, a discretely valued column on the case row, or another column with a RELATED TO clause (indicating a deeper hierarchy). A special target "KEY" is reserved for nested tables with multiple keys and indicates a relation between the value in this column and the composite of all the key columns. <prediction flags> These flags indicate that the column can be predicted by the model and can have one of two values. PREDICT This column can be predicted by the model and it can be supplied in input cases to predict the value of other predictable columns. PREDICT_ONLY This column can be predicted by the model, but its values cannot be used in input cases to predict the value of other predictable columns. Specification Version 1.0— Microsoft 89 Using OLE DB for Data Mining 4.1.2 INSERT INTO INSERT INTO <model> (<mapped model columns>) <source data query> INSERT INTO <model> (<mapped model columns>) VALUES <constant list> INSERT INTO <model>.COLUMN_VALUES(<mapped model columns>) <source data query> Parameters <model> A model identifier. <mapped model columns> A comma-separated list of column identifiers and nested identifiers. <source data query> The source query in the provider-defined format. Remarks The INSERT INTO statement inserts training data into the model. The columns from the query are mapped to model columns through the <mapped model columns> section. The keyword SKIP is used to instruct the model to ignore columns that appear in the source data query that are not used in the model. The INSERT INTO <model>.COLUMN_VALUES form inserts data directly into the models columns without training the model's algorithm. This allows you to provide column data to the model in a concise ordered manner that is useful when dealing with data sets containing hierarchies or ordered columns. The "." operator is used to specify columns that are part of a nested table. When using this form, columns that are part of a relation (either through RELATE TO or by being a KEY in a nested table) cannot be inserted individually and must be inserted together with all the columns in the relation. The <mapped model columns> section has the following form: <column identifier> | <table identifier>(<column identifier> | SKIP), … 4.1.3 SELECT 4.1.3.1 SELECT INTO SELECT * INTO <new model> USING <algorithm> [(<parameter list>)] FROM <existing model> Specification Version 1.0 — Microsoft 90 Using OLE DB for Data Mining Parameters <new model> A unique name for the new model being created. <algorithm> The provider-defined name of a data mining algorithm. <parameter list> (Optional) A comma-separated list of provider-defined parameters for the algorithm. <existing model> The name of the existing model to be copied. Remarks The SELECT INTO statement creates a new mining model by copying schema and other information from an existing mining model. If the existing model is trained, the new model will automatically be trained with the same query; otherwise, the new model will be empty. 4.1.3.2 SELECT FROM CONTENT SELECT * FROM <model>.CONTENT Parameters <model> A name of the model. Remarks The SELECT FROM CONTENT statement returns the mining model schema rowset for the specified model. See Appendix C for a description of the mining model schema rowset. 4.1.3.3 SELECT FROM <MODEL> SELECT [DISTINCT] <expr list> FROM <model> [ WHERE < condition list > ] Parameters <model> A model identifier. <expr list> A comma-separated list of related column identifiers or expressions. <condition list> (Optional) Conditions to restrict the values returned from the column list. Remarks The SELECT FROM <model> statement allows you to directly browse the values on which the columns have been trained. Specification Version 1.0— Microsoft 91 Using OLE DB for Data Mining 4.1.3.4 SELECT FROM PREDICTION JOIN SELECT <select expression list> FROM <model> [NATURAL] PREDICTION JOIN <source data query> [ON <join mapping list>] [ WHERE <condition expression> ] Parameters <select expression list> A comma-separated list of column identifiers and other expressions to describe the columns in the results of the query. <model> A model identifier. <source data query> The source query in the provider-defined format. <join mapping list> A logical expression comparing column from model to column from source query. <condition expression> (Optional) A condition to restrict the values returned from the column list. Remarks The SELECT FROM PREDICT syntax allows you to predict columns based on the input data that are supplied in the PREDICT clause. You can specify the OLE DB for DM feature-rich prediction functions, including prediction histograms, prediction probability, sub-SELECT, and so forth, in <select expression list> and <condition expression>. Only the rows that qualify the condition in the WHERE clause will be included in the result. 4.1.4 DELETE DELETE * FROM <model>[.CONTENT] Parameters <model> A model identifier. Remarks Deletes all training data from the model. If CONTENT is specified, only the algorithm training is discarded and the column values are retained. Specification Version 1.0 — Microsoft 92 Using OLE DB for Data Mining 4.1.5 DROP DROP MINING MODEL <model> Parameters <model> A model identifier. Remarks Removes the model and all associated information from the database. 4.2 A Sample BNF This example BNF is from Microsoft's implementation of an OLE-DB for DM provider and does not represent the entire breadth of grammar described by this document. <statement> -> <create> |<insert> |<select> |<delete> |<rename> 4.2.1 CREATE <create> -> <dm_create> |<select_into> |<pmml_create> <dm_create> -> CREATE MINING MODEL <identifier> ( <col_def_list> ) USING <algorithm> [(<algo_param_list>)] <pmml_create> -> CREATE MINING MODEL <identifier> FROM PMML <string> <select_into> -> SELECT * INTO <identifier> USING <algorithm> FROM <identifier> <col_def_list> -> <col_def> |<col_def_list> , <col_def> -> <col_def_reg> | <col_def_tbl> -> <identifier> <col_type> [<col_distribution>] [<col_binary>] <col_def> <col_def_reg> [<col_content>] <col_def_tbl> <algorithm> <algo_param> <algo_param_list> [<col_content_qual>] [<col_qualif>] [<col_prediction>] [<relation_clause>] -> <identifier> TABLE <col_prediction> ( <col_def_list> ) -> MICROSOFT_DECISION_TREES | MICROSOFT_CLUSTERING -> <identifier> = <value> -> <algo_param> | <algo_param>, <algo_param_list> Specification Version 1.0— Microsoft 93 Using OLE DB for Data Mining <col_type> -> LONG | BOOLEAN | TEXT | DOUBLE | DATE <col_distribution>-> NORMAL | UNIFORM <col_binary> -> MODEL_EXISTENCE_ONLY | NOT NULL <col_content> -> DISCRETE | CONTINUOUS | DISCRETIZED( [<disc_method> [, <numeric_const>]] ) | SEQUENCE_TIME <disc_method> -> AUTOMATIC | EQUAL_AREAS | THRESHOLDS | CLUSTERS <col_content_qual>-> ORDERED | CYCLICAL <col_qualif> -> KEY | PROBABILITY | VARIANCE | STDEV | STDDEV | PROBABILITY_VARIANCE | PROBABILITY_STDEV | PROBABILITY_STDDEV | SUPPORT <col_prediction> -> PREDICT | PREDICT_ONLY <relation_clause> -> <related_to_clause> | <of_clause> <related_to_clause>-> RELATED TO <identifier> | RELATED TO KEY <of_clause> -> OF <identifier> | OF KEY 4.2.2 INSERT <insert> <insert_att> <insert_reg> <query> <external_query> -> | -> -> <insert_att> <insert_reg> INSERT [INTO] <identifier>.COLUMN_VALUES ( <column_ref_list> ) <query> INSERT [INTO] <identifier> ( <column_ref_list> ) <query> -> <external_query> | <shape> -> OPENROWSET ( <string>, {<string>|<string>;<string>;<string>}, <string> ) Specification Version 1.0 — Microsoft 94 Using OLE DB for Data Mining <shape> <append_list> <append> <relate_list> <relate> -> -> -> -> | -> SHAPE { <query> } APPEND <append_list> <append_list> , <append_list> ( { <query> } RELATE <relate_list> ) AS <identifier> <relate> <relate_list> , <relate> <column_ref> TO <column_ref> 4.2.3 SELECT <column_ref_list> -> | <column_ref> -> | | | | | | | | | | | | | | | <pred_option_list>-> | <pred_option> -> | | | | | <select> <pred_select> <column_ref> <column_ref_list> , <column_ref> <identifier> <identifier>.<column_ref> <column_ref> ( <column_ref_list> ) SKIP CLUSTER() $SUPPORT $VARIANCE $STDEV $STDDEV $PROBABILITY $PROBABILITY_VARIANCE $PROBABILITY_STDEV $PROBABILITY_STDDEV $DISTANCE PREDICT ( <column_ref> [, <pred_option_list>] ) <column_ref> AS <identifier> <pred_option> <pred_option_list> , <pred_option> EXCLUDE_NULL INCLUDE_NULL INPUT_ONLY EXCLUSIVE INCLUSIVE INCLUDE_STATISTICS -> <pred_select> | <model_select> -> SELECT [FLATTENED] <expression_list> FROM <identifier> [NATURAL] PREDICTION JOIN <query> AS <identifier> [ON <on_list>] [<where_clause>] | SELECT [FLATTENED] <expression_list> FROM <identifier> [NATURAL] PREDICTION JOIN <model_select> <expression> AS <identifier> [ON <on_list>] [<where_clause>] -> SELECT [DISTINCT] <expression_list> FROM <identifier> [<where_clause>] | SELECT [DISTINCT] <expression_list> FROM <identifier>.PMML | SELECT [DISTINCT] <expression_list> FROM <identifier>.CONTENT [<where_clause>] Specification Version 1.0— Microsoft 95 Using OLE DB for Data Mining <expression_list> -> <expression> | <expression_list> , <expression> <expression> -> <value> | <column_ref> | * | <expression> + <expression> | <expression> - <expression> | <expression> * <expression> | <expression> / <expression> | -<expression> | +<expression> | ( <expression> ) | <expression> OR <expression> | <expression> AND <expression> | NOT <expression> | <expression> = <expression> | <expression> <> <expression> | <expression> < <expression> | <expression> <= <expression> | <expression> > <expression> | <expression> >= <expression> | PREDICTSTDEV ( <column_ref> ) | PREDICTSTDDEV ( <column_ref> ) | PREDICTVARIANCE ( <column_ref> ) | PREDICTSUPPORT ( <column_ref> ) | PREDICTPROBABILITY ( <column_ref> ) | PREDICTPROBABILITYSTDEV ( <column_ref> ) | PREDICTPROBABILITYSTDDEV ( <column_ref> ) | PREDICTPROBABILITYVARIANCE ( <column_ref> ) | CLUSTERDISTANCE ( [<expression>] ) | CLUSTERPROBABILITY ( [<expression>] ) | PREDICTHISTOGRAM ( <column_ref> ) | TOPCOUNT ( <expression>, <column_ref>, <expression> ) | TOPSUM ( <expression>, <column_ref>, <expression> ) | TOPPERCENT ( <expression>, <column_ref>, <expression> ) | BOTTOMCOUNT ( <expression>, <column_ref>, <expression> ) | BOTTOMSUM ( <expression>, <column_ref>, <expression> ) | BOTTOMPERCENT ( <expression>, <column_ref>, <expression> ) | ( SELECT <expression_list> FROM <expression> <where_clause> ) | ( <singleton_list> ) | <expression> AS <identifier> <singleton_list> -> <singleton> | <singleton_list> UNION <singleton> <singleton> -> SELECT <expression_list> <where_clause> -> WHERE <expression> <delete> -> <delete_reg> | <delete_content> Specification Version 1.0 — Microsoft 96 Using OLE DB for Data Mining 4.2.4 DELETE/DROP <delete_reg> <delete_content> <drop> -> DELETE * FROM <identifier> -> DELETE * FROM <identifier>.CONTENT -> DROP MINING MODEL <identifier> 4.2.5 RENAME <rename> -> RENAME MINING MODEL <identifier> TO <identifier> 4.2.6 MISCELLANEOUS <value> <identifier> -> | -> | <numeric_const> <string> [([^\]]|(\]\]))*] [a-zA-Z_][a-zA-Z_0-9]* Specification Version 1.0— Microsoft 97 5 Appendix C: Functions 5.1 Predict Syntax: Predict(<scalar column reference>, option1, option2, …) Predict(<table column reference>, option1, option2, …) Applies To: Either a scalar column or table column reference. Return Type: <scalar column reference> <table column reference> or depending on which type of column this function is applied to. Description: This is a general form of prediction function that modifies the behavior of a prediction (for example, missing value control, association control, and so on). Possible options include EXCLUDE_NULL (default), INCLUDE_NULL, INCLUSIVE, EXCLUSIVE (default), INPUT_ONLY, and INCLUDE_STATISTICS. Note INCLUSIVE, EXCLUSIVE, INPUT_ONLY, and INCLUDE_STATISTICS are applicable only for a table column reference, and EXCLUDE and INCLUDE_NULL are only for scalar values columns. In most cases, the following shorthand will be used:  [Gender]  [Products Purchases] is shorthand for Predict([Gender], EXCLUDE_NULL). is shorthand for Predict([Products Purchases], EXCLUDE_NULL,  EXCLUSIVE_ASSOCIATION). Note The return type of this function is itself regarded as a column reference. This means that this function can be used as an argument in other functions that take a column reference as an argument (except the Predict function itself). Passing INCLUDE_STATISTICS to a prediction on a TABLE-valued column will add the metacolumns $Probability and $Support to the resulting table. These columns describe the likelihood of existence for the associated nested table record. Specification Version 1.0— Microsoft 99 Using OLE DB for Data Mining 5.2 PredictSupport Syntax: PredictSupport(<scalar column reference>) Applies to: Scalar column Return Type: Scalar value Description: This function returns the support value for the histogram entry that has the highest probability (the top row in the histogram obtained by PredictHistogram(<column reference>). 5.3 PredictVariance Syntax: PredictVariance(<scalar column reference>) Applies to: Scalar column Return Type: Scalar value Description: This function returns the variance value for the histogram entry that has the highest probability (the top row in the histogram obtained by PredictHistogram(<column reference>). Specification Version 1.0 — Microsoft 100 Using OLE DB for Data Mining 5.4 PredictStdev Syntax: PredictStdev(<scalar column reference>) Applies to: Scalar column Return Type: Scalar value Description: This function returns the standard deviation for the histogram entry that has the highest probability (the top row in the histogram obtained by PredictHistogram(<column reference>). 5.5 PredictProbability Syntax: PredictProbability(<scalar column reference>) Applies to: Scalar column Return Type: Scalar value Description: This function returns the probability for the histogram entry that has the highest probability (the top row in the histogram obtained by PredictHistogram(<column reference>). Specification Version 1.0— Microsoft 101 Using OLE DB for Data Mining 5.6 PredictProbabilityVariance Syntax: PredictProbabilityVariance(<scalar column reference>) Applies to: Scalar column Return Type: Scalar value Description: This function returns the variance of the probability for the histogram entry that has the highest probability (the top row in the histogram obtained by PredictHistogram(<column reference>). 5.7 PredictProbabilityStdev Syntax: PredictProbabilityStdev(<scalar column reference>) Applies to: Scalar column Return Type: Scalar value Description: This function returns the standard deviation of the probability for the histogram entry that has the highest probability (the top row in the histogram obtained by PredictHistogram(<column reference>). Specification Version 1.0 — Microsoft 102 Using OLE DB for Data Mining 5.8 Cluster Syntax: Cluster Applies to: This function does not require any parameter, but it can be used only when the underlying DMM supports clustering. Return Type: This function returns a scalar value of cluster identifier. However, if this function is used as an argument of other functions, it must be regarded as a <cluster column reference>. Description: This function returns a cluster identifier that the input case belongs to with the highest probability. It also can be used as a <cluster column reference> for a PredictHistogram function. 5.9 ClusterDistance Syntax: ClusterDistance([<ClusterID expression>]) Applies to: This function can be used only when the underlying DMM supports clustering. Return Type: Scalar value. Description: This function returns the distance between the input case and the center of the cluster that has the highest probability. If <ClusterID expression> is given, the cluster is identified by the evaluation of the expression. Specification Version 1.0— Microsoft 103 Using OLE DB for Data Mining 5.10 ClusterProbability Syntax: ClusterProbability([<ClusterID expression>]) Applies to: This function can be used only when the underlying DMM supports clustering. Return Type: Scalar value. Description: This function returns the probability that the input case belongs to the cluster that has the highest probability. If <ClusterID expression> is given, the cluster is identified by the evaluation of the expression. 5.11 PredictHistogram Syntax: PredictHistogram(<scalar column reference>) PredictHistogram(<cluster column reference>) Applies to: A scalar or cluster column reference. Return Type: <table expression> Description: This function returns a table representing a histogram for prediction of the given column. A histogram generates statistics columns. For a <scalar column reference>, a histogram consists of the following seven columns:  The column being predicted  $Support  $Variance  $Stdev (standard deviation) Specification Version 1.0 — Microsoft 104 Using OLE DB for Data Mining  $Probability  $ProbabilityVariance  $ProbabilityStdev A histogram for a <cluster column reference> consists of the following columns:  Cluster to represent the cluster identifier  $Distance  $Probability  $Support 5.12 TopCount Syntax: TopCount(<table expression>, <rank expression>, <n-items>) Applies to: A table-returning expression that includes <table column reference> and functions that return a table. Return Type: <table expr> Description: This function returns the first <n-items> rows in a decreasing order of <rank expression>. As an example, a table expression (for example, a sub-SELECT) may contain the following table: (SELECT [Product Name], $Probability AS [Probability] FROM Predict([Products Purchases], INCLUDE_STATISTICS)) Product Name Probability Apples 0.4 Kiwi 0.1 Oranges 0.5 Lemons 0.2 Specification Version 1.0— Microsoft 105 Using OLE DB for Data Mining If so, the function TopCount((SELECT ….), [Probability], 2) returns the following table: Product Name Probability Oranges 0.5 Apples 0.4 5.13 TopSum Syntax: TopSum(<table expression>, <rank expression>, <sum>) Applies to: A table-returning expression that includes <table column reference> and functions that return a table. Return Type: <table expr> Description: This function returns the first N rows in a decreasing order of <rank column reference>, such that the sum of the <rank expression> values is at least <sum>. TopSum returns the smallest number of elements possible while still meeting that criterion. For example, a table column named [Products] might contain the following table: Product Name Unit Sales Apples 1200 Kiwi 500 Oranges 1500 Lemons 750 If so, TopSum([Products], [Unit Sales], 2500) would return the following table: Product Name Unit Sales Oranges 1500 Apples 1200 Specification Version 1.0 — Microsoft 106 Using OLE DB for Data Mining 5.14 TopPercent Syntax: TopPercent(<table expression>, <rank expression>, <percent>) Applies to: A table-returning expression that includes <table column reference> and functions that return a table. Return Type: <table expr> Description: This function returns the first N rows in a decreasing order of <rank expression>, such that the sum of the <rank column reference> values is at least the given percentage of the total sum of <rank column reference> values. TopPercent returns the smallest number of elements possible while still meeting that criterion. Using a table column named [Products], as shown here: Product Name Unit Sales Apples 30 Kiwi 10 Oranges 40 Lemons 20 TopPercent([Products], [Unit Sales], 60) Product Name Unit Sales Oranges 40 Apples 30 function would return the following table: Note that Apples were selected instead of Lemons. Specification Version 1.0— Microsoft 107 Using OLE DB for Data Mining 5.15 Sub-SELECT Syntax: (SELECT <SELECT-expressions> FROM <table expression> [WHERE <WHERE-clause>]) Applies to: A table-returning expression that includes <table column reference> and functions that return a table. Return Type: <table expr> Description: A sub-SELECT selects columns (generally speaking, expressions containing columns) from the given table-returning expression. Users also can specify a WHERE clause to filter out undesired rows. 5.16 RangeMid Syntax: RangeMid(<scalar column reference>) Applies to: Discretized scalar columns Return Type: Scalar value Description: This function returns the midpoint of the predicted bucket that was discovered for a discretized column. Specification Version 1.0 — Microsoft 108 Using OLE DB for Data Mining 5.17 RangeMin Syntax: RangeMin(<scalar column reference>) Applies To: Discretized scalar columns Return Type: Scalar value Description: This function returns the lower end of the predicted bucket that was discovered for a discretized column. 5.18 RangeMax Syntax: RangeMax(<scalar column reference>) Applies To: Discretized scalar columns Return Type: Scalar value Description: This function returns the upper end of the predicted bucket that was discovered for a discretized column. 5.19 PredictScore Syntax: PredictScore(<scalar column reference>) PredictScore(<table column reference>) Specification Version 1.0— Microsoft 109 Using OLE DB for Data Mining Applies To: Predictable columns Return Type: Scalar value Description: This function returns the prediction score of the specified column. 5.20 PredictNodeId Syntax: PredictNodeId(<scalar column reference>) Applies To: Predictable columns (except table columns or predictable columns in nested table). Return Type: Scalar value Description: This function returns the node id of the tree leaf node in which the case is classified. Specification Version 1.0 — Microsoft 110 Using OLE DB for Data Mining 6 Appendix D: XML Format for Data Mining Models DMMs are represented in XML using a variation of the Predictive Model Markup Language (PMML) version 1.0. A few of the additions to PMML 1.0:  Support for the nested table nature of a DMM through nested Data Dictionaries.  The idea of Discretized, ordered, and Cyclical model variables beyond the simple Categorical and Continuous.  Support for Key columns in nested dictionaries that list instances as categories.  Support for Relation type columns as "hierarchy parents."  All model variables can have a missing state described, even ones with continuous domain.  Data dictionary is no longer a complete list of all attributes; rather, it is an "attribute factory." Any attribute reference outside the data dictionary must "instantiate" a model variable by locating it in the data dictionary hierarchy.  Because of the previous point, it is no longer sufficient to reference a model variable (called attribute) as an attribute (in XML terms) of a tag. Instead, they must be properties (nested tags) that describe the variable instance.  Statistics on the global distribution of the model variables have been separated out into a new section. It is expected that most of these changes will simply become part of PMML version 1.1. Specification Version 1.0— Microsoft 111 Using OLE DB for Data Mining 6.1 DTD for the DMM Extended PMML <?xml encoding="UTF-8"?> <!ENTITY % predicates "(predicate | compound-predicate | true | false)" > <!ENTITY % NUMBER "NMTOKEN"> <!-================================================================= Overall structure ================================================================= -->  <!ELEMENT pmml (head?, statements?, data-dictionary, global-statistics?, (tree-model | segment-model | regression-model)+)> <!ATTLIST pmml version CDATA #REQUIRED name CDATA #IMPLIED GUID CDATA #IMPLIED Modified-time CDATA #IMPLIED Creation-time CDATA #IMPLIED > <!-================================================================= Header Information ================================================================= -->  <!ELEMENT head (application?, annotation*, timestamp?, datasrc?)> <!ATTLIST head copyright CDATA #REQUIRED description CDATA #IMPLIED > Specification Version 1.0 — Microsoft 112 Using OLE DB for Data Mining  <!ELEMENT timestamp (#PCDATA)>  <!ELEMENT application EMPTY> <!ATTLIST application name CDATA #REQUIRED version CDATA #IMPLIED > <!ELEMENT annotation (#PCDATA)>  <!ELEMENT datasrc EMPTY> <!ATTLIST datasrc src CDATA #REQUIRED query CDATA #REQUIRED > <!-================================================================= Statements ================================================================= -->  <!ELEMENT statements(statement+)> <!ELEMENT statement EMPTY> <!ATTLIST statement type CDATA #REQUIRED value CDATA #REQUIRED > <!-================================================================= Data Dictionary ================================================================= --> Specification Version 1.0— Microsoft 113 Using OLE DB for Data Mining  <!ELEMENT data-dictionary (compound-categories? , (categorical | ordinal | continuous | categorical-continuous | data-dictionary | key | hierarchy-parent)+)>  <!ATTLIST data-dictionary name CDATA #IMPLIED >  <!ELEMENT key (category+)> <!ATTLIST key name CDATA #REQUIRED ispredict ( true | false ) "false" isinput ( true | false ) "false" datatype CDATA #IMPLIED >  <!ELEMENT hierarchy-parent ((relates-to | category)+)> <!ATTLIST hierarchy-parent name CDATA #REQUIRED ispredict ( true | false ) "false" isinput ( true | false ) "false" datatype CDATA #IMPLIED >  <!ELEMENT relates-to EMPTY> <!ATTLIST relates-to name CDATA #REQUIRED Specification Version 1.0 — Microsoft 114 Using OLE DB for Data Mining >  <!ELEMENT categorical (category+)> <!ATTLIST categorical name CDATA #REQUIRED ispredict ( true | false ) "false" isinput ( true | false ) "false" datatype CDATA #IMPLIED >  <!ELEMENT category (parent*)> <!ATTLIST category value CDATA #IMPLIED display-value CDATA #IMPLIED proportion CDATA #IMPLIED missing (true | false | uninformative) "false" >  <!ELEMENT parent EMPTY> <!ATTLIST parent name CDATA #REQUIRED value CDATA #REQUIRED >  <!ELEMENT ordinal (order+)> <!ATTLIST ordinal name CDATA #REQUIRED cyclical ( true | false ) "false" timesequence ( true | false ) "false" > Specification Version 1.0— Microsoft 115 Using OLE DB for Data Mining  <!ELEMENT order EMPTY> <!ATTLIST order value CDATA #IMPLIED display-value CDATA #IMPLIED rank CDATA #IMPLIED proportion CDATA #IMPLIED missing (true | false | uninformative) "false" >   <!ELEMENT continuous (category?, (%predicates;)*)> <!ATTLIST continuous name CDATA #REQUIRED minimum CDATA #IMPLIED maximum CDATA #IMPLIED mean CDATA #IMPLIED median CDATA #IMPLIED standard-deviation CDATA #IMPLIED inter-quartile-range CDATA #IMPLIED ispredict ( true | false ) "false" isinput ( true | false ) "false" datatype CDATA #IMPLIED >  <!ELEMENT categorical-continuous (category?, (%predicates;)*)> <!ATTLIST continuous name CDATA #REQUIRED ispredict ( true | false ) "false" isinput ( true | false ) "false" datatype CDATA #IMPLIED Specification Version 1.0 — Microsoft 116 Using OLE DB for Data Mining >  <!ELEMENT compound-categories (compound-category+)> <!ELEMENT compound-category ( categoryref | parent )+> <!ELEMENT categoryref EMPTY> <!ATTLIST categoryref name CDATA #REQUIRED value CDATA #REQUIRED >  <!ELEMENT simple-attribute EMPTY> <!ATTLIST simple-attribute name CDATA #REQUIRED >  <!ELEMENT compound-attribute (key-val+ , (simple-attribute | compound-attribute)?)> <!ATTLIST compound-attribute name CDATA #REQUIRED > <!ELEMENT derived-attribute ((simple-attribute | compound-attribute)+)> Specification Version 1.0— Microsoft 117 Using OLE DB for Data Mining <!ATTLIST derived-attribute index CDATA #REQUIRED > <!ELEMENT key-val EMPTY> <!ATTLIST key-val name CDATA #REQUIRED value CDATA #REQUIRED >  <!ENTITY % attribute "(simple-attribute | compound-attribute | derived-attribute)" >  <!-================================================================= Global Statistics ================================================================= --> <!ELEMENT global-statistics (data-distribution+)> <!ELEMENT data-distribution (%attribute;, state+)> <!ELEMENT state EMPTY> <!ATTLIST state value CDATA #IMPLIED missing ( true | false | uninformative) "false" minimum CDATA #IMPLIED maximum CDATA #IMPLIED mean CDATA #IMPLIED median CDATA #IMPLIED standard-deviation CDATA #IMPLIED inter-quartile-range CDATA #IMPLIED support CDATA #IMPLIED proportion CDATA #IMPLIED > Specification Version 1.0 — Microsoft 118 Using OLE DB for Data Mining <!-================================================================= General Tree Model ================================================================= -->  <!ELEMENT tree-model (node+)> <!ATTLIST tree-model model-id CDATA #IMPLIED > <!-================================================================= The root node of a model should contain a true predicate. ================================================================= -->  <!ELEMENT node (targets?, (%predicates;)?, info*, node*, score-distribution*, datadistribution*)> <!ATTLIST node score CDATA #IMPLIED >  <!ELEMENT targets ((%attribute;)+)> <!ELEMENT score-distribution EMPTY> <!ATTLIST score-distribution Specification Version 1.0— Microsoft 119 Using OLE DB for Data Mining label CDATA #REQUIRED value CDATA #REQUIRED > <!ELEMENT info EMPTY> <!ATTLIST info name CDATA #REQUIRED value CDATA #REQUIRED > <!ELEMENT compound-predicate (%predicates;, (%predicates;)+)> <!ATTLIST compound-predicate bool-op (or | and | xor | cascade) #REQUIRED >  <!ELEMENT predicate (%attribute;)> <!ATTLIST predicate attribute CDATA #IMPLIED op (eq | ne | lt | le | gt | ge) #REQUIRED value CDATA #REQUIRED > <!ELEMENT true EMPTY> <!ELEMENT false EMPTY>  <!-================================================================= Segment Model ================================================================= --> Specification Version 1.0 — Microsoft 120 Using OLE DB for Data Mining <!ELEMENT segment-model (info*, node+)> <!-================================================================= Regression Model ================================================================= --> <!ELEMENT regression-model (factor-list?, covariate-list?, predictor-to-parameter-correlation-matrix?, parameter-table)> <!ATTLIST regression-model model-id CDATA #REQUIRED response-variable-name CDATA #REQUIRED number-parameters %NUMBER; #REQUIRED model-type (regression | general-linear | log-linear | multinomial-logistic) #REQUIRED verbose-model-specification CDATA #IMPLIED > <!ELEMENT factor-list (var-name+)> <!ELEMENT covariate-list (var-name+)> <!ELEMENT var-name (#PCDATA)> <!ELEMENT predictor-to-parameter-correlation-matrix (predictor-to-parameter-cell+)> <!ELEMENT predictor-to-parameter-cell (#PCDATA)> <!ATTLIST predictor-to-parameter-cell predictor-name CDATA #REQUIRED parameter-name CDATA #REQUIRED > <!ELEMENT parameter-table (parameter-cell+)> <!ELEMENT parameter-cell EMPTY> <!ATTLIST parameter-cell target-category CDATA #REQUIRED parameter-name CDATA #REQUIRED beta %NUMBER; #REQUIRED std-error %NUMBER; #IMPLIED df %NUMBER; #IMPLIED > Specification Version 1.0— Microsoft 121 Using OLE DB for Data Mining 6.2 Example: Tree Model to Predict Credit Risk <?xml version="1.0"?> <pmml> <statements> <statement type = "CREATE" value = "Create Mining Model CreditTree1 ( ID long key, Credit text discrete predict, Education text discrete, Age text discrete, Pay text discrete ) using microsoft_decision_trees "/> <statement type = "TRAIN" value = "Insert Into CreditTree1 ( ID, Credit, Education, Age, Pay) OPENROWSET("Microsoft.Jet.OLEDB.4.0", "data source=w:\test\demozero\credit.mdb", "SELECT ID, Credit, Education, Age , Pay FROM CreditTraining" ) "/> </statements> <data-dictionary name = "CreditTree1" GUID = "{707D31A7-D42A-11D3-8AEF-00C04F68DDCA}"> <key name = "ID" datatype = "LONG"/> <categorical name = "Credit" isinput = "true" ispredict = "true" datatype = "TEXT"> <category missing = "true"/> <category value = "Bad"/> <category value = "Good"/> </categorical> <categorical name = "Education" isinput = "true" datatype = "TEXT"> <category missing = "true"/> <category value = "Bachelor"/> <category value = "High School"/> <category value = "Graduate"/> <category value = "Partial College"/> <category value = "Partial High School"/> </categorical> <categorical name = "Age" isinput = "true" datatype = "TEXT"> <category missing = "true"/> <category value = "Middle Age"/> <category value = "Young"/> <category value = "Old"/> </categorical> <categorical name = "Pay" isinput = "true" datatype = "TEXT"> <category missing = "true"/> <category value = "Weekly pay"/> <category value = "Monthly salary"/> </categorical> </data-dictionary> Specification Version 1.0 — Microsoft 122 Using OLE DB for Data Mining <global-statistics> <data-distribution> <simple-attribute name = "Credit"/> <state missing = "true" support = "0."/> <state value = "Bad" support = "114."/> <state value = "Good" support = "109."/> </data-distribution> <data-distribution> <simple-attribute name = "Education"/> <state missing = "true" support = "0."/> <state value = "Bachelor" support = "109."/> <state value = "High School" support = "24."/> <state value = "Graduate" support = "28."/> <state value = "Partial College" support = "34."/> <state value = "Partial High School" support = "28."/> </data-distribution> <data-distribution> <simple-attribute name = "Age"/> <state missing = "true" support = "0."/> <state value = "Middle Age" support = "55."/> <state value = "Young" support = "126."/> <state value = "Old" support = "42."/> </data-distribution> <data-distribution> <simple-attribute name = "Pay"/> <state missing = "true" support = "0."/> <state value = "Weekly pay" support = "114."/> <state value = "Monthly salary" support = "109."/> </data-distribution> </global-statistics> <tree-model> <info name = "Scorer" value = "4"/> <info name = "Splitter" value = "1"/> <info name = "Minimum Leaf Cases" value = "10"/> <info name = "Number of ESS" value = "16"/> <info name = "Complexity Penalty" value = "0.80000000000000004"/> <node> <targets> <target> <simple-attribute name = "Credit"/> </target> </targets> <node missing = "false"> <predicate op = "eq" value = "Weekly pay"> <simple-attribute name = "Pay"/> </predicate> <node missing = "false"> <predicate op = "eq" value = "Young"> <simple-attribute name = "Age"/> Specification Version 1.0— Microsoft 123 Using OLE DB for Data Mining </predicate> <node missing = "false"> <predicate op = "eq" value = "High School"> <simple-attribute name = "Education"/> </predicate> <data-distribution> <simple-attribute name = "Credit"/> <state missing = "true" support = "0."/> <state value = "Bad" support = "24."/> <state value = "Good" support = "0."/> </data-distribution> </node> <node missing = "false"> <predicate op = "ne" value = "High School"> <simple-attribute name = "Education"/> </predicate> <data-distribution> <simple-attribute name = "Credit"/> <state missing = "true" support = "0."/> <state value = "Bad" support = "60."/> <state value = "Good" support = "9."/> </data-distribution> </node> </node> <node missing = "false"> <predicate op = "ne" value = "Young"> <simple-attribute name = "Age"/> </predicate> <data-distribution> <simple-attribute name = "Credit"/> <state missing = "true" support = "0."/> <state value = "Bad" support = "13."/> <state value = "Good" support = "8."/> </data-distribution> </node> </node> <node missing = "false"> <predicate op = "ne" value = "Weekly pay"> <simple-attribute name = "Pay"/> </predicate> <node missing = "false"> <predicate op = "eq" value = "Young"> <simple-attribute name = "Age"/> </predicate> <data-distribution> <simple-attribute name = "Credit"/> <state missing = "true" support = "0."/> <state value = "Bad" support = "16."/> <state value = "Good" support = "17."/> Specification Version 1.0 — Microsoft 124 Using OLE DB for Data Mining </data-distribution> </node> <node missing = "false"> <predicate op = "ne" value = "Young"> <simple-attribute name = "Age"/> </predicate> <node missing = "false"> <predicate op = "eq" value = "Bachelor"> <simple-attribute name = "Education"/> </predicate> <data-distribution> <simple-attribute name = "Credit"/> <state missing = "true" support = "0."/> <state value = "Bad" support = "1."/> <state value = "Good" support = "52."/> </data-distribution> </node> <node missing = "false"> <predicate op = "ne" value = "Bachelor"> <simple-attribute name = "Education"/> </predicate> <data-distribution> <simple-attribute name = "Credit"/> <state missing = "true" support = "0."/> <state value = "Bad" support = "0."/> <state value = "Good" support = "23."/> </data-distribution> </node> </node> </node> </node> </tree-model> </pmml> Specification Version 1.0— Microsoft 125 7 Appendix E: Provider Support for SHAPE Syntax The complete syntax of the SHAPE command is documented in the Microsoft Data Access Component SDK. This appendix describes the subset of that syntax needed to shape multiple result sets into a single nested table. Data mining providers should provide support for this subset, at a minimum. Following is the basic syntax: SHAPE {<master query>} APPEND ({ <child table query> } RELATE <master column> TO <child column>) AS < column table name> [ APPEND ({ <child table query> } RELATE <master column> TO <child column>) AS < column table name> … ] The SHAPE statement allows the addition of table columns to a master query by specifying the child table rows and the way to match between the row in <master query> and its child rows in the <child query>. Using this syntax, you can now read all of the data needed for the cases from multiple queries and shape these into a single table that is fed into the DMM. The following example illustrates how this is done: INSERT INTO [Age Prediction] ( [Customer ID], [Gender], [Age], [Age Probability], [Product Purchases] (SKIP, [Product Name], [Product Type], [Quantity]), [Car Ownership] (SKIP, [Car Name], [Car Probability]) ) SHAPE { select [Customer ID], [Gender], [Age], [Age Probability] from [Customers] order by [Customer ID]} APPEND ( {select [CustID], [Product Name], [Product Type] , [Quantity] from [Customer Product Sales] order by [CustID] } RELATE [Customer ID] TO [Cust ID]) AS [Product Purchases], ( {select [CustID], [Car Name], [Probability], from [Customer Cars] order by [CustID] } RELATE [Customer ID] TO [Cust ID]) AS [Car Ownership] Specification Version 1.0— Microsoft 127 Using OLE DB for Data Mining Following are important notes:  The SHAPE statement has a rich syntax, and DM providers are encouraged to support as much of it as possible. At a minimum, DM providers should support the syntax described in this appendix.  The column binding between the target DMM and the source query is done by columns order, as is the standard with INSERT INTO statement.  Table columns ("Product Purchases" and "Car Ownership") are listed in the source columns, although they are mapped into whole tables and not to single columns.  The columns in the child query used for the relation (in the RELATE clause) are skipped by using the SKIP keyword in the column map and not mapped into any of the columns contained in the target table-column.  A DM provider may (and usually will) mandate that the relation columns in the child queries be ordered the same as the key column in the master query. Specification Version 1.0 — Microsoft 128 Using OLE DB for Data Mining 8 Appendix F: Provider Support for OPENROWSET Syntax The complete documentation of the OPENROWSET command is found in the Microsoft SQL Server® Programmer's Toolkit. This appendix provides an abbreviated version of that. Data mining providers should provide support for OPENROWSET to be used for the <source data query> in INSERT INTO and PREDICT commands. OPENROWSET('provider_name' { 'datasource';'user_id';'password' | 'provider_string' }, { 'query' }) 'provider_name' A character string that represents the friendly name of the OLE DB provider as specified in the registry. provider_name has no default value. 'datasource' A string constant that corresponds to a particular OLE DB data source object. datasource is the DBPROP_INIT_DATASOURCE property passed to the provider's IDBProperties interface to initialize the provider. Typically, this string includes the name of the database file, the name of a database server, or a name that the provider understands to locate the database(s). 'user_id' A string constant that is the user name passed to the specified OLE DB provider. user_id specifies the security context for the connection and is passed in as the DBPROP_AUTH_USERID property to initialize the provider. 'password' A string constant that is the user password passed to the OLE DB provider. password is passed in as the DBPROP_AUTH_PASSWORD property when initializing the provider. 'provider_string' A provider-specific connection string that is passed in as the DBPROP_INIT_PROVIDERSTRING property to initialize the OLE DB provider. provider_string typically encapsulates all the connection information needed to initialize the provider. Specification Version 1.0— Microsoft 129 Using OLE DB for Data Mining 'query' A string constant that is sent to and executed by the provider. For more information, see SQL Server OLE DB Programmer's Reference. Specification Version 1.0 — Microsoft 130 Using OLE DB for Data Mining 9 Appendix G: Support for Other Data Mining Algorithms Although most examples in this document are based on decision tree and clustering algorithms, the purpose of the OLE DB for Data Mining specification is to provide a data mining standard to support all the data mining algorithms. For presentation of the content of different algorithms, PMML is adopted. The information is stored in the content schema rowset after the model gets trained. In this appendix, the support for Association and Regression Algorithm is illustrated, based on the syntax defined in the document. 9.1 Support for Association Algorithm Association is one of the popular data mining algorithms. It can be applied to do market basket analysis, cross-selling, Web site mining, and so forth. The typical problem the association algorithm solves is that given a transaction table with products customers have bought, what items does a customer tend to buy together. Suppose there are two tables: Transaction and Purchase. The Transaction table stores information about a transaction, such as transaction ID, time, store, and so on. The Purchase table stores the purchased products for each transaction. The following statement creates a data mining model to find out those products which sell together based on an association algorithm. The model is interested only in rules with at least five items. Create Mining Model MyAssociationModel ( Transaction_id long key, [Product purchases] table predict ( [Product Name] text key ) ) Using [My Association Algorithm] (Minimum_size = 5) Training an association model is exactly the same as training a tree model or a clustering model. The results of the training are stored in the MINING_MODEL_CONTENT schema rowset. In the content schema rowset, there is a column called Rule, which stores the PMML representation of an association rule. To get all the association rules discovered by the algorithm, run the following statement: Select * from MyAssociationModel.content This returns the content schema rowset that contains all the rules. It is also possible to search for some particular rules—for example, all the products associated with "Milk." Specification Version 1.0— Microsoft 131 Using OLE DB for Data Mining 9.2 Support for Regression Algorithm Regression is another popular data mining algorithm. It is used to find the relationship between a response variable and several possible predictor variables by some mathematic formula. There are some different regression methods, such as linear regression, logistic regression, and nonlinear regression. A linear regression equation is usually written as follows: Y = a + bX + e where Y is the dependent variable a is the intercept b is the slope or regression coefficient X is the independent variable e is the error term Suppose there is a loan table containing customer demographic information and the level of risk of each loan. By using a regression algorithm, the following mining model predicts loan risk level based on age, income, homeowner, and marital status. Create Mining Model MyRegressionModel ( Customer_id long key, Age long continuous, Homeowner boolean discrete, Marital_status Boolean discrete, Loan_risk_LEVELcontinuous predict ) Using [My Regression Algorithm] Training a regression model is exactly the same as training tree model or a clustering model. The value of intercept, regression coefficient, and error rate are stored in MINING_MODEL_CONTENT schema rowset, in the Rule column, with the PMML format. The following statement returns all the coefficients of regression: Select * from MyRegressionModel.content Specification Version 1.0 — Microsoft 132 Using OLE DB for Data Mining Copyright This is a preliminary document and may be changed substantially prior to final commercial release. This document is provided for informational purposes only and Microsoft makes no warranties, either express or implied, in this document. Information in this document, including URL and other Internet Web site references, is subject to change without notice. The entire risk of the use or the results of the use of this document remains with the user. Unless otherwise noted, the example companies, organizations, products, people and events depicted herein are fictitious and no association with any real company, organization, product, person or event is intended or should be inferred. Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rights under copyright, no part of this document may be reproduced, stored in or introduced into a retrieval system, or transmitted in any form or by any means (electronic, mechanical, photocopying, recording, or otherwise), or for any purpose, without the express written permission of Microsoft Corporation. Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property rights covering subject matter in this document. Except as expressly provided in any written license agreement from Microsoft, the furnishing of this document does not give you any license to these patents, trademarks, copyrights, or other intellectual property.  2000 Microsoft Corporation. All rights reserved. Microsoft, MS-DOS, Windows, Windows NT, SQL Server, and Visual C++ are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries. The names of actual companies and products mentioned herein may be the trademarks of their respective owners. Specification Version 1.0— Microsoft 133

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download 1 Introduction to OLE DB for Data Mining (DM)