Download 1 Introduction to OLE DB for Data Mining (DM)

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Mixture model wikipedia , lookup

Transcript
OLE DB for Data Mining
Specification
Version 1.0
Microsoft Corporation
J U L Y
2 0 0 0
Contents
1 Introduction to OLE DB for Data Mining (DM) ............................................................................. 5
1.1 Goals of Data Mining ............................................................................................................................. 5
1.2 Data Mining Tasks ................................................................................................................................. 6
1.2.1 Predictive Modeling (Classification) ............................................................................................ 6
1.2.2 Segmentation (Clustering) ............................................................................................................ 8
1.2.3 Association (Data Summarization) ............................................................................................... 9
1.2.4 Sequence and Deviation Analysis ............................................................................................... 11
1.2.5 Dependency Modeling ................................................................................................................ 12
1.3 The OLE DB for DM Specification ..................................................................................................... 12
1.4 The Columns Structure of a Data Mining Model (DMM) .................................................................. 15
1.4.1 Model Columns ........................................................................................................................... 15
1.4.2 Prediction Columns ..................................................................................................................... 20
2 OLE DB for DM Programmer's Guide ........................................................................................ 21
2.1 Connecting to a Data Mining Provider ................................................................................................ 21
2.2 Creating New Mining Models.............................................................................................................. 22
2.2.1 Detecting the Capabilities of the Provider .................................................................................. 22
2.2.2 Defining a New Mining Model ................................................................................................... 27
2.2.3 Copying a Mining Model ............................................................................................................ 29
2.2.4 Creating a Mining Model from Predictive Model Markup Language (PMML) ......................... 29
2.3 Finding Existing Mining Models ......................................................................................................... 30
2.4 Browsing Model Column Definition ................................................................................................... 31
2.4.1 Input Columns ............................................................................................................................. 31
2.4.2 Prediction Columns ..................................................................................................................... 33
2.5 Populating the Mining Model .............................................................................................................. 34
2.5.1 Inserting Cases ............................................................................................................................ 35
2.5.2 Populating the Column Values.................................................................................................... 35
Specification Version 1.0— Microsoft
1
Using OLE DB for Data Mining
2.6 Source Data .......................................................................................................................................... 36
2.6.1 SINGLETON CONSTANT as Source Data ............................................................................... 36
2.6.2 SINGLETON SELECT as Source Data ...................................................................................... 37
2.6.3 OPENROWSET as Source Data ................................................................................................. 38
2.6.4 SELECT as Source Data ............................................................................................................. 38
2.6.5 SHAPE as Source Data ............................................................................................................... 38
2.7 Browsing Mining Model Content ........................................................................................................ 40
2.8 Browsing All Possible Cases and Distinct Column Values ................................................................. 41
2.9 Querying—Applying Mining Models on New Data ............................................................................ 46
2.9.1 Components of a Prediction Query ............................................................................................. 46
2.9.2 An Example ................................................................................................................................ 48
2.9.3 Prediction Details ........................................................................................................................ 49
2.9.4 Flattening Nested Tables ............................................................................................................. 61
2.10 Deleting Existing Mining Models ...................................................................................................... 62
2.11 Refining Mining Models .................................................................................................................... 63
3 Appendix A: Schema Rowsets ................................................................................................... 65
3.1 MINING_MODELS Schema Rowset .................................................................................................. 65
3.2 MINING_COLUMNS Schema Rowset ............................................................................................... 67
3.3 MINING_MODEL_CONTENT Schema Rowset ............................................................................... 75
3.4 Layout of DISTRIBUTION Chapter in MINING_CONTENT Schema Rowset ................................. 78
3.5 MINING_SERVICES Schema Rowset ............................................................................................... 79
3.6 SERVICE_PARAMETERS Schema Rowset ...................................................................................... 85
3.7 MODEL_CONTENT_PMML Schema Rowset................................................................................... 86
4 Appendix B: OLE DB for DM Grammar ...................................................................................... 87
4.1 Statements ............................................................................................................................................ 87
4.1.1 CREATE MINING MODEL ...................................................................................................... 87
4.1.2 INSERT INTO ............................................................................................................................ 90
4.1.3 SELECT ...................................................................................................................................... 90
4.1.4 DELETE ..................................................................................................................................... 92
4.1.5 DROP .......................................................................................................................................... 93
4.2 A Sample BNF ..................................................................................................................................... 93
4.2.1 CREATE ..................................................................................................................................... 93
4.2.2 INSERT ...................................................................................................................................... 94
4.2.3 SELECT ...................................................................................................................................... 95
4.2.4 DELETE/DROP .......................................................................................................................... 97
4.2.5 RENAME .................................................................................................................................... 97
4.2.6 MISCELLANEOUS ................................................................................................................... 97
Specification Version 1.0 — Microsoft
2
Using OLE DB for Data Mining
5 Appendix C: Functions ............................................................................................................... 99
5.1 Predict .................................................................................................................................................. 99
5.2 PredictSupport ................................................................................................................................... 100
5.3 PredictVariance .................................................................................................................................. 100
5.4 PredictStdev ....................................................................................................................................... 101
5.5 PredictProbability .............................................................................................................................. 101
5.6 PredictProbabilityVariance ................................................................................................................ 102
5.7 PredictProbabilityStdev ..................................................................................................................... 102
5.8 Cluster ................................................................................................................................................ 103
5.9 ClusterDistance .................................................................................................................................. 103
5.10 ClusterProbability ............................................................................................................................ 104
5.11 PredictHistogram ............................................................................................................................. 104
5.12 TopCount ......................................................................................................................................... 105
5.13 TopSum............................................................................................................................................ 106
5.14 TopPercent ....................................................................................................................................... 107
5.15 Sub-SELECT ................................................................................................................................... 108
5.16 RangeMid......................................................................................................................................... 108
5.17 RangeMin......................................................................................................................................... 109
5.18 RangeMax ........................................................................................................................................ 109
5.19 PredictScore ..................................................................................................................................... 109
5.20 PredictNodeId .................................................................................................................................. 110
6 Appendix D: XML Format for Data Mining Models.................................................................. 111
6.1 DTD for the DMM Extended PMML ................................................................................................ 112
6.2 Example: Tree Model to Predict Credit Risk ..................................................................................... 122
7 Appendix E: Provider Support for SHAPE Syntax .................................................................. 127
8 Appendix F: Provider Support for OPENROWSET Syntax .................................................... 129
9 Appendix G: Support for Other Data Mining Algorithms ....................................................... 131
9.1 Support for Association Algorithm .................................................................................................... 131
9.2 Support for Regression Algorithm ..................................................................................................... 132
Copyright....................................................................................................................................... 133
Specification Version 1.0— Microsoft
3
1 Introduction to OLE DB for Data
Mining (DM)
The OLE DB for Data Mining (hereafter referred to as OLE DB for DM) draft specification
assumes that the reader has a working knowledge of the following technologies and
languages:

OLE DB

SQL (Structured Query Language)

Microsoft® Visual C++®

Data mining theory and practice
1.1 Goals of Data Mining
Data mining is about finding interesting structures in data, which may be interpreted as
knowledge about the data or may be used to predict events related to the data. These
structures take the form of patterns, which are concise descriptions of the data set. Data
mining makes the exploration and exploitation of large databases easy, convenient, and
practical for those who have data but not years of training in statistics or data analysis.
The "knowledge" extracted by a data mining algorithm can have many forms and many uses.
It can be in the form of a set of rules, a decision tree, a regression model, or a set of
associations, among many other possibilities. It may be used to produce summaries of data or
to get insight into previously unknown correlations. It also may be used to predict events
related to the data—for example, missing values, records for which some information is not
known, and so forth. There are many different data mining techniques, most of them
originating from the fields of machine learning, statistics, and database programming.
Note Machine learning, as defined here, refers to the computer's ability to improve data
mining algorithms automatically through experience. Data training, an important term
that will be used in this context throughout this specification, refers to the process where
the data mining algorithm analyzes the input data and finds hidden patterns. Using this
trained data, these discovered patterns can then be formed into a model and applied to the
machine's learning process.
Specification Version 1.0— Microsoft
5
Using OLE DB for Data Mining
1.2 Data Mining Tasks
Data mining can be applied for a number of different tasks. The major ones are predictive
modeling (classification), segmentation (clustering), association, sequence and deviation
analysis, and dependency modeling. This section presents a brief description of each of these
tasks.
1.2.1 Predictive Modeling (Classification)
Predictive modeling targets predicting one or more fields in the data by using the rest of the
fields. When the variable being predicted is categorical (to approve or reject a loan, for
example), the problem is called classification. When the variable is continuous (such as
expected profit or loss), the problem is referred to as regression. Classification is a
traditionally well-studied problem. Methods popular in data mining include decision trees,
rules, neural networks (nonlinear regression), radial basis functions, and many others.
For example, based on debt level, income level and employment type, you can use predictive
modeling to predict the credit risk of a given customer. The classification algorithm
determines the relationship of these attributes to the risk class in a training data set where the
risk is known. Decision trees are a common and useful technique for predictive modeling.
Figure 1 shows a set of training data that will be used to predict credit risk. Historical
information was collected on customers that included their debt level, income level, what type
of employment they had and whether they turned out to be a good or bad credit risk. Figure 2
shows a decision tree that might be created from this data.
Customer ID
Debt level
Income level
Employment type
Credit risk
1
High
High
Self-employed
Bad
2
High
High
Salaried
Bad
3
High
Low
Salaried
Bad
4
Low
Low
Salaried
Good
5
Low
Low
Self-employed
Bad
6
Low
High
Self-employed
Good
7
Low
High
Salaried
Good
Figure 1. Sample data
Specification Version 1.0 — Microsoft
6
Using OLE DB for Data Mining
All
Credit Risk
Good: 3
Bad: 4
Debt = Low
Credit Risk
Good: 3
Bad: 1
Employment Type
= Self Employed
Credit Risk
Good: 0
Bad: 1
Debt = High
Credit Risk
Good: 0
Bad: 3
Employment Type
= Salaried
Credit Risk
Good: 3
Bad: 0
Figure 2. A decision tree
In this trivial example, a decision tree algorithm might decide that the most significant
attribute for predicting credit risk is debt level. The first split in the decision tree is therefore
made on debt level. One of the two new nodes (debt level = high) is a leaf node, having three
bad credit risks and no good credit risks. In this example, a high debt level is a perfect
predictor of a bad credit risk. The other node (debt level = low) is still mixed, having three
good credit risks and one bad. The decision tree algorithm then chooses employment type as
the next most significant predictor of credit risk. The split on employment type gives two leaf
nodes. It turns out that self-employed people are a bad credit risk. This is, of course, a
completely imaginary and trivial example, but it illustrates how the decision tree can use
known attributes of the credit applicants to predict credit risk. In reality, there would be far
more attributes for each credit applicant and the numbers of applicants would be very large.
When the scale of the problem expands like this, it is very difficult for a person to extract the
rules to identify good and bad credit risks. The classification algorithm, on the other hand, can
consider hundreds of attributes and millions of records to come up with the decision tree that
describes rules for credit risk prediction.
Specification Version 1.0— Microsoft
7
Using OLE DB for Data Mining
1.2.2 Segmentation (Clustering)
Segmentation is finding the groups (clusters) in the data that consist of similar subsets of
records. Unlike in predictive modeling, there is no target variable that appears as an attribute
in the data. The clustering algorithm determines this new "hidden" attribute (the cluster ID to
which each example belongs) by examining the data. Examples include segmenting a
customer database into clusters of similar customers, which enables the design of a separate
marketing strategy for each segment. There are many methods for clustering data. Popular
approaches include K-Means algorithm, hierarchical agglomerative methods, and mixture
modeling using the Estimation-Maximization (EM) algorithm for fitting probabilistic mixture
models to data. It is possible for a data record to belong to different clusters with different
degrees of membership.
Consider an employee database in which each employee has three attributes—age, salary, and
vested amount in a company pension plan. A user may want to issue a query that provides a
cross-tabulation of the average ages of employees having pension plans in the ranges 100K–
200K, 200K–400K, and 400K–1000K and having salaries in the ranges 50K–100K, 100K–
200K, and 200K–300K. For traditional approaches, the problem is that the ranges specified by
the user can be arbitrary. In other words, the query hierarchy is dynamic and not prediscretized along each dimension.
Multidimensional data records can be viewed as points in a multidimensional space. For
example, the records of the schema (age, salary) could be viewed as points in a twodimensional space, with the dimensions of age and salary. Figure 3a shows some data
conforming to the above example schema. Figure 3b shows its representation as points in a
two dimensional space.
Figure 3. Clustering sample
Specification Version 1.0 — Microsoft
8
Using OLE DB for Data Mining
Now suppose one is to give a short representation of this simple data set. One could provide
the average age and the average salary (and their standard deviations). This would represent
the average employee as having a salary of $85.5K ( $35.5K) and an average age of 40 (
15.5) years. However, imagine inspecting the data further and realizing that there are two
groups of employees. The summary on the data would then be as shown in Figure 4.
Group
Age
Income
Average
Std Dev.
Average
Std Dev.
Segment 1
26 years
1.0
$54.3K
$4K
Segment 2
54 years
3.6
$116.6K
$15.2
Figure 4. Clustering result
As Figure 4 illustrates, the data has not only been identified to comprise two distinct segments
but its average values are much more meaningful within each segment. This is evidenced by a
much more reasonable standard deviation associated with each segment.
How does one identify the presence of such segments? This is what a clustering algorithm
does. While it may be obvious what these segments should be in two dimensions (as shown in
the preceding simple two-dimensional example), finding segments in higher dimensions (for
example, four or higher) is much more difficult for humans because simply plotting the data
may no longer help. Also, plotting data becomes extremely inconvenient with many data
points. However, clustering algorithms automatically find such segments in data. Each
segment is represented by its own distribution. The normal distribution was used in this
example, but categorical dimensions, such as gender or job description, can also be admitted
and can be represented by using the multinomial distribution. A clustering algorithm can deal
with both types of attributes and can produce useful groupings for summaries.
1.2.3 Association (Data Summarization)
Association (data summarization) describes a class of methods that target producing
summaries of parts of the data—for example, discovering correlations between variables over
substantial subsets of the data or deriving an association between some items and other items.
The most common technique in this category of methods is the use of association rules.
Sometimes referred to as market basket analysis, the process of finding association rules
depends on identifying frequent item sets in transactional data. Frequent item sets consist of
sets of items (for example, products) that frequently occur together in the same transaction.
Specification Version 1.0— Microsoft
9
Using OLE DB for Data Mining
Frequent item sets can be used to summarize the sets of products customers tend to buy
together in a supermarket basket. (For another example, to understand how a Web site is used
by its visitors, frequent item sets can also be used to find a set of Web pages that will be
visited during a Web-browsing session.) Therefore, retailers can use association techniques to
do cross-selling by stocking related products together. For example, consider a set of
transactions representing checkout baskets in a grocery store. Given a minimum support level
(supplied by the analyst), the data mining algorithm can find items in the store that are bought
together. Suppose one has a set of baskets shown in the Transaction table in Figure 5a. The
Frequent item sets table in Figure 5b shows the respective support levels for the frequent
item sets derived from the Transaction table.
Basket ID
Item ID
1
Milk
Support
Item sets found
1
Butter
4
{Milk}
2
Milk
3
{Milk}, {Butter}, {Milk, Butter}
2
Honey
2
{Milk}, {Butter}, {Milk, Butter}
2
Butter
3
Milk
3
Bread
3
Butter
4
Milk
4
Bread
4
Honey
{Honey}, {Bread}, {Honey, Bread}, {Honey, Milk},
{Honey, Butter}, {Bread, Milk}, {Bread, Butter}
(b) Frequent item sets
(a) Transaction table
Figure 5. Association
Note that as the support level decreases, the number of frequent item sets grows
monotonically. In general, in real databases—whether storing market baskets, tracking Webbrowsing behavior, or monitoring customer uses of a service (for example, a phone service)—
the number of item sets having a high support value tends to be very small, and the number of
item sets tends to grow exponentially as the support level is decreased.
Once the frequent item sets are derived, they can be used to produce association rules.
Association rules are derived by selecting one of the items in a frequent item set as the item to
be predicted and then evaluating the remaining items as the conditions of a rule for predicting
that item. For example, in the Frequent item sets table in Figure 5b, one may use the set
"{milk, Butter} with support 3" to derive the following association rule:
Specification Version 1.0 — Microsoft
10
Using OLE DB for Data Mining
If a customer buys Milk, that customer also buys Butter.
However, studying the example data set, one also determines that this rule has an accuracy
rate of only 75%, because the transaction indicated by Basket ID number 4 does not obey this
rule even though it satisfies the first condition.
1.2.4 Sequence and Deviation Analysis
Sequence and deviation analysis accounts for sequence information and anomalies in the data.
In the preceding three categories of data mining techniques—predictive modeling,
segmentation, and association—the sequence in which events occurred was ignored and was
treated simply as part of one record (the case). For example, on a data set consisting of people
visiting a Web site, suppose user U774 first visits the home page (page 0), then page 13, then
page 2, and then page 17 on the Web site. This case could simply be flattened into the
following statement:
Case: User U774: visited {page 0, page 2, page 13, page 17}
On the other hand, it might be preferable to preserve the sequence information. This means
that another user who visited the same pages, but in a different order, will be distinct from
U774.
Algorithms in this category focus on one of the following objectives:
1. Summarizing frequent sequences or episodes in data
2. Detecting changes in data over time
3. Detecting changes in knowledge (models or patterns) over time
As an example of the first kind of task, summarizing, suppose it is discovered that users visit a
particular Web site as follows:
Figure 6. Sequence and deviation analysis
The sequences found in the data may indicate that on a given Web site, 90% of users visit
page 0 and 2% enter at page 10. The sequences also may indicate that from page 0, 60% go to
page 15, and so forth. The graph in Figure 6 summarizes ordering relationships and gives an
Specification Version 1.0— Microsoft
11
Using OLE DB for Data Mining
idea of the flow. There may be infrequently visited pages between pages 15 and 17, but only
the frequent visits are reported.
Deviation analysis focuses on finding the anomalies in data. For example, if a user usually
visits only page 0, 1, 15 and then one day visits page 17, the deviation analysis algorithm
outlines this particular event. Deviation analysis is a common technique in fraud detection.
1.2.5 Dependency Modeling
Dependency modeling or "density estimation" refers to the estimation of the underlying joint
probability distribution or density of the data. If you know the joint probability distribution is,
you can answer any question of interest about the data. Dependency modeling can be used to
identify (sometimes novel) dependencies among attributes of cases. Identifying dependencies
is one way to gain insight into your data.
An often-used density estimate for a small number of attributes is the histogram.
Unfortunately, this technique is not useful when then are many attributes . An simple form of
density estimation that can handle a large number of attributes uses the Naïve Bayes model. In
this model, it is assumed that all attributes are independent within a class or a cluster. Note
that the model does not assume that attributes are globally independent. Another simple
example of density estimation is to fit a multivariate-normal distribution to data.
More complex (and more accurate) models for density estimation include mixture models and
graphical models. In the mixture-model approach, one fits several distributions to a data set.
For example, one may decide a population of users is composed of three distinct
subpopulations, each having its own multivariate-normal distribution. Graphical models
useful for density estimation include Bayesian networks and dependency networks.
1.3 The OLE DB for DM Specification
OLE DB for DM is an OLE DB extension that supports data mining operations over OLE DB
data providers. The goal of this specification is to provide an industry standard for data
mining so that different data mining algorithms from various data mining ISVs can be easily
plugged into user applications. In this documentation, software packages that provide data
mining algorithms are called data mining providers and those applications that use data
mining features are called data mining consumers. OLE DB for DM specifies the API
between data mining consumers and data mining providers.
OLE DB for DM introduces one new virtual object, referred to as the data mining model
(DMM), as well as several new commands for manipulating the DMM. In its characteristics
and use, the DMM is very similar to a table and is created with a CREATE statement very
similar to the SQL CREATE TABLE statement. It is populated using the INSERT INTO
statement, just as a table would be populated. The client uses a SELECT statement to make
predictions and explore the DMM.
Specification Version 1.0 — Microsoft
12
Using OLE DB for Data Mining
OLE DB for DM treats a DMM as if it were a special type of table. When you insert the data
into the table, it is processed by a DM algorithm and the resulting abstraction (or data mining
model) is saved instead of the data itself. Subsequently, the DMM can be browsed, refined, or
used to derive predictions.
Data to be mined is represented logically as a collection of tables in a relational database. For
instance, a customer database might record customers, demographic data about customers,
orders, and order items. A join of the customer orders and order items tables may have many
records for one customer (one per order item). This collection of data pertaining to a single
entity is often called a case, and the set of all relevant cases is referred to as a case set. To
represent these relationships, OLE DB for DM uses nested tables as defined by the Data
Shaping Service, which is included with the Microsoft Data Access Components (MDAC)
products. Note that the same physical data may be used to generate different case sets for
different analysis purposes. For example, if one chooses to mine models or patterns over
specific products, each product then becomes a single case and customers become attributes
of the case.
The content of a DMM can be thought of as a "truth table" containing a row for every
possible combination of the distinct values for each column in the DMM. In other words, it
contains every possible case. With this view in mind, a DMM can be used to look up learned
values and statistics.
A fundamental operation in OLE DB for DM is the training of a data mining model, followed
by use of the model to derive predictions. The following is an outline of the process.
The INSERT statement invokes the DM algorithm on the provider to create an abstraction of
the data in the form of a DMM. This abstraction represents the patterns the algorithm found in
the data; the patterns are saved rather than the training data. Selecting from a PREDICTION
JOIN allows new data to be processed through the model to produce predictions.
1. Create an OLE DB data source object and obtain an OLE DB session object. This is
the standard mechanism of connecting to data stores via OLE DB.
2. Create the data mining model object. Using an OLE DB command object, the client
executes a CREATE statement that is similar to a CREATE TABLE statement.
CREATE MINING MODEL [Age Prediction]
(
[Customer ID]
LONG
KEY,
[Gender]
TEXT
DISCRETE,
[Age]
DOUBLE DISCRETIZED() PREDICT,
[Product Purchases]
TABLE
(
[Product Name]
TEXT
KEY,
[Quantity]
DOUBLE NORMAL CONTINUOUS,
[Product Type]
TEXT
DISCRETE RELATED TO [Product Name]
)
)
USING [Decision Trees]
Specification Version 1.0— Microsoft
13
Using OLE DB for Data Mining
3. Insert training data into the model. In a manner similar to populating an ordinary table,
the client uses a form of the INSERT INTO statement. Note the use of the SHAPE
statement to create the nested table.
INSERT INTO [Age Prediction]
(
[Customer ID], [Gender], [Age],
[Product Purchases](SKIP, [Product Name], [Quantity], [Product Type])
)
SHAPE
{
SELECT [Customer ID], [Gender], [Age] FROM Customers ORDER BY [Customer ID]
}
APPEND
(
{SELECT [CustID], [Product Name], [Quantity], [Product Type] FROM Sales ORDER BY
[CustID]}
RELATE [Customer ID] To [CustID]
)
AS [Product Purchases]
4. Use the data mining model to make some predictions. Predictions are made with a
SELECT statement that joins the model's set of all possible cases with another set of
actual cases. The actual cases can be incomplete. In this example, the value for "Age" is
not known. Joining these incomplete cases to the model and selecting the "Age" column
from the model will return a predicted "age" for each of the actual cases.
SELECT t.[Customer ID], [Age Prediction].[Age]
FROM [Age Prediction]
PREDICTION JOIN
(
SHAPE
{
SELECT [Customer ID], [Gender], FROM Customers ORDER BY [Customer ID]
}
APPEND
(
{SELECT [CustID], [Product Name], [Quantity] FROM Sales ORDER BY [CustID]}
RELATE [Customer ID] To [CustID]
)
AS [Product Purchases]
) as t
ON [Age Prediction] .Gender = t.Gender and
[Age Prediction] .[Product Purchases].[Product Name] = t.[Product Purchases].[Product
Name] and
[Age Prediction] .[Product Purchases].[Quantity] = t.[Product Purchases].[Quantity]
Specification Version 1.0 — Microsoft
14
Using OLE DB for Data Mining
Note Because the process of combining actual cases with all possible model cases is not
as simple as the semantics of a normal SQL JOIN, a new type of join, the PREDICTION
JOIN, is introduced in OLE DB for DM. For the instance when the schema of the actual
case table matches the schema of the model, NATURAL PREDICTION JOIN can be
used, obviating the need for the ON clause of the join. Columns from the source query
will be matched to columns from the DMM based on the names of the columns.
Part 2 of this document describes the language for creating and manipulating a DMM in more
detail. The complete details of the language and the schema rowsets used when working with
a data mining provider (DMP) are described in Appendix A.
1.4 The Columns Structure of a Data Mining
Model (DMM)
In usage, the DMM is very similar to a SQL table. The SELECT statement returns columns
from the input data, columns from the model, and predictions produced by the model. The
DMM definition includes a definition of the columns of data over which the model will be
created, including detailed information about the nature of the data and relationships between
columns.
1.4.1 Model Columns
The model columns describe all of the information about a specific case. For example, assume
that each case in the DMM represents a customer. The columns of the DMM will include all
known and desired information about the customer.
The following table illustrates a customer case.
Customer
ID
Hair
Gender Color
Age
Age
Probability
Product
Name
Product
Quantity
Product
Type
Cars
Car
Owned Probability
1
Male
35
100%
TV
1
Electronic
Truck
100%
VCR
1
Electronic
Van
50%
Ham
2
Food
Beer
6
Beverage
Black
As the table indicates, a customer case is not easily describable using simple relational tables.
Each case can include not only simple columns but also multiple tables. Each of these tables
inside the case can have a variable number of rows and a different number of columns. The
meaning of the information contained in the columns can also greatly differ.
Specification Version 1.0— Microsoft
15
Using OLE DB for Data Mining
Note The ability of a case to contain multiple tables of data is a key requirement for most
of the data mining algorithms. Although most of the relational data stores today cannot
support such table structures, the theoretical notion of nested tables (also known as table
columns) already exists in the relational world and is also supported by MDAC. This
specification will rely on these data structures with some anticipation of a wider adoption
in the relational world in the future.
Some of the columns in the example have a direct one-to-one relationship with the case (such
as "Gender" and "Age"), while others have a one-to-many relationship with the case and
therefore exist in tables. As noted above, the nested tables are a key element in the basic data
structure of the case and therefore have an explicit representation in the case definition. You
can easily identify the following two tables contained in the sample case:

"Product Purchases" table containing the columns "Product Name," "Product Quantity,"
and "Product Type"

"Car Ownership" table containing the columns "Cars Owned" and "Car Probability"
The main row of the case is the case row. Columns in the case row describe the entity of the
case. For example, in the case illustrated in the preceding table, the "Age" column contains
the age of the customer whose Customer ID is 1. Rows inside nested tables are referred to as
nested rows. Columns in nested rows describe the entity of the nested row as it relates to the
case row. For example, the "Product Quantity" column represents the quantity of the product
indicated in the "Product Name" column; therefore, 2 is the quantity of "Ham" purchased by
customer 1.
As the preceding example indicates, each column can represent the following content types:

KEY: the columns that identify a row. For example, "Customer ID" uniquely identifies
customer cases, and "Product Name" uniquely identifies a row in the "Product Purchases"
table. In the CREATE MINING MODEL command syntax, specifying the type flag KEY
in the column definition identifies key columns.

ATTRIBUTE: A direct attribute of the case. This type of column represents some value
for the case. For example, the age, gender, or hair-color of the customer or the quantity of
a specific product the customer purchased.

RELATION: Information used to classify attributes, other relations, or key columns. For
example, "Product Type" classifies "Product Name." A given relation value must always
be consistent for all of the instance values of the other columns it describes—for example,
the product "Ham" must always be shown as "Food" for all cases. In the CREATE
MINING MODEL command syntax, relations are identified in the column definition by
using a RELATED TO clause to indicate the column being classified.

QUALIFIER: A special value associated with an attribute that has a predefined meaning
for the provider. Take for example the probability that the attribute is correct. These
qualifiersare all optional and apply only if the data has uncertainties attached to it or if the
output of previous predictions is being chained as input to a subsequent DMM training
step. Following are examples of qualifiers.
Specification Version 1.0 — Microsoft
16
Using OLE DB for Data Mining
Note In the CREATE MINING MODEL command syntax, modifiers are identified
by using an OF clause to indicate the attribute column they modify.

PROBABILITY: A number between zero and one that describes the probability of
the associated value.

VARIANCE (or Stdev): A number that describes the variance (or standard deviation)
of the value of an attribute.

SUPPORT: A float that represents a weight (case replication factor) to be associated
with the value.

PROBABILITY_VARIANCE (or Stdev): The variance (or standard deviation)
associated with the probability estimator used for PROBABILITY.

ORDER: Specifies the order of a column. (See ORDERED below.)

TABLE: A nested table is represented in the case as consisting of special column with the
data type TABLE. For any given case row, the value of a TABLE type column contains
the entire contents of the associated nested table. The value of a TABLE type column is in
itself a table containing all of the columns for the nested table. In the CREATE MINING
MODEL command syntax, nested tables are described by a set of columns, all of which
are contained within the definition of a named TABLE type column.

DISCRETE: The attribute values are discrete. These are the simplest forms of an
attribute. Gender is a typical example of such an attribute, where the values describe
categories. Even if the values are numeric, no ordering is implied by the values. ("Area
Code" is a good example.) The values of a discrete attribute are often called its states.

ORDERED: Columns that define an ordered set of values. Although there is a total
ordering, no distance or magnitude semantics are implied. A ranking of skill level (say
one through five) is an ordered set, but a skill level of five isn't necessarily five times
better than a skill level of one. Attributes with a type flag of ORDERED are also
considered to be discrete. There may be an associated "Order Of" column with numeric
values that gives the ordering for this attribute type column. The order of column values
can be defined before the model training. (See the section "Populating the Column
Values.")

CYCLICAL: A set of values that have cyclical ordering. Day of the week is a good
example, since day number one follows day number seven. Attributes with a type flag of
CYCLICAL are also considered to be ordered and discrete.

CONTINUOUS: Attributes with values that form a continuous curve. Values are
naturally ordered and have implicit distance and magnitude semantics. Salary is a typical
example.
Specification Version 1.0— Microsoft
17
Using OLE DB for Data Mining

DISCRETIZED: The data that will be inserted into the model is continuous, but it should
be transformed into and modeled as a number of ORDERED states by the provider. Some
Data Mining algorithms cannot accept CONTINUOUS attributes as input, or they may not
be able to predict CONTINUOUS values. For these cases, columns with continuous
domains should be made into DISCRETIZED attributes. In the CREATE MINING
MODEL command syntax, the DISCRETIZED type flag can take arguments to override
default discretization behavior.

SEQUENCE_TIME: A column containing time measurement units. A time column does
not have to contain a data type of any particular format. A period number is acceptable.
This is typically used to associate a sequence time with individual attribute values such as
purchase time.
A CONTINUOUS attribute's domain may also have a distribution associated with it. This is a
hint given to the data mining provider describing the expected distribution of the column
values that will be inserted into the model when trained. Specific values may be known to
have typical distributions. For some algorithms, it is particularly beneficial to know the
distribution ahead of time. If the distribution isn't known or isn't given, the provider may
assume whatever distribution it finds convenient. Following are examples of distributions:

NORMAL: A histogram of the continuous values forms a normal Gaussian distribution.
Household income values may form this curve.

LOG_NORMAL: A histogram of the continuous values forms a Gaussian distribution
with all values greater than 0, with an elongated upper tail, and with a skew toward the
low end of the curve. The quantity associated with a product purchase may form this
curve if a value of 0 is not explicitly recorded and if most consumers tend to buy smaller
quantities of the product.

UNIFORM: The likely occurrence of all values is equal.
There are a number of other distribution models, such as BINOMIAL, MULTINOMIAL,
POISSON, T-DISTRIBUTION, and so on. A data mining provider may support a subset of
these distributions.
All of the preceding column descriptions allow the provider to make some sense of the
training data it is given with the INSERT command. Returning to the example, the columns
can now be classified as shown in the following table.
Containing
Table
Column
Content Type
Customer
ID
Key
Gender
Discrete Attribute
HairColor
Discrete Attribute
Specification Version 1.0 — Microsoft
18
Model Hints
Comments
Special column that
serves as the case
identifier (key)
Using OLE DB for Data Mining
Containing
Table
Column
Content Type
Model Hints
Comments
Age
Continuous Attribute
Age
Probability
Probability Modifier of
Age
Customer
Loyalty
Ordered Attribute
Product
Purchases
Table
Product
Purchases
Product
Name
Key
Product
Purchases
Product
Quantity
Continuous Attribute
Product
Purchases
Product
Type
Relation of Product
Name
Product
Purchases
Month
Purchased
Cyclical Attribute
Doesn't exist in the
sample case. Added for
additional illustration.
Car
Ownership
Cars
Owned
Key
Has an implicit "Exists"
attribute for each
distinct key.
Car
Ownership
Cars
Probability
Probability Modifier of
Implicit "Exists"
Attribute
Doesn't exist in the
sample case. Added for
additional illustration.
Each distinct key
represents the purchase
of a product with a
"Quantity" attribute.
Log Normal
Other hints can be given to the data mining provider to help it build good models of the
training data. These modeling flags are provider-specific, but following are two examples:

MODEL_EXISTENCE_ONLY: The actual values for an attribute are not nearly as
important as the simple existence of the attribute. For example, assume the existence of
some general demographic data for a selected group of people, along with a nested table
of the television programs and the viewing duration for all of the programs that each
person watched. For modeling purposes, the fact that the person watched a particular
program may be more important than how long they watched it. In this case, the Duration
attribute should be marked as MODEL_EXISTENCE_ONLY.

NOT NULL: The attribute can never contain a null value, and encountering one while
training should generate an error.
Specification Version 1.0— Microsoft
19
Using OLE DB for Data Mining
1.4.2 Prediction Columns
Attribute or Table type columns can be input columns, output columns, or both. The data
mining provider will build a data mining model capable of predicting or explaining output
column values based on the values of the input columns.
Predictions may convey not only simple information such as "estimated age is 21", but they
may also convey additional statistical information such as confidence level and standard
deviation. Further, the prediction may actually be a collection of predictions, such as "the set
of products that the customer is likely to buy." Each of the predictions in the collection may
also include a set of statistics.
A prediction can be expressed as a histogram. A histogram provides multiple possible
prediction values, each accompanied by a probability and other statistics. When histogram
information is required, each prediction (which by itself can be part of a collection of
predictions) may have a collection of possible values that constitutes a histogram.
Since the prediction information may be very rich, it is often necessary to extract only a
portion of the predictions. For example, you may want to see only the "best estimate," "top 3
estimates," or "estimates with probability greater then 55%." Not every provider nor every
DMM can support all of the possible requests. Therefore, it is necessary for the output column
to define whatever information may be extracted out of it.
OLE DB for DM defines a set of standard transformation functions on output columns. These
functions are discussed in detail in section 2.9 Querying—Applying Mining Models on New
Data," and in Appendix C.
Specification Version 1.0 — Microsoft
20
Using OLE DB for Data Mining
2 OLE DB for DM Programmer's
Guide
This section of the specification illustrates how data mining consumers and providers work
together. The section will walk you through the following operations:

Connecting to a DMP

Creating a new DMM

Enumerating and exploring existing data mining models

Executing queries and deriving predictions with a DMM

Housekeeping activities
This section is not a formal representation of the interfaces and does not intend to describe
every option and variation that the API enables. Instead, all of the interfaces are formally
detailed in the appendixes. You should consider this section a tutorial that describes the
principles of working with a DMP and introduces application programmers to the new world
of DM client development.
2.1 Connecting to a Data Mining Provider
The process of connecting to a DMP is the same as connecting to any other OLE DB provider
(whether relational, multidimensional, or any other type). The connection sequence to an OLE
DB provider is described in the OLE DB Programmer's Reference.
As with all other OLE DB providers, a DMP supports the data source, session, command, and
rowset objects.
Although during the connection sequence a DMP behaves just like any other OLE DB
provider, it is still very useful to be able to determine whether a specific provider supports the
OLE DB for DM specification. To this end, the constant
DBSOURCETYPE_DATASOURCE_DMP is defined and can be used when enumerating
providers to locate a provider capable of performing data mining. A single provider may
support many data store types. For example, a provider may support both relational operations
as well as data mining operations concurrently. Bit operations on the SOURCE_TYPE value
can detect whether a provider supports a specific data store type.
Once a session object has been instantiated, the client application can query the provider for
information and execute various commands.
Specification Version 1.0— Microsoft
21
Using OLE DB for Data Mining
2.2 Creating New Mining Models
A new DMM is created with the CREATE MINING MODEL command. This command
correlates closely to the common relational database operation CREATE TABLE, which
defines a table object structure. As will be shown in following sections, creating and
populating a DMM follows the approach taken by relational databases for the management of
tables.
The similarities between DMMs and tables are not coincidental. It is widely expected that
data mining capabilities will be fully integrated with relational databases in the future.
Therefore, the present approach looks at the DMM as a future standard object of an RDBMS,
just like a table or a view, and the DMM is indeed represented and accessed to a large degree
as if it were a special type of a table.
However, unlike a table, a DMM must announce a predefined goal and analysis technique.
Each provider may support many and different analysis techniques. It is therefore necessary to
be able to identify the provider capabilities.
2.2.1 Detecting the Capabilities of the Provider
The different mining services (or algorithms as they are also known) are exposed through a
new schema rowset—the mining services schema rowset. This schema rowset exposes the
different algorithms supported by a provider and the way to specify goals for the algorithm.
Many algorithms require a goal—for example, "predict whether the customer's transactions
look fraudulent," "predict the sales amount for the customer," "predict the profit for a
product," and "predict the sales of each store for next year" all have targeted goals. The
algorithm will try to predict something about the case, usually one of the attributes of the
case. Most of the algorithms will need to get a training set of cases where the attributes to be
predicted are already known, and they will then create a DMM capable of predicting these
attributes for cases in which the attribute is unknown.
Different algorithms will be capable of predicting different things. They may also differ in the
type of data they are capable of processing. The list of algorithms (or services), their possible
goals, their limitations, and their capabilities are all exposed in the mining services achema
rowset. This information will be used when defining a new model.
The mining services schema rowset is described in detail in Appendix A. The following table
describes some of the important columns that are found in the mining services schema rowset.
Specification Version 1.0 — Microsoft
22
Using OLE DB for Data Mining
Column Name
Type Indicator
Description
SERVICE_NAME
DBTYPE_WSTR
The name of the algorithm. Provider-specific.
Used with the CREATE MINING MODEL
command to specify algorithm.
SERVICE_TYPE_ID
DBTYPE_UI4
A bitmask that describes mining service types.
The list includes known popular mining services,
such as the following:

DM_SERVICETYPE_CLASSIFICATION
(0x0000001)

DM_SERVICETYPE_CLUSTERING

(0x0000002)

DM_SERVICETYPE_ASSOCIATION

(0x0000004)

DM_SERVICETYPE_DENSITY_ESTIMA
TE (0x0000008)

DM_SERVICETYPE_SEQUENCE
(0x0000010)
PREDICTED_CONTENT
DBTYPE_WSTR
The attribute types that can be predicted. This is
a comma-delimited list of content types.
PREDICTION_LIMIT
DBTYPE_UI4
The maximum number of predictions the model
and algorithm can provide; 0 means no limit.
SUPPORTED_DISTRIBUTION_
FLAGS
DBTYPE_WSTR
A comma-delimited list of one or more of the
following:

NORMAL

LOG_NORMAL

UNIFORM

BINOMIAL

MULTINOMIAL

POISSON

T-DISTRIBUTION
Provider-specific flags may also be defined.
Specification Version 1.0— Microsoft
23
Using OLE DB for Data Mining
Column Name
Type Indicator
Description
SUPPORTED_INPUT_CONTEN
T_TYPES
DBTYPE_WSTR
A comma-delimited list of one or more of the
following:

KEY

DISCRETE

CONTINUOUS

DISCRETIZED

ORDERED

SEQUENCE_TIME

CYCLICAL

PROBABILITY

VARIANCE

STDEV

SUPPORT

PROBABILITY_VARIANCE

PROBABILITY_STDEV

ORDER

SEQUENCE

TABLE
Provider-specific flags may also be defined.
Specification Version 1.0 — Microsoft
24
Using OLE DB for Data Mining
Column Name
Type Indicator
Description
SUPPORTED_PREDICTION_C
ONTENT_TYPES
DBTYPE_WSTR
A comma-delimited list of one or more of the
following:

DISCRETE

CONTINUOUS

DISCRETIZED

ORDERED

SEQUENCE_TIME

CYCLICAL

PROBABILITY

VARIANCE

STDEV

SUPPORT

PROBABILITY VARIANCE

PROBABILITY_STDEV

ORDER

TABLE
Provider-specific flags may also be defined.
SUPPORTED_MODELING_FLA DBTYPE_WSTR
GS
A comma-delimited list of one or more of the
following:

MODEL_EXISTENCE_ONLY

NOT NULL
Provider-specific flags may also be defined.
Specification Version 1.0— Microsoft
25
Using OLE DB for Data Mining
Column Name
Type Indicator
Description
TRAINING_COMPLEXITY
DBTYPE_I4
Indication of expected time for training:
PREDICTION_COMPLEXITY
EXPECTED_QUALITY
DBTYPE_I4
DBTYPE_I4

DM_TRAINING_COMPLEXITY_LOW—
Running time is proportional to input and is
relatively short.

DM_
TRAINING_COMPLEXITY_MEDIUM—
Running time may be long but is generally
proportional to input.

DM_ TRAINING_COMPLEXITY_HIGH—
Running time is long and may grow
exponentially in relationship to input.
Indication of expected time for prediction:

DM_PREDICTION_COMPLEXITY_LOW
—Running time is proportional to input and
is relatively short.

DM
PREDICTION_COMPLEXITY_MEDIUM
—Running time may be long but is generally
proportional to input.

DM_
PREDICTION_COMPLEXITY_HIGH—
Running time is long and may grow
exponentially in relationship to input.
Indication of expected quality of model
produced with this algorithm:

DM_EXPECTED_QUALITY_LOW

DM_EXPECTED_QUALITY_MEDIUM

DM_EXPECTED_QUALITY_HIGH
ALLOW_INCREMENTAL_INSE DBTYPE_BOOL
RT
TRUE if additional INSERT INTO statements
are allowed after the initial training.
ALLOW_DUPLICATE_KEY
TRUE if cases may have duplicate key.
Specification Version 1.0 — Microsoft
26
DBTYPE_BOOL
Using OLE DB for Data Mining
2.2.2 Defining a New Mining Model
Defining a new model is done using a CREATE MINING MODEL statement. Similar to the
CREATE TABLE statement, the creation of a DMM defines only its structure and properties.
It does not define the specific content (the learned graphical structure), which will be created
only when the DMM is populated. (See below.)
The CREATE MINING MODEL statement will define the following:
1. The DMM columns
2. The specific algorithm to be used in the DMM
The syntax used to define the DMM columns is similar to the syntax used to define the
columns in a table object, as follows:
CREATE MINING MODEL <mining model name> (<Column definitions>) USING <Service>[(<service
arguments>)]
However, since the columns of a DMM require a lot of specialized information, some
extensions were added to the standard SQL syntax. Following is a statement example that
applies to the case structure illustrated in Section 1.3:
CREATE MINING MODEL [Age Prediction]
(
[Customer ID]
LONG
KEY,
[Gender]
TEXT
DISCRETE,
[Hair Color]
TEXT
DISCRETE,
[Age]
DOUBLE DISCRETIZED() PREDICT,
[Age Probability]
DOUBLE
PROBABILITY OF [Age],
[Product Purchases]
TABLE
(
[Product Name]
TEXT
KEY,
[Quantity]
DOUBLE NORMAL CONTINUOUS
[Product Type]
TEXT
RELATED TO [Product Name]
),
[Car Ownership]
TABLE
(
[Car Name]
TEXT
KEY,
[Probability]
DOUBLE PROBABILITY OF [Car Name]
)
)
USING [Microsoft_Decision_Trees]
As the example shows, the definition includes the following information for each column:

Name (mandatory)

Data type (mandatory)—a special data type exists for tables contained in a case (TABLE)

List of column type flags and modeling flags
Specification Version 1.0— Microsoft
27
Using OLE DB for Data Mining

Relationship to an attribute column (mandatory only if applies)—indicated by the
RELATED TO or OF clauses

Prediction request (that is, indication to the algorithm to predict this column)—indicated
by the PREDICT or PREDICT_ONLY string
While a complete BNF for this grammar is given in Appendix B, following are a few
interesting points:
The syntax allows for explicit definition of "Table Columns." "Product Purchases" and "Car
Ownership" are both columns that contain a full table each.
A potential list of supported of data types is as follows: LONG, DOUBLE, TEXT, DATE,
BOOL, and TABLE. For a list of the data types supported by the provider, see the
PROVIDER_TYPES schema rowset in Appendix B of the OLE DB Programmer's Reference.
The Discretized function cuts the value range of a continuous variable to a number of
buckets. The syntax for the Discretized attribute type is as follows:
Discretized([method[,n]]). Both arguments are optional, but parentheses are always required
and a value must be given for "method" in order to supply a value for "n". The "n" argument
is the recommended number of buckets that the discretization method should try to find to
divide up the values of the column. Each provider will have a reasonable default. The
"method" argument describes the algorithm that the provider should use to find the buckets.
All providers should support the method DEFAULT as the default. Other possible providerspecific algorithms could be AUTOMATIC, EQUAL_AREAS, THRESHOLDS,
CLUSTERS, and so forth.
A column may have missing values. There are different ways to deal with missing values. The
easy way is to ignore it, but sometimes missing values can be informative, and thus it is often
beneficial to model the missing state. Users can specify how to deal with missing values in the
column definition statement. For example, Gender TEXT DISCRETE NULL IGNORE means
to ignore the missing state in the Gender column. The following is a list of possible ways to
specify missing value treatment:

NOT NULL: The column should not contain missing values; otherwise it returns an error
during the model training stage.

IGNORE NULL: Ignore the missing value.

NULL INFORMATIVE: Data mining algorithm will model the missing state.
The default option is NULL INFORMATIVE. After the column definition, the statement
indicates the type of algorithm to be used. Only one of the services listed by the provider in
the services schema rowset can be used.
The USING clause can be followed by a PARAMETERS clause containing provider-specific
pairs of parameter-value settings. THE SERVICE_PARAMETERS schema rowset contains a
list of parameters supported by the provider. A full description of this schema rowset is
provided in Appendix A. Algorithm providers define the names of their parameters. However,
we suggest the following list of parameters, which may used in many algorithms:
Specification Version 1.0 — Microsoft
28
Using OLE DB for Data Mining

HOLDOUT_PERCENTAGE: The percentage of data that is held out during the training
stage. This data may be used in validation or test phase.

HOLDOUT_SEED: The seed used to hold out data.

SAMPLE_PERCENTAGE: The percentage of data that is selected after sampling.

SAMPLE_SEED: The seed used in sampling data.
When a CREATE MINING MODEL statement is executed, the model is cr eated and will
appear in the schema rowsets of the provider. However, since data has not been inserted into
the model, the model cannot be used for any kind of useful analysis. The client can use the
MODEL_STATE column in the mining models schema rowset to get this indication.
2.2.3 Copying a Mining Model
Sometimes you may want to run multiple algorithms against the same source data and model
column structure. The OLE DB for DM specification provides a mechanism that allows you
to easily create a new model from an existing model.
SELECT * INTO <new model> USING <model type> [( <parameter list> )] FROM <model>
The new model will contain all information from the existing model that is not specific to the
actual algorithm. Executing this statement will cause the new model to be trained using the
same training query as the existing model. If the existing model is not trained, only the
structure of the model will be copied.
2.2.4 Creating a Mining Model from Predictive Model
Markup Language (PMML)
Because all of the structure and content of a DMM may be expressed as an XML string in the
Predictive Model Markup Language (PMML) format (see Appendix D), it is conceivable that
the expert user can use such a string as the basis for the creation of a model. This string could
be a modified version of the string retrieved from another model. (See The MODEL_PMML
column of the MODEL_CONTENT_PMML schema rowset.) Changes to the XML string will
typically allow manipulation of the content nodes. The change may include pruning of the tree
additions of other nodes or changing the rules described in the nodes.
A provider does not have to support initialization based on a PMML document. To discover
whether the provider supports this capability, the services schema rowset offers the
ALLOW_PMML_INITIALIZATION column.
To create a new model from PMML, use a modified version of the CREATE MINING
MODEL statement, as follows:
CREATE MINING MODEL <mining model name> FROM PMML <xml string>
Specification Version 1.0— Microsoft
29
Using OLE DB for Data Mining
2.3 Finding Existing Mining Models
Data mining models are exposed in the mining models schema rowset. This rowset can be
viewed as an enhanced version of the TABLES schema rowset because it contains all of the
same types of information. In addition, several DMM-specific columns have been added to
the rowset. A complete description of the MINING_MODELS schema rowset can be found in
Appendix A; the following table describes some of the interesting columns.
Column Name
Type Indicator
Description
MODEL_NAME
DBTYPE_WSTR
Model name. This column cannot contain NULL.
SERVICE_TYPE_ID
DBTYPE_UI4
A bitmask that describes mining service types. The
list includes known popular mining services, such as
the following:

DM_SERVICETYPE_CLASSIFICATION
(0x0000001)

DM_SERVICETYPE_CLUSTERING
(0x0000002)

DM_SERVICETYPE_ASSOCIATION
(0x0000004)

DM_SERVICETYPE_DENSITY_ESTIMATE
(0x0000008)

DM_SERVICETYPE_SEQUENCE
(0x0000010)
SERVICE_NAME
DBTYPE_WSTR
A provider-specific name that describes the
algorithm used to generate the model.
CREATION_STATEMENT
DBTYPE_WSTR
Optional. The statement used to create the original
data mining model.
PREDICTION_ENTITY
DBTYPE_WSTR
A comma-delimited list indicating which columns
the model can predict.
IS_POPULATED
DBTYPE_BOOL
VARIANT_TRUE if the model is populated.
VARIANT_FALSE if the model is not populated.
An empty model has a defined structure but has not
been "trained" with data.
Specification Version 1.0 — Microsoft
30
Using OLE DB for Data Mining
2.4 Browsing Model Column Definition
Once an interesting DMM has been identified, you may want to explore its structure. The
structure of a DMM is similar to the structure of a table that is represented as a set of
columns. Like columns of a table, the structure represents the kind of inputs and outputs that
the DMM can provide. Like a table, the structure is independent of the specific data instances
that were or will be input into it. In fact, the structure of a DMM is described using a schema
rowset that is derived from the COLUMNS schema rowset (see the Appendix B of the OLE
DB Programmer's Reference), with new columns added to support data mining operations.
2.4.1 Input Columns
The structure of the DMM is described by the inputs that are used to describe a case and by
the set of possible predictions that can be selected from the model. This structure is described
in the MINING_COLUMNS schema rowset. Data mining providers must support all
mandatory columns, as defined by the OLE DB for DM specification.
The section on The Columns Structure of a DMM in part one of this document describes the
data types, content types, and other interesting flags that describe the columns of a DMM.
Several columns in the MINING_COLUMNS schema rowset (the complete description can
be found in Appendix A) describe these properties of a model column. The following table
describes some interesting columns from that rowset.
Column Name
Type Indicator
Description
COLUMN_NAME
DBTYPE_WSTR
The name of the column; this might not be
unique. If this cannot be determined, a NULL is
returned.
DATA_TYPE
DBTYPE_UI2
The indicator of the column's data type—for
example:
"TABLE" = DBTYPE_HCHAPTER
"TEXT" = DBTYPE_WCHAR
"LONG" = DBTYPE_I8
"DOUBLE" = DBTYPE_R8
"DATE" = DBTYPE_DATE
Specification Version 1.0— Microsoft
31
Using OLE DB for Data Mining
Column Name
Type Indicator
Description
DISTRIBUTION_FLAG
DBTYPE_WSTR
One of the following:

NORMAL

LOG_NORMAL

UNIFORM

BINOMIAL

MULTINOMIAL

POISSON

T-DISTRIBUTION
Provider-specific flags may also be defined.
CONTENT_TYPE
DBTYPE_WSTR
One of the following:

KEY

DISCRETE

CONTINUOUS

DISCRETIZED([args])

ORDERED

SEQUENCE TIME

CYCLICAL

PROBABILITY

VARIANCE

STDEV

SUPPORT

PROBABILITY_VARIANCE

PROBABILITY_STDEV

ORDER

SEQUENCE
Provider-specific flags may also be defined.
Specification Version 1.0 — Microsoft
32
Using OLE DB for Data Mining
Column Name
Type Indicator
Description
MODELING_FLAG
DBTYPE_WSTR
A comma-delimited list of flags. The defined
flags are:

MODEL_EXISTENCE_ONLY

NOT NULL
Provider-specific flags may also be defined.
RELATED_ATTRIBUTE
DBTYPE_WSTR
This is the name of the target column that the
current column either relates to or is a special
property of.
CONTAINING_COLUMN
DBTYPE_WSTR
Name of the TABLE column containing this
column. NULL if any table does not contain the
column.
2.4.2 Prediction Columns
ATTRIBUTE or TABLE type columns can be input columns, output columns, or both. The
data mining provider will build a DMM capable of predicting or explaining output column
values based on the values of the input columns. In the CREATE MINING MODEL
command syntax, output columns are identified with the PREDICT or the PREDICT_ONLY
keyword. Marking a column for prediction (or not) has various implications for usage in the
model, as described in the following table.
Prediction Flag in Command
Input
Output
Description
PREDICT_ONLY
No
Yes
Input column values will be used to
predict this column's values. This
column's values will not be used to
predict other columns.
PREDICT
Yes
Yes
Input column values will be used to
predict this column's values. This
column's values will be used to predict
predictable columns.
(None mentioned)
Yes
No
This column's values will be used to
predict predictable columns.
Specification Version 1.0— Microsoft
33
Using OLE DB for Data Mining
The following table lists two additional columns in the MINING_COLUMNS schema rowset
that describe the input/output state of a column.
Column Name
Type Indicator
Description
IS_INPUT
DBTYPE_BOOL
VARIANT_TRUE if this is an input column.
IS_PREDICTABLE
DBTYPE_BOOL
VARIANT_TRUE if this is an output
column.
Any TABLE column containing a predictable column will itself become predictable.
The MINING_COLUMNS schema rowset has additional columns that indicate the kind of
additional information that can be found in the prediction of a predictable column and what
extraction functions on the predictable column are supported. These additional columns apply
only to output columns (that is, when IS_PREDICTABLE is set to TRUE).
Column Name
Type Indicator
Description
PREDICTION_SCALAR_FUNCTIONS
DBTYPE_WSTR
A comma-delimited list of scalar functions
that may be performed on the column.
PREDICTION_TABLE_FUNCTIONS
DBTYPE_WSTR
A comma-delimited list of functions that
may be applied to the column, returning a
table. The list has the following format:
<function name>(<column1> [,
<column2>], ...)
The format allows the client to determine
which columns will be present in the table
returned by any given function.
2.5 Populating the Mining Model
After the structure of the DMM is defined, you can use the INSERT INTOcommand to
populate the model with training data. This command correlates closely to the common
relational database operation INSERT, which populates a table with data.
The model population stage will run the training data through the data mining algorithm and
will generate a predictive model (referred to in this document as the DMM content).
Notice that although massive quantities of data are fed into the DMM, the DMM usually will
not store any of the data and will retain only the DMM content and distinct column values
after the process is done.
The population step may involve intensive processing of the data, and you should expect it to
last for a while. A notification mechanism is available to follow the progress of the algorithm
and the OLE DB asynchronous execution cancellation interfaces are also available.
Specifically, for commands that do not return a rowset, the DM provider's command object
Specification Version 1.0 — Microsoft
34
Using OLE DB for Data Mining
should return an object that supports the following interfaces: IDBAsynchStatus and
IConnectionPointContainer (allowing users to get a connection point for the
IDBAsynchNotify interface).
2.5.1 Inserting Cases
The command syntax for populating the DMM with data is identical to the population of a
relational table with data in SQL. The basic syntax has the form:
INSERT [INTO] <mining model name>
[ <mapped model columns> ]
<source data query>
As is described in the following sections, various syntaxes can be used to specify the <source
data query>. Regardless of which syntax is used, the column binding between the target
DMM and the source query is done by column order, as is the standard with the INSERT
INTO statement, or the command may specify an explicit mapping from source data columns
into DMM columns using the <mapped model columns> clause. Because not every <source
data query> syntax (for example, the SHAPE syntax) allows complete control over the set of
columns that is returned, using the keyword SKIP in the INTO clause indicates columns that
must be present in the source data query but have no meaning to the DMM. Once the DMM is
populated, the client application can browse its content and perform queries to predict new
data points.
2.5.2 Populating the Column Values
In general, the DMM will learn the available set of distinct column values while training.
However, there are instances when it is preferable or necessary to explicitly train these values
independently of the model.

ORDERED or CYCLICAL attributes—The model may depend on the maintenance of
a certain order of discrete attributes; for example, Monday < Tuesday. This order cannot
be guaranteed to be introduced in that order in the training data.

Value hierarchies—Related columns introduce value hierarchies that would have to be
described every time the attribute is used. For example, it is not necessary to tell the
DMM that "Beer" is of type "Beverage" each time it appears in the training data.
To train a column, OLEDB for DM specifies the following syntax:
INSERT INTO <model>.COLUMN_VALUES(<mapped model columns>)
<source_data_query>
Unlike the model itself, the column values are incrementally trainable. Individual columns
can be trained separately and repeatedly to add more values. However, if there are
relationships between columns through the RELATED TO clause in the CREATE MINING
MODEL statement, these columns must be trained together, as in the following example:
Specification Version 1.0— Microsoft
35
Using OLE DB for Data Mining
INSERT INTO [Age Prediction].COLUMN_VALUES(Gender)
OPENROWSET('SQLOLEDB', '…', 'SELECT DISTINCT Gender FROM Customers')
INSERT INTO [Age Prediction].COLUMN_VALUES([Product Purchases].[Product Name],
[Product Purchases].[Product Type])
OPENROWSET('SQLOLEDB', '…', 'SELECT DISTINCT [Product Name], [Product Type] FROM Sales')
INSERT INTO [Age Prediction].COLUMN_VALUES( SKIP, [Month])
OPENROWSET('SQLOLEDB', '…', 'SELECT MonthID, Month FROM Months ORDER BY MonthID')
When the column values have been trained, the client application can browse those values but
cannot yet perform queries or browse model content. Also, since all column-value
relationships are now known, all RELATED TO columns can be omitted from the modeltraining query.
2.6 Source Data
The <source data query> part of the INSERT (See "Populating the Mining Model") and
SELECT FROM PREDICTION JOIN (See "Querying—Applying Mining Models on New
Data") commands can be any of the sources described by the
SUPPORTED_SOURCE_QUERY column from the MINING_SERVICES schema rowset
described in Appendix A. The possible values for this column are as follows:





SINGLETON CONSTANT
SINGLETON SELECT
OPENROWSET
SELECT
SHAPE
The meanings of each of these constants are described in more detail in the following section.
If the data-mining provider is embedded in a relational provider that supports nested tables
(also known as table columns), the entire population process could occur under the aegis of a
single provider. However, it is expected that at first the DM providers will be separated from
the relational providers and that the relational providers usually will not natively support
nested tables.
This specification offers suggested ways to overcome these issues. Data mining providers are
strongly encouraged to support at least one of the methods discussed in the following sections
and must publish which methods they support in the MINING_SERVICES schema rowset.
2.6.1 SINGLETON CONSTANT as Source Data
If the provider supports SINGLETON CONSTANT as a SUPPORTED_SOURCE_QUERY
value from the MINING_SERVICES schema rowset, a syntax allowing specification of cases
as a set of constant values is supported in place of the <source data query> for the INSERT
and SELECT FROM PREDICTION JOIN commands.
Specification Version 1.0 — Microsoft
36
Using OLE DB for Data Mining
<singleton constant> ::= (<value or set of values> [,<value or set of values>] )
<value or set of values> ::=
<value> | (<set of values>)
For example, the following could be a valid syntax to supply a set of values:
('1', 'Male', (('TV', 1), ('VCR', 2)), (('Van'), ('Truck')))
Although the syntax is identical, the (<singleton constant list>) used by the INSERT
INTO VALUES command syntax is not the same as replacing <source data query> with a
singleton constant data source object. (The only syntax difference is the word "VALUES."
However, inserting a constant row by using the word VALUES is standard SQL, and
accepting a constant list as a general replacement for a table is not.)
2.6.2 SINGLETON SELECT as Source Data
If the provider supports SINGLETON SELECT as a SUPPORTED_SOURCE_QUERY value
from the MINING_SERVICES schema rowset, a syntax allowing specification of cases as a
selection of constant values is supported in place of the <source data query> for the
INSERT and SELECT FROM PREDICTION JOIN commands.
The syntax has the following form:
<singleton select> ::= <compound constant select> as <alias>
<compound constant select> ::= <constant select> |
<compound constant select> UNION <compound constant select>
<constant select> ::= (SELECT <alias constant list>)
<alias constant list> ::= <alias constant element> |
<alias constant list>, <alias constant element>
<alias constant element> ::= <CONSTANT> |
<CONSTANT> as <alias> |
<singleton select>
For example, the following could be valid syntaxes to supply a set of values:
(SELECT 21 as Age, 'Male' as Gender) as Case
(SELECT 21 as Age, 'Male' as Gender,
((SELECT 'ham' as Product, 10 as Qty) UNION (SELECT 'beer' as Product, 1 as Qty)) as
Purchases)
as Case
Specification Version 1.0— Microsoft
37
Using OLE DB for Data Mining
2.6.3 OPENROWSET as Source Data
If the provider supports OPENROWSET as a SUPPORTED_SOURCE_QUERY value from
the MINING_SERVICES schema rowset, a syntax allowing cases to result from an
OPENROWSET of an external command is supported in place of the <source data query>
for the INSERT and SELECT FROM PREDICTION JOIN commands.
Since many of the DM providers will not be embedded within the RDBMS containing the
source data, the <source data query> will most likely need to read data from another data
source. The OPENROWSET function supports this functionality and has the following basic
syntax:
OPENROWSET('provider_name','provider_string','query_syntax')
The 'provider_name' is an OLE DB provider name, the 'provider_string' is the OLE DB
connection string for that provider, and the 'query_syntax' is a query syntax that returns a
rowset (either simple or using SHAPE). The DM provider will establish connection to the
data source object using the 'provider_name' and 'provider_string' and will execute the query
specified in 'query syntax' to retrieve the source data rowset.
The complete syntax for OPENROWSET is described in Appendix F.
2.6.4 SELECT as Source Data
If the provider supports SELECT as a SUPPORTED_SOURCE_QUERY value from the
MINING_SERVICES schema rowset, the standard SQL SELECT command can is supported
in place of the <source data query> for the INSERT and SELECT FROM PREDICTION
JOIN commands.
2.6.5 SHAPE as Source Data
If the provider supports SHAPE as a SUPPORTED_SOURCE_QUERY value from the
MINING_SERVICES schema rowset, a syntax allowing specification of cases as a SHAPE of
related queries is supported in place of the <source data query> for the INSERT and
SELECT FROM PREDICTION JOIN commands.
A single query to most popular relational providers cannot return the nested tables shaped
result set that is needed for the population of many DMMs. Therefore, multiple queries must
be executed in the data source to retrieve all of the data that a case represents. The queries
must be shaped into a nested table form to feed them into the DMM.
OLE DB for DM provides a number of alternatives for performing this operation, including
the following:

Use of the MDAC Data Shaping Service. The Data Shaping Service is an OLE DB
provider that can be layered on top of other providers. In OLE DB for DM, it can be
invoked via OPENROWSET as follows:
Specification Version 1.0 — Microsoft
38
Using OLE DB for Data Mining
INSERT INTO [Age Prediction]
(
[Customer ID], [Gender], [Age], [Age Probability],
[Product Purchases] (SKIP, [Product Name], [Product Type], [Quantity]),
[Car Ownership] (SKIP, [Cars Owned], [Probability])
)
OPENROWSET('MSDataShape','Data Provider=SQLOLEDB',
'SHAPE
{
SELECT [Customer ID], [Gender], [Age], [Age Probability]
FROM [Customers]
}
APPEND ( {SELECT [CustID], [Product Name], [Product Type] , [Quantity]
FROM [Customer Product Sales] }
RELATE [Customer ID] TO [Cust ID]
)
AS [Product Purchases],
( {SELECT [CustID], [Car Name], [Probability]
from [Customer Cars] }
RELATE [Customer ID] TO [Cust ID]
)
AS [Car Ownership] '
)
Note Of course, OPENROWSET can be used to direct the query to any provider so that
any syntax can be used as long as the relevant provider supports it. At this time, there is
no standard SQL syntax to query a nested table. Until such a standard is established, it is
likely that different relational database vendors will create unique and incompatible
syntaxes.

Integrated support for the SHAPE syntax. Some DM providers may choose to adopt the
SHAPE command syntax and provide integrated support for it within the data mining
provider. With these providers, the SHAPE command does not need to be executed within
the context of an OPENROWSET command:
INSERT INTO [Age Prediction]
(
[Customer ID], [Gender], [Age], [Age Probability],
[Product Purchases] (SKIP, [Product Name], [Product Type], [Quantity]),
[Car Ownership] ( SKIP, [Car Name], [Car Probability] )
)
SHAPE
{
OPENROWSET ('SQLOLEDB', 'catalog=Sales',
'SELECT [Customer ID], [Gender], [Age], [Age Probability]
FROM [Customers] ORDER BY [Customer ID] ' )
}
Specification Version 1.0— Microsoft
39
Using OLE DB for Data Mining
APPEND ( { OPENROWSET ('SQLOLEDB', 'catalog=Sales',
'SELECT [CustID], [Product Name], [Product Type] , [Quantity]
FROM [Customer Product Sales] ORDER BY [CustID]' )
}
RELATE [Customer ID] TO [Cust ID]
)
AS [Product Purchases],
( { OPENROWSET ('SQLOLEDB', 'catalog=Sales',
'SELECT [CustID], [Car Name], [Probability]
FROM [Customer Cars] ORDER BY [CustID] ' )
}
RELATE [Customer ID] TO [Cust ID]
)
AS [Car Ownership]
Note Appendix E contains more detail on the SHAPE command syntax. Provider
support of the SHAPE command will likely depend on the explicit ordering of the
input data.

Native support for nested tables. In time, data mining providers may become integrated
with relational providers capable of fully supporting nested tables. Such providers might
adopt their own syntax for specifying nested tables. OLE DB for DM does not preclude
support for such syntax.
2.7 Browsing Mining Model Content
In addition to listing the column structure of a DMM, a very different type of browsing is to
navigate the graphical content of the model. Using a set of input cases, the content of a DMM
is learned by the data mining algorithm. The content of a DMM is the set of rules, formulas,
classifications, distributions, nodes, or any other information that was derived from a specific
set of data using a data mining technique.
Depending on the specific data mining technique used in the creation of the DMM, the
content type may differ from one model to the other. The DMM content of a decision tree–
based classification will differ from a segmentation model, which, in turn, is very different
from a multiregression DMM.
Browsing the content can provide important insight into the data. In many cases it allows you
to understand the patterns and rules that can be used to predict new data points. You must be
aware, however, that some DMMs do not support a way to express DMM content.
One of the ways to browse the content of the DMM is to extract an XML description of it.
The XML description of the contents can be found in the TABLES schema rowset. The
format of the XML string is provided in Appendix D. The XML string provides an easy way
to get, store, manipulate, and re-create all of the DMM information. However, this format
requires significant expertise from the client application to navigate the content.
Specification Version 1.0 — Microsoft
40
Using OLE DB for Data Mining
The most popular way to express DMM content is by using a directed graph (that is, a tree of
nodes). A decision tree is the classic example. Each node in the tree may have relationships to
other nodes. A node may have one or more parent nodes and zero or more child nodes. The
depth of the graph may vary depending on the specific node.
Tree navigation is already defined in the OLE DB for OLAP specification, and a similar
navigation mechanism is adopted for traversing DMM nodes. The
MINING_MODEL_CONTENT schema rowset described in Appendix A provides a rich
functional set of navigation operations.
Querying the model directly will also return the MINING_MODEL_CONTENT rowset. The
following query provides a result table with the exact structure of the
MINING_MODEL_CONTENT schema rowset:
SELECT * FROM <mining model>.CONTENT
This allows the relational database to expose the set of DMM nodes without requiring custom
OLE DB coding.
2.8 Browsing All Possible Cases and Distinct
Column Values
When a mining model is trained, it will encounter in the set of training cases a distinct set of
possible values or "states" that the attributes of the model can take on.
For example, consider a DMM with the following columns: Gender, Age and HairColor.
After this DMM has been trained, the Gender column should end up knowing about the states
"Male," "Female," and "Missing." (For completeness, assume that all attributes, even those
with continuous domains, can take on the "Missing" state. This is true even when NULL or
missing values are not encountered in the training data.) For HairColor, the DMM sees and
remembers the values "Black," "Gray," and "Missing." Although the DMM has seen all of the
values for the continuous attribute column Age, it does not remember every distinct value for
the column. Instead, it learns the minimum, mean, and maximum values for the column.
If the example model was built to predict the HairColor column from a set of 100 people,
browsing the contents of the DMM might show the following tree structure:
Specification Version 1.0— Microsoft
41
Using OLE DB for Data Mining
The set of all possible cases contained in a DMM has one entry for every possible
combination of the distinct values for each attribute. For discrete attributes, this is a list of the
distinct values seen in the column (plus the "Missing" state). For continuous attributes, the
"Minimum," "Maximum," "Mean," and "Missing" states are reported. For Discretized
attributes, the buckets found during discretization are listed. The return value is the midpoint
between the up and low bound of the bucket. Use of the SELECT command on the DMM
reports these possible cases. Along with each possible case, the DMM can report statistics
learned for the attributes that it has been built to predict.
In the example, the following command and results (shown in the following table) are
possible:
SELECT *, PredictProbability(HairColor) FROM HairColorPredictDMM
Gender
Age
HairColor
P(HairColor)
Male
2
Black
.667
Male
2
Gray
.267
Male
2
NULL
.067
Male
91
Black
.300
Male
91
Gray
.625
Male
91
NULL
.075
Male
45
Black
.667
Male
45
Gray
.267
Specification Version 1.0 — Microsoft
42
Using OLE DB for Data Mining
Gender
Age
HairColor
P(HairColor)
Male
45
NULL
.067
Male
NULL
Black
.600
Male
NULL
Gray
.350
Male
NULL
NULL
.05
Female
2
Black
.933
Female
2
Gray
.067
Female
2
NULL
.000
Female
91
Black
.300
Female
91
Gray
.625
Female
91
NULL
.075
Female
45
Black
.933
Female
45
Gray
.067
Female
45
NULL
.000
Female
NULL
Black
.600
Female
NULL
Gray
.350
Female
NULL
NULL
.05
NULL
2
Black
.800
NULL
2
Gray
.167
NULL
2
NULL
.033
NULL
91
Black
.300
NULL
91
Gray
.625
NULL
91
NULL
.075
NULL
45
Black
.800
NULL
45
Gray
.167
NULL
45
NULL
.033
NULL
NULL
Black
.600
NULL
NULL
Gray
.350
NULL
NULL
NULL
.05
Specification Version 1.0— Microsoft
43
Using OLE DB for Data Mining
Providers may support a WHERE clause on this command to filter the resulting set of all
possible cases, as shown in the following example and results table:
SELECT Age, PredictProbability(HairColor) FROM HairColorPredictDMM WHERE Gender = 'Male' and
HairColor = 'Black'
Gender
Age
HairColor
P(HairColor)
Male
2
Black
.667
Male
91
Black
.300
Male
45
Black
.667
Male
NULL
Black
.600
2.8.1 Finding Distinct Column Values
To find the list of possible values against which a column from a DMM can be compared, use
a command with the SELECT DISTINCT syntax from SQL, as in the following example:
SELECT DISTINCT HairColor FROM HairColorPredictDMM
HairColor
Black
Gray
NULL
As expected, selecting distinct combinations of columns will report rows for only the possible
combinations of the selected columns values.
SELECT DISTINCT HairColor, Gender FROM HairColorPredictDMM
Specification Version 1.0 — Microsoft
44
Using OLE DB for Data Mining
Gender
HairColor
Male
Black
Male
Gray
Male
NULL
Female
Black
Female
Gray
Female
NULL
NULL
Black
NULL
Gray
NULL
NULL
In theory, you could select TABLE type columns from a DMM that contains nested tables.
However, in practice, such an operation would be impractical. This is because the set of
possible values for a table-valued column is all of the conceivable tables having every
possible combination of the keys for that nested table. Although this is the conceptual "truth
table" content of the DMM, no provider should be expected to manifest this set of records.
However, selecting distinct column values from a set of all possible nested table cases is often
a useful task. Consider the larger example from Section 1.3 that contained a nested table of
product purchases. The following command produces a list of the distinct product names that
a customer may purchase:
SELECT DISTINCT [Product Purchases].[Product Name] FROM [Age Prediction]
Note that this syntax uses the "." operator to refer to a column from the scope of a nested
table.
Furthermore, you can determine relationships between trained column values with a WHERE
clause. In the larger example, product names were classified by product type. To find the
products of a certain type, consider the following command:
SELECT DISTINCT [Product Purchases].[Product Name] FROM [Age Prediction]
WHERE [Product Purchases].[Product Type] = 'Electronic'
This will return a list of all Product Names with which the model was trained that have a
corresponding type of "Electronic."
Specification Version 1.0— Microsoft
45
Using OLE DB for Data Mining
2.9 Querying—Applying Mining Models on
New Data
Prediction queries on a DMM allow you to predict attributes that may be missing from new
cases. To perform a query, you need a populated DMM (that is, already trained) and a set of
new cases to predict (generally not the cases upon which the DMM was trained).
2.9.1 Components of a Prediction Query
Prediction queries are retrieved from a DMM with a SELECT command. (The complete
syntax for the OLE DB for DM–compliant SELECT statement is presented in Appendix B.)
SELECT [FLATTENED] <SELECT-expressions>
FROM <mining model name> PREDICTION JOIN <source data query> ON <join condition>
[WHERE <WHERE-expression>]
2.9.1.1 Source Data Query
The <source data query> clause identifies the set of new cases that will have attributes
predicted by combining this set with the learned knowledge in the DMM. For information on
source data queries, please see the section "Source Data."
2.9.1.2 PREDICTION JOIN
When retrieving predictions from a DMM, the actual cases from <source data query> are
matched up with the set of all possible cases from the model (<mining model name>) via a
PREDICTION JOIN operation. See "Browsing All Possible Cases and Distinct Column
Values" for an explanation of the possible cases contained in a DMM. For the following
simple reasons, the matching of source cases to all possible cases with a PREDICTION JOIN
does not follow the semantics of a standard relational JOIN:

The DMM cases do not represent every possible value of a continuous column, but a
PREDICTION JOIN must match an exact continuous value from the source case to some
learned distribution in the DMM. Using the simple example set of all possible cases
defined earlier, the following command returns no records because the possible cases for
the DMM contains the Age column values for only the "Minimum," "Mean,"
"Maximum," and "Missing" ages (2, 45, 91, "Missing"):
SELECT * FROM GenderPredictDMM WHERE Gender = 'Male' AND Age = 30
However, a PREDICTION JOIN using the decision tree described for this model finds a
distribution on HairColor for a 30-year-old Male of (Black = .667; Grey = .267; Missing =
.067).
Specification Version 1.0 — Microsoft
46
Using OLE DB for Data Mining

The DMM cases represent all possible states for a column being predicted, while a user
selecting a prediction for a column often expects to get the single "Best" predicted state.
Use of the same simple example model produces the following results:
SELECT * FROM GenderPredictDMM WHERE Gender = 'Male' AND Age = 45
Gender
Age
HairColor
Male
45
Black
Male
45
Gray
Male
45
NULL
However, selecting HairColor from this model using PREDICTION JOIN to a case for a
45-year-old male would simply report "Black" as the single value for HairColor.

The PREDICTION JOIN may need to make some aggregations and assumptions when
confronted with missing values in the source case. To continue the example, a
PREDICTION JOIN between the simple model and a case where the person's age is 30
but the gender is unknown would report a hair color of "Black" with a probability of 80%.
(As the sample tree indicates, this is a probability which is independent of Gender.)
In general, PREDICTION JOIN will take one case from the input set, and using the
conditions in the ON clause, it will find a matching set of cases from the DMM. This set
of matching DMM cases is then "collapsed" by the algorithm (in an algorithm-specific
way) into one aggregate case that contains the best predictions for all predictable columns
in the model. This collapsed case may have prediction-describing statistics that are not
directly observable in the set of all possible DMM cases because the statistics are the
result of the collapsing process.
2.9.1.3 SELECT Expressions
The <SELECT-expressions> clause is a set of comma-separated expressions, each of which
can be just a simple column reference or a general expression containing prediction functions
that may be connected with various types of operators. (See "Prediction Details.") Columns
can be referenced from the DMM or from the source data query. When a name conflict occurs
between the DMM and source, the column reference must be prefixed with the model name or
the source query's alias.
To validate the accuracy of the learned model, make a prediction on a set of new source cases
where the predicted column value is known (a set of cases reserved from the set upon which
the model was trained). Use SELECT to find the predicted value of the column from the
model and the actual value from the source query.
2.9.1.4 ON and the Join Condition
The existence of key columns on the case row are really for bookkeeping and consistency
reasons; the key values from a set of training data may not be used by the DMM, and the
Specification Version 1.0— Microsoft
47
Using OLE DB for Data Mining
DMM does not retain the set of distinct values for these columns. However, because each row
from the DMM's set of all possible cases is unique, it can be matched to rows from the source
query of actual cases through the <join condition> clause of the ON keyword. The join
condition matches columns from the DMM to columns from the source query. The join
condition has one "=" expression for each set of columns to be matched, and the expressions
are joined with the AND keyword. Column references in the join condition can be simple
column names, they can be prefixed with a model or alias name to scope namespaces and
resolve name conflicts, and they can have many scope levels to identify columns which are in
turn members of table type columns. Consider the following examples:
SELECT … ON GenderPredictDMM.Gender = T2.Gender AND GenderPredictDMM.Age = T2.Age
Notice that even though the model has a column for HairColor, the source query may not have
this column. In fact, if the SELECT command is predicting the "best" HairColor, the DMM's
HairColor column should not be bound to a source column.
SELECT … ON M1.Gender = T2.Sex AND M1.[Product Purchases].[product name] = T2.Age.
[Product Purchases].[product name]
The DMM [Age Prediction] has been aliased in the FROM clause as M1, and the source
query has been renamed to T2. For both tables, the [product name] column exists in a nested
table-valued column called [product purchases].
For the situation where the schema of the DMM matches the schema of the input query, the
key words NATURAL PREDICTION JOIN can be used and the ON clause must be omitted.
Columns from the source query will be matched to columns from the DMM based on the
names of the columns.
2.9.1.5 WHERE Clause
The <WHERE-expression> supports a simplified form of the SQL WHERE clause semantics
that can limit the cases returned from a prediction query. Column references in the WHERE
expression have the same semantics of column references in the <SELECT-expressions>.
2.9.2 An Example
The following sample query will return the predicted age for set of new customers where the
prediction is more than 80% likely:
SELECT
T1.[Customer ID], T1.[Gender], M1.[Age]
FROM
[Age Prediction] as M1 PREDICTION JOIN
OPENROWSET('MSDataShape',
'data provider=Microsoft.Jet.OLEDB.4.0;data source=D:\customer.mdb',
'SHAPE { SELECT [Customer ID], [Gender]
FROM [Customers] ORDER BY [Customer ID]}
APPEND ( {SELECT [CustID], [Product Name], [Quantity]
Specification Version 1.0 — Microsoft
48
Using OLE DB for Data Mining
FROM [Customer Product Sales] ORDER BY [CustID] }
RELATE [Customer ID] TO [Cust ID]) AS [Product Purchases],
( {SELECT [CustID], [Car Name]
FROM [Customer Cars] ORDER BY [CustID] }
RELATE [Customer ID] TO [Cust ID]) AS [Car Ownership]') as T1
ON M1.Gender = T1.Gender AND
M1.[Product Purchases].[Product Name] = T1.[Product Purchases].[Product Name] AND
M1.[Product Purchases].Quantity = T1.[Product Purchases].Quantity AND
M1.[Car Ownership].[Car Name] = T1.[Car Ownership].[Car Name]
WHERE PredictProbability(M1.Age) > .8
2.9.3 Prediction Details
Along with the "best" predicted values, prediction queries on DMMs can convey additional
information and statistics learned from the training data set. There are not explicit columns in
the DMM dedicated to hold these additional bits of information; instead, they can be selected
from the DMM by calling the appropriate functions (often a function taking the predicted
column as an argument).
Some of these functions report simple scalar values that relay measures of the confidence in a
prediction or give fine-grained control over how a prediction is made. Other functions can
expand a prediction into a table of details that better explain the prediction.
Also, the value predicted for a nested table (a column of type TABLE that is predictable) will
in theory produce a nested table with one row for every distinct value for the key of the nested
table. Various functions can operate on this nested table and limit, expand, or reorder the
records. These functions are often a shorthand form of a nested SELECT clause. (A SELECT
statement operating on the nested table can produce a new version of the nested table. A
nested SELECT can be used as an entry in the <SELECT-expressions> list to generate a
nested table.)
These functions will be described briefly in the following sections and are fully enumerated in
Appendix C.
2.9.3.1 Scalar Functions
Directly selecting a predictable column from a DMM is a shortcut for using the default
behavior of the Predict function on the column. It will return the "best" predicted value for
the column (that is, the one with highest probability or whatever the provider decides is most
appropriate). When a non-TABLE type column is given to the Predict function, the result is a
scalar value.
All attributes of a DMM implicitly consider "Missing" as one of the possible values or states
that they should model. In general, it is assumed that "Missing" or NULL values should not
be returned as predictions, even if they are the most likely states. However, for some domains,
a prediction of "Missing" could be informative. For example, consider a data set for the result
of a survey that asked for Age, Gender, and Weight. If you are trying to predict Weight when
Specification Version 1.0— Microsoft
49
Using OLE DB for Data Mining
given Age and Gender, for example, you might learn that for a certain segment of the
population the average Weight is 135 lbs, but the most likely response to the question is
"Missing" (that is, "none of your business!"). An (optional) argument to the Predict function
can be the value INCLUDE_NULL, which is used to force the Predict function to return
"Missing" as one of the potential prediction values.
Along with the predicted value, other functions can give statistics that describe the prediction.
PredictSupport(MyColumn) will return the number of cases in support of the prediction, and
PredictProbability will give the likelihood of the returned value amongst the set of possible
values for the column.
SELECT [Customer ID], Predict(Age), PredictProbability([Age]) as P …
Customer ID
Age
P
10001
43
.667
10203
43
.400
In the preceding example, [Age] is the predicted attribute and it is a Discretized attribute, so
the predicted value for age will be the midpoint of one of the "buckets" that were found for
age values. To get a better description for the range of a predicted bucket, the RangeMin,
RangeMax, and RangeMid functions can be called on the prediction for the Discretized
column.
However, if instead of Discretized, this model was created with [Age] as a continuous
attribute, the reported prediction for Age would be a continuous value (in the domain of Age).
This predicted age may be the mean of some local distribution—for example, the average age
of people who buy the same products as those purchased by a person in the source case. Using
this predicted value alone may be sufficient, but additional pieces of information might also
be available. For example, the standard deviation will usually accompany a continuous
attribute prediction, as follows:
SELECT [Customer ID], [Age], PredictStdev([Age]) as S …
Customer ID
Age
S
10001
45
5.2
10203
15
2.1
[Age] will return the mean value of prediction of age for the input case. The PredictStdev
function will return the standard deviation for the predicted [Age] column. Notice that, unlike
the SQL STD function, which is an aggregation function, the PredictStdev is a scalar
function that may provide different results for each returned row.
If the DMM supports finding a clustering of records, the cluster membership information for a
given input case can be obtained with the Cluster function. It returns the cluster identifier that
the given input case most likely belongs to. Details about the input case's fit into its cluster are
retrieved with the ClusterDistance and ClusterProbability.
Specification Version 1.0 — Microsoft
50
Using OLE DB for Data Mining
SELECT [Customer ID], [Gender], Cluster() as C, ClusterProbability() as CP, ……
Customer ID
Gender
C
CP
10001
Male
2
.21
10203
Female
7
.32
The list of available functions for each of the prediction columns is found in the
MINING_COLUMNS schema rowset of the DMM. Many of the common functions were
standardized in this specification and are available in Appendix C. The following table
provides a short description of these functions.
Function
Return Value
Description
Predict(<scalar column reference>,
options, …)
<column
reference>
General prediction function to modify
behavior of prediction for scalar values, such
as including a missing state. Returns the "best"
value, given the options, for the specified
scalar column.
PredictSupport(<column reference>)
Scalar value
Count of cases in support of the predicted
value.
PredictVariance(<column reference >)
Scalar value
Variance describing the distribution for which
the value of Predict is the mean (generally for
continuous attributes).
PredictionStdev(<column reference >)
Scalar value
Square root of PredictVariance.
PredictProbability(<column reference >)
Scalar value
Likelihood that Predict is the correct value.
PredictProbabilityVariance(<column
reference >)
Scalar value
Expresses certainty in the value of
PredictVariance.
PredictProbabilityStdev(<column
reference >)
Scalar value
Square root of PredictProbabilityVariance.
Cluster
Scalar value
or <cluster
column
reference>
Cluster identifier that the input case belongs to
with the highest probability. It also can be
used as a <cluster column reference> for a
PredictHistogram function.
ClusterDistance([ClusterID_expr])
Scalar value
Distance from the center of the cluster that is
identified by ClusterID_expr or the highest
probability cluster.
ClusterProbability([ClusterID_expr])
Scalar value
Probability that the input case belongs to the
cluster that is identified by ClusterID_expr or
the highest probability cluster.
Specification Version 1.0— Microsoft
51
Using OLE DB for Data Mining
Function
Return Value
Description
RangeMid(<column reference>)
Scalar value
Gives the midpoint of the predicted bucket for
a discretized column.
RangeMin(<column reference>)
Scalar value
Gives the low end of the predicted bucket for a
discretized column.
RangeMax(<column reference>)
Scalar value
Gives the upper end of the predicted bucket for
a discretized column.
2.9.3.2 Expanding Scalar Predictions with PredictHistogram
The additional information on a prediction need not be a simple scalar. For example, when
predicting a discrete attribute (such as Gender), a histogram is one possible way to provide the
predictions. The histogram will have one entry for each of the possible values that could have
been returned for the column. Along with each value are some statistics that describe its
likelihood. (The exact format of a histogram is presented in Appendix C.) This histogram is a
table, and the PredictHistogram function returns this table as a column with the data type of
TABLE (that is, a table column). The nested table has a predefined set of informationcontaining columns. These columns are $Support, $Variance, $Stdev (standard deviation),
$Probability, $ProbabilityVariance, and $ProbabilityStdev.
SELECT [Customer ID], PredictHistogram([Gender]) AS GH …
Customer ID
GH
10001
10203
Gender
$Support
$Probability
Male
621
.621
Female
379
.379
Gender
$Support
$Probability
Male
446
.446
Female
554
.554
…
…
Note For simplicity, only a few of the automatic information columns are shown in the
preceding example.
The Predict functions are selecting their return values from the table returned by
PredictHistogram. From this table, the record with the highest value for $Probability is
found and the value for the appropriate column is returned.
Specification Version 1.0 — Microsoft
52
Using OLE DB for Data Mining
Depending on the capabilities of the underlying DMM, the distribution for a continuous
column may have more than one mode. (That is, the distribution graph shows more than one
peak.) In this case, users can obtain the statistics (mean, standard deviation, and so on) of each
mode by using the PredictHistogram function against a continuous column.
SELECT [Customer ID], PredictHistogram([Age]) AS AH …
Customer ID
10001
AH
Age
$StdDev
$Probability
32.1
17.2
.621
65.2
6.4
.379
…
If the DMM supports finding a clustering of records, the Cluster function returns the most
likely cluster membership for a given input case. However, the input case may exist with
various degrees of probability in many or all of the clusters. Using the
PredictHistogram(Cluster) functions will expand the cluster prediction out to a table
describing the full cluster membership of the input case.
SELECT [Customer ID], PredictHistogram(Cluster()) AS CH …
Customer ID
10001
CH
Cluster()
$Support
$Probability
1
724
.55
2
1025
.05
3
20
.40
…
By default, the PredictHistogram function will not include "Missing" as one of the reported
states. To force the function to return statistics for the attribute's missing state, the argument
passed into PredictHistogram should be a call to Predict on the attribute, with the argument
to include "Missing" specified, as shown in the following example:
SELECT [Customer ID], PredictHistogram(Predict([Gender], INCLUDE_NULL)) AS GH …
If a column supports the PredictHistogram function, it will be found in the
MINING_COLUMNS schema rowset of the DMM. A full description of PredictHistogram
can be found in Appendix C. The following table provides a short description:
Specification Version 1.0— Microsoft
53
Using OLE DB for Data Mining
Function
Return Value
Description
PredictHistogram(<scalar column
reference>)
<table>
Generates a histogram that contains
details of the predictions for the
column. Input column reference can
be a column returning a function
such as Predict or Cluster.
2.9.3.3 Predictions on Table Columns
TABLE type columns may be predicted. The result of selecting such a TABLE type column
from a DMM in a PREDICTION JOIN query is a nested table with one row for every distinct
value learned for the key of the nested table. Along with each row of the generated nested
table will be the "best" predicted value for any predictable columns from the nested table.
Directly selecting a TABLE type column by name is a shortcut for using the default behavior
of the Predict function on the column. Also, because the column is in itself a table, a nested
SELECT statement can be used to return the rows. Using the example schema, where the
Gender. Product Purchases, and Quantity columns are predictable, the following three queries
are equivalent and will return the same results:
SELECT [Customer ID], [Gender], [Product Purchases] …
SELECT [Customer ID], [Gender], Predict([Product Purchases]) …
SELECT [Customer ID], [Gender], (SELECT * FROM [Product Purchases]) …
Customer ID
Gender
Product Purchases
Product
Name
10001
Male
Female
Specification Version 1.0 — Microsoft
54
Product
Type
TV
1
Electronic
Ham
2
Food
Beer
6
Beverage
Product
Name
10203
Quantity
Quantity
Product
Type
TV
2
Electronic
Ham
1
Food
Beer
0
Beverage
Using OLE DB for Data Mining
The input table of actual cases may or may not contain a nested table that matches the nested
table being predicted. If not, the interpretation of Predict on the table column is quite natural.
Predict the membership of this table based on the other factors given for the case. If, however,
the input case has a matching nested table, three possible behaviors may be desired. Consider
the following example model:
1. A prediction simply could be the complete list of products the store offers, with associated
predictions for quantities.
2. The prediction might show what other products a customer is likely to buy based on the
products the customer has already bought. The reported list should not include the product
from the input case.
3. The prediction might be just the predicted "Quantity" value associated with the products
from the input case, or perhaps just the likelihood of each product in the input case. No
other products should appear in the nested output table.
To express these three different cases, user can specify, respectively, one of the following
options in the Predict function:

INCLUSIVE, which affects behavior number 1.

EXCLUSIVE (default option), which causes behavior number 2.

INPUT_ONLY, which ensures that the predicted table contains only the rows supplied by
the input (behavior number 3).
Each entry in the predicted nested table has some probabilistic measurements for inclusion or
ranking in the list. This is different from the probabilities and statistics associated with
individual predictable columns within the nested table. Instead, these are statistics that
describe what was learned about the mere existence of the record in the nested table. For
instance, A model may show an 80% chance that a certain customer will buy beer but only a
40% chance that the beer will be purchased on sale, or a 70% chance that the number of units
purchased will be 12. Another value for the option argument of the Predict function appends
a new statistic containing columns to the returned nested table (similar to the way the
PredictHistogram function creates statistics columns in the nested table it produces). Using
the INCLUDE_STATISTICS value adds a $Support and a $Probability column to the
resulting nested table, as illustrated in the following example:
SELECT [Customer ID], [Gender], Predict([Product Purchases], INCLUDE_STATISTICS, INPUT_ONLY)
…
Specification Version 1.0— Microsoft
55
Using OLE DB for Data Mining
Customer ID
Gender
10001
Male
Product Purchases
Product
Name
Ham
Product
Name
10203
Female
Quantity
2
Quantity
Product
Type
Food
Product
Type
$Probability
$Support
725
.267
$Probability
$Support
Ham
1
Food
30
.34
Beer
0
Beverage
56
.83
Note In the preceding example, the customer 10001 input case contained a Product
Purchases subrow only for Ham, and the customer 10203 case contained subrows for Ham
and Beer. Because the INPUT_ONLY option was used, only these rows show up in the
prediction.
The $Probability column for a nested table contains the probability of existence for the
particular subtable entry. No assumptions can be made about the relationships among the sets
of probabilities returned for nested table membership. As they may be derived from
independent parts of the DMM, they cannot be added together to make anything meaningful.
One of the more complex forms of a returned prediction results from requesting a histogram
for a value column inside a predicted table column. In this case, the prediction may include a
histogram for the different statistics of each of the values. The following query will provide
such a structure. (For simplicity, only a few of the automatic info columns are shown in this
example.)
SELECT [Customer ID], [Gender],
(SELECT [Product Name], PredictHistogram([Quantity]) AS [Quantity Histogram]
FROM Predict([Product Purchases]), INCLUDE_STATISTICS) …
Specification Version 1.0 — Microsoft
56
Using OLE DB for Data Mining
Customer ID
Gender
Product Purchases
Product
Name
Quantity
Histogram
TV
Ham
10001
Male
Beer
$Probability
Quantity
$Variance
$Probability
1
1.3
0.60
2
1.8
0.10
3
3.2
0.30
Quantity
$Variance
$Probability
1
0.5
0.25
2
0.7
0.55
3
3.7
0.20
Quantity
$Variance
$Probability
1
1.1
0.15
2
0.7
0.15
3
0.2
0.70
0.23
0.267
0.832
If a TABLE column supports the Predict function, it will be found in the
MINING_COLUMNS schema rowset of the DMM. A full description of Predict can be
found in Appendix C. The following table provides a short description.
Function
Return Value
Predict(<TABLE column <table column
reference>, options, …)
reference>
Description
General prediction function to modify default
behavior of prediction—for example,
including missing records, appending
statistics, inclusive/exclusive/input only
membership, and so on.
2.9.3.4 Operating on Nested Tables
If a nested table returned as a prediction contains a great number of records (as would be the
case if a store sold many, many different items), slogging through the results of the nested
table to pick out interesting predictions would be an onerous task for both the provider and the
consumer. Even if the nested table contains a relatively small number of records, finding good
predictions from the set would be inconvenient. To solve this problem, OLE DB for DM
introduces the TopX and BottomX family of functions, which operate on nested tables
(including those resulting from PredictHistogram, a nested SELECT, or any other table
Specification Version 1.0— Microsoft
57
Using OLE DB for Data Mining
returning an expression). These functions order the records of the nested table by a specified
column's value and then truncate the sorted list to a specified length.
For example, using the TopCount function, the following syntax retrieves the three most
probable hair colors (from the learned set of 8 possible) for an input case:
SELECT [Customer ID], TopCount(PredictHistogram([HairColor]), $Probability, 3)…
Or to get the 10 products (out of the 10,000) that a customer a customer is predicted buy in
the largest quantity, the TopCount function could be used as follows:
SELECT [Customer ID], TopCount([Product Purchases], [Quantity], 10) …
If a nested table contains a large number of columns and only a few are interesting to the
prediction, or if using a function that produces information columns (such as
PredictHistogram or Predict) and some of the automatic columns are not needed, a nested
SELECT can be used on the nested table or function to project out the desired columns.
Following are two examples using a nested SELECT:
SELECT [Customer ID], (SELECT [Product Name], Quantity FROM [Product Purchases]) …
or
SELECT [Customer ID], (SELECT HairColor, $Support as Sup, FROM
TopCount(PredictHistogram([HairColor]), $Probability, 3)) as PH …
Customer ID
PH
200
220
HairColor
Sup
Red
100
Brown
57
Black
13
HairColor
Sup
Grey
675
Black
453
Green
2
Suppose you wanted to get a list of predicted records from a TABLE type column and, along
with each nested table record, you wanted additional statistics on a predictable column in the
nested table. An earlier example in this document provided this information (and more). This
earlier example generated a prediction of product purchases and, along with each prediction, a
detailed histogram explaining the prediction for the quantity column. Navigating such a
nested rowset may be a bit cumbersome and is also unnecessary if the only information
needed is the best prediction of quantity and some other measure of the prediction's strength
Specification Version 1.0 — Microsoft
58
Using OLE DB for Data Mining
that is returned from the prediction histogram. The following example shows how to get this
result:
SELECT [Customer ID], Gender,
(SELECT [Product Name], [Quantity] as [Best Quantity],
PredictStdev(Quantity) AS [Quantity Deviation],
$Probabilty
FROM Predict([Product Purchases], INCLUDE_STATISTICS)), …
Customer ID
10001
Gender
Male
Product Purchases
Product
Name
Best
Quantity
Quantity
Deviation
$Probability
TV
1
1.3
0.23
Ham
2
0.7
0.267
Beer
3
0.2
0.832
The sub-SELECT in the preceding example extracts desired columns from the histogram
generated by Predict([Product Purchases], INCLUDE_STATISTICS). Note that $Probability
is one of the columns that the Predict function automatically creates and is the probability of
the record existing in the set, not the probability on the quantity.
A nested SELECT with a WHERE clause can be used to pull out certain records from a
nested table. For example, if instead of always getting the "best" prediction for gender a query
wanted to get the probability that each customer was "Female," this syntax would work as
shown in the following example:
SELECT [Customer ID],
(SELECT $Probability FROM PredictHistogram([Gender]) WHERE Gender = 'Female')
AS [Female Probability] …
Customer ID
Female Probability
10001
.379
10203
.554
Another similar use of the WHERE clause is to limit the records in the prediction on a
TABLE type column to some specific entries or set of entries. The following example shows
how to get only predictions for the purchase of "Beer" for any customer:
SELECT [Customer ID], (SELECT * FROM [Product Purchases] WHERE [Product Name] = 'Beer') …
Specification Version 1.0— Microsoft
59
Using OLE DB for Data Mining
Customer ID
Product Purchases
10001
Product
Name
Quantity
Product
Type
6
Beverage
Quantity
Product
Type
0
Beverage
Beer
10203
Product
Name
Beer
The same idea applies to limit the scope of nested table predictions to a set of related records
as defined by another column that is related to the key of the subtable, as in illustrated by the
following example:
SELECT [Customer ID], (SELECT * FROM [Product Purchases] WHERE [Product Type] = 'Beverage') …
The list of available functions for a predictable TABLE type column is found in the
MINING_COLUMNS schema rowset of the DMM. Many of the common functions were
standardized in this specification and are available in Appendix C. The following table
provides a short description of these common functions.
Function
Return Value
Description
TopCount(<table expr>,
<table expr>
Return the first <n-items> rows in a
decreasing order of <rank expr >.
<table expr>
Return the first N rows in a
decreasing order of <rank expr >
such that the sum of the <rank
column reference> values is at least
<sum>.
<table expr>
Return the first N rows in a
decreasing order of <rank expr>
such that the sum of the <rank expr>
values is at least the given
percentage of the total sum of <rank
expr> values.
<table expr>
Apply a SELECT against <table
expr>. <table expr> can be either a
table column reference or any tablereturning function except a subSELECT.
<rank expr >, <n-items>)
TopSum(<table expr>,
<rank expr >, <sum>)
TopPercent(<table expr>,
<rank expr >,
<percent>)
Sub-SELECT:
(SELECT <SELECT-expressions>
FROM <table expr>
[WHERE <WHERE clause>])
Specification Version 1.0 — Microsoft
60
Using OLE DB for Data Mining
2.9.3.5 Singleton Queries
In some cases, you may want to make a prediction for a case that is not contained in a table.
For example, during a Web site visit, the Web server needs to make a prediction about the
visitor preferences based on the current activities recorded. The current activities may not yet
be recorded in the RDBMS, and it may be very inefficient to generate a record (or a set of
records in multiple tables) only for the prediction purpose.
To solve this problem, the provider can support a syntax allowing sets of constant values in
place of the <source data query> for the SELECT FROM PREDICTION JOIN syntax. See
the section "Source Data" for examples of singleton data sources.
2.9.4 Flattening Nested Tables
The nested table is a very useful form of data representation that is well suited to the needs of
data mining algorithms. Unfortunately, however, there is currently no widespread support in
relational databases for this form of data representation. The way to convert flat relational
views to a nested table was discussed earlier, and the SHAPE statement is introduced in
Appendix E. This mechanism helps to feed data into the DM provider.
Some data mining clients will not be able to accept result sets in hierarchical format from a
DM provider. This may be because the client lacks the ability to handle hierarchy or because
the client application needs to store the results in a single relational table. To convert the data
from nested tables to flattened tables, it is necessary to request that the query results be
flattened. For this, the SELECT syntax provides the FLATTENED option, as in the following
example:
SELECT FLATTENED <SELECT-expressions> FROM …
The FLATTENED option turns the SELECT result table from a hierarchical table to a
flattened table form. The result set will contain one row for each predicted value, simplifying
the processing of the prediction results. If the columns in the <SELECT-expressions> clause
come from various levels of a hierarchy of table nesting, the resulting flattened table will not
put the prediction results on the same record. Doing so implies a connection between the
predictions, and no connection is assumed to exist. For example, a FLATTENED prediction
on [Products Purchases] might give the result set shown in the following table.
Customer ID
Product Name
Quantity
Probability
1
TV
1
.25
1
TV
2
.1
1
TV
3
.02
1
Ham
2
.2
1
Ham
1
.05
1
Ham
3
.03
Specification Version 1.0— Microsoft
61
Using OLE DB for Data Mining
In this result set, each row contains a single prediction of products and the possible quantities.
If the columns in the <SELECT-expressions> clause include columns from more than one
table column, the results will return the hierarchical shape in a flattened result set. Each row
again contains a single prediction, but different rows might contain different types of
predictions. For example, if a prediction is made for Gender and Product Purchases, the
flattened result set might look like the following table.
Gender
Gender
Probability
Product Name
Quantity
Product Quantity
Probability
1
Female
.43
Null
Null
Null
1
Male
.57
Null
Null
Null
1
Null
Null
TV
1
.25
1
Null
Null
TV
2
.1
1
Null
Null
TV
3
.02
1
Null
Null
Ham
2
.2
1
Null
Null
Ham
1
.05
1
Null
Null
Ham
3
.03
Customer ID
Each row contains a single prediction; some rows contain a prediction for Gender while
others have a prediction on Product Purchases.
2.10 Deleting Existing Mining Models
Following are two ways to perform deletion operations:
1. Delete the DMM object—Remove the object from the system, with both its structure and
its content.
2. Clear the DMM content—Clear the object of its content, but leave its structure intact.
These two operations are similar to the operations of dropping a table from the database or
clearing all of the table content by using the following statements:

DROP MINING MODEL <model name>:
Will delete the DMM from the database. The model
will disappear from the namespace.

DELETE FROM <model name>:

DELETE FROM <model name>.CONTENT:
Will delete the content and the column values of the mining
model but will leave the object structure intact. You may now repopulate the DMM with a
new set of training data (using the INSERT INTO statement) without having to re-create
the DMM structure.
Will delete the content of the mining model but
leave the structure and learned column values intact.
Specification Version 1.0 — Microsoft
62
Using OLE DB for Data Mining
2.11 Refining Mining Models
Existing DMMs may also be refined. Refinement refers to modifying the content, or set of
rules, by inserting a new set of training cases.
Refining a DMM based on additional cases is limited to certain algorithms that can be
updated on an incremental basis. If the specific algorithm supports this capability, the
ALLOW_INCREMENTAL_INSERT column in the MINING_SERVICES schema rowset
indicates whether the provider supports this capability. If the capability is supported, the
DMM can be refined by simply executing another INSERT INTO statement with the
additional cases.
If the capability is not supported, all of the DMM content will have to be deleted and the
DMM must be retained using the full set of cases (both the old ones and the new ones).
Specification Version 1.0— Microsoft
63
3 Appendix A: Schema Rowsets
Schema information in OLE DB is retrieved using predefined schema rowsets; this appendix
lists the contents of each schema rowset. Providers can add columns to these standard schema
rowsets. We recommend that the names of the columns extended by the provider have the
provider name as the prefix.
3.1 MINING_MODELS Schema Rowset
Number of restriction columns: 6
Restriction columns: MODEL_CATALOG, MODEL_SCHEMA, MODEL_NAME,
MODEL_TYPE, SERVICE_NAME, SERVICE_TYPE_ID
Default sort order: MODEL_CATALOG, MODEL_SCHEMA, MODEL_NAME
Description: Data mining models are exposed in the MINING_MODELS schema rowset.
This schema rowset can be viewed as an enhanced form of the TABLES schema rowset for
data mining models.
Column Name
Type Indicator
Description
1
MODEL_CATALOG
DBTYPE_WSTR
Catalog name. NULL if the
provider does not support
catalogs.
2
MODEL_SCHEMA
DBTYPE_WSTR
Unqualified schema name. NULL
if the provider does not support
schemas.
3
MODEL_NAME
DBTYPE_WSTR
Model name. This column cannot
contain NULL.
4
MODEL_TYPE
DBTYPE_WSTR
Model type, a provider-specific
string—can be NULL.
5
MODEL_GUID
DBTYPE_GUID
GUID that uniquely identifies the
model. Providers that do not use
GUIDs to identify tables should
return NULL in this column.
6
DESCRIPTION
DBTYPE_WSTR
Human-readable description of the
model. Null if there is no
description associated with the
column.
Specification Version 1.0— Microsoft
65
Using OLE DB for Data Mining
Column Name
Type Indicator
Description
7
MODEL_PROPID
DBTYPE_UI4
Property ID of the model.
Providers that do not use
PROPIDs to identify columns
should return NULL in this
column.
8
DATE_CREATED
DBTYPE_DATE
Date when the model was created
or NULL if the provider does not
have this information.
Note 1.x providers do not
return this column.
9
DATE_MODIFIED
DBTYPE_DATE
Date when the model definition
was last modified or NULL if the
provider does not have this
information.
10
SERVICE_TYPE_ID
DBTYPE_UI4
A bitmask that describes mining
service types. The following list
includes known popular mining
service values:

DM_SERVICETYPE_CLASS
IFICATION (0x0000001)

DM_SERVICETYPE_CLUST
ERING (0x0000002)

DM_SERVICETYPE_ASSO
CIATION (0x0000004)

DM_SERVICETYPE_DENSI
TY_ESTIMATE (0x0000008)

DM_SERVICETYPE_SEQU
ENCE (0x0000010)
11
SERVICE_NAME
DBTYPE_WSTR
A provider-specific name that
describes the algorithm used to
generate the model.
12
CREATION_STATE
MENT
DBTYPE_WSTR
Optional. The statement used to
create the original data mining
model.
Specification Version 1.0 — Microsoft
66
Using OLE DB for Data Mining
Column Name
Type Indicator
Description
13
PREDICTION_ENTIT
Y
DBTYPE_WSTR
A comma-delimited list indicating
which columns the model can
predict.
14
IS_POPULATED
DBTYPE_BOOL
VARIANT_TRUE if the model is
populated; VARIANT_FALSE if
the model is not populated. An
empty model has a defined
structure but has not been trained
with data.
3.2 MINING_COLUMNS Schema Rowset
Number of restriction columns: 4
Restriction columns: MODEL_CATALOG, MODEL_SCHEMA, MODEL_NAME,
COLUMN_NAME
Default sort order: MODEL_CATALOG, MODEL_SCHEMA, MODEL_NAME,
COLUMN_NAME
Description: The MINING_COLUMNS schema rowset describes the individual columns of
all defined data mining models known to the provider. This schema rowset can be viewed as
an enhanced form of the COLUMNS rowset for data mining models. Many of the entries are
derived from the COLUMNS schema rowset and are optional.
Column Name
Type Indicator
Description
1
MODEL_CATALOG
DBTYPE_WSTR
Catalog name. NULL if the
provider does not support
catalogs.
2
MODEL_SCHEMA
DBTYPE_WSTR
Unqualified schema name.
NULL if the provider does not
support schemas.
3
MODEL_NAME
DBTYPE_WSTR
Model name. This column
cannot contain a NULL.
Specification Version 1.0— Microsoft
67
Using OLE DB for Data Mining
4
Column Name
Type Indicator
Description
COLUMN_NAME
DBTYPE_WSTR
The name of the column; this
might not be unique. If this
cannot be determined, a NULL
is returned.
This column, together with the
COLUMN_GUID and
COLUMN_PROPID columns,
forms the column ID. One or
more of these columns will be
NULL, depending on which
elements of the DBID structure
the provider uses.
If possible, the resulting column
ID should be persistent.
However, some providers do not
support persistent identifiers for
columns.
5
COLUMN_GUID
DBTYPE_GUID
Column GUID. Providers that do
not use GUIDs to identify
columns should return NULL in
this column.
6
COLUMN_PROPID
DBTYPE_UI4
Column property ID. Providers
that do not associate PROPIDs
with columns should return
NULL in this column.
7
ORDINAL_POSITION
DBTYPE_UI4
The ordinal of the column.
Columns are numbered starting
from one. NULL if there is no
stable ordinal value for the
column.
8
COLUMN_HASDEFAULT
DBTYPE_BOOL
VARIANT_TRUE—The
column has a default value.
VARIANT_FALSE—The
column does not have a default
value, or it is unknown whether
the column has a default value.
Specification Version 1.0 — Microsoft
68
Using OLE DB for Data Mining
9
Column Name
Type Indicator
Description
COLUMN_DEFAULT
DBTYPE_WSTR
Default value of the column. A
provider may expose
DBCOLUMN_DEFAULTVAL
UE but not
DBCOLUMN_HASDEFAULT
(for SQL-92 tables) in the rowset
returned by
IColumnsRowset::GetColumn
sRowset.
If the default value is the NULL
value,
COLUMN_HASDEFAULT is
VARIANT_TRUE and the
COLUMN_DEFAULT column
is a NULL value.
10
COLUMN_FLAGS
DBTYPE_UI4
A bitmask that describes column
characteristics. The
DBCOLUMNFLAGS
enumerated type specifies the
bits in the bitmask. This column
cannot contain a NULL value.
11
IS_NULLABLE
DBTYPE_BOOL
VARIANT_TRUE—The
column might be nullable.
VARIANT_FALSE—The
column is known not to be
nullable.
12
DATA_TYPE
DBTYPE_UI2
The indicator of the column's
data type—for example:

"TABLE" =
DBTYPE_HCHAPTER

"TEXT" =
DBTYPE_WCHAR

"LONG" = DBTYPE_I8

"DOUBLE" = DBTYPE_R8

"DATE" = DBTYPE_DATE
Specification Version 1.0— Microsoft
69
Using OLE DB for Data Mining
Column Name
Type Indicator
Description
13
TYPE_GUID
DBTYPE_GUID
The GUID of the column's data
type. Providers that do not use
GUIDs to identify data types
should return NULL in this
column.
14
CHARACTER_MAXIMU
M_LENGTH
DBTYPE_UI4
The maximum possible length of
a value in the column. For
character, binary, or bit columns,
this is one of the following:
15
CHARACTER_OCTET_LE DBTYPE_UI4
NGTH
Specification Version 1.0 — Microsoft
70

The maximum length of the
column in characters, bytes,
or bits, respectively, if the
length is defined. For
example, a CHAR(5)
column in an SQL table has a
maximum length of 5.

The maximum length of the
data type in characters, bytes,
or bits, respectively, if the
column does not have a
defined length.

Zero (0) if neither the
column nor the data type has
a defined maximum length.

NULL for all other types of
columns.
Maximum length in octets
(bytes) of the column, if the type
of the column is character or
binary. A value of zero means
the column has no maximum
length. NULL for all other types
of columns.
Using OLE DB for Data Mining
16
Column Name
Type Indicator
Description
NUMERIC_PRECISION
DBTYPE_UI2
If the column's data type is of a
numeric data type other than
VARNUMERIC, this is the
maximum precision of the
column. The precision of
columns with a data type of
DBTYPE_DECIMAL or
DBTYPE_NUMERIC depends
on the definition of the column
If the column's data type is not
numeric or is VARNUMERIC,
this is NULL.
17
NUMERIC_SCALE
DBTYPE_I2
If the column's type indicator is
DBTYPE_DECIMAL,
DBTYPE_NUMERIC, or
DBTYPE_VARNUMERIC, this
is the number of digits to the
right of the decimal point.
Otherwise, this is NULL.
18
DATETIME_PRECISION
DBTYPE_UI4
Datetime precision (number of
digits in the fractional seconds
portion) of the column if the
column is a datetime or interval
type. If the column's data type is
not datetime, this is NULL.
19
CHARACTER_SET_CATA DBTYPE_WSTR
LOG
Catalog name in which the
character set is defined. NULL if
the provider does not support
catalogs or different character
sets.
20
CHARACTER_SET_SCHE
MA
DBTYPE_WSTR
Unqualified schema name in
which the character set is
defined. NULL if the provider
does not support schemas or
different character sets.
21
CHARACTER_SET_NAM
E
DBTYPE_WSTR
Character set name. NULL if the
provider does not support
different character sets.
Specification Version 1.0— Microsoft
71
Using OLE DB for Data Mining
Column Name
Type Indicator
Description
22
COLLATION_CATALOG
DBTYPE_WSTR
Catalog name in which the
collation is defined. NULL if the
provider does not support
catalogs or different collations.
23
COLLATION_SCHEMA
DBTYPE_WSTR
Unqualified schema name in
which the collation is defined.
NULL if the provider does not
support schemas or different
collations.
24
COLLATION_NAME
DBTYPE_WSTR
Collation name. NULL if the
provider does not support
different collations.
256
DOMAIN_CATALOG
DBTYPE_WSTR
Catalog name in which the
domain is defined. NULL if the
provider does not support
catalogs or domains.
26
DOMAIN_SCHEMA
DBTYPE_WSTR
Unqualified schema name in
which the domain is defined.
NULL if the provider does not
support schemas or domains.
27
DOMAIN_NAME
DBTYPE_WSTR
Domain name. NULL if the
provider does not support
domains.
28
DESCRIPTION
DBTYPE_WSTR
Human-readable description of
the column. For example, the
description for a column named
Name in the Employee table
might be "Employee name." Null
if there is no description
associated with the column.
Specification Version 1.0 — Microsoft
72
Using OLE DB for Data Mining
29
Column Name
Type Indicator
Description
DISTRIBUTION_FLAG
DBTYPE_WSTR
One of the following:

"NORMAL"

"LOG_NORMAL"

"UNIFORM"

"BINOMIAL"

"MULTINOMIAL"

"POISSON"

"HEAVYTAIL"

"MIXTURE"
Provider-specific flags may also
be defined.
30
CONTENT_TYPE
DBTYPE_WSTR
One of the following:

"KEY"

"DISCRETE"

"CONTINUOUS"

"DISCRETIZED([args])"

"ORDERED"

"SEQUENCE_TIME"

"CYCLICAL"

"PROBABILITY"

"VARIANCE"

"STDEV"

"SUPPORT"

"PROBABILITY_VARIAN
CE"

"PROBABILITY_STDEV"

"ORDER"

"SEQUENCE"
Provider-specific flags may also
be defined.
Specification Version 1.0— Microsoft
73
Using OLE DB for Data Mining
31
Column Name
Type Indicator
Description
MODELING_FLAG
DBTYPE_WSTR
A comma-delimited list of flags.
The defined flags are as follows:

"MODEL_EXISTENCE_ON
LY"

"NOT NULL"
Provider-specific flags may also
be defined.
32
IS_RELATED_TO_KEY
DBTYPE_BOOL
VARIANT_TRUE if this
column is related to the key. If
the key is a single column, the
RELATED_ATTRIBUTE field
optionally may contain its
column name.
33
RELATED_ATTRIBUTE
DBTYPE_WSTR
This is the name of the target
column that the current column
either relates to or is a special
property of.
34
IS_INPUT
DBTYPE_BOOL
VARIANT_TRUE if this is an
input column.
35
IS_PREDICTABLE
DBTYPE_BOOL
VARIANT_TRUE if the column
is predictable.
36
CONTAINING_COLUMN
DBTYPE_WSTR
Name of the TABLE column
containing this column. NULL if
any table does not contain the
column.
37
PREDICTION_SCALAR_F
UNCTIONS
DBTYPE_WSTR
A comma-delimited list of scalar
functions that may be performed
on the column.
Specification Version 1.0 — Microsoft
74
Using OLE DB for Data Mining
38
Column Name
Type Indicator
Description
PREDICTION_TABLE_FU
NCTIONS
DBTYPE_WSTR
A comma-delimited list of
functions that may be applied to
the column, returning a table.
The list has the following
format:
<function name>(<column1> [,
<column2>], ...)
The format allows the client to
determine which columns will be
present in the table returned by
any given function.
39
IS_POPULATED
DBTYPE_BOOL
VARIANT_TRUE if the column
has learned a set of possible
values.
VARIANT_FALSE if the
column is not populated.
40
PREDICTION_SCORE
DBTYPE_R8
The score of the model on the
predicting column. Score is used
to measure the accuracy of a
model.
3.3 MINING_MODEL_CONTENT Schema
Rowset
Number of restriction columns: 10
Restriction columns: MODEL_CATALOG, MODEL_SCHEMA, MODEL_NAME,
ATTRIBUTE_NAME, NODE_NAME, NODE_UNIQUE_NAME, NODE_TYPE,
NODE_GUID, and NODE_CAPTION
Note A tenth restriction, called the tree operation, is not on any particular column of the
MINING_MODEL_CONTENT rowset; rather, it specifies a tree operator. The idea is that
the consumer specified a NODE_UNIQUE_NAME restriction and the tree operator
(ANCESTORS, CHILDREN, SIBLINGS, PARENT, DESCENDANTS, SELF) to obtain
the desired set of members. The SELF operator includes the row for the node itself in the
list of returned rows. The following constants are defined:
Specification Version 1.0— Microsoft
75
Using OLE DB for Data Mining
DMTREEOP_ANCESTORS
0x00000020
DMTREEOP_CHILDREN
0x00000001
DMTREEOP_SIBLINGS
0x00000002
DMTREEOP_PARENT
0x00000004
DMTREEOP_SELF
0x00000008
DMTREEOP_DESCENDANTS
0x00000010
(These designations comprise a bit mask and may be combined.)
Default sort order: MODEL_CATALOG, MODEL_SCHEMA, MODEL_NAME,
ATTRIBUTE_NAME
Description: The MINING_MODEL_CONTENT schema rowset allows browsing of the
content of a data mining model. The user can employ special tree-operation restrictions to
navigate the content as a directed acyclic graph.
Column Name
Type Indicator
Description
1
MODEL_CATALOG
DBTYPE_WSTR
The name of the catalog to which this
model belongs. NULL if the provider
does not support catalogs.
2
MODEL_SCHEMA
DBTYPE_WSTR
The name of the schema to which this
model belongs. NULL if the provider
does not support schemas.
3
MODEL_NAME
DBTYPE_WSTR
Name of the model.
4
ATTRIBUTE_NAME
DBTYPE_WSTR
Name(s) of the attribute(s)
corresponding to this node. For a
model node, this would be a list of
predictable attributes. For a leaf
distribution node, this would be a
single attribute that the distribution
corresponds to.
5
NODE_NAME
DBTYPE_WSTR
Name of the node.
6
NODE_UNIQUE_NA
ME
DBTYPE_WSTR
Unique name of the node. For
providers that generate unique names
by qualification, each component of
this name is delimited.
Specification Version 1.0 — Microsoft
76
Using OLE DB for Data Mining
7
Column Name
Type Indicator
Description
NODE_TYPE
DBTYPE_I4
The type of the node. Can be one of
the following values:

DM_NODE_TYPE_MODEL

DM_NODE_TYPE_TREE

DM_NODE_TYPE_INTERIOR

DM_NODE_TYPE_DISTRIBUTI
ON

DM_NODE_TYPE_CLUSTER

DM_NODE_TYPE_UNKNOWN
8
NODE_GUID
DBTYPE_GUID
Node GUID. NULL if no GUID.
9
NODE_CAPTION
DBTYPE_WSTR
A label or a caption associated with
the node. Used primarily for display
purposes. If a caption does not exist,
NODE_NAME is returned.
10 CHILDREN_CARDI
NALITY
DBTYPE_UI4
Number of children that the node has.
This can be an estimate of the number
of children. Consumers should not rely
on this being the exact count.
Providers should return as good an
estimate as possible.
11 PARENT_UNIQUE_
NAME
DBTYPE_WSTR
Unique name of the node's parent.
NULL is returned for any nodes at the
root level. For providers that generate
unique names by qualification, each
component of this name is delimited.
12 NODE_DESCRIPTIO
N
DBTYPE_WSTR
A human-readable description of the
node.
13 NODE_RULE
DBTYPE_WSTR
An XML description of the rule
embedded in the node. The format of
the XML string is based on the PMML
standard.
14 MARGINAL_RULE
DBTYPE_WSTR
An XML description of the rule
moving to the node from the parent
node.
15 NODE_PROBABILIT
Y
DBTYPE_R8
The probability for reaching the node.
Specification Version 1.0— Microsoft
77
Using OLE DB for Data Mining
Column Name
Type Indicator
Description
16 MARGINAL_PROBA DBTYPE_R8
BILITY
The probability of reaching the node
from the parent node.
17 NODE_DISTRIBUTI
ON
DBTYPE_HCHAP
TER
A table containing the probability
histogram of the node.
18 NODE_SUPPORT
DBTYPE_R8
Number of cases in support of this
node.
3.4 Layout of DISTRIBUTION Chapter in
MINING_CONTENT Schema Rowset
Number of restriction columns: Not applicable.
Restriction columns: Not applicable.
Default sort order: None.
Description: The DISTRIBUTION column in the MINING_CONTENT schema rowset is a
nested table (which is represented in OLE DB as a chapter column). It provides statistical
distribution information for the attributes corresponding to the node that the parent row
represents. Each attribute will have multiple rows in this table.
Column Name
Type Indicator
Description
1
ATTRIBUTE_NAME
DBTYPE_WSTR
Name of the attribute.
2
ATTRIBUTE_VALU
E
DBTYPE_VARIA
NT
The attribute value represented as a
variant.
3
SUPPORT
DBTYPE_R8
The number of cases that support this
attribute value.
4
PROBABILITY
DBTYPE_R8
Probability of occurrence of this
attribute value.
5
VARIANCE
DBTYPE_R8
Variance of this attribute value.
Specification Version 1.0 — Microsoft
78
Using OLE DB for Data Mining
6
Column Name
Type Indicator
Description
VALUETYPE
DBTYPE_I4
The value type of the attribute. Can be
one of the following values:

VALUETYPE_MISSING = 1

VALUETYPE_EXISTING = 2

VALUETYPE_CONTINUOUS =
3

VALUETYPE_DISCRETE = 4

VALUETYPE_DISCRETIZED =
5

VALUETYPE_BOOLEAN = 6
3.5 MINING_SERVICES Schema Rowset
Number of restriction columns: 2
Restriction columns: SERVICE_NAME, SERVICE_TYPE_ID
Default sort order: SERVICE_NAME
Description: The MINING_SERVICES schema rowset exposes the data mining algorithms
available from the provider. It can be used to determine the prediction capabilities,
complexity, and similar information about the algorithm.
1
Column Name
Type Indicator
Description
SERVICE_NAME
DBTYPE_WSTR
The name of the algorithm. Providerspecific. This will be used as the
service identifier in the language. (It is
not localizable.)
Specification Version 1.0— Microsoft
79
Using OLE DB for Data Mining
2
Column Name
Type Indicator
Description
SERVICE_TYPE_ID
DBTYPE_UI4
A bitmask that describes mining
service types. The following list
includes known popular mining
service values:

DM_SERVICETYPE_CLASSIFI
CATION (0x0000001)

DM_SERVICETYPE_CLUSTERI
NG (0x0000002)

DM_SERVICETYPE_ASSOCIAT
ION (0x0000004)

DM_SERVICETYPE_DENSITY_
ESTIMATE (0x0000008)

DM_SERVICETYPE_SEQUENC
E (0x0000010)
3
SERVICE_DISPLAY
_NAME
DBTYPE_WSTR
The localizable display name of the
algorithm. Provider-specific.
4
SERVICE_GUID
DBTYPE_GUID
GUID for the algorithm. NULL if no
GUID.
5
DESCRIPTION
DBTYPE_WSTR
Description of the algorithm.
6
PREDICTION_LIMIT DBTYPE_UI4
The maximum number of predictions
the model and algorithm can provide;
0 means no limit.
7
SUPPORTED_DISTR
IBUTION_FLAGS
A comma-delimited list of one or more
of the following:
DBTYPE_WSTR

"NORMAL"

"LOG_NORMAL"

"UNIFORM"
Provider-specific flags may also be
defined.
Specification Version 1.0 — Microsoft
80
Using OLE DB for Data Mining
Column Name
8
Type Indicator
SUPPORTED_INPUT DBTYPE_WSTR
_CONTENT_TYPES
Description
A comma-delimited list of one or more
of the following:

"KEY"

"DISCRETE"

"CONTINUOUS"

"DISCRETIZED"

"ORDERED"

"SEQUENCE_TIME"

"CYCLICAL"

"PROBABILITY"

"VARIANCE"

"STDEV"

"SUPPORT"

"PROBABILITY_VARIANCE"

"PROBABILITY_STDEV"

"ORDER"

"SEQUENCE"
Provider-specific flags may also be
defined.
Specification Version 1.0— Microsoft
81
Using OLE DB for Data Mining
9
Column Name
Type Indicator
Description
SUPPORTED_PREDI
CTION_CONTENT_
TYPES
DBTYPE_WSTR
A comma-delimited list of one or more
of the following:

"DISCRETE"

"CONTINUOUS"

"DISCRETIZED"

"ORDERED"

"SEQUENCE_TIME"

"CYCLICAL"

"PROBABILITY"

"VARIANCE"

"STDEV"

"SUPPORT"

"PROBABILITY_VARIANCE"

"PROBABILITY_STDEV"
Provider-specific flags may also be
defined.
10
SUPPORTED_MODE DBTYPE_WSTR
LING_FLAGS
A comma-delimited list of one or more
of the following:

"MODEL_EXISTENCE_ONLY"

"NOT NULL"
Provider-specific flags may also be
defined.
Specification Version 1.0 — Microsoft
82
Using OLE DB for Data Mining
11
12
Column Name
Type Indicator
Description
SUPPORTED_SOUR
CE_QUERY
DBTYPE_WSTR
The <source_data_query> types that
the provider supports. This is a
comma-delimited list of one or more
of the following syntax descriptions
that can be used as the source of data
for INSERT INTO or that can be
PREDICTION JOINED to a DMM for
SELECT:
TRAINING_COMPL
EXITY
DBTYPE_I4

"SINGLETON_CONSTANT"

"SINGLETON_SELECT"

"OPENROWSET"

"SELECT"

"SHAPE"
Indication of expected time for
training:

DM_TRAINING_COMPLEXITY
_LOW—Running time is
proportional to input and is
relatively short.

DM_
TRAINING_COMPLEXITY_ME
DIUM—Running time may be
long but is generally proportional
to input.

DM_
TRAINING_COMPLEXITY_HIG
H—Running time is long and may
grow exponentially in relationship
to input.
Specification Version 1.0— Microsoft
83
Using OLE DB for Data Mining
Column Name
13
14
15
Type Indicator
PREDICTION_COMP DBTYPE_I4
LEXITY
EXPECTED_QUALI
TY
SCALING
DBTYPE_I4
DBTYPE_I4
Description
Indication of expected time for
prediction:

DM_PREDICTION_COMPLEXI
TY_LOW—Running time is
proportional to input and is
relatively short.

DM
PREDICTION_COMPLEXITY_
MEDIUM—Running time may be
long but is generally proportional
to input.

DM_
PREDICTION_COMPLEXITY_H
IGH—Running time is long and
may grow exponentially in
relationship to input.
Indication of expected quality of
model produced with this algorithm:

DM_EXPECTED_QUALITY_LO
W

DM_EXPECTED_QUALITY_ME
DIUM

DM_EXPECTED_QUALITY_HI
GH
Indication of the scalability of the
algorithm:

DM_SCALING_LOW

DM_SCALING_MEDIUM

DM_ SCALING_HIGH
16
ALLOW_INCREME
NTAL_INSERT
DBTYPE_BOOL
VARIANT_TRUE if additional
INSERT INTO statements are allowed
after the initial training.
17
ALLOW_PMML_INI
TIALIZATION
DBTYPE_BOOL
VARIANT_TRUE if the creation of a
DMM (including both structure and
content) based on an XML string is
allowed.
Specification Version 1.0 — Microsoft
84
Using OLE DB for Data Mining
18
19
Column Name
Type Indicator
Description
CONTROL
DBTYPE_I4
One of the following:
ALLOW_DUPLICAT
E_KEY
DBTYPE_BOOL

DM_CONTROL_NONE

DM_CONTROL_CANCEL

DM_CONTROL_SUSPENDRES
UME

DM_CONTROL_SUSPENDWIT
HRESULT
TRUE if cases may have duplicate
key.
3.6 SERVICE_PARAMETERS Schema Rowset
Number of restriction columns: 2
Restriction columns: SERVICE_NAME, PARAMETER_NAME
Default sort order: SERVICE_NAME, PARAMETER_NAME
Description: The SERVICE_PARAMETERS schema rowset provides a list of parameters
that can be supplied when generating a mining model via the CREATE MINING MODEL
statement. The client will generally restrict by SERVICE_NAME to obtain the parameters
supported by the provider and applicable to the type of mining model being generated.
Column Name
Type Indicator
Description
1
SERVICE_NAME
DBTYPE_WSTR
The name of the algorithm. Providerspecific.
2
PARAMETER_NAM
E
DBTYPE_WSTR
The name of the parameter.
3
PARAMETER_TYPE
DBTYPE_WSTR
Data type of parameter (DBTYPE).
4
IS_REQUIRED
DBTYPE_BOOL
If true, the parameter is required.
5
PARAMETER_FLAG DBTYPE_UI4
S
A bitmask that describes parameter
characteristics. The following values
(or a combination thereof) may be
used:

DM_PARAMETER_TRAINING
(0x0000001)—for training

DM_PARAMETER_PREDICTIO
N (0x00000002)—for prediction
Specification Version 1.0— Microsoft
85
Using OLE DB for Data Mining
6
Column Name
Type Indicator
Description
DESCRIPTION
DBTYPE_WSTR
Text describing the purpose and
format of the parameter.
3.7 MODEL_CONTENT_PMML Schema
Rowset
Number of restriction columns: four
Restrictions: MODEL_CATALOG, MODEL_SCHEMA, MODEL_NAME,
MODEL_TYPE
Default Sort Order: MODEL_NAME, MODEL_SCHEMA, MODEL_NAME
Description: MODEL_CONTENT_PMML schema rowset stores the XML representation of
the content of each model. The format of the XML string follows the PMML standard.
Column Name
Type Indicator
Description
1
MODEL_CATALOG
DBTYPE_WSTR
Catalog name. NULL if the provider
does not support catalogs.
2
MODEL_SCHEMA
DBTYPE_WSTR
Unqualified schema name. NULL if
the provider does not support schemas.
3
MODEL_NAME
DBTYPE_WSTR
Model name. This column cannot
contain NULL.
4
MODEL_TYPE
DBTYPE_WSTR
Model type, a provider-specific
string—can be NULL
5
MODEL_GUID
DBTYPE_GUID
GUID that uniquely identifies the
model. Providers that do not use
GUIDs to identify tables should return
NULL in this column.
6
MODEL_PMML
DBTYPE_WSTR
An XML representation of the model's
content with PMML format.
7
SIZE
DMTYPE_UI4
Number of bytes of the XML string
size.
8
LOCATION
DMTYPE_WSTR
The location of the XML file. NULL if
the file is stored in the default
directory.
Specification Version 1.0 — Microsoft
86
4 Appendix B: OLE DB for DM
Grammar
4.1 Statements
4.1.1 CREATE MINING MODEL
CREATE MINING MODEL <model>
(
<column definition list>
)
USING <algorithm> [(<parameter list>)]
CREATE MINING MODEL <model> FROM PMML <xml string>
Parameters
<model>
A unique name for the model.
<column definition list>
A comma-separated list of column definitions.
<algorithm>
The provider-defined name of a data mining algorithm.
<parameter list>
(Optional) A comma-separated list of provider-defined
parameters for the algorithm.
<xml string>
An XML-encoded model (for advanced use only).
Remarks
The CREATE MINING MODEL statement creates a new mining model based on the column
definition list. A column definition is one of the following forms:
<column name>
<column name>
<type>
TABLE
[<content flags>] [<column relation>] [<prediction flag>]
[<prediction flag>] ( < non-table column definition list > )
<column name>
Any valid column identifier.
<type>
Any valid SQL type, including LONG, DOUBLE, DATE, TEXT,
and TABLE.
<content flags>
Content flags are "hints" to the data mining algorithm that provide
additional information. Flags appear in the order of the grouping
shown here, and flags within the same group cannot appear on the
same column.
Specification Version 1.0— Microsoft
87
Using OLE DB for Data Mining
Distribution Flags
NORMAL
The values of the column appear in a normal distribution.
LOG NORMAL
The values of the column appear in a log normal distribution
UNIFORM
The values of the column appear in a uniform distribution.
Type Flags
KEY
The column is discrete and is a key. Key columns will not have any
other flags except in the case of a nested table with no attribute
columns.
CONTINUOUS
The column contains values in a continuous range, such as Age or
Salary.
DISCRETE
The column contains a discrete set of values, such as Gender.
DISCRETIZED
The column contains a continuous set of values that should be
converted to buckets.
ORDERED
The column contains a discrete set of values that are ordered, such
as Salary Level.
CYCLICAL
The column contains an ordered discrete set of values that are
cyclical, such as Day of Week, or Month.
SEQUENCE TIME The column contains time measurement units.
SEQUENCE
The column contains the sorting key of the related columns.
Modeling Flags
MODEL_EXISTE
NCE_ONLY
The column should be modeled has having two states, missing and
nonmissing, regardless of the values in the column. This is
particularly useful for columns in a nested table, where values are
sparse across cases.
NOT NULL
The column cannot accept NULL values.
Specification Version 1.0 — Microsoft
88
Using OLE DB for Data Mining
Special Property Flags
These flags indicate a property of another column and will not appear with any other content
flags or prediction flags
PROBABILITY
The value in this column is the probability (0–1) of the
associated value.
VARIANCE
The value in this column is value variance of the associated
value.
STDEV
The value in this column is the standard deviation of the
associated value.
PROBABILITY_VARIA
NCE
The value in this column is the variance of the probability
associated with the associated value.
PROBABILITY_STDEV
The value in this column is the standard deviation of the
probability associated with the associated value.
SUPPORT
The value in this column is the weight (case replication
factor) of the associated value.
<column relation>
The column relation appears in two forms: OF <column name> and
RELATED TO <column name>.
OF
This form is restricted to use for columns with Special Property
content flags—for example, ProbGender Double PROBABILITY
OF Gender.
RELATED TO
This form indicates a value hierarchy. The target of a related to
column can be a key column in a nested table, a discretely valued
column on the case row, or another column with a RELATED TO
clause (indicating a deeper hierarchy). A special target "KEY" is
reserved for nested tables with multiple keys and indicates a relation
between the value in this column and the composite of all the key
columns.
<prediction flags>
These flags indicate that the column can be predicted by the model and
can have one of two values.
PREDICT
This column can be predicted by the model and it can be supplied in
input cases to predict the value of other predictable columns.
PREDICT_ONLY
This column can be predicted by the model, but its values cannot be
used in input cases to predict the value of other predictable columns.
Specification Version 1.0— Microsoft
89
Using OLE DB for Data Mining
4.1.2 INSERT INTO
INSERT INTO <model> (<mapped model columns>) <source data query>
INSERT INTO <model> (<mapped model columns>) VALUES <constant list>
INSERT INTO <model>.COLUMN_VALUES(<mapped model columns>) <source data query>
Parameters
<model>
A model identifier.
<mapped model columns>
A comma-separated list of column identifiers and nested
identifiers.
<source data query>
The source query in the provider-defined format.
Remarks
The INSERT INTO statement inserts training data into the model. The columns from the
query are mapped to model columns through the <mapped model columns> section. The
keyword SKIP is used to instruct the model to ignore columns that appear in the source data
query that are not used in the model.
The INSERT INTO <model>.COLUMN_VALUES form inserts data directly into the models
columns without training the model's algorithm. This allows you to provide column data to
the model in a concise ordered manner that is useful when dealing with data sets containing
hierarchies or ordered columns. The "." operator is used to specify columns that are part of a
nested table. When using this form, columns that are part of a relation (either through
RELATE TO or by being a KEY in a nested table) cannot be inserted individually and must
be inserted together with all the columns in the relation.
The <mapped model columns> section has the following form:
<column identifier> | <table identifier>(<column identifier> | SKIP), …
4.1.3 SELECT
4.1.3.1 SELECT INTO
SELECT * INTO <new model>
USING <algorithm> [(<parameter list>)]
FROM <existing model>
Specification Version 1.0 — Microsoft
90
Using OLE DB for Data Mining
Parameters
<new model>
A unique name for the new model being created.
<algorithm>
The provider-defined name of a data mining algorithm.
<parameter list>
(Optional) A comma-separated list of provider-defined parameters
for the algorithm.
<existing model>
The name of the existing model to be copied.
Remarks
The SELECT INTO statement creates a new mining model by copying schema and other
information from an existing mining model. If the existing model is trained, the new model
will automatically be trained with the same query; otherwise, the new model will be empty.
4.1.3.2 SELECT FROM CONTENT
SELECT * FROM <model>.CONTENT
Parameters
<model>
A name of the model.
Remarks
The SELECT FROM CONTENT statement returns the mining model schema rowset for the
specified model. See Appendix C for a description of the mining model schema rowset.
4.1.3.3 SELECT FROM <MODEL>
SELECT [DISTINCT] <expr list> FROM <model> [ WHERE < condition list > ]
Parameters
<model>
A model identifier.
<expr list>
A comma-separated list of related column identifiers or expressions.
<condition list>
(Optional) Conditions to restrict the values returned from the
column list.
Remarks
The SELECT FROM <model> statement allows you to directly browse the values on which
the columns have been trained.
Specification Version 1.0— Microsoft
91
Using OLE DB for Data Mining
4.1.3.4 SELECT FROM PREDICTION JOIN
SELECT <select expression list> FROM <model> [NATURAL] PREDICTION JOIN
<source data query> [ON <join mapping list>]
[ WHERE <condition expression> ]
Parameters
<select expression list>
A comma-separated list of column identifiers and other
expressions to describe the columns in the results of the
query.
<model>
A model identifier.
<source data query>
The source query in the provider-defined format.
<join mapping list>
A logical expression comparing column from model to
column from source query.
<condition expression>
(Optional) A condition to restrict the values returned from the
column list.
Remarks
The SELECT FROM PREDICT syntax allows you to predict columns based on the input data
that are supplied in the PREDICT clause. You can specify the OLE DB for DM feature-rich
prediction functions, including prediction histograms, prediction probability, sub-SELECT,
and so forth, in <select expression list> and <condition expression>. Only the rows that
qualify the condition in the WHERE clause will be included in the result.
4.1.4 DELETE
DELETE * FROM <model>[.CONTENT]
Parameters
<model>
A model identifier.
Remarks
Deletes all training data from the model. If CONTENT is specified, only the algorithm
training is discarded and the column values are retained.
Specification Version 1.0 — Microsoft
92
Using OLE DB for Data Mining
4.1.5 DROP
DROP MINING MODEL <model>
Parameters
<model>
A model identifier.
Remarks
Removes the model and all associated information from the database.
4.2 A Sample BNF
This example BNF is from Microsoft's implementation of an OLE-DB for DM provider and
does not represent the entire breadth of grammar described by this document.
<statement>
-> <create>
|<insert>
|<select>
|<delete>
|<rename>
4.2.1 CREATE
<create>
-> <dm_create>
|<select_into>
|<pmml_create>
<dm_create>
-> CREATE MINING MODEL <identifier> ( <col_def_list> ) USING <algorithm>
[(<algo_param_list>)]
<pmml_create>
-> CREATE MINING MODEL <identifier> FROM PMML <string>
<select_into>
-> SELECT * INTO <identifier> USING <algorithm> FROM <identifier>
<col_def_list>
-> <col_def>
|<col_def_list> , <col_def>
-> <col_def_reg> | <col_def_tbl>
-> <identifier> <col_type> [<col_distribution>] [<col_binary>]
<col_def>
<col_def_reg>
[<col_content>]
<col_def_tbl>
<algorithm>
<algo_param>
<algo_param_list>
[<col_content_qual>] [<col_qualif>] [<col_prediction>] [<relation_clause>]
-> <identifier> TABLE <col_prediction> ( <col_def_list> )
-> MICROSOFT_DECISION_TREES | MICROSOFT_CLUSTERING
-> <identifier> = <value>
-> <algo_param>
| <algo_param>, <algo_param_list>
Specification Version 1.0— Microsoft
93
Using OLE DB for Data Mining
<col_type>
-> LONG
| BOOLEAN
| TEXT
| DOUBLE
| DATE
<col_distribution>-> NORMAL
| UNIFORM
<col_binary>
-> MODEL_EXISTENCE_ONLY
| NOT NULL
<col_content>
-> DISCRETE
| CONTINUOUS
| DISCRETIZED( [<disc_method> [, <numeric_const>]] )
| SEQUENCE_TIME
<disc_method>
-> AUTOMATIC
| EQUAL_AREAS
| THRESHOLDS
| CLUSTERS
<col_content_qual>-> ORDERED
| CYCLICAL
<col_qualif>
-> KEY
| PROBABILITY
| VARIANCE
| STDEV
| STDDEV
| PROBABILITY_VARIANCE
| PROBABILITY_STDEV
| PROBABILITY_STDDEV
| SUPPORT
<col_prediction> -> PREDICT
| PREDICT_ONLY
<relation_clause> -> <related_to_clause>
| <of_clause>
<related_to_clause>-> RELATED TO <identifier>
| RELATED TO KEY
<of_clause>
-> OF <identifier>
| OF KEY
4.2.2 INSERT
<insert>
<insert_att>
<insert_reg>
<query>
<external_query>
->
|
->
->
<insert_att>
<insert_reg>
INSERT [INTO] <identifier>.COLUMN_VALUES ( <column_ref_list> ) <query>
INSERT [INTO] <identifier> ( <column_ref_list> ) <query>
-> <external_query>
| <shape>
-> OPENROWSET ( <string>, {<string>|<string>;<string>;<string>}, <string> )
Specification Version 1.0 — Microsoft
94
Using OLE DB for Data Mining
<shape>
<append_list>
<append>
<relate_list>
<relate>
->
->
->
->
|
->
SHAPE { <query> } APPEND <append_list>
<append_list> , <append_list>
( { <query> } RELATE <relate_list> ) AS <identifier>
<relate>
<relate_list> , <relate>
<column_ref> TO <column_ref>
4.2.3 SELECT
<column_ref_list> ->
|
<column_ref>
->
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
<pred_option_list>->
|
<pred_option>
->
|
|
|
|
|
<select>
<pred_select>
<column_ref>
<column_ref_list> , <column_ref>
<identifier>
<identifier>.<column_ref>
<column_ref> ( <column_ref_list> )
SKIP
CLUSTER()
$SUPPORT
$VARIANCE
$STDEV
$STDDEV
$PROBABILITY
$PROBABILITY_VARIANCE
$PROBABILITY_STDEV
$PROBABILITY_STDDEV
$DISTANCE
PREDICT ( <column_ref> [, <pred_option_list>] )
<column_ref> AS <identifier>
<pred_option>
<pred_option_list> , <pred_option>
EXCLUDE_NULL
INCLUDE_NULL
INPUT_ONLY
EXCLUSIVE
INCLUSIVE
INCLUDE_STATISTICS
-> <pred_select>
| <model_select>
-> SELECT [FLATTENED] <expression_list> FROM <identifier> [NATURAL]
PREDICTION JOIN
<query> AS <identifier> [ON <on_list>] [<where_clause>]
| SELECT [FLATTENED] <expression_list> FROM <identifier> [NATURAL]
PREDICTION JOIN
<model_select>
<expression> AS <identifier> [ON <on_list>] [<where_clause>]
-> SELECT [DISTINCT] <expression_list> FROM <identifier> [<where_clause>]
| SELECT [DISTINCT] <expression_list> FROM <identifier>.PMML
| SELECT [DISTINCT] <expression_list> FROM <identifier>.CONTENT
[<where_clause>]
Specification Version 1.0— Microsoft
95
Using OLE DB for Data Mining
<expression_list> -> <expression>
| <expression_list> , <expression>
<expression>
-> <value>
| <column_ref>
| *
| <expression> + <expression>
| <expression> - <expression>
| <expression> * <expression>
| <expression> / <expression>
| -<expression>
| +<expression>
| ( <expression> )
| <expression> OR <expression>
| <expression> AND <expression>
| NOT <expression>
| <expression> = <expression>
| <expression> <> <expression>
| <expression> < <expression>
| <expression> <= <expression>
| <expression> > <expression>
| <expression> >= <expression>
| PREDICTSTDEV ( <column_ref> )
| PREDICTSTDDEV ( <column_ref> )
| PREDICTVARIANCE ( <column_ref> )
| PREDICTSUPPORT ( <column_ref> )
| PREDICTPROBABILITY ( <column_ref> )
| PREDICTPROBABILITYSTDEV ( <column_ref> )
| PREDICTPROBABILITYSTDDEV ( <column_ref> )
| PREDICTPROBABILITYVARIANCE ( <column_ref> )
| CLUSTERDISTANCE ( [<expression>] )
| CLUSTERPROBABILITY ( [<expression>] )
| PREDICTHISTOGRAM ( <column_ref> )
| TOPCOUNT ( <expression>, <column_ref>, <expression> )
| TOPSUM ( <expression>, <column_ref>, <expression> )
| TOPPERCENT ( <expression>, <column_ref>, <expression> )
| BOTTOMCOUNT ( <expression>, <column_ref>, <expression> )
| BOTTOMSUM ( <expression>, <column_ref>, <expression> )
| BOTTOMPERCENT ( <expression>, <column_ref>, <expression> )
| ( SELECT <expression_list> FROM <expression> <where_clause> )
| ( <singleton_list> )
| <expression> AS <identifier>
<singleton_list> -> <singleton>
| <singleton_list> UNION <singleton>
<singleton>
-> SELECT <expression_list>
<where_clause>
-> WHERE <expression>
<delete>
-> <delete_reg>
| <delete_content>
Specification Version 1.0 — Microsoft
96
Using OLE DB for Data Mining
4.2.4 DELETE/DROP
<delete_reg>
<delete_content>
<drop>
-> DELETE * FROM <identifier>
-> DELETE * FROM <identifier>.CONTENT
-> DROP MINING MODEL <identifier>
4.2.5 RENAME
<rename>
-> RENAME MINING MODEL <identifier> TO <identifier>
4.2.6 MISCELLANEOUS
<value>
<identifier>
->
|
->
|
<numeric_const>
<string>
[([^\]]|(\]\]))*]
[a-zA-Z_][a-zA-Z_0-9]*
Specification Version 1.0— Microsoft
97
5 Appendix C: Functions
5.1 Predict
Syntax:
Predict(<scalar column reference>, option1, option2, …)
Predict(<table column reference>, option1, option2, …)
Applies To:
Either a scalar column or table column reference.
Return Type:
<scalar column reference>
<table column reference>
or
depending on which type of column this function is applied to.
Description:
This is a general form of prediction function that modifies the behavior of a prediction (for
example, missing value control, association control, and so on). Possible options include
EXCLUDE_NULL (default), INCLUDE_NULL, INCLUSIVE, EXCLUSIVE (default),
INPUT_ONLY, and INCLUDE_STATISTICS.
Note INCLUSIVE, EXCLUSIVE, INPUT_ONLY, and INCLUDE_STATISTICS are
applicable only for a table column reference, and EXCLUDE and INCLUDE_NULL
are only for scalar values columns.
In most cases, the following shorthand will be used:

[Gender]

[Products Purchases]
is shorthand for Predict([Gender], EXCLUDE_NULL).
is shorthand for Predict([Products Purchases],
EXCLUDE_NULL,

EXCLUSIVE_ASSOCIATION).
Note The return type of this function is itself regarded as a column reference. This
means that this function can be used as an argument in other functions that take a
column reference as an argument (except the Predict function itself).
Passing INCLUDE_STATISTICS to a prediction on a TABLE-valued column will add the
metacolumns $Probability and $Support to the resulting table. These columns describe the
likelihood of existence for the associated nested table record.
Specification Version 1.0— Microsoft
99
Using OLE DB for Data Mining
5.2 PredictSupport
Syntax:
PredictSupport(<scalar column reference>)
Applies to:
Scalar column
Return Type:
Scalar value
Description:
This function returns the support value for the histogram entry that has the highest probability
(the top row in the histogram obtained by PredictHistogram(<column reference>).
5.3 PredictVariance
Syntax:
PredictVariance(<scalar column reference>)
Applies to:
Scalar column
Return Type:
Scalar value
Description:
This function returns the variance value for the histogram entry that has the highest
probability (the top row in the histogram obtained by PredictHistogram(<column
reference>).
Specification Version 1.0 — Microsoft
100
Using OLE DB for Data Mining
5.4 PredictStdev
Syntax:
PredictStdev(<scalar column reference>)
Applies to:
Scalar column
Return Type:
Scalar value
Description:
This function returns the standard deviation for the histogram entry that has the highest
probability (the top row in the histogram obtained by PredictHistogram(<column
reference>).
5.5 PredictProbability
Syntax:
PredictProbability(<scalar column reference>)
Applies to:
Scalar column
Return Type:
Scalar value
Description:
This function returns the probability for the histogram entry that has the highest probability
(the top row in the histogram obtained by PredictHistogram(<column reference>).
Specification Version 1.0— Microsoft
101
Using OLE DB for Data Mining
5.6 PredictProbabilityVariance
Syntax:
PredictProbabilityVariance(<scalar column reference>)
Applies to:
Scalar column
Return Type:
Scalar value
Description:
This function returns the variance of the probability for the histogram entry that has the
highest probability (the top row in the histogram obtained by PredictHistogram(<column
reference>).
5.7 PredictProbabilityStdev
Syntax:
PredictProbabilityStdev(<scalar column reference>)
Applies to:
Scalar column
Return Type:
Scalar value
Description:
This function returns the standard deviation of the probability for the histogram entry that has
the highest probability (the top row in the histogram obtained by PredictHistogram(<column
reference>).
Specification Version 1.0 — Microsoft
102
Using OLE DB for Data Mining
5.8 Cluster
Syntax:
Cluster
Applies to:
This function does not require any parameter, but it can be used only when the underlying
DMM supports clustering.
Return Type:
This function returns a scalar value of cluster identifier. However, if this function is used as
an argument of other functions, it must be regarded as a <cluster column reference>.
Description:
This function returns a cluster identifier that the input case belongs to with the highest
probability. It also can be used as a <cluster column reference> for a PredictHistogram
function.
5.9 ClusterDistance
Syntax:
ClusterDistance([<ClusterID expression>])
Applies to:
This function can be used only when the underlying DMM supports clustering.
Return Type:
Scalar value.
Description:
This function returns the distance between the input case and the center of the cluster that has
the highest probability. If <ClusterID expression> is given, the cluster is identified by the
evaluation of the expression.
Specification Version 1.0— Microsoft
103
Using OLE DB for Data Mining
5.10 ClusterProbability
Syntax:
ClusterProbability([<ClusterID expression>])
Applies to:
This function can be used only when the underlying DMM supports clustering.
Return Type:
Scalar value.
Description:
This function returns the probability that the input case belongs to the cluster that has the
highest probability. If <ClusterID expression> is given, the cluster is identified by the
evaluation of the expression.
5.11 PredictHistogram
Syntax:
PredictHistogram(<scalar column reference>)
PredictHistogram(<cluster column reference>)
Applies to:
A scalar or cluster column reference.
Return Type:
<table expression>
Description:
This function returns a table representing a histogram for prediction of the given column.
A histogram generates statistics columns. For a <scalar column reference>, a
histogram consists of the following seven columns:

The column being predicted

$Support

$Variance

$Stdev (standard deviation)
Specification Version 1.0 — Microsoft
104
Using OLE DB for Data Mining

$Probability

$ProbabilityVariance

$ProbabilityStdev
A histogram for a <cluster column reference> consists of the following columns:

Cluster to represent the cluster identifier

$Distance

$Probability

$Support
5.12 TopCount
Syntax:
TopCount(<table expression>, <rank expression>, <n-items>)
Applies to:
A table-returning expression that includes <table column reference> and functions that return
a table.
Return Type:
<table expr>
Description:
This function returns the first <n-items> rows in a decreasing order of <rank expression>.
As an example, a table expression (for example, a sub-SELECT) may contain the following
table:
(SELECT [Product Name], $Probability AS [Probability] FROM Predict([Products Purchases],
INCLUDE_STATISTICS))
Product Name
Probability
Apples
0.4
Kiwi
0.1
Oranges
0.5
Lemons
0.2
Specification Version 1.0— Microsoft
105
Using OLE DB for Data Mining
If so, the function TopCount((SELECT ….), [Probability], 2) returns the following table:
Product Name
Probability
Oranges
0.5
Apples
0.4
5.13 TopSum
Syntax:
TopSum(<table expression>, <rank expression>, <sum>)
Applies to:
A table-returning expression that includes <table column reference> and functions that
return a table.
Return Type:
<table expr>
Description:
This function returns the first N rows in a decreasing order of <rank column reference>,
such that the sum of the <rank expression> values is at least <sum>. TopSum returns the
smallest number of elements possible while still meeting that criterion. For example, a table
column named [Products] might contain the following table:
Product Name
Unit Sales
Apples
1200
Kiwi
500
Oranges
1500
Lemons
750
If so, TopSum([Products], [Unit Sales], 2500) would return the following table:
Product Name
Unit Sales
Oranges
1500
Apples
1200
Specification Version 1.0 — Microsoft
106
Using OLE DB for Data Mining
5.14 TopPercent
Syntax:
TopPercent(<table expression>, <rank expression>, <percent>)
Applies to:
A table-returning expression that includes <table column reference> and functions that return
a table.
Return Type:
<table expr>
Description:
This function returns the first N rows in a decreasing order of <rank expression>, such that
the sum of the <rank column reference> values is at least the given percentage of the total
sum of <rank column reference> values. TopPercent returns the smallest number of
elements possible while still meeting that criterion.
Using a table column named [Products], as shown here:
Product Name
Unit Sales
Apples
30
Kiwi
10
Oranges
40
Lemons
20
TopPercent([Products], [Unit Sales], 60)
Product Name
Unit Sales
Oranges
40
Apples
30
function would return the following table:
Note that Apples were selected instead of Lemons.
Specification Version 1.0— Microsoft
107
Using OLE DB for Data Mining
5.15 Sub-SELECT
Syntax:
(SELECT <SELECT-expressions> FROM <table expression> [WHERE <WHERE-clause>])
Applies to:
A table-returning expression that includes <table column reference> and functions that
return a table.
Return Type:
<table expr>
Description:
A sub-SELECT selects columns (generally speaking, expressions containing columns) from
the given table-returning expression. Users also can specify a WHERE clause to filter out
undesired rows.
5.16 RangeMid
Syntax:
RangeMid(<scalar column reference>)
Applies to:
Discretized scalar columns
Return Type:
Scalar value
Description:
This function returns the midpoint of the predicted bucket that was discovered for a
discretized column.
Specification Version 1.0 — Microsoft
108
Using OLE DB for Data Mining
5.17 RangeMin
Syntax:
RangeMin(<scalar column reference>)
Applies To:
Discretized scalar columns
Return Type:
Scalar value
Description:
This function returns the lower end of the predicted bucket that was discovered for a
discretized column.
5.18 RangeMax
Syntax:
RangeMax(<scalar column reference>)
Applies To:
Discretized scalar columns
Return Type:
Scalar value
Description:
This function returns the upper end of the predicted bucket that was discovered for a
discretized column.
5.19 PredictScore
Syntax:
PredictScore(<scalar column reference>)
PredictScore(<table column reference>)
Specification Version 1.0— Microsoft
109
Using OLE DB for Data Mining
Applies To:
Predictable columns
Return Type:
Scalar value
Description:
This function returns the prediction score of the specified column.
5.20 PredictNodeId
Syntax:
PredictNodeId(<scalar column reference>)
Applies To:
Predictable columns (except table columns or predictable columns in nested table).
Return Type:
Scalar value
Description:
This function returns the node id of the tree leaf node in which the case is classified.
Specification Version 1.0 — Microsoft
110
Using OLE DB for Data Mining
6 Appendix D: XML Format for Data
Mining Models
DMMs are represented in XML using a variation of the Predictive Model Markup Language
(PMML) version 1.0. A few of the additions to PMML 1.0:

Support for the nested table nature of a DMM through nested Data Dictionaries.

The idea of Discretized, ordered, and Cyclical model variables beyond the simple
Categorical and Continuous.

Support for Key columns in nested dictionaries that list instances as categories.

Support for Relation type columns as "hierarchy parents."

All model variables can have a missing state described, even ones with continuous
domain.

Data dictionary is no longer a complete list of all attributes; rather, it is an "attribute
factory." Any attribute reference outside the data dictionary must "instantiate" a model
variable by locating it in the data dictionary hierarchy.

Because of the previous point, it is no longer sufficient to reference a model variable
(called attribute) as an attribute (in XML terms) of a tag. Instead, they must be properties
(nested tags) that describe the variable instance.

Statistics on the global distribution of the model variables have been separated out into a
new section.
It is expected that most of these changes will simply become part of PMML version 1.1.
Specification Version 1.0— Microsoft
111
Using OLE DB for Data Mining
6.1 DTD for the DMM Extended PMML
<?xml encoding="UTF-8"?>
<!ENTITY % predicates
"(predicate | compound-predicate | true | false)"
>
<!ENTITY % NUMBER "NMTOKEN">
<!-=================================================================
Overall structure
=================================================================
-->
<!-- Extended Feature: Allow a pmml document to contain segment models and global
statistics -->
<!ELEMENT pmml (head?, statements?, data-dictionary, global-statistics?, (tree-model |
segment-model | regression-model)+)>
<!ATTLIST pmml
version CDATA #REQUIRED
name
CDATA #IMPLIED
GUID
CDATA #IMPLIED
Modified-time CDATA #IMPLIED
Creation-time CDATA #IMPLIED
>
<!-=================================================================
Header Information
=================================================================
-->
<!-- Extended Feature: Allow a head to contain a datasrc -->
<!ELEMENT head (application?, annotation*, timestamp?, datasrc?)>
<!ATTLIST head
copyright CDATA #REQUIRED
description CDATA #IMPLIED
>
Specification Version 1.0 — Microsoft
112
Using OLE DB for Data Mining
<!-- a timestamp in the format YYYY-MM-DD hh:mm:ss GMT +/- xx:xx -->
<!ELEMENT timestamp (#PCDATA)>
<!-- describes the software application that generated the PMML-->
<!ELEMENT application EMPTY>
<!ATTLIST application
name CDATA #REQUIRED
version CDATA #IMPLIED
>
<!ELEMENT annotation (#PCDATA)>
<!-- Extended Feature: Define the datasrc ELEMENT. -->
<!ELEMENT datasrc EMPTY>
<!ATTLIST datasrc
src CDATA #REQUIRED
query CDATA #REQUIRED
>
<!-=================================================================
Statements
=================================================================
-->
<!--Extended Feature:
1.
Allows models to save the creation or other statements.
-->
<!ELEMENT statements(statement+)>
<!ELEMENT statement EMPTY>
<!ATTLIST statement
type CDATA #REQUIRED
value CDATA #REQUIRED
>
<!-=================================================================
Data Dictionary
=================================================================
-->
Specification Version 1.0— Microsoft
113
Using OLE DB for Data Mining
<!-- Extended Feature:
1. Allow a data-dictionary to contain another data-dictionary to support nested tables.
2. Allow a data-dictionary to contain the set of keys.
3. Allow a data-dictionary to contain a column of hierarchy parents to support hierarchical
attributes.
4. Allow a data-dictionary to contain a compound-category that enumerates all of the
possible keys combinations in the event that there are multiple keys for the table.
5. Added a new model variable type 'categorical-continuous' to represent pre-discretized
continuous data.
-->
<!ELEMENT data-dictionary (compound-categories? , (categorical | ordinal | continuous |
categorical-continuous | data-dictionary | key | hierarchy-parent)+)>
<!-- Extended Feature: Allow a data-dictionary to have a name. -->
<!ATTLIST data-dictionary
name CDATA #IMPLIED
>
<!-- Extended Feature: Define the key ELEMENT -->
<!ELEMENT key (category+)>
<!ATTLIST key
name CDATA #REQUIRED
ispredict ( true | false ) "false"
isinput ( true | false ) "false"
datatype CDATA #IMPLIED
>
<!-- Extended Feature: Define the hierarchy-parent ELEMENT -->
<!ELEMENT hierarchy-parent ((relates-to | category)+)>
<!ATTLIST hierarchy-parent
name CDATA #REQUIRED
ispredict ( true | false ) "false"
isinput ( true | false ) "false"
datatype CDATA #IMPLIED
>
<!-- Extended Feature: Define the relates-to ELEMENT. Tells you which PROPERTY of the datadictionary this hierarchical parent relates to. -->
<!ELEMENT relates-to EMPTY>
<!ATTLIST relates-to
name CDATA #REQUIRED
Specification Version 1.0 — Microsoft
114
Using OLE DB for Data Mining
>
<!-- Extended Feature: Allow for the additional ATTRIBUTEs of the categorical ELEMENT. -->
<!ELEMENT categorical (category+)>
<!ATTLIST categorical
name CDATA #REQUIRED
ispredict ( true | false ) "false"
isinput ( true | false ) "false"
datatype CDATA #IMPLIED
>
<!-- Extended Feature:
1. Allow the category ELEMENT to contain the parent ELEMENT, which specifies the
hierarchical parent(s).
2. Relax the value ATTRIBUTE to be an optional ATTRIBUTE. A missing state does not
need a value.
3. Added "uninformative" to the possible states of the missing ATTRIBUTE. A value can be
present, missing at random, or missing informative. missing = "true" is equivalent to
missing informative.
-->
<!ELEMENT category (parent*)>
<!ATTLIST category
value CDATA #IMPLIED
display-value CDATA #IMPLIED
proportion CDATA #IMPLIED
missing (true | false | uninformative) "false"
>
<!-- Extended Feature: Define the parent ELEMENT to specify the hierarchical parent of a
state. -->
<!ELEMENT parent EMPTY>
<!ATTLIST parent
name CDATA #REQUIRED
value CDATA #REQUIRED
>
<!-- Extended Feature: Allow for additional ATTRIBUTES cyclical and timesequence for an
ordinal attribute. -->
<!ELEMENT ordinal (order+)>
<!ATTLIST ordinal
name CDATA #REQUIRED
cyclical ( true | false ) "false"
timesequence ( true | false ) "false"
>
Specification Version 1.0— Microsoft
115
Using OLE DB for Data Mining
<!-- Extended Feature:
1. Relax the value ATTRIBUTE to be an optional ATTRIBUTE. A missing state does not
need a value.
2. Added "uninformative" to the possible states of the missing ATTRIBUTE. A value can be
present, missing at random, or missing informative. missing = "true" is equivalent to
missing informative.
3. Relax the rank ATTRIBUTE to be an optional ATTRIBUTE. The states are implied to be
ordered if rank is not specified for any of them.
-->
<!ELEMENT order EMPTY>
<!ATTLIST order
value CDATA #IMPLIED
display-value CDATA #IMPLIED
rank CDATA #IMPLIED
proportion CDATA #IMPLIED
missing (true | false | uninformative) "false"
>
<!-- The predicates indicate the values that represent missing values -->
<!-- Extended Feature: Allow for a missing = true or uninformative category state for a
continuous attribute. -->
<!ELEMENT continuous (category?, (%predicates;)*)>
<!ATTLIST continuous
name CDATA #REQUIRED
minimum CDATA #IMPLIED
maximum CDATA #IMPLIED
mean CDATA #IMPLIED
median CDATA #IMPLIED
standard-deviation CDATA #IMPLIED
inter-quartile-range CDATA #IMPLIED
ispredict ( true | false ) "false"
isinput ( true | false ) "false"
datatype CDATA #IMPLIED
>
<!-- Extended Feature: new type for prediscretized data. -->
<!ELEMENT categorical-continuous (category?, (%predicates;)*)>
<!ATTLIST continuous
name CDATA #REQUIRED
ispredict ( true | false ) "false"
isinput ( true | false ) "false"
datatype CDATA #IMPLIED
Specification Version 1.0 — Microsoft
116
Using OLE DB for Data Mining
>
<!-- Extended Feature:
1. Define compound-category ELEMENT that contains a set of compound-category
ELEMENTs, which lists out all the valid combinations of multiple keys.
2. Define compound-category ELEMENT that contains the combination of valid keys and its
hierarchical parents.
3. Define the category ref ELEMEMT that refers to an existing key.
-->
<!ELEMENT compound-categories (compound-category+)>
<!ELEMENT compound-category ( categoryref | parent )+>
<!ELEMENT categoryref EMPTY>
<!ATTLIST categoryref
name CDATA #REQUIRED
value CDATA #REQUIRED
>
<!-- Extended Feature: Attribute References
1. Define the simple-attribute ELEMENT that refers to a simple attribute, not nested in a
data-dictionary.
2. Define the compound-attribute ELEMENT that refers an attribute that is formed by taking
a Cartesian product of a key (or multiple-key) and value attribute within a nested datadictionary.
3. Define the derived-attribute ELEMENT, which specifies a list of existing attributes. It is
referred to by index.
4. Define the key-val ELEMENT that "instantiates" the compound attributes.
-->
<!ELEMENT simple-attribute EMPTY>
<!ATTLIST simple-attribute
name CDATA #REQUIRED
>
<!-- The name is the name of the nesting data dictionary. -->
<!ELEMENT compound-attribute (key-val+ , (simple-attribute | compound-attribute)?)>
<!ATTLIST compound-attribute
name CDATA #REQUIRED
>
<!ELEMENT derived-attribute ((simple-attribute | compound-attribute)+)>
Specification Version 1.0— Microsoft
117
Using OLE DB for Data Mining
<!ATTLIST derived-attribute
index CDATA #REQUIRED
>
<!ELEMENT key-val EMPTY>
<!ATTLIST key-val
name CDATA #REQUIRED
value CDATA #REQUIRED
>
<!-- Extended Feature: Define the %attribute ENTITY -->
<!ENTITY % attribute
"(simple-attribute | compound-attribute | derived-attribute)"
>
<!-- Extended Feature:
1. Define the global-statisics ELEMENT, which contains a list of data-distribution
ELEMENTs.
2. Define the data-distribution ELEMENT, which contains the sufficient statistics for a given
attribute.
3. Define the state ELEMENT that specifies the statistics of a given state.
-->
<!-=================================================================
Global Statistics
=================================================================
-->
<!ELEMENT global-statistics (data-distribution+)>
<!ELEMENT data-distribution (%attribute;, state+)>
<!ELEMENT state EMPTY>
<!ATTLIST state
value CDATA #IMPLIED
missing ( true | false | uninformative) "false"
minimum CDATA #IMPLIED
maximum CDATA #IMPLIED
mean CDATA #IMPLIED
median CDATA #IMPLIED
standard-deviation CDATA #IMPLIED
inter-quartile-range CDATA #IMPLIED
support CDATA #IMPLIED
proportion CDATA #IMPLIED
>
Specification Version 1.0 — Microsoft
118
Using OLE DB for Data Mining
<!-=================================================================
General Tree Model
=================================================================
-->
<!-- Extended Feature:
1. Allow the tree-model to contain more than one tree.
2. Relax the criteria so that a tree does not need a model-ID.
-->
<!ELEMENT tree-model (node+)>
<!ATTLIST tree-model
model-id CDATA #IMPLIED
>
<!-=================================================================
The root node of a model should contain a true predicate.
=================================================================
-->
<!-- Extended Feature:
1. Allows the node to contain the targets ELEMENT that specifies the target of the
prediction tree.
2. The root node does not need any arriving predicates and contains all of the pertaining
information of that tree.
-->
<!ELEMENT node (targets?, (%predicates;)?, info*, node*, score-distribution*, datadistribution*)>
<!ATTLIST node
score CDATA #IMPLIED
>
<!-- Extended Feature:
Define the targets ELEMENT.
-->
<!ELEMENT targets ((%attribute;)+)>
<!ELEMENT score-distribution EMPTY>
<!ATTLIST score-distribution
Specification Version 1.0— Microsoft
119
Using OLE DB for Data Mining
label CDATA #REQUIRED
value CDATA #REQUIRED
>
<!ELEMENT info EMPTY>
<!ATTLIST info
name CDATA #REQUIRED
value CDATA #REQUIRED
>
<!ELEMENT compound-predicate (%predicates;, (%predicates;)+)>
<!ATTLIST compound-predicate
bool-op (or | and | xor | cascade) #REQUIRED
>
<!-- Extended Feature: Allow specification of the attribute using the attribute elements
instead of the flat name. -->
<!ELEMENT predicate (%attribute;)>
<!ATTLIST predicate
attribute CDATA #IMPLIED
op (eq | ne | lt | le | gt | ge) #REQUIRED
value CDATA #REQUIRED
>
<!ELEMENT true EMPTY>
<!ELEMENT false EMPTY>
<!-- Extended Feature:
1. Define the segment-model ELEMENT.
2. The segment-model contains a list of nodes, which are the cluster points.
3. The cluster points contain a list of data-distribution for all of the attributes.
-->
<!-=================================================================
Segment Model
=================================================================
-->
Specification Version 1.0 — Microsoft
120
Using OLE DB for Data Mining
<!ELEMENT segment-model (info*, node+)>
<!-=================================================================
Regression Model
=================================================================
-->
<!ELEMENT regression-model (factor-list?, covariate-list?,
predictor-to-parameter-correlation-matrix?,
parameter-table)>
<!ATTLIST regression-model
model-id CDATA #REQUIRED
response-variable-name
CDATA
#REQUIRED
number-parameters
%NUMBER;
#REQUIRED
model-type
(regression | general-linear | log-linear | multinomial-logistic)
#REQUIRED
verbose-model-specification CDATA
#IMPLIED
>
<!ELEMENT factor-list (var-name+)>
<!ELEMENT covariate-list (var-name+)>
<!ELEMENT var-name (#PCDATA)>
<!ELEMENT predictor-to-parameter-correlation-matrix (predictor-to-parameter-cell+)>
<!ELEMENT predictor-to-parameter-cell (#PCDATA)>
<!ATTLIST predictor-to-parameter-cell
predictor-name
CDATA
#REQUIRED
parameter-name
CDATA
#REQUIRED
>
<!ELEMENT parameter-table (parameter-cell+)>
<!ELEMENT parameter-cell EMPTY>
<!ATTLIST parameter-cell
target-category
CDATA
#REQUIRED
parameter-name
CDATA
#REQUIRED
beta
%NUMBER;
#REQUIRED
std-error
%NUMBER;
#IMPLIED
df
%NUMBER;
#IMPLIED
>
Specification Version 1.0— Microsoft
121
Using OLE DB for Data Mining
6.2 Example: Tree Model to Predict Credit Risk
<?xml version="1.0"?>
<pmml>
<statements>
<statement type = "CREATE" value = "Create Mining Model CreditTree1
( ID long key,
Credit text discrete predict,
Education text discrete,
Age text discrete,
Pay text discrete
) using microsoft_decision_trees
"/>
<statement type = "TRAIN" value = "Insert Into CreditTree1
( ID, Credit, Education, Age, Pay)
OPENROWSET("Microsoft.Jet.OLEDB.4.0",
"data source=w:\test\demozero\credit.mdb",
"SELECT ID, Credit, Education, Age , Pay FROM CreditTraining"
)
"/>
</statements>
<data-dictionary name = "CreditTree1" GUID = "{707D31A7-D42A-11D3-8AEF-00C04F68DDCA}">
<key name = "ID" datatype = "LONG"/>
<categorical name = "Credit" isinput = "true" ispredict = "true" datatype = "TEXT">
<category missing = "true"/>
<category value = "Bad"/>
<category value = "Good"/>
</categorical>
<categorical name = "Education" isinput = "true" datatype = "TEXT">
<category missing = "true"/>
<category value = "Bachelor"/>
<category value = "High School"/>
<category value = "Graduate"/>
<category value = "Partial College"/>
<category value = "Partial High School"/>
</categorical>
<categorical name = "Age" isinput = "true" datatype = "TEXT">
<category missing = "true"/>
<category value = "Middle Age"/>
<category value = "Young"/>
<category value = "Old"/>
</categorical>
<categorical name = "Pay" isinput = "true" datatype = "TEXT">
<category missing = "true"/>
<category value = "Weekly pay"/>
<category value = "Monthly salary"/>
</categorical>
</data-dictionary>
Specification Version 1.0 — Microsoft
122
Using OLE DB for Data Mining
<global-statistics>
<data-distribution>
<simple-attribute name = "Credit"/>
<state missing = "true" support = "0."/>
<state value = "Bad" support = "114."/>
<state value = "Good" support = "109."/>
</data-distribution>
<data-distribution>
<simple-attribute name = "Education"/>
<state missing = "true" support = "0."/>
<state value = "Bachelor" support = "109."/>
<state value = "High School" support = "24."/>
<state value = "Graduate" support = "28."/>
<state value = "Partial College" support = "34."/>
<state value = "Partial High School" support = "28."/>
</data-distribution>
<data-distribution>
<simple-attribute name = "Age"/>
<state missing = "true" support = "0."/>
<state value = "Middle Age" support = "55."/>
<state value = "Young" support = "126."/>
<state value = "Old" support = "42."/>
</data-distribution>
<data-distribution>
<simple-attribute name = "Pay"/>
<state missing = "true" support = "0."/>
<state value = "Weekly pay" support = "114."/>
<state value = "Monthly salary" support = "109."/>
</data-distribution>
</global-statistics>
<tree-model>
<info name = "Scorer" value = "4"/>
<info name = "Splitter" value = "1"/>
<info name = "Minimum Leaf Cases" value = "10"/>
<info name = "Number of ESS" value = "16"/>
<info name = "Complexity Penalty" value = "0.80000000000000004"/>
<node>
<targets>
<target>
<simple-attribute name = "Credit"/>
</target>
</targets>
<node missing = "false">
<predicate op = "eq" value = "Weekly pay">
<simple-attribute name = "Pay"/>
</predicate>
<node missing = "false">
<predicate op = "eq" value = "Young">
<simple-attribute name = "Age"/>
Specification Version 1.0— Microsoft
123
Using OLE DB for Data Mining
</predicate>
<node missing = "false">
<predicate op = "eq" value = "High School">
<simple-attribute name = "Education"/>
</predicate>
<data-distribution>
<simple-attribute name = "Credit"/>
<state missing = "true" support = "0."/>
<state value = "Bad" support = "24."/>
<state value = "Good" support = "0."/>
</data-distribution>
</node>
<node missing = "false">
<predicate op = "ne" value = "High School">
<simple-attribute name = "Education"/>
</predicate>
<data-distribution>
<simple-attribute name = "Credit"/>
<state missing = "true" support = "0."/>
<state value = "Bad" support = "60."/>
<state value = "Good" support = "9."/>
</data-distribution>
</node>
</node>
<node missing = "false">
<predicate op = "ne" value = "Young">
<simple-attribute name = "Age"/>
</predicate>
<data-distribution>
<simple-attribute name = "Credit"/>
<state missing = "true" support = "0."/>
<state value = "Bad" support = "13."/>
<state value = "Good" support = "8."/>
</data-distribution>
</node>
</node>
<node missing = "false">
<predicate op = "ne" value = "Weekly pay">
<simple-attribute name = "Pay"/>
</predicate>
<node missing = "false">
<predicate op = "eq" value = "Young">
<simple-attribute name = "Age"/>
</predicate>
<data-distribution>
<simple-attribute name = "Credit"/>
<state missing = "true" support = "0."/>
<state value = "Bad" support = "16."/>
<state value = "Good" support = "17."/>
Specification Version 1.0 — Microsoft
124
Using OLE DB for Data Mining
</data-distribution>
</node>
<node missing = "false">
<predicate op = "ne" value = "Young">
<simple-attribute name = "Age"/>
</predicate>
<node missing = "false">
<predicate op = "eq" value = "Bachelor">
<simple-attribute name = "Education"/>
</predicate>
<data-distribution>
<simple-attribute name = "Credit"/>
<state missing = "true" support = "0."/>
<state value = "Bad" support = "1."/>
<state value = "Good" support = "52."/>
</data-distribution>
</node>
<node missing = "false">
<predicate op = "ne" value = "Bachelor">
<simple-attribute name = "Education"/>
</predicate>
<data-distribution>
<simple-attribute name = "Credit"/>
<state missing = "true" support = "0."/>
<state value = "Bad" support = "0."/>
<state value = "Good" support = "23."/>
</data-distribution>
</node>
</node>
</node>
</node>
</tree-model>
</pmml>
Specification Version 1.0— Microsoft
125
7 Appendix E: Provider Support for
SHAPE Syntax
The complete syntax of the SHAPE command is documented in the Microsoft Data Access
Component SDK. This appendix describes the subset of that syntax needed to shape multiple
result sets into a single nested table. Data mining providers should provide support for this
subset, at a minimum. Following is the basic syntax:
SHAPE {<master query>}
APPEND ({ <child table query> }
RELATE <master column> TO <child column>)
AS < column table name>
[
APPEND ({ <child table query> }
RELATE <master column> TO <child column>)
AS < column table name>
…
]
The SHAPE statement allows the addition of table columns to a master query by specifying
the child table rows and the way to match between the row in <master query> and its child
rows in the <child query>.
Using this syntax, you can now read all of the data needed for the cases from multiple queries
and shape these into a single table that is fed into the DMM.
The following example illustrates how this is done:
INSERT INTO [Age Prediction]
(
[Customer ID], [Gender], [Age], [Age Probability],
[Product Purchases] (SKIP, [Product Name], [Product Type], [Quantity]),
[Car Ownership] (SKIP, [Car Name], [Car Probability])
)
SHAPE { select [Customer ID], [Gender], [Age], [Age Probability]
from [Customers] order by [Customer ID]}
APPEND ( {select [CustID], [Product Name], [Product Type] , [Quantity]
from [Customer Product Sales] order by [CustID] }
RELATE [Customer ID] TO [Cust ID])
AS [Product Purchases],
( {select [CustID], [Car Name], [Probability],
from [Customer Cars] order by [CustID] }
RELATE [Customer ID] TO [Cust ID])
AS [Car Ownership]
Specification Version 1.0— Microsoft
127
Using OLE DB for Data Mining
Following are important notes:

The SHAPE statement has a rich syntax, and DM providers are encouraged to support as
much of it as possible. At a minimum, DM providers should support the syntax described
in this appendix.

The column binding between the target DMM and the source query is done by columns
order, as is the standard with INSERT INTO statement.

Table columns ("Product Purchases" and "Car Ownership") are listed in the source
columns, although they are mapped into whole tables and not to single columns.

The columns in the child query used for the relation (in the RELATE clause) are skipped
by using the SKIP keyword in the column map and not mapped into any of the columns
contained in the target table-column.

A DM provider may (and usually will) mandate that the relation columns in the child
queries be ordered the same as the key column in the master query.
Specification Version 1.0 — Microsoft
128
Using OLE DB for Data Mining
8 Appendix F: Provider Support for
OPENROWSET Syntax
The complete documentation of the OPENROWSET command is found in the Microsoft SQL
Server® Programmer's Toolkit. This appendix provides an abbreviated version of that. Data
mining providers should provide support for OPENROWSET to be used for the <source data
query> in INSERT INTO and PREDICT commands.
OPENROWSET('provider_name'
{
'datasource';'user_id';'password'
| 'provider_string'
},
{
'query'
})
'provider_name'
A character string that represents the friendly name of the OLE DB provider as
specified in the registry. provider_name has no default value.
'datasource'
A string constant that corresponds to a particular OLE DB data source object.
datasource is the DBPROP_INIT_DATASOURCE property passed to the provider's
IDBProperties interface to initialize the provider. Typically, this string includes the
name of the database file, the name of a database server, or a name that the provider
understands to locate the database(s).
'user_id'
A string constant that is the user name passed to the specified OLE DB provider.
user_id specifies the security context for the connection and is passed in as the
DBPROP_AUTH_USERID property to initialize the provider.
'password'
A string constant that is the user password passed to the OLE DB provider. password
is passed in as the DBPROP_AUTH_PASSWORD property when initializing the
provider.
'provider_string'
A provider-specific connection string that is passed in as the
DBPROP_INIT_PROVIDERSTRING property to initialize the OLE DB provider.
provider_string typically encapsulates all the connection information needed to
initialize the provider.
Specification Version 1.0— Microsoft
129
Using OLE DB for Data Mining
'query'
A string constant that is sent to and executed by the provider. For more information,
see SQL Server OLE DB Programmer's Reference.
Specification Version 1.0 — Microsoft
130
Using OLE DB for Data Mining
9 Appendix G: Support for Other
Data Mining Algorithms
Although most examples in this document are based on decision tree and clustering
algorithms, the purpose of the OLE DB for Data Mining specification is to provide a data
mining standard to support all the data mining algorithms. For presentation of the content of
different algorithms, PMML is adopted. The information is stored in the content schema
rowset after the model gets trained. In this appendix, the support for Association and
Regression Algorithm is illustrated, based on the syntax defined in the document.
9.1 Support for Association Algorithm
Association is one of the popular data mining algorithms. It can be applied to do market
basket analysis, cross-selling, Web site mining, and so forth. The typical problem the
association algorithm solves is that given a transaction table with products customers have
bought, what items does a customer tend to buy together.
Suppose there are two tables: Transaction and Purchase. The Transaction table stores
information about a transaction, such as transaction ID, time, store, and so on. The Purchase
table stores the purchased products for each transaction.
The following statement creates a data mining model to find out those products which sell
together based on an association algorithm. The model is interested only in rules with at least
five items.
Create Mining Model MyAssociationModel (
Transaction_id long key,
[Product purchases] table predict (
[Product Name] text key
)
)
Using [My Association Algorithm] (Minimum_size = 5)
Training an association model is exactly the same as training a tree model or a clustering
model. The results of the training are stored in the MINING_MODEL_CONTENT schema
rowset. In the content schema rowset, there is a column called Rule, which stores the PMML
representation of an association rule.
To get all the association rules discovered by the algorithm, run the following statement:
Select * from MyAssociationModel.content
This returns the content schema rowset that contains all the rules. It is also possible to search
for some particular rules—for example, all the products associated with "Milk."
Specification Version 1.0— Microsoft
131
Using OLE DB for Data Mining
9.2 Support for Regression Algorithm
Regression is another popular data mining algorithm. It is used to find the relationship
between a response variable and several possible predictor variables by some mathematic
formula. There are some different regression methods, such as linear regression, logistic
regression, and nonlinear regression.
A linear regression equation is usually written as follows:
Y = a + bX + e
where
Y is the dependent variable
a is the intercept
b is the slope or regression coefficient
X is the independent variable
e is the error term
Suppose there is a loan table containing customer demographic information and the level of
risk of each loan. By using a regression algorithm, the following mining model predicts loan
risk level based on age, income, homeowner, and marital status.
Create Mining Model MyRegressionModel (
Customer_id long key,
Age long continuous,
Homeowner boolean discrete,
Marital_status Boolean discrete,
Loan_risk_LEVELcontinuous predict
)
Using [My Regression Algorithm]
Training a regression model is exactly the same as training tree model or a clustering model.
The value of intercept, regression coefficient, and error rate are stored in
MINING_MODEL_CONTENT schema rowset, in the Rule column, with the PMML format.
The following statement returns all the coefficients of regression:
Select * from MyRegressionModel.content
Specification Version 1.0 — Microsoft
132
Using OLE DB for Data Mining
Copyright
This is a preliminary document and may be changed substantially prior to final commercial release. This document is provided for
informational purposes only and Microsoft makes no warranties, either express or implied, in this document. Information in this document,
including URL and other Internet Web site references, is subject to change without notice. The entire risk of the use or the results of the use
of this document remains with the user. Unless otherwise noted, the example companies, organizations, products, people and events
depicted herein are fictitious and no association with any real company, organization, product, person or event is intended or should be
inferred. Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rights under copyright, no part
of this document may be reproduced, stored in or introduced into a retrieval system, or transmitted in any form or by any means (electronic,
mechanical, photocopying, recording, or otherwise), or for any purpose, without the express written permission of Microsoft Corporation.
Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property rights covering subject matter in this
document. Except as expressly provided in any written license agreement from Microsoft, the furnishing of this document does not give you
any license to these patents, trademarks, copyrights, or other intellectual property.
 2000 Microsoft Corporation. All rights reserved.
Microsoft, MS-DOS, Windows, Windows NT, SQL Server, and Visual C++ are either registered trademarks or trademarks of Microsoft
Corporation in the United States and/or other countries.
The names of actual companies and products mentioned herein may be the trademarks of their respective owners.
Specification Version 1.0— Microsoft
133