Download Data Source View Guidelines

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

K-means clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
11
Data Mining
Microsoft Confidential. © 2006 Microsoft Corporation. All rights reserved. These materials are
confidential to and maintained as a trade secret by Microsoft Corporation. Information in these
materials is restricted to Microsoft authorized recipients only. Any use, distribution or public
discussion of, and any feedback to, these materials is subject to the terms of the attached
license. By providing any feedback on these materials to Microsoft, you agree to the terms of that
license.
Copyright © 2006 by Microsoft Corporation. All rights reserved. By using or providing
feedback on these materials, you agree to the attached license agreement. Please provide
feedback at BI Feedback Alias.
Microsoft Corporation Technical Documentation License Agreement (Standard)
READ THIS! THIS IS A LEGAL AGREEMENT BETWEEN MICROSOFT CORPORATION ("MICROSOFT") AND
THE RECIPIENT OF THESE MATERIALS, WHETHER AN INDIVIDUAL OR AN ENTITY ("YOU"). IF YOU HAVE
ACCESSED THIS AGREEMENT IN THE PROCESS OF DOWNLOADING MATERIALS ("MATERIALS") FROM A
MICROSOFT WEB SITE, BY CLICKING "I ACCEPT", DOWNLOADING, USING OR PROVIDING FEEDBACK ON
THE MATERIALS, YOU AGREE TO THESE TERMS. IF THIS AGREEMENT IS ATTACHED TO MATERIALS, BY
ACCESSING, USING OR PROVIDING FEEDBACK ON THE ATTACHED MATERIALS, YOU AGREE TO THESE
TERMS.
1. For good and valuable consideration, the receipt and sufficiency of which are acknowledged, You and
Microsoft agree as follows:
(a) If You are an authorized representative of the corporation or other entity designated below
("Company"), and such Company has executed a Microsoft Corporation Non-Disclosure Agreement that is
not limited to a specific subject matter or event ("Microsoft NDA"), You represent that You have
authority to act on behalf of Company and agree that the Confidential Information, as defined in the
Microsoft NDA, is subject to the terms and conditions of the Microsoft NDA and that Company will treat the
Confidential Information accordingly;
(b) If You are an individual, and have executed a Microsoft NDA, You agree that the Confidential
Information, as defined in the Microsoft NDA, is subject to the terms and conditions of the Microsoft NDA
and that You will treat the Confidential Information accordingly; or
(c)If a Microsoft NDA has not been executed, You (if You are an individual), or Company (if You are an
authorized representative of Company), as applicable, agrees: (a) to refrain from disclosing or distributing
the Confidential Information to any third party for five (5) years from the date of disclosure of the
Confidential Information by Microsoft to Company/You; (b) to refrain from reproducing or summarizing the
Confidential Information; and (c) to take reasonable security precautions, at least as great as the
precautions it takes to protect its own confidential information, but no less than reasonable care, to keep
confidential the Confidential Information. You/Company, however, may disclose Confidential Information
in accordance with a judicial or other governmental order, provided You/Company either (i) gives
Microsoft reasonable notice prior to such disclosure and to allow Microsoft a reasonable opportunity to
seek a protective order or equivalent, or (ii) obtains written assurance from the applicable judicial or
governmental entity that it will afford the Confidential Information the highest level of protection afforded
under applicable law or regulation. Confidential Information shall not include any information, however
designated, that: (i) is or subsequently becomes publicly available without Your/Company’s breach of any
obligation owed to Microsoft; (ii) became known to You/Company prior to Microsoft’s disclosure of such
information to You/Company pursuant to the terms of this Agreement; (iii) became known to
You/Company from a source other than Microsoft other than by the breach of an obligation of
confidentiality owed to Microsoft; or (iv) is independently developed by You/Company. For purposes of this
paragraph, "Confidential Information" means nonpublic information that Microsoft designates as being
confidential or which, under the circumstances surrounding disclosure ought to be treated as confidential
by Recipient. "Confidential Information" includes, without limitation, information in tangible or intangible
form relating to and/or including released or unreleased Microsoft software or hardware products, the
marketing or promotion of any Microsoft product, Microsoft's business policies or practices, and
information received from others that Microsoft is obligated to treat as confidential.
2. You may review these Materials only (a) as a reference to assist You in planning and designing Your
product, service or technology ("Product") to interface with a Microsoft Product as described in these
Materials; and (b) to provide feedback on these Materials to Microsoft. All other rights are retained by
Microsoft; this agreement does not give You rights under any Microsoft patents. You may not (i) duplicate
any part of these Materials, (ii) remove this agreement or any notices from these Materials, or (iii) give
any part of these Materials, or assign or otherwise provide Your rights under this agreement, to anyone
else.
3. These Materials may contain preliminary information or inaccuracies, and may not correctly represent
any associated Microsoft Product as commercially released. All Materials are provided entirely "AS IS." To
the extent permitted by law, MICROSOFT MAKES NO WARRANTY OF ANY KIND, DISCLAIMS ALL EXPRESS,
IMPLIED AND STATUTORY WARRANTIES, AND ASSUMES NO LIABILITY TO YOU FOR ANY DAMAGES OF
ANY TYPE IN CONNECTION WITH THESE MATERIALS OR ANY INTELLECTUAL PROPERTY IN THEM.
4. If You are an entity and (a) merge into another entity or (b) a controlling ownership interest in You
changes, Your right to use these Materials automatically terminates and You must destroy them.
Copyright © 2006 by Microsoft Corporation. All rights reserved. By using or providing
feedback on these materials, you agree to the attached license agreement. Please provide
feedback at BI Feedback Alias.
5. You have no obligation to give Microsoft any suggestions, comments or other feedback ("Feedback")
relating to these Materials. However, any Feedback you voluntarily provide may be used in Microsoft
Products and related specifications or other documentation (collectively, "Microsoft Offerings") which in
turn may be relied upon by other third parties to develop their own Products. Accordingly, if You do give
Microsoft Feedback on any version of these Materials or the Microsoft Offerings to which they apply, You
agree: (a) Microsoft may freely use, reproduce, license, distribute, and otherwise commercialize Your
Feedback in any Microsoft Offering; (b) You also grant third parties, without charge, only those patent
rights necessary to enable other Products to use or interface with any specific parts of a Microsoft Product
that incorporate Your Feedback; and (c) You will not give Microsoft any Feedback (i) that You have reason
to believe is subject to any patent, copyright or other intellectual property claim or right of any third
party; or (ii) subject to license terms which seek to require any Microsoft Offering incorporating or derived
from such Feedback, or other Microsoft intellectual property, to be licensed to or otherwise shared with
any third party.
6. Microsoft has no obligation to maintain confidentiality of any Microsoft Offering, but otherwise the
confidentiality of Your Feedback, including Your identity as the source of such Feedback, is governed by
Your NDA.
7. This agreement is governed by the laws of the State of Washington. Any dispute involving it must be
brought in the federal or state superior courts located in King County, Washington, and You waive any
defenses allowing the dispute to be litigated elsewhere. If there is litigation, the losing party must pay the
other party’s reasonable attorneys’ fees, costs and other expenses. If any part of this agreement is
unenforceable, it will be considered modified to the extent necessary to make it enforceable, and the
remainder shall continue in effect. This agreement is the entire agreement between You and Microsoft
concerning these Materials; it may be changed only by a written document signed
by both You and Microsoft.
Copyright © 2006 by Microsoft Corporation. All rights reserved. By using or providing
feedback on these materials, you agree to the attached license agreement. Please provide
feedback at BI Feedback Alias.
Overview
Data Mining deals with the proactive process of searching and discovering the patterns in the
data and present it in a more predictable and useful format. It enables making business
decisions, increasing revenues, reducing costs and boost customer confidence.
Data Mining discovers the hidden knowledge with in the information which is not possible with
traditional relational or OLAP technologies.
Need for Data Mining
Identifying the patterns in the business data is a required step for the analysts to take right
decisions at the right time for business portfolio. The following sections give more insight on the
need for data mining for specific business needs.
Relational and OLAP tools are good for finding and reporting information that is already existing
and visible within the data. Hidden knowledge that represents patterns and regularities in data
can’t be easily found by traditional reporting. Scenarios where the relational and OLAP reporting
are sufficient include the following:





Customers who used a Visa card to pay for gas bills
Employees over age 50 who took more than 10 sick days last year
Patients over 40 with blood pressure over 140 who have had at least one heart attack
The total Sales Dollars by region by month for the past three years
The year over year growth in discount percent by customer type
In all these cases the selection and grouping criteria are known in advance. Data Mining provides
solution when the specific selection and grouping criteria are not known in advance, but are
derived from the data values. Following are scenarios for which data mining can be highly
effective:







Predicting the seasonally adjusted sales for software products in order to prepare for
customer support requirements
Identifying demographic groups that purchase different makes of cars to plan import
allotments and focus marketing efforts
Grocery shop owners identifying products that customers typically buy together so the
product physical arrangement can be optimized
Targeting a mailing campaign to a customer base that reflects specific behavioral
patterns. For example, mailing baby product discount coupons to family that have
recently made similar purchases
Focusing retention efforts on most effective employees based on revenue contribution,
alignment with company policies, and consistency in performance
Preventing potential fraud by comparing a transaction to previous purchasing patterns for
a customer
Identifying matching customers even when data entry errors include misspellings of the
customer’s name or address.
Factors that would encourage considering data mining include the following:
 Data availability in source Systems: Detailed data is available from source systems,
preferably on a near real-time basis. Having detailed data would be a good candidate for
accurate and predictable results.
Copyright © 2006 by Microsoft Corporation. All rights reserved. By using or providing
feedback on these materials, you agree to the attached license agreement. Please provide
feedback at BI Feedback Alias.



Huge data volume: Large data sets that can be difficult to analyze effectively using other
tools lend themselves to data mining solutions. Also, the statistical functions in data mining
require a large sample set in order to produce meaningful results.
Complexity to identify trends: Having multiple factors enter into to forecasting or discovery
analysis lends itself to data mining, particularly when the appropriate grouping structures are
not known in advance.
Automating with minimum user interaction: Because data mining is driven by data
values, the same solution can be implemented at different customer locations, achieving
customized behavior with no changes to the application.
Data mining is not
Data warehouse: Data warehouse, relational or OLAP can be used for mining process but data
mining itself is not data warehouse store for storing warehouse objects such as facts, dimensions.
Reporting store: Data mining is not a report store. It provides a method for analyzing data and
making decisions. It does not provide any reports other than the analyzed data
OLAP: Online Analytical processing stores the data warehouse data in multi dimensional store
and also does aggregates accordingly. Data Mining does not require the data to be in multi
dimensional or aggregations. It cannot be treated as a replacement of OLAP store.
Data Visualization: DMX queries are to be issues against the data mining models to get the
results of algorithms in mining models. There is a limited interface for viewing the data mining
predictions in Business Intelligence Development Studio, additional visualization tools must be
used to interface with external clients.
Though OLAP is a data store in a multi-dimensional structure that includes aggregations, it
cannot replace data mining models, as OLAP requires pre-defined grouping buckets. OLAP data
can be used as an alternative to detailed relational data as a source for data mining models. The
data mining models then provide further analysis and additional insight on the patterns and
predictions. The following table shows what is possible/not possible with both OLAP and Data
Mining models.
OLAP
Data Mining
Typically focuses on historical facts
Typically focuses on future outcomes or trends
Aggregates data using pre-defined
Requires detail data
groupings
Verification driven/Factual results
Discovery driven
Ad hoc queries and reports
Statistical and machine learning techniques
Limited ability to include reliability
Data models available for predicting, discovering
estimates with predictions
patterns, estimating and producing accurate
results for trend analysis and forecasting
OLAP can be used as a data source
Data mining results can also be used in OLAP
for Data Mining models
applications by incorporating new predictive
variables or scores as dimensions or attributes in
Copyright © 2006 by Microsoft Corporation. All rights reserved. By using or providing
feedback on these materials, you agree to the attached license agreement. Please provide
feedback at BI Feedback Alias.
your OLAP tool
Data Mining Methodology
Data mining is frequently described as "the process of extracting valid, authentic, and actionable
information from large databases." In other words, data mining derives patterns and trends that
exist in data. These patterns and trends can be collected together and defined as a mining model.
Building a data mining model is part of a larger process that includes everything from defining the
basic problem the model will solve, to deploying it into a working environment. This process can
be divided into the following six steps:
1.
2.
3.
4.
5.
6.
7.
Defining the problem
Data Preparation
Exploring data
Building the model
Exploring and validating the model
Deploying and updating the models
Accessing the models
Defining the problem:
This involves defining the expectations from data mining, including business analysts’
requirements, marketing strategies, portfolio, forecasting, and decision support requirements. For
example, a grocery store owner needs to find out the cross-sell of the customers in order to
increase the customer base by providing what they need, together. For a grocery store manager,
projected sales information may not be as important as the cross-sell requirement. The next
steps of the data mining act on the data collected for satisfying the business requirement. More
of the business scenarios and modeling techniques are explained in the later sections of the
chapter.
Data Preparation:
The Data Extraction chapter details the type of data sources and extraction methods for data
collections effectively. Typically the source transaction system consolidates all the discrete
channels of data at one place and applies transactions, but it is not necessarily be one source of
information. Data Mining involves analyzing the source data at a broader level that includes both
internal and external to the system. Data collection is completely based on the business
requirements not necessarily extracting all the data. Following are some examples:
Internal Data Sources: Company activities, Customer records, Webs sites, Mail campaigns,
purchasing transactions, Inventory
External Data Sources: Partners and Supplies that contains external credit agencies, market
surveyors, customer feedbacks
Data mining modals can extract data either from relational structures or from OLAP store,
explained as below in detail. Data mining algorithms implementation work same for both
relational and OLAP models. The only difference is the source structure and format.
Relational Data Mining Models:
Copyright © 2006 by Microsoft Corporation. All rights reserved. By using or providing
feedback on these materials, you agree to the attached license agreement. Please provide
feedback at BI Feedback Alias.
Relational mining models can be built from data that is stored in relational database systems.
This type of mining model is based on data that is retrieved from any OLE DB data source. A
data source may contain both single case and multiple case tables. Case is considered to be a
look up or master table in relational transactional systems and as dimension in relational
dimensional models. Transactional tables or fact tables where there can be more than one record
exist per one case record, treated as Nested for data mining models. Data mining models can
extract data from these relational models for modeling and processing. Mined data (that contains
the results from models in the form of predictions, patterns) can be extracted using queries (also
called DMX) to interface and provide interactions to the analysts.
OLAP Data Mining Models
OLAP cubes contain complex number of members and dimensions in a multi dimensional
structure. Unlike relational data mining models, OLAP mining models do not preserve the
granularity of the data. This is because the patterns and trends of the data might be lost in the
process of creating aggregations. This makes it difficult to manually find the hidden patterns
within these members and dimensions. To overcome this difficulty, you can use the OLAP data
mining models that are based on OLAP data sources. These models allow you to find the
complex patterns and apply them to business decisions. Similar to the relational data mining
models, the case key and case level columns from an OLAP data source need to be specified
while designing data mining model. While specifying the nested tables, only the measure groups
that apply to the dimension are displayed. Select a measure group that contains the foreign key
of the case dimension, as nested for data mining model.
More information at http://msdn2.microsoft.com/en-us/ms175645.aspx
Source data systems collect data of static in nature, transactional and seldom changed
information either from interface systems or extracted from external/internal systems (explained
above). Based on the nature of source data, it is segregated into following types,
Demographic data: This generally describes personal information
 Personal details such as gender, age, martial status, income, status, department,
geographical location
 Strengths – very stable, good for predictive modeling. This data is not subjected to change
frequently compared to the data that is transactional in nature (such as exchange rate,
product cost, bank balance, favorites).
 Weakness – difficult to get accurate information accurately on an individual basis. Having the
inaccurate demographic data leads to incorrect results from data mining. For example, if the
analysis shows that women bought more mopeds than men, it may be because the gender is
incorrectly stored.
Behavior data: It is measurement of an action or behavior typically transactional in nature
 Typically the most predictive type of data but it can change rapidly.
 Depends on the type of industry
o Sales amount, types and dates of purchase, payment dates and invoice amounts,
customer service attributes, insurance claims
o Web site activity – click stream behavior or exact path of each visitor
 Predicts the business patterns
 Strengths: Accuracy is higher as it is happened data (transactions) than collected data
(demographic). Good candidate for data mining predictable columns
 Weakness: Rapid change results in higher volume
Psychographic or attitudinal data: This is characterized by opinion polls, blogs, surveys,
lifestyle characteristics or personal values
 Traditionally associated with market research – surveys, opinion polls, customer feedbacks,
marketing, focus groups
Copyright © 2006 by Microsoft Corporation. All rights reserved. By using or providing
feedback on these materials, you agree to the attached license agreement. Please provide
feedback at BI Feedback Alias.



It brings added dimension to predictive modeling when companies have squeezed the
predictive power out of the demographic and behavior data
Strengths: Overall customer feedback and useful for analysis at a broader range to check if
the products are liked by customers
Weakness: Not accurate, not based on any particular individual. Different surveys will give
different results
Data quality and transformations is explained in the other chapters. In brief, even If your sources
are clean, integrated, and validated, they may contain data about the real world that is simply not
true. This noise can, for example, be caused by errors in user input or just plain mistakes of
customers filling in questionnaires. If it does not occur too often, data mining tools are able to
ignore the noise and still find the overall patterns that exist in your data
Data preparation and cleaning is an often neglected but extremely important step in the data
mining process. It is not required to cleanse the data before the extracted by data mining process
if the intention is to find out data quality problems using data mining models.
Data can be polluted in a number of ways such as user interface, application problems, data
collection mechanisms, heterogeneous systems, data transformation procedures. Often, the
method by which the data where gathered was not tightly controlled, and so the data may contain
out-of-range values (e.g., Income: -100), incorrect data combinations (e.g., Having New York in
Japan). If the intention is not to find out data quality problems using data mining models, consider
cleansing the data before using it in the models. Data quality problems and techniques are
explained in Data Transformation chapter.
Data mining can handle either numeric or text based data.
Numeric Mining:
Here the input columns may have descriptive (text) content, but the prediction columns contain
numeric data. Simple operations on numeric data as Greater, Less, Percentage can be applied
to deduce more meaningful patterns and forecast. This chapter talks about data mining models
that deal with numeric data for predictions.
Text Mining:
While data mining is typically concerned with the detection of patterns in numeric data, very often
information that is critical to the business is stored in the form of text. Unlike numeric data, text is
often difficult to deal with. Text mining generally consists of the analysis of (multiple) text
documents by extracting key phrases, parsing, concepts, etc. and the preparation of the text
processed in that manner for further analyses with numeric data mining techniques (e.g., to
determine co-occurrences of concepts, key phrases, names, addresses, product names, etc.).
Exploring data:
This step involves checking if the required data contains the expected information. For example,
if cross-sell information is to be analyzed as part of data mining models then individual customer
transactions and products purchased must be captured. The term data reduction in the context of
data mining usually applies to the goal of aggregating the information from large datasets into
manageable information chunks. Data reduction methods include simple tabulation or
aggregation, or more sophisticated techniques like clustering and principal components analysis.
Building Model:
A model typically contains input columns, an identifying column, and a predictable column.
Copyright © 2006 by Microsoft Corporation. All rights reserved. By using or providing
feedback on these materials, you agree to the attached license agreement. Please provide
feedback at BI Feedback Alias.
Data type for the columns can be defined in a mining structure based on which algorithms
process the data. The following basic terms would be useful to understand about the column
types, enable for further studying the rest of the sections
Continuous Column: This column contains numeric measurements typically the product cost,
salary, account balance, shipping date, invoice date having no upper bound.
Discrete Column: These are finite unrelated values such as Gender, location, age, telephone
area codes. They do not need to be numeric in nature, and typically do not have a fractional
component.
Discretized Column: This is a continuous column converted to be discrete. For example,
grouping salaries into predefined bands.
Key: The column which uniquely identifies the row, similar to the primary key. This is sometimes
called the Case attribute.
In brief, the available algorithms while using Business Intelligence Development Studio are
Microsoft Decision Trees Algorithm: This algorithm uses the values, or states, of the
designated “input columns” to predict the states of the column that was designated as
“predictable”. It identifies the attribute tree that best predicts the result. This algorithm allows for
interplay between attributes and provides a hierarchy of attribute definitions that can be used to
take a decision.
More information can be found at http://msdn2.microsoft.com/en-us/library/ms175312.aspx
Microsoft Clustering Algorithm: The algorithm does grouping of the cases in a dataset into
clusters that contain similar characteristics. Identifies how the data forms subgroups and how
these subgroups are different from each other. This algorithm finds patterns without a specific
target result.
More information can be found at http://msdn2.microsoft.com/en-us/library/ms174879.aspx
Microsoft Naive Bayes Algorithm: Identifies the attribute that is most likely to predict the result.
This algorithm is less computationally intense than other Microsoft algorithms, and therefore is
useful for quickly generating a mining model to discover relationships between input columns and
predictable columns. You can use this algorithm to do initial explorations of data, and then later
apply the results to create additional mining models with other algorithms that are more
computationally intense and more sophisticated.
More information can be found at http://msdn2.microsoft.com/en-us/library/ms174806.aspx
Microsoft Association Algorithm: Association models are built on datasets that contain
identifiers both for individual cases and item set that the cases contain. An association model is
made up of a series of item sets and the rules that describe how those items are grouped
together within the cases. The rules that the algorithm identifies can be used to predict a
customer's likely future purchases, based on the items that already exist in the customer's
shopping cart. It basically identifies the subgroup of data that participates in a specific
transaction.
More information can be found at http://msdn2.microsoft.com/en-us/library/ms174916.aspx
Microsoft Sequence Clustering Algorithm: Identifies the event that is likely to happen next.
The algorithm takes a sequence of events as input parameter and is well suited for click stream.
This algorithm is similar to the Microsoft Clustering Algorithm. However, instead of finding clusters
of cases that contain similar attributes, this algorithm finds clusters of cases that contain similar
paths in a sequence.
More information can be found at http://msdn2.microsoft.com/en-us/library/ms175462.aspx
Microsoft Time Series Algorithm: This algorithm is used for predicting continuous columns
such as product sales. While other Microsoft algorithms create models, time series model is
based only on the trends that the algorithm derives from the original dataset to create a forecast
Copyright © 2006 by Microsoft Corporation. All rights reserved. By using or providing
feedback on these materials, you agree to the attached license agreement. Please provide
feedback at BI Feedback Alias.
model. It basically identifies the trends that are happening and predicting future from the current
data.
More information can be found at http://msdn2.microsoft.com/en-us/library/ms174923.aspx
Microsoft Neural Network Algorithm: Similar to the Microsoft Decision Trees algorithm, this
algorithm also Identifies attribute tree that best predicts the result, but involves more than 2
attributes analyzed at a time. This algorithm calculates probabilities for each possible state of the
input attribute when given each state of the predictable attribute.
More information can be found at http://msdn2.microsoft.com/en-us/library/ms174941.aspx
Microsoft Logistic Regression Algorithm: The Microsoft Logistic Regression algorithm is a
variation of the Microsoft Neural Network algorithm, where the HIDDEN_NODE_RATIO
parameter is set to 0. This setting will create a neural network model that does not contain a
hidden layer, and that therefore is equivalent to logistic regression.
More information can be found at http://msdn2.microsoft.com/en-us/library/ms174828.aspx
Microsoft Linear Regression Algorithm: The Microsoft Linear Regression algorithm is a
variation of the Microsoft Decision Trees algorithm, where the MINIMUM_LEAF_CASES
parameter is set to be greater than or equal to the total number of cases in the dataset that the
algorithm uses to train.
More information can be found at http://msdn2.microsoft.com/en-us/library/ms174824.aspx
The output of data mining model can provide you with the analyzed and forecast data that can be
readily used by the business analysts. For example, if you have a budget to mail information to
1000 people about a new product, relational or OLAP queries will not produce the optimal set of
1000 people. By enhancing your data by creating a data mining attribute that you can use in your
query or OLAP analysis, data mining enables you to find the 1000 people most likely to respond.
This example also shows that data mining does not replace OLAP, but enhances it.
SQL Server 2005 Business Intelligence Development Studio is an integrated environment that
includes several data mining algorithms and tools for building a comprehensive data mining
solution. For example, you can use the Data Mining Designer tool in Business Intelligence
Development Studio to create, modify, and compare the data mining models. Different Data
mining models are explained in detail in the later sections of this chapter. A data mining model
applies a mining model algorithm to the data that is represented by a mining structure. Model
Parameters and boundary values can be defined on the data mining algorithms and usage
parameters on data mining model column. You can define columns to be input columns, key
columns, or predictable columns.
Evaluating and validating the model:
Different models for a given business problem could be used for analyzing various business
scenarios, identifying the analytical requirements, tuning the parameters and evaluating the
results of the models to make a business decision.
This section explains the details about considering various models and choosing the best one
based on their predictive performance (i.e., explaining the variability in question and producing
stable results across samples). This may sound like a simple operation, but in fact, it sometimes
involves a very elaborate process. There are a variety of techniques developed to achieve that
goal – typically applying different models to the same data set and then comparing their
performance to choose the best
Data analysis life cycle represent the maturity model of the analysis.
Copyright © 2006 by Microsoft Corporation. All rights reserved. By using or providing
feedback on these materials, you agree to the attached license agreement. Please provide
feedback at BI Feedback Alias.
Trend
analysis
Operational
analysis
Adhoc
analysis
Predictive
analysis
Data analysis Cycle
Operational analysis is nothing but business transaction Reports (closing bank balances, who
was admitted into the hospital today, how many support calls are closed today etc)
Trend analysis understands the growth of the historical data over a period of time.
Ad hoc analysis is business context analysis (Products sales by region) or it can also be used for
finding the root cause such as sudden decrease in sales of a product due floods or natural
calamity
Predictive analysis is predicting the patterns for the future (also called forecasting)
Deploying and updating the models
Once the right data mining model is chosen and trained with the source data, you deploy it on the
server for DMX queries and APIs to access. These queries act directly on the deployed models
and feed the result to the interface for customer interaction and decision making. The processed
models may need to be updated as the business requirement changes. For example, a change in
business requirements may require you to consider choosing a different mining model.
Alternatively, the input data may change or you may have new values you need to predict. Data
mining models can be trained (processed) and deployed by using SSIS tasks and
transformations. This is helpful when you want the model to respond to source data changes on a
near-real-time basis. Also, mining results can be accessed by using SSIS tasks and
transformations for feeding OLAP or relational structures. This is useful when end users access
and analyze the data using existing OLAP or relational reports.
Accessing the model:
Once the model is built and deployed, the next step is to access the mined information from the
front-end interface for further analysis. The query language DMX is typically used for accessing
data mining models. DMX is similar to the MDX query language for OLAP queries and to the SQL
query language for relational queries.
Following tasks can be performed using DMX query language, they are
1. Creating mining structures and mining models
2. Processing mining structures and mining models
3. Deleting or dropping mining structures or mining models
4. Copying mining models
5. Browsing mining models
6. Predicting against mining models
For more information on different types of DDL and DML statements offered by DMX please refer
to http://msdn2.microsoft.com/en-us/library/ms132025.aspx
For more information on the functions features in DMX please refer to
http://msdn2.microsoft.com/en-us/library/ms131996.aspx
For more information on operators supported in DMX query statements please refer to
http://msdn2.microsoft.com/en-us/library/ms132095.aspx
Copyright © 2006 by Microsoft Corporation. All rights reserved. By using or providing
feedback on these materials, you agree to the attached license agreement. Please provide
feedback at BI Feedback Alias.
Data Mining Strategies
There are two main kinds of models in data mining, they are Predictive and Descriptive.
Predictive models can be used to forecast explicit values, based on patterns determined from
known results. For example, from a database of customers who have already responded to a
particular offer, a model can be built that predicts which prospects are likeliest to respond to the
same offer.
Descriptive models describe patterns in existing data, and are generally used to create
meaningful subgroups such as demographic clusters.
The following picture represents the hierarchical view of data mining models and segregations of
algorithms into these two models.







Classification algorithms predict one or more discrete variables, based on the other attributes
in the dataset. An example of a classification algorithm is the Microsoft Decision Trees
Algorithm.
Regression algorithms predict one or more continuous variables, such as profit or loss, based
on other attributes in the dataset. An example of a regression algorithm is the Microsoft
Regression Algorithm.
Time Series algorithms forecast the patterns based on the current set of continuous
predictable attributes. The data that is to be taken as a base for future patterns predictions
can be configured. Microsoft Time Series algorithms would be a best fit to solve time series
related business requirement such as forecasting
Prediction is the estimation of future outcomes, such as predicting which customers will be
loyal, predicting which customers will respond to a promotion, works on continuous attribute
set. Microsoft Time Series and Decision Trees Algorithms would be better examples to
implement these scenarios.
Segmentation algorithms divide data into groups, or clusters, of items that have similar
properties. An example of a segmentation algorithm is the Microsoft Clustering Algorithm.
Summarization algorithms are similar to clustering algorithm but instead of grouping the data,
it would quantify the members of the group, such as group 1 has more number of line items
available and it has most probability of occurring. Microsoft Clustering Algorithm would give
this information apart from clustering the selected data set.
Association algorithms find correlations between different attributes in a dataset. The most
common application of this kind of algorithm is for creating association rules, which can be
used in a market basket analysis. An example of an association algorithm is the Microsoft
Association Algorithm.
Copyright © 2006 by Microsoft Corporation. All rights reserved. By using or providing
feedback on these materials, you agree to the attached license agreement. Please provide
feedback at BI Feedback Alias.

Sequence analysis algorithms summarize frequent sequences or episodes in data, such as a
Web path flow. An example of a sequence analysis algorithm is the Microsoft Sequence
Clustering Algorithm.
Choosing the right algorithm to use for a specific business task can be a challenge. While you
can use different algorithms to perform the same business task, each algorithm produces a
different result, and some algorithms can produce more than one type of result. For example, you
can use the Microsoft Decision Trees algorithm not only for prediction, but also as a way to
reduce the number of columns in a dataset, because the decision tree can identify columns that
do not affect the final mining model.
You can use different algorithms to perform the same business task and each algorithm produces
a different result. Lift charts would be useful to check the accuracy of the data mining models
once built on the input data.
Use more than one algorithm to produce results and analyze the results for choosing the right
one. Different algorithms produce different results. The choosing of the algorithms is based on
the accuracy and on the business need
Use algorithms together – use some algorithms to explore data, and then use other algorithms to
predict a specific outcome based on that data. For example, you can use a clustering algorithm,
which recognizes patterns, to break data into groups that are more or less homogeneous, and
then use the results to create a better decision tree model.
Use multiple algorithms within one solution to perform separate tasks. For example, regression
tree algorithm can be used to obtain financial forecasting information, and a rule-based algorithm
to perform a market basket analysis.
Drill down analysis on processed mining models would be useful to check the granular content
behavior over the time to denote the interactive exploration of data, in particular of large
databases. The process of drill-down analyses begins by considering some simple break-downs
of the data by a few variables of interest (e.g., Gender, geographic region, etc.).
If the individual attributes you have are transaction amounts, you should model them as
continuous rather than discrete.
Below lists out some of the popular business scenarios for which data mining models are sought
and right model chosen
Real Customer Scenarios:
PROBLEM
The marketing department of a car company
needs to identify the characteristics of existing
customers to determine whether they are likely
to buy a product in the future
The marketing department of a car company
needs to predict monthly car sales for the
coming year. It also needs to identify whether
the sales report of one model can be used to
predict the sales of another model
The car company is redesigning its web site to
SOLUTION
By using the Microsoft Decision Trees
Algorithm, the marketing department can
predict whether a particular customer will
purchase a product. The Microsoft Decision
Trees algorithm can make a prediction based
on the customer information, such as
demographics or past buying patterns.
By using the Microsoft time Series Algorithm on
the past three year historical data, a data
mining model that forecasts future car sales
can be produced. Using this model, you can
also make cross predictions to determine the
relationship between sales trends of individual
car models
By using the Microsoft Association Rules
Copyright © 2006 by Microsoft Corporation. All rights reserved. By using or providing
feedback on these materials, you agree to the attached license agreement. Please provide
feedback at BI Feedback Alias.
favor the sale of products
Predicting when the customer balance would
become zero, a typical banking requirement so
that alerting the customer or automatic moving
the funds from another account
In price comparison system which matches the
best prices for a purchase. There could be a
number of major items in an order and each
major item could have multiple related sub
items. The other variables that affect the price
include trade-ins if any, sales going on at the
time of order, number of units etc.
A Retail Store selling video game consoles
wants to introduce the new XBOX 360 during
the holiday season. Before the store introduces
the product the manager wants to predict which
of the existing customers are more likely to buy
the new product
Algorithm on the company records of each sale
in the transactional database, the company can
identify the various car models and accessories
that tend to be purchased together. The
company can then predict additional items that
a customer might be interested in.
Making use of Decision Trees and Neural
Networks. Check the recent transaction of the
customers when it hit zero, come out with input
attributes such as premium repay, summer
bike renting, gender, and age then prepare the
model. This can also be predicted with
average monthly balance, average weekly
balance of all the customer and predict when
current week balance is less than average
weekly balance and could reach zero if
continues.
Making use of decision trees, neural nets, or
logistic regression would be a better option,
consider the parameters such as product type,
weight, cost, location, date of the year as input
parameters and predict the cost of the product
across different variables
Microsoft Decision trees would be a good
model for this scenario to study the customer
parameters who would mostly buy the Xbox
360, Ex: Kids of age less than 20 would most
probably buy product. The input columns for
this case consideration would d be Customer
ID, Age, Gender,
Marital Status, Total
children, Occupation, Location, Yearly Income
based on who can afford for purchasing and
having time at home to play
Usage based matrix of each data mining algorithm:
Task
Predicting a discrete attribute. For example, to
predict whether the recipient of a targeted mailing
campaign will buy a product.
Predicting a continuous attribute. For example, to
forecast next year's sales.
Predicting a sequence. For example, to perform a
clickstream analysis of a company's Web site.
Finding groups of common items in transactions.
For example, to use market basket analysis to
suggest additional products to a customer for
purchase.
Finding groups of similar items. For example, to
segment demographic data into groups to better
understand the relationships between attributes.
Microsoft algorithms to use
Microsoft Decision Trees Algorithm
Microsoft Naive Bayes Algorithm
Microsoft Clustering Algorithm
Microsoft Neural Network Algorithm (SSAS)
Microsoft Decision Trees Algorithm
Microsoft Time Series Algorithm
Microsoft Sequence Clustering Algorithm
Microsoft Association Algorithm
Microsoft Decision Trees Algorithm
Microsoft Clustering Algorithm
Microsoft Sequence Clustering Algorithm
Copyright © 2006 by Microsoft Corporation. All rights reserved. By using or providing
feedback on these materials, you agree to the attached license agreement. Please provide
feedback at BI Feedback Alias.
Data Mining Guidelines
Following are the list of suggestions/guidelines related to implementation and modeling the Data
Mining algorithms.
Considering updating data mining model when volume of data
increases
Data Mining accuracy depends not only on the model adopted but the details of the data. The
more the detailed the data is that much is the accuracy, but having more data being analyzed by
data mining model would be time consuming. So choosing the right sampling of data is required.
Example: Customer employer information is not necessary for predicting the cross-sell. Also
forecasting for next year does not data which was 5 years old. When the incoming data volume
changes then update the mining model for accurate results
Consider slicing the Source Cube if OLAP is source or Filter out
if Relational storage is used as Mining source
Consider filtering the data either by slicing the cube or using “where” condition before the data is
being processed by data mining, this would eliminate the unnecessary data to mining models to
process.
Cleanse the incoming data is modeling is not for data quality
identification
Data mining models can be used for predicting the data quality problems such as: Percentage of
male gender people have taken pregnancy leave. This is purely a data entry problem, can be
identified by data mining models. But if the data mining model intention is not for finding the data
quality problems, consider cleansing the incoming data. Data cleansing is explained in Data
Transformation chapter.
Consider data mining models built from same data source
All models built from the same structure must be from the same data source. However, the
models can differ as to which columns of the structure are used, how the columns are used, the
type of algorithm that is used to create each model, and the parameter settings for each
algorithm. Usage of different data sources may result in heterogeneous data sampling for
different models whose lift chart is studied for selecting the appropriate modal.
Data mining models sourcing data from granular stores
Data mining models can extract data from either relational or OLAP data structures but the only
requirement is the granularity. OLAP structure typically has aggregated data and relational store
has granular data. Sourcing the granular data will increase the accuracy of the resulted
predictive model and better patterns discovery. Example: In order to identify the patterns of
customer Age and motor bike, it needs transactional data granulated to each customer purchase.
Use lift chart for finding the accuracy and deciding the modal
The lift chart is important because it helps distinguish between models in a structure that are
almost the same, to help you determine which model provides the best predictions. Similarly, the
lift chart shows which type of algorithm performs the best predictions for a particular situation
Considering splitting the source data into a training set and a
testing set
Copyright © 2006 by Microsoft Corporation. All rights reserved. By using or providing
feedback on these materials, you agree to the attached license agreement. Please provide
feedback at BI Feedback Alias.
In order to test the predictive capability of a mining model, you can randomly divide the available
source data in half, and use one half to train the model and the other half to test the predictive
capability of the model when using a lift chart.
Data mining models predict but decision is yours
Data mining models analyze the given data to find patterns and predict, but taking decision of
whether the result is accurate for the business scenario and giving the correct sample data to the
mining for processing is your decision. Suggested to not to make decisions before finishing
studying the results of data modals.
Consider having a statistical analyst for studying the models
and aligning with business requirements
Statistical analyst would be a good fit for analyzing results from data mining models and applying
it to decision making process. Statistical analysts would typically have experience on predictions,
forecasting and patterns identification that is exactly what data mining models do to help him for
more accurate business decisions.
Consider denormalized tables/views as a source for data mining
models
Data mining looks at data as case tables and nested tables. Case tables maps the key to the
nested tables in one to many relationship fashion. Nested tables generally have the input and
predictable columns which are to be analyzed by data mining models. Having one denormalized
table/view would have keys and input columns which can directly map to transactional (nested)
tables would minimize the complexity of data model.
Consider using integration services for modeling, maintaining
and reprocessing mining models
Make use of SSIS (Integration Services) for identifying the incoming data changes, requirement
changes and modify the data mining models accordingly. This would automate reprocessing
schedules of the models and keep it accurate every time when the underlying data changes.
Requirement changes would need to change the model but at only one place in the integration
services package to have the corresponding parameters set or adding yet another model.
Consider aggregating to the required level before building time
series model
If daily sales forecast is required then first prepare the data by aggregating sales figures per item
to that level. Having multiple records of the same key at the level the forecast is sought, would
given incorrect results and may error out
Consider the detailed transaction table as nested and its master
table as a case table
This is true for every model that is built for data mining needs to identify the case table and
nested table, but this example would give an idea on what is to be selected as case and nested
table.
The more common approach for market basket analysis is to model each transaction (containing
multiple items) as a case so your transaction table would have a composite key comprising the
transaction id and item id, with multiple rows for each transaction id, comprising all items that
were bought together as part of that single transaction. From the point of view of SQL Data
Mining, you would use the same table as the "case table" as well as the "nested table" containing
Copyright © 2006 by Microsoft Corporation. All rights reserved. By using or providing
feedback on these materials, you agree to the attached license agreement. Please provide
feedback at BI Feedback Alias.
the associated items for each transaction. As a natural modeling concept for one-to-many
relationships - examples: products purchased by a customer, movies watched by a person.
As a powerful and compact way of representing variable-length, sparse cases where individual
cases contain only a small subset of the potential attributes for a case. For instance, a store might
sell a thousand products. If you wanted to model each potential product purchase as an attribute,
without nested tables, you would need more than a thousand columns in your model , with the
attribute states being "missing" or "existing" to indicate whether the customer (the "case") bought
the product or not. Instead of this, the nested table allows to pivot the attributes and specify only
the ones that are actually present for each case.
Challenges/Limitations
Data mining systems face a lot of challenges particularly related to the incoming data and its
quality. A data mining system may work perfect for consistent data and perform significant worse
when a little noise exists to the incoming data. In this section we take a look at what we mean are
the most prominent problems and challenges of data mining systems today.
Noisy Data:
In a large database, many of the attribute values will be inconsistent and/or incorrect. This may
be due to erroneous instruments, human error during data entry, migration problems, missing
values or incorrect transactions at the source system.
Complex incoming data:
Sometimes the incoming data is complex enough that data modelers cannot uniquely define the
requirements
Non-Representative Data: If the data in the training set is not representative of overall patterns,
data mining will not produce accurate results. For example, if you are creating a model for
employee wages, and the training set includes only salaried people, the model will not accurately
represent hourly employees. Even when using large training data sets, you must be careful that
the training data is representative and that it belongs to the same domain as the target data set.
No Boundary Cases: To find the real differences between two classes, boundary—or unusual—
cases should be present. For example, if a data mining system is classifying animals, and nonflying birds such as penguins are not included in the training set, the model may define a bird as
an animal that has wings and can fly.
Databases are Dynamic:
If the source database structure changes frequently, it is a challenge for data mining, as this
requires changing and re-training the models. Ideally, the trained model should reflect the content
of the source data at all times, in order to make the best possible classification. An important
challenge for data mining systems is to accomplish this, by changing its rules as the source data
changes.
Databases may be huge:
Most data mining algorithms have been created for handling very large data sets but what
happens if you need to perform data mining on terabytes of source data? On the one hand, a
huge input data source is advantageous, since that will often result in more accurate modeling,
but the number of possible attribute values within a large dataset can be enormous. This means
that processing the models will take enormous time and query performance can be too slow.
One solution is to utilize less complex mining algorithms to minimize processing time. Other
options include randomly select data from the source, or possibly summarizing to the minimum
level of detail needed for the mining.
Not a complete automated solution
Data Mining is not automatic process to identify the trends, tools need guidance
Copyright © 2006 by Microsoft Corporation. All rights reserved. By using or providing
feedback on these materials, you agree to the attached license agreement. Please provide
feedback at BI Feedback Alias.
Summary:
This chapter details the need of data mining in the end to end scenario of business intelligence
applications implementation. The requirement of having data mining models come from the
business analysts who wants to dig out more in day to day transactions to discover the hidden
patterns and use it for business forecast, increase customer satisfaction, profits by taking right
decisions with the help of mining models. Different data mining models are available in Business
Intelligence studio. Same business scenario can be implemented by more than one model.
Techniques for selecting the right model and comparing with other similar models would increase
the accuracy of expected result. Guidelines and challenges are provided on the considerations
while implementing the data mining solutions.
Copyright © 2006 by Microsoft Corporation. All rights reserved. By using or providing
feedback on these materials, you agree to the attached license agreement. Please provide
feedback at BI Feedback Alias.