Download SQL/MX Data Mining Guide

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
HP NonStop SQL/MX
Data Mining Guide
Abstract
This manual presents a nine-step knowledge-discovery process, which was developed
over a series of data mining investigations. This manual describes the data structures
and operations of the NonStop™ SQL/MX approach and implementation.
Product Version
NonStop SQL/MX Release 2.0
Supported Release Version Updates (RVUs)
This publication supports G06.23 and all subsequent G-series releases until otherwise
indicated by its replacement publication.
Part Number
Published
523737-001
April 2004
Document History
Part Number
Product Version
Published
424397-001
NonStop SQL/MX Release 1.0
February 2001
523737-001
NonStop SQL/MX Release 2.0
April 2004
HP NonStop SQL/MX Data
Mining Guide
Index
Figures
What’s New in This Manual iii
Manual Information iii
New and Changed Information
Tables
iii
About This Manual v
Audience v
Organization v
Related Documentation vi
Notation Conventions viii
1. Introduction
The Traditional Approach 1-1
The SQL/MX Approach 1-2
Data-Intensive Computations Performed in the DBMS
Use of Built-In DBMS Data Structures and Operations
The Knowledge Discovery Process 1-3
Defining the Business Opportunity 1-4
Preparing the Data 1-7
Creating the Mining View 1-10
Mining the Data 1-10
Knowledge Deployment and Monitoring 1-11
2. Preparing the Data
Loading the Data 2-2
Creating the Database 2-2
Importing Data Into the Database
Profiling the Data 2-2
Cardinalities and Metrics 2-3
Transposition 2-3
Quick Profiling 2-5
Defining Events 2-6
Aligning the Data 2-6
Deriving Attributes 2-9
2-2
Hewlett-Packard Company—523737-001
i
1-2
1-2
2. Preparing the Data (continued)
Contents
2. Preparing the Data (continued)
Moving Metrics 2-9
Rankings 2-10
3. Creating the Data Mining View
Creating the Single Table
Pivoting the Data 3-3
3-2
4. Mining the Data
Building the Model 4-2
Building Decision Trees 4-2
Checking the Model 4-9
Applying the Model to the Mining Table 4-10
Applying the Model to the Database 4-10
Deploying the Model 4-10
Monitoring Model Performance 4-11
A. Creating the Data Mining Database
B. Inserting Into the Data Mining Database
C. Importing Into the Data Mining Database
Importing Customers Data C-1
Customers Format File C-1
Customers Data File C-1
Importing Account History Data C-2
Account History Format File C-3
Account History Data File C-3
Index
Figures
Figure 4-1.
Figure 4-2.
Figure 4-3.
Figure 4-4.
Initial Branches of Decision Tree 4-4
Decision Tree for Divorced Branch 4-5
Decision Tree for Single Branch 4-6
Final Decision Tree 4-9
Tables
Table i.
Manual Organization
v
HP NonStop SQL/MX Data Mining Guide—523737-001
ii
What’s New in This Manual
Manual Information
HP NonStop SQL/MX Data Mining Guide
Abstract
This manual presents a nine-step knowledge-discovery process, which was developed
over a series of data mining investigations. This manual describes the data structures
and operations of the NonStop™ SQL/MX approach and implementation.
Product Version
NonStop SQL/MX Release 2.0
Supported Release Version Updates (RVUs)
This publication supports G06.23 and all subsequent G-series releases until otherwise
indicated by its replacement publication.
Part Number
Published
523737-001
April 2004
Document History
Part Number
Product Version
Published
424397-001
NonStop SQL/MX Release 1.0
February 2001
523737-001
NonStop SQL/MX Release 2.0
April 2004
New and Changed Information
This publication has been updated to reflect new product names:
•
•
•
•
Since product names are changing over time, this publication might contain both
HP and Compaq product names.
Product names in graphic representations are consistent with the current product
interface.
The technical content of this guide has been updated and reflects the state of the
product at the G06.23 RVU.
Previous versions of the guide used the Object Relational Data Mining (ORDM)
approach and architecture. ORDM advocates performing data mining and other
parts of the knowledge discovery process against data in the SQL/MX data base.
This technique has been updated. Readers are encouraged to perform the data
HP NonStop SQL/MX Data Mining Guide—523737-001
iii
New and Changed Information
What’s New in This Manual
preparation steps in SQL/MX but reserve the mining or model building for UNIX or
Microsoft Windows platforms.
•
•
•
•
•
All sections of the manual have been updated to reflect the impact of major
changes of SQL/MX Release 2.0 (for example, the introduction of SQL/MX
tables).
Introductions to the data preparation steps have been revised and rewritten.
The DDL statements in Appendix A, B, and C have been updated to use SQL/MX
DDL syntax.
Appendix A syntax has been removed. Readers can consult the SQL/MX
Reference Manual for the most current syntax and examples.
Index entries have been added, updated, and corrected.
HP NonStop SQL/MX Data Mining Guide—523737-001
iv
About This Manual
This manual presents a nine-step knowledge discovery process, which was developed
over a series of data mining investigations. This manual describes the data structures
and operations of the NonStop SQL/MX approach and implementation.
Audience
This manual is intended for database administrators and application programmers who
are using NonStop SQL/MX to solve data mining problems, either through the SQL
conversational interface or through embedded SQL programs.
Organization
The sections listed in Table i describe the knowledge discovery process (or the data
mining process) and present examples that carry out the process.
The appendixes listed in Table i provide the syntax for the data mining features of
NonStop SQL/MX and the SQL scripts that create the data mining database used in
the examples.
Table i. Manual Organization
Section
Description
Section 1, Introduction
Presents an overview of the knowledge
discovery process and the SQL/MX
approach to this process. Defines the
example business opportunity used in this
manual.
Section 2, Preparing the Data
Describes the data preparation steps of the
knowledge discovery process.
Section 3, Creating the Data Mining
View
Describes how to create the mining view.
Section 4, Mining the Data
Describes the data mining steps of the
knowledge discovery process.
Appendix A, Creating the Data
Mining Database
Contains DDL statement scripts that you can
use to create the data mining database used
in the examples in this manual.
Appendix B, Inserting Into the Data
Mining Database
Contains INSERT statement scripts that you
can use to populate the data mining
database used in this manual.
Appendix C, Importing Into the Data
Mining Database
Contains IMPORT statement scripts that you
can use to create the data mining database
used in this manual.
HP NonStop SQL/MX Data Mining Guide—523737-001
v
Related Documentation
About This Manual
Related Documentation
This manual is part of the SQL/MX library of manuals, which includes:
Introductory Guides
SQL/MX Comparison Guide
for SQL/MP Users
Describes SQL differences between SQL/MP and
SQL/MX.
SQL/MX Quick Start
Describes basic techniques for using SQL in the
SQL/MX conversational interface (MXCI). Includes
information about installing the sample database.
Reference Manuals
SQL/MX Reference Manual
Describes the syntax of SQL/MX statements, MXCI
commands, functions, and other SQL/MX language
elements.
SQL/MX Connectivity
Service Command
Reference
Describes the SQL/MX administrative command
library (MACL) available with the SQL/MX
conversational interface (MXCI).
DataLoader/MX Reference
Manual
Describes the features and functions of the
DataLoader/MX product, a tool to load SQL/MX
databases.
SQL/MX Messages Manual
Describes SQL/MX messages.
SQL/MX Glossary
Defines SQL/MX terminology.
Programming Manuals
SQL/MX Programming
Manual for C and COBOL
Describes how to embed SQL/MX statements in
ANSI C and COBOL programs.
SQL/MX Programming
Manual for Java
Describes how to embed SQL/MX statements in
Java programs according to the SQLJ standard.
SQL/MX Guide to Stored
Procedures in Java
Describes how to use stored procedures that are
written in Java within SQL/MX.
HP NonStop SQL/MX Data Mining Guide—523737-001
vi
Related Documentation
About This Manual
Specialized Guides
SQL/MX Installation and
Management Guide
Describes how to plan, install, create, and manage
an SQL/MX database. Explains how to use
installation and management commands and
utilities.
SQL/MX Query Guide
Describes how to understand query execution
plans and write optimal queries for an SQL/MX
database.
SQL/MX Data Mining Guide
Describes the SQL/MX data structures and
operations to carry out the knowledge-discovery
process.
SQL/MX Queuing and
Publish/Subscribe Services
Describes how SQL/MX integrates transactional
queuing and publish/subscribe services into its
database infrastructure.
SQL/MX Report Writer Guide
Describes how to produce formatted reports using
data from a NonStop SQL/MX database.
SQL/MX Connectivity
Service Manual
Describes how to install and manage the SQL/MX
Connectivity Service (MXCS), which enables
applications developed for the Microsoft Open
Database Connectivity (ODBC) application
programming interface (API) and other connectivity
APIs to use SQL/MX.
Online Help
The SQL/MX Online Help consists of:
Reference Help
Overview and reference entries from the SQL/MX
Reference Manual.
Messages Help
Individual messages grouped by source from the
SQL/MX Messages Manual.
Glossary Help
Terms and definitions from the SQL/MX Glossary.
NSM/web Help
Context-sensitive help topics that describe how to
use the NSM/web management tool.
The following manuals are part of the SQL/MP library of manuals and are essential
references for information about SQL/MP Data Definition Language (DDL) and
SQL/MP installation and management:
Related SQL/MP Manuals
SQL/MP Reference Manual
Describes the SQL/MP language elements,
expressions, predicates, functions, and statements.
SQL/MP Installation and
Management Guide
Describes how to plan, install, create, and manage
an SQL/MP database. Describes installation and
management commands and SQL/MP catalogs
and files.
HP NonStop SQL/MX Data Mining Guide—523737-001
vii
Notation Conventions
About This Manual
This figure shows the manuals in the SQL/MX library:
Programming Manuals
Introductory Guides
SQL/MX
Comparison
Guide for
SQL/MP
Users
SQL/MX
Programming
Manual for C
and COBOL
SQL/MX
Quick Start
SQL/MX
Programming
Manual for
Java
SQL/MX
Guide to
Stored
Procedures
in Java
Reference Manuals
SQL/MX
Reference
Manual
SQL/MX
Messages
Manual
SQL/MX
Glossary
SQL/MX
Queuing
and Publish/
Subscribe
Services
SQL/MX
Query Guide
SQL/MX
Report
Writer
Guide
DataLoader/MX
Reference
Manual
SQL/MX Online Help
Specialized Guides
SQL/MX
Installation
and
Management
Guide
SQL/MX
Connectivity
Service
Command
Reference
SQL/MX
Data Mining
Guide
SQL/MX
Connectivity
Service
Manual
Reference
Help
Messages
Help
Glossary
Help
NSM/web
Help
VST001.vsd
Notation Conventions
Hypertext Links
Blue underline is used to indicate a hypertext link within text. By clicking a passage of
text with a blue underline, you are taken to the location described. For example:
HP NonStop SQL/MX Data Mining Guide—523737-001
viii
General Syntax Notation
About This Manual
This requirement is described under Backup DAM Volumes and Physical Disk
Drives on page 3-2.
General Syntax Notation
This list summarizes the notation conventions for syntax presentation in this manual.
UPPERCASE LETTERS. Uppercase letters indicate keywords and reserved words. Type
these items exactly as shown. Items not enclosed in brackets are required. For
example:
MAXATTACH
lowercase italic letters. Lowercase italic letters indicate variable items that you supply.
Items not enclosed in brackets are required. For example:
file-name
computer type. Computer type letters within text indicate C and Open System Services
(OSS) keywords and reserved words. Type these items exactly as shown. Items not
enclosed in brackets are required. For example:
myfile.c
italic computer type. Italic computer type letters within text indicate C and Open
System Services (OSS) variable items that you supply. Items not enclosed in brackets
are required. For example:
pathname
[ ] Brackets. Brackets enclose optional syntax items. For example:
TERM [\system-name.]$terminal-name
INT[ERRUPTS]
A group of items enclosed in brackets is a list from which you can choose one item or
none. The items in the list can be arranged either vertically, with aligned brackets on
each side of the list, or horizontally, enclosed in a pair of brackets and separated by
vertical lines. For example:
FC [ num ]
[ -num ]
[ text ]
K [ X | D ] address
{ } Braces. A group of items enclosed in braces is a list from which you are required to
choose one item. The items in the list can be arranged either vertically, with aligned
HP NonStop SQL/MX Data Mining Guide—523737-001
ix
General Syntax Notation
About This Manual
braces on each side of the list, or horizontally, enclosed in a pair of braces and
separated by vertical lines. For example:
LISTOPENS PROCESS { $appl-mgr-name }
{ $process-name }
ALLOWSU { ON | OFF }
| Vertical Line. A vertical line separates alternatives in a horizontal list that is enclosed in
brackets or braces. For example:
INSPECT { OFF | ON | SAVEABEND }
… Ellipsis. An ellipsis immediately following a pair of brackets or braces indicates that you
can repeat the enclosed sequence of syntax items any number of times. For example:
M address [ , new-value ]…
[ - ] {0|1|2|3|4|5|6|7|8|9}…
An ellipsis immediately following a single syntax item indicates that you can repeat that
syntax item any number of times. For example:
"s-char…"
Punctuation. Parentheses, commas, semicolons, and other symbols not previously
described must be typed as shown. For example:
error := NEXTFILENAME ( file-name ) ;
LISTOPENS SU $process-name.#su-name
Quotation marks around a symbol such as a bracket or brace indicate the symbol is a
required character that you must type as shown. For example:
"[" repetition-constant-list "]"
Item Spacing. Spaces shown between items are required unless one of the items is a
punctuation symbol such as a parenthesis or a comma. For example:
CALL STEPMOM ( process-id ) ;
If there is no space between two items, spaces are not permitted. In this example, no
spaces are permitted between the period and any other items:
$process-name.#su-name
Line Spacing. If the syntax of a command is too long to fit on a single line, each
continuation line is indented three spaces and is separated from the preceding line by
a blank line. This spacing distinguishes items in a continuation line from items in a
vertical list of selections. For example:
ALTER [ / OUT file-spec / ] LINE
[ , attribute-spec ]…
HP NonStop SQL/MX Data Mining Guide—523737-001
x
Notation for Messages
About This Manual
!i and !o. In procedure calls, the !i notation follows an input parameter (one that passes data
to the called procedure); the !o notation follows an output parameter (one that returns
data to the calling program). For example:
CALL CHECKRESIZESEGMENT (
segment-id
, error
) ;
!i
!o
!i,o. In procedure calls, the !i,o notation follows an input/output parameter (one that both
passes data to the called procedure and returns data to the calling program). For
example:
error := COMPRESSEDIT ( filenum ) ;
!i:i.
!i,o
In procedure calls, the !i:i notation follows an input string parameter that has a
corresponding parameter specifying the length of the string in bytes. For example:
error := FILENAME_COMPARE_ (
filename1:length
, filename2:length ) ;
!i:i
!i:i
!o:i. In procedure calls, the !o:i notation follows an output buffer parameter that has a
corresponding input parameter specifying the maximum length of the output buffer in
bytes. For example:
error := FILE_GETINFO_ (
filenum
, [ filename:maxlen ] ) ;
!i
!o:i
Notation for Messages
This list summarizes the notation conventions for the presentation of displayed
messages in this manual.
Bold Text. Bold text in an example indicates user input typed at the terminal. For example:
ENTER RUN CODE
?123
CODE RECEIVED:
123.00
The user must press the Return key after typing the input.
Nonitalic text. Nonitalic letters, numbers, and punctuation indicate text that is displayed or
returned exactly as shown. For example:
Backup Up.
lowercase italic letters. Lowercase italic letters indicate variable items whose values are
displayed or returned. For example:
p-register
process-name
HP NonStop SQL/MX Data Mining Guide—523737-001
xi
Notation for Management Programming Interfaces
About This Manual
[ ] Brackets. Brackets enclose items that are sometimes, but not always, displayed. For
example:
Event number = number [ Subject = first-subject-value ]
A group of items enclosed in brackets is a list of all possible items that can be
displayed, of which one or none might actually be displayed. The items in the list can
be arranged either vertically, with aligned brackets on each side of the list, or
horizontally, enclosed in a pair of brackets and separated by vertical lines. For
example:
proc-name trapped [ in SQL | in SQL file system ]
{ } Braces. A group of items enclosed in braces is a list of all possible items that can be
displayed, of which one is actually displayed. The items in the list can be arranged
either vertically, with aligned braces on each side of the list, or horizontally, enclosed in
a pair of braces and separated by vertical lines. For example:
obj-type obj-name state changed to state, caused by
{ Object | Operator | Service }
process-name State changed from old-objstate to objstate
{ Operator Request. }
{ Unknown.
}
| Vertical Line. A vertical line separates alternatives in a horizontal list that is enclosed in
brackets or braces. For example:
Transfer status: { OK | Failed }
% Percent Sign. A percent sign precedes a number that is not in decimal notation. The
% notation precedes an octal number. The %B notation precedes a binary number.
The %H notation precedes a hexadecimal number. For example:
%005400
%B101111
%H2F
P=%p-register E=%e-register
Notation for Management Programming Interfaces
This list summarizes the notation conventions used in the boxed descriptions of
programmatic commands, event messages, and error lists in this manual.
UPPERCASE LETTERS. Uppercase letters indicate names from definition files. Type these
names exactly as shown. For example:
ZCOM-TKN-SUBJ-SERV
HP NonStop SQL/MX Data Mining Guide—523737-001
xii
Notation for Management Programming Interfaces
About This Manual
lowercase letters. Words in lowercase letters are words that are part of the notation,
including Data Definition Language (DDL) keywords. For example:
token-type
!r.
The !r notation following a token or field name indicates that the token or field is
required. For example:
ZCOM-TKN-OBJNAME
!o.
token-type ZSPI-TYP-STRING.
!r
The !o notation following a token or field name indicates that the token or field is
optional. For example:
ZSPI-TKN-MANAGER
token-type ZSPI-TYP-FNAME32.
HP NonStop SQL/MX Data Mining Guide—523737-001
xiii
!o
About This Manual
Notation for Management Programming Interfaces
HP NonStop SQL/MX Data Mining Guide—523737-001
xiv
1
Introduction
Knowledge discovery is an iterative process involving many query-intensive steps. The
challenges of data management in supporting this process efficiently are significant
and continue to grow as knowledge discovery becomes more widely used.
Data mining identifies and characterizes interrelationships among multiple variables
without requiring a data analyst to formulate specific questions. Software tools look for
trends and patterns and flag unusual or potentially interesting ones. Because data
mining reveals previously unknown information and patterns, rather than proving or
disproving a hypothesis, mining enables knowledge discovery rather than just
knowledge verification.
Knowledge discovery is an iterative process involving many query-intensive steps. The
challenges of data management in supporting this process efficiently are significant
and continue to grow as knowledge discovery becomes more widely used.
This section discusses these approaches to data mining:
•
The Traditional Approach
Today, most data mining is performed in the database by using client tools. This
approach is limited because important information might be omitted from the data
extract.
•
The SQL/MX Approach
The SQL/MX approach to knowledge discovery enables you to perform many data
intensive tasks in the database itself, rather than on extracts. Examples include
statistical sampling, statistical functions, temporal reasoning through sequence
functions, cross-table generation, database profiling, and moving-window
aggregations.
•
The Knowledge Discovery Process
In the SQL/MX approach, fundamental data structures and operations are built into
the database management system (DBMS) to support a wide range of knowledge
discovery tasks and algorithms. The knowledge discovery process is described as
a series of steps that starts with the selection and definition of a business
opportunity, continues through data preparation and modeling, and ends with the
deployment of the new knowledge.
The Traditional Approach
Today’s traditional knowledge discovery systems consist of an application program on
top of a data source. The main emphasis in these systems is data mining—inventing
new techniques and algorithms, proving their statistical soundness, and validating their
effectiveness given a suitable problem.
Data should be available in a convenient form, typically a flat file, extracted from an
appropriate data source. The knowledge discovery system consists of specific
HP NonStop SQL/MX Data Mining Guide—523737-001
1 -1
The SQL/MX Approach
Introduction
algorithms that load the entire data set into memory and perform necessary
computations.
The extract approach has two major limitations:
•
•
It does not scale to large data sets because the entire data set is required to fit in
memory. Statistical sampling can be used to avoid this limitation. However,
sampling is inappropriate in many situations because sampling might cause
patterns to be missed, such as those in small groups or those between records.
It cannot conveniently manage multiple versions of data across numerous
iterations of a typical knowledge discovery investigation. For example, each
iteration might require extracting additional data, performing incremental updates,
deriving new attributes, and so on.
The SQL/MX Approach
In most enterprise organizations today, database systems are crucial for conducting
business. The DBMS systems serve as the transaction processing systems for daily
operations and manage data warehouses containing huge amounts of historical
information. The validated data in these warehouses is already being used for online
analysis and is a natural starting point for knowledge discovery.
The SQL/MX approach identifies fundamental data structures and operations that are
common across a wide range of knowledge discovery tasks and builds such structures
and operations into the DBMS. The primary advantages of the SQL/MX technology
over traditional data mining techniques include:
•
•
•
•
The ability to mine much larger data sets, not only data in flat-file extracts
Simplified data management
More complete results
Better performance and reduced cycle times
The main features of the SQL/MX approach are summarized next.
Data-Intensive Computations Performed in the DBMS
Tools and applications perform data-intensive data-preparation tasks in the DBMS by
using an SQL interface. As a result, you can access the powerful and parallel DBMS
data manipulation capabilities in the data preparation stage of the knowledge discovery
process.
Use of Built-In DBMS Data Structures and Operations
Fundamental data structures and operations are built into the DBMS to support a wide
range of knowledge discovery tasks and algorithms in an efficient and scalable
manner.
HP NonStop SQL/MX Data Mining Guide—523737-001
1 -2
The Knowledge Discovery Process
Introduction
Building these data structures and operations into the DBMS allows mining tasks to be
moved into the SQL engine for tighter integration of data and mining operations and for
improved performance and scalability.
Adding new primitives, such as moving-window aggregate functions, simplifies queries
needed by knowledge discovery tools and applications. This type of query
simplification often results in significant improvements in performance.
The Knowledge Discovery Process
The knowledge discovery process is a nine-step process that starts with the selection
and definition of a business opportunity, continues through several data preparation
steps and a modeling step, and ends with the deployment of the new knowledge. This
subsection describes the first step of that process.
1. Identify and define a business opportunity.
The process begins with the identification and precise specification of a business
opportunity.
See Defining the Business Opportunity on page 1-4.
2. Preprocess and load the data for the business opportunity.
Real-world data is often inconsistent and incomplete. The first preparation step is
to address these problems by preprocessing the data in various ways—for
example, verifying and mapping the data. Then load the data into your database
system.
See Preparing the Data on page 1-7
3. Profile and understand the relevant data.
Generate a variety of statistics such as column unique entry counts, value ranges,
number of missing values, mean, variance, and so on.
See Profiling the Data on page 1-7
4. Define events relevant to the business opportunity being explored.
Events are used to align related data in a single set of columns for mining.
Example events are life changes, such as getting married or switching jobs, or
customer actions, such as opening an account or requesting a credit limit increase.
See Defining Events on page 1-8
5. Derive attributes.
For example, customer age can be derived from birth date. Account summary
statistics, such as maximum and minimum balances, can be derived from monthly
status information.
See Preparing the Data on page 1-7.
HP NonStop SQL/MX Data Mining Guide—523737-001
1 -3
Defining the Business Opportunity
Introduction
6. Create the data mining view.
Transform the data into a mining view, a form in which all attributes about the
primary mining entity occur in a single record.
See Creating the Mining View on page 1-10.
7. Mine the data and build models.
Core knowledge discovery techniques are applied to gain insight, learn patterns, or
verify hypotheses. The main tasks are either predictive or descriptive in nature.
Predictive tasks involve trying to determine what will happen in the future, based
upon historical data. Descriptive tasks involve finding patterns describing the data.
See Mining the Data on page 1-10.
8. Deploy models.
Deployment can take many different forms. For example, deployment might be as
simple as documenting and reporting the results, or deployment might be
embedding the model in an operational system to achieve predictive results.
9. Monitor model performance.
Performance of the model must be monitored for accuracy. When accuracy begins
to decline, the model must be updated to fit the current situation.
See Knowledge Deployment and Monitoring on page 1-11.
In Step 1, a business opportunity is identified and defined. In Steps 2 through 6, data
mining data is gathered, preprocessed, and organized in a form that is suitable for
mining. These steps require the most time in the process. For example, selecting the
data is an important step in the process and typically requires the assistance of a data
mining expert or subject matter expert who has knowledge of the data to be mined.
In Step 7, models are built. In Steps 8 and 9, the models are deployed and monitored.
This latter part of the knowledge discovery process focuses on analyzing the data
mining view prepared in Steps 2 through 6.
Defining the Business Opportunity
The process begins with the identification and precise specification of a business
opportunity. Several factors must be considered when evaluating potential
opportunities:
•
Quantification of the return on investment
What is the answer worth? How much money can be saved? How much of a
competitive advantage does it offer?
HP NonStop SQL/MX Data Mining Guide—523737-001
1 -4
Defining the Business Opportunity
Introduction
•
Usability of the results
Merely identifying patterns is not enough. The opportunity and analysis must be
structured so that any interpretation of results obtained develops into deployable
business strategy.
•
Political and organizational reaction
In assessing probabilities for organizational resistance, it is helpful to examine
similar past efforts and understand why these efforts succeeded or failed.
•
Availability of business analysts and data mining experts and technology
Are data, domain, and mining experts available to participate in the process? Is
sufficient technology, both hardware and software, available?
•
Data availability
Does preclassified data exist or can it be derived? Do sufficiently large amounts of
data exist? Both internal and external data sources should be considered.
•
Logistics
How difficult is it to collect, extract, and transport the relevant data? Is
confidentiality an issue?
Careful consideration of these factors helps to ensure that the opportunity selected is
both amenable to data mining and likely to provide significant value.
After an operation is selected, the next task is to specify it precisely. In the scenario of
building a model to predict credit card account attrition, the goal is to build a model that
will predict, as early as possible, whether a credit card customer will close their
account.
To specify this opportunity precisely, decide on an explicit definition of attrition, such as
when a customer calls and closes their account. Another option is implicit—when a
customer stops using their card. For simplicity, define attrition as a customer closing
their account or maintaining a zero balance for three months.
Another aspect of specifying the opportunity is defining what it means to predict as
early as possible when an account will be closed. For this example, choose three
months as the prediction window. This window should be long enough to allow the card
issuer to take some action to try to retain customers likely to leave, but short enough to
capture attrition-related patterns.
The goal is to build a model that will predict, as early as possible, customer attrition.
Example Business Opportunity
The precise specification of our example opportunity is to build a model that will predict
at any point in time, based on such things as current account status, account activity,
and demographics, whether a credit card customer will close their account in the
future. Note that the precise specification of the opportunity might be modified or
HP NonStop SQL/MX Data Mining Guide—523737-001
1 -5
Defining the Business Opportunity
Introduction
refined later in the knowledge discovery process as more information becomes
available.
This manual uses this opportunity scenario to describe the knowledge discovery
process and how to implement it. The data set used to illustrate techniques and
SQL/MX features consists of two tables: one containing customer information and the
other containing account history information. This data set is presented in Appendix A
through C of this manual.
A subset of this data set is shown in these tables:
Customers Table
Account
Name
Marital Status
Home
Income
1234567
Jones, Mary
Single
Own
65,000
2500000
Abbas, Ali
Divorced
Rent
32,000
4098124
Kano, Tomoko
Divorced
Own
44,000
2400000
Lund, Erika
Widow
Own
28,000
...
Account History Table
Account
Month
Status
Limit
Balance
Payment
Fin. Chrg
1234567
01/03
Open
10,000
1232.50
1232.50
0.00
2500000
07/02
Open
5,000
566.00
32.00
8.00
4098124
10/00
Open
6,000
3200.00
3200.00
0.00
1234567
02/03
Open
10,000
3000.00
3000.00
0.00
2500000
08/02
Open
5,000
600.00
40.00
9.23
...
The first table, the Customers table, contains one row for each credit card account and
consists of customer demographic information such as marital status, income, and so
on. For a large financial institution, a customers table such as this one might contain
approximately 10 million rows and 100 columns.
The second table, the Account History table, contains monthly status records, one for
each account for each month the account was open over a given time period, and
consists of about 200 columns. For this example, suppose the time period is three
years. The history table would then contain about 360 million rows, assuming 10
million customers.
Given these parameters, the size of the first table is about 5 GB (10 million rows, 500
bytes in each row), and the size of the second table is about 360 GB (360 million rows,
1000 bytes in each row).
For the example business opportunity, the Status and Balance fields of the Account
History table are used to determine if a customer will close their account. If the Status
changes from Open to Closed or if the Balance is zero for three consecutive months,
HP NonStop SQL/MX Data Mining Guide—523737-001
1 -6
Preparing the Data
Introduction
then a customer is defined as having left—that is, no longer holds a credit card
account.
Preparing the Data
After a business opportunity has been identified and defined, the next task is to
prepare a data set for mining. This is done in Steps 2 through 6 of the knowledge
discovery process. See The Knowledge Discovery Process on page 1-3.
The first two steps are preprocessing the mining data to make it consistent and then
loading the data into a database system. For further information, see Loading the Data
on page 2-2.
The next step is to generate a variety of statistics—for example, column unique entry
counts, value ranges, number of missing values, mean, variance, and so on. This type
of data profile is helpful in gaining an understanding of the data, and this profile also
serves as a valuable reference throughout the knowledge discovery process.
Profiling the Data
A profile of the database helps to solve the data mining problem in these ways:
•
•
•
To better understand the data
To decide which columns to use for analysis
To decide whether to treat attributes as discrete or continuous
Types of Information
The type of information used to create a profile of the data mining view comes from the
following elements:
•
•
•
•
•
•
•
Tables in the database
Table attributes (or columns to be used in the analysis)
Data types of the table attributes
Relationships between tables
Cardinalities of discrete attributes
Statistics about continuous attributes
Derived table attributes (or derived columns to be used in the analysis)
Determining the derived columns to be constructed requires knowledge of the table
attributes and how these attributes relate to the data mining problem. See Preparing
the Data on page 1-7 for a full discussion of these elements.
SQL/MX provides the TRANSPOSE clause of the SELECT statement to display the
cardinalities of discrete attributes. See Transposition on page 2-3 and the
TRANSPOSE Clause entry in the SQL/MX Reference Manual for details.
Example of Finding Cardinality of Discrete Attributes
The customers table in your data set has Age and Number_Children columns. Both of
these attributes are discrete, and you can compute the cardinality of each attribute.
HP NonStop SQL/MX Data Mining Guide—523737-001
1 -7
Preparing the Data
Introduction
You obtain the cardinality of an attribute, which is the count of the number of unique
values for the attribute, by using a COUNT DISTINCT query. For example:
SELECT COUNT(DISTINCT Age)
FROM Customers;
or
SELECT COUNT(DISTINCT Number_Children)
FROM Customers;
Instead of having to submit a query for each attribute, you can obtain counts for
multiple attributes of a table by using the TRANSPOSE clause. For example:
SET NAMETYPE ANSI;
SET SCHEMA dmcat.whse;
SELECT ColumnIndex, COUNT(DISTINCT ColumnValue)
FROM Customers
TRANSPOSE Age, Number_Children AS ColumnValue
KEY BY ColumnIndex
GROUP BY ColumnIndex;
COLUMNINDEX
----------1
2
(EXPR)
-------------------17
4
--- 2 row(s) selected.
The first row of the result table of the TRANSPOSE clause contains the distinct count
for the column Age, and the second row contains the distinct count for the column
Number_Children. You can treat the Age values as categories, consisting of age
ranges. Similarly, if Number_Children is greater than five, you can put the count into
the category for the Number_Children equal to five.
The number of attributes in a TRANSPOSE clause is unlimited.
Note. The data types of attributes to be transformed into a single column must be
compatible. The data type of the result column is the union compatible data type of the
attributes.
For further information, see Profiling the Data on page 2-2.
Defining Events
In the scenario considered in this manual, the relevant event is the account holder
leaving. This event occurs at different points in time for customers that leave and not at
all for customers that stay.
This event must be defined so that account status and activity in the months leading up
to a customer leaving can be located and aligned in columns. For example, suppose
you create three derived attributes that describe the account balance for each of the
HP NonStop SQL/MX Data Mining Guide—523737-001
1 -8
Preparing the Data
Introduction
three months before a customer leaves, because these attributes are predictors of
attrition.
For the customers that do leave, the months leading up to leaving occur at various
points in time. For customers that do not leave, these months are chosen to be any
three consecutive months in which the account is open.
The information about these months should be aligned for all accounts in a single set
of columns, one for each of the three months. Most mining algorithms require a single
logical attribute, such as the balance one month before leaving, to be stored in one
column in all records, rather than in different columns in different records.
For example, consider this data in a table that contains monthly account balances for
each month in the three-year history period:
Account
Bal 08/03
Bal 09/03
Bal 10/03
Bal 11/03
1234567
7800.00
3000.00
2870.00
1200.00
2500000
0.00
0.00
0.00
0.00
Account
...
...
4098124
Bal 07/02
Bal 08/02
Bal 09/02
Bal 10/02
4817.94
4596.10
4347.63
4069.34
...
Left
Yes (closed)
Yes (0 bal)
...
Left
Yes (closed)
The balances prior to the event (of the customer leaving) are in different date columns
for these accounts, and therefore algorithms that build predictive models are not able
to consider this information.
A table organization that allows this information to be considered:
Account
...
Bal-3
Bal-2
Bal-1
Date Left
...
Left
1234567
3000.00
2870.00
1200.00
12/03
Yes (closed)
2500000
0.00
0.00
0.00
11/03
Yes (0 bal)
4098124
4817.94
4596.10
4347.63
10/03
Yes (closed)
In this table, columns Bal-1 through Bal-3 contain account balances one through three
months prior to a customer leaving. Consequently, this information is aligned within a
single set of columns and can be considered during model creation.
For further information, see Defining Events on page 2-6.
Deriving Attributes
The next task is to derive attributes that are not relative to events. For example,
customer age can be derived from birth date. Part of the challenge of effective data
mining is identifying a set of derived attributes that capture key indicators relevant to
the business opportunity being explored.
For further information, see Deriving Attributes on page 2-9.
HP NonStop SQL/MX Data Mining Guide—523737-001
1 -9
Creating the Mining View
Introduction
Creating the Mining View
The final data preparation step is to transform the data set into a mining view, a form in
which all attributes about the main mining entity appear in a single record. The mining
entity used in this manual is a credit card account. The data mining challenge is to
determine predictors for when a customer will close a credit card account.
Transforming the data set to a single record for each mining entity often involves a
pivot operation, in which attributes in multiple rows are collapsed and put into a single
row. For example, in the credit card example, the set of history records associated with
each account is collapsed to a single record and then appended to the corresponding
customer record.
For further information, see Section 3, Creating the Data Mining View.
The resulting table looks similar to this:
Mining View
Account
Mar Status
Income
Bal-3
Bal-2
Bal-1
Date Left
Left
1234567
Single
65,000
3000.00
2870.00
1200.00
12/99
Yes
2500000
Divorced
32,000
0.00
0.00
0.00
11/99
Yes
4098124
Divorced
44,000
4817.94
4596.10
4347.63
10/98
Yes
5200000
Married
32,000
–
–
–
–
No
This table contains demographic information from the Customers table, such as marital
status and income, and also pivoted columns from the Account History table, such as
balances prior to leaving. You use example data set in the data mining step, the next
step in the knowledge discovery process.
Mining the Data
In the data mining step, core knowledge discovery techniques are applied to gain
insight, learn patterns, or verify hypotheses. The main tasks performed in this step are
either predictive or descriptive in nature. Predictive tasks involve trying to determine
what will happen in the future, based upon historical data. Descriptive tasks involve
finding patterns describing the data.
The task used in this customer scenario is predictive: to build a model to predict
attrition of credit card customers based on historical information, such as
demographics and account activity.
The most common predictive tasks are:
•
•
Classification—Classify a case (or record) into one of several predefined classes.
Regression—Map a case (or record) into a numerical prediction value.
HP NonStop SQL/MX Data Mining Guide—523737-001
1- 10
Knowledge Deployment and Monitoring
Introduction
Descriptive tasks involve finding patterns describing the data. The most common are:
•
•
•
•
Database segmentation (clustering)—Map a case into one of several clusters.
Summarization—Provide a compact description of the data, often in visual form.
Link analysis—Determine relationships between attributes in a case.
Sequence analysis—Determine trends over time.
You use a variety of algorithms, and the models they produce, to perform these
predictive and descriptive tasks.
For example, classification can be done by building a decision tree model, where each
branch of the tree is represented by a predicate involving attributes in the mining data
set and where each branch is homogeneous with respect to whether the predicate is
true or false. The main task in classification is to determine which predicates form the
decision tree that predicts the goal. The most common algorithms for classification
come from the field of machine learning in computer science.
Typically, the model building step involves the use of client-mining tools that require the
interactive participation of the user to guide the investigation. A description of these
special-purpose tools is beyond the scope of this manual.
For further information, see Section 4, Mining the Data.
Knowledge Deployment and Monitoring
The last two steps of the knowledge discovery process involve deploying and
monitoring discovered knowledge. Deployment can take many different forms. For
example, deployment might be as simple as documenting and reporting the results, or
deployment might be embedding the model in an operational system to achieve
predictive results.
Most data mining tools support model deployment either by applying a model to data
within the tool or by exporting a model as executable code, which can then be
embedded and used in applications. In the credit card attrition example, one form of
model deployment is to periodically use the model to identify profitable customers that
are likely to leave, and then to take some action, such as lowering interest rates or
waiving fees, to try to retain these customers.
HP NonStop SQL/MX Data Mining Guide—523737-001
1- 11
Introduction
Knowledge Deployment and Monitoring
HP NonStop SQL/MX Data Mining Guide—523737-001
1- 12
2
Preparing the Data
Section 1, Introduction identifies and defines a business opportunity, the first step in
the knowledge discovery process supported by SQL/MX. This section describes Steps
2 through 5.
1. Identify and define a business opportunity.
2. Preprocess and load the data for the business opportunity.
The first preparation step is to address these problems by preprocessing the data
in various ways—for example, verifying and mapping the data. Then load the data
into your database system.
See Loading the Data on page 2-2.
3. Profile and understand the relevant data.
Generate a variety of statistics, such as column unique entry counts, value ranges,
number of missing values, mean, variance, and so on.
See Profiling the Data on page 2-2.
4. Define events relevant to the business opportunity being explored.
Events are used to align related data in a single set of columns for mining.
Example events are life changes, such as getting married or switching jobs, or
customer actions, such as opening an account or requesting a credit limit increase.
See Defining Events on page 2-6.
5. Derive attributes.
For example, customer age can be derived from birth date. Account summary
statistics, such as maximum and minimum balances, can be derived from monthly
status information.
See Deriving Attributes on page 2-9.
6. Create the data mining view.
7. Mine the data and build models.
8. Deploy models.
9. Monitor model performance.
HP NonStop SQL/MX Data Mining Guide—523737-001
2 -1
Loading the Data
Preparing the Data
Loading the Data
The first step in preparing a data set for mining is loading the data into database tables.
Suppose the credit card organization has a customers data warehouse. The customer
data and the account history data are stored in this warehouse. In a typical real-world
scenario, the warehouse could have millions of records representing millions of
customers dating back many years.
Creating the Database
Suppose a data mining database is created consisting of the Customers table and the
Account History table described in the previous section.
You can use the DDL scripts included with this manual to create a database to run the
examples in this manual. To create the database:
1. Open the .pdf file for this manual.
2. Navigate to Appendix A, Creating the Data Mining Database of this manual, which
contains the DDL script that creates the database.
3. On the tool bar, select the Table/Formatted Text Select Tool.
4. Copy and paste from the DDL script, one page at a time, into an OSS text file.
5. Within MXCI (the SQL/MX conversational interface), obey the OSS file you have
created.
Importing Data Into the Database
After the data mining database is created, the warehouse data is imported into the
database. In a typical real-world scenario, you would import the data by using some
type of database utility—for example, you can use the DataLoader/MP utility to import
a large quantity of data into an SQL/MP database. For further information, see the
DataLoader/MX Reference Manual and the SQL/MX Reference Manual for discussions
of the Import Utility.
Alternatively, you can also use INSERT statements to insert values into the data mining
database. The INSERT statements for the example in this manual are included in
Appendix B, Inserting Into the Data Mining Database.
Profiling the Data
Profiling often begins with the computation of basic information about each attribute.
For discrete attributes, this basic information is typically a table of the unique values
and a count of how many times each value occurs. However, as cardinality increases,
these frequencies become less and less meaningful. For continuous attributes, the
approach is to use metrics such as minimum, maximum, mean, and variance.
HP NonStop SQL/MX Data Mining Guide—523737-001
2 -2
Cardinalities and Metrics
Preparing the Data
Cardinalities and Metrics
For any attribute, one approach to profiling is to run a separate query for each attribute.
As an example, consider the following queries, which profile the discrete attribute
Marital Status from the Customers table and the continuous attribute Balance from the
Account History table.
Example of Discrete Attribute
This query finds the number of discrete values of the Marital Status column of the
Customers table:
SELECT marital_status, COUNT(*)
FROM customers
GROUP BY marital_status;
Example of Continuous Attribute
This query computes statistical information about the continuous attribute Balance in
the Account History table:
SELECT MIN(balance), MAX(balance),
AVG(balance), VARIANCE(balance)
FROM acct_history;
Transposition
Other than the computation of a few metrics, both of the previous queries require a
complete scan of the data. In this way, a table with N attributes requires N queries,
resulting in the same number of complete scans. For a wide mining table, this
procedure can result in thousands of queries and scans of the data.
Using transposition, SQL/MX can perform the above profiling operations by using a
total of only two queries, regardless of the number of attributes to be profiled. Through
the TRANSPOSE clause of the SELECT statement, different columns of a source table
can be treated as a single output column, enabling similar computations to be
performed on all such source columns.
TRANSPOSE takes each row in the source table and converts each expression listed
in the transpose set to an individual output row. Used in this way, TRANSPOSE can
compute frequency counts for all discrete attributes in a table in a single query.
See the TRANSPOSE Clause entry in the SQL/MX Reference Manual for more
information.
Example of Computing Counts for Character Discrete Attributes
This query computes the frequency counts for the discrete attributes Gender, Marital
Status, and Home, which are all type character:
SET NAMETYPE ANSI;
SET SCHEMA mining.whse;
HP NonStop SQL/MX Data Mining Guide—523737-001
2 -3
Transposition
Preparing the Data
SELECT attr, c1, COUNT(*) FROM customers
TRANSPOSE ('GENDER', gender),
('HOME', home),
('MARITAL_STATUS', marital_status)
AS (attr, c1)
GROUP BY attr, c1
ORDER BY attr, c1;
ATTR
-------------GENDER
GENDER
HOME
HOME
MARITAL_STATUS
MARITAL_STATUS
MARITAL_STATUS
MARITAL_STATUS
C1
-------F
M
Own
Rent
Divorced
Married
Single
Widow
(EXPR)
-------------------20
22
33
9
12
9
15
6
--- 8 row(s) selected.
Because this query produces counts for three different attributes, use the ATTR
column to distinguish from which attribute the values are drawn. The C1 column
contains the values for these character attributes.
Example of Computing Counts for Character and Numeric Discrete
Attributes
This query also shows the transpose clause and illustrates how profiling can be
achieved. The column C2 has been added to the statement because Number_Children
has numeric data type.
SELECT attr, c1, c2, COUNT(*) FROM customers
TRANSPOSE ('GENDER', gender, null),
('HOME', home, null),
('MARITAL_STATUS', marital_status, null),
('NUMBER_CHILDREN', null, number_children)
AS (attr, c1, c2)
GROUP BY attr, c1, c2
ORDER BY attr, c1, c2;
ATTR
--------------GENDER
GENDER
HOME
HOME
MARITAL_STATUS
MARITAL_STATUS
MARITAL_STATUS
MARITAL_STATUS
NUMBER_CHILDREN
NUMBER_CHILDREN
C1
-------F
M
Own
Rent
Divorced
Married
Single
Widow
?
?
C2
-----?
?
?
?
?
?
?
?
0
1
(EXPR)
-------------------20
22
33
9
12
9
15
6
25
4
HP NonStop SQL/MX Data Mining Guide—523737-001
2 -4
Quick Profiling
Preparing the Data
NUMBER_CHILDREN
NUMBER_CHILDREN
?
?
2
3
10
3
--- 12 row(s) selected.
Because this query produces counts for four different attributes, use the ATTR column
to distinguish from which attribute the values are drawn. The C1 column contains the
values for the character attributes, and the C2 column contains the values for the
numeric attribute.
Example of Computing Statistics for Continuous Attributes
Similarly, a single query using TRANSPOSE can compute the necessary statistics for
all continuous attributes. The next query computes the minimum, maximum, mean, and
variance for the continuous attributes Customer Credit Limit and Balance, which are
both numeric:
SELECT attr, MIN(c1), MAX(c1), AVG(c1), VARIANCE(c1)
FROM acct_history
TRANSPOSE (1,cust_limit), (2,balance) AS (attr, c1)
GROUP BY attr
ORDER BY attr;
Sample results are:
ATTR
MIN(C1)
MAX(C1)
AVG(C1)
VARIANCE(C1)
1
5000.00
40000.00
18225.81
2.01E+008
2
.00
32000.00
2539.12
1.46E+007
ATTR
MIN(C1)
MAX(C1)
AVG(C1)
VARIANCE(C1)
1
5000.00
40000.00
20139.86
2.35E+008
2
.00
32000.00
2444.17
1.584E+007
Sample results are:
By using TRANSPOSE to compute attribute profiles, you gain performance and
scalability advantages. Performance is improved because the data set is scanned only
once. In addition, the number of queries is reduced to two: one for discrete attributes
and one for continuous attributes. Scalability is enhanced because the amount of data
accessed grows linearly with the number of attributes actually profiled.
Quick Profiling
The profiling step is highly iterative, because many different data sources are
inspected and evaluated for possible analysis. Getting a quick impression of an
attribute before proceeding to a more detailed profile is often necessary. For example,
by quickly estimating cardinality, you can determine whether to treat a column as
discrete or continuous. You can make a determination accurately without a scan of
every single data element.
Use the SQL/MX sampling feature to:
HP NonStop SQL/MX Data Mining Guide—523737-001
2 -5
Defining Events
Preparing the Data
•
•
•
Randomly sample source data
Improve computing efficiency for a profile using a selected sampling percentage
Reduce both the I/O costs and the CPU costs associated with computing a profile
See the SAMPLE Clause of SELECT in the SQL/MX Reference Manual.
Defining Events
Events are used to align related data in a single set of columns for mining. Example
events are life changes, such as getting married or switching jobs, or customer actions,
such as opening an account or requesting a credit limit increase.
The critical event to be defined for the business opportunity described in this manual is
the month the customer left—either by closing their account or by maintaining a zero
balance for three months. The problem is to align the data so that this event can be
derived as an attribute of the mining view.
Aligning the Data
Most mining algorithms and tools require that the input data be arranged so that all the
information pertaining to a given entity is contained in a single record. However, in
typical raw mining data, observations about a given entity can be stored in separate
rows and tables.
For example, the Account History table contains one record per customer per month,
summarizing the account status for that customer. The related Customers table
contains static information in the form of one row per customer. For this example, the
account status information must be reduced to a single row of information for each
customer. This data is paired with the static customer information to form the mining
view.
Two methods exist for mapping time-dependent data in the mining view. One method is
to take a value from a particular month and include that value in the mining view. For
example, the checking account balance for January 1998 can be included in the mining
view for each customer because the balance is a single value.
Alternatively, a value can be aggregated over a time period to compute a single value
for the mining view. An example is the average checking account balance for January
1998 through June 1998.
Absolute and relative methods exist for aligning time-dependent data in the mining
view. Specifying an event relative to a customer is often more meaningful than to
specify an absolute event, such as a given year and month.
The account balance one month prior to closing an account or the average account
balance for six months prior to closing an account are both examples of relative
events. In this type of relative time specification, the actual months selected depend on
an event that is different for each customer. Aligning the data by using relative events
HP NonStop SQL/MX Data Mining Guide—523737-001
2 -6
Aligning the Data
Preparing the Data
is crucial for building models to predict events that occur at different times for each
customer.
Example of Aligning Data
This statement creates an SQL/MX table named Close_Temp that contains the
account number, the month the account is considered closed (if not closed, an arbitrary
month), and an indicator of whether or not the customer left:
SET SCHEMA mining.whse;
CREATE TABLE Close_Temp
( account
NUMERIC (7) UNSIGNED
NO DEFAULT
NOT NULL
HEADING 'Account Number'
,close_month DATE
NO DEFAULT
NOT NULL
HEADING 'Close Month'
,cust_left
CHAR(1)
NO DEFAULT
,PRIMARY KEY (account) );
In this query, the source data for the column named Close_Month is defined to be the
month the customer left—either by closing their account or by maintaining a zero
balance for three months. If the customer did not leave, the month is arbitrarily defined
to be a month in the middle of their account history.
INSERT INTO close_temp
(SELECT p.account,
CASE
WHEN p.close_month2 IS NOT NULL THEN p.close_month2
WHEN p.close_month1 IS NOT NULL THEN p.close_month1
ELSE p.open_month + ((DATE '1999-12-01' - p.open_month)/2)
- INTERVAL '16' DAY
END,
CASE
WHEN p.close_month2 IS NOT NULL THEN 'Y'
WHEN p.close_month1 IS NOT NULL THEN 'Y'
ELSE 'N'
END
FROM
(SELECT t.account, MAX(t.close_month1),
MAX(t.close_month2), MIN(t.year_month)
FROM
(SELECT m.account ,m.year_month,
CASE WHEN m.status = 'Closed'
AND OFFSET(m.status,1) = 'Open'
AND account = OFFSET(account,1)
THEN m.year_month
END,
CASE WHEN ROWS SINCE INCLUSIVE(balance <> 0) = 3.0
AND account = OFFSET(account,2)
HP NonStop SQL/MX Data Mining Guide—523737-001
2 -7
Aligning the Data
Preparing the Data
THEN m.year_month
END
FROM acct_history m
SEQUENCE BY m.account, m.year_month)
t (account, year_month, close_month1, close_month2)
GROUP BY t.account)
p (account, close_month1, close_month2, open_month));
The derived attribute Close_Month1 contains the month when a customer explicitly
closed their account (the Account Status is marked Closed). The first CASE expression
in the inner query uses the OFFSET sequence function to determine the month when
an account is closed explicitly.
The derived attribute Close_Month2 contains the month when a customer implicitly
closed their account (maintained a zero balance for three months). The second CASE
expression in the inner query uses the OFFSET sequence function and the ROWS
SINCE INCLUSIVE sequence function to determine the month when an account has a
zero balance for three months.
The derived attribute Open_Month is the month when the account was opened. In the
CASE expression of the outer query, this month is adjusted to be the month in the
middle of the account history. The account history interval is defined to start with the
first month the account is open up to the date 1999-12-01.
The derived attribute Close_Month in the Close_Temp table is set to either
Close_Month1 (when a customer explicitly closed their account), Close_Month2 (when
a customer maintained a zero balance for three months), or the month in the middle of
the Account History interval (when an account is open).
The derived attribute Cust_Left is set to Y if a customer has a zero balance for three
months or if the Account Status is marked Closed.
In queries that use sequence functions, note the use of the SEQUENCE BY clause.
See SEQUENCE BY Clause and Sequence Functions in the SQL/MX Reference
Manual for more information.
Here are the contents of the Close_Temp table after the preceding row insertion:
Account Number
Close Month
Cust_Left
1000000
1999-03-01
N
1234567
1999-12-01
Y
2300000
1999-11-01
Y
2400000
1998-12-01
Y
2500000
1999-10-01
Y
2900000
1999-06-01
N
3200000
1999-05-01
Y
3900000
1998-10-01
Y
4098124
1998-10-01
Y
HP NonStop SQL/MX Data Mining Guide—523737-001
2 -8
Deriving Attributes
Preparing the Data
Account Number
Close Month
Cust_Left
4300000
1999-06-01
N
4400000
1999-07-01
Y
4500000
1998-09-01
Y
4600000
1999-12-01
Y
4700000
1999-06-01
N
Deriving Attributes
In the preceding Example of Aligning Data on page 2-7, the derived attributes in the
Close_Temp table are Close_Month and Cust_Left. These attributes are critical for the
task of building a model that will predict at any point in time, based on such things as
current account status, account activity, and customer demographics, whether a credit
card customer will leave three months in the future.
To produce good models, the source mining data typically needs to be supplemented
with appropriate derived attributes. Typical derived attributes include computing ratios
between key quantities, mapping postal codes to average demographics, computing
metrics, and computing rankings, percentiles, or quartiles.
Moving Metrics
Moving metrics measure a dynamic behavior in terms of rates of events or trends for a
state of condition. In the data mining environment, moving metrics are good predictors
for many modeling tasks involving historical or time series data. For example, the
moving average of an account balance produces attributes that could be included in
the mining view for each customer.
SQL/MX supports a number of sequence functions that you can use to simplify queries
and to execute queries more efficiently.
Example Using MOVINGAVG and ROWS SINCE
This query uses the sequence functions MOVINGAVG and ROWS SINCE:
SELECT account, year_month, MOVINGAVG (balance,
ROWS SINCE INCLUSIVE (account <> OFFSET (account,1)) +1,
RUNNINGCOUNT(*))
FROM acct_history
SEQUENCE BY account, year_month;
ACCOUNT
---------1000000
1000000
1000000
1000000
1000000
YEAR_MONTH
---------1998-07-01
1998-08-01
1998-09-01
1998-10-01
1998-11-01
(EXPR)
--------------------3678.67
5229.33
4253.15
5189.86
5221.06
HP NonStop SQL/MX Data Mining Guide—523737-001
2 -9
Rankings
Preparing the Data
1000000
1000000
...
1998-12-01
1999-01-01
...
5134.22
4572.19
...
--- 186 row(s) selected.
In this query, the ROWS SINCE INCLUSIVE sequence function is used to limit the
moving average window to records for the current customer. The third argument of
MOVINGAVG is RUNNINGCOUNT(*), which ensures MOVINGAVG does not include
rows before the beginning row.
In practice, similar queries can be used to compute several metrics at the same time,
and the results, which conceptually are new columns in the Account History table, can
be realized in an auxiliary table. This auxiliary table can then be referenced when
computing the mining view.
By using sequence functions, you eliminate the dependency on the number and
location of moving averages computed for each customer. Even if customers have
different numbers of history records, sequence functions allow the computation of a
metric for each customer.
Rankings
Simple rankings provide good predictors for many modeling tasks. An example is the
rank of a customer’s average account balance relative to all other customers. This
query computes the absolute rank of the average account balance for each customer:
SELECT cid, RUNNINGCOUNT(*), avg_bal
FROM
(SELECT account, AVG(balance)
FROM acct_history
GROUP BY account)
AS t(cid, avg_bal)
SEQUENCE BY avg_bal DESC;
CID
---------4300000
2900000
4098124
2300000
1000000
1234567
...
(EXPR)
-------------------1
2
3
4
5
6
...
AVG_BAL
--------------------6203.33
5184.04
4920.02
4610.28
4067.44
2807.20
...
--- 14 row(s) selected.
In practice, the results of this type of query are realized in an auxiliary table that can be
thought of as an extension to the Customers table. Percentiles and quartiles can also
be computed easily with similar queries.
See the Sequence Fnctions entry in the SQL/MX Reference Manual for more
information.
HP NonStop SQL/MX Data Mining Guide—523737-001
2- 10
3
Creating the Data Mining View
Because data mining often involves executing a series of similar queries before getting
satisfying results, it can be helpful to use materialized results of previous queries when
answering a new one. Creating a data mining view allows you to access intentionally
gathered and permanently stored results of a data mining query.
Creating a data mining view is Step 6 of the knowledge discovery process.
1. Identify and define a business opportunity.
2. Preprocess and load the data for the business opportunity.
3. Profile and understand the relevant data.
4. Define events relevant to the business opportunity being explored.
5. Derive attributes.
6. Create the data mining view.
Transform the data into a mining view, a form in which all attributes about the
primary mining entity occur in a single record. This transformation involves:
•
•
Creating the Single Table
Pivoting the Data
7. Mine the data and build models.
8. Deploy models.
9. Monitor model performance.
HP NonStop SQL/MX Data Mining Guide—523737-001
3 -1
Creating the Data Mining View
Creating the Single Table
Creating the Single Table
After computing derived attributes and storing these attributes in auxiliary tables, you
create the mining view by combining all the information into a single table with one row
for each entity. Continuing with the credit card example, the mining view contains the
information in the Customers table along with the auxiliary customer data. In addition,
information in the Account History and related tables is also used.
Typically, after the mining view is computed and inserted into a single database table,
the data is extracted and loaded into a mining tool for the model building step. The
mining data would be extracted via ODBC/MX or the Genus Mining Integrator for
NonStop SQL.
Example of Creating the View
The derived attributes consisting of the three balances for the three months prior to a
customer leaving are specified in the following SQL/MX CREATE TABLE statement.
This view aligns the data around the month of a particular event—account attrition.
SET SCHEMA mining.whse;
CREATE TABLE mineview
( account
NUMERIC (7) UNSIGNED
NO DEFAULT
NOT NULL
HEADING 'Account Number'
,marital_status
CHARACTER (8)
DEFAULT NULL
HEADING 'Marital Status'
,home
CHARACTER (4)
DEFAULT NULL
HEADING 'Home'
,income
NUMERIC (8, 2) UNSIGNED
DEFAULT NULL
HEADING 'Income'
,gender
CHAR(1)
DEFAULT NULL
,age
NUMERIC (3)
DEFAULT NULL
HEADING 'Age'
,number_children NUMERIC (2)
DEFAULT NULL
HEADING 'Number of Children'
,year_month
DATE
NO DEFAULT
NOT NULL
,close_month
DATE
NO DEFAULT
NOT NULL
,balance_close_1 NUMERIC (9,2)
NO DEFAULT
NOT NULL
,balance_close_2 NUMERIC (9,2)
NO DEFAULT
HP NonStop SQL/MX Data Mining Guide—523737-001
3 -2
Creating the Data Mining View
Pivoting the Data
NOT NULL
NUMERIC (9,2)
NO DEFAULT
NOT NULL
,cust_left
CHAR(1)
NO DEFAULT
,PRIMARY KEY (account) );
,balance_close_3
Pivoting the Data
All the data in the Customer, Account History, and auxiliary tables must be collapsed to
a single row for each customer. Collapsing the data is accomplished by pivoting the
data. Data is purged from separate rows for each customer into different columns of a
single customer row. For example, the balance one month prior to account closure can
be placed in column BALANCE_CLOSE_1, the balance two months prior to account
closure in column BALANCE_CLOSE_2, and the balance three months prior to
account closure in column BALANCE_CLOSE_3.
To accomplish this pivoting operation, use the OFFSET sequence function to collect
data from various months and place the results in a single row.
Example Using OFFSET Sequence Function
This query populates the mining view:
INSERT INTO miningview
(SELECT t.account
,c.marital_status
,c.home
,c.income
,c.gender
,c.age
,c.number_children
,t.year_month
,t.close_month
,t.balance_close_1
,t.balance_close_2
,t.balance_close_3
,t.cust_left
FROM
(SELECT account
,year_month
,close_month
,CASE WHEN year_month = close_month THEN balance
END AS balance_close_1
,CASE WHEN year_month = close_month
AND account = OFFSET(account,1)
THEN OFFSET(balance, 1)
END AS balance_close_2
,CASE WHEN year_month = close_month
AND account = OFFSET(account,2)
THEN OFFSET(balance,2)
HP NonStop SQL/MX Data Mining Guide—523737-001
3 -3
Pivoting the Data
Creating the Data Mining View
END AS balance_close_3
,cust_left
FROM acct_history a NATURAL JOIN close_temp m
SEQUENCE BY account, year_month) AS t, customers c
WHERE t.balance_close_1 IS NOT NULL AND
t.balance_close_2 IS NOT NULL AND
t.balance_close_3 IS NOT NULL AND
c.account = t.account);
Sequence functions are used in the preceding query to create a derived table with the
various balances for each customer. This derived table has one row per customer that
consists of a single copy of the relevant data.
Here are the contents of the Miningview table after the preceding row insertion:
Account
Number
Marital
Status
Home
Income
Gender
Age
Number
Children
1000000
Married
Own
175500.00
M
45
3
1234567
Single
Own
65000.00
F
34
0
2300000
Divorced
Own
137000.00
M
42
2
2400000
Widow
Own
28000.00
F
65
0
2500000
Divorced
Rent
32000.00
M
23
0
2900000
Divorced
Rent
136000.00
F
50
0
3200000
Divorced
Rent
138000.00
M
40
1
3900000
Divorced
Own
75000.00
M
40
2
4098124
Divorced
Own
44000.00
M
44
2
4300000
Married
Own
300000.00
F
29
2
4400000
Single
Own
300000.00
F
29
0
4500000
Married
Own
300000.00
F
29
1
4600000
Single
Own
300000.00
M
48
0
4700000
Widow
Own
300000.00
M
68
0
...
This table continues the mining view table.
Account
Number
...
Year_
Month
Close
Month
Balance
Close_1
Balance
Close_2
Balance
Close_3
Cust
Left
1000000
1999-03-01
1999-03-01
5500.00
3500.00
1200.00
N
1234567
1999-12-01
1999-12-01
500.00
1200.00
2870.00
Y
2300000
1999-11-01
1999-11-01
.00
.00
.00
Y
2400000
1998-12-01
1998-12-01
.00
.00
.00
Y
2500000
1999-10-01
1999-10-01
.00
.00
.00
Y
2900000
1999-06-01
1999-06-01
2356.80
1134.00
9432.78
N
3200000
1999-05-01
1999-05-01
.00
.00
.00
Y
HP NonStop SQL/MX Data Mining Guide—523737-001
3 -4
Pivoting the Data
Creating the Data Mining View
Account
Number
...
Year_
Month
Close
Month
Balance
Close_1
Balance
Close_2
Balance
Close_3
3900000
1998-10-01
4098124
1998-10-01
.00
.00
.00
Y
1998-10-01
1998-10-01
4069.34
4347.63
4596.10
Y
4300000
1999-06-01
1999-06-01
9000.00
4354.00
9876.00
N
4400000
1999-07-01
1999-07-01
.00
.00
.00
Y
4500000
1998-09-01
1998-09-01
.00
100.00
50.00
Y
4600000
1999-12-01
1999-12-01
1000.00
50.00
80.00
Y
4700000
1999-06-01
1999-06-01
330.00
330.00
330.00
N
HP NonStop SQL/MX Data Mining Guide—523737-001
3 -5
Cust
Left
Creating the Data Mining View
HP NonStop SQL/MX Data Mining Guide—523737-001
3 -6
Pivoting the Data
4
Mining the Data
This section describes the next three steps of the process, Steps 7 through 9.
1. Identify and define a business opportunity.
2. Preprocess and load the data for the business opportunity.
3. Profile and understand the relevant data.
4. Define events relevant to the business opportunity being explored.
5. Derive attributes.
6. Create the data mining view.
7. Mine the data and build models.
Model building can be done by extracting the mining data into a special mining
tool, such as Enterprise Miner from the SAS Institute. A detailed discussion of the
use of this tool is beyond the scope of this manual.
However, this manual does include building a decision tree as an example of a
technique that could be used by a data mining tool for building a model. See
Building the Model on page 4-2.
8. Deploy models.
Deployment can take many different forms. For example, deployment might be as
simple as documenting and reporting the results, or deployment might be
embedding the model in an operational system to achieve predictive results.
See Deploying the Model on page 4-10.
9. Monitor model performance.
Performance of the model must be monitored for accuracy. When accuracy begins
to decline, the model must be updated to fit the current situation.
See Monitoring Model Performance on page 4-11.
HP NonStop SQL/MX Data Mining Guide—523737-001
4 -1
Building the Model
Mining the Data
Building the Model
Typically, after the mining view is computed and inserted into a single database table,
the data is extracted and loaded into a mining tool for the model building step.
Regardless of the type of analysis to be performed, the mining data can be stored and
retrieved by using the SQL/MX approach. This subsection describes how a decision
tree can be used for data analysis.
Building Decision Trees
Decision trees are built by recursively partitioning the data in an increasingly selective
manner, based on the attributes that most strongly determine the outcome. This
classification is determined by computing the best splits at each node in the tree. The
key operation of data is computing the frequency of various combinations of attributes
for a given subset of the data. This result is called a cross table.
The first step in building a decision tree is to generate cross tables for all the attributes
compared to the goal attribute. Building a decision tree can require the computation of
tens of thousands of cross tables. The computation of each cross table requires
scanning the data, applying specified predicates, grouping, and computing counts.
Computing Cross Tables
The first set of cross tables needed for building a decision tree consists of each
independent variable (a potential predictor) paired with the dependent variable (the
goal). In the same way that the profiling queries are combined by using TRANSPOSE,
these separate cross-table queries can be combined into a single query for each node
in the decision tree.
Computing Cross Tables to Determine the Initial Branch
This query computes the cross tables for Gender, Marital Status, and
Number_Children, with Cust_Left as the dependent variable (the goal):
SET SCHEMA mining.whse;
SELECT Independent_Variable, IV1, IV2, cust_left, COUNT(*)
FROM miningview
TRANSPOSE ('GENDER', gender, NULL),
('MARITAL STATUS', marital_status, NULL),
('NUMBER_CHILDREN', NULL, number_children)
AS (Independent_Variable, IV1, IV2)
GROUP BY Independent_Variable, IV1, IV2, cust_left
ORDER BY Independent_Variable, IV1, IV2, cust_left ;
INDEPENDENT_VARIABLE
-------------------GENDER
GENDER
GENDER
IV1
-------F
F
M
IV2
-----?
?
?
CUST_LEFT
--------N
Y
N
HP NonStop SQL/MX Data Mining Guide—523737-001
4 -2
(EXPR)
-------2
4
2
Building Decision Trees
Mining the Data
GENDER
MARITAL STATUS
MARITAL STATUS
MARITAL STATUS
MARITAL STATUS
MARITAL STATUS
MARITAL STATUS
MARITAL STATUS
NUMBER_CHILDREN
NUMBER_CHILDREN
NUMBER_CHILDREN
NUMBER_CHILDREN
NUMBER_CHILDREN
NUMBER_CHILDREN
M
Divorced
Divorced
Married
Married
Single
Widow
Widow
?
?
?
?
?
?
?
?
?
?
?
?
?
?
0
0
1
2
2
3
Y
N
Y
N
Y
Y
N
Y
N
Y
Y
N
Y
N
6
1
5
2
1
3
1
1
2
5
2
1
3
1
--- 17 row(s) selected.
Determining Which Attribute Best Predicts the Goal
Consider the results of the preceding query. You are ready to determine which of the
independent variables best predicts the dependent variable (the goal).
Examine the rows for each independent variable in the query. If most of the rows for a
particular value of an independent variable correlate with Cust_Left equal to Y, that
independent variable is a good predictor of the goal. This type of analysis is typically
performed by client-mining tools.
Independent
Variable
Predictor? Reason
GENDER
Yes
When Cust_Left equal to Y, the Gender is
predominantly equal to M. The number of
Males is 6, and the number of Females is 4.
MARITAL STATUS
Yes
When Cust_Left is equal to Y, the Marital
Status is predominantly equal to Divorced and
Single. The number of Divorced is 5, the
number of Married is 1, the number of Single
is 3, and the number of Widow is 1.
NUMBER CHILDREN
No
When Cust_Left is equal to Y, the
Number_Children is 0, 1, and 2. The number
with Children=0 is 5, the number with
Children=1 is 2, and the number with
Children=2 is 3. The values do not show a
pattern and do not predict Cust_Left equal to
Y.
Both Gender and Marital Status are reasonable choices as the best predictor of the
goal. To carry out the remaining cross-table generations, this scenario uses Marital
Status as the best predictor for the initial branch of the decision tree.
HP NonStop SQL/MX Data Mining Guide—523737-001
4 -3
Building Decision Trees
Mining the Data
Typically, the best discriminator of the goal is determined by a statistical analysis of the
cross tables. The exact nature of this analysis varies from tool to tool.
Initial Decision Tree
Figure 4-1 shows the initial decision tree for the business opportunity. Marital Status is
chosen as the best predictor of the goal with four initial branches—Divorced, Single,
Married, and Widow.
Figure 4-1. Initial Branches of Decision Tree
Marital Status
Divorced
No Yes
1
5
Single
No Yes
0
3
Married
No Yes
2
1
Widow
No Yes
1
1
The model is built to characterize the customers that have left—that is, the model will
find the rows where Cust_Left is Y.
The results for Divorced and Single are the most promising for further development of
the decision tree. For Divorced, the number of records is 5 for Cust_Left equal to Y,
and for Single, the number of records is 3 for Cust_Left equal to Y. In both cases, the
results of the cross table show the best homogeneous split with respect to the goal.
Initial Branches of the Decision Tree
The two initial branches that seem most promising are defined by two conditions:
marital_status = 'Divorced'
marital_status = 'Single'
Computing Cross Tables When Marital Status Equal to Divorced
This query generates cross tables for all attributes, except Marital Status, compared to
the goal when Marital Status is equal to Divorced:
SELECT Independent_Variable, IV1, IV2, cust_left, COUNT(*)
FROM miningview
WHERE marital_status = 'Divorced'
TRANSPOSE ('GENDER', gender, NULL),
('NUMBER_CHILDREN', NULL, number_children)
AS (Independent_Variable, IV1, IV2)
GROUP BY Independent_Variable, IV1, IV2, cust_left
ORDER BY Independent_Variable, IV1, IV2, cust_left;
INDEPENDENT_VARIABLE
-------------------GENDER
IV1
--F
IV2
-----?
CUST_LEFT
--------N
(EXPR)
-------1
HP NonStop SQL/MX Data Mining Guide—523737-001
4 -4
Building Decision Trees
Mining the Data
GENDER
NUMBER_CHILDREN
NUMBER_CHILDREN
NUMBER_CHILDREN
NUMBER_CHILDREN
M
?
0
0
1
2
?
?
?
?
Y
N
Y
Y
Y
5
1
1
1
3
--- 6 row(s) selected.
The preceding query shows Gender is Male in all cases where Cust_Left is equal to Y,
and therefore Gender is a good predictor where Marital Status is Divorced. The
Number_Children is equal to 0, 1, and 2, and therefore Number_Children is not a good
predictor.
Decision Tree for Divorced Branch
Figure 4-2 shows the results of the preceding query for the example business
opportunity.
Figure 4-2. Decision Tree for Divorced Branch
Marital Status
Divorced
No Yes
1
5
Male
No Yes
0
5
Single
No Yes
0
3
Married
No Yes
2
1
Widow
No Yes
1
1
Female
No Yes
1
0
For Divorced, when Cust_Left is equal to Y, the number of records is 5 for Gender
equal to Male. Gender best discriminates the goal when Marital Status is equal to
Divorced.
Computing Cross Tables When Marital Status Equal to Single
This query generates cross tables for all attributes, except Marital Status, compared to
the goal when Marital Status is equal to Single:
SELECT Independent_Variable, IV1, IV2, cust_left, COUNT(*)
FROM miningview
WHERE marital_status = 'Single'
TRANSPOSE ('GENDER', gender, NULL),
('NUMBER_CHILDREN', NULL, number_children)
AS (Independent_Variable, IV1, IV2)
GROUP BY Independent_Variable, IV1, IV2, cust_left
ORDER BY Independent_Variable, IV1, IV2, cust_left;
INDEPENDENT_VARIABLE
IV1
IV2
CUST_LEFT
(EXPR)
HP NonStop SQL/MX Data Mining Guide—523737-001
4 -5
Building Decision Trees
Mining the Data
-------------------GENDER
GENDER
NUMBER_CHILDREN
--F
M
?
-----?
?
0
--------Y
Y
Y
-------2
1
3
--- 3 row(s) selected.
The preceding query shows split results for Gender when Cust_Left is equal to Y, and
therefore Gender is not a good predictor when Marital Status equal to Single. However,
the query also shows Number_Children equal to 0 when Cust_Left is equal to Y, and
therefore Number_Children is a good predictor.
Decision Tree for Single Branch
Figure 4-3 shows the results of the preceding query for the example business
opportunity.
Figure 4-3. Decision Tree for Single Branch
Marital Status
Divorced
No Yes
1
5
Single
No Yes
0
3
Chldrn=0
No Yes
0
3
Married
No Yes
2
1
Widow
No Yes
1
1
Chldrn>0
No Yes
0
0
For Single, when Cust_Left equal to Y, the number of records is 3 for Number_Children
equal to 0. Number_Children best discriminates the goal when Marital Status is equal
to Single.
Conditions Defining the Decision Tree
The model developed so far seems to characterize the customers that have left—that
is, the model finds the rows where Cust_Left equal to Y. The model is now defined by
two conditions:
(marital_status = 'Divorced' AND gender = 'M')
(marital_status = 'Single'
AND number_children = 0)
For Divorced and Male, the number of records is 5 for Cust_Left equal to Y, and the
number of records is 0 for Cust_Left equal to N. For Single and Number_Children, the
number of records is 3 for Cust_Left equal to Y, and the number of records is 0 for
Cust_Left equal to N.
HP NonStop SQL/MX Data Mining Guide—523737-001
4 -6
Building Decision Trees
Mining the Data
Showing the Homogeneous Branches
For each of the preceding conditions, these queries show that he branches in the
decision tree are homogeneous with respect to the goal attribute Cust_Left.
Computing Cross Table When Marital Status is Divorced and Gender is
Male
This query generates cross tables for the Gender attribute compared to the goal when
Marital Status is Divorced and Gender is Male:
SELECT Independent_Variable, IV1, cust_left, COUNT(*)
FROM miningview
WHERE marital_status = 'Divorced' AND gender = 'M'
TRANSPOSE ('GENDER', gender)
AS (Independent_Variable, IV1)
GROUP BY Independent_Variable, IV1, cust_left
ORDER BY Independent_Variable, IV1, cust_left;
INDEPENDENT_VARIABLE
-------------------GENDER
IV1
--M
CUST_LEFT
--------Y
(EXPR)
--------5
--- 1 row(s) selected.
This group of records is homogeneous with respect to Cust_Left—that is, Cust_Left is
equal to Y in all cases.
Computing Cross Table When Marital Status is Divorced and Gender is
Female
This query generates cross tables for the Gender attribute compared to the goal when
Marital Status is Divorced and Gender is Female:
SELECT Independent_Variable, IV1, cust_left, COUNT(*)
FROM miningview
WHERE marital_status = 'Divorced' AND gender = 'F'
TRANSPOSE ('GENDER', gender)
AS (Independent_Variable, IV1)
GROUP BY Independent_Variable, IV1, cust_left
ORDER BY Independent_Variable, IV1, cust_left;
INDEPENDENT_VARIABLE
-------------------GENDER
IV1
--F
CUST_LEFT
--------N
(EXPR)
--------1
--- 1 row(s) selected.
This group of records is homogeneous with respect to Cust_Left—that is, Cust_Left is
equal to N in all cases.
HP NonStop SQL/MX Data Mining Guide—523737-001
4 -7
Building Decision Trees
Mining the Data
Computing Cross Table When Marital Status is Single and Children is Zero
This query generates cross tables for the Number_Children attribute compared to the
goal when Marital Status is Single and Number_Children is 0:
SELECT Independent_Variable, IV2, cust_left, COUNT(*)
FROM miningview
WHERE marital_status = 'Single' AND number_children = 0
TRANSPOSE ('NUMBER_CHILDREN', number_children)
AS (Independent_Variable, IV2)
GROUP BY Independent_Variable, IV2, cust_left
ORDER BY Independent_Variable, IV2, cust_left;
INDEPENDENT_VARIABLE
-------------------NUMBER_CHILDREN
IV2
----0
CUST_LEFT
--------Y
(EXPR)
--------3
--- 1 row(s) selected.
This group of records is homogeneous with respect to Cust_Left—that is, Cust_Left is
equal to Y in all cases.
Computing Cross Table When Marital Status is Single and Children > Zero
This query generates cross tables for the Number_Children attribute compared to the
goal when Marital Status is Single and Number of Children is greater than 0:
SELECT Independent_Variable, IV2, cust_left, COUNT(*)
FROM miningview
WHERE marital_status = 'Single' AND number_children > 0
TRANSPOSE ('NUMBER_CHILDREN', number_children)
AS (Independent_Variable, IV2)
GROUP BY Independent_Variable, IV2, cust_left
ORDER BY Independent_Variable, IV2, cust_left;
--- 0 row(s) selected.
This group of records is homogeneous with respect to Cust_Left equal to N.
Final Decision Tree
You have now finished developing the decision tree. Each branch of the tree is
homogeneous with respect to the value of Cust_Left. In practice, this process is highly
iterative. Expanding each node might require several iterations, and you might need to
back up to a previous node to consider another alternative.
HP NonStop SQL/MX Data Mining Guide—523737-001
4 -8
Checking the Model
Mining the Data
Figure 4-4 shows the final decision tree for the example business opportunity.
Figure 4-4. Final Decision Tree
Marital Status
Divorced
No Yes
1
5
Male
No Yes
0
5
Single
No Yes
0
3
Female
No Yes
1
0
Chldrn=0
No Yes
0
3
Chldrn>0
No Yes
0
0
Married
No Yes
2
1
Widow
No Yes
1
1
Prune the tree here because
the remaining branches do not
yield a pattern.
Changing the Process
Classification trees are used to predict or explain responses to categorical dependent
variables. If you had not been able to develop a classification tree with homogeneous
branches with respect to Cust_Left, you could now do any of the following:
•
Redefine the statement of the business opportunity.
The data analysis process might indicate new directions that offer more interesting
results.
•
Redefine the goal.
The goal is equal to Y if the customer had a zero balance for a period of 3 months.
This definition might need adjustment.
•
Add or remove columns in the mining view.
Some columns that do not contribute to the goal can be removed. Also, the initial
analysis might give new insight into columns that could be added.
•
Change the definition of derived columns.
For example, the average balance for the period of 3 months might be a better
choice than a zero balance for 3 months.
•
Change the mappings on the encoded columns.
Each iteration of the data mining process gives new insight into the changes you might
make for the next iteration.
Checking the Model
After you develop a model, you can check the model against the mining data.
HP NonStop SQL/MX Data Mining Guide—523737-001
4 -9
Applying the Model to the Mining Table
Mining the Data
Applying the Model to the Mining Table
You must check your model against the mining table.
Finding the Rows Where the Customer Left
This query finds most of the rows where Cust_Left is equal to Y:
SELECT account, cust_left FROM miningview
WHERE (marital_status = 'Divorced' AND gender = 'M')
OR (marital_status = 'Single'
AND number_children = 0);
Account Number
-------------1234567
2300000
2500000
3200000
3900000
4098124
4400000
4600000
CUST_LEFT
--------Y
Y
Y
Y
Y
Y
Y
Y
--- 8 row(s) selected.
Applying the Model to the Database
Now, check your model against the database. Before applying the model to the
database, you can remove the tables and attributes that are not used in the analysis.
You must remove any mappings you created between the values in the database and
the values in the mining table.
Deploying the Model
After a model has been built and tested, the results are deployed into the business
environment. In many cases, deployment means exporting the model back to the
database to be used to evaluate new cases. Depending on its complexity, a model can
be evaluated directly in the database by using standard SQL or user-defined functions.
Simple models like decision trees can usually be represented in standard SQL by
using a complex CASE statement. Many mining tools have the ability to export a CASE
statement representing a decision tree.
However, many times, models cannot be evaluated directly by using SQL. In this case,
user-defined functions are needed. Most mining tools have the ability to export a C
function that evaluates a model. The function code can be compiled and then executed
in the DBMS as a user-defined function. Object relational enhancements to SQL/MX
include such user-defined functions, which are accessible through standard SQL and
executed directly in the database.
HP NonStop SQL/MX Data Mining Guide—523737-001
4- 10
Monitoring Model Performance
Mining the Data
Monitoring Model Performance
When measuring a model, consider these questions:
•
How accurate is the model?
The accuracy of the model can be measured as a whole. For example, you can
determine the percentage of records that are classified correctly. The accuracy of
the parts of a model can also be measured. For example, in a decision tree, each
branch of the tree has an associated error rate.
•
To what degree does the model describe the observed data?
The model should be sufficiently descriptive with respect to the observed data to
make clear why a particular prediction was made.
•
What is the level of confidence in the model’s predictions?
Confidence is a measure of how often the model predicts the goal in the training
data set.
•
Is the model easily understood?
A predictive model that consists of a few simple rules is preferable to a model that
contains many rules, even if the latter is more accurate.
However, in the end, the only true measure of a business model is its return on
investment. In a marketing application, measuring a model requires setting aside
control groups and carefully tracking customer responses to various models.
HP NonStop SQL/MX Data Mining Guide—523737-001
4- 11
Mining the Data
Monitoring Model Performance
HP NonStop SQL/MX Data Mining Guide—523737-001
4- 12
A
Creating the Data Mining
Database
The examples presented in this manual use tables created by the execution of
SQL/MX CREATE TABLE statements. These SQL/MX DDL statements enable you to
create the data mining database so that you can use the SQL/MX features shown in
this manual.
-------------------------------------------------------------- Data mining database in catalog dmcat and schema whse
-- Run this script in MXCI
------------------------------------------------------------CREATE CATALOG dmcat;
CREATE SCHEMA dmcat.whse;
SET SCHEMA dmcat.whse;
-- Create tables CUSTOMERS and ACCT_HISTORY in WHSE schema
-- Create CUSTOMERS table in WHSE schema
DROP TABLE customers;
CREATE TABLE customers
( account
NUMERIC (7) UNSIGNED
NO DEFAULT
NOT NULL NOT DROPPABLE
Heading 'Account Number'
,first_name
CHARACTER (15)
DEFAULT ' '
,last_name
,marital_status
,home
,income
,gender
NOT NULL NOT DROPPABLE
HEADING 'First Name'
CHARACTER (20)
DEFAULT ' '
NOT NULL NOT DROPPABLE
HEADING 'Last Name'
CHARACTER (8)
DEFAULT NULL
HEADING 'Marital Status'
CHARACTER (4)
DEFAULT NULL
HEADING 'Home'
NUMERIC (8, 2) UNSIGNED
DEFAULT NULL
HEADING 'Income'
CHAR(1)
DEFAULT NULL
HP NonStop SQL/MX Data Mining Guide—523737-001
A- 1
Creating the Data Mining Database
,age
NUMERIC (3)
DEFAULT NULL
HEADING 'Age'
,number_children
NUMERIC (2)
DEFAULT null
HEADING 'Number of Children'
,PRIMARY KEY (account) NOT DROPPABLE
)
LOCATION $P2
PARTITION (ADD FIRST KEY 3000000 LOCATION $VOLUME,
ADD FIRST KEY 5000000 LOCATION $P1);
-- Set constraint on home column; must be Rent or Own or NULL
ALTER TABLE customers
ADD CONSTRAINT home_constraint
CHECK (home = 'Own' OR home = 'Rent' OR home IS NULL);
-- Set constraint on marital status column; must be Divorced,
-- Married, Widow, Single or NULL
ALTER TABLE customers
ADD CONSTRAINT ms_constraint
CHECK (marital_status = 'Divorced' OR
marital_status = 'Married' OR
marital_status = 'Single' OR
marital_status = 'Widow' OR
marital_status IS NULL);
-- Set constraint on gender column; must be F, M or NULL
ALTER TABLE customers
ADD CONSTRAINT gender_constraint
CHECK (gender = 'F' OR gender = 'M' OR gender IS NULL);
-- Create the ACCT_HISTORY table in WHSE schema
DROP TABLE acct_history;
CREATE TABLE acct_history
( account
NUMERIC (7) UNSIGNED
NO DEFAULT
NOT NULL NOT DROPPABLE
,year_month
DATE
HP NonStop SQL/MX Data Mining Guide—523737-001
A- 2
Creating the Data Mining Database
NO DEFAULT
NOT NULL NOT DROPPABLE
,status
CHAR (10)
NO DEFAULT
NOT NULL NOT DROPPABLE
,cust_limit
NUMERIC (9,2)
NO DEFAULT
NOT NULL NOT DROPPABLE
,balance
NUMERIC (9,2)
NO DEFAULT
NOT NULL NOT DROPPABLE
,payment
NUMERIC (9,2)
NO DEFAULT
NOT NULL NOT DROPPABLE
,finance_charge
NUMERIC (9,2)
NO DEFAULT
NOT NULL NOT DROPPABLE
,PRIMARY KEY (account, year_month)
)
LOCATION $P2
PARTITION (ADD FIRST KEY 3000000 LOCATION $VOLUME,
ADD FIRST KEY 5000000 LOCATION $P1);
-- Set constraint on status column; must be Open,
-- Delinquent,or Closed
ALTER TABLE acct_history
ADD CONSTRAINT status_constraint
CHECK (status = 'Open' OR
status = 'Delinquent' OR
status = 'Closed');
-------------------------------------------------------------
HP NonStop SQL/MX Data Mining Guide—523737-001
A- 3
Creating the Data Mining Database
HP NonStop SQL/MX Data Mining Guide—523737-001
A- 4
B
Inserting Into the Data Mining
Database
The following INSERT statements enable you to populate the data mining database.
Use the following script to populate the CUSTOMERS table and the ACCT_HISTORY
table:
-------------------------------------------------------------- Data mining database in catalog dmcat and schema whse
-- Run this script in MXCI
-------------------------------------------------------------- POPULATE THE DATA MINING DATABASE TABLES
SET SCHEMA dmcat.whse;
INSERT INTO customers VALUES
(1234567,'MARY', 'JONES', 'Single', 'Own', 65000,'F',34,0),
(2500000,'ALI',
'ABBAS', 'Divorced','Rent', 32000,'M',23,0),
(4098124,'TOMOKO','KANO',
'Divorced','Own', 44000,'M',44,2),
(2400000,'ERIKA', 'LUND',
'Widow',
'Own', 28000,'F',65,0),
(1000000,'ROGER', 'GREEN', 'Married', 'Own', 175500,'M',45,3),
(2300000,'JERRY', 'HOWARD', 'Divorced','Own', 137000,'M',42,2),
(2900000,'JANE', 'RAYMOND','Divorced','Rent',136000,'F',50,0),
(3200000,'THOMAS','RUDLOFF','Divorced','Rent',138000,'M',40,1),
(3900000,'KLAUS', 'SAFFERT','Divorced','Own', 75000,'M',40,2),
(4300000,'DEBBIE','DUNN',
'Married', 'Own', 300000,'F',29,2),
(4400000,'HANNAH','ROSE',
'Single', 'Own', 300000,'F',29,0),
(4500000,'LIZ',
'STONE', 'Married', 'Own', 300000,'F',29,1),
(4600000,'HANS', 'NOBLE', 'Single', 'Own', 300000,'M',48,0),
(4700000,'SEAN', 'FREDRICK','Widow', 'Own', 300000,'M',68,0),
(5000000,'CYNTHIA','TREBLE','Single', 'Own', 65000,'F',34,0),
(5200000,'FRANK', 'KIRBY', 'Married', 'Rent', 32000,'M',23,0),
(5300000,'ROBERT','HOLDER', 'Divorced','Own', 44000,'M',44,2),
(6000000,'VALERIE','RECORD','Widow',
'Own', 28000,'F',65,0),
(7000000,'KARL', 'SMITH', 'Married', 'Own', 175500,'M',45,3),
(7100000,'BRADLEY','RAY',
'Widow',
'Own', 137000,'M',42,2),
(7200000,'KIRSTEN','HOWARD','Married', 'Rent',136000,'F',50,0),
(7300000,'GERALD','CACHMAN','Divorced','Rent',138000,'M',40,1),
(7400000,'MILES', 'KOCH',
'Divorced','Own', 75000,'M',40,2),
(7500000,'SYDNEY','NICOLE', 'Single', 'Own', 200000,'F',25,0),
(7600000,'ERIN', 'MCDONALD','Single', 'Own', 65000,'F',34,0),
(7700000,'MATT', 'STEVENS','Married', 'Rent', 32000,'M',23,0),
(7800000,'SANDY', 'MILLER', 'Divorced','Own', 44000,'M',44,2),
HP NonStop SQL/MX Data Mining Guide—523737-001
B- 1
Inserting Into the Data Mining Database
(7900000,'LAUREN','LITTLE', 'Widow',
'Own', 28000,'F',65,0),
(8000000,'BRENT', 'BLACK', 'Married', 'Own', 175500,'M',45,3),
(8100000,'STEVEN','HUFF',
'Widow',
'Own', 137000,'M',42,2),
(8200000,'ELLIE', 'RAYMOND','Married', 'Rent',136000,'F',50,0),
(8300000,'PATRICK','ZORO', 'Divorced','Rent',138000,'M',40,1),
(8400000,'SHAWN', 'JONES', 'Divorced','Own', 75000,'M',40,2),
(8500000,'ABBIE', 'LAUREN', 'Single', 'Own', 200000,'F',19,0),
(8600000,'ELSIE', 'VANDER', 'Single', 'Own', 200000,'F',30,0),
(8700000,'SARAH', 'PETERS', 'Single', 'Own', 200000,'F',19,0),
(8800000,'ALLIE', 'BOWERS', 'Single', 'Own', 200000,'F',40,0),
(8900000,'KELSEY','SMITH', 'Single', 'Own', 200000,'F',28,0),
(9000000,'KIM',
'TENNEL', 'Single', 'Own', 200000,'F',56,0),
(9100000,'TJ',
'CASWELL','Single', 'Own', 200000,'M',25,0),
(9200000,'HELEN', 'SPOTS', 'Single', 'Own', 200000,'F',29,0),
(9300000,'JOHN', 'MOORE', 'Single', 'Own', 200000,'M',43,0);
-- Insert into ACCT_HISTORY 12 to 36 records per account
INSERT INTO acct_history VALUES
(1234567,DATE '2003-01-01','Open',10000,1232.50,1232.50,
(1234567,DATE '2003-02-01','Open',10000,3000.00,3000.00,
(1234567,DATE '2003-03-01','Open',10000,1034.00,1034.00,
(1234567,DATE '2003-04-01','Open',10000,2500.00,2500.00,
(1234567,DATE '2003-05-01','Open',10000,1050.00,1050.00,
(1234567,DATE '2003-06-01','Open',10000,6500.00,6500.00,
(1234567,DATE '2003-07-01','Open',10000,3000.00,3000.00,
(1234567,DATE '2003-08-01','Open',10000,7800.00,7800.00,
(1234567,DATE '2003-09-01','Open',10000,3000.00,3000.00,
(1234567,DATE '2003-10-01','Open',10000,2870.00,2870.00,
(1234567,DATE '2003-11-01','Open',10000,1200.00,1200.00,
(1234567,DATE '2003-12-01','Closed',10000,500.00,500.00,
INSERT INTO acct_history VALUES
(2500000,DATE '2002-07-01','Open',
(2500000,DATE '2002-08-01','Open',
(2500000,DATE '2002-09-01','Open',
(2500000,DATE '2002-10-01','Open',
(2500000,DATE '2002-11-01','Open',
(2500000,DATE '2002-12-01','Open',
(2500000,DATE '2003-01-01','Open',
(2500000,DATE '2003-02-01','Open',
(2500000,DATE '2003-03-01','Open',
(2500000,DATE '2003-04-01','Open',
(2500000,DATE '2003-05-01','Open',
(2500000,DATE '2003-06-01','Open',
(2500000,DATE '2003-07-01','Open',
5000,
5000,
5000,
5000,
5000,
5000,
5000,
5000,
5000,
5000,
5000,
5000,
5000,
0.00),
0.00),
0.00),
0.00),
0.00),
0.00),
0.00),
0.00),
0.00),
0.00),
0.00),
0.00);
566.00, 32.00, 8.00),
600.00, 40.00, 9.23),
632.00, 32.00, 8.00),
615.00, 31.00, 8.00),
670.00, 42.00,10.40),
650.00, 37.00,10.00),
703.00, 50.00,13.00),
723.00, 23.00,14.00),
700.00, 20.00,13.75),
745.00, 22.00,13.60),
745.00,
0,89.40),
834.40, 75 ,100.28),
834.40, 834.40, 0),
HP NonStop SQL/MX Data Mining Guide—523737-001
B- 2
Inserting Into the Data Mining Database
(2500000,DATE
(2500000,DATE
(2500000,DATE
(2500000,DATE
(2500000,DATE
'2003-08-01','Open',
'2003-09-01','Open',
'2003-10-01','Open',
'2003-11-01','Open',
'2003-12-01','Open',
5000,
5000,
5000,
5000,
5000,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0),
0),
0),
0),
0);
INSERT INTO acct_history VALUES
(4098124,DATE '2000-10-01','Open', 6000,32000.00,3200.00,0.00),
(4098124,DATE '2000-11-01','Open', 6000, 2300.00,2300.00,0.00),
(4098124,DATE '2000-12-01','Open', 6000,
0.00,
0.00,0.00),
(4098124,DATE '2001-01-01','Open', 6000,
0.00,
0.00,0.00),
(4098124,DATE '2001-02-01','Open', 6000, 4000.00,4000.00,0.00),
(4098124,DATE '2001-03-01','Open', 6000, 1200.00,1200.00,0.00),
(4098124,DATE '2001-04-01','Open', 6000, 320.00, 320.00,0.00),
(4098124,DATE '2001-05-01','Open', 6000, 1000.00,1000.00,0.00),
(4098124,DATE '2001-06-01','Open', 6000, 2300.00,2300.00,0.00),
(4098124,DATE '2001-07-01','Open', 6000, 1200.00,1200.00,0.00),
(4098124,DATE '2001-08-01','Open', 6000, 5400.00,400.00,500.00),
(4098124,DATE '2001-09-01','Open', 6000, 5300.00,300.00,550.00),
(4098124,DATE '2001-10-01','Open', 6000, 6000.00,800.00,720.00),
(4098124,DATE '2001-11-01','Open', 6000, 5920.00,800.00,710.40),
(4098124,DATE '2001-12-01','Open', 6000, 5830.40,800.00,699.65),
(4098124,DATE '2002-01-01','Open', 6000, 5730.04, 800, 687.60),
(4098124,DATE '2002-02-01','Open', 6000, 5617.65, 800, 674.11),
(4098124,DATE '2002-03-01','Open', 6000, 5491.77, 800, 659.01),
(4098124,DATE '2002-04-01','Open', 6000, 5350.78, 800, 642.09),
(4098124,DATE '2002-05-01','Open', 6000, 5192.87, 800, 623.14),
(4098124,DATE '2002-06-01','Open', 6000, 5016.02, 800, 601.92),
(4098124,DATE '2002-07-01','Open', 6000, 4817.94, 800, 578.15),
(4098124,DATE '2002-08-01','Open', 6000, 4596.10, 800, 551.53),
(4098124,DATE '2002-09-01','Open', 6000, 4347.63, 800, 521.71),
(4098124,DATE '2002-10-01','Closed',6000,4069.34, 800, 488.32);
INSERT INTO acct_history VALUES
(2400000,DATE '2002-01-01','Open', 5000, 50.00, 50.00,
(2400000,DATE '2002-02-01','Open', 5000, 100.00, 50.00,
(2400000,DATE '2002-03-01','Open', 5000, 50.00, 50.00,
(2400000,DATE '2002-04-01','Open', 5000, 380.00, 50.00,
(2400000,DATE '2002-05-01','Open', 5000, 330.00, 60.00,
(2400000,DATE '2002-06-01','Open', 5000, 430.45, 55.00,
(2400000,DATE '2002-07-01','Open', 5000, 470.34, 55.00,
(2400000,DATE '2002-08-01','Open', 5000, 545.00, 60.00,
(2400000,DATE '2002-09-01','Open', 5000, 490.67,490.67,
(2400000,DATE '2002-10-01','Open', 5000,
0.00, 0.00,
(2400000,DATE '2002-11-01','Open', 5000,
0.00, 0.00,
(2400000,DATE '2002-12-01','Closed',5000, 0.00, 0.00,
HP NonStop SQL/MX Data Mining Guide—523737-001
B- 3
0.00),
0.75),
0.00),
4.95),
4.05),
5.63),
6.23),
7.27),
0),
0.00),
0.00),
0.00);
Inserting Into the Data Mining Database
INSERT INTO acct_history VALUES
(1000000,DATE '2002-07-01','Open',20000,3678.67,3678.67, 0.00),
(1000000,DATE '2002-08-01','Open',20000,6780.00,6780.00, 0.00),
(1000000,DATE '2002-09-01','Open',20000,2300.78,2300.78, 0.00),
(1000000,DATE '2002-10-01','Open',20000,8000.00,8000.00, 0.00),
(1000000,DATE '2002-11-01','Open',20000,5345.89,5345.89, 0.00),
(1000000,DATE '2002-12-01','Open',20000,4700.00,4700.00, 0.00),
(1000000,DATE '2003-01-01','Open',20000,1200.00,1200.00, 0.00),
(1000000,DATE '2003-02-01','Delinquent',20000,3500.00,0,51.75),
(1000000,DATE '2003-03-01','Open',20000,5500.00,5500.00, 0.00),
(1000000,DATE '2003-04-01','Open',20000,
0.00,
0.00, 0.00),
(1000000,DATE '2003-05-01','Open',20000,6500.00,6500.00, 0.00),
(1000000,DATE '2003-06-01','Open',20000,4590.00,4590.00, 0.00),
(1000000,DATE '2003-07-01','Open',20000,3200.00,3200.00, 0.00),
(1000000,DATE '2003-08-01','Open',20000, 167.89, 167.89, 0.00),
(1000000,DATE '2003-09-01','Open',20000,9800.00,9800.00, 0.00),
(1000000,DATE '2003-10-01','Open',20000, 50.00, 50.00, 0.00),
(1000000,DATE '2003-11-01','Open',20000,2300.78,2300.78, 0.00),
(1000000,DATE '2003-12-01','Open',20000,5600.00,5600.00, 0.00);
INSERT INTO acct_history VALUES
(2300000,DATE '2002-11-01','Open',15000, 0,0,0),
(2300000,DATE '2002-12-01','Open',15000,10000.00,1500.00,127.5),
(2300000,DATE '2003-01-01','Open',15000,9500.00,1500.00,120.00),
(2300000,DATE '2003-02-01','Open',15000,8120.00,1500.00, 99.30),
(2300000,DATE '2003-03-01','Open',15000,12000.00,4000.00,120),
(2300000,DATE '2003-04-01','Open',15000,8120.00,4000.00,61.80),
(2300000,DATE '2003-05-01','Open',15000,5004.00,1500.00,52.56),
(2300000,DATE '2003-06-01','Open',15000,3500.00,1500.00,30.00),
(2300000,DATE '2003-07-01','Open',15000,4500.00, 800.00,55.50),
(2300000,DATE '2003-08-01','Open',15000,3800.00,1500.00,34.50),
(2300000,DATE '2003-09-01','Open',15000, 0, 0, 0),
(2300000,DATE '2003-10-01','Open',15000, 0, 0, 0),
(2300000,DATE '2003-11-01','Open',15000, 0, 0, 0),
(2300000,DATE '2003-12-01','Open',15000, 0, 0, 0);
INSERT INTO acct_history VALUES
(2900000,DATE '2003-01-01','Open',15000,10000.00,10000.00,
(2900000,DATE '2003-02-01','Open',15000, 3456.00, 3456.00,
(2900000,DATE '2003-03-01','Open',15000, 2300.90, 2300.90,
(2900000,DATE '2003-04-01','Open',15000, 9432.78, 9432.78,
(2900000,DATE '2003-05-01','Open',15000, 1134.00, 1134.00,
(2900000,DATE '2003-06-01','Open',15000, 2356.80, 2356.80,
(2900000,DATE '2003-07-01','Open',15000, 9870.00, 9870.00,
(2900000,DATE '2003-08-01','Open',15000, 8765.00, 8765.00,
(2900000,DATE '2003-09-01','Open',15000, 2460.00, 2460.00,
(2900000,DATE '2003-10-01','Open',15000, 4543.00, 4543.00,
HP NonStop SQL/MX Data Mining Guide—523737-001
B- 4
0),
0),
0),
0),
0),
0),
0),
0),
0),
0),
Inserting Into the Data Mining Database
(2900000,DATE '2003-11-01','Open',15000, 2000.00, 2000.00, 0),
(2900000,DATE '2003-12-01','Open',15000, 5890.00, 5890.00, 0);
INSERT INTO acct_history VALUES
(3200000,DATE '2002-07-01','Open',
(3200000,DATE '2002-08-01','Open',
(3200000,DATE '2002-09-01','Open',
(3200000,DATE '2002-10-01','Open',
(3200000,DATE '2002-11-01','Open',
(3200000,DATE '2002-12-01','Open',
(3200000,DATE '2003-01-01','Open',
(3200000,DATE '2003-02-01','Open',
(3200000,DATE '2003-03-01','Open',
(3200000,DATE '2003-04-01','Open',
(3200000,DATE '2003-05-01','Open',
(3200000,DATE '2003-06-01','Open',
10000,
10000,
10000,
10000,
10000,
10000,
10000,
10000,
10000,
10000,
10000,
10000,
INSERT INTO acct_history VALUES
(3900000,DATE '2001-12-01','Open',
(3900000,DATE '2002-01-01','Open',
(3900000,DATE '2002-02-01','Open',
(3900000,DATE '2002-03-01','Open',
(3900000,DATE '2002-04-01','Open',
(3900000,DATE '2002-05-01','Open',
(3900000,DATE '2002-06-01','Open',
(3900000,DATE '2002-07-01','Open',
(3900000,DATE '2002-08-01','Open',
(3900000,DATE '2002-09-01','Open',
(3900000,DATE '2002-10-01','Open',
(3900000,DATE '2002-11-01','Open',
5000, 800.00, 800.00,
5000, 300.00, 300.00,
5000, 230.00, 230.00,
5000, 789.00, 789.00,
5000, 600.00, 600.00,
5000, 500.00, 500.00,
5000,1800.00,1800.00,
5000,4800.00,4800.00,
5000, 0, 0, 0),
5000, 0, 0, 0),
5000, 0, 0, 0),
5000, 0, 0, 0);
2345.00, 2345.00, 0),
0, 0, 0),
150.00, 150.00, 0),
5678.00, 5678.00, 0),
2000.00, 2000.00, 0),
50.00,
50.00, 0),
0, 0, 0),
800.00, 800.00, 0),
0, 0, 0),
0, 0, 0),
0, 0, 0),
0, 0, 0 );
0),
0),
0),
0),
0),
0),
0),
0),
INSERT INTO acct_history VALUES
(4300000,DATE '2003-01-01','Open',40000, 0, 0, 0),
(4300000,DATE '2003-02-01','Open',40000,18000.00,18000.00,
(4300000,DATE '2003-03-01','Open',40000, 459.99, 459.99,
(4300000,DATE '2003-04-01','Open',40000, 9876.00, 9876.00,
(4300000,DATE '2003-05-01','Open',40000, 4354.00, 4354.00,
(4300000,DATE '2003-06-01','Open',40000, 9000.00, 9000.00,
(4300000,DATE '2003-07-01','Open',40000, 0, 0, 0),
(4300000,DATE '2003-08-01','Open',40000, 6700.00, 6700.00,
(4300000,DATE '2003-09-01','Open',40000, 7800.00, 7800.00,
(4300000,DATE '2003-10-01','Open',40000, 1200.00, 1200.00,
(4300000,DATE '2003-11-01','Open',40000, 8000.00, 8000.00,
(4300000,DATE '2003-12-01','Open',40000, 9050.00, 9050.00,
INSERT INTO acct_history VALUES
(4400000,DATE '2003-01-01','Open',40000,
0.00,
HP NonStop SQL/MX Data Mining Guide—523737-001
B- 5
0),
0),
0),
0),
0),
0),
0),
0),
0),
0);
00.00, 0),
Inserting Into the Data Mining Database
(4400000,DATE
(4400000,DATE
(4400000,DATE
(4400000,DATE
(4400000,DATE
(4400000,DATE
(4400000,DATE
(4400000,DATE
(4400000,DATE
(4400000,DATE
(4400000,DATE
'2003-02-01','Open',40000,
'2003-03-01','Open',40000,
'2003-04-01','Open',40000,
'2003-05-01','Open',40000,
'2003-06-01','Open',40000,
'2003-07-01','Open',40000,
'2003-08-01','Open',40000,
'2003-09-01','Open',40000,
'2003-10-01','Open',40000,
'2003-11-01','Open',40000,
'2003-12-01','Open',40000,
100.00,
50.00,
90.00,
0.00,
0.00,
0.00,
0.00,
0.00,
0.00,
0.00,
0.00,
INSERT INTO acct_history VALUES
(4500000,DATE '2002-07-01','Open',40000,
50.00,
(4500000,DATE '2002-08-01','Open',40000, 100.00,
(4500000,DATE '2002-09-01','Closed',40000, 0.00,
100.00,
50.00,
90.00,
0.00,
0.00,
0.00,
0.00,
0.00,
0.00,
0.00,
0.00,
0),
0),
0),
0),
0),
0),
0),
0),
0),
0),
0);
50.00, 0),
100.00, 0),
0.00, 0);
INSERT INTO acct_history VALUES
(4600000,DATE '2003-01-01','Open',40000,
30.00,
30.00,
(4600000,DATE '2003-02-01','Open',40000,
30.00,
30.00,
(4600000,DATE '2003-03-01','Open',40000,
30.00,
30.00,
(4600000,DATE '2003-04-01','Open',40000,
30.00,
30.00,
(4600000,DATE '2003-05-01','Open',40000,
30.00,
30.00,
(4600000,DATE '2003-06-01','Open',40000,
30.00,
30.00,
(4600000,DATE '2003-07-01','Open',40000,
30.00,
30.00,
(4600000,DATE '2003-08-01','Open',40000,
60.00,
60.00,
(4600000,DATE '2003-09-01','Open',40000, 700.00, 700.00,
(4600000,DATE '2003-10-01','Open',40000,
80.00,
80.00,
(4600000,DATE '2003-11-01','Open',40000,
50.00,
50.00,
(4600000,DATE '2003-12-01','Closed',40000,1000.00,1000.00,
INSERT INTO acct_history VALUES
(4700000,DATE '2003-01-01','Open',40000, 330.00,
(4700000,DATE '2003-02-01','Open',40000, 330.00,
(4700000,DATE '2003-03-01','Open',40000, 330.00,
(4700000,DATE '2003-04-01','Open',40000, 330.00,
(4700000,DATE '2003-05-01','Open',40000, 330.00,
(4700000,DATE '2003-06-01','Open',40000, 330.00,
(4700000,DATE '2003-07-01','Open',40000, 330.00,
(4700000,DATE '2003-08-01','Open',40000, 650.00,
(4700000,DATE '2003-09-01','Open',40000, 710.00,
(4700000,DATE '2003-10-01','Open',40000, 807.00,
(4700000,DATE '2003-11-01','Open',40000, 509.00,
(4700000,DATE '2003-12-01','Open',40000, 1000.00,
HP NonStop SQL/MX Data Mining Guide—523737-001
B- 6
330.00,
330.00,
330.00,
330.00,
330.00,
330.00,
330.00,
650.00,
710.00,
807.00,
509.00,
1000.00,
0),
0),
0),
0),
0),
0),
0),
0),
0),
0),
0),
0);
0),
0),
0),
0),
0),
0),
0),
0),
0),
0),
0),
0);
C
Importing Into the Data Mining
Database
The format file, data file, and the import command provided in this appendix enable
you to populate the data mining database. You cannot execute the import utility
command through MXCI or in programs. You must run import at the command prompt.
For further information, see the import Utility entry in the NonStop SQL/MX Reference
Manual.
Importing Customers Data
The import command for the Customers table looks like this:
IMPORT dmcat.whse.customers
-I importdatac.txt -U importfmtc.txt
Customers Format File
This file is named importfmtc.txt and is the format file specified in the preceding
IMPORT command:
[DATE FORMAT]
DecimalSymbol=.
[COLUMN FORMAT]
col=account,N
col=first_name,N
col=last_name,N
col=marital_status,N
col=home,N
col=income,N
col=gender,N
col=age,N
col=number_children,N
[DELIMITED FORMAT]
FieldDelimiter=,
Customers Data File
This file is named importdatac.txt and is the data file specified in the preceding
import command:
1234567,MARY,JONES,Single,Own,65000,F,34,0
2500000,ALI,ABBAS,Divorced,Rent,32000,M,23,0
4098124,TOMOKO,KANO,Divorced,Own,44000,M,44,2
2400000,ERIKA,LUND,Widow,Own,28000,F,65,0
HP NonStop SQL/MX Data Mining Guide—523737-001
C- 1
Importing Into the Data Mining Database
Importing Account History Data
1000000,ROGER,GREEN,Married,Own,175500,M,45,3
2300000,JERRY,HOWARD,Divorced,Own,137000,M,42,2
2900000,JANE,RAYMOND,Divorced,Rent,136000,F,50,0
3200000,THOMAS,RUDLOFF,Divorced,Rent,138000,M,40,1
3900000,KLAUS,SAFFERT,Divorced,Own,75000,M,40,2
4300000,DEBBIE,DUNN,Married,Own,300000,F,29,2
4400000,HANNAH,ROSE,Single,Own,300000,F,29,0
4500000,LIZ,STONE,Married,Own,300000,F,29,1
4600000,HANS,NOBLE,Single,Own,300000,M,48,0
4700000,SEAN,FREDRICK,Widow,Own,300000,M,68,0
5000000,CYNTHIA,TREBLE,Single,Own,65000,F,34,0
5200000,FRANK,KIRBY,Married,Rent,32000,M,23,0
5300000,ROBERT,HOLDER,Divorced,Own,44000,M,44,2
6000000,VALERIE,RECORD,Widow,Own,28000,F,65,0
7000000,KARL,SMITH,Married,Own,175500,M,45,3
7100000,BRADLEY,RAY,Widow,Own,137000,M,42,2
7200000,KIRSTEN,HOWARD,Married,Rent,136000,F,50,0
7300000,GERALD,CACHMAN,Divorced,Rent,138000,M,40,1
7400000,MILES,KOCH,Divorced,Own,75000,M,40,2
7500000,SYDNEY,NICOLE,Single,Own,200000,F,25,0
7600000,ERIN,MCDONALD,Single,Own,65000,F,34,0
7700000,MATT,STEVENS,Married,Rent,32000,M,23,0
7800000,SANDY,MILLER,Divorced,Own,44000,M,44,2
7900000,LAUREN,LITTLE,Widow,Own,28000,F,65,0
8000000,BRENT,BLACK,Married,Own,175500,M,45,3
8100000,STEVEN,HUFF,Widow,Own,137000,M,42,2
8200000,ELLIE,RAYMOND,Married,Rent,136000,F,50,0
8300000,PATRICK,ZORO,Divorced,Rent,138000,M,40,1
8400000,SHAWN,JONES,Divorced,Own,75000,M,40,2
8500000,ABBIE,LAUREN,Single,Own,200000,F,19,0
8600000,ELSIE,VANDER,Single,Own,200000,F,30,0
8700000,SARAH,PETERS,Single,Own,200000,F,19,0
8800000,ALLIE,BOWERS,Single,Own,200000,F,40,0
8900000,KELSEY,SMITH,Single,Own,200000,F,28,0
9000000,KIM,TENNEL,Single,Own,200000,F,56,0
9100000,TJ,CASWELL,Single,Own,200000,M,25,0
9200000,HELEN,SPOTS,Single,Own,200000,F,29,0
9300000,JOHN,MOORE,Single,Own,200000,M,43,0
Importing Account History Data
The import command for the Account History table looks like this:
import dmcat.whse.acct_history
-I importdataa.txt -U importfmta.txt
HP NonStop SQL/MX Data Mining Guide—523737-001
C- 2
Importing Into the Data Mining Database
Account History Format File
Account History Format File
This file is named importfmta.txt and is the format file specified in the preceding
import command:
[DATE FORMAT]
DateOrder=YMD
DateDelimiter=[COLUMN FORMAT]
col=account,N
col=year_month,N
col=status,N
col=cust_limit,N
col=balance,N
col=payment,N
col=finance_charge,N
[DELIMITED FORMAT]
FieldDelimiter=,
Account History Data File
This file is named importdataa.txt and is the data file specified in the preceding
import command:
1234567,2003-01-01,Open,10000,1232.50,1232.50,0.00
1234567,2003-02-01,Open,10000,3000.00,3000.00,0.00
1234567,2003-03-01,Open,10000,1034.00,1034.00,0.00
1234567,2003-04-01,Open,10000,2500.00,2500.00,0.00
1234567,2003-05-01,Open,10000,1050.00,1050.00,0.00
1234567,2003-06-01,Open,10000,6500.00,6500.00,0.00
1234567,2003-07-01,Open,10000,3000.00,3000.00,0.00
1234567,2003-08-01,Open,10000,7800.00,7800.00,0.00
1234567,2003-09-01,Open,10000,3000.00,3000.00,0.00
1234567,2003-10-01,Open,10000,2870.00,2870.00,0.00
1234567,2003-11-01,Open,10000,1200.00,1200.00,0.00
1234567,2003-12-01,Closed,10000,500.00,500.00,0.00
2500000,2002-07-01,Open, 5000,566.00,32.00,8.00
2500000,2002-08-01,Open,5000,600.00,40.00,9.23
2500000,2002-09-01,Open,5000,632.00,32.00,8.00
2500000,2002-10-01,Open,5000,615.00,31.00,8.00
2500000,2002-11-01,Open,5000,670.00,42.00,10.40
2500000,2002-12-01,Open,5000,650.00,37.00,10.00
2500000,2003-01-01,Open,5000,703.00,50.00,13.00
2500000,2003-02-01,Open,5000,723.00,23.00,14.00
2500000,2003-03-01,Open,5000,700.00,20.00,13.75
2500000,2003-04-01,Open,5000,745.00,22.00,13.60
2500000,2003-05-01,Open,5000,745.00,0,89.40
2500000,2003-06-01,Open,5000,834.40,75,100.28
HP NonStop SQL/MX Data Mining Guide—523737-001
C- 3
Importing Into the Data Mining Database
Account History Data File
2500000,2003-07-01,Open,5000,834.40,834.40,0
2500000,2003-08-01,Open,5000,0,0,0
2500000,2003-09-01,Open,5000,0,0,0
2500000,2003-10-01,Open,5000,0,0,0
2500000,2003-11-01,Open,5000,0,0,0
2500000,2003-12-01,Open,5000,0,0,0
4098124,2000-10-01,Open,6000,32000.00,3200.00,0.00
4098124,2000-11-01,Open,6000,2300.00,2300.00,0.00
4098124,2000-12-01,Open,6000,0.00,0.00,0.00
4098124,2001-01-01,Open,6000,0.00,0.00,0.00
4098124,2001-02-01,Open,6000,4000.00,4000.00,0.00
4098124,2001-03-01,Open,6000,1200.00,1200.00,0.00
4098124,2001-04-01,Open,6000,320.00,320.00,0.00
4098124,2001-05-01,Open,6000,1000.00,1000.00,0.00
4098124,2001-06-01,Open,6000,2300.00,2300.00,0.00
4098124,2001-07-01,Open,6000,1200.00,1200.00,0.00
4098124,2001-08-01,Open,6000,5400.00,400.00,500.00
4098124,2001-09-01,Open,6000,5300.00,300.00,550.00
4098124,2001-10-01,Open,6000,6000.00,800.00,720.00
4098124,2001-11-01,Open,6000,5920.00,800.00,710.40
4098124,2001-12-01,Open,6000,5830.40,800.00,699.65
4098124,2002-01-01,Open,6000,5730.04,800,687.60
4098124,2002-02-01,Open,6000,5617.65,800,674.11
4098124,2002-03-01,Open,6000,5491.77,800,659.01
4098124,2002-04-01,Open,6000,5350.78,800,642.09
4098124,2002-05-01,Open,6000,5192.87,800,623.14
4098124,2002-06-01,Open,6000,5016.02,800,601.92
4098124,2002-07-01,Open,6000,4817.94,800,578.15
4098124,2002-08-01,Open,6000,4596.10,800,551.53
4098124,2002-09-01,Open,6000,4347.63,800,521.71
4098124,2002-10-01,Closed,6000,4069.34,800,488.32
2400000,2002-01-01,Open,5000,50.00,50.00,0.00
2400000,2002-02-01,Open,5000,100.00,50.00,0.75
2400000,2002-03-01,Open,5000,50.00,50.00,0.00
2400000,2002-04-01,Open,5000,380.00,50.00,4.95
2400000,2002-05-01,Open,5000,330.00,60.00,4.05
2400000,2002-06-01,Open,5000,430.45,55.00,5.63
2400000,2002-07-01,Open,5000,470.34,55.00,6.23
2400000,2002-08-01,Open,5000,545.00,60.00,7.27
2400000,2002-09-01,Open,5000,490.67,490.67,0
2400000,2002-10-01,Open,5000,0.00,0.00,0.00
2400000,2002-11-01,Open,5000,0.00,0.00,0.00
2400000,2002-12-01,Closed,5000,0.00,0.00,0.00
1000000,2002-07-01,Open,20000,3678.67,3678.67,0.00
1000000,2002-08-01,Open,20000,6780.00,6780.00,0.00
1000000,2002-09-01,Open,20000,2300.78,2300.78,0.00
1000000,2002-10-01,Open,20000,8000.00,8000.00,0.00
HP NonStop SQL/MX Data Mining Guide—523737-001
C- 4
Importing Into the Data Mining Database
Account History Data File
1000000,2002-11-01,Open,20000,5345.89,5345.89,0.00
1000000,2002-12-01,Open,20000,4700.00,4700.00,0.00
1000000,2003-01-01,Open,20000,1200.00,1200.00,0.00
1000000,2003-02-01,Delinquent,20000,3500.00,0.00,51.75
1000000,2003-03-01,Open,20000,5500.00,5500.00,0.00
1000000,2003-04-01,Open,20000,0.00,0.00,0.00
1000000,2003-05-01,Open,20000,6500.00,6500.00,0.00
1000000,2003-06-01,Open,20000,4590.00,4590.00,0.00
1000000,2003-07-01,Open,20000,3200.00, 3200.00,0.00
1000000,2003-08-01,Open,20000,167.89,167.89,0.00
1000000,2003-09-01,Open,20000,9800.00,9800.00,0.00
1000000,2003-10-01,Open,20000,50.00, 50.00, 0.00
1000000,2003-11-01,Open,20000,2300.78,2300.78,0.00
1000000,2003-12-01,Open,20000,5600.00,5600.00,0.00
2300000,2002-11-01,Open, 15000,0,0,0
2300000,2002-12-01,Open,15000,10000.00,1500.00,127.5
2300000,2003-01-01,Open,15000,9500.00,1500.00,120.00
2300000,2003-02-01,Open,15000,8120.00,1500.00,99.30
2300000,2003-03-01,Open,15000,12000.00,4000.00,120.00
2300000,2003-04-01,Open,15000,8120.00,4000.00,61.80
2300000,2003-05-01,Open,15000,5004.00,1500.00,52.56
2300000,2003-06-01,Open,15000,3500.00,1500.00,30.00
2300000,2003-07-01,Open,15000,4500.00,800.00,55.50
2300000,2003-08-01,Open,15000,3800.00,1500.00,34.50
2300000,2003-09-01,Open,15000,0,0,0
2300000,2003-10-01,Open,15000,0,0,0
2300000,2003-11-01,Open,15000,0,0,0
2300000,2003-12-01,Open,15000,0,0,0
2900000,2003-01-01,Open,15000,10000.00,10000.00,0
2900000,2003-02-01,Open,15000,3456.00,3456.00,0
2900000,2003-03-01,Open,15000,2300.90,2300.90,0
2900000,2003-04-01,Open,15000,9432.78,9432.78,0
2900000,2003-05-01,Open,15000,1134.00,1134.00,0
2900000,2003-06-01,Open,15000,2356.80,2356.80,0
2900000,2003-07-01,Open,15000,9870.00,9870.00,0
2900000,2003-08-01,Open,15000,8765.00,8765.00,0
2900000,2003-09-01,Open,15000,2460.00,2460.00,0
2900000,2003-10-01,Open,15000,4543.00,4543.00,0
2900000,2003-11-01,Open,15000,2000.00,2000.00,0
2900000,2003-12-01,Open,15000,5890.00,5890.00,0
3200000,2002-07-01,Open,10000,2345.00,2345.00,0
3200000,2002-08-01,Open,10000,0,0,0
3200000,2002-09-01,Open,10000,150.00,150.00,0
3200000,2002-10-01,Open,10000,5678.00,5678.00,0
3200000,2002-11-01,Open,10000,2000.00,2000.00,0
3200000,2002-12-01,Open,10000,50.00,50.00,0
3200000,2003-01-01,Open,10000,0,0,0
HP NonStop SQL/MX Data Mining Guide—523737-001
C- 5
Importing Into the Data Mining Database
Account History Data File
3200000,2003-02-01,Open,10000,800.00,800.00,0
3200000,2003-03-01,Open,10000,0,0,0
3200000,2003-04-01,Open,10000,0,0,0
3200000,2003-05-01,Open,10000,0,0,0
3200000,2003-06-01,Open,10000,0,0,0
3900000,20012000-12-01,Open,5000,800.00,800.00,0
3900000,2002-01-01,Open,5000,300.00,300.00,0
3900000,2002-02-01,Open,5000,230.00,230.00,0
3900000,2002-03-01,Open,5000,789.00,789.00,0
3900000,2002-04-01,Open,5000,600.00,600.00,0
3900000,2002-05-01,Open,5000,500.00,500.00,0
3900000,2002-06-01,Open,5000,1800.00,1800.00,0
3900000,2002-07-01,Open,5000,4800.00,4800.00,0
3900000,2002-08-01,Open,5000,0,0,0
3900000,2002-09-01,Open,5000,0,0,0
3900000,2002-10-01,Open,5000,0,0,0
3900000,2002-11-01,Open,5000,0,0,0
4300000,2003-01-01,Open,40000,0,0,0
4300000,2003-02-01,Open,40000,18000.00,18000.00,0
4300000,2003-03-01,Open,40000,459.99,459.99,0
4300000,2003-04-01,Open,40000,9876.00,9876.00,0
4300000,2003-05-01,Open,40000,4354.00,4354.00,0
4300000,2003-06-01,Open,40000,9000.00,9000.00,0
4300000,2003-07-01,Open,40000,0,0,0
4300000,2003-08-01,Open,40000,6700.00,6700.00,0
4300000,2003-09-01,Open,40000,7800.00,7800.00,0
4300000,2003-10-01,Open,40000,1200.00,1200.00,0
4300000,2003-11-01,Open,40000,8000.00,8000.00,0
4300000,2003-12-01,Open,40000,9050.00,9050.00,0
4400000,2003-01-01,Open,40000,0.00,00.00,0
4400000,2003-02-01,Open,40000,100.00,100.00,0
4400000,2003-03-01,Open,40000,50.00,50.00,0
4400000,2003-04-01,Open,40000,90.00,90.00,0
4400000,2003-05-01,Open,40000,0.00,0.00,0
4400000,2003-06-01,Open,40000,0.00,0.00,0
4400000,2003-07-01,Open,40000,0.00,0.00,0
4400000,2003-08-01,Open,40000,0.00,0.00,0
4400000,2003-09-01,Open,40000,0.00,0.00,0
4400000,2003-10-01,Open,40000,0.00,0.00,0
4400000,2003-11-01,Open,40000,0.00,0.00,0
4400000,2003-12-01,Open,40000,0.00,0.00,0
4500000,2002-07-01,Open,40000,50.00,50.00,0
4500000,2002-08-01,Open,40000,100.00,100.00,0
4500000,2002-09-01,Closed,40000,0.00,0.00,0
4600000,2003-01-01,Open,40000,30.00,30.00,0
4600000,2003-02-01,Open,40000,30.00,30.00,0
4600000,2003-03-01,Open,40000,30.00,30.00,0
HP NonStop SQL/MX Data Mining Guide—523737-001
C- 6
Importing Into the Data Mining Database
Account History Data File
4600000,2003-04-01,Open,40000,30.00,30.00,0
4600000,2003-05-01,Open,40000,30.00,30.00,0
4600000,2003-06-01,Open,40000,30.00,30.00,0
4600000,2003-07-01,Open,40000,30.00,30.00,0
4600000,2003-08-01,Open,40000,60.00,60.00,0
4600000,2003-09-01,Open,40000,700.00,700.00,0
4600000,2003-10-01,Open,40000,80.00,80.00,0
4600000,2003-11-01,Open,40000,50.00,50.00,0
4600000,2003-12-01,Closed,40000,1000.00,1000.00,0
4700000,2003-01-01,Open,40000,330.00,330.00,0
4700000,2003-02-01,Open,40000,330.00,330.00,0
4700000,2003-03-01,Open,40000,330.00,330.00,0
4700000,2003-04-01,Open,40000,330.00,330.00,0
4700000,2003-05-01,Open,40000,330.00,330.00,0
4700000,2003-06-01,Open,40000,330.00,330.00,0
4700000,2003-07-01,Open,40000,330.00,330.00,0
4700000,2003-08-01,Open,40000,650.00,650.00,0
4700000,2003-09-01,Open,40000,710.00,710.00,0
4700000,2003-10-01,Open,40000,807.00,807.00,0
4700000,2003-11-01,Open,40000,509.00,509.00,0
4700000,2003-12-01,Open,40000,1000.00,1000.00,0
HP NonStop SQL/MX Data Mining Guide—523737-001
C- 7
Importing Into the Data Mining Database
HP NonStop SQL/MX Data Mining Guide—523737-001
C- 8
Account History Data File
Index
A
Aligning data 1-9, 2-6
Attributes
cardinality of 1-7, 2-3
continuous 2-3, 2-5
deriving 2-9
discrete 1-7, 2-3
discrete numeric 2-4
statistics 2-5
Decision trees
cross tables 4-2
dependent variable 4-2
description of 4-2
first branch 4-2
goal definition 4-6
goal prediction 4-3
independent variables 4-2
Defining events 1-8, 2-6
Deploying model 1-11, 4-10
B
K
Business model
building 4-2
checking against database 4-9
deploying 4-10
monitoring 4-11
summarizing results 4-8
Business opportunity
defining attrition 1-5
prediction window 1-5
Knowledge discovery process 1-3
C
COUNT DISTINCT query 1-8
Creating mining view 1-10
D
Data mining database
creating 2-2, A-1
importing into 2-2, C-1
populating B-1
L
Loading data 2-2
M
Metrics, moving 2-9
Mining data set
Account History table 1-6
Customers table 1-6
Mining view
checking model 4-10
creating 3-2
Monitoring model 4-11
MOVINGAVG function 2-9
O
OFFSET function 2-7, 3-3
P
Pivoting data 3-3
Preparing data 1-7
Profiling data 2-2, 2-5
HP NonStop SQL/MX Data Mining Guide—523737-001
Index -1
R
Index
R
Rankings 2-10
ROWS SINCE function 2-7, 2-9
RUNNINGCOUNT function 2-10
S
SEQUENCE BY clause 2-8, 2-9, 2-10, 3-4
SQL/MX approach, advantages of 1-2
T
THIS function 2-9
TRANSPOSE clause 2-4, 2-5, 4-2, 4-4
1-8
Transposition 2-3
V
VARIANCE set function 2-3, 2-5
HP NonStop SQL/MX Data Mining Guide—523737-001
Index -2