Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
HP NonStop SQL/MX Data Mining Guide Abstract This manual presents a nine-step knowledge-discovery process, which was developed over a series of data mining investigations. This manual describes the data structures and operations of the NonStop™ SQL/MX approach and implementation. Product Version NonStop SQL/MX Release 2.0 Supported Release Version Updates (RVUs) This publication supports G06.23 and all subsequent G-series releases until otherwise indicated by its replacement publication. Part Number Published 523737-001 April 2004 Document History Part Number Product Version Published 424397-001 NonStop SQL/MX Release 1.0 February 2001 523737-001 NonStop SQL/MX Release 2.0 April 2004 HP NonStop SQL/MX Data Mining Guide Index Figures What’s New in This Manual iii Manual Information iii New and Changed Information Tables iii About This Manual v Audience v Organization v Related Documentation vi Notation Conventions viii 1. Introduction The Traditional Approach 1-1 The SQL/MX Approach 1-2 Data-Intensive Computations Performed in the DBMS Use of Built-In DBMS Data Structures and Operations The Knowledge Discovery Process 1-3 Defining the Business Opportunity 1-4 Preparing the Data 1-7 Creating the Mining View 1-10 Mining the Data 1-10 Knowledge Deployment and Monitoring 1-11 2. Preparing the Data Loading the Data 2-2 Creating the Database 2-2 Importing Data Into the Database Profiling the Data 2-2 Cardinalities and Metrics 2-3 Transposition 2-3 Quick Profiling 2-5 Defining Events 2-6 Aligning the Data 2-6 Deriving Attributes 2-9 2-2 Hewlett-Packard Company—523737-001 i 1-2 1-2 2. Preparing the Data (continued) Contents 2. Preparing the Data (continued) Moving Metrics 2-9 Rankings 2-10 3. Creating the Data Mining View Creating the Single Table Pivoting the Data 3-3 3-2 4. Mining the Data Building the Model 4-2 Building Decision Trees 4-2 Checking the Model 4-9 Applying the Model to the Mining Table 4-10 Applying the Model to the Database 4-10 Deploying the Model 4-10 Monitoring Model Performance 4-11 A. Creating the Data Mining Database B. Inserting Into the Data Mining Database C. Importing Into the Data Mining Database Importing Customers Data C-1 Customers Format File C-1 Customers Data File C-1 Importing Account History Data C-2 Account History Format File C-3 Account History Data File C-3 Index Figures Figure 4-1. Figure 4-2. Figure 4-3. Figure 4-4. Initial Branches of Decision Tree 4-4 Decision Tree for Divorced Branch 4-5 Decision Tree for Single Branch 4-6 Final Decision Tree 4-9 Tables Table i. Manual Organization v HP NonStop SQL/MX Data Mining Guide—523737-001 ii What’s New in This Manual Manual Information HP NonStop SQL/MX Data Mining Guide Abstract This manual presents a nine-step knowledge-discovery process, which was developed over a series of data mining investigations. This manual describes the data structures and operations of the NonStop™ SQL/MX approach and implementation. Product Version NonStop SQL/MX Release 2.0 Supported Release Version Updates (RVUs) This publication supports G06.23 and all subsequent G-series releases until otherwise indicated by its replacement publication. Part Number Published 523737-001 April 2004 Document History Part Number Product Version Published 424397-001 NonStop SQL/MX Release 1.0 February 2001 523737-001 NonStop SQL/MX Release 2.0 April 2004 New and Changed Information This publication has been updated to reflect new product names: • • • • Since product names are changing over time, this publication might contain both HP and Compaq product names. Product names in graphic representations are consistent with the current product interface. The technical content of this guide has been updated and reflects the state of the product at the G06.23 RVU. Previous versions of the guide used the Object Relational Data Mining (ORDM) approach and architecture. ORDM advocates performing data mining and other parts of the knowledge discovery process against data in the SQL/MX data base. This technique has been updated. Readers are encouraged to perform the data HP NonStop SQL/MX Data Mining Guide—523737-001 iii New and Changed Information What’s New in This Manual preparation steps in SQL/MX but reserve the mining or model building for UNIX or Microsoft Windows platforms. • • • • • All sections of the manual have been updated to reflect the impact of major changes of SQL/MX Release 2.0 (for example, the introduction of SQL/MX tables). Introductions to the data preparation steps have been revised and rewritten. The DDL statements in Appendix A, B, and C have been updated to use SQL/MX DDL syntax. Appendix A syntax has been removed. Readers can consult the SQL/MX Reference Manual for the most current syntax and examples. Index entries have been added, updated, and corrected. HP NonStop SQL/MX Data Mining Guide—523737-001 iv About This Manual This manual presents a nine-step knowledge discovery process, which was developed over a series of data mining investigations. This manual describes the data structures and operations of the NonStop SQL/MX approach and implementation. Audience This manual is intended for database administrators and application programmers who are using NonStop SQL/MX to solve data mining problems, either through the SQL conversational interface or through embedded SQL programs. Organization The sections listed in Table i describe the knowledge discovery process (or the data mining process) and present examples that carry out the process. The appendixes listed in Table i provide the syntax for the data mining features of NonStop SQL/MX and the SQL scripts that create the data mining database used in the examples. Table i. Manual Organization Section Description Section 1, Introduction Presents an overview of the knowledge discovery process and the SQL/MX approach to this process. Defines the example business opportunity used in this manual. Section 2, Preparing the Data Describes the data preparation steps of the knowledge discovery process. Section 3, Creating the Data Mining View Describes how to create the mining view. Section 4, Mining the Data Describes the data mining steps of the knowledge discovery process. Appendix A, Creating the Data Mining Database Contains DDL statement scripts that you can use to create the data mining database used in the examples in this manual. Appendix B, Inserting Into the Data Mining Database Contains INSERT statement scripts that you can use to populate the data mining database used in this manual. Appendix C, Importing Into the Data Mining Database Contains IMPORT statement scripts that you can use to create the data mining database used in this manual. HP NonStop SQL/MX Data Mining Guide—523737-001 v Related Documentation About This Manual Related Documentation This manual is part of the SQL/MX library of manuals, which includes: Introductory Guides SQL/MX Comparison Guide for SQL/MP Users Describes SQL differences between SQL/MP and SQL/MX. SQL/MX Quick Start Describes basic techniques for using SQL in the SQL/MX conversational interface (MXCI). Includes information about installing the sample database. Reference Manuals SQL/MX Reference Manual Describes the syntax of SQL/MX statements, MXCI commands, functions, and other SQL/MX language elements. SQL/MX Connectivity Service Command Reference Describes the SQL/MX administrative command library (MACL) available with the SQL/MX conversational interface (MXCI). DataLoader/MX Reference Manual Describes the features and functions of the DataLoader/MX product, a tool to load SQL/MX databases. SQL/MX Messages Manual Describes SQL/MX messages. SQL/MX Glossary Defines SQL/MX terminology. Programming Manuals SQL/MX Programming Manual for C and COBOL Describes how to embed SQL/MX statements in ANSI C and COBOL programs. SQL/MX Programming Manual for Java Describes how to embed SQL/MX statements in Java programs according to the SQLJ standard. SQL/MX Guide to Stored Procedures in Java Describes how to use stored procedures that are written in Java within SQL/MX. HP NonStop SQL/MX Data Mining Guide—523737-001 vi Related Documentation About This Manual Specialized Guides SQL/MX Installation and Management Guide Describes how to plan, install, create, and manage an SQL/MX database. Explains how to use installation and management commands and utilities. SQL/MX Query Guide Describes how to understand query execution plans and write optimal queries for an SQL/MX database. SQL/MX Data Mining Guide Describes the SQL/MX data structures and operations to carry out the knowledge-discovery process. SQL/MX Queuing and Publish/Subscribe Services Describes how SQL/MX integrates transactional queuing and publish/subscribe services into its database infrastructure. SQL/MX Report Writer Guide Describes how to produce formatted reports using data from a NonStop SQL/MX database. SQL/MX Connectivity Service Manual Describes how to install and manage the SQL/MX Connectivity Service (MXCS), which enables applications developed for the Microsoft Open Database Connectivity (ODBC) application programming interface (API) and other connectivity APIs to use SQL/MX. Online Help The SQL/MX Online Help consists of: Reference Help Overview and reference entries from the SQL/MX Reference Manual. Messages Help Individual messages grouped by source from the SQL/MX Messages Manual. Glossary Help Terms and definitions from the SQL/MX Glossary. NSM/web Help Context-sensitive help topics that describe how to use the NSM/web management tool. The following manuals are part of the SQL/MP library of manuals and are essential references for information about SQL/MP Data Definition Language (DDL) and SQL/MP installation and management: Related SQL/MP Manuals SQL/MP Reference Manual Describes the SQL/MP language elements, expressions, predicates, functions, and statements. SQL/MP Installation and Management Guide Describes how to plan, install, create, and manage an SQL/MP database. Describes installation and management commands and SQL/MP catalogs and files. HP NonStop SQL/MX Data Mining Guide—523737-001 vii Notation Conventions About This Manual This figure shows the manuals in the SQL/MX library: Programming Manuals Introductory Guides SQL/MX Comparison Guide for SQL/MP Users SQL/MX Programming Manual for C and COBOL SQL/MX Quick Start SQL/MX Programming Manual for Java SQL/MX Guide to Stored Procedures in Java Reference Manuals SQL/MX Reference Manual SQL/MX Messages Manual SQL/MX Glossary SQL/MX Queuing and Publish/ Subscribe Services SQL/MX Query Guide SQL/MX Report Writer Guide DataLoader/MX Reference Manual SQL/MX Online Help Specialized Guides SQL/MX Installation and Management Guide SQL/MX Connectivity Service Command Reference SQL/MX Data Mining Guide SQL/MX Connectivity Service Manual Reference Help Messages Help Glossary Help NSM/web Help VST001.vsd Notation Conventions Hypertext Links Blue underline is used to indicate a hypertext link within text. By clicking a passage of text with a blue underline, you are taken to the location described. For example: HP NonStop SQL/MX Data Mining Guide—523737-001 viii General Syntax Notation About This Manual This requirement is described under Backup DAM Volumes and Physical Disk Drives on page 3-2. General Syntax Notation This list summarizes the notation conventions for syntax presentation in this manual. UPPERCASE LETTERS. Uppercase letters indicate keywords and reserved words. Type these items exactly as shown. Items not enclosed in brackets are required. For example: MAXATTACH lowercase italic letters. Lowercase italic letters indicate variable items that you supply. Items not enclosed in brackets are required. For example: file-name computer type. Computer type letters within text indicate C and Open System Services (OSS) keywords and reserved words. Type these items exactly as shown. Items not enclosed in brackets are required. For example: myfile.c italic computer type. Italic computer type letters within text indicate C and Open System Services (OSS) variable items that you supply. Items not enclosed in brackets are required. For example: pathname [ ] Brackets. Brackets enclose optional syntax items. For example: TERM [\system-name.]$terminal-name INT[ERRUPTS] A group of items enclosed in brackets is a list from which you can choose one item or none. The items in the list can be arranged either vertically, with aligned brackets on each side of the list, or horizontally, enclosed in a pair of brackets and separated by vertical lines. For example: FC [ num ] [ -num ] [ text ] K [ X | D ] address { } Braces. A group of items enclosed in braces is a list from which you are required to choose one item. The items in the list can be arranged either vertically, with aligned HP NonStop SQL/MX Data Mining Guide—523737-001 ix General Syntax Notation About This Manual braces on each side of the list, or horizontally, enclosed in a pair of braces and separated by vertical lines. For example: LISTOPENS PROCESS { $appl-mgr-name } { $process-name } ALLOWSU { ON | OFF } | Vertical Line. A vertical line separates alternatives in a horizontal list that is enclosed in brackets or braces. For example: INSPECT { OFF | ON | SAVEABEND } … Ellipsis. An ellipsis immediately following a pair of brackets or braces indicates that you can repeat the enclosed sequence of syntax items any number of times. For example: M address [ , new-value ]… [ - ] {0|1|2|3|4|5|6|7|8|9}… An ellipsis immediately following a single syntax item indicates that you can repeat that syntax item any number of times. For example: "s-char…" Punctuation. Parentheses, commas, semicolons, and other symbols not previously described must be typed as shown. For example: error := NEXTFILENAME ( file-name ) ; LISTOPENS SU $process-name.#su-name Quotation marks around a symbol such as a bracket or brace indicate the symbol is a required character that you must type as shown. For example: "[" repetition-constant-list "]" Item Spacing. Spaces shown between items are required unless one of the items is a punctuation symbol such as a parenthesis or a comma. For example: CALL STEPMOM ( process-id ) ; If there is no space between two items, spaces are not permitted. In this example, no spaces are permitted between the period and any other items: $process-name.#su-name Line Spacing. If the syntax of a command is too long to fit on a single line, each continuation line is indented three spaces and is separated from the preceding line by a blank line. This spacing distinguishes items in a continuation line from items in a vertical list of selections. For example: ALTER [ / OUT file-spec / ] LINE [ , attribute-spec ]… HP NonStop SQL/MX Data Mining Guide—523737-001 x Notation for Messages About This Manual !i and !o. In procedure calls, the !i notation follows an input parameter (one that passes data to the called procedure); the !o notation follows an output parameter (one that returns data to the calling program). For example: CALL CHECKRESIZESEGMENT ( segment-id , error ) ; !i !o !i,o. In procedure calls, the !i,o notation follows an input/output parameter (one that both passes data to the called procedure and returns data to the calling program). For example: error := COMPRESSEDIT ( filenum ) ; !i:i. !i,o In procedure calls, the !i:i notation follows an input string parameter that has a corresponding parameter specifying the length of the string in bytes. For example: error := FILENAME_COMPARE_ ( filename1:length , filename2:length ) ; !i:i !i:i !o:i. In procedure calls, the !o:i notation follows an output buffer parameter that has a corresponding input parameter specifying the maximum length of the output buffer in bytes. For example: error := FILE_GETINFO_ ( filenum , [ filename:maxlen ] ) ; !i !o:i Notation for Messages This list summarizes the notation conventions for the presentation of displayed messages in this manual. Bold Text. Bold text in an example indicates user input typed at the terminal. For example: ENTER RUN CODE ?123 CODE RECEIVED: 123.00 The user must press the Return key after typing the input. Nonitalic text. Nonitalic letters, numbers, and punctuation indicate text that is displayed or returned exactly as shown. For example: Backup Up. lowercase italic letters. Lowercase italic letters indicate variable items whose values are displayed or returned. For example: p-register process-name HP NonStop SQL/MX Data Mining Guide—523737-001 xi Notation for Management Programming Interfaces About This Manual [ ] Brackets. Brackets enclose items that are sometimes, but not always, displayed. For example: Event number = number [ Subject = first-subject-value ] A group of items enclosed in brackets is a list of all possible items that can be displayed, of which one or none might actually be displayed. The items in the list can be arranged either vertically, with aligned brackets on each side of the list, or horizontally, enclosed in a pair of brackets and separated by vertical lines. For example: proc-name trapped [ in SQL | in SQL file system ] { } Braces. A group of items enclosed in braces is a list of all possible items that can be displayed, of which one is actually displayed. The items in the list can be arranged either vertically, with aligned braces on each side of the list, or horizontally, enclosed in a pair of braces and separated by vertical lines. For example: obj-type obj-name state changed to state, caused by { Object | Operator | Service } process-name State changed from old-objstate to objstate { Operator Request. } { Unknown. } | Vertical Line. A vertical line separates alternatives in a horizontal list that is enclosed in brackets or braces. For example: Transfer status: { OK | Failed } % Percent Sign. A percent sign precedes a number that is not in decimal notation. The % notation precedes an octal number. The %B notation precedes a binary number. The %H notation precedes a hexadecimal number. For example: %005400 %B101111 %H2F P=%p-register E=%e-register Notation for Management Programming Interfaces This list summarizes the notation conventions used in the boxed descriptions of programmatic commands, event messages, and error lists in this manual. UPPERCASE LETTERS. Uppercase letters indicate names from definition files. Type these names exactly as shown. For example: ZCOM-TKN-SUBJ-SERV HP NonStop SQL/MX Data Mining Guide—523737-001 xii Notation for Management Programming Interfaces About This Manual lowercase letters. Words in lowercase letters are words that are part of the notation, including Data Definition Language (DDL) keywords. For example: token-type !r. The !r notation following a token or field name indicates that the token or field is required. For example: ZCOM-TKN-OBJNAME !o. token-type ZSPI-TYP-STRING. !r The !o notation following a token or field name indicates that the token or field is optional. For example: ZSPI-TKN-MANAGER token-type ZSPI-TYP-FNAME32. HP NonStop SQL/MX Data Mining Guide—523737-001 xiii !o About This Manual Notation for Management Programming Interfaces HP NonStop SQL/MX Data Mining Guide—523737-001 xiv 1 Introduction Knowledge discovery is an iterative process involving many query-intensive steps. The challenges of data management in supporting this process efficiently are significant and continue to grow as knowledge discovery becomes more widely used. Data mining identifies and characterizes interrelationships among multiple variables without requiring a data analyst to formulate specific questions. Software tools look for trends and patterns and flag unusual or potentially interesting ones. Because data mining reveals previously unknown information and patterns, rather than proving or disproving a hypothesis, mining enables knowledge discovery rather than just knowledge verification. Knowledge discovery is an iterative process involving many query-intensive steps. The challenges of data management in supporting this process efficiently are significant and continue to grow as knowledge discovery becomes more widely used. This section discusses these approaches to data mining: • The Traditional Approach Today, most data mining is performed in the database by using client tools. This approach is limited because important information might be omitted from the data extract. • The SQL/MX Approach The SQL/MX approach to knowledge discovery enables you to perform many data intensive tasks in the database itself, rather than on extracts. Examples include statistical sampling, statistical functions, temporal reasoning through sequence functions, cross-table generation, database profiling, and moving-window aggregations. • The Knowledge Discovery Process In the SQL/MX approach, fundamental data structures and operations are built into the database management system (DBMS) to support a wide range of knowledge discovery tasks and algorithms. The knowledge discovery process is described as a series of steps that starts with the selection and definition of a business opportunity, continues through data preparation and modeling, and ends with the deployment of the new knowledge. The Traditional Approach Today’s traditional knowledge discovery systems consist of an application program on top of a data source. The main emphasis in these systems is data mining—inventing new techniques and algorithms, proving their statistical soundness, and validating their effectiveness given a suitable problem. Data should be available in a convenient form, typically a flat file, extracted from an appropriate data source. The knowledge discovery system consists of specific HP NonStop SQL/MX Data Mining Guide—523737-001 1 -1 The SQL/MX Approach Introduction algorithms that load the entire data set into memory and perform necessary computations. The extract approach has two major limitations: • • It does not scale to large data sets because the entire data set is required to fit in memory. Statistical sampling can be used to avoid this limitation. However, sampling is inappropriate in many situations because sampling might cause patterns to be missed, such as those in small groups or those between records. It cannot conveniently manage multiple versions of data across numerous iterations of a typical knowledge discovery investigation. For example, each iteration might require extracting additional data, performing incremental updates, deriving new attributes, and so on. The SQL/MX Approach In most enterprise organizations today, database systems are crucial for conducting business. The DBMS systems serve as the transaction processing systems for daily operations and manage data warehouses containing huge amounts of historical information. The validated data in these warehouses is already being used for online analysis and is a natural starting point for knowledge discovery. The SQL/MX approach identifies fundamental data structures and operations that are common across a wide range of knowledge discovery tasks and builds such structures and operations into the DBMS. The primary advantages of the SQL/MX technology over traditional data mining techniques include: • • • • The ability to mine much larger data sets, not only data in flat-file extracts Simplified data management More complete results Better performance and reduced cycle times The main features of the SQL/MX approach are summarized next. Data-Intensive Computations Performed in the DBMS Tools and applications perform data-intensive data-preparation tasks in the DBMS by using an SQL interface. As a result, you can access the powerful and parallel DBMS data manipulation capabilities in the data preparation stage of the knowledge discovery process. Use of Built-In DBMS Data Structures and Operations Fundamental data structures and operations are built into the DBMS to support a wide range of knowledge discovery tasks and algorithms in an efficient and scalable manner. HP NonStop SQL/MX Data Mining Guide—523737-001 1 -2 The Knowledge Discovery Process Introduction Building these data structures and operations into the DBMS allows mining tasks to be moved into the SQL engine for tighter integration of data and mining operations and for improved performance and scalability. Adding new primitives, such as moving-window aggregate functions, simplifies queries needed by knowledge discovery tools and applications. This type of query simplification often results in significant improvements in performance. The Knowledge Discovery Process The knowledge discovery process is a nine-step process that starts with the selection and definition of a business opportunity, continues through several data preparation steps and a modeling step, and ends with the deployment of the new knowledge. This subsection describes the first step of that process. 1. Identify and define a business opportunity. The process begins with the identification and precise specification of a business opportunity. See Defining the Business Opportunity on page 1-4. 2. Preprocess and load the data for the business opportunity. Real-world data is often inconsistent and incomplete. The first preparation step is to address these problems by preprocessing the data in various ways—for example, verifying and mapping the data. Then load the data into your database system. See Preparing the Data on page 1-7 3. Profile and understand the relevant data. Generate a variety of statistics such as column unique entry counts, value ranges, number of missing values, mean, variance, and so on. See Profiling the Data on page 1-7 4. Define events relevant to the business opportunity being explored. Events are used to align related data in a single set of columns for mining. Example events are life changes, such as getting married or switching jobs, or customer actions, such as opening an account or requesting a credit limit increase. See Defining Events on page 1-8 5. Derive attributes. For example, customer age can be derived from birth date. Account summary statistics, such as maximum and minimum balances, can be derived from monthly status information. See Preparing the Data on page 1-7. HP NonStop SQL/MX Data Mining Guide—523737-001 1 -3 Defining the Business Opportunity Introduction 6. Create the data mining view. Transform the data into a mining view, a form in which all attributes about the primary mining entity occur in a single record. See Creating the Mining View on page 1-10. 7. Mine the data and build models. Core knowledge discovery techniques are applied to gain insight, learn patterns, or verify hypotheses. The main tasks are either predictive or descriptive in nature. Predictive tasks involve trying to determine what will happen in the future, based upon historical data. Descriptive tasks involve finding patterns describing the data. See Mining the Data on page 1-10. 8. Deploy models. Deployment can take many different forms. For example, deployment might be as simple as documenting and reporting the results, or deployment might be embedding the model in an operational system to achieve predictive results. 9. Monitor model performance. Performance of the model must be monitored for accuracy. When accuracy begins to decline, the model must be updated to fit the current situation. See Knowledge Deployment and Monitoring on page 1-11. In Step 1, a business opportunity is identified and defined. In Steps 2 through 6, data mining data is gathered, preprocessed, and organized in a form that is suitable for mining. These steps require the most time in the process. For example, selecting the data is an important step in the process and typically requires the assistance of a data mining expert or subject matter expert who has knowledge of the data to be mined. In Step 7, models are built. In Steps 8 and 9, the models are deployed and monitored. This latter part of the knowledge discovery process focuses on analyzing the data mining view prepared in Steps 2 through 6. Defining the Business Opportunity The process begins with the identification and precise specification of a business opportunity. Several factors must be considered when evaluating potential opportunities: • Quantification of the return on investment What is the answer worth? How much money can be saved? How much of a competitive advantage does it offer? HP NonStop SQL/MX Data Mining Guide—523737-001 1 -4 Defining the Business Opportunity Introduction • Usability of the results Merely identifying patterns is not enough. The opportunity and analysis must be structured so that any interpretation of results obtained develops into deployable business strategy. • Political and organizational reaction In assessing probabilities for organizational resistance, it is helpful to examine similar past efforts and understand why these efforts succeeded or failed. • Availability of business analysts and data mining experts and technology Are data, domain, and mining experts available to participate in the process? Is sufficient technology, both hardware and software, available? • Data availability Does preclassified data exist or can it be derived? Do sufficiently large amounts of data exist? Both internal and external data sources should be considered. • Logistics How difficult is it to collect, extract, and transport the relevant data? Is confidentiality an issue? Careful consideration of these factors helps to ensure that the opportunity selected is both amenable to data mining and likely to provide significant value. After an operation is selected, the next task is to specify it precisely. In the scenario of building a model to predict credit card account attrition, the goal is to build a model that will predict, as early as possible, whether a credit card customer will close their account. To specify this opportunity precisely, decide on an explicit definition of attrition, such as when a customer calls and closes their account. Another option is implicit—when a customer stops using their card. For simplicity, define attrition as a customer closing their account or maintaining a zero balance for three months. Another aspect of specifying the opportunity is defining what it means to predict as early as possible when an account will be closed. For this example, choose three months as the prediction window. This window should be long enough to allow the card issuer to take some action to try to retain customers likely to leave, but short enough to capture attrition-related patterns. The goal is to build a model that will predict, as early as possible, customer attrition. Example Business Opportunity The precise specification of our example opportunity is to build a model that will predict at any point in time, based on such things as current account status, account activity, and demographics, whether a credit card customer will close their account in the future. Note that the precise specification of the opportunity might be modified or HP NonStop SQL/MX Data Mining Guide—523737-001 1 -5 Defining the Business Opportunity Introduction refined later in the knowledge discovery process as more information becomes available. This manual uses this opportunity scenario to describe the knowledge discovery process and how to implement it. The data set used to illustrate techniques and SQL/MX features consists of two tables: one containing customer information and the other containing account history information. This data set is presented in Appendix A through C of this manual. A subset of this data set is shown in these tables: Customers Table Account Name Marital Status Home Income 1234567 Jones, Mary Single Own 65,000 2500000 Abbas, Ali Divorced Rent 32,000 4098124 Kano, Tomoko Divorced Own 44,000 2400000 Lund, Erika Widow Own 28,000 ... Account History Table Account Month Status Limit Balance Payment Fin. Chrg 1234567 01/03 Open 10,000 1232.50 1232.50 0.00 2500000 07/02 Open 5,000 566.00 32.00 8.00 4098124 10/00 Open 6,000 3200.00 3200.00 0.00 1234567 02/03 Open 10,000 3000.00 3000.00 0.00 2500000 08/02 Open 5,000 600.00 40.00 9.23 ... The first table, the Customers table, contains one row for each credit card account and consists of customer demographic information such as marital status, income, and so on. For a large financial institution, a customers table such as this one might contain approximately 10 million rows and 100 columns. The second table, the Account History table, contains monthly status records, one for each account for each month the account was open over a given time period, and consists of about 200 columns. For this example, suppose the time period is three years. The history table would then contain about 360 million rows, assuming 10 million customers. Given these parameters, the size of the first table is about 5 GB (10 million rows, 500 bytes in each row), and the size of the second table is about 360 GB (360 million rows, 1000 bytes in each row). For the example business opportunity, the Status and Balance fields of the Account History table are used to determine if a customer will close their account. If the Status changes from Open to Closed or if the Balance is zero for three consecutive months, HP NonStop SQL/MX Data Mining Guide—523737-001 1 -6 Preparing the Data Introduction then a customer is defined as having left—that is, no longer holds a credit card account. Preparing the Data After a business opportunity has been identified and defined, the next task is to prepare a data set for mining. This is done in Steps 2 through 6 of the knowledge discovery process. See The Knowledge Discovery Process on page 1-3. The first two steps are preprocessing the mining data to make it consistent and then loading the data into a database system. For further information, see Loading the Data on page 2-2. The next step is to generate a variety of statistics—for example, column unique entry counts, value ranges, number of missing values, mean, variance, and so on. This type of data profile is helpful in gaining an understanding of the data, and this profile also serves as a valuable reference throughout the knowledge discovery process. Profiling the Data A profile of the database helps to solve the data mining problem in these ways: • • • To better understand the data To decide which columns to use for analysis To decide whether to treat attributes as discrete or continuous Types of Information The type of information used to create a profile of the data mining view comes from the following elements: • • • • • • • Tables in the database Table attributes (or columns to be used in the analysis) Data types of the table attributes Relationships between tables Cardinalities of discrete attributes Statistics about continuous attributes Derived table attributes (or derived columns to be used in the analysis) Determining the derived columns to be constructed requires knowledge of the table attributes and how these attributes relate to the data mining problem. See Preparing the Data on page 1-7 for a full discussion of these elements. SQL/MX provides the TRANSPOSE clause of the SELECT statement to display the cardinalities of discrete attributes. See Transposition on page 2-3 and the TRANSPOSE Clause entry in the SQL/MX Reference Manual for details. Example of Finding Cardinality of Discrete Attributes The customers table in your data set has Age and Number_Children columns. Both of these attributes are discrete, and you can compute the cardinality of each attribute. HP NonStop SQL/MX Data Mining Guide—523737-001 1 -7 Preparing the Data Introduction You obtain the cardinality of an attribute, which is the count of the number of unique values for the attribute, by using a COUNT DISTINCT query. For example: SELECT COUNT(DISTINCT Age) FROM Customers; or SELECT COUNT(DISTINCT Number_Children) FROM Customers; Instead of having to submit a query for each attribute, you can obtain counts for multiple attributes of a table by using the TRANSPOSE clause. For example: SET NAMETYPE ANSI; SET SCHEMA dmcat.whse; SELECT ColumnIndex, COUNT(DISTINCT ColumnValue) FROM Customers TRANSPOSE Age, Number_Children AS ColumnValue KEY BY ColumnIndex GROUP BY ColumnIndex; COLUMNINDEX ----------1 2 (EXPR) -------------------17 4 --- 2 row(s) selected. The first row of the result table of the TRANSPOSE clause contains the distinct count for the column Age, and the second row contains the distinct count for the column Number_Children. You can treat the Age values as categories, consisting of age ranges. Similarly, if Number_Children is greater than five, you can put the count into the category for the Number_Children equal to five. The number of attributes in a TRANSPOSE clause is unlimited. Note. The data types of attributes to be transformed into a single column must be compatible. The data type of the result column is the union compatible data type of the attributes. For further information, see Profiling the Data on page 2-2. Defining Events In the scenario considered in this manual, the relevant event is the account holder leaving. This event occurs at different points in time for customers that leave and not at all for customers that stay. This event must be defined so that account status and activity in the months leading up to a customer leaving can be located and aligned in columns. For example, suppose you create three derived attributes that describe the account balance for each of the HP NonStop SQL/MX Data Mining Guide—523737-001 1 -8 Preparing the Data Introduction three months before a customer leaves, because these attributes are predictors of attrition. For the customers that do leave, the months leading up to leaving occur at various points in time. For customers that do not leave, these months are chosen to be any three consecutive months in which the account is open. The information about these months should be aligned for all accounts in a single set of columns, one for each of the three months. Most mining algorithms require a single logical attribute, such as the balance one month before leaving, to be stored in one column in all records, rather than in different columns in different records. For example, consider this data in a table that contains monthly account balances for each month in the three-year history period: Account Bal 08/03 Bal 09/03 Bal 10/03 Bal 11/03 1234567 7800.00 3000.00 2870.00 1200.00 2500000 0.00 0.00 0.00 0.00 Account ... ... 4098124 Bal 07/02 Bal 08/02 Bal 09/02 Bal 10/02 4817.94 4596.10 4347.63 4069.34 ... Left Yes (closed) Yes (0 bal) ... Left Yes (closed) The balances prior to the event (of the customer leaving) are in different date columns for these accounts, and therefore algorithms that build predictive models are not able to consider this information. A table organization that allows this information to be considered: Account ... Bal-3 Bal-2 Bal-1 Date Left ... Left 1234567 3000.00 2870.00 1200.00 12/03 Yes (closed) 2500000 0.00 0.00 0.00 11/03 Yes (0 bal) 4098124 4817.94 4596.10 4347.63 10/03 Yes (closed) In this table, columns Bal-1 through Bal-3 contain account balances one through three months prior to a customer leaving. Consequently, this information is aligned within a single set of columns and can be considered during model creation. For further information, see Defining Events on page 2-6. Deriving Attributes The next task is to derive attributes that are not relative to events. For example, customer age can be derived from birth date. Part of the challenge of effective data mining is identifying a set of derived attributes that capture key indicators relevant to the business opportunity being explored. For further information, see Deriving Attributes on page 2-9. HP NonStop SQL/MX Data Mining Guide—523737-001 1 -9 Creating the Mining View Introduction Creating the Mining View The final data preparation step is to transform the data set into a mining view, a form in which all attributes about the main mining entity appear in a single record. The mining entity used in this manual is a credit card account. The data mining challenge is to determine predictors for when a customer will close a credit card account. Transforming the data set to a single record for each mining entity often involves a pivot operation, in which attributes in multiple rows are collapsed and put into a single row. For example, in the credit card example, the set of history records associated with each account is collapsed to a single record and then appended to the corresponding customer record. For further information, see Section 3, Creating the Data Mining View. The resulting table looks similar to this: Mining View Account Mar Status Income Bal-3 Bal-2 Bal-1 Date Left Left 1234567 Single 65,000 3000.00 2870.00 1200.00 12/99 Yes 2500000 Divorced 32,000 0.00 0.00 0.00 11/99 Yes 4098124 Divorced 44,000 4817.94 4596.10 4347.63 10/98 Yes 5200000 Married 32,000 – – – – No This table contains demographic information from the Customers table, such as marital status and income, and also pivoted columns from the Account History table, such as balances prior to leaving. You use example data set in the data mining step, the next step in the knowledge discovery process. Mining the Data In the data mining step, core knowledge discovery techniques are applied to gain insight, learn patterns, or verify hypotheses. The main tasks performed in this step are either predictive or descriptive in nature. Predictive tasks involve trying to determine what will happen in the future, based upon historical data. Descriptive tasks involve finding patterns describing the data. The task used in this customer scenario is predictive: to build a model to predict attrition of credit card customers based on historical information, such as demographics and account activity. The most common predictive tasks are: • • Classification—Classify a case (or record) into one of several predefined classes. Regression—Map a case (or record) into a numerical prediction value. HP NonStop SQL/MX Data Mining Guide—523737-001 1- 10 Knowledge Deployment and Monitoring Introduction Descriptive tasks involve finding patterns describing the data. The most common are: • • • • Database segmentation (clustering)—Map a case into one of several clusters. Summarization—Provide a compact description of the data, often in visual form. Link analysis—Determine relationships between attributes in a case. Sequence analysis—Determine trends over time. You use a variety of algorithms, and the models they produce, to perform these predictive and descriptive tasks. For example, classification can be done by building a decision tree model, where each branch of the tree is represented by a predicate involving attributes in the mining data set and where each branch is homogeneous with respect to whether the predicate is true or false. The main task in classification is to determine which predicates form the decision tree that predicts the goal. The most common algorithms for classification come from the field of machine learning in computer science. Typically, the model building step involves the use of client-mining tools that require the interactive participation of the user to guide the investigation. A description of these special-purpose tools is beyond the scope of this manual. For further information, see Section 4, Mining the Data. Knowledge Deployment and Monitoring The last two steps of the knowledge discovery process involve deploying and monitoring discovered knowledge. Deployment can take many different forms. For example, deployment might be as simple as documenting and reporting the results, or deployment might be embedding the model in an operational system to achieve predictive results. Most data mining tools support model deployment either by applying a model to data within the tool or by exporting a model as executable code, which can then be embedded and used in applications. In the credit card attrition example, one form of model deployment is to periodically use the model to identify profitable customers that are likely to leave, and then to take some action, such as lowering interest rates or waiving fees, to try to retain these customers. HP NonStop SQL/MX Data Mining Guide—523737-001 1- 11 Introduction Knowledge Deployment and Monitoring HP NonStop SQL/MX Data Mining Guide—523737-001 1- 12 2 Preparing the Data Section 1, Introduction identifies and defines a business opportunity, the first step in the knowledge discovery process supported by SQL/MX. This section describes Steps 2 through 5. 1. Identify and define a business opportunity. 2. Preprocess and load the data for the business opportunity. The first preparation step is to address these problems by preprocessing the data in various ways—for example, verifying and mapping the data. Then load the data into your database system. See Loading the Data on page 2-2. 3. Profile and understand the relevant data. Generate a variety of statistics, such as column unique entry counts, value ranges, number of missing values, mean, variance, and so on. See Profiling the Data on page 2-2. 4. Define events relevant to the business opportunity being explored. Events are used to align related data in a single set of columns for mining. Example events are life changes, such as getting married or switching jobs, or customer actions, such as opening an account or requesting a credit limit increase. See Defining Events on page 2-6. 5. Derive attributes. For example, customer age can be derived from birth date. Account summary statistics, such as maximum and minimum balances, can be derived from monthly status information. See Deriving Attributes on page 2-9. 6. Create the data mining view. 7. Mine the data and build models. 8. Deploy models. 9. Monitor model performance. HP NonStop SQL/MX Data Mining Guide—523737-001 2 -1 Loading the Data Preparing the Data Loading the Data The first step in preparing a data set for mining is loading the data into database tables. Suppose the credit card organization has a customers data warehouse. The customer data and the account history data are stored in this warehouse. In a typical real-world scenario, the warehouse could have millions of records representing millions of customers dating back many years. Creating the Database Suppose a data mining database is created consisting of the Customers table and the Account History table described in the previous section. You can use the DDL scripts included with this manual to create a database to run the examples in this manual. To create the database: 1. Open the .pdf file for this manual. 2. Navigate to Appendix A, Creating the Data Mining Database of this manual, which contains the DDL script that creates the database. 3. On the tool bar, select the Table/Formatted Text Select Tool. 4. Copy and paste from the DDL script, one page at a time, into an OSS text file. 5. Within MXCI (the SQL/MX conversational interface), obey the OSS file you have created. Importing Data Into the Database After the data mining database is created, the warehouse data is imported into the database. In a typical real-world scenario, you would import the data by using some type of database utility—for example, you can use the DataLoader/MP utility to import a large quantity of data into an SQL/MP database. For further information, see the DataLoader/MX Reference Manual and the SQL/MX Reference Manual for discussions of the Import Utility. Alternatively, you can also use INSERT statements to insert values into the data mining database. The INSERT statements for the example in this manual are included in Appendix B, Inserting Into the Data Mining Database. Profiling the Data Profiling often begins with the computation of basic information about each attribute. For discrete attributes, this basic information is typically a table of the unique values and a count of how many times each value occurs. However, as cardinality increases, these frequencies become less and less meaningful. For continuous attributes, the approach is to use metrics such as minimum, maximum, mean, and variance. HP NonStop SQL/MX Data Mining Guide—523737-001 2 -2 Cardinalities and Metrics Preparing the Data Cardinalities and Metrics For any attribute, one approach to profiling is to run a separate query for each attribute. As an example, consider the following queries, which profile the discrete attribute Marital Status from the Customers table and the continuous attribute Balance from the Account History table. Example of Discrete Attribute This query finds the number of discrete values of the Marital Status column of the Customers table: SELECT marital_status, COUNT(*) FROM customers GROUP BY marital_status; Example of Continuous Attribute This query computes statistical information about the continuous attribute Balance in the Account History table: SELECT MIN(balance), MAX(balance), AVG(balance), VARIANCE(balance) FROM acct_history; Transposition Other than the computation of a few metrics, both of the previous queries require a complete scan of the data. In this way, a table with N attributes requires N queries, resulting in the same number of complete scans. For a wide mining table, this procedure can result in thousands of queries and scans of the data. Using transposition, SQL/MX can perform the above profiling operations by using a total of only two queries, regardless of the number of attributes to be profiled. Through the TRANSPOSE clause of the SELECT statement, different columns of a source table can be treated as a single output column, enabling similar computations to be performed on all such source columns. TRANSPOSE takes each row in the source table and converts each expression listed in the transpose set to an individual output row. Used in this way, TRANSPOSE can compute frequency counts for all discrete attributes in a table in a single query. See the TRANSPOSE Clause entry in the SQL/MX Reference Manual for more information. Example of Computing Counts for Character Discrete Attributes This query computes the frequency counts for the discrete attributes Gender, Marital Status, and Home, which are all type character: SET NAMETYPE ANSI; SET SCHEMA mining.whse; HP NonStop SQL/MX Data Mining Guide—523737-001 2 -3 Transposition Preparing the Data SELECT attr, c1, COUNT(*) FROM customers TRANSPOSE ('GENDER', gender), ('HOME', home), ('MARITAL_STATUS', marital_status) AS (attr, c1) GROUP BY attr, c1 ORDER BY attr, c1; ATTR -------------GENDER GENDER HOME HOME MARITAL_STATUS MARITAL_STATUS MARITAL_STATUS MARITAL_STATUS C1 -------F M Own Rent Divorced Married Single Widow (EXPR) -------------------20 22 33 9 12 9 15 6 --- 8 row(s) selected. Because this query produces counts for three different attributes, use the ATTR column to distinguish from which attribute the values are drawn. The C1 column contains the values for these character attributes. Example of Computing Counts for Character and Numeric Discrete Attributes This query also shows the transpose clause and illustrates how profiling can be achieved. The column C2 has been added to the statement because Number_Children has numeric data type. SELECT attr, c1, c2, COUNT(*) FROM customers TRANSPOSE ('GENDER', gender, null), ('HOME', home, null), ('MARITAL_STATUS', marital_status, null), ('NUMBER_CHILDREN', null, number_children) AS (attr, c1, c2) GROUP BY attr, c1, c2 ORDER BY attr, c1, c2; ATTR --------------GENDER GENDER HOME HOME MARITAL_STATUS MARITAL_STATUS MARITAL_STATUS MARITAL_STATUS NUMBER_CHILDREN NUMBER_CHILDREN C1 -------F M Own Rent Divorced Married Single Widow ? ? C2 -----? ? ? ? ? ? ? ? 0 1 (EXPR) -------------------20 22 33 9 12 9 15 6 25 4 HP NonStop SQL/MX Data Mining Guide—523737-001 2 -4 Quick Profiling Preparing the Data NUMBER_CHILDREN NUMBER_CHILDREN ? ? 2 3 10 3 --- 12 row(s) selected. Because this query produces counts for four different attributes, use the ATTR column to distinguish from which attribute the values are drawn. The C1 column contains the values for the character attributes, and the C2 column contains the values for the numeric attribute. Example of Computing Statistics for Continuous Attributes Similarly, a single query using TRANSPOSE can compute the necessary statistics for all continuous attributes. The next query computes the minimum, maximum, mean, and variance for the continuous attributes Customer Credit Limit and Balance, which are both numeric: SELECT attr, MIN(c1), MAX(c1), AVG(c1), VARIANCE(c1) FROM acct_history TRANSPOSE (1,cust_limit), (2,balance) AS (attr, c1) GROUP BY attr ORDER BY attr; Sample results are: ATTR MIN(C1) MAX(C1) AVG(C1) VARIANCE(C1) 1 5000.00 40000.00 18225.81 2.01E+008 2 .00 32000.00 2539.12 1.46E+007 ATTR MIN(C1) MAX(C1) AVG(C1) VARIANCE(C1) 1 5000.00 40000.00 20139.86 2.35E+008 2 .00 32000.00 2444.17 1.584E+007 Sample results are: By using TRANSPOSE to compute attribute profiles, you gain performance and scalability advantages. Performance is improved because the data set is scanned only once. In addition, the number of queries is reduced to two: one for discrete attributes and one for continuous attributes. Scalability is enhanced because the amount of data accessed grows linearly with the number of attributes actually profiled. Quick Profiling The profiling step is highly iterative, because many different data sources are inspected and evaluated for possible analysis. Getting a quick impression of an attribute before proceeding to a more detailed profile is often necessary. For example, by quickly estimating cardinality, you can determine whether to treat a column as discrete or continuous. You can make a determination accurately without a scan of every single data element. Use the SQL/MX sampling feature to: HP NonStop SQL/MX Data Mining Guide—523737-001 2 -5 Defining Events Preparing the Data • • • Randomly sample source data Improve computing efficiency for a profile using a selected sampling percentage Reduce both the I/O costs and the CPU costs associated with computing a profile See the SAMPLE Clause of SELECT in the SQL/MX Reference Manual. Defining Events Events are used to align related data in a single set of columns for mining. Example events are life changes, such as getting married or switching jobs, or customer actions, such as opening an account or requesting a credit limit increase. The critical event to be defined for the business opportunity described in this manual is the month the customer left—either by closing their account or by maintaining a zero balance for three months. The problem is to align the data so that this event can be derived as an attribute of the mining view. Aligning the Data Most mining algorithms and tools require that the input data be arranged so that all the information pertaining to a given entity is contained in a single record. However, in typical raw mining data, observations about a given entity can be stored in separate rows and tables. For example, the Account History table contains one record per customer per month, summarizing the account status for that customer. The related Customers table contains static information in the form of one row per customer. For this example, the account status information must be reduced to a single row of information for each customer. This data is paired with the static customer information to form the mining view. Two methods exist for mapping time-dependent data in the mining view. One method is to take a value from a particular month and include that value in the mining view. For example, the checking account balance for January 1998 can be included in the mining view for each customer because the balance is a single value. Alternatively, a value can be aggregated over a time period to compute a single value for the mining view. An example is the average checking account balance for January 1998 through June 1998. Absolute and relative methods exist for aligning time-dependent data in the mining view. Specifying an event relative to a customer is often more meaningful than to specify an absolute event, such as a given year and month. The account balance one month prior to closing an account or the average account balance for six months prior to closing an account are both examples of relative events. In this type of relative time specification, the actual months selected depend on an event that is different for each customer. Aligning the data by using relative events HP NonStop SQL/MX Data Mining Guide—523737-001 2 -6 Aligning the Data Preparing the Data is crucial for building models to predict events that occur at different times for each customer. Example of Aligning Data This statement creates an SQL/MX table named Close_Temp that contains the account number, the month the account is considered closed (if not closed, an arbitrary month), and an indicator of whether or not the customer left: SET SCHEMA mining.whse; CREATE TABLE Close_Temp ( account NUMERIC (7) UNSIGNED NO DEFAULT NOT NULL HEADING 'Account Number' ,close_month DATE NO DEFAULT NOT NULL HEADING 'Close Month' ,cust_left CHAR(1) NO DEFAULT ,PRIMARY KEY (account) ); In this query, the source data for the column named Close_Month is defined to be the month the customer left—either by closing their account or by maintaining a zero balance for three months. If the customer did not leave, the month is arbitrarily defined to be a month in the middle of their account history. INSERT INTO close_temp (SELECT p.account, CASE WHEN p.close_month2 IS NOT NULL THEN p.close_month2 WHEN p.close_month1 IS NOT NULL THEN p.close_month1 ELSE p.open_month + ((DATE '1999-12-01' - p.open_month)/2) - INTERVAL '16' DAY END, CASE WHEN p.close_month2 IS NOT NULL THEN 'Y' WHEN p.close_month1 IS NOT NULL THEN 'Y' ELSE 'N' END FROM (SELECT t.account, MAX(t.close_month1), MAX(t.close_month2), MIN(t.year_month) FROM (SELECT m.account ,m.year_month, CASE WHEN m.status = 'Closed' AND OFFSET(m.status,1) = 'Open' AND account = OFFSET(account,1) THEN m.year_month END, CASE WHEN ROWS SINCE INCLUSIVE(balance <> 0) = 3.0 AND account = OFFSET(account,2) HP NonStop SQL/MX Data Mining Guide—523737-001 2 -7 Aligning the Data Preparing the Data THEN m.year_month END FROM acct_history m SEQUENCE BY m.account, m.year_month) t (account, year_month, close_month1, close_month2) GROUP BY t.account) p (account, close_month1, close_month2, open_month)); The derived attribute Close_Month1 contains the month when a customer explicitly closed their account (the Account Status is marked Closed). The first CASE expression in the inner query uses the OFFSET sequence function to determine the month when an account is closed explicitly. The derived attribute Close_Month2 contains the month when a customer implicitly closed their account (maintained a zero balance for three months). The second CASE expression in the inner query uses the OFFSET sequence function and the ROWS SINCE INCLUSIVE sequence function to determine the month when an account has a zero balance for three months. The derived attribute Open_Month is the month when the account was opened. In the CASE expression of the outer query, this month is adjusted to be the month in the middle of the account history. The account history interval is defined to start with the first month the account is open up to the date 1999-12-01. The derived attribute Close_Month in the Close_Temp table is set to either Close_Month1 (when a customer explicitly closed their account), Close_Month2 (when a customer maintained a zero balance for three months), or the month in the middle of the Account History interval (when an account is open). The derived attribute Cust_Left is set to Y if a customer has a zero balance for three months or if the Account Status is marked Closed. In queries that use sequence functions, note the use of the SEQUENCE BY clause. See SEQUENCE BY Clause and Sequence Functions in the SQL/MX Reference Manual for more information. Here are the contents of the Close_Temp table after the preceding row insertion: Account Number Close Month Cust_Left 1000000 1999-03-01 N 1234567 1999-12-01 Y 2300000 1999-11-01 Y 2400000 1998-12-01 Y 2500000 1999-10-01 Y 2900000 1999-06-01 N 3200000 1999-05-01 Y 3900000 1998-10-01 Y 4098124 1998-10-01 Y HP NonStop SQL/MX Data Mining Guide—523737-001 2 -8 Deriving Attributes Preparing the Data Account Number Close Month Cust_Left 4300000 1999-06-01 N 4400000 1999-07-01 Y 4500000 1998-09-01 Y 4600000 1999-12-01 Y 4700000 1999-06-01 N Deriving Attributes In the preceding Example of Aligning Data on page 2-7, the derived attributes in the Close_Temp table are Close_Month and Cust_Left. These attributes are critical for the task of building a model that will predict at any point in time, based on such things as current account status, account activity, and customer demographics, whether a credit card customer will leave three months in the future. To produce good models, the source mining data typically needs to be supplemented with appropriate derived attributes. Typical derived attributes include computing ratios between key quantities, mapping postal codes to average demographics, computing metrics, and computing rankings, percentiles, or quartiles. Moving Metrics Moving metrics measure a dynamic behavior in terms of rates of events or trends for a state of condition. In the data mining environment, moving metrics are good predictors for many modeling tasks involving historical or time series data. For example, the moving average of an account balance produces attributes that could be included in the mining view for each customer. SQL/MX supports a number of sequence functions that you can use to simplify queries and to execute queries more efficiently. Example Using MOVINGAVG and ROWS SINCE This query uses the sequence functions MOVINGAVG and ROWS SINCE: SELECT account, year_month, MOVINGAVG (balance, ROWS SINCE INCLUSIVE (account <> OFFSET (account,1)) +1, RUNNINGCOUNT(*)) FROM acct_history SEQUENCE BY account, year_month; ACCOUNT ---------1000000 1000000 1000000 1000000 1000000 YEAR_MONTH ---------1998-07-01 1998-08-01 1998-09-01 1998-10-01 1998-11-01 (EXPR) --------------------3678.67 5229.33 4253.15 5189.86 5221.06 HP NonStop SQL/MX Data Mining Guide—523737-001 2 -9 Rankings Preparing the Data 1000000 1000000 ... 1998-12-01 1999-01-01 ... 5134.22 4572.19 ... --- 186 row(s) selected. In this query, the ROWS SINCE INCLUSIVE sequence function is used to limit the moving average window to records for the current customer. The third argument of MOVINGAVG is RUNNINGCOUNT(*), which ensures MOVINGAVG does not include rows before the beginning row. In practice, similar queries can be used to compute several metrics at the same time, and the results, which conceptually are new columns in the Account History table, can be realized in an auxiliary table. This auxiliary table can then be referenced when computing the mining view. By using sequence functions, you eliminate the dependency on the number and location of moving averages computed for each customer. Even if customers have different numbers of history records, sequence functions allow the computation of a metric for each customer. Rankings Simple rankings provide good predictors for many modeling tasks. An example is the rank of a customer’s average account balance relative to all other customers. This query computes the absolute rank of the average account balance for each customer: SELECT cid, RUNNINGCOUNT(*), avg_bal FROM (SELECT account, AVG(balance) FROM acct_history GROUP BY account) AS t(cid, avg_bal) SEQUENCE BY avg_bal DESC; CID ---------4300000 2900000 4098124 2300000 1000000 1234567 ... (EXPR) -------------------1 2 3 4 5 6 ... AVG_BAL --------------------6203.33 5184.04 4920.02 4610.28 4067.44 2807.20 ... --- 14 row(s) selected. In practice, the results of this type of query are realized in an auxiliary table that can be thought of as an extension to the Customers table. Percentiles and quartiles can also be computed easily with similar queries. See the Sequence Fnctions entry in the SQL/MX Reference Manual for more information. HP NonStop SQL/MX Data Mining Guide—523737-001 2- 10 3 Creating the Data Mining View Because data mining often involves executing a series of similar queries before getting satisfying results, it can be helpful to use materialized results of previous queries when answering a new one. Creating a data mining view allows you to access intentionally gathered and permanently stored results of a data mining query. Creating a data mining view is Step 6 of the knowledge discovery process. 1. Identify and define a business opportunity. 2. Preprocess and load the data for the business opportunity. 3. Profile and understand the relevant data. 4. Define events relevant to the business opportunity being explored. 5. Derive attributes. 6. Create the data mining view. Transform the data into a mining view, a form in which all attributes about the primary mining entity occur in a single record. This transformation involves: • • Creating the Single Table Pivoting the Data 7. Mine the data and build models. 8. Deploy models. 9. Monitor model performance. HP NonStop SQL/MX Data Mining Guide—523737-001 3 -1 Creating the Data Mining View Creating the Single Table Creating the Single Table After computing derived attributes and storing these attributes in auxiliary tables, you create the mining view by combining all the information into a single table with one row for each entity. Continuing with the credit card example, the mining view contains the information in the Customers table along with the auxiliary customer data. In addition, information in the Account History and related tables is also used. Typically, after the mining view is computed and inserted into a single database table, the data is extracted and loaded into a mining tool for the model building step. The mining data would be extracted via ODBC/MX or the Genus Mining Integrator for NonStop SQL. Example of Creating the View The derived attributes consisting of the three balances for the three months prior to a customer leaving are specified in the following SQL/MX CREATE TABLE statement. This view aligns the data around the month of a particular event—account attrition. SET SCHEMA mining.whse; CREATE TABLE mineview ( account NUMERIC (7) UNSIGNED NO DEFAULT NOT NULL HEADING 'Account Number' ,marital_status CHARACTER (8) DEFAULT NULL HEADING 'Marital Status' ,home CHARACTER (4) DEFAULT NULL HEADING 'Home' ,income NUMERIC (8, 2) UNSIGNED DEFAULT NULL HEADING 'Income' ,gender CHAR(1) DEFAULT NULL ,age NUMERIC (3) DEFAULT NULL HEADING 'Age' ,number_children NUMERIC (2) DEFAULT NULL HEADING 'Number of Children' ,year_month DATE NO DEFAULT NOT NULL ,close_month DATE NO DEFAULT NOT NULL ,balance_close_1 NUMERIC (9,2) NO DEFAULT NOT NULL ,balance_close_2 NUMERIC (9,2) NO DEFAULT HP NonStop SQL/MX Data Mining Guide—523737-001 3 -2 Creating the Data Mining View Pivoting the Data NOT NULL NUMERIC (9,2) NO DEFAULT NOT NULL ,cust_left CHAR(1) NO DEFAULT ,PRIMARY KEY (account) ); ,balance_close_3 Pivoting the Data All the data in the Customer, Account History, and auxiliary tables must be collapsed to a single row for each customer. Collapsing the data is accomplished by pivoting the data. Data is purged from separate rows for each customer into different columns of a single customer row. For example, the balance one month prior to account closure can be placed in column BALANCE_CLOSE_1, the balance two months prior to account closure in column BALANCE_CLOSE_2, and the balance three months prior to account closure in column BALANCE_CLOSE_3. To accomplish this pivoting operation, use the OFFSET sequence function to collect data from various months and place the results in a single row. Example Using OFFSET Sequence Function This query populates the mining view: INSERT INTO miningview (SELECT t.account ,c.marital_status ,c.home ,c.income ,c.gender ,c.age ,c.number_children ,t.year_month ,t.close_month ,t.balance_close_1 ,t.balance_close_2 ,t.balance_close_3 ,t.cust_left FROM (SELECT account ,year_month ,close_month ,CASE WHEN year_month = close_month THEN balance END AS balance_close_1 ,CASE WHEN year_month = close_month AND account = OFFSET(account,1) THEN OFFSET(balance, 1) END AS balance_close_2 ,CASE WHEN year_month = close_month AND account = OFFSET(account,2) THEN OFFSET(balance,2) HP NonStop SQL/MX Data Mining Guide—523737-001 3 -3 Pivoting the Data Creating the Data Mining View END AS balance_close_3 ,cust_left FROM acct_history a NATURAL JOIN close_temp m SEQUENCE BY account, year_month) AS t, customers c WHERE t.balance_close_1 IS NOT NULL AND t.balance_close_2 IS NOT NULL AND t.balance_close_3 IS NOT NULL AND c.account = t.account); Sequence functions are used in the preceding query to create a derived table with the various balances for each customer. This derived table has one row per customer that consists of a single copy of the relevant data. Here are the contents of the Miningview table after the preceding row insertion: Account Number Marital Status Home Income Gender Age Number Children 1000000 Married Own 175500.00 M 45 3 1234567 Single Own 65000.00 F 34 0 2300000 Divorced Own 137000.00 M 42 2 2400000 Widow Own 28000.00 F 65 0 2500000 Divorced Rent 32000.00 M 23 0 2900000 Divorced Rent 136000.00 F 50 0 3200000 Divorced Rent 138000.00 M 40 1 3900000 Divorced Own 75000.00 M 40 2 4098124 Divorced Own 44000.00 M 44 2 4300000 Married Own 300000.00 F 29 2 4400000 Single Own 300000.00 F 29 0 4500000 Married Own 300000.00 F 29 1 4600000 Single Own 300000.00 M 48 0 4700000 Widow Own 300000.00 M 68 0 ... This table continues the mining view table. Account Number ... Year_ Month Close Month Balance Close_1 Balance Close_2 Balance Close_3 Cust Left 1000000 1999-03-01 1999-03-01 5500.00 3500.00 1200.00 N 1234567 1999-12-01 1999-12-01 500.00 1200.00 2870.00 Y 2300000 1999-11-01 1999-11-01 .00 .00 .00 Y 2400000 1998-12-01 1998-12-01 .00 .00 .00 Y 2500000 1999-10-01 1999-10-01 .00 .00 .00 Y 2900000 1999-06-01 1999-06-01 2356.80 1134.00 9432.78 N 3200000 1999-05-01 1999-05-01 .00 .00 .00 Y HP NonStop SQL/MX Data Mining Guide—523737-001 3 -4 Pivoting the Data Creating the Data Mining View Account Number ... Year_ Month Close Month Balance Close_1 Balance Close_2 Balance Close_3 3900000 1998-10-01 4098124 1998-10-01 .00 .00 .00 Y 1998-10-01 1998-10-01 4069.34 4347.63 4596.10 Y 4300000 1999-06-01 1999-06-01 9000.00 4354.00 9876.00 N 4400000 1999-07-01 1999-07-01 .00 .00 .00 Y 4500000 1998-09-01 1998-09-01 .00 100.00 50.00 Y 4600000 1999-12-01 1999-12-01 1000.00 50.00 80.00 Y 4700000 1999-06-01 1999-06-01 330.00 330.00 330.00 N HP NonStop SQL/MX Data Mining Guide—523737-001 3 -5 Cust Left Creating the Data Mining View HP NonStop SQL/MX Data Mining Guide—523737-001 3 -6 Pivoting the Data 4 Mining the Data This section describes the next three steps of the process, Steps 7 through 9. 1. Identify and define a business opportunity. 2. Preprocess and load the data for the business opportunity. 3. Profile and understand the relevant data. 4. Define events relevant to the business opportunity being explored. 5. Derive attributes. 6. Create the data mining view. 7. Mine the data and build models. Model building can be done by extracting the mining data into a special mining tool, such as Enterprise Miner from the SAS Institute. A detailed discussion of the use of this tool is beyond the scope of this manual. However, this manual does include building a decision tree as an example of a technique that could be used by a data mining tool for building a model. See Building the Model on page 4-2. 8. Deploy models. Deployment can take many different forms. For example, deployment might be as simple as documenting and reporting the results, or deployment might be embedding the model in an operational system to achieve predictive results. See Deploying the Model on page 4-10. 9. Monitor model performance. Performance of the model must be monitored for accuracy. When accuracy begins to decline, the model must be updated to fit the current situation. See Monitoring Model Performance on page 4-11. HP NonStop SQL/MX Data Mining Guide—523737-001 4 -1 Building the Model Mining the Data Building the Model Typically, after the mining view is computed and inserted into a single database table, the data is extracted and loaded into a mining tool for the model building step. Regardless of the type of analysis to be performed, the mining data can be stored and retrieved by using the SQL/MX approach. This subsection describes how a decision tree can be used for data analysis. Building Decision Trees Decision trees are built by recursively partitioning the data in an increasingly selective manner, based on the attributes that most strongly determine the outcome. This classification is determined by computing the best splits at each node in the tree. The key operation of data is computing the frequency of various combinations of attributes for a given subset of the data. This result is called a cross table. The first step in building a decision tree is to generate cross tables for all the attributes compared to the goal attribute. Building a decision tree can require the computation of tens of thousands of cross tables. The computation of each cross table requires scanning the data, applying specified predicates, grouping, and computing counts. Computing Cross Tables The first set of cross tables needed for building a decision tree consists of each independent variable (a potential predictor) paired with the dependent variable (the goal). In the same way that the profiling queries are combined by using TRANSPOSE, these separate cross-table queries can be combined into a single query for each node in the decision tree. Computing Cross Tables to Determine the Initial Branch This query computes the cross tables for Gender, Marital Status, and Number_Children, with Cust_Left as the dependent variable (the goal): SET SCHEMA mining.whse; SELECT Independent_Variable, IV1, IV2, cust_left, COUNT(*) FROM miningview TRANSPOSE ('GENDER', gender, NULL), ('MARITAL STATUS', marital_status, NULL), ('NUMBER_CHILDREN', NULL, number_children) AS (Independent_Variable, IV1, IV2) GROUP BY Independent_Variable, IV1, IV2, cust_left ORDER BY Independent_Variable, IV1, IV2, cust_left ; INDEPENDENT_VARIABLE -------------------GENDER GENDER GENDER IV1 -------F F M IV2 -----? ? ? CUST_LEFT --------N Y N HP NonStop SQL/MX Data Mining Guide—523737-001 4 -2 (EXPR) -------2 4 2 Building Decision Trees Mining the Data GENDER MARITAL STATUS MARITAL STATUS MARITAL STATUS MARITAL STATUS MARITAL STATUS MARITAL STATUS MARITAL STATUS NUMBER_CHILDREN NUMBER_CHILDREN NUMBER_CHILDREN NUMBER_CHILDREN NUMBER_CHILDREN NUMBER_CHILDREN M Divorced Divorced Married Married Single Widow Widow ? ? ? ? ? ? ? ? ? ? ? ? ? ? 0 0 1 2 2 3 Y N Y N Y Y N Y N Y Y N Y N 6 1 5 2 1 3 1 1 2 5 2 1 3 1 --- 17 row(s) selected. Determining Which Attribute Best Predicts the Goal Consider the results of the preceding query. You are ready to determine which of the independent variables best predicts the dependent variable (the goal). Examine the rows for each independent variable in the query. If most of the rows for a particular value of an independent variable correlate with Cust_Left equal to Y, that independent variable is a good predictor of the goal. This type of analysis is typically performed by client-mining tools. Independent Variable Predictor? Reason GENDER Yes When Cust_Left equal to Y, the Gender is predominantly equal to M. The number of Males is 6, and the number of Females is 4. MARITAL STATUS Yes When Cust_Left is equal to Y, the Marital Status is predominantly equal to Divorced and Single. The number of Divorced is 5, the number of Married is 1, the number of Single is 3, and the number of Widow is 1. NUMBER CHILDREN No When Cust_Left is equal to Y, the Number_Children is 0, 1, and 2. The number with Children=0 is 5, the number with Children=1 is 2, and the number with Children=2 is 3. The values do not show a pattern and do not predict Cust_Left equal to Y. Both Gender and Marital Status are reasonable choices as the best predictor of the goal. To carry out the remaining cross-table generations, this scenario uses Marital Status as the best predictor for the initial branch of the decision tree. HP NonStop SQL/MX Data Mining Guide—523737-001 4 -3 Building Decision Trees Mining the Data Typically, the best discriminator of the goal is determined by a statistical analysis of the cross tables. The exact nature of this analysis varies from tool to tool. Initial Decision Tree Figure 4-1 shows the initial decision tree for the business opportunity. Marital Status is chosen as the best predictor of the goal with four initial branches—Divorced, Single, Married, and Widow. Figure 4-1. Initial Branches of Decision Tree Marital Status Divorced No Yes 1 5 Single No Yes 0 3 Married No Yes 2 1 Widow No Yes 1 1 The model is built to characterize the customers that have left—that is, the model will find the rows where Cust_Left is Y. The results for Divorced and Single are the most promising for further development of the decision tree. For Divorced, the number of records is 5 for Cust_Left equal to Y, and for Single, the number of records is 3 for Cust_Left equal to Y. In both cases, the results of the cross table show the best homogeneous split with respect to the goal. Initial Branches of the Decision Tree The two initial branches that seem most promising are defined by two conditions: marital_status = 'Divorced' marital_status = 'Single' Computing Cross Tables When Marital Status Equal to Divorced This query generates cross tables for all attributes, except Marital Status, compared to the goal when Marital Status is equal to Divorced: SELECT Independent_Variable, IV1, IV2, cust_left, COUNT(*) FROM miningview WHERE marital_status = 'Divorced' TRANSPOSE ('GENDER', gender, NULL), ('NUMBER_CHILDREN', NULL, number_children) AS (Independent_Variable, IV1, IV2) GROUP BY Independent_Variable, IV1, IV2, cust_left ORDER BY Independent_Variable, IV1, IV2, cust_left; INDEPENDENT_VARIABLE -------------------GENDER IV1 --F IV2 -----? CUST_LEFT --------N (EXPR) -------1 HP NonStop SQL/MX Data Mining Guide—523737-001 4 -4 Building Decision Trees Mining the Data GENDER NUMBER_CHILDREN NUMBER_CHILDREN NUMBER_CHILDREN NUMBER_CHILDREN M ? 0 0 1 2 ? ? ? ? Y N Y Y Y 5 1 1 1 3 --- 6 row(s) selected. The preceding query shows Gender is Male in all cases where Cust_Left is equal to Y, and therefore Gender is a good predictor where Marital Status is Divorced. The Number_Children is equal to 0, 1, and 2, and therefore Number_Children is not a good predictor. Decision Tree for Divorced Branch Figure 4-2 shows the results of the preceding query for the example business opportunity. Figure 4-2. Decision Tree for Divorced Branch Marital Status Divorced No Yes 1 5 Male No Yes 0 5 Single No Yes 0 3 Married No Yes 2 1 Widow No Yes 1 1 Female No Yes 1 0 For Divorced, when Cust_Left is equal to Y, the number of records is 5 for Gender equal to Male. Gender best discriminates the goal when Marital Status is equal to Divorced. Computing Cross Tables When Marital Status Equal to Single This query generates cross tables for all attributes, except Marital Status, compared to the goal when Marital Status is equal to Single: SELECT Independent_Variable, IV1, IV2, cust_left, COUNT(*) FROM miningview WHERE marital_status = 'Single' TRANSPOSE ('GENDER', gender, NULL), ('NUMBER_CHILDREN', NULL, number_children) AS (Independent_Variable, IV1, IV2) GROUP BY Independent_Variable, IV1, IV2, cust_left ORDER BY Independent_Variable, IV1, IV2, cust_left; INDEPENDENT_VARIABLE IV1 IV2 CUST_LEFT (EXPR) HP NonStop SQL/MX Data Mining Guide—523737-001 4 -5 Building Decision Trees Mining the Data -------------------GENDER GENDER NUMBER_CHILDREN --F M ? -----? ? 0 --------Y Y Y -------2 1 3 --- 3 row(s) selected. The preceding query shows split results for Gender when Cust_Left is equal to Y, and therefore Gender is not a good predictor when Marital Status equal to Single. However, the query also shows Number_Children equal to 0 when Cust_Left is equal to Y, and therefore Number_Children is a good predictor. Decision Tree for Single Branch Figure 4-3 shows the results of the preceding query for the example business opportunity. Figure 4-3. Decision Tree for Single Branch Marital Status Divorced No Yes 1 5 Single No Yes 0 3 Chldrn=0 No Yes 0 3 Married No Yes 2 1 Widow No Yes 1 1 Chldrn>0 No Yes 0 0 For Single, when Cust_Left equal to Y, the number of records is 3 for Number_Children equal to 0. Number_Children best discriminates the goal when Marital Status is equal to Single. Conditions Defining the Decision Tree The model developed so far seems to characterize the customers that have left—that is, the model finds the rows where Cust_Left equal to Y. The model is now defined by two conditions: (marital_status = 'Divorced' AND gender = 'M') (marital_status = 'Single' AND number_children = 0) For Divorced and Male, the number of records is 5 for Cust_Left equal to Y, and the number of records is 0 for Cust_Left equal to N. For Single and Number_Children, the number of records is 3 for Cust_Left equal to Y, and the number of records is 0 for Cust_Left equal to N. HP NonStop SQL/MX Data Mining Guide—523737-001 4 -6 Building Decision Trees Mining the Data Showing the Homogeneous Branches For each of the preceding conditions, these queries show that he branches in the decision tree are homogeneous with respect to the goal attribute Cust_Left. Computing Cross Table When Marital Status is Divorced and Gender is Male This query generates cross tables for the Gender attribute compared to the goal when Marital Status is Divorced and Gender is Male: SELECT Independent_Variable, IV1, cust_left, COUNT(*) FROM miningview WHERE marital_status = 'Divorced' AND gender = 'M' TRANSPOSE ('GENDER', gender) AS (Independent_Variable, IV1) GROUP BY Independent_Variable, IV1, cust_left ORDER BY Independent_Variable, IV1, cust_left; INDEPENDENT_VARIABLE -------------------GENDER IV1 --M CUST_LEFT --------Y (EXPR) --------5 --- 1 row(s) selected. This group of records is homogeneous with respect to Cust_Left—that is, Cust_Left is equal to Y in all cases. Computing Cross Table When Marital Status is Divorced and Gender is Female This query generates cross tables for the Gender attribute compared to the goal when Marital Status is Divorced and Gender is Female: SELECT Independent_Variable, IV1, cust_left, COUNT(*) FROM miningview WHERE marital_status = 'Divorced' AND gender = 'F' TRANSPOSE ('GENDER', gender) AS (Independent_Variable, IV1) GROUP BY Independent_Variable, IV1, cust_left ORDER BY Independent_Variable, IV1, cust_left; INDEPENDENT_VARIABLE -------------------GENDER IV1 --F CUST_LEFT --------N (EXPR) --------1 --- 1 row(s) selected. This group of records is homogeneous with respect to Cust_Left—that is, Cust_Left is equal to N in all cases. HP NonStop SQL/MX Data Mining Guide—523737-001 4 -7 Building Decision Trees Mining the Data Computing Cross Table When Marital Status is Single and Children is Zero This query generates cross tables for the Number_Children attribute compared to the goal when Marital Status is Single and Number_Children is 0: SELECT Independent_Variable, IV2, cust_left, COUNT(*) FROM miningview WHERE marital_status = 'Single' AND number_children = 0 TRANSPOSE ('NUMBER_CHILDREN', number_children) AS (Independent_Variable, IV2) GROUP BY Independent_Variable, IV2, cust_left ORDER BY Independent_Variable, IV2, cust_left; INDEPENDENT_VARIABLE -------------------NUMBER_CHILDREN IV2 ----0 CUST_LEFT --------Y (EXPR) --------3 --- 1 row(s) selected. This group of records is homogeneous with respect to Cust_Left—that is, Cust_Left is equal to Y in all cases. Computing Cross Table When Marital Status is Single and Children > Zero This query generates cross tables for the Number_Children attribute compared to the goal when Marital Status is Single and Number of Children is greater than 0: SELECT Independent_Variable, IV2, cust_left, COUNT(*) FROM miningview WHERE marital_status = 'Single' AND number_children > 0 TRANSPOSE ('NUMBER_CHILDREN', number_children) AS (Independent_Variable, IV2) GROUP BY Independent_Variable, IV2, cust_left ORDER BY Independent_Variable, IV2, cust_left; --- 0 row(s) selected. This group of records is homogeneous with respect to Cust_Left equal to N. Final Decision Tree You have now finished developing the decision tree. Each branch of the tree is homogeneous with respect to the value of Cust_Left. In practice, this process is highly iterative. Expanding each node might require several iterations, and you might need to back up to a previous node to consider another alternative. HP NonStop SQL/MX Data Mining Guide—523737-001 4 -8 Checking the Model Mining the Data Figure 4-4 shows the final decision tree for the example business opportunity. Figure 4-4. Final Decision Tree Marital Status Divorced No Yes 1 5 Male No Yes 0 5 Single No Yes 0 3 Female No Yes 1 0 Chldrn=0 No Yes 0 3 Chldrn>0 No Yes 0 0 Married No Yes 2 1 Widow No Yes 1 1 Prune the tree here because the remaining branches do not yield a pattern. Changing the Process Classification trees are used to predict or explain responses to categorical dependent variables. If you had not been able to develop a classification tree with homogeneous branches with respect to Cust_Left, you could now do any of the following: • Redefine the statement of the business opportunity. The data analysis process might indicate new directions that offer more interesting results. • Redefine the goal. The goal is equal to Y if the customer had a zero balance for a period of 3 months. This definition might need adjustment. • Add or remove columns in the mining view. Some columns that do not contribute to the goal can be removed. Also, the initial analysis might give new insight into columns that could be added. • Change the definition of derived columns. For example, the average balance for the period of 3 months might be a better choice than a zero balance for 3 months. • Change the mappings on the encoded columns. Each iteration of the data mining process gives new insight into the changes you might make for the next iteration. Checking the Model After you develop a model, you can check the model against the mining data. HP NonStop SQL/MX Data Mining Guide—523737-001 4 -9 Applying the Model to the Mining Table Mining the Data Applying the Model to the Mining Table You must check your model against the mining table. Finding the Rows Where the Customer Left This query finds most of the rows where Cust_Left is equal to Y: SELECT account, cust_left FROM miningview WHERE (marital_status = 'Divorced' AND gender = 'M') OR (marital_status = 'Single' AND number_children = 0); Account Number -------------1234567 2300000 2500000 3200000 3900000 4098124 4400000 4600000 CUST_LEFT --------Y Y Y Y Y Y Y Y --- 8 row(s) selected. Applying the Model to the Database Now, check your model against the database. Before applying the model to the database, you can remove the tables and attributes that are not used in the analysis. You must remove any mappings you created between the values in the database and the values in the mining table. Deploying the Model After a model has been built and tested, the results are deployed into the business environment. In many cases, deployment means exporting the model back to the database to be used to evaluate new cases. Depending on its complexity, a model can be evaluated directly in the database by using standard SQL or user-defined functions. Simple models like decision trees can usually be represented in standard SQL by using a complex CASE statement. Many mining tools have the ability to export a CASE statement representing a decision tree. However, many times, models cannot be evaluated directly by using SQL. In this case, user-defined functions are needed. Most mining tools have the ability to export a C function that evaluates a model. The function code can be compiled and then executed in the DBMS as a user-defined function. Object relational enhancements to SQL/MX include such user-defined functions, which are accessible through standard SQL and executed directly in the database. HP NonStop SQL/MX Data Mining Guide—523737-001 4- 10 Monitoring Model Performance Mining the Data Monitoring Model Performance When measuring a model, consider these questions: • How accurate is the model? The accuracy of the model can be measured as a whole. For example, you can determine the percentage of records that are classified correctly. The accuracy of the parts of a model can also be measured. For example, in a decision tree, each branch of the tree has an associated error rate. • To what degree does the model describe the observed data? The model should be sufficiently descriptive with respect to the observed data to make clear why a particular prediction was made. • What is the level of confidence in the model’s predictions? Confidence is a measure of how often the model predicts the goal in the training data set. • Is the model easily understood? A predictive model that consists of a few simple rules is preferable to a model that contains many rules, even if the latter is more accurate. However, in the end, the only true measure of a business model is its return on investment. In a marketing application, measuring a model requires setting aside control groups and carefully tracking customer responses to various models. HP NonStop SQL/MX Data Mining Guide—523737-001 4- 11 Mining the Data Monitoring Model Performance HP NonStop SQL/MX Data Mining Guide—523737-001 4- 12 A Creating the Data Mining Database The examples presented in this manual use tables created by the execution of SQL/MX CREATE TABLE statements. These SQL/MX DDL statements enable you to create the data mining database so that you can use the SQL/MX features shown in this manual. -------------------------------------------------------------- Data mining database in catalog dmcat and schema whse -- Run this script in MXCI ------------------------------------------------------------CREATE CATALOG dmcat; CREATE SCHEMA dmcat.whse; SET SCHEMA dmcat.whse; -- Create tables CUSTOMERS and ACCT_HISTORY in WHSE schema -- Create CUSTOMERS table in WHSE schema DROP TABLE customers; CREATE TABLE customers ( account NUMERIC (7) UNSIGNED NO DEFAULT NOT NULL NOT DROPPABLE Heading 'Account Number' ,first_name CHARACTER (15) DEFAULT ' ' ,last_name ,marital_status ,home ,income ,gender NOT NULL NOT DROPPABLE HEADING 'First Name' CHARACTER (20) DEFAULT ' ' NOT NULL NOT DROPPABLE HEADING 'Last Name' CHARACTER (8) DEFAULT NULL HEADING 'Marital Status' CHARACTER (4) DEFAULT NULL HEADING 'Home' NUMERIC (8, 2) UNSIGNED DEFAULT NULL HEADING 'Income' CHAR(1) DEFAULT NULL HP NonStop SQL/MX Data Mining Guide—523737-001 A- 1 Creating the Data Mining Database ,age NUMERIC (3) DEFAULT NULL HEADING 'Age' ,number_children NUMERIC (2) DEFAULT null HEADING 'Number of Children' ,PRIMARY KEY (account) NOT DROPPABLE ) LOCATION $P2 PARTITION (ADD FIRST KEY 3000000 LOCATION $VOLUME, ADD FIRST KEY 5000000 LOCATION $P1); -- Set constraint on home column; must be Rent or Own or NULL ALTER TABLE customers ADD CONSTRAINT home_constraint CHECK (home = 'Own' OR home = 'Rent' OR home IS NULL); -- Set constraint on marital status column; must be Divorced, -- Married, Widow, Single or NULL ALTER TABLE customers ADD CONSTRAINT ms_constraint CHECK (marital_status = 'Divorced' OR marital_status = 'Married' OR marital_status = 'Single' OR marital_status = 'Widow' OR marital_status IS NULL); -- Set constraint on gender column; must be F, M or NULL ALTER TABLE customers ADD CONSTRAINT gender_constraint CHECK (gender = 'F' OR gender = 'M' OR gender IS NULL); -- Create the ACCT_HISTORY table in WHSE schema DROP TABLE acct_history; CREATE TABLE acct_history ( account NUMERIC (7) UNSIGNED NO DEFAULT NOT NULL NOT DROPPABLE ,year_month DATE HP NonStop SQL/MX Data Mining Guide—523737-001 A- 2 Creating the Data Mining Database NO DEFAULT NOT NULL NOT DROPPABLE ,status CHAR (10) NO DEFAULT NOT NULL NOT DROPPABLE ,cust_limit NUMERIC (9,2) NO DEFAULT NOT NULL NOT DROPPABLE ,balance NUMERIC (9,2) NO DEFAULT NOT NULL NOT DROPPABLE ,payment NUMERIC (9,2) NO DEFAULT NOT NULL NOT DROPPABLE ,finance_charge NUMERIC (9,2) NO DEFAULT NOT NULL NOT DROPPABLE ,PRIMARY KEY (account, year_month) ) LOCATION $P2 PARTITION (ADD FIRST KEY 3000000 LOCATION $VOLUME, ADD FIRST KEY 5000000 LOCATION $P1); -- Set constraint on status column; must be Open, -- Delinquent,or Closed ALTER TABLE acct_history ADD CONSTRAINT status_constraint CHECK (status = 'Open' OR status = 'Delinquent' OR status = 'Closed'); ------------------------------------------------------------- HP NonStop SQL/MX Data Mining Guide—523737-001 A- 3 Creating the Data Mining Database HP NonStop SQL/MX Data Mining Guide—523737-001 A- 4 B Inserting Into the Data Mining Database The following INSERT statements enable you to populate the data mining database. Use the following script to populate the CUSTOMERS table and the ACCT_HISTORY table: -------------------------------------------------------------- Data mining database in catalog dmcat and schema whse -- Run this script in MXCI -------------------------------------------------------------- POPULATE THE DATA MINING DATABASE TABLES SET SCHEMA dmcat.whse; INSERT INTO customers VALUES (1234567,'MARY', 'JONES', 'Single', 'Own', 65000,'F',34,0), (2500000,'ALI', 'ABBAS', 'Divorced','Rent', 32000,'M',23,0), (4098124,'TOMOKO','KANO', 'Divorced','Own', 44000,'M',44,2), (2400000,'ERIKA', 'LUND', 'Widow', 'Own', 28000,'F',65,0), (1000000,'ROGER', 'GREEN', 'Married', 'Own', 175500,'M',45,3), (2300000,'JERRY', 'HOWARD', 'Divorced','Own', 137000,'M',42,2), (2900000,'JANE', 'RAYMOND','Divorced','Rent',136000,'F',50,0), (3200000,'THOMAS','RUDLOFF','Divorced','Rent',138000,'M',40,1), (3900000,'KLAUS', 'SAFFERT','Divorced','Own', 75000,'M',40,2), (4300000,'DEBBIE','DUNN', 'Married', 'Own', 300000,'F',29,2), (4400000,'HANNAH','ROSE', 'Single', 'Own', 300000,'F',29,0), (4500000,'LIZ', 'STONE', 'Married', 'Own', 300000,'F',29,1), (4600000,'HANS', 'NOBLE', 'Single', 'Own', 300000,'M',48,0), (4700000,'SEAN', 'FREDRICK','Widow', 'Own', 300000,'M',68,0), (5000000,'CYNTHIA','TREBLE','Single', 'Own', 65000,'F',34,0), (5200000,'FRANK', 'KIRBY', 'Married', 'Rent', 32000,'M',23,0), (5300000,'ROBERT','HOLDER', 'Divorced','Own', 44000,'M',44,2), (6000000,'VALERIE','RECORD','Widow', 'Own', 28000,'F',65,0), (7000000,'KARL', 'SMITH', 'Married', 'Own', 175500,'M',45,3), (7100000,'BRADLEY','RAY', 'Widow', 'Own', 137000,'M',42,2), (7200000,'KIRSTEN','HOWARD','Married', 'Rent',136000,'F',50,0), (7300000,'GERALD','CACHMAN','Divorced','Rent',138000,'M',40,1), (7400000,'MILES', 'KOCH', 'Divorced','Own', 75000,'M',40,2), (7500000,'SYDNEY','NICOLE', 'Single', 'Own', 200000,'F',25,0), (7600000,'ERIN', 'MCDONALD','Single', 'Own', 65000,'F',34,0), (7700000,'MATT', 'STEVENS','Married', 'Rent', 32000,'M',23,0), (7800000,'SANDY', 'MILLER', 'Divorced','Own', 44000,'M',44,2), HP NonStop SQL/MX Data Mining Guide—523737-001 B- 1 Inserting Into the Data Mining Database (7900000,'LAUREN','LITTLE', 'Widow', 'Own', 28000,'F',65,0), (8000000,'BRENT', 'BLACK', 'Married', 'Own', 175500,'M',45,3), (8100000,'STEVEN','HUFF', 'Widow', 'Own', 137000,'M',42,2), (8200000,'ELLIE', 'RAYMOND','Married', 'Rent',136000,'F',50,0), (8300000,'PATRICK','ZORO', 'Divorced','Rent',138000,'M',40,1), (8400000,'SHAWN', 'JONES', 'Divorced','Own', 75000,'M',40,2), (8500000,'ABBIE', 'LAUREN', 'Single', 'Own', 200000,'F',19,0), (8600000,'ELSIE', 'VANDER', 'Single', 'Own', 200000,'F',30,0), (8700000,'SARAH', 'PETERS', 'Single', 'Own', 200000,'F',19,0), (8800000,'ALLIE', 'BOWERS', 'Single', 'Own', 200000,'F',40,0), (8900000,'KELSEY','SMITH', 'Single', 'Own', 200000,'F',28,0), (9000000,'KIM', 'TENNEL', 'Single', 'Own', 200000,'F',56,0), (9100000,'TJ', 'CASWELL','Single', 'Own', 200000,'M',25,0), (9200000,'HELEN', 'SPOTS', 'Single', 'Own', 200000,'F',29,0), (9300000,'JOHN', 'MOORE', 'Single', 'Own', 200000,'M',43,0); -- Insert into ACCT_HISTORY 12 to 36 records per account INSERT INTO acct_history VALUES (1234567,DATE '2003-01-01','Open',10000,1232.50,1232.50, (1234567,DATE '2003-02-01','Open',10000,3000.00,3000.00, (1234567,DATE '2003-03-01','Open',10000,1034.00,1034.00, (1234567,DATE '2003-04-01','Open',10000,2500.00,2500.00, (1234567,DATE '2003-05-01','Open',10000,1050.00,1050.00, (1234567,DATE '2003-06-01','Open',10000,6500.00,6500.00, (1234567,DATE '2003-07-01','Open',10000,3000.00,3000.00, (1234567,DATE '2003-08-01','Open',10000,7800.00,7800.00, (1234567,DATE '2003-09-01','Open',10000,3000.00,3000.00, (1234567,DATE '2003-10-01','Open',10000,2870.00,2870.00, (1234567,DATE '2003-11-01','Open',10000,1200.00,1200.00, (1234567,DATE '2003-12-01','Closed',10000,500.00,500.00, INSERT INTO acct_history VALUES (2500000,DATE '2002-07-01','Open', (2500000,DATE '2002-08-01','Open', (2500000,DATE '2002-09-01','Open', (2500000,DATE '2002-10-01','Open', (2500000,DATE '2002-11-01','Open', (2500000,DATE '2002-12-01','Open', (2500000,DATE '2003-01-01','Open', (2500000,DATE '2003-02-01','Open', (2500000,DATE '2003-03-01','Open', (2500000,DATE '2003-04-01','Open', (2500000,DATE '2003-05-01','Open', (2500000,DATE '2003-06-01','Open', (2500000,DATE '2003-07-01','Open', 5000, 5000, 5000, 5000, 5000, 5000, 5000, 5000, 5000, 5000, 5000, 5000, 5000, 0.00), 0.00), 0.00), 0.00), 0.00), 0.00), 0.00), 0.00), 0.00), 0.00), 0.00), 0.00); 566.00, 32.00, 8.00), 600.00, 40.00, 9.23), 632.00, 32.00, 8.00), 615.00, 31.00, 8.00), 670.00, 42.00,10.40), 650.00, 37.00,10.00), 703.00, 50.00,13.00), 723.00, 23.00,14.00), 700.00, 20.00,13.75), 745.00, 22.00,13.60), 745.00, 0,89.40), 834.40, 75 ,100.28), 834.40, 834.40, 0), HP NonStop SQL/MX Data Mining Guide—523737-001 B- 2 Inserting Into the Data Mining Database (2500000,DATE (2500000,DATE (2500000,DATE (2500000,DATE (2500000,DATE '2003-08-01','Open', '2003-09-01','Open', '2003-10-01','Open', '2003-11-01','Open', '2003-12-01','Open', 5000, 5000, 5000, 5000, 5000, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), 0), 0), 0), 0); INSERT INTO acct_history VALUES (4098124,DATE '2000-10-01','Open', 6000,32000.00,3200.00,0.00), (4098124,DATE '2000-11-01','Open', 6000, 2300.00,2300.00,0.00), (4098124,DATE '2000-12-01','Open', 6000, 0.00, 0.00,0.00), (4098124,DATE '2001-01-01','Open', 6000, 0.00, 0.00,0.00), (4098124,DATE '2001-02-01','Open', 6000, 4000.00,4000.00,0.00), (4098124,DATE '2001-03-01','Open', 6000, 1200.00,1200.00,0.00), (4098124,DATE '2001-04-01','Open', 6000, 320.00, 320.00,0.00), (4098124,DATE '2001-05-01','Open', 6000, 1000.00,1000.00,0.00), (4098124,DATE '2001-06-01','Open', 6000, 2300.00,2300.00,0.00), (4098124,DATE '2001-07-01','Open', 6000, 1200.00,1200.00,0.00), (4098124,DATE '2001-08-01','Open', 6000, 5400.00,400.00,500.00), (4098124,DATE '2001-09-01','Open', 6000, 5300.00,300.00,550.00), (4098124,DATE '2001-10-01','Open', 6000, 6000.00,800.00,720.00), (4098124,DATE '2001-11-01','Open', 6000, 5920.00,800.00,710.40), (4098124,DATE '2001-12-01','Open', 6000, 5830.40,800.00,699.65), (4098124,DATE '2002-01-01','Open', 6000, 5730.04, 800, 687.60), (4098124,DATE '2002-02-01','Open', 6000, 5617.65, 800, 674.11), (4098124,DATE '2002-03-01','Open', 6000, 5491.77, 800, 659.01), (4098124,DATE '2002-04-01','Open', 6000, 5350.78, 800, 642.09), (4098124,DATE '2002-05-01','Open', 6000, 5192.87, 800, 623.14), (4098124,DATE '2002-06-01','Open', 6000, 5016.02, 800, 601.92), (4098124,DATE '2002-07-01','Open', 6000, 4817.94, 800, 578.15), (4098124,DATE '2002-08-01','Open', 6000, 4596.10, 800, 551.53), (4098124,DATE '2002-09-01','Open', 6000, 4347.63, 800, 521.71), (4098124,DATE '2002-10-01','Closed',6000,4069.34, 800, 488.32); INSERT INTO acct_history VALUES (2400000,DATE '2002-01-01','Open', 5000, 50.00, 50.00, (2400000,DATE '2002-02-01','Open', 5000, 100.00, 50.00, (2400000,DATE '2002-03-01','Open', 5000, 50.00, 50.00, (2400000,DATE '2002-04-01','Open', 5000, 380.00, 50.00, (2400000,DATE '2002-05-01','Open', 5000, 330.00, 60.00, (2400000,DATE '2002-06-01','Open', 5000, 430.45, 55.00, (2400000,DATE '2002-07-01','Open', 5000, 470.34, 55.00, (2400000,DATE '2002-08-01','Open', 5000, 545.00, 60.00, (2400000,DATE '2002-09-01','Open', 5000, 490.67,490.67, (2400000,DATE '2002-10-01','Open', 5000, 0.00, 0.00, (2400000,DATE '2002-11-01','Open', 5000, 0.00, 0.00, (2400000,DATE '2002-12-01','Closed',5000, 0.00, 0.00, HP NonStop SQL/MX Data Mining Guide—523737-001 B- 3 0.00), 0.75), 0.00), 4.95), 4.05), 5.63), 6.23), 7.27), 0), 0.00), 0.00), 0.00); Inserting Into the Data Mining Database INSERT INTO acct_history VALUES (1000000,DATE '2002-07-01','Open',20000,3678.67,3678.67, 0.00), (1000000,DATE '2002-08-01','Open',20000,6780.00,6780.00, 0.00), (1000000,DATE '2002-09-01','Open',20000,2300.78,2300.78, 0.00), (1000000,DATE '2002-10-01','Open',20000,8000.00,8000.00, 0.00), (1000000,DATE '2002-11-01','Open',20000,5345.89,5345.89, 0.00), (1000000,DATE '2002-12-01','Open',20000,4700.00,4700.00, 0.00), (1000000,DATE '2003-01-01','Open',20000,1200.00,1200.00, 0.00), (1000000,DATE '2003-02-01','Delinquent',20000,3500.00,0,51.75), (1000000,DATE '2003-03-01','Open',20000,5500.00,5500.00, 0.00), (1000000,DATE '2003-04-01','Open',20000, 0.00, 0.00, 0.00), (1000000,DATE '2003-05-01','Open',20000,6500.00,6500.00, 0.00), (1000000,DATE '2003-06-01','Open',20000,4590.00,4590.00, 0.00), (1000000,DATE '2003-07-01','Open',20000,3200.00,3200.00, 0.00), (1000000,DATE '2003-08-01','Open',20000, 167.89, 167.89, 0.00), (1000000,DATE '2003-09-01','Open',20000,9800.00,9800.00, 0.00), (1000000,DATE '2003-10-01','Open',20000, 50.00, 50.00, 0.00), (1000000,DATE '2003-11-01','Open',20000,2300.78,2300.78, 0.00), (1000000,DATE '2003-12-01','Open',20000,5600.00,5600.00, 0.00); INSERT INTO acct_history VALUES (2300000,DATE '2002-11-01','Open',15000, 0,0,0), (2300000,DATE '2002-12-01','Open',15000,10000.00,1500.00,127.5), (2300000,DATE '2003-01-01','Open',15000,9500.00,1500.00,120.00), (2300000,DATE '2003-02-01','Open',15000,8120.00,1500.00, 99.30), (2300000,DATE '2003-03-01','Open',15000,12000.00,4000.00,120), (2300000,DATE '2003-04-01','Open',15000,8120.00,4000.00,61.80), (2300000,DATE '2003-05-01','Open',15000,5004.00,1500.00,52.56), (2300000,DATE '2003-06-01','Open',15000,3500.00,1500.00,30.00), (2300000,DATE '2003-07-01','Open',15000,4500.00, 800.00,55.50), (2300000,DATE '2003-08-01','Open',15000,3800.00,1500.00,34.50), (2300000,DATE '2003-09-01','Open',15000, 0, 0, 0), (2300000,DATE '2003-10-01','Open',15000, 0, 0, 0), (2300000,DATE '2003-11-01','Open',15000, 0, 0, 0), (2300000,DATE '2003-12-01','Open',15000, 0, 0, 0); INSERT INTO acct_history VALUES (2900000,DATE '2003-01-01','Open',15000,10000.00,10000.00, (2900000,DATE '2003-02-01','Open',15000, 3456.00, 3456.00, (2900000,DATE '2003-03-01','Open',15000, 2300.90, 2300.90, (2900000,DATE '2003-04-01','Open',15000, 9432.78, 9432.78, (2900000,DATE '2003-05-01','Open',15000, 1134.00, 1134.00, (2900000,DATE '2003-06-01','Open',15000, 2356.80, 2356.80, (2900000,DATE '2003-07-01','Open',15000, 9870.00, 9870.00, (2900000,DATE '2003-08-01','Open',15000, 8765.00, 8765.00, (2900000,DATE '2003-09-01','Open',15000, 2460.00, 2460.00, (2900000,DATE '2003-10-01','Open',15000, 4543.00, 4543.00, HP NonStop SQL/MX Data Mining Guide—523737-001 B- 4 0), 0), 0), 0), 0), 0), 0), 0), 0), 0), Inserting Into the Data Mining Database (2900000,DATE '2003-11-01','Open',15000, 2000.00, 2000.00, 0), (2900000,DATE '2003-12-01','Open',15000, 5890.00, 5890.00, 0); INSERT INTO acct_history VALUES (3200000,DATE '2002-07-01','Open', (3200000,DATE '2002-08-01','Open', (3200000,DATE '2002-09-01','Open', (3200000,DATE '2002-10-01','Open', (3200000,DATE '2002-11-01','Open', (3200000,DATE '2002-12-01','Open', (3200000,DATE '2003-01-01','Open', (3200000,DATE '2003-02-01','Open', (3200000,DATE '2003-03-01','Open', (3200000,DATE '2003-04-01','Open', (3200000,DATE '2003-05-01','Open', (3200000,DATE '2003-06-01','Open', 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, INSERT INTO acct_history VALUES (3900000,DATE '2001-12-01','Open', (3900000,DATE '2002-01-01','Open', (3900000,DATE '2002-02-01','Open', (3900000,DATE '2002-03-01','Open', (3900000,DATE '2002-04-01','Open', (3900000,DATE '2002-05-01','Open', (3900000,DATE '2002-06-01','Open', (3900000,DATE '2002-07-01','Open', (3900000,DATE '2002-08-01','Open', (3900000,DATE '2002-09-01','Open', (3900000,DATE '2002-10-01','Open', (3900000,DATE '2002-11-01','Open', 5000, 800.00, 800.00, 5000, 300.00, 300.00, 5000, 230.00, 230.00, 5000, 789.00, 789.00, 5000, 600.00, 600.00, 5000, 500.00, 500.00, 5000,1800.00,1800.00, 5000,4800.00,4800.00, 5000, 0, 0, 0), 5000, 0, 0, 0), 5000, 0, 0, 0), 5000, 0, 0, 0); 2345.00, 2345.00, 0), 0, 0, 0), 150.00, 150.00, 0), 5678.00, 5678.00, 0), 2000.00, 2000.00, 0), 50.00, 50.00, 0), 0, 0, 0), 800.00, 800.00, 0), 0, 0, 0), 0, 0, 0), 0, 0, 0), 0, 0, 0 ); 0), 0), 0), 0), 0), 0), 0), 0), INSERT INTO acct_history VALUES (4300000,DATE '2003-01-01','Open',40000, 0, 0, 0), (4300000,DATE '2003-02-01','Open',40000,18000.00,18000.00, (4300000,DATE '2003-03-01','Open',40000, 459.99, 459.99, (4300000,DATE '2003-04-01','Open',40000, 9876.00, 9876.00, (4300000,DATE '2003-05-01','Open',40000, 4354.00, 4354.00, (4300000,DATE '2003-06-01','Open',40000, 9000.00, 9000.00, (4300000,DATE '2003-07-01','Open',40000, 0, 0, 0), (4300000,DATE '2003-08-01','Open',40000, 6700.00, 6700.00, (4300000,DATE '2003-09-01','Open',40000, 7800.00, 7800.00, (4300000,DATE '2003-10-01','Open',40000, 1200.00, 1200.00, (4300000,DATE '2003-11-01','Open',40000, 8000.00, 8000.00, (4300000,DATE '2003-12-01','Open',40000, 9050.00, 9050.00, INSERT INTO acct_history VALUES (4400000,DATE '2003-01-01','Open',40000, 0.00, HP NonStop SQL/MX Data Mining Guide—523737-001 B- 5 0), 0), 0), 0), 0), 0), 0), 0), 0), 0); 00.00, 0), Inserting Into the Data Mining Database (4400000,DATE (4400000,DATE (4400000,DATE (4400000,DATE (4400000,DATE (4400000,DATE (4400000,DATE (4400000,DATE (4400000,DATE (4400000,DATE (4400000,DATE '2003-02-01','Open',40000, '2003-03-01','Open',40000, '2003-04-01','Open',40000, '2003-05-01','Open',40000, '2003-06-01','Open',40000, '2003-07-01','Open',40000, '2003-08-01','Open',40000, '2003-09-01','Open',40000, '2003-10-01','Open',40000, '2003-11-01','Open',40000, '2003-12-01','Open',40000, 100.00, 50.00, 90.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, INSERT INTO acct_history VALUES (4500000,DATE '2002-07-01','Open',40000, 50.00, (4500000,DATE '2002-08-01','Open',40000, 100.00, (4500000,DATE '2002-09-01','Closed',40000, 0.00, 100.00, 50.00, 90.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0), 0), 0), 0), 0), 0), 0), 0), 0), 0), 0); 50.00, 0), 100.00, 0), 0.00, 0); INSERT INTO acct_history VALUES (4600000,DATE '2003-01-01','Open',40000, 30.00, 30.00, (4600000,DATE '2003-02-01','Open',40000, 30.00, 30.00, (4600000,DATE '2003-03-01','Open',40000, 30.00, 30.00, (4600000,DATE '2003-04-01','Open',40000, 30.00, 30.00, (4600000,DATE '2003-05-01','Open',40000, 30.00, 30.00, (4600000,DATE '2003-06-01','Open',40000, 30.00, 30.00, (4600000,DATE '2003-07-01','Open',40000, 30.00, 30.00, (4600000,DATE '2003-08-01','Open',40000, 60.00, 60.00, (4600000,DATE '2003-09-01','Open',40000, 700.00, 700.00, (4600000,DATE '2003-10-01','Open',40000, 80.00, 80.00, (4600000,DATE '2003-11-01','Open',40000, 50.00, 50.00, (4600000,DATE '2003-12-01','Closed',40000,1000.00,1000.00, INSERT INTO acct_history VALUES (4700000,DATE '2003-01-01','Open',40000, 330.00, (4700000,DATE '2003-02-01','Open',40000, 330.00, (4700000,DATE '2003-03-01','Open',40000, 330.00, (4700000,DATE '2003-04-01','Open',40000, 330.00, (4700000,DATE '2003-05-01','Open',40000, 330.00, (4700000,DATE '2003-06-01','Open',40000, 330.00, (4700000,DATE '2003-07-01','Open',40000, 330.00, (4700000,DATE '2003-08-01','Open',40000, 650.00, (4700000,DATE '2003-09-01','Open',40000, 710.00, (4700000,DATE '2003-10-01','Open',40000, 807.00, (4700000,DATE '2003-11-01','Open',40000, 509.00, (4700000,DATE '2003-12-01','Open',40000, 1000.00, HP NonStop SQL/MX Data Mining Guide—523737-001 B- 6 330.00, 330.00, 330.00, 330.00, 330.00, 330.00, 330.00, 650.00, 710.00, 807.00, 509.00, 1000.00, 0), 0), 0), 0), 0), 0), 0), 0), 0), 0), 0), 0); 0), 0), 0), 0), 0), 0), 0), 0), 0), 0), 0), 0); C Importing Into the Data Mining Database The format file, data file, and the import command provided in this appendix enable you to populate the data mining database. You cannot execute the import utility command through MXCI or in programs. You must run import at the command prompt. For further information, see the import Utility entry in the NonStop SQL/MX Reference Manual. Importing Customers Data The import command for the Customers table looks like this: IMPORT dmcat.whse.customers -I importdatac.txt -U importfmtc.txt Customers Format File This file is named importfmtc.txt and is the format file specified in the preceding IMPORT command: [DATE FORMAT] DecimalSymbol=. [COLUMN FORMAT] col=account,N col=first_name,N col=last_name,N col=marital_status,N col=home,N col=income,N col=gender,N col=age,N col=number_children,N [DELIMITED FORMAT] FieldDelimiter=, Customers Data File This file is named importdatac.txt and is the data file specified in the preceding import command: 1234567,MARY,JONES,Single,Own,65000,F,34,0 2500000,ALI,ABBAS,Divorced,Rent,32000,M,23,0 4098124,TOMOKO,KANO,Divorced,Own,44000,M,44,2 2400000,ERIKA,LUND,Widow,Own,28000,F,65,0 HP NonStop SQL/MX Data Mining Guide—523737-001 C- 1 Importing Into the Data Mining Database Importing Account History Data 1000000,ROGER,GREEN,Married,Own,175500,M,45,3 2300000,JERRY,HOWARD,Divorced,Own,137000,M,42,2 2900000,JANE,RAYMOND,Divorced,Rent,136000,F,50,0 3200000,THOMAS,RUDLOFF,Divorced,Rent,138000,M,40,1 3900000,KLAUS,SAFFERT,Divorced,Own,75000,M,40,2 4300000,DEBBIE,DUNN,Married,Own,300000,F,29,2 4400000,HANNAH,ROSE,Single,Own,300000,F,29,0 4500000,LIZ,STONE,Married,Own,300000,F,29,1 4600000,HANS,NOBLE,Single,Own,300000,M,48,0 4700000,SEAN,FREDRICK,Widow,Own,300000,M,68,0 5000000,CYNTHIA,TREBLE,Single,Own,65000,F,34,0 5200000,FRANK,KIRBY,Married,Rent,32000,M,23,0 5300000,ROBERT,HOLDER,Divorced,Own,44000,M,44,2 6000000,VALERIE,RECORD,Widow,Own,28000,F,65,0 7000000,KARL,SMITH,Married,Own,175500,M,45,3 7100000,BRADLEY,RAY,Widow,Own,137000,M,42,2 7200000,KIRSTEN,HOWARD,Married,Rent,136000,F,50,0 7300000,GERALD,CACHMAN,Divorced,Rent,138000,M,40,1 7400000,MILES,KOCH,Divorced,Own,75000,M,40,2 7500000,SYDNEY,NICOLE,Single,Own,200000,F,25,0 7600000,ERIN,MCDONALD,Single,Own,65000,F,34,0 7700000,MATT,STEVENS,Married,Rent,32000,M,23,0 7800000,SANDY,MILLER,Divorced,Own,44000,M,44,2 7900000,LAUREN,LITTLE,Widow,Own,28000,F,65,0 8000000,BRENT,BLACK,Married,Own,175500,M,45,3 8100000,STEVEN,HUFF,Widow,Own,137000,M,42,2 8200000,ELLIE,RAYMOND,Married,Rent,136000,F,50,0 8300000,PATRICK,ZORO,Divorced,Rent,138000,M,40,1 8400000,SHAWN,JONES,Divorced,Own,75000,M,40,2 8500000,ABBIE,LAUREN,Single,Own,200000,F,19,0 8600000,ELSIE,VANDER,Single,Own,200000,F,30,0 8700000,SARAH,PETERS,Single,Own,200000,F,19,0 8800000,ALLIE,BOWERS,Single,Own,200000,F,40,0 8900000,KELSEY,SMITH,Single,Own,200000,F,28,0 9000000,KIM,TENNEL,Single,Own,200000,F,56,0 9100000,TJ,CASWELL,Single,Own,200000,M,25,0 9200000,HELEN,SPOTS,Single,Own,200000,F,29,0 9300000,JOHN,MOORE,Single,Own,200000,M,43,0 Importing Account History Data The import command for the Account History table looks like this: import dmcat.whse.acct_history -I importdataa.txt -U importfmta.txt HP NonStop SQL/MX Data Mining Guide—523737-001 C- 2 Importing Into the Data Mining Database Account History Format File Account History Format File This file is named importfmta.txt and is the format file specified in the preceding import command: [DATE FORMAT] DateOrder=YMD DateDelimiter=[COLUMN FORMAT] col=account,N col=year_month,N col=status,N col=cust_limit,N col=balance,N col=payment,N col=finance_charge,N [DELIMITED FORMAT] FieldDelimiter=, Account History Data File This file is named importdataa.txt and is the data file specified in the preceding import command: 1234567,2003-01-01,Open,10000,1232.50,1232.50,0.00 1234567,2003-02-01,Open,10000,3000.00,3000.00,0.00 1234567,2003-03-01,Open,10000,1034.00,1034.00,0.00 1234567,2003-04-01,Open,10000,2500.00,2500.00,0.00 1234567,2003-05-01,Open,10000,1050.00,1050.00,0.00 1234567,2003-06-01,Open,10000,6500.00,6500.00,0.00 1234567,2003-07-01,Open,10000,3000.00,3000.00,0.00 1234567,2003-08-01,Open,10000,7800.00,7800.00,0.00 1234567,2003-09-01,Open,10000,3000.00,3000.00,0.00 1234567,2003-10-01,Open,10000,2870.00,2870.00,0.00 1234567,2003-11-01,Open,10000,1200.00,1200.00,0.00 1234567,2003-12-01,Closed,10000,500.00,500.00,0.00 2500000,2002-07-01,Open, 5000,566.00,32.00,8.00 2500000,2002-08-01,Open,5000,600.00,40.00,9.23 2500000,2002-09-01,Open,5000,632.00,32.00,8.00 2500000,2002-10-01,Open,5000,615.00,31.00,8.00 2500000,2002-11-01,Open,5000,670.00,42.00,10.40 2500000,2002-12-01,Open,5000,650.00,37.00,10.00 2500000,2003-01-01,Open,5000,703.00,50.00,13.00 2500000,2003-02-01,Open,5000,723.00,23.00,14.00 2500000,2003-03-01,Open,5000,700.00,20.00,13.75 2500000,2003-04-01,Open,5000,745.00,22.00,13.60 2500000,2003-05-01,Open,5000,745.00,0,89.40 2500000,2003-06-01,Open,5000,834.40,75,100.28 HP NonStop SQL/MX Data Mining Guide—523737-001 C- 3 Importing Into the Data Mining Database Account History Data File 2500000,2003-07-01,Open,5000,834.40,834.40,0 2500000,2003-08-01,Open,5000,0,0,0 2500000,2003-09-01,Open,5000,0,0,0 2500000,2003-10-01,Open,5000,0,0,0 2500000,2003-11-01,Open,5000,0,0,0 2500000,2003-12-01,Open,5000,0,0,0 4098124,2000-10-01,Open,6000,32000.00,3200.00,0.00 4098124,2000-11-01,Open,6000,2300.00,2300.00,0.00 4098124,2000-12-01,Open,6000,0.00,0.00,0.00 4098124,2001-01-01,Open,6000,0.00,0.00,0.00 4098124,2001-02-01,Open,6000,4000.00,4000.00,0.00 4098124,2001-03-01,Open,6000,1200.00,1200.00,0.00 4098124,2001-04-01,Open,6000,320.00,320.00,0.00 4098124,2001-05-01,Open,6000,1000.00,1000.00,0.00 4098124,2001-06-01,Open,6000,2300.00,2300.00,0.00 4098124,2001-07-01,Open,6000,1200.00,1200.00,0.00 4098124,2001-08-01,Open,6000,5400.00,400.00,500.00 4098124,2001-09-01,Open,6000,5300.00,300.00,550.00 4098124,2001-10-01,Open,6000,6000.00,800.00,720.00 4098124,2001-11-01,Open,6000,5920.00,800.00,710.40 4098124,2001-12-01,Open,6000,5830.40,800.00,699.65 4098124,2002-01-01,Open,6000,5730.04,800,687.60 4098124,2002-02-01,Open,6000,5617.65,800,674.11 4098124,2002-03-01,Open,6000,5491.77,800,659.01 4098124,2002-04-01,Open,6000,5350.78,800,642.09 4098124,2002-05-01,Open,6000,5192.87,800,623.14 4098124,2002-06-01,Open,6000,5016.02,800,601.92 4098124,2002-07-01,Open,6000,4817.94,800,578.15 4098124,2002-08-01,Open,6000,4596.10,800,551.53 4098124,2002-09-01,Open,6000,4347.63,800,521.71 4098124,2002-10-01,Closed,6000,4069.34,800,488.32 2400000,2002-01-01,Open,5000,50.00,50.00,0.00 2400000,2002-02-01,Open,5000,100.00,50.00,0.75 2400000,2002-03-01,Open,5000,50.00,50.00,0.00 2400000,2002-04-01,Open,5000,380.00,50.00,4.95 2400000,2002-05-01,Open,5000,330.00,60.00,4.05 2400000,2002-06-01,Open,5000,430.45,55.00,5.63 2400000,2002-07-01,Open,5000,470.34,55.00,6.23 2400000,2002-08-01,Open,5000,545.00,60.00,7.27 2400000,2002-09-01,Open,5000,490.67,490.67,0 2400000,2002-10-01,Open,5000,0.00,0.00,0.00 2400000,2002-11-01,Open,5000,0.00,0.00,0.00 2400000,2002-12-01,Closed,5000,0.00,0.00,0.00 1000000,2002-07-01,Open,20000,3678.67,3678.67,0.00 1000000,2002-08-01,Open,20000,6780.00,6780.00,0.00 1000000,2002-09-01,Open,20000,2300.78,2300.78,0.00 1000000,2002-10-01,Open,20000,8000.00,8000.00,0.00 HP NonStop SQL/MX Data Mining Guide—523737-001 C- 4 Importing Into the Data Mining Database Account History Data File 1000000,2002-11-01,Open,20000,5345.89,5345.89,0.00 1000000,2002-12-01,Open,20000,4700.00,4700.00,0.00 1000000,2003-01-01,Open,20000,1200.00,1200.00,0.00 1000000,2003-02-01,Delinquent,20000,3500.00,0.00,51.75 1000000,2003-03-01,Open,20000,5500.00,5500.00,0.00 1000000,2003-04-01,Open,20000,0.00,0.00,0.00 1000000,2003-05-01,Open,20000,6500.00,6500.00,0.00 1000000,2003-06-01,Open,20000,4590.00,4590.00,0.00 1000000,2003-07-01,Open,20000,3200.00, 3200.00,0.00 1000000,2003-08-01,Open,20000,167.89,167.89,0.00 1000000,2003-09-01,Open,20000,9800.00,9800.00,0.00 1000000,2003-10-01,Open,20000,50.00, 50.00, 0.00 1000000,2003-11-01,Open,20000,2300.78,2300.78,0.00 1000000,2003-12-01,Open,20000,5600.00,5600.00,0.00 2300000,2002-11-01,Open, 15000,0,0,0 2300000,2002-12-01,Open,15000,10000.00,1500.00,127.5 2300000,2003-01-01,Open,15000,9500.00,1500.00,120.00 2300000,2003-02-01,Open,15000,8120.00,1500.00,99.30 2300000,2003-03-01,Open,15000,12000.00,4000.00,120.00 2300000,2003-04-01,Open,15000,8120.00,4000.00,61.80 2300000,2003-05-01,Open,15000,5004.00,1500.00,52.56 2300000,2003-06-01,Open,15000,3500.00,1500.00,30.00 2300000,2003-07-01,Open,15000,4500.00,800.00,55.50 2300000,2003-08-01,Open,15000,3800.00,1500.00,34.50 2300000,2003-09-01,Open,15000,0,0,0 2300000,2003-10-01,Open,15000,0,0,0 2300000,2003-11-01,Open,15000,0,0,0 2300000,2003-12-01,Open,15000,0,0,0 2900000,2003-01-01,Open,15000,10000.00,10000.00,0 2900000,2003-02-01,Open,15000,3456.00,3456.00,0 2900000,2003-03-01,Open,15000,2300.90,2300.90,0 2900000,2003-04-01,Open,15000,9432.78,9432.78,0 2900000,2003-05-01,Open,15000,1134.00,1134.00,0 2900000,2003-06-01,Open,15000,2356.80,2356.80,0 2900000,2003-07-01,Open,15000,9870.00,9870.00,0 2900000,2003-08-01,Open,15000,8765.00,8765.00,0 2900000,2003-09-01,Open,15000,2460.00,2460.00,0 2900000,2003-10-01,Open,15000,4543.00,4543.00,0 2900000,2003-11-01,Open,15000,2000.00,2000.00,0 2900000,2003-12-01,Open,15000,5890.00,5890.00,0 3200000,2002-07-01,Open,10000,2345.00,2345.00,0 3200000,2002-08-01,Open,10000,0,0,0 3200000,2002-09-01,Open,10000,150.00,150.00,0 3200000,2002-10-01,Open,10000,5678.00,5678.00,0 3200000,2002-11-01,Open,10000,2000.00,2000.00,0 3200000,2002-12-01,Open,10000,50.00,50.00,0 3200000,2003-01-01,Open,10000,0,0,0 HP NonStop SQL/MX Data Mining Guide—523737-001 C- 5 Importing Into the Data Mining Database Account History Data File 3200000,2003-02-01,Open,10000,800.00,800.00,0 3200000,2003-03-01,Open,10000,0,0,0 3200000,2003-04-01,Open,10000,0,0,0 3200000,2003-05-01,Open,10000,0,0,0 3200000,2003-06-01,Open,10000,0,0,0 3900000,20012000-12-01,Open,5000,800.00,800.00,0 3900000,2002-01-01,Open,5000,300.00,300.00,0 3900000,2002-02-01,Open,5000,230.00,230.00,0 3900000,2002-03-01,Open,5000,789.00,789.00,0 3900000,2002-04-01,Open,5000,600.00,600.00,0 3900000,2002-05-01,Open,5000,500.00,500.00,0 3900000,2002-06-01,Open,5000,1800.00,1800.00,0 3900000,2002-07-01,Open,5000,4800.00,4800.00,0 3900000,2002-08-01,Open,5000,0,0,0 3900000,2002-09-01,Open,5000,0,0,0 3900000,2002-10-01,Open,5000,0,0,0 3900000,2002-11-01,Open,5000,0,0,0 4300000,2003-01-01,Open,40000,0,0,0 4300000,2003-02-01,Open,40000,18000.00,18000.00,0 4300000,2003-03-01,Open,40000,459.99,459.99,0 4300000,2003-04-01,Open,40000,9876.00,9876.00,0 4300000,2003-05-01,Open,40000,4354.00,4354.00,0 4300000,2003-06-01,Open,40000,9000.00,9000.00,0 4300000,2003-07-01,Open,40000,0,0,0 4300000,2003-08-01,Open,40000,6700.00,6700.00,0 4300000,2003-09-01,Open,40000,7800.00,7800.00,0 4300000,2003-10-01,Open,40000,1200.00,1200.00,0 4300000,2003-11-01,Open,40000,8000.00,8000.00,0 4300000,2003-12-01,Open,40000,9050.00,9050.00,0 4400000,2003-01-01,Open,40000,0.00,00.00,0 4400000,2003-02-01,Open,40000,100.00,100.00,0 4400000,2003-03-01,Open,40000,50.00,50.00,0 4400000,2003-04-01,Open,40000,90.00,90.00,0 4400000,2003-05-01,Open,40000,0.00,0.00,0 4400000,2003-06-01,Open,40000,0.00,0.00,0 4400000,2003-07-01,Open,40000,0.00,0.00,0 4400000,2003-08-01,Open,40000,0.00,0.00,0 4400000,2003-09-01,Open,40000,0.00,0.00,0 4400000,2003-10-01,Open,40000,0.00,0.00,0 4400000,2003-11-01,Open,40000,0.00,0.00,0 4400000,2003-12-01,Open,40000,0.00,0.00,0 4500000,2002-07-01,Open,40000,50.00,50.00,0 4500000,2002-08-01,Open,40000,100.00,100.00,0 4500000,2002-09-01,Closed,40000,0.00,0.00,0 4600000,2003-01-01,Open,40000,30.00,30.00,0 4600000,2003-02-01,Open,40000,30.00,30.00,0 4600000,2003-03-01,Open,40000,30.00,30.00,0 HP NonStop SQL/MX Data Mining Guide—523737-001 C- 6 Importing Into the Data Mining Database Account History Data File 4600000,2003-04-01,Open,40000,30.00,30.00,0 4600000,2003-05-01,Open,40000,30.00,30.00,0 4600000,2003-06-01,Open,40000,30.00,30.00,0 4600000,2003-07-01,Open,40000,30.00,30.00,0 4600000,2003-08-01,Open,40000,60.00,60.00,0 4600000,2003-09-01,Open,40000,700.00,700.00,0 4600000,2003-10-01,Open,40000,80.00,80.00,0 4600000,2003-11-01,Open,40000,50.00,50.00,0 4600000,2003-12-01,Closed,40000,1000.00,1000.00,0 4700000,2003-01-01,Open,40000,330.00,330.00,0 4700000,2003-02-01,Open,40000,330.00,330.00,0 4700000,2003-03-01,Open,40000,330.00,330.00,0 4700000,2003-04-01,Open,40000,330.00,330.00,0 4700000,2003-05-01,Open,40000,330.00,330.00,0 4700000,2003-06-01,Open,40000,330.00,330.00,0 4700000,2003-07-01,Open,40000,330.00,330.00,0 4700000,2003-08-01,Open,40000,650.00,650.00,0 4700000,2003-09-01,Open,40000,710.00,710.00,0 4700000,2003-10-01,Open,40000,807.00,807.00,0 4700000,2003-11-01,Open,40000,509.00,509.00,0 4700000,2003-12-01,Open,40000,1000.00,1000.00,0 HP NonStop SQL/MX Data Mining Guide—523737-001 C- 7 Importing Into the Data Mining Database HP NonStop SQL/MX Data Mining Guide—523737-001 C- 8 Account History Data File Index A Aligning data 1-9, 2-6 Attributes cardinality of 1-7, 2-3 continuous 2-3, 2-5 deriving 2-9 discrete 1-7, 2-3 discrete numeric 2-4 statistics 2-5 Decision trees cross tables 4-2 dependent variable 4-2 description of 4-2 first branch 4-2 goal definition 4-6 goal prediction 4-3 independent variables 4-2 Defining events 1-8, 2-6 Deploying model 1-11, 4-10 B K Business model building 4-2 checking against database 4-9 deploying 4-10 monitoring 4-11 summarizing results 4-8 Business opportunity defining attrition 1-5 prediction window 1-5 Knowledge discovery process 1-3 C COUNT DISTINCT query 1-8 Creating mining view 1-10 D Data mining database creating 2-2, A-1 importing into 2-2, C-1 populating B-1 L Loading data 2-2 M Metrics, moving 2-9 Mining data set Account History table 1-6 Customers table 1-6 Mining view checking model 4-10 creating 3-2 Monitoring model 4-11 MOVINGAVG function 2-9 O OFFSET function 2-7, 3-3 P Pivoting data 3-3 Preparing data 1-7 Profiling data 2-2, 2-5 HP NonStop SQL/MX Data Mining Guide—523737-001 Index -1 R Index R Rankings 2-10 ROWS SINCE function 2-7, 2-9 RUNNINGCOUNT function 2-10 S SEQUENCE BY clause 2-8, 2-9, 2-10, 3-4 SQL/MX approach, advantages of 1-2 T THIS function 2-9 TRANSPOSE clause 2-4, 2-5, 4-2, 4-4 1-8 Transposition 2-3 V VARIANCE set function 2-3, 2-5 HP NonStop SQL/MX Data Mining Guide—523737-001 Index -2