Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Teradata Warehouse Miner User Guide - Volume 3 Analytic Functions Release 5.4.2 B035-2302-106A October 2016 The product or products described in this book are licensed products of Teradata Corporation or its affiliates. Teradata, BYNET, DBC/1012, DecisionCast, DecisionFlow, DecisionPoint, Eye logo design, InfoWise, Meta Warehouse, MyCommerce, SeeChain, SeeCommerce, SeeRisk, Teradata Warehouse Miner, Teradata Source Experts, WebAnalyst, and You’ve Never Seen Your Business Like This Before are trademarks or registered trademarks of Teradata Corporation or its affiliates. Adaptec and SCSISelect are trademarks or registered trademarks of Adaptec, Inc. AMD Opteron and Opteron are trademarks of Advanced Micro Devices, Inc. BakBone and NetVault are trademarks or registered trademarks of BakBone Software, Inc. Cloudera and the Cloudera logo are trademarks of Cloudera, Inc. This software contains material under license from DUNDAS SOFTWARE LTD., which is ©1994-1999 DUNDAS SOFTWARE LTD., all rights reserved. EMC, PowerPath, SRDF, and Symmetrix are registered trademarks of EMC Corporation. GoldenGate is a trademark of GoldenGate Software, Inc. Hewlett-Packard and HP are registered trademarks of Hewlett-Packard Company. Hortonworks, the Hortonworks logo and other Hortonworks trademarks are trademarks of Hortonworks Inc. in the United States and other countries. Intel, Pentium, and XEON are registered trademarks of Intel Corporation. IBM, CICS, DB2, MVS, RACF, Tivoli, and VM are registered trademarks of International Business Machines Corporation. Linux is a registered trademark of Linus Torvalds. LSI and Engenio are registered trademarks of LSI Corporation. MapR, MapR Heatmap, Direct Access NFS, Distributed NameNode HA, Direct Shuffle and Lockless Storage Services are all trademarks of MapR Technologies, Inc. Microsoft, Active Directory, Windows, Windows NT, Windows Server, Windows Vista, Visual Studio and Excel are either registered trademarks or trademarks of Microsoft Corporation in the United States or other countries. MongoDB, Mongo, and the leaf logo are registered trademarks of MongoDB, Inc. Novell and SUSE are registered trademarks of Novell, Inc., in the United States and other countries. QLogic and SANbox trademarks or registered trademarks of QLogic Corporation. SAS, SAS/C and Enterprise Miner are trademarks or registered trademarks of SAS Institute Inc. SPSS is a registered trademark of SPSS Inc. STATISTICA and StatSoft are trademarks or registered trademarks of StatSoft, Inc. SPARC is a registered trademarks of SPARC International, Inc. Sun Microsystems, Solaris, Sun, and Sun Java are trademarks or registered trademarks of Sun Microsystems, Inc., in the United States and other countries. Symantec, NetBackup, and VERITAS are trademarks or registered trademarks of Symantec Corporation or its affiliates in the United States and other countries. Unicode is a collective membership mark and a service mark of Unicode, Inc. UNIX is a registered trademark of The Open Group in the United States and other countries. Other product and company names mentioned herein may be the trademarks of their respective owners. THE INFORMATION CONTAINED IN THIS DOCUMENT IS PROVIDED ON AN “AS-IS” BASIS, WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING THE IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, OR NON-INFRINGEMENT. SOME JURISDICTIONS DO NOT ALLOW THE EXCLUSION OF IMPLIED WARRANTIES, SO THE ABOVE EXCLUSION MAY NOT APPLY TO YOU. IN NO EVENT WILL TERADATA CORPORATION BE LIABLE FOR ANY INDIRECT, DIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING LOST PROFITS OR LOST SAVINGS, EVEN IF EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. The information contained in this document may contain references or cross-references to features, functions, products, or services that are not announced or available in your country. Such references do not imply that Teradata Corporation intends to announce such features, functions, products, or services in your country. Please consult your local Teradata Corporation representative for those features, functions, products, or services available in your country. Information contained in this document may contain technical inaccuracies or typographical errors. Information may be changed or updated without notice. Teradata Corporation may also make improvements or changes in the products or services described in this information at any time without notice. To maintain the quality of our products and services, we would like your comments on the accuracy, clarity, organization, and value of this document. Please e-mail: [email protected] Any comments or materials (collectively referred to as “Feedback”) sent to Teradata Corporation will be deemed non-confidential. Teradata Corporation will have no obligation of any kind with respect to Feedback and will be free to use, reproduce, disclose, exhibit, display, transform, create derivative works of, and distribute the Feedback and derivative works thereof without limitation on a royalty-free basis. Further, Teradata Corporation will be free to use any ideas, concepts, know-how, or techniques contained in such Feedback for any purpose whatsoever, including developing, manufacturing, or marketing products or services incorporating Feedback. Copyright © 1999-2016 by Teradata Corporation. All Rights Reserved. Teradata Warehouse Miner User Guide - Volume 3 iii iv Teradata Warehouse Miner User Guide - Volume 3 Preface Purpose This volume describes how to use the modeling, scoring and statistical test features of the Teradata Warehouse Miner product. Teradata Warehouse Miner is a set of Microsoft .NET interfaces and a multi-tier User Interface that together help understand the quality of data residing in a Teradata database, create analytic data sets, and build and score analytic models directly in the Teradata database. Audience This manual is written for users of Teradata Warehouse Miner, who should be familiar with Teradata SQL, the operation and administration of the Teradata RDBMS system and statistical techniques. They should also be familiar with the Microsoft Windows operating environment and standard Microsoft Windows operating techniques. This manual only applies to Teradata Warehouse Miner when operating on a Teradata database. Revision Record The following table lists a history of releases where this guide has been revised: Release Date Description TWM 5.4.2 10/31/16 Maintenance Release TWM 5.4.1 01/08/16 Maintenance Release TWM 5.4.0 07/31/15 Feature Release TWM 5.3.5 06/19/14 Maintenance Release TWM 5.3.4 09/10/13 Maintenance Release TWM 5.3.3 06/30/12 Maintenance Release TWM 5.3.2 06/01/11 Maintenance Release TWM 5.3.1 06/30/10 Maintenance Release TWM 5.3.0 10/30/09 Feature Release Teradata Warehouse Miner User Guide - Volume 3 v Preface How This Manual Is Organized Release Date Description TWM 5.2.2 02/05/09 Maintenance Release TWM 5.2.1 12/15/08 Maintenance Release TWM 5.2.0 05/31/08 Feature Release TWM 5.1.1 01/23/08 Maintenance Release TWM 5.1.0 07/12/07 Feature Release TWM 5.0.1 11/16/06 Maintenance Release TWM 5.0.0 09/22/06 Major Release How This Manual Is Organized This manual is organized and presents information as follows: • Chapter 1: “Analytic Algorithms” — describes how to use the Teradata Warehouse Miner Multivariate Statistics and Machine Learning Algorithms. This includes Linear Regression, Logistic Regression, Factor Analysis, Decision Trees, Clustering and Association Rules. • Chapter 2: “Scoring” — describes how to use the Teradata Warehouse Miner Multivariate Statistics and Machine Learning Algorithms scoring analyses. Scoring is available for Linear Regression, Logistic Regression, Factor Analysis, Decision Trees and Clustering. • Chapter 3: “Statistical Tests” — describes how to use Teradata Warehouse Miner Statistical Tests. This includes Binomial, Kolmogorov Smirnov, Parametric, Rank, and Contingency Tables-based tests. Conventions Used In This Manual The following typographical conventions are used in this guide: Convention Description Italic Titles (esp. screen names/titles) New terms for emphasis Monospace Code sample Output vi ALL CAPS Acronyms Bold Important term or concept GUI Item Screen item and/or esp. something you will click on or highlight in following a procedure. Teradata Warehouse Miner User Guide - Volume 3 Preface Related Documents This document provides information for operations on both Teradata and Aster systems. In some cases, certain information will only apply to either a Teradata or an Aster system. “Teradata Only” and “Aster Only” markers are distributed throughout this document in order to identify Teradata-specific and Aster-specific content, respectively. Teradata Only The following marker denotes information that only applies to a Teradata system: While the following signals the conclusion of Teradata-specific content: Aster Only The following marker denotes information that only applies to an Aster system: While the following signals the conclusion of Aster-specific content: Related Documents Related Teradata documentation and other sources of information are available from: http://www.info.teradata.com Additional technical information on data warehousing and other topics is available from: http://www.teradata.com/t/resources Support Information Services, support and training information is available from: http://www.teradata.com/services-support Teradata Warehouse Miner User Guide - Volume 3 vii Preface Related Documents viii Teradata Warehouse Miner User Guide - Volume 3 Table of Contents Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v Audience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v Revision Record . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v How This Manual Is Organized . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi Conventions Used In This Manual. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi Related Documents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Support Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Chapter 1: Analytic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Association Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Initiate an Association Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Association - INPUT - Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Association - INPUT - Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Association - INPUT - Expert Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Association - OUTPUT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Run the Association Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Results - Association Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Tutorial - Association Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Options - Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Using the TWM Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Success Analysis - Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Optimizing Performance of Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Initiate a Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Cluster - INPUT - Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Cluster - INPUT - Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Cluster - INPUT - Expert Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Cluster - OUTPUT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Run the Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Teradata Warehouse Miner User Guide - Volume 3 ix Table of Contents Results - Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Tutorial - Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Initiate a Decision Tree Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Decision Tree - INPUT - Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Decision Tree - INPUT - Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Decision Tree - INPUT - Expert Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Run the Decision Tree Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Results - Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tutorial - Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 39 44 45 46 48 48 49 58 Factor Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Initiate a Factor Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Factor - INPUT - Data Selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Factor - INPUT - Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Factor Analysis - OUTPUT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Run the Factor Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Results - Factor Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tutorial - Factor Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 62 71 72 73 76 77 77 87 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 Initiate a Linear Regression Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Linear Regression - INPUT - Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Linear Regression - INPUT - Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Linear Regression - OUTPUT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Run the Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 Results - Linear Regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Tutorial - Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Initiate a Logistic Regression Function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Logistic Regression - INPUT - Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Logistic Regression - INPUT - Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . Logistic Regression - INPUT - Expert Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Logistic Regression - OUTPUT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Run the Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Results - Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tutorial - Logistic Regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x 120 120 126 127 128 131 132 133 134 141 Teradata Warehouse Miner User Guide - Volume 3 Chapter 2: Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 Cluster Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 Initiate Cluster Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 Cluster Scoring - INPUT - Data Selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 Cluster Scoring - INPUT - Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 Cluster Scoring - OUTPUT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 Run the Cluster Scoring Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 Results - Cluster Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 Tutorial - Cluster Scoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 Tree Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 Initiate Tree Scoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 Tree Scoring - INPUT - Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 Tree Scoring - INPUT - Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 Tree Scoring - OUTPUT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 Run the Tree Scoring Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 Results - Tree Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 Tutorial - Tree Scoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 Factor Scoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 Initiate Factor Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 Factor Scoring - INPUT - Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 Factor Scoring - INPUT - Analysis Parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 Factor Scoring - OUTPUT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 Run the Factor Scoring Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 Results - Factor Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 Tutorial - Factor Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 Linear Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 Linear Regression Model Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 Initiate Linear Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 Linear Scoring - INPUT - Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 Linear Scoring - INPUT - Analysis Parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 Linear Scoring - OUTPUT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 Run the Linear Scoring Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 Results - Linear Scoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 Tutorial - Linear Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 Logistic Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 Logistic Regression Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 Prediction Success Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 Multi-Threshold Success Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 Cumulative Lift Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 Initiate Logistic Scoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 Teradata Warehouse Miner User Guide - Volume 3 xi Table of Contents Logistic Scoring - INPUT - Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Logistic Scoring - INPUT - Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Logistic Scoring - OUTPUT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Run the Logistic Scoring Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Results - Logistic Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tutorial - Logistic Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 189 190 191 191 194 Chapter 3: Statistical Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 Summary of Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 Data Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 Parametric Tests. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Two Sample T-Test for Equal Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F-Test - N-Way . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F-Test/Analysis of Variance - Two Way Unequal Sample Size. . . . . . . . . . . . . . . . . . 204 204 211 221 Binomial Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 Binomial/Ztest. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 Binomial Sign Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 Kolmogorov-Smirnov Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kolmogorov-Smirnov Test (One Sample) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lilliefors Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shapiro-Wilk Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D'Agostino and Pearson Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Smirnov Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241 241 247 253 259 264 Tests Based on Contingency Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270 Chi Square Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270 Median Test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277 Rank Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mann-Whitney/Kruskal-Wallis Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wilcoxon Signed Ranks Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Friedman Test with Kendall's Coefficient of Concordance & Spearman's Rho . . . . . . 283 283 292 299 Appendix A: References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307 xii Teradata Warehouse Miner User Guide - Volume 3 List of Figures Figure 1: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Figure 2: Add New Analysis dialog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Figure 3: Association > Input > Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Figure 4: Association > Input > Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Figure 5: Association: X to X. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Figure 6: Association Combinations pane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Figure 7: Association > Input > Expert Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Figure 8: Association > Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Figure 9: Association > Results > SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Figure 10: Association > Results > Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Figure 11: Association > Results > Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Figure 12: Association Graph Selector. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Figure 13: Association Graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Figure 14: Association Graph: Tutorial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Figure 15: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Figure 16: Add New Analysis dialog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Figure 17: Clustering > Input > Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Figure 18: Clustering > Input > Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Figure 19: Clustering > Input > Expert Options. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Figure 20: Cluster > OUTPUT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Figure 21: Clustering > Results > Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Figure 22: Clustering > Results > Sizes Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Figure 23: Clustering > Results > Similarity Graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Figure 24: Clustering Analysis Tutorial: Sizes Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Figure 25: Clustering Analysis Tutorial: Similarity Graph . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Figure 26: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Figure 27: Add New Analysis dialog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Figure 28: Decision Tree > Input > Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Figure 29: Decision Tree > Input > Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Figure 30: Decision Tree > Input > Expert Options. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Figure 31: Tree Browser. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Figure 32: Tree Browser menu: Small Navigation Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Teradata Warehouse Miner User Guide - Volume 3 xiii List of Figures Figure 33: Tree Browser menu: Zoom Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Figure 34: Tree Browser menu: Print . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Figure 35: Text Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Figure 36: Rules List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Figure 37: Counts and Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Figure 38: Tree Pruning menu. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Figure 39: Tree Pruning Menu > Prune Selected Branch . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Figure 40: Tree Pruning menu (All Options Enabled) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Figure 41: Decision Tree Graph: Previously Pruned Tree . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Figure 42: Decision Tree Graph: Predicate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Figure 43: Decision Tree Graph: Lift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Figure 44: Decision Tree Graph Tutorial: Browser. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Figure 45: Decision Tree Graph Tutorial: Lift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Figure 46: Decision Tree Graph Tutorial: Browser. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Figure 47: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Figure 48: Add New Analysis dialog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Figure 49: Factor Analysis > Input > Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Figure 50: Factor Analysis > Input > Analysis Parameters. . . . . . . . . . . . . . . . . . . . . . . . . . 74 Figure 51: Factor Analysis > Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Figure 52: Factor Analysis > Results > Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Figure 53: Factor Analysis > Results > Pattern Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Figure 54: Factor Analysis > Results > Scree Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 Figure 55: Factor Analysis Tutorial: Scree Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 Figure 56: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Figure 57: Add New Analysis dialog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Figure 58: Linear Regression > Input > Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Figure 59: Linear Regression > Input > Analysis Parameters. . . . . . . . . . . . . . . . . . . . . . . 103 Figure 60: Linear Regression > OUTPUT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Figure 61: Linear Regression Tutorial: Linear Weights Graph. . . . . . . . . . . . . . . . . . . . . . 118 Figure 62: Linear Regression Tutorial: Scatter Plot (2d) . . . . . . . . . . . . . . . . . . . . . . . . . . 119 Figure 63: Linear Regression Tutorial: Scatter Plot (3d) . . . . . . . . . . . . . . . . . . . . . . . . . . 119 Figure 64: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 Figure 65: Add New Analysis dialog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 Figure 66: Logistic Regression > Input > Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 128 Figure 67: Logistic Regression > Input > Analysis Parameters . . . . . . . . . . . . . . . . . . . . . 129 Figure 68: Logistic Regression > Input > Expert Options. . . . . . . . . . . . . . . . . . . . . . . . . . 131 xiv Teradata Warehouse Miner User Guide - Volume 3 List of Figures Figure 69: Logistic Regression > OUTPUT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 Figure 70: Logistic Regression Tutorial: Logistic Weights Graph . . . . . . . . . . . . . . . . . . . 147 Figure 71: Logistic Regression Tutorial: Lift Chart. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 Figure 72: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 Figure 73: Add New Analysis > Scoring > Cluster Scoring . . . . . . . . . . . . . . . . . . . . . . . . 151 Figure 74: Add New Analysis > Input > Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 Figure 75: Add New Analysis > Input > Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . 152 Figure 76: Cluster Scoring > Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 Figure 77: Cluster Scoring > Results > Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 Figure 78: Cluster Scoring > Results > Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 Figure 79: Cluster Scoring > Results > SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 Figure 80: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 Figure 81: Add New Analysis > Scoring > Tree Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . 158 Figure 82: Tree Scoring > Input > Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 Figure 83: Tree Scoring > Input > Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 Figure 84: Tree Scoring > Output. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 Figure 85: Tree Scoring > Results > Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 Figure 86: Tree Scoring > Results > Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 Figure 87: Tree Scoring > Results > Lift Graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 Figure 88: Tree Scoring > Results > SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 Figure 89: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 Figure 90: Add New Analysis > Scoring > Factor Scoring . . . . . . . . . . . . . . . . . . . . . . . . . 170 Figure 91: Factor Scoring > Input > Data Selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 Figure 92: Factor Scoring > Input > Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 171 Figure 93: Factor Scoring > Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 Figure 94: Factor Scoring > Results > Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 Figure 95: Factor Scoring > Results > Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 Figure 96: Factor Scoring > Results > SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 Figure 97: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 Figure 98: Add New Analysis > Scoring > Linear Scoring . . . . . . . . . . . . . . . . . . . . . . . . . 177 Figure 99: Linear Scoring > Input > Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 Figure 100: Linear Scoring > Input > Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . 179 Figure 101: Linear Scoring > Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 Figure 102: Linear Scoring > Results > Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 Figure 103: Linear Scoring > Results > Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 Figure 104: Linear Scoring > Results > SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 Teradata Warehouse Miner User Guide - Volume 3 xv List of Figures Figure 105: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 Figure 106: Add New Analysis > Scoring > Logistic Scoring . . . . . . . . . . . . . . . . . . . . . . 188 Figure 107: Logistic Scoring > Input > Data Selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 Figure 108: Logistic Scoring > Input > Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . 189 Figure 109: Logistic Scoring > Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 Figure 110: Logistic Scoring > Results > Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 Figure 111: Logistic Scoring > Results > Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 Figure 112: Logistic Scoring > Results > Lift Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 Figure 113: Logistic Scoring > Results > SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 Figure 114: Logistic Scoring Tutorial: Lift Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 Figure 115: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 Figure 116: Add New Analysis > Statistical Tests > Parametric Tests. . . . . . . . . . . . . . . . 205 Figure 117: T-Test > Input > Data Selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 Figure 118: T-Test > Input > Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 Figure 119: T-Test > Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 Figure 120: T-Test > Results > SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 Figure 121: T-Test > Results > Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 Figure 122: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 Figure 123: Add New Analysis > Statistical Tests > Parametric Tests. . . . . . . . . . . . . . . . 212 Figure 124: F-Test > Input > Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 Figure 125: F-Test > Input > Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 Figure 126: F-Test > Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 Figure 127: F-Test > Results > SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 Figure 128: F-Test > Results > data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 Figure 129: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 Figure 130: Add New Analysis > Statistical Tests > Parametric Tests. . . . . . . . . . . . . . . . 223 Figure 131: F-Test > Input > Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 Figure 132: F-Test > Input > Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 Figure 133: F-Test > Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 Figure 134: F-Test > Results > SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 Figure 135: F-Test > Results > data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 Figure 136: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 Figure 137: Add New Analysis > Statistical Tests > Binomial Tests . . . . . . . . . . . . . . . . . 230 Figure 138: Binomial Tests > Input > Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 Figure 139: Binomial Tests > Input > Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . 231 Figure 140: Binomial Tests > Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 xvi Teradata Warehouse Miner User Guide - Volume 3 List of Figures Figure 141: Binomial Tests > Results > SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 Figure 142: Binomial Tests > Results > data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 Figure 143: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 Figure 144: Add New Analysis > Statistical Tests > Binomial Tests . . . . . . . . . . . . . . . . . 236 Figure 145: Binomial Sign Test > Input > Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 236 Figure 146: Binomial Sign Test > Input > Analysis Parameters . . . . . . . . . . . . . . . . . . . . . 237 Figure 147: Binomial Sign Test > Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238 Figure 148: Binomial Sign Test > Results > SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 Figure 149: Binomial Sign Test > Results > data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 Figure 150: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241 Figure 151: Add New Analysis > Statistical Tests > Kolmogorov-Smirnov Tests . . . . . . . 242 Figure 152: Kolmogorov-Smirnov Test > Input > Data Selection. . . . . . . . . . . . . . . . . . . . 242 Figure 153: Kolmogorov-Smirnov Test > Input > Analysis Parameters . . . . . . . . . . . . . . . 243 Figure 154: Kolmogorov-Smirnov Test > Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244 Figure 155: Kolmogorov-Smirnov Test > Results > SQL . . . . . . . . . . . . . . . . . . . . . . . . . . 245 Figure 156: Kolmogorov-Smirnov Test > Results > Data . . . . . . . . . . . . . . . . . . . . . . . . . . 245 Figure 157: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 Figure 158: Add New Analysis> Statistical Tests > Kolmogorov-Smirnov Tests . . . . . . . 248 Figure 159: Lillefors Test > Input > Data Selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248 Figure 160: Lillefors Test > Input > Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 249 Figure 161: Lillefors Test > Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250 Figure 162: Lillefors Test > Results > SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 Figure 163: Lillefors Test > Results > Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 Figure 164: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253 Figure 165: Add New Analysis > Statistical Tests > Kolmogorov-Smirnov Tests . . . . . . . 254 Figure 166: Shapiro-Wilk Test > Input > Data Selection. . . . . . . . . . . . . . . . . . . . . . . . . . . 254 Figure 167: Shapiro-Wilk Test > Input > Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . 255 Figure 168: Shapiro-Wilk Test > Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256 Figure 169: Shapiro-Wilk Test > Results > SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 Figure 170: Shapiro-Wilk Test > Results > data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 Figure 171: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259 Figure 172: Add New Analysis > Statistical Tests > Kolmogorov-Smirnov Tests . . . . . . . 259 Figure 173: D'Agostino and Pearson Test > Input > Data Selection . . . . . . . . . . . . . . . . . . 260 Figure 174: D'Agostino and Pearson Test > Input > Analysis Parameters . . . . . . . . . . . . . 261 Figure 175: D'Agostino and Pearson Test > Output. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261 Figure 176: D'Agostino and Pearson Test > Results > SQL . . . . . . . . . . . . . . . . . . . . . . . . 263 Teradata Warehouse Miner User Guide - Volume 3 xvii List of Figures Figure 177: D'Agostino and Pearson Test > Results > data . . . . . . . . . . . . . . . . . . . . . . . . 263 Figure 178: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 Figure 179: Add New Analysis > Statistical Tests > Kolmogorov-Smirnov Tests. . . . . . . 265 Figure 180: Smirnov Test > Input > Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266 Figure 181: Smirnov Test > Input > Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . 267 Figure 182: Smirnov Test > Output. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 Figure 183: Smirnov Test > Results > SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268 Figure 184: Smirnov Test > Results > data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 Figure 185: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271 Figure 186: Add New Analysis > Statistical Tests > Tests Based on Contingency Tables 272 Figure 187: Chi Square Test > Input > Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272 Figure 188: Chi Square Test > Input > Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . 273 Figure 189: Chi Square Test > Output. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274 Figure 190: Chi Square Test > Results > SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275 Figure 191: Chi Square Test > Results > data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275 Figure 192: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278 Figure 193: Add New Analysis > Statistical Tests > Tests Based On Contingency Tables 278 Figure 194: Median Test > Input > Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279 Figure 195: Median Test > Input > Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 280 Figure 196: Median Test > Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280 Figure 197: Median Test > Results > SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281 Figure 198: Median Test > Results > data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282 Figure 199: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285 Figure 200: Add New Analysis > Statistical Tests > Rank Tests . . . . . . . . . . . . . . . . . . . . 285 Figure 201: Mann-Whitney/Kruskal-Wallis Test > Input > Data Selection . . . . . . . . . . . . 286 Figure 202: Mann-Whitney/Kruskal-Wallis Test > Input > Analysis Parameters . . . . . . . 287 Figure 203: Mann-Whitney/Kruskal-Wallis Test > Output. . . . . . . . . . . . . . . . . . . . . . . . . 287 Figure 204: Mann-Whitney/Kruskal-Wallis Test > Results > SQL . . . . . . . . . . . . . . . . . . 289 Figure 205: Mann-Whitney/Kruskal-Wallis Test > Results > data . . . . . . . . . . . . . . . . . . . 289 Figure 206: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293 Figure 207: Add New Analysis > Statistical Tests > Rank Tests . . . . . . . . . . . . . . . . . . . . 294 Figure 208: Wilcoxon Signed Ranks Test > Input > Data Selection. . . . . . . . . . . . . . . . . . 294 Figure 209: Wilcoxon Signed Ranks Test > Input > Analysis Parameters . . . . . . . . . . . . . 295 Figure 210: Wilcoxon Signed Ranks Test > Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296 Figure 211: Wilcoxon Signed Ranks Test > Results > SQL . . . . . . . . . . . . . . . . . . . . . . . . 297 Figure 212: Wilcoxon Signed Ranks Test > Results > data . . . . . . . . . . . . . . . . . . . . . . . . 297 xviii Teradata Warehouse Miner User Guide - Volume 3 List of Figures Figure 213: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300 Figure 214: Add New Analysis > Statistical Tests > Rank Tests. . . . . . . . . . . . . . . . . . . . . 300 Figure 215: Friedman Test > Input > Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301 Figure 216: Friedman Test > Input > Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . 302 Figure 217: Friedman Test > Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302 Figure 218: Friedman Test > Results > SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304 Figure 219: Friedman Test > Results > data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304 Teradata Warehouse Miner User Guide - Volume 3 xix List of Figures xx Teradata Warehouse Miner User Guide - Volume 3 List of Tables Table 1: Three-Level Hierarchy Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Table 2: Association Combinations output table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Table 3: Tutorial - Association Analysis Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Table 4: test_ClusterResults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Table 5: test_ClusterColumns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Table 6: Progress . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Table 7: Solution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Table 8: Confusion Matrix Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Table 9: Decision Tree Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Table 10: Variables: Dependent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 Table 11: Variables: Independent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 Table 12: Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 Table 13: Cumulative Lift Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 Table 14: Prime Factor Loadings report (Example) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Table 15: Prime Factor Variables report (Example). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 Table 16: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Table 17: my_factor_reports_ tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Table 18: Factor Analysis Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 Table 19: Execution Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Table 20: Eigenvalues. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Table 21: Principal Component Loadings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 Table 22: Factor Variance to Total Variance Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 Table 23: Variance Explained By Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 Table 24: Difference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 Table 25: Prime Factor Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 Table 26: Eigenvalues of Unit Scaled X'X . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 Table 27: Condition Indices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 Table 28: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 Table 29: Near Dependency report (example) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 Table 30: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 Table 31: Linear Regression Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 Table 32: Regression vs. Residual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 Teradata Warehouse Miner User Guide - Volume 3 xxi List of Tables Table 33: Execution Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 Table 34: Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 Table 35: Out . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 Table 36: Model Assessment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 Table 37: Columns In (Part 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 Table 38: Columns In (Part 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 Table 39: Columns In (Part 3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 Table 40: Columns Out . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 Table 41: Logistic Regression - OUTPUT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 Table 42: Logistic Regression Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 Table 43: Execution Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 Table 44: Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 Table 45: Columns Out . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 Table 46: Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 Table 47: Columns Out . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 Table 48: Prediction Success Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 Table 49: Multi-Threshold Success Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 Table 50: Cumulative Lift Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 Table 51: Output Database (Built by the Cluster Scoring analysis) . . . . . . . . . . . . . . . . . . 155 Table 52: Clustering Progress . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 Table 53: Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 Table 54: Confusion Matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 Table 55: Output Database table (Built by the Decision Tree Scoring analysis) . . . . . . . . 164 Table 56: Output Database table (Built by the Decision Tree Scoring analysis) - Create profiling tables option selected (“_1” appended) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 Table 57: Output Database table (Built by the Decision Tree Scoring analysis) - Create profiling tables option selected (“_2” appended) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 Table 58: Decision Tree Model Scoring Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 Table 59: Confusion Matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 Table 60: Cumulative Lift Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 Table 61: Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 Table 62: Output Database table (Built by Factor Scoring) . . . . . . . . . . . . . . . . . . . . . . . . 174 Table 63: Factor Analysis Score Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 Table 64: Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 Table 65: Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 Table 66: Output Database table (Built by Linear Regression scoring) . . . . . . . . . . . . . . . 182 xxii Teradata Warehouse Miner User Guide - Volume 3 List of Tables Table 67: Linear Regression Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 Table 68: Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 Table 69: Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 Table 70: Prediction Success Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 Table 71: Logistic Regression Multi-Threshold Success table . . . . . . . . . . . . . . . . . . . . . . 185 Table 72: Logistic Regression Cumulative Lift Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 Table 73: Output Database table (Built by Logistic Regression scoring) . . . . . . . . . . . . . . 192 Table 74: Logistic Regression Model Scoring Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 Table 75: Prediction Success Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 Table 76: Multi-Threshold Success Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 Table 77: Cumulative Lift Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 Table 78: Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 Table 79: Statistical Test functions handling of input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 Table 80: Two sample t tests for unpaired data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 Table 81: Output Database table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 Table 82: T-Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 Table 83: Output Columns - 1-Way F-Test Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 Table 84: Output Columns - 2-Way F-Test Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 Table 85: Output Columns - 3-Way F-Test Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 Table 86: F-Test (one-way) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 Table 87: Output Columns - 2-Way F-Test Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 Table 88: F-Test (Two-way Unequal Cell Count) (Part 1). . . . . . . . . . . . . . . . . . . . . . . . . . 228 Table 89: F-Test (Two-way Unequal Cell Count) (Part 2). . . . . . . . . . . . . . . . . . . . . . . . . . 228 Table 90: F-Test (Two-way Unequal Cell Count) (Part 3). . . . . . . . . . . . . . . . . . . . . . . . . . 228 Table 91: Output Database table (Built by the Binomial Analysis) . . . . . . . . . . . . . . . . . . . 234 Table 92: Binomial Test Analysis (Table 1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234 Table 93: Binomial Test Analysis (Table 2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 Table 94: Binomial Sign Analysis: Output Columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 Table 95: Tutorial - Binomial Sign Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240 Table 96: Output Database table (Built by the Kolmogorov-Smirnov test analysis) . . . . . . 245 Table 97: Kolmogorov-Smirnov Test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246 Table 98: Lilliefors Test Analysis: Output Columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 Table 99: Lilliefors Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252 Table 100: Shapiro-Wilk Test Analysis: Output Columns. . . . . . . . . . . . . . . . . . . . . . . . . . 257 Table 101: Shapiro-Wilk Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258 Table 102: D'Agostino and Pearson Test Analysis: Output Columns . . . . . . . . . . . . . . . . . 263 Teradata Warehouse Miner User Guide - Volume 3 xxiii List of Tables Table 103: D'Agostino and Pearson Test: Output Columns . . . . . . . . . . . . . . . . . . . . . . . . 264 Table 104: Smirnov Test Analysis: Output Columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 Table 105: Smirnov Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270 Table 106: Chi Square Test Analysis: Output Columns . . . . . . . . . . . . . . . . . . . . . . . . . . . 276 Table 107: Chi Square Test (Part 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276 Table 108: Chi Square Test (Part 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277 Table 109: Median Test Analysis: Output Columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282 Table 110: Median Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283 Table 111: Table for Mann-Whitney (if two groups) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289 Table 112: Table for Kruskal-Wallis (if more than two groups). . . . . . . . . . . . . . . . . . . . . 290 Table 113: Mann-Whitney Test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290 Table 114: Kruskal-Wallis Test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291 Table 115: Mann-Whitney Test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292 Table 116: Wilcoxon Signed Ranks Test Analysis: Output Columns. . . . . . . . . . . . . . . . . 298 Table 117: Wilcoxon Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298 Table 118: Friedman Test Analysis: Output Columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304 Table 119: Friedman Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306 xxiv Teradata Warehouse Miner User Guide - Volume 3 CHAPTER 1 Analytic Algorithms What’s In This Chapter This chapter applies only to an instance of Teradata Warehouse Miner operating on a Teradata database. For more information, see these subtopics: 1 “Overview” on page 1 2 “Association Rules” on page 2 3 “Cluster Analysis” on page 20 4 “Decision Trees” on page 39 5 “Factor Analysis” on page 62 6 “Linear Regression” on page 92 7 “Logistic Regression” on page 120 Overview Teradata Warehouse Miner contains several analytic algorithms from both the traditional statistics and machine learning disciplines. These algorithms pertain to the exploratory data analysis (EDA) and model-building phases of the data mining process. Along with these algorithms, Teradata Warehouse Miner contains corresponding model scoring and evaluation functions that pertain to the model evaluation and deployment phases of the data mining process. A brief summary of the algorithms offered may be given as follows: • Linear Regression — Linear regression can be used to predict or estimate the value of a continuous numeric data element based upon a linear combination of other numeric data elements present for each observation. • Logistic Regression — Logistic regression can be used to predict or estimate a two-valued variable based upon other numeric data elements present for each observation. • Factor Analysis — Factor analysis is a collective term for a family of techniques. In general, Factor analysis can be used to identify, quantify, and re-specify the common and unique sources of variability in a set of numeric variables. One of its many applications allows an analytical modeler to reduce the number of numeric variables needed to describe a collection of observations by creating new variables, called factors, as linear combinations of the original variables. Teradata Warehouse Miner User Guide - Volume 3 1 Chapter 1: Analytic Algorithms Association Rules • Decision Trees — Decision trees, or rule induction, can be used to predict or estimate the value of a multi-valued variable based upon other categorical and continuous numeric data elements by building decision rules and presenting them graphically in the shape of a tree, based upon splits on specific data values. • Clustering — Cluster analysis can be used to form multiple groups of observations, such that each group contains observations that are very similar to one another, based upon values of multiple numeric data elements. • Association Rules — Generate association rules and various measures of frequency, relationship and statistical significance associated with these rules. These rules can be general, or have a dimension of time association with them. Association Rules Overview Association Rules are measurements on groups of observations or transactions that contain items of some kind. These measurements seek to describe the relationships between the items in the groups, such as the frequency of occurrence of items together in a group or the probability that items occur in a group given that other specific items are in that group. The nature of items and groups in association analysis and the meaning of the relationships between items in a group will depend on the nature of the data being studied. For example, the items may be products purchased and the groups the market baskets in which they were purchased. (This is generally called market basket analysis). Another example is that items may be accounts opened and the groups the customers that opened the accounts. This type of association analysis is useful in a cross-sell application to determine what products and services to sell with other products and services. Obviously the possibilities are endless when it comes to the assignment of meaning to items and groups in business and scientific transactions or observations. Rules What does an association analysis produce and what types of measurements does it include? An association analysis produces association rules and various measures of frequency, relationship and statistical significance associated with these rules. Association rules are of the form X 1 X 2 X n Y 1 Y 2 Y m where X 1 X 2 X n is a set of n items that appear in a group along with a set of m items Y 1 Y 2 Y m in the same group. For example, if checking, saving and credit card accounts are owned by a customer, then the customer will also own a certificate of deposit (CD) with a certain frequency. Relationship means that, for example, owning a specific account or set of accounts, (antecedent), is associated with ownership of one or more other specific accounts (consequent). Association rules, in and of themselves, do not warrant inferences of causality, however they may point to relationships among items or events that could be studied further using other analytical techniques which are more appropriate for determining the structure and nature of causalities that may exist. 2 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Association Rules Measures The four measurements made for association rules are support, confidence, lift and Z score. Support Support is a measure of the generality of an association rule, and is literally the percentage (a value between 0 and 1) of groups that contain all of the items referenced in the rule. More formally, in the association rule defined as L R , L represents the items given to occur together (the Left side or antecedent), and R represents the items that occur with them as a result (the Right side or consequent). Support can actually be applied to a single item or a single side of an association rule, as well as to an entire rule. The support of an item is simply the percentage of groups containing that item. Given the previous example of banking product ownership, let L be defined as the number of customers who own the set of products on the left side and let R be defined as the number of customers who own the set of products on the right side. Further, let LR be the number of customers who own all products in the association rule (note that this notation does not mean L times R), and let N be defined as the total number of customers under consideration. The support of L, R and the association rule are given by: L Sup L = ---N R Sup R = ---N LR Sup L R = -------N Let’s say for example that out of 10 customers, 6 of them have a checking account, 5 have a savings account, and 4 have both. If L is (checking) and R is (savings), then Sup L is .6, Sup R is .5 and Sup L R is .4. Confidence Confidence is the probability of R occurring in an item group given that L is in the item group. The equation to calculate the probability of R occurring in an item group given that L is in the item group is given by: Teradata Warehouse Miner User Guide - Volume 3 3 Chapter 1: Analytic Algorithms Association Rules L R Conf L R = Sup --------------------Sup L Another way of expressing the measure confidence is as the percentage of groups containing L that also contain R. This gives the following equivalent calculation for confidence: LR Conf L R = -------L Using the previous example of banking product ownership once again, the confidence that checking account ownership implies savings account ownership is 4/6. The expected value of an association rule is the number of customers that are expected to have both L and R if there is no relationship between L and R. (To say that there is no relationship between L and R means that customers who have L are neither more likely nor less likely to have R than are customers who do not have L). The equation for the expected value of the association rule is: LR E_LR = -----------N An equivalent formula for the expected value of the association rule is: E_LR = Sup L Sup R N Again using the previous example, the expected value of the number of customers with checking and savings is calculated as 6 * 5 / 10 or 3. The expected confidence of a rule is the confidence that would result if there were no relationship between L and R. This simply equals the percentage of customers that own R, since if owning L has no effect on owning R, then it would be expected that the percentage of L’s that own R would be the same as the percentage of the entire population that own R. The following equation computes expected confidence: R E_Conf = ---- = Sup R N From the previous example, the expected confidence that checking implies savings is given by 5/10. Lift Lift measures how much the probability of R is increased by the presence of L in an item group. A lift of 1 indicates there are exactly as many occurrences of R as expected; thus, the presence of L neither increases nor decreases the likelihood of R occurring. A lift of 5 4 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Association Rules indicates that the presence of L implies that it is 5 times more likely for R to occur than would otherwise be expected. A lift of 0.5 indicates that when L occurs, it is one half as likely that R will occur. Lift can be calculated as follows: LR Lift L R = --------------E_LR From another viewpoint, lift measures the ratio of the actual confidence to the expected confidence, and can be calculated equivalently as either of the following: L R Lift L R = Conf --------------------E_Conf Conf L R Lift L R = ---------------------------------Sup R The lift associated with the previous example of “checking implies savings” is 4/3. Z score Z score measures how statistically different the actual result is from the expected result. A Z score of zero corresponds to the situation where the actual number equals the expected. A Z score of 1 means that the actual number is 1 standard deviation greater than expected. A Z score of -3.0 means that the actual number is 3 standard deviations less than expected. As a rule of thumb, a Z score greater than 3 (or less than -3) indicates a statistically significant result, which means that a difference that large between the actual result and the expected is very unlikely to be due to chance. A Z score attempts to help answer the question of how confident you can be about the observed relationship between L and R, but does not directly indicate the magnitude of the relationship. It is interesting to note that a negative Z score indicates a negative association. These are rules L R where ownership of L decreases the likelihood of owning R. The following equation calculates a measure of the difference between the expected number of customers that have both L and R, if there is no relationship between L and R, and the actual number of customers that have both L and R. (It can be derived starting with either the formula for the standard deviation of the sampling distribution of proportions or the formula for the standard deviation of a binomial variable). LR – E_LR Zscore L R = --------------------------------------------------------------E_LR SQRT E_LR(1 – -------------- N or equivalently: Teradata Warehouse Miner User Guide - Volume 3 5 Chapter 1: Analytic Algorithms Association Rules N Sup LR – N Sup L Sup R Zscore L R = -------------------------------------------------------------------------------------------------------N Sup L Sup R 1 – Sup L Sup R The mean value is E_LR, and the actual value is LR. The standard deviation is calculated with SQRT (E_LR * (1 - E_LR/N)). From the previous example, the expected value is 6 * 5 / 10, so the mean value is 3. The actual value is calculated knowing that savings and checking accounts are owned by 4 out of 10 customers. The standard deviation is SQRT(3*(1-3/10)) or 1.449. The Z score is therefore (4 - 3) / 1.449 = .690. Interpreting Measures None of the measures described above are “best”; they all measure slightly different things. In the discussion below, product ownership association analysis is used as an example for purposes of illustration. First look at confidence, which measures the strength of an association: what percent of L customers also own R? Many people will sort associations by confidence and consider the highest confidence rules to be the best. However, there are several other factors to consider. One factor to consider is that a rule may apply to very few customers, so is not very useful. This is what support measures, the generality of the rule, or how often it applies. Thus a rule L R might have a confidence of 70%, but if that is just 7 out of 100 customers, it has very low support and is not very useful. Another shortcoming of confidence is that by itself it does not tell you whether owning L “changes” the likelihood of owning R, which is probably the more important piece of information. For example, if 20% of the customers own R, then a rule L R (20% of those with L also own R) may have high confidence but is really providing no information, because customers that own L have the same rate of ownership of R as the entire population does. What is probably really wanted is to find the products L for which the confidence of L R is significantly greater than 20%. This is what lift measures, the difference between the actual confidence and the expected confidence. However, lift, like confidence, is much less meaningful when very small numbers are involved; that is, when the support is low. If the expected number is 2 and there are actually 8 customers with product R, then the lift is an impressive 400. But because of the small numbers involved, the association rule is likely of limited use, and might even have occurred by chance. This is where the Z score comes in. For a rule L R , confidence indicates the likelihood that R is owned given that L is owned. Lift indicates how much owning L increases or decreases the probability of the ownership of R, and Z score measures how trustworthy the observed difference between the actual and expected ownership is relative to what could be observed due to chance alone. For example, for a rule L R , if it is expected to have 10,000 customers with both L and R, and there are actually 11,000, the lift would be only 1.1, but the Z score would be very high, because such a large difference could not be due to chance. Thus, a large Z score and small lift means there definitely is an effect, but it is small. A large lift and small Z means there appears to be a large effect, but it might not be real. A possible strategy then is given here as an illustration, but the exact strategy and threshold values will depend on the nature of each business problem addressed with association analysis. The full set of rules produced by an association analysis is often too large to 6 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Association Rules examine in detail. First, prune out rules that have low Z scores. Try throwing out rules with a Z score of less than 2, if not 3, 4 or 5. However, there is little reason to focus in on rules with extremely high Z scores. Next, filter according to support and lift. Setting a limit on the Z score will not remove rules with low support or with low lift that involve common products. Where to set the support threshold depends on what products are of interest and performance considerations. Where to set the lift threshold is not really a technical question, but a question of preference as to how large a lift is useful from a business perspective. A lift of 1.5 for L R means that customers that own L are 50% more likely to own R than among the overall population. If a value of 1.5 does not yield interesting results, then set the threshold higher. Sequence Analysis Sequence analysis is a form of association analysis where the items in an association rule are considered to have a time ordering associated with them. By default, when sequence analysis is requested, left side items are assumed to have “occurred” before right side items, and in fact the items on each side of an association rule, left or right, are also time ordered within themselves. If we use in a sequence analysis the more full notation for an association rule L R , namely X 1 X 2 X m Y 1 Y 2 Y n , then we are asserting that not only do the X items precede the Y items, but X 1 precedes X 2 , which precedes X· m , which precedes Y 1, which precedes Y 2 , which precedes Y n . It is important to note here that if a strict ordering of items in a sequence analysis is either not desired or not possible for some reason (such as multiple purchases on the same day), an option is provided to relax the strict ordering. With relaxed sequence analysis, all items on the left must still precede all items on the right of a sequence rule, but the items on the left and the items on the right are not time ordered amongst themselves. (When the rules are presented, the items in each rule are ordered by name for convenience). Lift and Z score are calculated differently for sequence analysis than for association analysis. Recall that the expected value of the association rule, E_LR, is given by Sup (L) * Sup (R) * N for a non-sequence association analysis. For example, if L occurs half the time and R occurs half the time, then if L and R are independent of each other it can be expected that L and R will occur together one-fourth of the time. But this does not take into account the fact that with sequence analysis, the correct ordering can only be expected to happen some percentage of the time if L and R are truly independent of each other. Interestingly, this expected percentage of independent occurrence of correct ordering is calculated the same for strictly ordered and relaxed ordered sequence analysis. With m items on the left and n on the right, the probability of correct ordering is given by “m!n!/(m + n)!”. Note that this is the inverse of the combinatorial analysis formula for the number of permutations of m + n objects grouped such that m are alike and n are alike. In the case of strictly ordered sequence analysis, the applicability of the formula just given for the probability of correct ordering can be explained as follows. There are clearly m + n objects in the rule, and saying that m are alike and n are alike corresponds to restricting the permutations to those that preserve the ordering of the m items on the left side and the n items on the right side of the rule. That is, all of the orderings of the items on a side other than the correct ordering fall out as being the same permutation. The logic of the formula given for the probability of correct ordering is perhaps easier to see in the case of relaxed ordering. Since Teradata Warehouse Miner User Guide - Volume 3 7 Chapter 1: Analytic Algorithms Association Rules there are m + n items in the rule there are (m + n)! possible orderings of the items. Out of these, there are m! ways the left items can be ordered and n! ways the right items can be ordered while insuring that the m items on the left precede the n items on the right, so there are m!n! valid orderings out of the (m + n)! possible. The “probability of correct ordering” factor described above has a direct effect on the calculation of lift and Z score. Lift is effectively divided by this factor, such that a factor of one half results in doubling the lift and increasing the Z score as well. The resulting lift and Z score for sequence analysis must be interpreted cautiously however since the assumptions made in calculating the independent probability of correct ordering are quite broad. For example, it is assumed that all combinations of ordering are equally likely to occur, and the amount of time between occurrences is completely ignored. To give the user more control over the calculation of lift and Z score for a sequence analysis, an option is provided to set the “probability of correct ordering” factor to a constant value if desired. Setting it to 1 for example effectively ignores this factor in the calculation of E_LR and therefore in lift and Z score. Initiate an Association Analysis Use the following procedure to initiate a new Association analysis in Teradata Warehouse Miner: 1 Click on the Add New Analysis icon in the toolbar: Figure 1: Add New Analysis from toolbar 2 8 In the resulting Add New Analysis dialog box, click on Analytics under Categories and then under Analyses double-click on Association: Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Association Rules Figure 2: Add New Analysis dialog 3 This will bring up the Association dialog in which you will enter INPUT and OUTPUT options to parameterize the analysis as described in the following sections. Association - INPUT - Data Selection On the Association dialog click on INPUT and then click on data selection: Figure 3: Association > Input > Data Selection On this screen select: 1 Select Input Source Users may select between different sources of input. By selecting the Input Source Table the user can select from available databases, tables (or views) and columns in the usual manner. By selecting the Input Source Analysis however the user can select directly from the output of another analysis of qualifying type in the current project. Analyses that may be selected from directly include all of the Analytic Data Set (ADS) and Reorganization analyses (except Refresh). In place of Available Databases the user may select from Available Analyses, while Available Tables then contains a list of all the output tables that will eventually be produced by the selected Analysis. (Note that since this analysis cannot select from a volatile input table, Available Analyses will contain only those qualifying analyses that create an output table or view). 2 Select Columns From a Single Table Teradata Warehouse Miner User Guide - Volume 3 9 Chapter 1: Analytic Algorithms Association Rules • Available Databases (or Analyses) — These are the databases (or analyses) available to be processed. • Available Tables — These are the tables and views that are available to be processed. • Available Columns — These are the columns within the table/view that are available for the Association analysis. • Selected Columns — Select columns by highlighting and then either dragging and dropping into the Selected Columns window, or click on the arrow button to move highlighted columns into the Selected Columns window. Note: The Selected Columns window is actually a split window; you can either insert columns as Group, Item, or Sequence columns. Make sure you have the correct portion of the window highlighted. • Group Column — The column that specifies the group for the Association analysis. This column should specify observations or transactions that contain items of some kind. • Item Column — The column that specifies the items to be analyzed in the Association analysis. The relationship of these items within the group will be described by the Association analysis. • Sequence Column — The column that specifies the sequence of items in the Association analysis. This column should have a time ordering relationship with the item associated with them. Association - INPUT - Analysis Parameters On the Association dialog click on INPUT and then click on analysis parameters: Figure 4: Association > Input > Analysis Parameters On this screen select: • Association Combinations — In this window specify one or more association combinations in the format of “X TO Y” where the sum of X and Y must not exceed a total of 10. First select an “X TO Y” combination from the drop-down lists: Figure 5: Association: X to X 10 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Association Rules Then click the Add button to add this combination to the window. Repeat for as many combinations as needed: Figure 6: Association Combinations pane If needed, remove a combination by highlighting it in the window and then clicking on the Remove button. • Processing Options • Perform All Steps — Execute the entire Association/Sequence Analysis, regardless of result sets generated from a previous execution. • Perform Support Calculation Only — In order to determine the minimum support value to use, the user may choose to only build the single-item support table by using this option, making it possible to stop and examine the table before proceeding. • Recalculate Final Affinities Only — Rebuild just the final association tables using support tables from a previous run provided that intermediate work tables were not dropped (see Drop All Support Tables After Execution option below). • Auto-Calculate group count — By default, the algorithm automatically determines the actual input count. • Force Group Count To — If the Auto-Calculate group count is disabled, this option can be used to fix the number of groups, overriding the actual input count. This is useful in conjunction with the Reduced Input Options, to set the group count to the group count in the original data set, rather than the reduced input data set. • Drop All Support Tables After Execution — Normally, the Association analysis temporarily builds the support tables, dropping them prior to termination. If for performance reasons, it is desired to use the Recalculate Final Affinities Only option, this option can be disabled so that this clean-up of support tables does not happen. • Minimum Support — The minimum Support value that the association must have in order to be reported. Using this option reduces the input data - this can be saved for further processing using the Reduced Input Options. Using this option also invokes list-wise deletion, automatically removing from processing (and from the reduced input data) all rows containing a null Group, Item or Sequence column. • Minimum Confidence — The minimum Confidence value that the association must have in order to be reported. • Minimum Lift — The minimum Lift value that the association must have in order to be reported. • Minimum Z-Score — The minimum absolute Z-Score value that the association must have in order to be reported. Teradata Warehouse Miner User Guide - Volume 3 11 Chapter 1: Analytic Algorithms Association Rules • Sequence Options — If a column is specified with the Sequence Column option, then the following two Sequence Options are enabled. Note that Sequence Analysis is not available when Hierarchy Information is specified: • Use Relaxed Ordering — With this option, the items on each side of the association rule may be in any sequence provided all the left items (antecedents) precede all the right items (precedents). • Auto-Calculate Ordering Probability — Sequence analysis option to let the algorithm calculate the “probability of correct ordering” according to the principles described in “Sequence Analysis” on page 7. (Note that the following option to set “Ordering Probability” to a chosen value is only available if this option is unchecked). • Ordering Probability — Sequence analysis option to set probability of correct ordering to a non-zero constant value between 0 and 1. Setting it to a 1 effectively ignores this principle in calculating lift and Z-score. Association - INPUT - Expert Options On the Association dialog click on INPUT and then click on expert options: Figure 7: Association > Input > Expert Options On this screen select: • Where Conditions — An SQL WHERE clause may be specified here to provide further input filtering for only those groups or items that you are interested in. This works exactly like the Expert Options for the Descriptive Statistics, Transformation and Data Reorganization functions - only the condition itself is entered here. Using this option reduces the input data set - this can be saved for further processing using the Reduced Input Options. Using this option also invokes list-wise deletion, automatically removing from processing (and from the reduced input data) all rows containing a null Group, Item or Sequence column. • Include Hierarchy Table — A hierarchy lookup table may be specified to convert input items on both the left and right sides of the association rule to a higher level in a hierarchy if desired. Note that the column in the hierarchy table corresponding to the items in the input table must not contain repeated values, so effectively the items in the input table must match the lowest level in the hierarchy table. The following is an example of a threelevel hierarchy table compatible with Association analysis, provided the input table matches up with the column ITEM1. 12 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Association Rules Table 1: Three-Level Hierarchy Table ITEM1 ITEM2 ITEM3 DESC1 DESC2 DESC3 A P Y Savings Passbook Deposit B P Y Checking Passbook Deposit C W Z Atm Electronic Access D S X Charge Short Credit E T Y CD Term Deposit F T Y IRA Term Deposit G L X Mortgage Long Credit H L X Equity Long Credit I S X Auto Short Credit J W Z Internet Electronic Access Using this option reduces the input data set - this can be saved for further processing using the Reduced Input Options. Using this option also invokes list-wise deletion, automatically removing from processing (and from the reduced input data) all rows containing a null Group, Item or Sequence column. The following columns in the hierarchy table must be specified with this option. • Item Column — The name of the column that can be joined to the column specified by the Item Column option on the Select Column tab to look up the associated Hierarchy. • Hierarchy Column — The name of the column with the Hierarchy values. • Include Description Table — For reporting purposes, a descriptive name or label can be given to the items processed during the Association/Sequence Analysis. • Item ID Column — The name of the column that can be joined to the column specified by the Item Column option on the Select Column tab (or Hierarchy Column option on the Hierarchies tab if hierarchy information is also specified) to look up the description. • Item Description Column — The name of the column with the descriptive values. • Include Left Side Lookup Table — A focus products table may be specified to process only those items that are of interest on the left side of the association. • Left Side Identifier Column — The name of the column where the Focus Products values exist for the left side of the association. • Include Right Side Lookup Table — A focus products table may be specified to process only those items that are of interest on the right side of the association. • Right Side Identifier Column — The name of the column where the Focus Products values exist for the right side of the association. Teradata Warehouse Miner User Guide - Volume 3 13 Chapter 1: Analytic Algorithms Association Rules Association - OUTPUT On the Association dialog click on OUTPUT: Figure 8: Association > Output On this screen select: • Output Tables • Database Name — The database where the Association analysis build temporary and permanent tables during the analysis. This defaults to the Result Database. • Table Names — Assign a table name for each displayed combination. • Advertise Output — The Advertise Output option “advertises” each output table (including the Reduced Input Table, if saved) by inserting information into one or more of the Advertise Output metadata tables according to the type of analysis and the options selected in the analysis. • Advertise Note — An Advertise Note may be specified if desired when the Advertise Output option is selected or when the Always Advertise option is selected on the Connection Properties dialog. It is a free-form text field of up to 30 characters that may be used to categorize or describe the output. • Reduced Input Options — A reduced input set, based upon the minimum support value specified, a product hierarchy or input filtering via a WHERE clause, can be saved and used as input to a subsequent Association/Sequence analysis as follows: • Save Reduced Input Table — Check box to specify to the analysis that the reduced input table should be saved. • Database Name — The database name where the reduced input table will be saved. • Table Name — The table name that the reduced input table will be saved under. • Generate SQL, but do not Execute it — Generate the Association or Sequence Analysis SQL, but do not execute it - the set of queries are returned with the analysis results. Run the Association Analysis After setting parameters on the INPUT screens as described above, you are ready to run the analysis. To run the analysis you can either: • Click the Run icon on the toolbar, or • Select Run <project name> on the Project menu, or • Press the F5 key on your keyboard 14 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Association Rules Results - Association Analysis The results of running the Association analysis include a table for each association pair requested, as well as the SQL to perform the association or sequence analysis. All of these results are outlined below. Association - RESULTS - SQL On the Association dialog click on RESULTS and then click on SQL: Figure 9: Association > Results > SQL The series of SQL statements that comprise the Association/Sequence Analysis are displayed here. Association - RESULTS - data On the Association dialog click on RESULTS and then click on data: Figure 10: Association > Results > Data Results data, if any, is displayed in a data grid. An output table is generated for each item pair specified in the Association Combinations option. Each table generated has the form specified below: Table 2: Association Combinations output table Name Type Definition ITEMXOFY User Defined Two or more columns will be generated, depending upon the number of Association Combinations. Together, these form the UPI of the result table. The value for X in the column name is 1 through the number of item pairs specified. The value for Y in the column name is the sum of the number of items specified. For example, specifying Left and Right Association Combinations or <1, 1> will produce two columns: ITEM1OF2, ITEM2OF2. Specifying <1,2> will result in three columns: ITEM1OF3, ITEM2OF3 and ITEM3OF3. The data type is the same as the Item Column. Default- Data type of Item Column LSUPPORT DECIMAL(18,5) Teradata Warehouse Miner User Guide - Volume 3 The Support of the left-side item or antecedent only. 15 Chapter 1: Analytic Algorithms Association Rules Table 2: Association Combinations output table Name Type Definition RSUPPORT DECIMAL(18,5) The Support of the right-side item or consequent only. SUPPORT DECIMAL(18,5) The Support of the association (i.e., antecedent and consequent together). CONFIDENCE DECIMAL(18,5) The Confidence of the association. LIFT DECIMAL(15,5) The Lift of the association. ZSCORE DECIMAL(15,5) The Z-Score of the association. Association - RESULTS - graph On the Association dialog click on RESULTS and then click on graph: Figure 11: Association > Results > Graph For 1-to-1 Associations, a tile map is available as described below. (No graph is available for combinations other than 1-to-1). • Graph Options — Two selectors with a Reference Table display underneath are used to make association selections to graph. For example, the following selections produced the graph below. 16 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Association Rules Figure 12: Association Graph Selector The Graph Options display has the following selectors: a Select item 1 of 2 from this table, then click button. The first step is to select the left-side or antecedent items to graph associations for by clicking or dragging the mouse just to the left of the row numbers displayed. Note that the accumulated minimum and maximum values of the measures checked just above the display are given in this table. (The third column, “Item2of2 count” is a count of the number of associations that are found in the result table for this left-side item). Once the selections are made, click the big button between the selectors. b Select from these tables to populate graph. The second step is to select once again the desired left-side or antecedent items by clicking or dragging the mouse just to the left of the row numbers displayed under the general header “Item 1 of 2” in the left-hand portion of selector 2. Note that as “Item 1 of 2” items are selected, “Item 2 of 2” right-side or consequent items are automatically selected in the right-hand portion of selector 2. Here the accumulated minimum and maximum values of the measures checked just above this display are given in the trailing columns of the table. (The third column “Item1of2 count” is a count of the number of associations that are found in the result table for this right-side item when limited to associations involving the left-side items selected in step 1). The corresponding associations are automatically highlighted in the Reference Table below. An alternative second step is to directly select one or more “Item 2 of 2” items in the right-hand portion of selector 2. The corresponding associations (again, limited to the left-side items selected in the first step) are then highlighted in the Reference Table below. Teradata Warehouse Miner User Guide - Volume 3 17 Chapter 1: Analytic Algorithms Association Rules • Reference Table — This table displays the rows from the result table that correspond to the selections made above in step 1, highlighting the rows corresponding to the selections made in step 2. • (Row Number) — A sequential numbering of the rows in this display. • Item 1 of 2 — Left item or antecedent in the association rule. • Item 2 of 2 — Right item or consequent in the association rule. • LSupport — The left-hand item Support, calculated as the percentage (a value between 0 and 1) of groups that contain the left-hand item referenced in the association rule. • RSupport — The right-hand item Support, calculated as the percentage (a value between 0 and 1) of groups that contain the right-hand item referenced in the association rule. • Support — The Support, which is a measure of the generality of an association rule. Calculated as the percentage (a value between 0 and 1) of groups that contain all of the items referenced in the rule • Confidence — The Confidence defined as the probability of the right-hand item occurring in an item group given that the left-hand item is in the item group. • Lift — The Lift which measures how much the probability of the existence of the right-hand item is increased by the presence of the left hand item in a group. • ZScore — The Z score value, a measure of how statistically different the actual result is from the expected result. • Show Graph — A tile map is displayed when the “show graph” tab is selected, provided that valid “graph options” selections have been made. The example below corresponds to the graph options selected in the example above. Figure 13: Association Graph The tiles are color coded in the gradient specified on the right-hand side. Clicking on any tile, brings up all statistics associated with that association, and highlights the two items in 18 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Association Rules the association. Radio buttons above the upper right hand corner of the tile map can be used to select the measure to color code in the tiles, that is either Support, Lift or Zscore. Tutorial - Association Analysis In this example, an Association analysis is performed on the fictitious banking data to analyze channel usage. Parameterize an Association analysis as follows: • Available Tables — twm_credit_tran • Group Column — cust_id • Item Column — channel • Association Combinations • Left — 1 • Right — 1 • Processing Options • Perform All Steps — Enabled • Minimum Support — 0 • Minimum Confidence — 0.1 • Minimum Lift — 1 • Minimum Z-Score — 1 • Where Clause Text — channel <> ‘ ‘ (i.e., channel is not equal to a single blank) • Output Tables • 1 to 1 Table Name — twm_tutorials_assoc Run the analysis, and click on Results when it completes. For this example, the Association analysis generated the following pages. The SQL is not shown for brevity. Table 3: Tutorial - Association Analysis Data ITEM1OF2 ITEM20F2 LSUPPORT RSUPPORT SUPPORT CONFIDENCE LIFT ZSCORE A E 0.85777 0.91685 0.80744 0.94132 1.02669 1.09511 B K 0.49672 0.35667 0.21007 0.42291 1.18572 1.84235 B V 0.49672 0.36324 0.22538 0.45374 1.24915 2.49894 C K 0.67177 0.35667 0.26477 0.39414 1.10506 1.26059 C V 0.67177 0.36324 0.27133 0.4039 1.11194 1.35961 E A 0.91685 0.85777 0.80744 0.88067 1.0267 1.09511 K B 0.35667 0.49672 0.21007 0.58898 1.18574 1.84235 K C 0.35667 0.67177 0.26477 0.74234 1.10505 1.26059 K V 0.35667 0.36324 0.1663 0.46626 1.28361 2.33902 V B 0.36324 0.49672 0.22538 0.62047 1.24913 2.49894 Teradata Warehouse Miner User Guide - Volume 3 19 Chapter 1: Analytic Algorithms Cluster Analysis Table 3: Tutorial - Association Analysis Data ITEM1OF2 ITEM20F2 LSUPPORT RSUPPORT SUPPORT CONFIDENCE LIFT ZSCORE V C 0.36324 0.67177 0.27133 0.74697 1.11194 1.35961 V K 0.36324 0.35667 0.1663 0.45782 1.2836 2.33902 Click on Graph Options and perform the following steps: 1 Select all data in selector 1 under the “Item 1 of 2” heading. 2 Click on the large button between selectors 1 and 2. 3 Select all data in selector 2 under the “Item 1 of 2” heading. 4 Click on the show graph tab. When the tile map displays, perform the following additional steps: a Click on the bottom most tile. (Hovering over this tile will display the item names K and V). b Try selecting different measures at the top right of the tile map. (Zscore will initially be initially selected). Figure 14: Association Graph: Tutorial Cluster Analysis Overview The task of modeling multidimensional data sets encompasses a variety of statistical techniques, including that of ‘cluster analysis’. Cluster analysis is a statistical process for 20 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Cluster Analysis identifying homogeneous groups of data objects. It is based on unsupervised machine learning and is crucial in data mining. Due to the massive sizes of databases today, implementation of any clustering algorithm must be scalable to complete analysis within a practicable amount of time, and must operate on large volumes of data with many variables. Typical clustering statistical algorithms do not work well with large databases due to memory limitations and execution times required. The advantage of the cluster analysis algorithm in Teradata Warehouse Miner is that it enables scalable data mining operations directly within the Teradata RDBMS. This is achieved by performing the data intensive aspects of the algorithm using dynamically generated SQL, while low-intensity processing is performed in Teradata Warehouse Miner. A second key design feature is that model application or scoring is performed by generating and executing SQL based on information about the model saved in metadata result tables. A third key design feature is the use of the Expectation Maximization or EM algorithm, a particularly sound statistical processing technique. Its simplicity makes possible a purely SQL-based implementation that might not otherwise be feasible with other optimization techniques. And finally, the Gaussian mixture model gives a probabilistic approach to cluster assignment, allowing observations to be assigned probabilities for inclusion in each cluster. The clustering is based on a simplified form of generalized distance in which the variables are assumed to be independent, equivalent to Euclidean distances on standardized measures. While this section primarily introduces Gaussian Mixture Model clustering, variations of this technique are described in the next section. In particular, the Fast K-Means clustering option uses a quite different technique: a stored procedure and a table operator that process the data more directly in the database for a considerable performance boost. Preprocessing - Cluster Analysis Some preprocessing of the input data by the user may be necessary. Any categorical data to be clustered must first be converted to design-coded numeric variables. Since null data values may bias or invalidate the analysis, they may be replaced, or the listwise deletion option selected to exclude rows with any null values in the preprocessing phase. Teradata Warehouse Miner automatically builds a single input table from the requested columns of the requested input table. If the user requests more than 30 input columns, the data is unpivoted with additional rows added for the column values. Through this mechanism, any number of columns within a table may be analyzed, and the SQL optimized for a particular Teradata server capability. Expectation Maximization Algorithm The clustering algorithm requires specification of the desired number of clusters. After preprocessing, an initialization step determines seed values for the clusters, and clustering is then performed based on conditional probability and maximum likelihood principles using the EM algorithm to converge on cluster assignments that yield the maximum likelihood value. In a Gaussian Mixture (GM) model, it is assumed that the variables being modeled are members of a normal (Gaussian) probability distribution. For each cluster, a maximum likelihood equation can be constructed indicating the probability that a randomly selected Teradata Warehouse Miner User Guide - Volume 3 21 Chapter 1: Analytic Algorithms Cluster Analysis observation from that cluster would look like a particular observation. A maximum likelihood rule for classification would assign this observation to the cluster with the highest likelihood value. In the computation of these probabilities, conditional probabilities use the relative size of clusters and prior probabilities, to compute a probability of membership of each row to each cluster. Rows are reassigned to clusters with probabilistic weighting, after units of distance have been transformed to units of standard deviation of the standard normal distribution via the Gaussian distance function: p mo = 2 –n 2 R –1 2 2 d mo exp – ------------- 2 Where: • p is dimensioned 1 by 1 and is the probability of membership of a point to a cluster • d is dimensioned 1 by 1 and is the Mahalanobis Distance • n is dimensioned 1 by 1 and is the number of variables • R is dimensioned n by n and is the cluster variance/covariance matrix The Gaussian Distance Function translates distance into a probability of membership under this probabilistic model. Intermediate results are saved in Teradata tables after each iteration, so the algorithm may be stopped at any point and the latest results viewed, or a new clustering process begun at this point. These results consist of cluster means, variances and prior probabilities. Expectation Step Means, variances and frequencies of rows assigned by cluster are first calculated. A covariance inverse matrix is then constructed using these variances, with all non-diagonals assumed to be zero. This simplification is tantamount to the assumption that the variables are independent. Performance is improved thereby, allowing the number of calculations to be proportional to the number of variables, rather than its square. Row distances to the mean of each cluster are calculated using a Mahalanobis Distance (MD) metric: n 2 do = o r xn – con Rn –1 x n – c on i = 1j = 1 Where: • m is the number of rows • n is the number of variables • o is the number of clusters • d is dimensioned n by o and is the Mahalanobis Distance from a row to a cluster • x is dimensioned m by n and is the data • c is dimensioned 1 by n and are the cluster centroids • R is dimensioned n by n and is the cluster variance/covariance matrix 22 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Cluster Analysis Mahalanobis Distance is a rescaled unitless data form used to identify outlying data points. Independent variables may be thought of as defining a multidimensional space in which each observation can be plotted. Means (“centroids”) for each independent variable may also be plotted. Mahalanobis distance is the distance of each observation from its centroid, defined by variables that may be dependent. In the special case where variables are independent or uncorrelated, it is equivalent to the simple Euclidean distance. In the default GM model, separate covariance matrices are maintained, conforming to the specifications of a pure maximum likelihood rule model. The EM algorithm works by performing the expectation and maximization steps iteratively until the log-likelihood value converges (i.e., changes less than a default or specified epsilon value), or until a maximum specified number of iterations has been performed. The loglikelihood value is the sum over all rows of the natural log of the probabilities associated with each cluster assignment. Although the EM algorithm is guaranteed to converge, it is possible it may converge slowly for comparatively random data, or it may converge to a local maximum rather than a global one. Maximization Step The row is assigned to the nearest cluster with a probabilistic weighting for the GM model, or with certainty for the K-Means model. Options - Cluster Analysis K-Means Option With the K-Means option, rows are reassigned to clusters by associating each to the closest cluster centroid using the shortest distance. Data points are assumed to belong to only one cluster, and the determination is considered a ‘hard assignment’. After the distances are computed from a given point to each cluster centroid, the point is assigned to the cluster whose center is nearest to the point. On the next iteration, the point’s value is used to redefine that cluster’s mean and variance. This is in contrast to the default Gaussian option, wherein rows are reassigned to clusters with probabilistic weighting, after units of distance have been transformed to units of standard deviation via the Gaussian distance function. Also with the K-means option, the variables' distances to cluster centroids are calculated by summing, without any consideration of the variances, resulting effectively in the use of unnormalized Euclidean distances. This implies that variables with large variances will have a greater influence over the cluster definition than those with small variances. Therefore, a typical preparatory step to conducting a K-means cluster analysis is to standardize all of the numeric data to be clustered using the Z-score transformation function in Teradata Warehouse Miner. K-means analyses of data that are not standardized typically produce results that: (a) are dominated by variables with large variances, and (b) virtually or totally ignore variables with small variances during cluster formation. Alternatively, the Rescale function could be used to normalize all numeric data, with a lower boundary of zero and an upper boundary of one. Normalizing the data prior to clustering gives all the variables equal weight. Teradata Warehouse Miner User Guide - Volume 3 23 Chapter 1: Analytic Algorithms Cluster Analysis Fast K-Means Option The Fast K-Means option provides a dramatic performance improvement over the K-Means option. When selected, the options on the analysis parameters tab are altered and the options on the expert options tab are not available. With Fast K-Means, the options include the following: • The Number of Clusters, Convergence Criterion and Maximum Iterations are provided as before. • The option to remove null values using list-wise deletion is not offered, it is automatically done. • The Variable Importance Evaluation Reports are not offered. • The Cluster Definitions Database and Table names are supplied by you. This table stores the model and the scoring module processes it. It can also be used to continue execution starting with the cluster definitions in this table, rather than using random starting clusters. • An Advertise Option option is provided for the Cluster Definitions table. The Fast K-Means algorithm creates an output table structured differently than other clustering algorithms. The output table is converted into the style used by other algorithms (that is, viewed as a report and graphed in the usual manner). If the conversion is not possible, you can view the cluster definitions in the new style as data, along with the progress report. Note: Install the td_analyze external stored procedure and the tda_kmeans table operator called by the stored procedure in the database where the TWM metadata tables reside. You can use the Install or Uninstall UDF’s option under the Teradata Warehouse Miner start program item, selecting the option to Install TD_Analyze UDFs. Poisson Option The Poisson option is designed to be applied to data containing mixtures of Poisson-distributed variables. The data is first normalized so all variables have the same means and variances, allowing the calculation of the distance metric without biasing the result in favor of larger-magnitude variables. The EM algorithm is then applied with a probability metric based on the likelihood function of the Poisson distribution function. As in the Gaussian Mixture Model option, rows are assigned to the nearest cluster with a probabilistic weighting. At the end of the EM iteration, the data is unnormalized and saved as a potential result, until or unless replaced by the next iteration. Average Mode - Minimum Generalized Distance Within the GM model, a special “average mode” option is provided, using the minimum generalized distance rule. With this option, a single covariance matrix is used for all clusters, rather than using an individual covariance matrix for each cluster. A weighted average of the covariance matrices is constructed for use in the succeeding iteration. Automatic Scaling of Likelihood Values When a large number of variables are input to the cluster analysis module, likelihood values can become prohibitively small. The algorithm automatically scales these values to avoid loss of precision, without invalidating the results in any way. The expert option ‘Scale Factor 24 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Cluster Analysis Exponent (s)’ may be used to bypass this feature by using a specific value, e.g. 10s, to multiply the probabilities. Continue Option The Continue Option allows clustering to be resumed where it left off by starting with the cluster centroid, variance and probability values of the last complete iteration saved in the metadata tables or output tables as requested on the Output Panel. Specifically, if the Continue Option is selected and output tables are specified and exist, the information in the output tables is used to restart processing. If output tables do not exist, then the model in metadata is used to restart processing. Note: If requested, the output tables are updated for each iteration of the algorithm and can, therefore, provide a degree of recovery. In the case of the Fast K-Means algorithm, however, the Continue Option depends on locating the Cluster Definition table named on the analysis parameters tab, which is effectively the model for this algorithm variation. The Cluster Definition table also is updated for each iteration of the algorithm and can, therefore, provide a degree of recovery. There is a special case of the Continue Option where using the Fast K-Means algorithm starts processing and, if processing terminates successfully, allows continuing with a Gaussian Mixture Model clustering. Note: With Fast K-Means, the output tables are built only at the end of processing and not after each iteration of the algorithm. You can request output tables on a Fast K-Means analysis and request the same tables as output tables on a Gaussian Mixture Model analysis with the Continue Option also selected. Using the TWM Cluster Analysis This section recommends parameter settings and techniques that apply primarily to the Gaussian Mixture Model. Sampling Large Database Tables as a Starting Method It may be most effective to use the sample parameter to begin the analysis of extremely large databases. The execution times are much faster and an approximate result obtained that can be used as a starting point, as described above. Results may be compared using the loglikelihood value, where the largest value indicates the best clustering fit, in terms of maximum likelihood. Because local maxima may result from a particular EM clustering analysis, multiple executions from different samples may produce a seed that ultimately yields the best log-likelihood value. Clustering and Data Problems Common data problems for cluster analysis include insufficient rows provided for the number of clusters requested, and constants in the data resulting in singular covariance matrices. When these problems occur, warning messages and recommendations are provided. An option for dealing with null values during processing is described below. Teradata Warehouse Miner User Guide - Volume 3 25 Chapter 1: Analytic Algorithms Cluster Analysis Additionally, Teradata errors may occur for non-normalized data having more than 15 digits of significance. In this case, a preprocessing step of either multiplying (for small numbers) or dividing (for large numbers) by a constant value may rectify overflow and underflow conditions. The clusters will remain the same as all this does is change the unit of measure. Clustering and Constants in Data When one or more of the variables included in the clustering analysis have only a few values, these values may be singled out and included in particular clusters as constants. This is most likely when the number of clusters sought is large. When this happens, the covariance matrix becomes singular and cannot be inverted, since some of the variances are zero. A feature is provided in the cluster algorithm to improve the chance of success under these conditions, by limiting how close to zero the variance may be set, e.g. 10-3. The default value is 10-10. If the log-likelihood values increase for a number of iterations and then start decreasing, it is likely due to the clustering algorithm having found clusters where selected variables are all the same value (a constant), so the cluster variance is zero. Changing the minimum variance exponent value to a larger value may reduce the effect of these constants, allowing the other variables to converge to a higher log-likelihood value. Clustering and Null Values The presence of null values in the data may result in clusters that differ from those that would have resulted from zero or numeric values. Since null data values may bias or invalidate the analysis, they should be replaced or the column eliminated. Alternatively, the listwise deletion option can be selected to exclude rows with any null values in the preprocessing phase. Stop Execution of a Clustering or Cluster Scoring Analysis Analyses can be terminated prior to normal completion by highlighting the name and clicking Stop on the Toolbar or by right-clicking the analysis name and selecting the Stop option. Typically, this results in a Cancelled status and a Cancelled message during execution. However, it can result in a Failed status and an error message, such as “The transaction was aborted by the user,” particularly when using the Fast K-Means algorithm. Success Analysis - Cluster Analysis If the log-likelihood value converges and the requested number of clusters is obtained with significant probabilities, then the clustering analysis can be considered successful. If the loglikelihood value declines, indicating convergence is complete, the iterations stop. Occasionally, warning messages can indicate constants within one or more clusters. Optimizing Performance of Clustering Parallel execution of SQL is an important feature of the cluster analysis algorithm in Teradata Warehouse Miner as well as Teradata. The number of variables to cluster in parallel is determined by the ‘width’ parameter. The optimum value of width will depend on the size of the Teradata system, its memory size, and so forth. Experience has shown that when a large number of variables are clustered on, the optimum value of width ranges from 20-25. The width value is dynamically set to the lesser of the specified Width option (default = 25) and 26 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Cluster Analysis the number of columns, but can never exceed 118. If SQL errors indicate insufficient memory, reducing the width parameter may alleviate the problem. Initiate a Cluster Analysis Use the following procedure to initiate a new Cluster analysis in Teradata Warehouse Miner: 1 Click on the Add New Analysis icon in the toolbar: Figure 15: Add New Analysis from toolbar 2 In the resulting Add New Analysis dialog box, click on Analytics under Categories and then under Analyses double-click on Clustering: Figure 16: Add New Analysis dialog 3 This will bring up the Clustering dialog in which you will enter INPUT and OUTPUT options to parameterize the analysis as described in the following sections. Cluster - INPUT - Data Selection On the Clustering dialog click on INPUT and then click on data selection: Teradata Warehouse Miner User Guide - Volume 3 27 Chapter 1: Analytic Algorithms Cluster Analysis Figure 17: Clustering > Input > Data Selection On this screen select: 1 Select Input Source Users may select between different sources of input. By selecting the Input Source Table the user can select from available databases, tables (or views) and columns in the usual manner. By selecting the Input Source Analysis however the user can select directly from the output of another analysis of qualifying type in the current project. Analyses that may be selected from directly include all of the Analytic Data Set (ADS) and Reorganization analyses (except Refresh). In place of Available Databases the user may select from Available Analyses, while Available Tables then contains a list of all the output tables that will eventually be produced by the selected Analysis. (Note that since this analysis cannot select from a volatile input table, Available Analyses will contain only those qualifying analyses that create an output table or view). 2 Select Columns From a Single Table • Available Databases (or Analyses) — All the databases (or analyses) that are available for the Clustering analysis. • Available Tables — All the tables within the Source Database that are available for the Clustering analysis. • Available Columns — Within the selected table or matrix, all columns which are available for the Clustering analysis. • Selected Columns — Columns must be of numeric type. For Fast K-Means, selected columns may not contain leading or trailing spaces and may not contain a separator character '|' if scoring of the model is ever published. Select columns by highlighting and then either dragging and dropping into the Selected Columns window, or click on the arrow button to move highlighted columns into the Selected Columns window. Cluster - INPUT - Analysis Parameters On the Clustering dialog click on INPUT and then click on analysis parameters: Figure 18: Clustering > Input > Analysis Parameters 28 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Cluster Analysis On this screen select: • Clustering Algorithm • Gaussian — Cluster the data using a Gaussian Mixture Model as described above. This is the default Algorithm. • K-Means — Cluster the data using the K-Means Model as described above. • Fast K-Means — Cluster the data using a high-performing version of the K-Means Model. • Poisson — Cluster the data using a Poisson Mixture Model as described above. • Number of clusters — Enter the number of clusters before executing the cluster analysis. • Convergence Criterion — For the Gaussian and Poisson Mixture Models, clustering stops when the log-likelihood increases less than this amount. The default value is 0.001. Fast K-Means uses this field as a threshold for cluster changes based on a different formula. Generic K-Means, on the other hand, does not use this criterion as clustering stops when the distances of all points to each cluster have not changed from the previous iteration. In other words, when the assignment of rows to clusters has not changed from the previous iteration, clustering has converged. • Maximum Iterations — Clustering is stopped after this maximum number of iterations has occurred. The default value is 50. • Remove Null Values (using Listwise deletion) — This option eliminates all rows from processing that contain any null input columns. The default is enabled. Fast K-Means always performs Listwise deletion — it is not an option. • Include Variable Importance Evaluation reports — Report shows resultant log-likelihood when each variable is successively dropped out of the clustering calculations. The most important variable will be listed next to the most negative log-likelihood value; the least important variable will be listed with the least negative value. Fast K-Means does not offer this option. • Cluster Definitions Database and Table — Applies only to the Fast K-Means algorithm. This table holds the model information and is used when continuing a previous run or when scoring. An option is also provided to Advertise Output with an optional Advertise Note. • Generate SQL Only — Applies only to the Fast K-Means algorithm. This option, if checked, generates the SQL call statement of the external stored procedure td_analyze but does not execute it. The SQL can be viewed on the Results > SQL tab. • Continue Execution (instead of starting over) — Previous execution results are used as seed values for starting clustering. Cluster - INPUT - Expert Options This screen does not apply to the Fast K-Means algorithm. On the Clustering dialog click on INPUT and then click on expert options: Teradata Warehouse Miner User Guide - Volume 3 29 Chapter 1: Analytic Algorithms Cluster Analysis Figure 19: Clustering > Input > Expert Options On this screen select: • Width — Number of variables to process in parallel (dependent on system limits) • Input Sample Fraction — Fraction of input dataset to cluster on. • Scale Factor Exponent — If nonzero “s” is entered, this option overrides automatic scaling, scaling by 10s. • Minimum Probability Exponent — If “e” is entered, the Clustering analysis uses 10e as smallest nonzero number in SQL calculations. • Minimum Variance Exponent — If “v” is entered, the Clustering analysis uses 10v as the minimum variance in SQL calculations. • Use single cluster covariance — Simplified model that uses the same covariance table for all clusters. • Use Random Seeding — When enabled (default) this option seeds the initial clustering answer matrix by randomly selecting a row for each cluster as the seed. This method is the most commonly used type of seeding for all other clustering systems, according to the literature. The byproduct of using this new method is that slightly different solutions will be provided by successive clustering runs, and convergence may be quicker because fewer iterations may be required. • Seed Sample Percentage — If Use Random Seeding is disabled, the previous seeding method of Teradata Warehouse Miner Clustering, where every row is assigned to one of the clusters, and then averages used as the seeds. Enter a percentage (1-100) of the input dataset to use as the starting seed. Cluster - OUTPUT This screen does not apply to the Fast K-Means algorithm. On the Clustering dialog, click on OUTPUT: Figure 20: Cluster > OUTPUT On this screen select: 30 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Cluster Analysis • Store the variables table of this analysis in the database — Check this box to store the variables table of this analysis in two tables in the database, one for cluster columns and one for cluster results. • Database Name — The name of the database to create the output tables in. • Output Table Prefix — The prefix of the output tables. (For example, if test is entered here, tables test_ClusterColumns and test_ClusterResults will be created). • Advertise Output — The Advertise Output option “advertises” output by inserting information into one or more of the Advertise Output metadata tables according to the type of analysis and the options selected in the analysis. • Advertise Note — An Advertise Note may be specified if desired when the Advertise Output option is selected or when the Always Advertise option is selected on the Connection Properties dialog. It is a free-form text field of up to 30 characters that may be used to categorize or describe the output. By way of an example, the tutorial example with prefix test yields table test_ ClusterResults: Table 4: test_ClusterResults column_ix cluster_id priors m v 1 1 0.0692162138434691 -2231.95933518596 7306685.95957656 1 2 0.403625379654599 -947.132576882845 846532.221977884 1 3 0.527158406501931 -231.599917701351 105775.923364194 2 1 0.0692162138434691 3733.31923440023 18669805.3968291 2 2 0.403625379654599 1293.34863525092 1440668.11504453 2 3 0.527158406501931 231.817911577847 102307.594966697 3 1 0.0692162138434691 3725.87257974281 18930649.6488828 3 2 0.403625379654599 632.603945909026 499736.882919713 3 3 0.527158406501931 163.869611182736 57426.9984808451 and test_ClusterColumns: Table 5: test_ClusterColumns table_name column_name column_alias column_order index_flag variable_type twm_ customer_ analysis avg_cc_bal avg_cc_bal 1 0 1 twm_ customer_ analysis avg_ck_bal avg_ck_bal 2 0 1 Teradata Warehouse Miner User Guide - Volume 3 31 Chapter 1: Analytic Algorithms Cluster Analysis Table 5: test_ClusterColumns table_name column_name column_alias column_order index_flag variable_type twm_ customer_ analysis avg_sv_bal avg_sv_bal 3 0 1 If Database Name is twm_results and Output Table Prefix is test, these tables are defined respectively as: CREATE SET TABLE twm_results.test_ClusterResults ( column_ix INTEGER, cluster_id INTEGER, priors FLOAT, m FLOAT, v FLOAT) UNIQUE PRIMARY INDEX ( column_ix ,cluster_id ); CREATE SET TABLE twm_results.test_ClusterColumns ( table_name VARCHAR(30) CHARACTER SET UNICODE NOT CASESPECIFIC, column_name VARCHAR(30) CHARACTER SET UNICODE NOT CASESPECIFIC, column_alias VARCHAR(100) CHARACTER SET UNICODE NOT CASESPECIFIC, column_order SMALLINT, index_flag SMALLINT, variable_type INTEGER) UNIQUE PRIMARY INDEX ( table_name ,column_name ); Run the Cluster Analysis After setting parameters on the INPUT screens as described above, you are ready to run the analysis. To run the analysis you can either: • Click the Run icon on the toolbar, or • Select Run <project name> on the Project menu, or • Press the F5 key on your keyboard Results - Cluster Analysis The results of running the Cluster analysis include a variety of statistical reports, a similarity/ dissimilarity graph, as well as a cluster size and distance measure graph. All of these results are outlined below. Cluster - RESULTS - reports On the Clustering dialog click on RESULTS and then click on reports: 32 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Cluster Analysis Figure 21: Clustering > Results > Reports Clustering Progress • Iteration — This represents the number of the step in the Expectation Maximization clustering algorithm as it seeks to converge on a solution maximizing the log likelihood function. • Log Likelihood — This is the log likelihood value calculated at the end of this step in the Expectation Maximization clustering algorithm. It does not appear when the K-Means option is used. • Diff — This is simply the difference in the log likelihood value between this and the previous step in the modeling process, starting with 0 at the end of the first step. It does not appear when the K-Means option is used. • Timestamp — This is the day, date, hour, minute and second marking the end of this step in processing. Clustering Progress for Fast K-Means The Clustering Progress report for Fast K-Means contains the Timestamp and Message columns. The Message column contains information such as processing phase and iterations. When using the Fast K-Means algorithm, the Clustering Solution report is derived from the Cluster Definitions table. If the Clustering Solution report cannot be successfully derived, the Cluster Definitions table can be viewed on the Results-Data tab as an alternative (although it does not include all of the same information, such as variance). The Importance of Variables report does not apply to Fast K-Means. Importance of Variables This report is available when the Include Variable Importance Evaluation Report option is enabled on the Expert Options tab. • Col — The column number in the order the input columns were requested. • Name — Name of the column being clustered. • Log Likelihood — This is the log likelihood value calculated if this variable was removed from the clustering solution. Clustering Solution • Col — This is the column number in the order the input columns were requested. • Table_Name — The name of the table associated with this input column. • Column_Name — The name of the input column used in performing the cluster analysis. Teradata Warehouse Miner User Guide - Volume 3 33 Chapter 1: Analytic Algorithms Cluster Analysis • Cluster_Id — The cluster number that this data applies to, from 1 to the number of clusters requested. • Weight — This is the so-called “prior probability” that an observation would belong to this cluster, based on the percentage of observations belonging to this cluster at this stage. • Mean — When the Gaussian Mixture Model algorithm is selected, Mean is the weighted average of this column or variable amongst all the observations, where the weight used is the probability of inclusion in this cluster. When the K-Means algorithm is selected, Mean is the average value of this column or variable amongst the observations assigned to this cluster at this iteration of the algorithm. • Variance — When the Gaussian Mixture Model algorithm is selected, Variance is the weighted variance of this variable amongst all the observations, where the weight used is the probability of inclusion in this cluster. When the K-Means algorithm is selected, Variance is the variance of this variable amongst the observations assigned to this cluster at this iteration. (Variance is the square of a variable’s standard deviation, measuring in some sense how its value varies from one observation to the next). Cluster - RESULTS - sizes graph On the Clustering dialog click on RESULTS and then click on sizes graph: Figure 22: Clustering > Results > Sizes Graph The Sizes (and Distances) graph plots the mean values of a pair of variables at a time, indicating the clusters by color and number label, and the standard deviations (square root of the variance) by the size of the ellipse surrounding the mean point, using the same colorcoding. Roughly speaking, this graph depicts the separation of the clusters with respect to pairs of model variables. The following options are available: • Non-Normalized — The default value to show the clusters without any normalization. • Normalized — With the Normalized option, cluster means are divided by the largest absolute mean and the size of the circle based on the variance is divided by the largest absolute variance. • Variables • Available — The variables that were input into the Clustering Analysis. • Selected — The variables that will be shown on the Size and Distances graph. Two variables are required to be entered here. • Clusters 34 • Available — A list of clusters generated in the clustering solution. • Selected — The clusters that are shown on the Size and Distances graph. Up to twelve clusters can be selected to be shown on the Size and Distances graph. Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Cluster Analysis • Zoom In — While holding down the left mouse button on the Size and Distances graph, drag a lasso around the area that you desire to magnify. Release the mouse button for the zoom to take place. This can be repeated until the desired level of magnification is achieved. • Zoom Out — Hit the “Z” key, or toggle the Graph Options tab to go back to the original magnification level. Cluster - RESULTS - data When clustering with the Fast K-Means algorithm, select this tab to display the cluster means and solution progress reports. The cluster means display contains a subset of the information shown on the Solution report and is intended as a backup in case the Solution report cannot be produced. If the project is saved and reopened, the data is not displayed as it is with the other tabs. Note: The other clustering algorithms do not use this display. Cluster - RESULTS - SQL When clustering with the Fast K-Means algorithm, select this tab to display the SQL generated by the algorithm, consisting of the call to the td_analyze external stored procedure. Note: The other clustering algorithms do not use this display. Cluster - RESULTS - similarity graph On the Clustering dialog click on RESULTS and then click on similarity graph: Figure 23: Clustering > Results > Similarity Graph The Similarity graph allows plotting the means and variances of up to twelve clusters and twelve variables at one time. The cluster means (i.e., the mean values of the variables for the data points assigned to the cluster) are displayed with values varying along the x-axis. A different line parallel to the x-axis is used for each variable. The normalized variances are displayed for each variable by color-coding, and the clusters are identified by number next to the point graphed. Roughly speaking, the more spread out the points on the graph, the more differentiated the clusters are. The following options are available: • Non-Normalized — The default value to show the clusters without any normalization. • Normalized — With the Normalized option, the cluster mean is divided by the largest absolute mean. • Variables • Available — The variables that were input into the Clustering Analysis. Teradata Warehouse Miner User Guide - Volume 3 35 Chapter 1: Analytic Algorithms Cluster Analysis • Selected — The variables that will be shown on the Similarity graph. Up to twelve variables can be entered here. selected to be shown on the Similarity graph • Clusters • Available — A list of clusters generated in the clustering solution. • Selected — The clusters that will be shown on the Similarity graph. Up to twelve clusters can be selected to be shown on the Similarity graph. Tutorial - Cluster Analysis In this example, Gaussian Mixture Model cluster analysis is performed on 3 variables giving the average credit, checking and savings balances of customers, yielding a requested 3 clusters. Note that since Clustering in Teradata Warehouse Miner is non-deterministic, the results may vary from these, or from execution to execution. Parameterize a Cluster analysis as follows: • Selected Tables and Columns • twm_customer_analysis.avg_cc_bal • twm_customer_analysis.avg_ck_bal • twm_customer_analysis.avg_sv_bal • Number of Clusters — 3 • Algorithm — Gaussian Mixture Model • Convergence Criterion — 0.1 • Use Listwise deletion to eliminate null values — Enabled Run the analysis and click on Results when it completes. For this example, the Clustering Analysis generated the following pages. Note that since Clustering is non-deterministic, results may vary. A single click on each page name populates the page with the item. Table 6: Progress 36 Iteration Log Likelihood Diff Timestamp 1 -25.63 0 3:05 PM 2 -25.17 .46 3:05 PM 3 -24.89 .27 3:05 PM 4 -24.67 .21 3:05 PM 5 -24.42 .24 3:05 PM 6 -24.33 .09 3:06 PM Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Cluster Analysis Table 7: Solution Col Table_Name Column_Name Cluster_Id Weight Mean Variance 1 twm_customer_analysis avg_cc_bal 1 .175 -1935.576 3535133.504 2 twm_customer_analysis avg_ck_bal 1 .175 2196.395 9698027.496 3 twm_customer_analysis avg_sv_bal 1 .175 674.72 825983.51 1 twm_customer_analysis avg_cc_bal 2 .125 -746.095 770621.296 2 twm_customer_analysis avg_ck_bal 2 .125 948.943 1984536.299 3 twm_customer_analysis avg_sv_bal 2 .125 2793.892 11219857.457 1 twm_customer_analysis avg_cc_bal 3 .699 -323.418 175890.376 2 twm_customer_analysis avg_ck_bal 3 .699 570.259 661100.56 3 twm_customer_analysis avg_sv_bal 3 .699 187.507 63863.503 Sizes Graph By default, the following graph will be displayed. This parameterization includes: • Non-Normalized — Enabled • Variables Selected • avg_cc_bal • avg_ck_bal • Clusters Selected • Cluster 1 • Cluster 2 • Cluster 3 Teradata Warehouse Miner User Guide - Volume 3 37 Chapter 1: Analytic Algorithms Cluster Analysis Figure 24: Clustering Analysis Tutorial: Sizes Graph Similarity Graph By default, the following graph will be displayed. This parameterization includes: • Non-Normalized — Enabled • Variables Selected • avg_cc_bal • avg_ck_bal • avg_sv_bal • Clusters Selected 38 • Cluster 1 • Cluster 2 • Cluster 3 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Decision Trees Figure 25: Clustering Analysis Tutorial: Similarity Graph Decision Trees Overview Decision tree models are most commonly used for classification. What is a classification model or classifier? It is simply a model for predicting a categorical variable, that is a variable that assumes one of a predetermined set of values. These values can be either nominal or ordinal, though ordinal variables are typically treated the same as nominal ones in these models. (An example of a nominal variable is single, married and divorced marital status, while an example of an ordinal or ordered variable is low, medium and high temperature). It is the ability of decision trees to not only predict the value of a categorical variable, but to directly use categorical variables as input or predictor variables that is perhaps their principal advantage. Decision trees are by their very nature also well suited to deal with large numbers of input variables, handle a mixture of data types and handle data that is not homogeneous (i.e., the variables do not have the same interrelationships throughout the data space). They also provide insight into the structure of the data space and the meaning of a model, a result at times as important as the accuracy of a model. It should be noted that a variation of decision trees called regression trees can be used to build regression models rather than classification models, enjoying the same benefits just described. Most of the upcoming discussion is geared toward classification trees with regression trees described separately. What are Decision Trees? What does a decision tree model look like? It first of all has a root node, which is associated with all of the data in the training set used to build the tree. Each node in the tree is either a decision node or a leaf node, which has no further connected nodes. A decision node Teradata Warehouse Miner User Guide - Volume 3 39 Chapter 1: Analytic Algorithms Decision Trees represents a split in the data based on the values of a single input or predictor variable. A leaf node represents a subset of the data that has a particular value of the predicted variable (i.e., the resulting class of the predicted variable). A measure of accuracy is also associated with the leaf nodes of the tree. The first issue in building a tree is the decision as to how data should be split at each decision node in the tree. The second issue is when to stop splitting each decision node and make it a leaf. And finally, what class should be assigned to each leaf node. In practice, researchers have found that it is usually best to let a tree grow as big as it needs to and then prune it back at the end to reduce its complexity and increase its interpretability. Once a decision tree model is built it can be used to score or classify new data. If the new data includes the values of the predicted variable it can be used to measure the effectiveness of the model. Typically though scoring is performed in order to create a new table containing key fields and the predicted value or class identifier. Decision Trees in Teradata Warehouse Miner Teradata Warehouse Miner provides decision trees for classification models and regression models. They are built largely on the techniques described in [Breiman, Friedman, Olshen and Stone] and [Quinlan]. As such, splits using the Gini diversity index, regression or information gain ratio are provided. Pruning is also provided, using either the Gini diversity index or gain ratio technique. In addition to a summary report, a graphical tree browser is provided when a model is built, displaying the model either as a tree or a set of rules. Finally, a scoring function is provided to score and/or evaluate a decision tree model. The scoring function can also be used to simply generate the scoring SQL for later use. A number of additional options are provided when building or scoring a decision tree model. One of these options is whether or not to bin numeric variables during the tree building process. Another involves including recalculated confidence measures at each leaf node in a tree based on a validation table, supplementing confidence measures based on the training data used to build the tree. Finally, at the time of scoring, a table profiling the leaf nodes in the tree can be requested, at the same time each scored row is linked with a leaf node and corresponding rule set. Decision Tree SQL Generation A key part to the design of the Teradata Warehouse Miner Decision Trees is SQL generation. In order to avoid having to extract all of the data from the RDBMS, the product generates SQL statements to return sufficient statistics. Before the model building begins, SQL is generated to give a better understanding of the attributes and the predicted variable. For each attribute, the algorithm must determine its cardinality and get all possible values of the predicted variable and the counts associated with it from all of the observations. This information helps to initialize some structures in memory for later use in the building process. The driving SQL behind the entire building process is a SQL statement that makes it possible to build a contingency table from the data. A contingency table is an m x n matrix that has m rows corresponding to the distinct values of an attribute by n columns that correspond to the predicted variable’s distinct values. The Teradata Warehouse Miner Decision Tree algorithms can quickly generate the contingency table on massive amounts of data rows and columns. 40 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Decision Trees This contingency table query allows the program to gather the sufficient statistics needed for the algorithms to do their calculations. Since this consists of the counts of the N distinct values of the dependent variable, a WHERE clause is simply added to this SQL when building a contingency table on a subset of the data instead of the data in the whole table. The WHERE clause expression in the statement helps define the subset of data which is the path down the tree that defines which node is a candidate to be split. Each type of decision tree uses a different method to compute which attribute is the best choice to split a given subset of data upon. Each type of decision tree is considered in turn in what follows. In the course of describing each algorithm, the following notation is used: 1 t denotes a node 2 j denotes the learning classes 3 J denotes the number of classes 4 s denotes a split 5 N(t) denotes the number of cases within a node t 6 p(j|t) is the proportion of class j learning samples in node t 7 An impurity function is a symmetric function with maximum value –1 –1 –1 J J J and 1 0 0 = 0 1 0 = = 0 0 1 = 0 8 t1 denotes a subnode i of t 9 i(t) denotes node impurity measure 10 t1 and tR are the left and right split nodes of t Splitting on Information Gain Ratio Information theory is the basic underlying idea in this type of decision tree. Splits on categorical variables are made on each individual value. Splits on continuous variables are made at one point in an ordered list of the actual values, that is a binary split is introduced right on a particular value. • Define the “info” at node t as the entropy: info t = – p j t log 2 p j t • Suppose t is split into subnodes t1, …, t2 by predictor X. Define: Info x = Teradata Warehouse Miner User Guide - Volume 3 N t1 -----------info t 1 Nt 41 Chapter 1: Analytic Algorithms Decision Trees Gain X = info t – info x t N t1 N t1 Split info X = – ------------- log 2 ------------- Nt Nt gain X Gain ratio X = ------------------------------------Split info X Once the gain ratios have been computed the attribute with the highest gain ratio is used to split the data. Then each subset goes through this process until the observations are all of one class or a stopping criterion is met such as each node must contain at least 2 observations. For a detailed description of this type of decision tree see [Quinlan]. Splitting on Gini Diversity Index Node impurity is the idea behind the Gini diversity index split selection. To measure node impurity, use the formula: i t = p t 0 Maximum impurity arises when there is an equal distribution of the class that is to be predicted. As in the heads and tails example, impurity is highest if half the total is heads and the other half is tails. On the other hand, if there were only tails in a certain sample the impurity would be 0. The Gini index uses the following formula for its calculation of impurity: 2 it = 1 – p j t j For a determination of the goodness of a split, the following formula is used: i s t = i t – p L i t L – p R i t R where tL and tR are the left and right sub nodes of t and pL and pR are the probabilities of being in those sub nodes. For a detailed description of this type of tree see [Breiman, Friedman, Olshen and Stone]. Regression Trees Teradata Warehouse Miner provides regression tree models that are built largely on the techniques described in [Breiman, Friedman, Olshen and Stone]. 42 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Decision Trees Like classification trees, regression trees utilize SQL in order to extract only the necessary information from the RDBMS instead of extracting all the data from the table. An m x 3 table is returned from the database that has m rows corresponding to the distinct values of an attribute followed by the SUM and SQUARED SUM of the predicted variable and the total number of rows having that attribute value. Using the formula: yn – avg(y) 2 the sum of squares for any particular node starting with the root node of all the data is calculated first. The regression tree is built by iteratively splitting nodes and picking the split for that node which will maximize a decrease in the within node sum of squares of the tree. Splitting stops if the minimum number of observations in a node is reached or if all of the predicted variable values are the same. The value to predict for a leaf node is simply the average of all the predicted values that fall into that leaf during model building. Chaid Trees CHAID trees utilize the chi squared significance test as a means of partitioning data. Independent variables are tested by looping through the values and merging categories that have the least significant difference from one another and also are still below the merging significance level parameter (default .05). Once all independent variables have been optimally merged the one with the highest significance is chosen for the split, the data is subdivided, and the process is repeated on the subsets of the data. The splitting stops when the significance goes above the splitting significance level (default .05). For a detailed description of this type of tree see [Kass]. Decision Tree Pruning Many times with algorithms such as those described above, a model over fits the data. One of the ways of correcting this is to prune the model from the leaves up. In situations where the error rate of leaves doesn’t increase when combined then they are joined into a new leaf. A simple example may be given as follows. If there is nothing but random data for the attributes and the class is set to predict “heads” 75% of the time and “tails” 25% of the time, the result will be an over fit model that doesn’t predict the outcome well. Just by looking it can be seen that instead of a built up model with many leaves, the model could just predict “heads” and it would be correct 75% of the time, whereas over fitting usually does much worse in such a case. Teradata Warehouse Miner provides pruning according to the gain ratio and Gini diversity index pruning techniques. It is possible to combine different splitting and pruning techniques, however when pruning a regression tree the Gini diversity index technique must be used. Teradata Warehouse Miner User Guide - Volume 3 43 Chapter 1: Analytic Algorithms Decision Trees Decision Trees and NULL Values NULL values are handled by listwise deletion. This means that if there are NULL values in any variables (independent and dependent) then that row where a NULL exists will be removed from the model building process. NULL values in scoring, however, are handled differently. Unlike in tree building where listwise deletion is used, scoring can sometimes handle rows that have NULL values in some of the independent variables. The only time a row will not get scored is if a decision node that the row is being tested on has a NULL value for that decision. For instance, if the first split in a tree is “age < 50,” only rows that don’t have a NULL value for age will pass down further in the tree. This row could have a NULL value in the income variable. But since this decision is on age, the NULL will have no impact at this split and the row will continue down the branches until a leaf is reached or it has a NULL value in a variable used in another decision node. Initiate a Decision Tree Analysis Use the following procedure to initiate a new Decision Tree analysis in Teradata Warehouse Miner: 1 Click on the Add New Analysis icon in the toolbar: Figure 26: Add New Analysis from toolbar 2 44 In the resulting Add New Analysis dialog box, click on Analytics under Categories and then under Analyses double-click on Decision Tree: Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Decision Trees Figure 27: Add New Analysis dialog 3 This will bring up the Decision Tree dialog in which you will enter INPUT and OUTPUT options to parameterize the analysis as described in the following sections. Decision Tree - INPUT - Data Selection On the Decision Tree dialog click on INPUT and then click on data selection: Figure 28: Decision Tree > Input > Data Selection On this screen select: 1 Select Input Source Users may select between different sources of input. By selecting the Input Source Table the user can select from available databases, tables (or views) and columns in the usual manner. By selecting the Input Source Analysis however the user can select directly from the output of another analysis of qualifying type in the current project. Analyses that may be selected from directly include all of the Analytic Data Set (ADS) and Reorganization analyses (except Refresh). In place of Available Databases the user may select from Available Analyses, while Available Tables then contains a list of all the output tables that will eventually be produced by the selected Analysis. (Note that since this analysis cannot select from a volatile input table, Available Analyses will contain only those qualifying analyses that create an output table or view). 2 Select Columns From a Single Table Teradata Warehouse Miner User Guide - Volume 3 45 Chapter 1: Analytic Algorithms Decision Trees • Available Databases (or Analyses) — All the databases (or analyses) that are available for the Decision Tree analysis. • Available Tables — All the tables that are available for the Decision Tree analysis. • Available Columns — Within the selected table or matrix, all columns that are available for the Decision Tree analysis. • Selected Columns — Select columns by highlighting and then either dragging and dropping into the Selected Columns window, or click on the arrow button to move highlighted columns into the Selected Columns window. Note: The Selected Columns window is actually a split window; you can either insert columns as Dependent or Independent columns. Make sure you have the correct portion of the window highlighted. • Independent — These may be of numeric or character type. • Dependent — The dependent variable column is the column whose value is being predicted. It is selected from the Available Variables in the selected table. When Gain Ratio or Gini Index are selected as the Tree Splitting criteria, this is treated as a categorical variable with distinct values, in keeping with the nature of classification trees. Note that in this case an error will occur if the Dependent Variable has more than 50 distinct values. When Regression Trees is selected as the Tree Splitting criteria, this is treated as a continuous variable. In this case it must contain only numeric values. Decision Tree - INPUT - Analysis Parameters On the Decision Tree dialog click on INPUT and then click on analysis parameters: Figure 29: Decision Tree > Input > Analysis Parameters On this screen select: • Splitting Options • 46 Splitting Method • Gain Ratio — Option to use the Gain Ratio splitting criteria. • Gini Index — Option to use the Gini Index splitting criteria. • Chaid — Option to use the Chaid splitting criteria. When using this option you are also given the opportunity to change the merging or splitting Chaid Significance Levels. • Regression Trees — Option to use the Regression splitting criteria as outlined above. • Gain Ratio Extreme — Option to use the Gain Ratio splitting criteria using a stored Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Decision Trees procedure and table operator that process the data more directly in the database for better resource utilization. Note: When using this option, confirm that the td_analyze external stored procedure and the tda_dt_calc table operator are installed in the database where the TWM metadata tables reside. This can be performed using the Install or Uninstall UDF's option under the Teradata Warehouse Miner start program item, selecting the option to Install TD_Analyze UDFs. • Minimum Split Count — This option determines how far the splitting of the decision tree will go. Unless a node is pure (meaning it has only observations with the same dependent value) it will split if each branch that can come off this node will contain at least this many observations. The default is a minimum of 2 cases for each branch. • Maximum Nodes — (This option is not available when using the Gain Ratio Extreme splitting method.) If the nodes in the tree are equal to or exceed this value while splitting a certain level of the tree, the algorithm stops the tree growing after completing this level and returns the tree built so far. The default is 10000 nodes. • Maximum Depth — Another method of stopping the tree is to specify the maximum depth the tree may grow to. This option will stop the algorithm if the tree being built has this many levels. The default is 100 levels. • Chaid Significance Levels — (These options are only available when using the Chaid splitting method.) • Merging — Independent variables are tested by looping through the values and merging categories that have the least significant difference from one another and also are still below this merging significance level parameter (default .05). • Splitting — Once all independent variables have been optimally merged the one with the highest significance is chosen for the split, the data is subdivided, and the process is repeated on the subsets of the data. The splitting stops when the significance goes above this splitting significance level parameter (default .05). • Bin Numeric Variables — Option to automatically Bincode the continuous independent variables. Continuous data is separated into one hundred bins when this option is selected. If the variable has less than one hundred distinct values, this option is ignored. • Include Validation Table — (This option is not available when using the Gain Ratio Extreme splitting method.) A supplementary table may be utilized in the modeling process to validate the effectiveness of the model on a separate set of observations. If specified, this table is used to calculate a second set of confidence or targeted confidence factors. These recalculated confidence factors are viewed in the tree browser and/or added to the scored table when scoring the resultant model. When Include Validation Table is selected, a separate validation table is required. • Database — The name of the database to look in for the validation table - by default, this is the source database. • Table — The name of the validation table to use for recalculating confidence or targeted confidence factors. Teradata Warehouse Miner User Guide - Volume 3 47 Chapter 1: Analytic Algorithms Decision Trees • Include Lift Table — (This option is not available when using the Gain Ratio Extreme splitting method.) Option to generate a Cumulative Lift Table in the report to demonstrate how effective the model is in estimating the dependent variable. Valid for binary dependent variables only. • Response Value — An optional response value can be specified for the dependent variable that will represent the response value. Note that all other dependent variable values will be considered a non-response value. Values — Bring up the Decision Tree values wizard to help in specifying the response value. • Pruning Options • • Pruning Method — Pull-down list with the following values: • Gain Ratio — Option to use the Gain Ratio pruning criteria as outlined above. • Gini Index — (This option is not available when using the Gain Ratio Extreme splitting method.) Option to use the Gini Index pruning criteria as outlined above. • None — Option to not prune the resultant decision tree. Gini Test Table — (This option does not apply when using the Gain Ratio Extreme splitting method.) When Gini Index pruning is selected as the pruning method, a separate Test table is required. • Database — The name of the database to look for the Test table - by default, this is the source database. • Table — The name of the table to use for test purposes during the Gini Pruning process. Decision Tree - INPUT - Expert Options On the Decision Tree dialog click on INPUT and then click on expert options: Figure 30: Decision Tree > Input > Expert Options • Performance • Maximum amount of data for in-memory processing — (This option does not apply when using the Gain Ratio Extreme splitting method.) By default, 2 MB of data can be processed in memory for the tree. This can be increased here. For smaller data sets, this option may be preferable over the SQL version of the decision tree. Run the Decision Tree Analysis After setting parameters on the INPUT screens as described above, you are ready to run the analysis. To run the analysis you can either: 48 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Decision Trees • Click the Run icon on the toolbar, or • Select Run <project name> on the Project menu, or • Press the F5 key on your keyboard Results - Decision Tree The results of running the Decision Tree analysis include a variety of statistical reports as well as a Graphic and Textual Tree browser. All of these results are outlined below. Note: Not all Results features are provided when using the Gain Ratio Extreme splitting method, including a lift table and graph, a validation matrix, and some of the information displayed by decision tree graph nodes. Decision Tree Reports • Total observations — This is the number of observations in the training data set used to build the tree. More precisely, this is the number of rows in the input table after any rows have been excluded for containing a null value in a column selected as an independent or dependent variable. • Nodes before pruning — This is the number of nodes in the tree, including the root node, before it is pruned back in the second stage of the tree-building process. • Nodes after pruning — This is the number of nodes in the tree, including the root node, after it is pruned back in the second stage of the tree-building process. • Total nodes — This is the number of nodes in the tree, including the root node, when either pruning is not requested or doesn’t remove any nodes. • Model Accuracy — This is the percentage of observations in the training data set that the tree accurately predicts the value of the dependent variable for. Variables • Independent Variables — A list of all the independent variables that made it into the decision tree model. • Dependent Variable — The dependent variable that the tree was built to predict. Confusion Matrix A N x (N+2) (for N outcomes of the dependent variable) confusion matrix is given with the following format: Table 8: Confusion Matrix Format Actual ‘0’ Actual ‘1’ … Actual ‘N’ Correct Incorrect Predicted ‘0’ # correct ‘0’ Predictions # incorrect‘1’ Predictions … # incorrect ‘N’ Predictions Total Correct ‘0’ Predictions Total Incorrect ‘0’ Predictions Predicted ‘1’ # incorrect‘0’ Predictions # correct ‘1’ Predictions … # incorrect ‘N’ Predictions Total Correct ‘1’ Predictions Total Incorrect ‘1’ Predictions … … … … … … … Teradata Warehouse Miner User Guide - Volume 3 49 Chapter 1: Analytic Algorithms Decision Trees Table 8: Confusion Matrix Format Predicted ‘N’ Actual ‘0’ Actual ‘1’ … Actual ‘N’ Correct Incorrect # incorrect‘0’ Predictions # incorrect ‘1’ Predictions … # correct ‘N’ Predictions Total Correct ‘N’ Predictions Total Incorrect ‘N’ Predictions Validation Matrix When the Include validation table option is selected, a validation matrix similar to the confusion matrix is produced based on the data in the validation table rather than the input table. Cumulative Lift Table The Cumulative Lift Table demonstrates how effective the model is in estimating the dependent variable. It is produced using deciles based on the probability values. Note that the deciles are labeled such that 1 is the highest decile and 10 is the lowest, based on the probability values calculated by logistic regression. The information in this report however is best viewed in the Lift Chart produced as a graph. Note that this is only valid for binary dependent variables. • Decile — The deciles in the report are based on the probability values predicted by the model. Note that 1 is the highest decile and 10 is the lowest. That is, decile 1 contains data on the 10% of the observations with the highest estimated probabilities that the dependent variable is 1. • Count — This column contains the count of observations in the decile. • Response — This column contains the count of observations in the decile where the actual value of the dependent variable is 1. • Pct Response — This column contains the percentage of observations in the decile where the actual value of the dependent variable is 1. • Pct Captured Response — This column contains the percentage of responses in the decile over all the responses in any decile. • Lift — The lift value is the percentage response in the decile (Pct Response) divided by the expected response, where the expected response is the percentage of response or dependent 1-values over all observations. For example, if 10% of the observations overall have a dependent variable with value 1, and 20% of the observations in decile 1 have a dependent variable with value 1, then the lift value within decile 1 is 2.0, meaning that the model gives a “lift” that is better than chance alone by a factor of two in predicting response values of 1 within this decile. • Cumulative Response — This is a cumulative measure of Response, from decile 1 to this decile. • Cumulative Pct Response — This is a cumulative measure of Pct Response, from decile 1 to this decile. • Cumulative Pct Captured Response — This is a cumulative measure of Pct Captured Response, from decile 1 to this decile. • Cumulative Lift — This is a cumulative measure of Lift, from decile 1 to this decile. 50 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Decision Trees Decision Tree Graphs The Decision Tree Analysis can display either a graphical and textual representation of the decision tree model, as well as a lift chart. Options are available to display decisions for any node in the graphical or textual tree, as well as the counts and distribution of the dependent variable. Additionally, manual pruning of the decision tree model is supported. Tree Browser Figure 31: Tree Browser When Tree Browser is selected, two frames are shown: the upper frame gives a condensed view to aid in navigating through the detailed tree in the lower frame. Set options by rightclicking on either frame to select from the following menu: • Small Navigation Tree — Under Small Navigation Tree, the options are: Figure 32: Tree Browser menu: Small Navigation Tree • Zoom — This option allows you to scale down the navigation tree so that more of it will appear within the window. A slider bar is provided so you can select from a range of new sizes while previewing the effect on the navigation tree. The slider bar can also be used to bring the navigation tree back up to a larger dimension after it has been reduced in size: Teradata Warehouse Miner User Guide - Volume 3 51 Chapter 1: Analytic Algorithms Decision Trees Figure 33: Tree Browser menu: Zoom Tree • Show Extents Box/Hide Extents Box — With this option a box is drawn around the nodes in the upper frame corresponding to the nodes displayed in the lower frame. The box can be dragged and dropped over segments of the small tree, automatically positioning the identical area in the detailed tree within the lower frame. Once set, the option changes to allow hiding the box. • Hide Navigation Tree/Show Navigation Tree — With this option the upper frame is made to disappear (or reappear) in order to give more room to the lower frame that contains the details of the tree. • Show Confidence Factors/Show Targeted Confidence — The Confidence Factor is a measure of how “confident” the model is that it can predict the correct score for a record that falls into a particular leaf node based on the training data the model was built from. For example, if a leaf node contained 10 observations and 9 of them predict Buy and the other record predicts Do Not Buy, then the model built will have a confidence factor of .9, or 90% sure of predicting the right value for a record that falls into that leaf node of the model. Models built with a predicted variable that has only 2 outcomes can display a Targeted Confidence value rather than a confidence factor. If the outcomes were 9 Buys and 1 Do Not Buy at a particular node and if the target value was set to Buy, .9 is the targeted confidence. However if it is desired to target the Do Not Buy outcome by setting the value to Do Not Buy, then any record falling into this leaf of the tree would get a targeted confidence of .1 or 10%. This option also controls whether Recalculated Confidence Factors or Recalculated Targeted Confidence factors are displayed in the case when the Include validation table option is selected. • Node Detail — The Node Detail feature can be used to copy the entire rule set for a particular node to the Windows Clipboard for use in other applications. • Print 52 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Decision Trees Figure 34: Tree Browser menu: Print • Large Tree — Allows you to print the entire tree diagram. This will be printed in pages, with the total number of pages reported before they are printed. (A page will also be printed showing how the tree was mapped into individual pages). If All Pages is selected the entire tree will be printed, across multiple pages if necessary. If Current Browser Page is selected then only that portion of the tree which is viewable will be printed in WYSIWYG fashion. • Small Tree — The entire navigation tree, showing the overall structure of the tree diagram without node labels or statistics, can be printed in pages. (The fewest possible pages will be printed if the navigation tree is reduced as small as possible before printing the small tree). The total number of pages needed to print the smaller tree will be reported before they are sent to the printer). • Save — Currently, the Tree Browser only supports the creation of Bitmaps. If Tree Text is currently selected, the entire tree will be saved. If Tree Browser is selected, only the portion of the tree that is viewable will be saved in WYSIWYG fashion. The lower frame shows the details of the decision tree in a graphical manner. The graphical representation of the tree consists of the following objects: • Root Node — The box at the top of the tree shows the total number of observations or rows used in building the tree after any rows have been removed for containing null values. • Intermediate Node — The boxes representing intermediate nodes in the tree contain the following information. • Decision — Condition under which data passes through this node. • N — Count of number of observations or rows passing through this node. • % — Percentage of observations or rows passing through this node. • Leaf Node — The boxes representing leaf nodes in the tree contain the following information. • Decision — Condition under which data passes to this node. • N — Count of number of observations or rows passing to this node. • % — Percentage of observations or rows passing to this node. • CF — Confidence factor • TF — Targeted confidence factor, alternative to CF display • RCF — Recalculated confidence factor based on validation table (if requested) • RTF — Recalculated targeted confidence factor based on validation table (if requested) Teradata Warehouse Miner User Guide - Volume 3 53 Chapter 1: Analytic Algorithms Decision Trees Text Tree When Tree Text is selected, the diagram represents the decisions made by the tree as a hierarchical structure of rules as follows: Figure 35: Text Tree The first rule corresponds to the root node of the tree. The rules corresponding to leaves in the tree are distinguished by an arrow drawn as ‘-->’, followed by a predicted value of the dependent variable. Rules List On both the Tree Browser and Text Tree, passing the mouse over a node or rule results in a hyperlink indication. When Rules List is enabled, clicking on the hyperlink results in a popup displaying all rules leading to that node or decision as follows: 54 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Decision Trees Figure 36: Rules List Note that the Node Detail, as described above, can be used to copy the Rules List to the Windows Clipboard. Counts and Distributions On both the Tree Browser and Text Tree, passing the mouse over a node or rule results in a hyperlink indication. When Counts and Distributions is enabled, clicking on the hyperlink results in a pop-up displaying the Count/Distribution of the dependent variable at that node as follows. Note that the Counts and Distribution option is only enabled when the dependent variable is multinomial. For regression trees this is not valid, and it is shown directly on the node or rule for binary trees. Figure 37: Counts and Distributions Note that the Node Detail, as described above, can be used to copy the Counts and Distribution list to the Windows Clipboard. Teradata Warehouse Miner User Guide - Volume 3 55 Chapter 1: Analytic Algorithms Decision Trees Tree Pruning On both the Tree Browser and Text Tree, passing the mouse over a node or rule results in a hyperlink indication. When Tree Pruning is enabled, the following menu appears: Figure 38: Tree Pruning menu Clicking on a node or rule highlights the node and all subnodes, indicating which portion of the tree will be pruned. Additionally, the Prune Selected Branch option becomes enabled as follows: Figure 39: Tree Pruning Menu > Prune Selected Branch Clicking on Prune Selected Branch will convert the highlighted node to a leaf node, and all subnode will disappear. When this is done, the other two Tree Pruning options become enabled: Figure 40: Tree Pruning menu (All Options Enabled) Click on Undo Last Prune, to revert back to the original tree, or the previously pruned tree if Prune Selected Branch was done multiple times. Click on Save Pruned Tree to save the tree to XML. This will be saved in metadata and can be rescored in a future release. 56 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Decision Trees After a tree is manually pruned and saved to metadata using the Save Pruned Tree option, it can be reopened and viewed in the Tree Browser and, if desired, pruned further. (All additional prunes must be re-saved to metadata). A previously pruned tree will be labeled to distinguish it from a tree that has not been manually pruned: Figure 41: Decision Tree Graph: Previously Pruned Tree “More >>” On both the Tree Browser and Text Tree, if Gini Index has been selected for Tree Splitting, large surrogate splits may occur. If a surrogate split is proceeded by “more >>”, the entire surrogate split can be displayed in a separate pop-up screen by clicking on the node and/or rule as follows: Figure 42: Decision Tree Graph: Predicate Lift Chart This graph displays the statistic in the Cumulative Lift Table, with the following options: • Non-Cumulative • % Response — This column contains the percentage of observations in the decile where the actual value of the dependent variable is 1. • % Captured Response — This column contains the percentage of responses in the decile over all the responses in any decile. • Lift — The lift value is the percentage response in the decile (Pct Response) divided by the expected response, where the expected response is the percentage of response or dependent 1-values over all observations. For example, if 10% of the observations overall have a dependent variable with value 1, and 20% of the observations in decile 1 have a dependent variable with value 1, then the lift value within decile 1 is 2.0, meaning that the model gives a “lift” that is better than chance alone by a factor of two in predicting response values of 1 within this decile. • Cumulative Teradata Warehouse Miner User Guide - Volume 3 57 Chapter 1: Analytic Algorithms Decision Trees • % Response — This is a cumulative measure of the percentage of observations in the decile where the actual value of the dependent variable is 1, from decile 1 to this decile. • % Captured Response — This is a cumulative measure of the percentage of responses in the decile over all the responses in any decile, from decile 1 to this decile. • Cumulative Lift — This is a cumulative measure of the percentage response in the decile (Pct Response) divided by the expected response, where the expected response is the percentage of response or dependent 1-values over all observations, from decile 1 to this decile. Any combination of options can be displayed as follows: Figure 43: Decision Tree Graph: Lift Tutorial - Decision Tree In this example a standard Gain Ratio tree was built to predict credit card ownership ccacct based on 20 numeric and categorical input variables. Notice that the tree initially built contained 100 nodes but was pruned back to only 11, counting the root node. This yielded not only a relatively simple tree structure, but also Model Accuracy of 95.72% on this training data. Parameterize a Decision Tree as follows: • Available Tables — twm_customer_analysis • Dependent Variable — ccacct • Independent Variables 58 • income • age • years_with_bank Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Decision Trees • nbr_children • gender • marital_status • city_name • state_code • female • single • married • separated • ckacct • svacct • avg_ck_bal • avg_sv_bal • avg_ck_tran_amt • avg_ck_tran_cnt • avg_sv_tran_amt • avg_sv_tran_cnt • Tree Splitting — Gain Ratio • Minimum Split Count — 2 • Maximum Nodes — 1000 • Maximum Depth — 10 • Bin Numeric Variables — Disabled • Pruning Method — Gain Ratio • Include Lift Table — Enabled • Response Value — 1 Run the analysis and click on Results when it completes. For this example, the Decision Tree Analysis generated the following pages. A single click on each page name populates the page with the item. Table 9: Decision Tree Report Total observations 747 Nodes before pruning 33 Nodes after pruning 11 Model Accuracy 95.72% Teradata Warehouse Miner User Guide - Volume 3 59 Chapter 1: Analytic Algorithms Decision Trees Table 10: Variables: Dependent Dependent Variable ccacct Table 11: Variables: Independent Independent Variables income ckacct avg_sv_bal avg_sv_tran_cnt Table 12: Confusion Matrix Actual Non-Response Actual Response Correct Incorrect Predicted 0 340 / 45.52% 0 / 0.00% 340 / 45.52% 0 / 0.00% Predicted 1 32 / 4.28% 375 / 50.20% 375 / 50.20% 32 / 4.28% Table 13: Cumulative Lift Table Cumulative Response Cumulative Response (%) Cumulative Captured Response (%) Cumulative Lift Decile Count Response Response (%) Captured Response (%) 1 5.00 5.00 100.00 1.33 1.99 5.00 100.00 1.33 1.99 2 0.00 0.00 0.00 0.00 0.00 5.00 100.00 1.33 1.99 3 0.00 0.00 0.00 0.00 0.00 5.00 100.00 1.33 1.99 4 0.00 0.00 0.00 0.00 0.00 5.00 100.00 1.33 1.99 5 0.00 0.00 0.00 0.00 0.00 5.00 100.00 1.33 1.99 6 402.00 370.00 92.04 98.67 1.83 375.00 92.14 100.00 1.84 7 0.00 0.00 0.00 0.00 0.00 375.00 92.14 100.00 1.84 8 0.00 0.00 0.00 0.00 0.00 375.00 92.14 100.00 1.84 9 0.00 0.00 0.00 0.00 0.00 375.00 92.14 100.00 1.84 10 340.00 0.00 0.00 0.00 0.00 375.00 50.20 100.00 1.00 60 Lift Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Decision Trees Graphs Tree Browser is displayed as follows: Figure 44: Decision Tree Graph Tutorial: Browser Select the Text Tree radio to view the rules in textual format: Figure 45: Decision Tree Graph Tutorial: Lift Additionally, you can click on Lift Chart to view the contents of the Lift Table graphically. Teradata Warehouse Miner User Guide - Volume 3 61 Chapter 1: Analytic Algorithms Factor Analysis Figure 46: Decision Tree Graph Tutorial: Browser Factor Analysis Overview Consider a data set with a number of correlated numeric variables that is to be used in some type of analysis, such as linear regression or cluster analysis. Or perhaps it is desired to understand customer behavior in a fundamental way, by discovering hidden structure and meaning in data. Factor analysis can be used to reduce a number of correlated numeric variables into a lesser number of variables called factors. These new variables or factors should hopefully be conceptually meaningful if the second goal just mentioned is to be achieved. Meaningful factors not only give insight into the dynamics of a business, but they also make any models built using these factors more explainable, which is generally a requirement for a useful analytic model. There are two fundamental types of factor analysis, principal components and common factors. Teradata Warehouse Miner offers principal components, maximum likelihood common factors and principal axis factors, which is a restricted form of common factor analysis. The product also offers factor rotations, both orthogonal and oblique, as postprocessing for any of these three types of models. Finally, as with all other models, automatic factor model scoring is offered via dynamically generated SQL. Before using the Teradata Warehouse Miner Factor Analysis module, the user must first build a data reduction matrix using the Build Matrix function. The matrix must include all of the input variables to be used in the factor analysis. The user can base the analysis on either a covariance or correlation matrix, thus working with either centered and unscaled data, or centered and normalized data (i.e., unit variance). Teradata Warehouse Miner automatically converts the extended cross-products matrix stored in metadata results tables by the Build 62 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Factor Analysis Matrix function into the desired covariance or correlation matrix. The choice will affect the scaling of resulting factor measures and factor scores. The primary source of information and formulae in this section is [Harman]. Principal Components Analysis The goal of principal components analysis (PCA) is to account for the maximum amount of the original data’s variance in the principal components created. Each of the original variables can be expressed as a linear combination of the new principal components. Each principal component in its turn, from the first to the last, accounts for a maximum amount of the remaining sum of the variances of the original variables. This allows some of the later components to be discarded and only the reduced set of components accounting for the desired amount of total variance to be retained. If all the components were to be retained, then all of the variance would be explained. A principal components solution has many desirable properties. First, the new components are independent of each other, that is, uncorrelated in statistical terminology or orthogonal in the terminology of linear algebra. Further, the principal components can be calculated directly, yielding a unique solution. This is true also of principal component scores, which can be calculated directly from the solution and are also inherently orthogonal or independent of each other. Principal Axis Factors The next step toward the full factor analysis model is a technique known as principal axis factors (PAF), or sometimes also called iterated principal axis factors, or just principal factors. The principal factors model is a blend of the principal components model described earlier and the full common factor model. In the common factor model, each of the original variables is described in terms of certain underlying or common factors, as well as a unique factor for that variable. In principal axis factors however, each variable is described in terms of common factors without a unique factor. Unlike a principal components model for which there is a unique solution, a principal axis factor model consists of estimated factors and scores. As with principal components, the derived factors are orthogonal or independent of each other. The same is not necessarily true of the scores however. (Refer to “Factor Scores” on page 65 for more information). Maximum Likelihood Common Factors The goal of common factors or classical factor analysis is to account in the new factors for the maximum amount of covariance or correlation in the original input variables. In the common factor model, each of the original input variables is expressed in terms of hypothetical common factors plus a unique factor accounting for the remaining variance in that variable. The user must specify the desired number of common factors to look for in the model. This type of model represents factor analysis in the fullest sense. Teradata Warehouse Miner offers maximum likelihood factors (MLF) for estimating common factors, using expectation maximization or EM as the method to determine the maximum likelihood solution. A potential benefit of common factor analysis is that it may reduce the original set of variables into fewer factors than would principal components analysis. It may also produce Teradata Warehouse Miner User Guide - Volume 3 63 Chapter 1: Analytic Algorithms Factor Analysis new variables that have more fundamental meaning. A drawback is that factors can only be estimated using iterative techniques requiring more computation, as there is no unique solution to the common factor analysis model. This is true also of common factor scores, which must likewise be estimated. As with principal components and principal axis factors, the derived factors are orthogonal or independent of each other, but in this case by design (Teradata Warehouse Miner utilizes a technique to insure this). The same is not necessarily true of the factor scores however. (Refer to “Factor Scores” on page 65 for more information). These three types of factor analysis then give the data analyst the choice of modeling the original variables in their entirety (principal components), modeling them with hypothetical common factors alone (principal axis factors), or modeling them with both common factors and unique factors (maximum likelihood common factors). Factor Rotations Whatever technique is chosen to compute principal components or common factors, the new components or factors may not have recognizable meaning. Correlations will be calculated between the new factors and the original input variables, which presumably have business meaning to the data analyst. But factor-variable correlations may not possess the subjective quality of simple structure. The idea behind simple structure is to express each component or factor in terms of fewer variables that are highly correlated with the factor (or vice versa), with the remaining variables largely uncorrelated with the factor. This makes it easier to understand the meaning of the components or factors in terms of the variables. Factor rotations of various types are offered to allow the data analyst to attempt to find simple structure and hence meaning in the new components or factors. Orthogonal rotations maintain the independence of the components or factors while aligning them differently with the data to achieve a particular simple structure goal. Oblique rotations relax the requirement for factor independence while more aggressively seeking better data alignment. Teradata Warehouse Miner offers several options for both orthogonal and oblique rotations. Factor Loadings The term factor loadings is sometimes used to refer to the coefficients of the linear combinations of factors that make up the original variables in a factor analysis model. The appropriate term for this however is the factor pattern. A factor loadings matrix is sometimes also assumed to indicate the correlations between the factors and the original variables, for which the appropriate term is factor structure. The good news is that whenever factors are mutually orthogonal or independent of each other, the factor pattern P and the factor structure S are the same. They are related by the equation S = PQ where Q is the matrix of correlations between factors. In the case of principal components analysis, factor loadings are labeled as component loadings and represent both factor pattern and structure. For other types of analysis, loadings are labeled as factor pattern but indicate structure also, unless a separate structure matrix is also given (as is the case after oblique rotations, described later). Keeping the above caveats in mind, the component loadings, pattern or structure matrix is interpreted for its structure properties in order to understand the meaning of each new factor 64 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Factor Analysis variable. When the analysis is based on a correlation matrix, the loadings, pattern or structure can be interpreted as a correlation matrix with the columns corresponding to the factors and the rows corresponding to the original variables. Like all correlations, the values range in absolute value from 0 to 1 with the higher values representing a stronger correlation or relationship between the variables and factors. By looking at these values, the user gets an idea of the meaning represented by each factor. Teradata Warehouse Miner stores these so called factor loadings and other related values in metadata result tables to make them available for scoring. Factor Scores In order to use a factor as a variable, it must be assigned a value called a factor score for each row or observation in the data. A factor score is actually a linear combination of the original input variables (without a constant term), and the coefficients associated with the original variables are called factor weights. Teradata Warehouse Miner provides a scoring function that calculates these weights and creates a table of new factor score variables using dynamically generated SQL. The ability to automatically generate factor scores, regardless of the factor analysis or rotation options used, is one of the most powerful features of the Teradata Warehouse Miner factor analysis module. Principal Components As mentioned earlier in the introduction, the goal of principal components analysis (PCA) is to account for the maximum amount of the original data’s variance in the independent principal components created. It was also stated that each of the original variables is expressed as a linear combination of the new principal components, and that each principal component in its turn, from the first to the last, accounts for a maximum amount of the remaining sum of the variances of the original variables. These results are achieved by first finding the eigenvalues and eigenvectors of the covariance or correlation matrix of the input variables to be modeled. Although not ordinarily thought of in this way, when analyzing v numeric columns in a table in a relational database, one is in some sense working in a vdimensional vector space corresponding to these columns. Back at the beginning of the previous century when principal components analysis was developed, this was no small task. Today however math library routines are available to perform these computations very efficiently. Although it won’t be attempted here to derive the mathematical solution to finding principal components, it might be helpful to state the following definition (i.e., that a square matrix A has an eigenvalue and an eigenvector x if Ax = x ). Further, a v x v square symmetric matrix A has v pairs of eigenvalues and eigenvectors, 1 e 1 2 e 2 v e v . It is further true that eigenvectors can be found so that they have unit length and are mutually orthogonal (i.e., independent or uncorrelated), making them unique. To return to the point at hand, the principal component loadings that are being sought are actually the covariance or correlation matrix eigenvectors just described multiplied by the square root of their respective eigenvalues. The step left out up to now however is the reduction of these principal component loadings to a number fewer than the variables present at the start. This can be achieved by first ordering the eigenvalues, and their corresponding eigenvectors, from maximum to minimum in descending order, and then by throwing away Teradata Warehouse Miner User Guide - Volume 3 65 Chapter 1: Analytic Algorithms Factor Analysis those eigenvalues below a minimum threshold value, such as 1.0. An alternative technique is to retain a desired number of the largest components regardless of the magnitude of the eigenvalues. Teradata Warehouse Miner provides both of these options to the user. The user may further optionally request that the signs of the principal component loadings be inverted if there are more minus signs than positive ones. This is purely cosmetic and does not affect the solution in a substantive way. However, if signs are reversed, this must be kept in mind when attempting to interpret or assign conceptual meaning to the factors. A final point worth noting is that the eigenvalues themselves turn out to be the variance accounted for by each principal component, allowing the computation of several variance related measures and some indication of the effectiveness of the principal components model. Principal Axis Factors In order to talk about principal axis factors (PAF) the term communality must first be introduced. In the common factor model, each original variable x is thought of to be a combination of common factors and a unique factor. The variance of x can then also be thought of as being composed of a common portion and a unique portion, that is 2 2 Var x = c + u . It is the common portion of the variance of x that is called the communality of x, that is the variance that the variable has in common through the common factors with all the other variables. In the algorithm for principal axis factors described below it is of interest to both make an initial estimate of the communality of each variable, and to calculate the actual communality for the variables in a factor model with uncorrelated factors. One method of making an initial estimate of the communality of each variable is to take the largest correlation of that variable with respect to the other variables. The preferred method however is to calculate its squared multiple correlation coefficient with respect to all of the other variables taken as a whole. This is the technique used by Teradata Warehouse Miner. The multiple correlation coefficient is a measure of the overall linear association of one variable with several other variables, that is, the correlation between a variable and the best-fitting linear combination of the other variables. The square of this value has the useful property of being a lower bound for the communality. Once a factor model is built, the actual communality of a variable is simply the sum of the squares of the factor loadings, i.e. 2 hj = r k – 1 fjk 2 With the idea of communality thus in place it is straightforward to describe the principal axis factors algorithm. Begin by estimating the communality of each variable and replacing this value in the appropriate position in the diagonal of the correlation or covariance matrix being factored. Then a principal components solution is found in the usual manner, as described earlier. As before, the user has the option of specifying either a fixed number of desired factors or a minimum eigenvalue by which to reduce the number of factors in the solution. Finally, the new communalities are calculated as the sum of the squared factor loadings, and these values are substituted into the correlation or covariance matrix. This process is repeated until the communalities change by only a small amount. 66 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Factor Analysis Through its use of communality estimates, the principal axis factor method attempts to find independent common factors that account for the covariance or correlation between the original variables in the model, while ignoring the effect of unique factors. It is then possible to use the factor loadings matrix to reproduce the correlation or covariance matrix and compare this to the original as a way of assessing the effectiveness of the model. The reproduced correlation or covariance matrix is simply the factor loadings matrix times its transpose (i.e., CCT). The user may optionally request that the signs of the factor loadings be inverted if there are more minus signs than positive ones. This is purely cosmetic and does not affect the solution in a substantive way. However, if signs are reversed, this must be kept in mind when attempting to interpret or assign meaning to the factors. Maximum Likelihood Factors As mentioned earlier, the common factor model attempts to find both common and unique factors explaining the covariance or correlations amongst a set of variables. That is, an attempt is made to find a factor pattern C and a uniqueness matrix R such that a covariance or correlation matrix S can be modeled as S = CCT + R. To do this, it is necessary to utilize the principle of maximum likelihood based on the assumption that the data comes from a multivariate normal distribution. Due to dealing with the distribution function of the elements of a covariance matrix it is necessary to use the Wishart distribution in order to derive the likelihood equation. The optimization technique used then to maximize the likelihood of a solution for C and R is the Expectation Maximization or EM technique. This technique, often used in the replacement of missing data, is the same basic technique used in Teradata Warehouse Miner’s cluster analysis algorithm. Some key points regarding this technique are described below. Beginning with a correlation or covariance matrix S as with our other factor techniques, a principal components solution is first derived as an initial estimate for the factor pattern matrix C, with the initial estimate for the uniqueness matrix R taken simply as S - CCT. Then the maximum likelihood solution is iteratively found, yielding a best estimate of C and R. In order then to assess the effectiveness of the model, the correlation or covariance matrix S is compared to the reproduced matrix CCT - R. It should be pointed out that when using the maximum likelihood solution the user must first specify the number of common factors f to produce in the model. The software will not automatically determine what this value should be or determine it based on a threshold value. Also, an internal adjustment is made to the final factor pattern matrix C to make the factors orthogonal, something that is automatically true of the other factor solutions. Finally, the user may optionally request that the signs of a factor in the matrix C be inverted if there are more minus signs than positive ones. This is purely cosmetic and does not affect the solution in a substantive way. However, if signs are reversed, this must be kept in mind when attempting to interpret or assign meaning to the factors. Factor Rotations Teradata Warehouse Miner offers a number of techniques for rotating factors in order to find the elusive quality of simple structure described earlier. These may optionally be used in combination with any of the factor techniques offered in the product. When a rotation is performed, both the rotated matrix and the rotation matrix is reported, as well as the Teradata Warehouse Miner User Guide - Volume 3 67 Chapter 1: Analytic Algorithms Factor Analysis reproduced correlation or covariance matrix after rotation. As before with the factor solutions themselves, the user may optionally request that the signs of a factor in the rotated factor or components matrix be inverted if there are more minus signs than positive ones. This is purely cosmetic and does not affect the solution in a substantive way. Orthogonal rotations First consider orthogonal rotations, that is, rotations of a factor matrix A that result in a rotated factor matrix B by way of an orthogonal transformation matrix T (i.e., B = AT). Remember that the nice thing about orthogonal rotations on a factor matrix is that the resulting factors scores are uncorrelated, a desirable property when the factors are going to be used in subsequent regression, cluster or other type of analysis. But how is simple structure obtained? As described earlier, the idea behind simple structure is to express each component or factor in terms of fewer variables that are highly correlated with the factor, with the remaining variables not so correlated with the factor. The two most famous mathematical criteria for simple factor structure are the quartimax and varimax criteria. Simply put, the varimax criterion seeks to simplify the structure of columns or factors in the factor loading matrix, whereas the quartimax criterion seeks to simplify the structure of the rows or variables in the factor loading matrix. Less simply put, the varimax criterion seeks to maximize the variance of the squared loadings across the variables for all factors. The quartimax criterion seeks to maximize the variance of the squared loadings across the factors for all variables. The solution to either optimization problem is mathematically quite involved, though in principle it is based on fundamental techniques of linear algebra, differential calculus, and the use of the popular Newton-Raphson iterative technique for finding the roots of equations. Regardless of the criterion used, rotations are performed on normalized loadings, that is prior to rotating, the rows of the factor loading matrix are set to unit length by dividing each element by the square root of the communality for that variable. The rows are unnormalized back to the original length after the rotation is performed. This has been found to improve results, particularly for the varimax method. Fortunately both the quartimax and varimax criteria can be expressed in terms of the same equation containing a constant value that is 0 for quartimax and 1 for varimax. The orthomax criterion is then obtained simply by setting this constant, call it gamma, to any desired value, equamax corresponds to setting this constant to half the number of factors, and parsimax is given by setting the value of gamma to v(f-1) / (v+f+2) where v is the number of variables and f is the number of factors. Oblique rotations As mentioned earlier, oblique rotations relax the requirement for factor independence that exists with orthogonal rotations, while more aggressively seeking better data alignment. Teradata Warehouse Miner uses a technique known as the indirect oblimin method. As with orthogonal rotations, there is a common equation for the oblique simple structure criterion that contains a constant that can be set for various effects. A value of 0 for this constant, call it gamma, yields the quartimin solution, which is the most oblique solution of those offered. A value of 1 yields the covarimin solution, the least oblique case. And a value of 0.5 yields the 68 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Factor Analysis biquartimin solution, a compromise between the two. A solution known as orthomin can be achieved by setting the value of gamma to any desired positive value. One of the distinctions of a factor solution that incorporates an oblique rotation is that the factor loadings must be thought of in terms of two different matrices, the factor pattern P matrix and the factor structure matrix S. These are related by the equation S = PQ where Q is the matrix of correlations between factors. Obviously if the factors are not correlated, as in an unrotated solution or after an orthogonal rotation, then Q is the identity matrix and the structure and pattern matrix are the same. The result of an oblique rotation must include both the pattern matrix that describes the common factors and the structure matrix of correlations between the factors and original variables. As with orthogonal rotations, oblique rotations are performed on normalized loadings that are restored to their original size after rotation. A unique characteristic of the indirect oblimin method of rotation is that it is performed on a reference structure based on the normals of the original factor space. There is no inherent value in this, but is in fact just a side effect of the technique. It means however that an oblique rotation results in a reference factor pattern, structure and rotation matrix that is then converted back into the original factor space as the final primary factor pattern, structure and rotation matrix. Data Quality Reports The same data quality reports optionally available for linear regression are also available when performing Factor Analysis. Prime Factor Reports Prime Factor Loadings This report provides a specially sorted presentation of the factor loadings. Like the standard report of factor loadings, the rows represent the variables and the columns represent the factors. In this case, however, each variable is associated with the factor for which it has the largest loading as an absolute value. The variables having factor 1 as the prime factor are listed first, in descending order of the loading with factor 1. Then the variables having factor 2 as the prime factor are listed, continuing on until all the variables are listed. It is possible that not all factors will appear in the Prime Factor column, but all the variables will be listed once and only once with all their factor loadings. Note that in the special case after an oblique rotation has been performed in the factor analysis, the report is based on the factor structure matrix and not the factor pattern matrix, since the structure matrix values represent the correlations between the variables and the factors. The following is an example of a Prime Factor Loadings report. Table 14: Prime Factor Loadings report (Example) Variable Prime Factor Factor 1 Factor 2 Factor 3 income Factor 1 .8229 -1.1675E-02 .1353 revenue Factor 1 .8171 .4475 2.3336E-02 Teradata Warehouse Miner User Guide - Volume 3 69 Chapter 1: Analytic Algorithms Factor Analysis Table 14: Prime Factor Loadings report (Example) Variable Prime Factor Factor 1 Factor 2 Factor 3 single Factor 1 -.7705 .4332 .1554 age Factor 1 .7348 -4.5584E-02 1.0212E-02 cust_years Factor 2 .5158 .6284 .1577 purchases Factor 2 .5433 -.5505 -.254 female Factor 3 -4.1177E-02 .3366 -.9349 Prime Factor Variables The Prime Factor Variables report is closely related to the Prime Factor Loadings report. It associates variables with their prime factors and possibly other factors if a threshold percent or loading value is specified. It provides a simple presentation, without numbers, of the relationships between factors and the variables that contribute to them. If a threshold percent of 1.0 is used, only prime factor relationships are reported. A threshold percentage of less than 1.0 indicates that if the loading for a particular factor is equal to or above this percentage of the loading for the variable's prime factor, then an association is made between the variable and this factor as well. When the variable is associated with a factor other than its prime factor, the variable name is given in parentheses. A threshold loading value may alternately be used to determine the associations between variables and factors. In this case, it is possible that a variable may not appear in the report, depending on the threshold value and the loading values. However, if the option to reverse signs was enabled, positive values may actually represent inverse relationships between factors and original variables. Deselecting this option in a second run and examining factor loading results will provide the true nature (directions) of relationships among variables and factors. The following is an example of a Prime Factor Variables report. Table 15: Prime Factor Variables report (Example) 70 Factor 1 Factor 2 Factor 3 income cust_years female revenue purchases * single * * age * * (purchases) * * Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Factor Analysis Prime Factor Variables with Loadings The Prime Factor Variables with Loadings is functionally the same as the Prime Factor Variables report except that the actual loading values determining the associations between the variables and factors are also given. The magnitude of the loading gives some idea of the relative strength of the relationship and the sign indicates whether or not it is an inverse relationship. A negative sign indicates an inverse relationship in the values (i.e., a negative correlation). The following is an example of a Prime Factor Variables with Loadings report. Table 16: Factor Variable Loading Factor 1 income .8229 Factor 1 revenue .8171 Factor 1 single -.7705 Factor 1 age .7348 Factor 1 (purchases) .5433 Factor 2 cust_years .6284 Factor 2 purchases -.5505 Factor 3 female -.9349 Missing Data Null values for columns in a factor analysis can adversely affect results. It is recommended that the listwise deletion option be used when building the SSCP matrix with the Build Matrix function. This ensures that any row for which one of the columns is null will be left out of the matrix computations completely. Additionally, the Recode transformation function can be used to build a new column, substituting a fixed known value for null. Initiate a Factor Analysis Use the following procedure to initiate a new Factor Analysis in Teradata Warehouse Miner: Teradata Warehouse Miner User Guide - Volume 3 71 Chapter 1: Analytic Algorithms Factor Analysis 1 Click on the Add New Analysis icon in the toolbar: Figure 47: Add New Analysis from toolbar 2 In the resulting Add New Analysis dialog box, click on Analytics under Categories and then under Analyses double-click on Factor Analysis: Figure 48: Add New Analysis dialog 3 This will bring up the Factor Analysis dialog in which you will enter INPUT and OUTPUT options to parameterize the analysis as described in the following sections. Factor - INPUT - Data Selection On the Factor Analysis dialog click on INPUT and then click on data selection: Figure 49: Factor Analysis > Input > Data Selection 72 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Factor Analysis On this screen select: 1 Select Input Source Users may select between different sources of input, Table, Matrix or Analysis. By selecting the Input Source Table the user can select from available databases, tables (or views) and columns in the usual manner. (In this case a matrix will be dynamically built and discarded when the algorithm completes execution). By selecting the Input Source Matrix the user may can select from available matrices created by the Build Matrix function. This has the advantage that the matrix selected for input is available for further analysis after completion of the algorithm, perhaps selecting a different subset of columns from the matrix. By selecting the Input Source Analysis the user can select directly from the output of another analysis of qualifying type in the current project. (In this case a matrix will be dynamically built and discarded when the algorithm completes execution). Analyses that may be selected from directly include all of the Analytic Data Set (ADS) and Reorganization analyses (except Refresh). In place of Available Databases the user may select from Available Analyses, while Available Tables then contains a list of all the output tables that will eventually be produced by the selected Analysis. (Note that since this analysis cannot select from a volatile input table, Available Analyses will contain only those qualifying analyses that create an output table or view). 2 Select Columns From One Table • Available Databases (only for Input Source equal to Table) — All the databases that are available for the Factor Analysis. • Available Matrices (only for Input Source equal to Matrix) — When the Input Source is Matrix, a matrix must first be built by the user with the Build Matrix function before Factor Analysis can be performed. Select the matrix that summarizes the data to be analyzed. (The matrix must have been built with more rows than columns selected or the Factor Analysis will produce a singular matrix, causing a failure). • Available Analyses (only for Input Source equal to Analysis) — All the analyses that are available for the Factor Analysis. • Available Tables (only for Input Source equal to Table or Analysis) — All the tables that are available for the Factor Analysis. • Available Columns — All the columns that are available for the Factor Analysis. • Selected Columns — Select columns by highlighting and then either dragging and dropping into the Selected Columns window, or click on the arrow button to move highlighted columns into the Selected Columns window. The algorithm requires that the selected columns be of numeric type (or contain numbers in character format). Factor - INPUT - Analysis Parameters On the Factor Analysis dialog click on INPUT and then click on analysis parameters: Teradata Warehouse Miner User Guide - Volume 3 73 Chapter 1: Analytic Algorithms Factor Analysis Figure 50: Factor Analysis > Input > Analysis Parameters On this screen select: • General Options • • Analysis method • Principal Components (PCA) — As described above. This is the default method. • Principal Axis Factors (PAF) — As described above. • Maximum Likelihood Factors (MLF) — As described above. Convergence Method • Minimum Eigenvalue PCA — minimum eigenvalue to include in principal components (default 1.0) PAF — minimum eigenvalue to include in factor loadings (default 0.0) MLF — option does not apply (N/A) • • • • 74 Number of Factors — The user may request a specific number of factors as an alternative to using the minimum eigenvalue option for PCA and PAF. Number of factors is however required for MLF. The number of factors requested must not exceed the number of requested variables. Convergence Criterion • PCA — convergence criterion does not apply • PAF — iteration continues until maximum communality change does not exceed convergence criterion • MLF — iteration continues until maximum change in the square root of uniqueness values does not exceed convergence criterion Maximum Iterations • PCA — maximum iterations does not apply (N/A) • PAF — the algorithm stops if the maximum iterations is exceeded (default 100) • MLF — the algorithm stops if the maximum iterations is exceeded (default 1000) Matrix Type — The product automatically converts the extended cross-products matrix stored in metadata results tables by the Build Matrix function into the desired covariance or correlation matrix. The choice will affect the scaling of resulting factor measures and factor scores. • Correlation — Build a correlation matrix as input to Factor Analysis. This is the default option. • Covariance — Build a covariance matrix as input to Factor Analysis. • Invert signs if majority of matrix values are negative (checkbox) — You may Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Factor Analysis optionally request that the signs of factor loadings and related values be changed if there are more minus signs than positive ones. This is purely cosmetic and does not affect the solution in a substantive way. Default is enabled. • Rotation Options • Rotation Method • None — No factor rotation is performed. This is the default option. • Varimax — Gamma in rotation equation fixed at 1.0. The varimax criterion seeks to simplify the structure of columns or factors in the factor loading matrix • Quartimax — Gamma in rotation equation fixed at 0.0. the quartimax criterion seeks to simplify the structure of the rows or variables in the factor loading matrix • Equamax — Gamma in rotation equation fixed at f / 2. • Parsimax — Gamma in rotation equation fixed at v(f-1) / (v+f+2). • Orthomax — Gamma in rotation equation set by user. • Quartimin — Gamma in rotation equation fixed at 0.0. Provides the most oblique rotation. • Biquartimin — Gamma in rotation equation fixed at 0.5. • Covarimin — Gamma in rotation equation fixed at 1.0. Provides the least oblique rotation. • Orthomin — Gamma in rotation equation set by user. • Report Options • Variable Statistics — This report gives the mean value and standard deviation of each variable in the model based on the derived SSCP matrix. • Near Dependency — This report lists collinear variables or near dependencies in the data based on the derived SSCP matrix. • Condition Index Threshold — Entries in the Near Dependency report are triggered by two conditions occurring simultaneously. The one that involves this parameter is the occurrence of a large condition index value associated with a specially constructed principal factor. If a factor has a condition index greater than this parameter’s value, it is a candidate for the Near Dependency report. A default value of 30 is used as a rule of thumb. • Variance Proportion Threshold — Entries in the Near Dependency report are triggered by two conditions occurring simultaneously. The one that involves this parameter is when two or more variables have a variance proportion greater than this threshold value for a factor with a high condition index. Another way of saying this is that a ‘suspect’ factor accounts for a high proportion of the variance of two or more variables. This parameter defines what a high proportion of variance is. A default value of 0.5 is used as a rule of thumb. • Collinearity Diagnostics Report — This report provides the details behind the Near Dependency report, consisting of the “Eigenvalues of Unit Scaled X’X”, “Condition Indices” and “Variance Proportions” tables. • Factor Loading Reports • Factor Variables Report Teradata Warehouse Miner User Guide - Volume 3 75 Chapter 1: Analytic Algorithms Factor Analysis • Factor Variables with Loadings Report • Display Variables Using • Threshold percent • Threshold loading — A threshold percentage of less than 1.0 indicates that if the loading for a particular factor is equal or above this percentage of the loading for the variable's prime factor, then an association is made between the variable and this factor as well. A threshold loading value may alternatively be used. Factor Analysis - OUTPUT On the Factor Analysis dialog, click on OUTPUT: Figure 51: Factor Analysis > Output On this screen select: • Store the Factor Loadings/Weights/Statistics reports as tables in the database — Check this box to store the following reports, if selected, as tables in the database. • Factor Loadings • Factor Variables With Loadings • Factor Weights • Variable Statistics • Database Name — The name of the database to create the output tables in. • Output Table Prefix — The prefix to each of the output table names. For example, if my_factor_reports_ is entered here, the tables produced will be named as follows: Table 17: my_factor_reports_ tables 76 Report Filename Factor Loadings my_factor_reports_FL Factor Variables With Loadings my_factor_reports_FV Factor Weights my_factor_reports_FW Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Factor Analysis Table 17: my_factor_reports_ tables Report Filename Variable Statistics my_factor_reports_VS The contents of the tables will match the contents of the reports except that there will be no fixed ordering of the rows (unless an ORDER BY clause is used when selecting from them). • Advertise Output — The Advertise Output option “advertises” output by inserting information into one or more of the Advertise Output metadata tables according to the type of analysis and the options selected in the analysis. Note that this check box will be disabled if the Always Advertise option is selected on the Connection Properties dialog, because in this case advertising will be automatic. Advertise Output information may be viewed using the Advertise Maintenance dialog available from the Tools menu, from where the definition and contents of these tables may also be viewed. For more information, refer to Advertise Output. • Advertise Note — An Advertise Note may be specified if desired when the Advertise Output option is selected or when the Always Advertise option is selected on the Connection Properties dialog. It is a free-form text field of up to 30 characters that may be used to categorize or describe the output. Run the Factor Analysis After setting parameters on the INPUT screens as described above, you are ready to run the analysis. To run the analysis you can either: • Click the Run icon on the toolbar, or • Select Run <project name> on the Project menu, or • Press the F5 key on your keyboard Results - Factor Analysis The results of running the Factor Analysis include a factor patterns graph, a scree plot (unless MLF was specified), and a variety of statistic reports. All of these results are outlined below. Factor Analysis - RESULTS - Reports On the Factor Analysis dialog, click on RESULTS and then click on reports (note that the RESULTS tab will be grayed-out/disabled until after the analysis is completed): Figure 52: Factor Analysis > Results > Reports Teradata Warehouse Miner User Guide - Volume 3 77 Chapter 1: Analytic Algorithms Factor Analysis Data Quality Reports • Variable Statistics — If selected on the Results Options tab, this report gives the mean value and standard deviation of each variable in the model based on the SSCP matrix provided as input. • Near Dependency — If selected on the Results Options tab, this report lists collinear variables or near dependencies in the data based on the SSCP matrix provided as input. Entries in the Near Dependency report are triggered by two conditions occurring simultaneously. The first is the occurrence of a large condition index value associated with a specially constructed principal factor. If a factor has a condition index greater than the parameter specified on the Results Option tab, it is a candidate for the Near Dependency report. The other is when two or more variables have a variance proportion greater than a threshold value for a factor with a high condition index. Another way of saying this is that a ‘suspect’ factor accounts for a high proportion of the variance of two or more variables. The parameter to defines what a high proportion of variance is also set on the Results Option tab. A default value of 0.5. • Detailed Collinearity Diagnostics — If selected on the Results Options tab, this report provides the details behind the Near Dependency report, consisting of the following tables. • Eigenvalues of Unit Scaled X'X — Report of the eigenvalues of all variables scaled so that each variable adds up to 1 when summed over all the observations or rows. In order to calculate the singular values of X (the rows of X are the observations), the mathematically equivalent square root of the eigenvalues of XTX are computed instead for practical reasons • Condition Indices — The condition index of each eigenvalue, calculated as the square root of the ratio of the largest eigenvalue to the given eigenvalue, a value always 1 or greater. • Variance Proportions — The variance decomposition of these eigenvalues is computed using the eigenvalues together with the eigenvectors associated with them. The result is a matrix giving, for each variable, the proportion of variance associated with each eigenvalue. Principal Component Analysis report • Number of Variables — This is the number of variables to be factored, taken from the matrix that is input to the algorithm. Note that there are no dependent or independent variables in a factor analysis model. • Minimum Eigenvalue — The minimum value of a factor’s associated eigenvalue, determining whether or not to include the factor in the final model. This field is not displayed if the Number of Factors option is used to determine the number of factors retained. • Number of Factors — This value reflects the number of factors retained in the final factor analysis model. If the Number of Factors option is explicitly set by the user to determine the number of factors, then this reported value reflects the value set by the user. Otherwise, it reflects the number of factors resulting from applying the Minimum Eigenvalue option. 78 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Factor Analysis • Matrix Type (cor/cov) — This value reflects the type of input matrix requested by the user, either correlation (cor) or covariance (cov). • Rotation (none/orthogonal/oblique) — This value reflects the type of rotation, if any, requested by the user, either none, orthogonal, or oblique. • Gamma — This value is a coefficient in the rotation equation that reflects the type of rotation requested, if any, and in some cases is explicitly set by the user. Gamma is determined as follows. • Orthogonal rotations: • Varimax — (gamma in rotation equation fixed at 1.0) • Quartimax — (gamma in rotation equation fixed at 0.0) • Equamax — (gamma in rotation equation fixed at f / 2)* • Parsimax — (gamma in rotation equation fixed at v(f-1) / (v+f+2))* • Orthomax — (gamma in rotation equation set by user) * where v is the number of variables and f is the number of factors • Oblique rotations • Quartimin — (gamma in rotation equation fixed at 0.0) • Biquartimin — (gamma in rotation equation fixed at 0.5) • Covarimin — (gamma in rotation equation fixed at 1.0) • Orthomin — (gamma in rotation equation set by user) Principal Axis Factors report • Number of Variables — This is the number of variables to be factored, taken from the matrix that is input to the algorithm. Note that there are no dependent or independent variables in a factor analysis model. • Minimum Eigenvalue — The minimum value of a factor’s associated eigenvalue, determining whether or not to include the factor in the final model. This field is not displayed if the Number of Factors option is used to determine the number of factors retained. • Number of Factors — This value reflects the number of factors retained in the final factor analysis model. If the Number of Factors option is explicitly set by the user to determine the number of factors, then this reported value reflects the value set by the user. Otherwise, it reflects the number of factors resulting from applying the Minimum Eigenvalue option. • Maximum Iterations — This is the maximum number of iterations requested by the user. • Convergence Criterion — This is the value requested by the user as the convergence criterion such that iteration continues until the maximum change in the square root of uniqueness values does not exceed this value. • Rotation (none/orthogonal/oblique) — This value reflects the type of rotation, if any, requested by the user, either none, orthogonal, or oblique. • Gamma — This value is a coefficient in the rotation equation that reflects the type of rotation requested, if any, and in some cases is explicitly set by the user. Gamma is determined as follows. Teradata Warehouse Miner User Guide - Volume 3 79 Chapter 1: Analytic Algorithms Factor Analysis • Orthogonal rotations • Varimax — (gamma in rotation equation fixed at 1.0) • Quartimax — (gamma in rotation equation fixed at 0.0) • Equamax — (gamma in rotation equation fixed at f / 2)* • Parsimax — (gamma in rotation equation fixed at v(f-1) / (v+f+2))* • Orthomax — (gamma in rotation equation set by user) * where v is the number of variables and f is the number of factors • Oblique rotations • Quartimin — (gamma in rotation equation fixed at 0.0) • Biquartimin — (gamma in rotation equation fixed at 0.5) • Covarimin — (gamma in rotation equation fixed at 1.0) • Orthomin — (gamma in rotation equation set by user) Maximum Likelihood (EM) Factor Analysis report • Number of Variables — This is the number of variables to be factored, taken from the matrix that is input to the algorithm. Note that there are no dependent or independent variables in a factor analysis model. • Number of Observations — This is the number of observations in the data used to build the matrix that is input to the algorithm. • Number of Factors — This reflects the number of factors requested by the user for the factor analysis model. • Maximum Iterations — This is the maximum number of iterations requested by the user. (The actual number of iterations used is reflected in the Total Number of Iterations field further down in the report). • Convergence Criterion — This is the value requested by the user as the convergence criterion such that iteration continues until the maximum change in the square root of uniqueness values does not exceed this value. (It should be noted that convergence is based on uniqueness values rather than maximum likelihood values, something that is done strictly for practical reasons based on experimentation). • Matrix Type (cor/cov) — This value reflects the type of input matrix requested by the user, either correlation (cor) or covariance (cov). • Rotation (none/orthogonal/oblique) — This value reflects the type of rotation, if any, requested by the user, either none, orthogonal, or oblique. • Gamma — This value is a coefficient in the rotation equation that reflects the type of rotation requested, if any, and in some cases is explicitly set by the user. Gamma is determined as follows. • Orthogonal rotations 80 • Varimax — (gamma in rotation equation fixed at 1.0) • Quartimax — (gamma in rotation equation fixed at 0.0) • Equamax — (gamma in rotation equation fixed at f / 2)* • Parsimax — (gamma in rotation equation fixed at v(f-1) / (v+f+2))* Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Factor Analysis • Orthomax — (gamma in rotation equation set by user) * where v is the number of variables and f is the number of factors • Oblique rotations • Quartimin — (gamma in rotation equation fixed at 0.0) • Biquartimin — (gamma in rotation equation fixed at 0.5) • Covarimin — (gamma in rotation equation fixed at 1.0) • Orthomin — (gamma in rotation equation set by user) • Total Number of Iterations — This value is the number of iterations that the algorithm performed to converge on a maximum likelihood solution. • Final Average Likelihood — This is the final value of the average likelihood over all the observations represented in the input matrix. • Change in Avg Likelihood — This is the final change, from the previous to the final iteration, in value of the average likelihood over all the observations represented in the input matrix. • Maximum Change in Sqrt (uniqueness) — The algorithm calculates a uniqueness value for each factor each time it iterates, and keeps track of how much the positive square root of each of these values changes from one iteration to the next. The maximum change in this value is given here, and it is of interest because it is used to determine convergence of the model. (Refer to “Final Uniqueness Values” on page 83 for an explanation of these values in the common factor model). Max Change in Sqrt (Communality) For Each Iteration This report, printed for Principal Axis Factors only, and only if the user requests the Report Output option Long, shows the progress of the algorithm in converging on a solution. It does this by showing, at each iteration, the maximum change in the positive square root of the communality of each of the variables. The communality of a variable is that portion of its variance that can be attributed to the common factors. Simply put, when the communality values for all of the variables stop changing sufficiently, the algorithm stops. Matrix to be Factored The correlation or covariance matrix to be factored is printed out only if the user requests the Report Output option Long. Only the lower triangular portion of this symmetric matrix is reported and output is limited to at most 100 rows for expediency. (If it is necessary to view the entire matrix, the Get Matrix function with the Export to File option is recommended). Initial Communality Estimates This report is produced only for Principal Axis Factors and Maximum Likelihood Factors. The communality of a variable is that portion of its variance that can be attributed to the common factors, excluding uniqueness. The initial communality estimates for each variable are made by calculating the squared multiple correlation coefficient of each variable with respect to the other variables taken together. Teradata Warehouse Miner User Guide - Volume 3 81 Chapter 1: Analytic Algorithms Factor Analysis Final Communality Estimates This report is produced only for Principal Axis Factors and Maximum Likelihood Factors. The communality of a variable is that portion of its variance that can be attributed to the common factors, excluding uniqueness. The final communality estimates for each variable are computed as: 2 hj = r k – 1 fjk 2 (i.e., as the sum of the squares of the factor loadings for each variable). Eigenvalues These are the resulting eigenvalues of the principal component or principal axis factor solution, in descending order. At this stage, there are as many eigenvalues as input variables since the number of factors has not been reduced yet. Eigenvectors These are the resulting eigenvectors of the principal components or principal axis factor solution, in descending order. At this stage, there are as many eigenvectors as input variables since the number of factors has not been reduced yet. Eigenvectors are printed out only if the user requests the Report Output option Long. Principal Component Loadings (Principal Components) This matrix of values, which is variables by factors in size, represents both the factor pattern and factor structure, i.e., the linear combination of factors for each variable and the correlations between factors and variables (provided Matrix Type is Correlation). The number of factors has been reduced to meet the minimum eigenvalue or number of factors requested, but the output does not reflect any factor rotations that may have been requested. This output table contains the raw data used in the Prime Factor Reports, which are probably better to use for interpreting results. If the user requested a Matrix Type of Correlation, the principal component loadings can be interpreted as the correlations between the original variables and the newly created factors. An absolute value approaching 1 indicates that a variable is contributing strongly to a particular factor. Factor Pattern (Principal Axis Factors) This matrix of values, which is variables by factors in size, represents both the factor pattern and factor structure, i.e., the linear combination of factors for each variable and the correlations between factors and variables (provided Matrix Type is Correlation). The number of factors has been reduced to meet the minimum eigenvalue or number of factors requested, but the output does not reflect any factor rotations that may have been requested. This output table contains the raw data used in the Prime Factor Reports, which are probably better to use for interpreting results. If the user requested a Matrix Type of Correlation, the factor pattern can be interpreted as the correlations between the original variables and the newly created factors. An absolute value approaching 1 indicates that a variable is contributing strongly to a particular factor. 82 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Factor Analysis Factor Pattern (Maximum Likelihood Factors) This matrix of values, which is variables by factors in size, represents both the factor pattern and factor structure, i.e., the linear combination of factors for each variable and the correlations between factors and variables (provided Matrix Type is Correlation). The number of factors has been fixed at the number of factors requested. The output at this stage does not reflect any factor rotations that may have been requested. This output table contains the raw data used in the Prime Factor Reports, which are probably better to use for interpreting results. If the user requested a Matrix Type of Correlation, the factor pattern can be interpreted as the correlations between the original variables and the newly created factors. An absolute value approaching 1 indicates that a variable is contributing strongly to a particular factor. Variance Explained by Factors This report provides the amount of variance in all of the original variables taken together that is accounted for by each factor. For Principal Components and Principal Axis Factor solutions, the variance is the same as the eigenvalues calculated for the solution. In general however, and for Maximum Likelihood Factor solutions in particular, the variance is the sum of the squared loadings for each factor. (After an oblique rotation, if the factors are correlated, there is an interaction term that must also be added in based on the loadings and the correlations between factors. A separate report entitled Contributions of Rotated Factors To Variance is provided if an oblique rotation is performed). • Factor Variance — This column shows the actual amount of variance in the original variables accounted for by each factor. • Percent of Total Variance — This column shows the percentage of the total variance in the original variables accounted for by each factor. • Cumulative Percent — This column shows the cumulative percentage of the total variance in the original variables accounted for by Factor 1 through each subsequent factor in turn. Factor Variance to Total Variance Ratio This is simply the ratio of the variance explained by all the factors to the total variance in the original data. Condition Indices of Components The condition index of a principal component or principal factor is the square root of the ratio of the largest eigenvalue to the eigenvalue associated with that component or factor. This report is provided for Principal Components and Principal Axis Factors only. Final Uniqueness Values The common factor model seeks to find a factor pattern C and a uniqueness matrix R such that a covariance or correlation matrix S can be modeled as S = CCT + R. The uniqueness matrix is a diagonal matrix, so there is a single uniqueness value for each variable in the model. The theory behind the uniqueness value of a variable is that the variance of each Teradata Warehouse Miner User Guide - Volume 3 83 Chapter 1: Analytic Algorithms Factor Analysis variable can be expressed as the sum of its communality and uniqueness, that is the variance of the jth variable is given by: 2 2 2 sj = hj + uj This report is provided for Maximum Likelihood Factors only. Reproduced Matrix Based on Loadings The results of a factor analysis can be used to reproduce or approximate the original correlation or covariance matrix used to build the factor analysis model. This is done to evaluate the effectiveness of the model in accounting for the variance in the original data. For Principal Components and Principal Axis Factors the reproduced matrix is simply the loadings matrix times its transpose. For Maximum Likelihood Factors it is the loadings matrix times its transpose plus the uniqueness matrix. This report is provided only when Long is selected as the Output Option. Difference Between Original and Reproduced cor/cov Matrix This report gives the differences between the original correlation or covariance matrix values used in the factor analysis and the Reproduced Matrix Based on Loadings. (In the case of Principal Axis Factors, the reproduced matrix is compared to the original matrix with the initial communality estimates placed in the diagonal of the matrix). This report is provided only when Long is selected as the Output Option. Absolute Difference This report summarizes the absolute value of the differences between the original correlation or covariance matrix values used in the factor analysis and the Reproduced Matrix Based on Loadings. • Mean — This is the average absolute difference in correlation or covariance over the entire matrix. • Standard Deviation — This is the standard deviation of the absolute differences in correlation or covariance over the entire matrix. • Minimum — This is the minimum absolute difference in correlation or covariance over the entire matrix. • Maximum — This is the maximum absolute difference in correlation or covariance over the entire matrix. Rotated Loading Matrix This report of the factor loadings (pattern) after rotation is given only after orthogonal rotations. Rotated Structure This report of the factor structure after rotation is given only after oblique rotations. Note that after an oblique rotation the rotated structure matrix is usually different from the rotated pattern matrix. 84 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Factor Analysis Rotated Pattern This report of the factor pattern after rotation is given after both orthogonal and oblique rotations. Note that after an oblique rotation the rotated pattern matrix is usually different from the rotated structure matrix. Rotation Matrix After rotating the factor pattern matrix P to get the rotated matrix PR, the rotation matrix T is also produced such that PR = PT. However, after an oblique rotation the rotation matrix obeys the following equation: PR = P(TT)-1. This report is provided only when Long is selected as the Output Option. Variance Explained by Rotated Factors This is the same report as Variance Explained by Factors except that it is based on the rotated factor loadings. Comparison of the two reports can show the effects of rotation on the effectiveness of the model. After an oblique rotation, another report is produced called the Contributions of Rotated Factors to Variance to show both the contributions of individual factors and the contributions of factor interactions to the explanation of the variance in the original variables analyzed. Rotated Factor Variance to Total Variance Ratio This is the same report as Factor Variance to Total Variance Ratio except that it is based on the rotated factor loadings. Comparison of the two reports can show the effects of rotation on the effectiveness of the model. Correlations Among Rotated Factors After an oblique rotation the factors are generally no longer orthogonal or uncorrelated with each other. This report is a standard Pearson product-moment correlation matrix treating the rotated factors as new variables. Values range from 0 to -1 or +1 indicating no correlation to maximum correlation respectively (a negative correlation indicates that two factors vary in opposite directions with respect to each other). This report is provided only after an oblique rotation is performed. Contributions of Rotated Factors to Variance In general, the variance of the original variables explained by a factor is the sum of the squared loadings for the factor. But after an oblique rotation the factors may be correlated, so additional interaction terms between the factors must be considered in computing the explained variance reported in the Variance Explained by Rotated Factors report. The contributions of factors to variance may be characterized as direct contributions: and joint contributions: where p and q vary by factors with p < q, j varies by variables, and r is the correlation between factors. The Contributions of Rotated Factors to Variance report displays direct contributions along the diagonal and joint contributions off the diagonal. Teradata Warehouse Miner User Guide - Volume 3 85 Chapter 1: Analytic Algorithms Factor Analysis n Vp = bjp 2 j=1 n V pq = 2r Tp Tq b jp b jq j=1 This report is provided only after an oblique rotation is performed. Factor Weights A report of Factor Weights may be selected on the analysis parameters tab. Factor weights are the coefficients that are multiplied by the variables in the factor model to determine the value of each factor as a linear combination of input variables when scoring. (Using the Factor Scoring analysis with Scoring Method equal to Score and output option Generate the SQL for this analysis but do not execute it checked, it may be seen that the Factor Weights report displays the same coefficients that are used when scoring a factor model). Whereas factor loadings generally indicate the correlation between factors and model variables (i.e., in the absence of an oblique rotation), factor weights can give an indication of the relative contribution of each model variable to each new variable (factor). Factor Analysis - RESULTS - Pattern Graph On the Factor Analysis dialog, click on RESULTS and then click on pattern graph (note that the RESULTS tab will be grayed-out/disabled until after the analysis is completed): Figure 53: Factor Analysis > Results > Pattern Graph The Factor Analysis Pattern Graph plots the final factor pattern values for up to twelve variables, two factors at a time. These factor pattern values are the coefficients in the linear combination of factors that comprise each variable. When the Analysis Type is Principal Components, these pattern values are referred to as factor loadings. When the Matrix Type is 86 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Factor Analysis Correlation, the values of these coefficients are standardized to be between -1 and 1 (if Covariance, they are not). Unless an oblique rotation has been performed, these values also represent the factor structure (i.e., the correlation between a factor and a variable). The following options are available: • Variables • Available — A list of all variables that were input to the Factor Analysis. • Selected — A list the variables (up to 12), that will be displayed on the Factor Patterns graph. • Factors • Available — A list of all factors generated by the Factor Analysis. • Selected — The selected two factors that will be displayed on the Factor Patterns graph. Factor Analysis - RESULTS - Scree Plot Unless MLF was specified, a screen plot is generated. On the Factor Analysis dialog, click on RESULTS and then click on scree plot (note that the RESULTS tab will be grayed-out/disabled until after the analysis is completed): Figure 54: Factor Analysis > Results > Scree Plot A definition of the word scree is a heap of stones or rocky debris, such as at the bottom of a hill. So in a scree plot the object is to find where the plotted points flatten out, in order to determine how many Principal Component or Principal Axis factors should be retained in the factor analysis model (the scree plot does not apply to Maximum Likelihood factor analysis). The plot shows the eigenvalues of each factor in descending order from left to right. Since the eigenvalues represent the amount of variance in the original variables is explained by the factors, when the eigenvalues flatten out in the plot, the factors they represent add less and less to the effectiveness of the model. Tutorial - Factor Analysis In this example, principal components analysis is performed on a correlation matrix for 21 numeric variables. This reduces the variables to 7 factors using a minimum eigenvalue of 1. The Scree Plot supports limiting the number of factors to 7 by showing how the eigenvalues (and thus the explained variance) level off at 7 or above. Parameterize a Factor Analysis as follows: • Available Matrices — Customer_Analysis_Matrix • Selected Variables • income Teradata Warehouse Miner User Guide - Volume 3 87 Chapter 1: Analytic Algorithms Factor Analysis • age • years_with_bank • nbr_children • female • single • married • separated • ccacct • ckacct • svacct • avg_cc_bal • avg_ck_bal • avg_sv_bal • avg_cc_tran_amt • avg_cc_tran_cnt • avg_ck_tran_amt • avg_ck_tran_cnt • avg_sv_tran_amt • avg_sv_tran_cnt • cc_rev • Analysis Method — Principal Components • Matrix Type — Correlation • Minimum Eigenvalue — 1 • Invert signs if majority of matrix values are negative — Enabled • Rotation Options — None • Factor Variables — Enabled • Threshold Percent — 1 • Long Report — Not enabled Run the analysis, and click on Results when it completes. For this example, the Factor Analysis generated the following pages. A single click on each page name populates the Results page with the item. Table 18: Factor Analysis Report 88 Number of Variables 21 Minimum Eigenvalue 1 Number of Factors 7 Matrix Type Correlation Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Factor Analysis Table 18: Factor Analysis Report Rotation None Table 19: Execution Summary 6/20/2004 1:55:02 PM Getting Matrix 6/20/2004 1:55:02 PM Principal Components Analysis Running...x 6/20/2004 1:55:02 PM Creating Report Table 20: Eigenvalues Factor 1 4.292 Factor 2 2.497 Factor 3 1.844 Factor 4 1.598 Factor 5 1.446 Factor 6 1.254 Factor 7 1.041 (Factor 8) .971 (Factor 9) .926 (Factor 10) .871 (Factor 11) .741 (Factor 12) .693 (Factor 13) .601 (Factor 14) .504 (Factor 15) .437 (Factor 16) .347 (Factor 17) .34 (Factor 18) .253 (Factor 19) .151 (Factor 20) .123 (Factor 21) 7.01E-02 Teradata Warehouse Miner User Guide - Volume 3 89 Chapter 1: Analytic Algorithms Factor Analysis Table 21: Principal Component Loadings Variable Name Factor 1 Factor 2 Factor 3 Factor 4 Factor 5 Factor 6 Factor 7 age 0.2876 -0.4711 0.1979 0.2615 0.2975 0.3233 -0.2463 avg_cc_bal -0.7621 0.0131 0.1628 -0.1438 0.3508 -0.1550 -0.0300 avg_cc_tran_amt 0.3716 -0.0318 -0.1360 0.0543 -0.1975 0.0100 0.0971 avg_cc_tran_cnt 0.4704 0.0873 -0.4312 0.5592 -0.0241 0.0133 0.0782 avg_ck_bal 0.5778 0.0527 -0.0981 -0.4598 0.0735 -0.0123 -0.0542 avg_ck_tran_amt 0.7698 0.0386 -0.0929 -0.4535 0.2489 0.0585 0.0190 avg_ck_tran_cnt 0.3127 0.1180 -0.1619 -0.1114 0.5435 0.1845 0.0884 avg_sv_bal 0.3785 0.3084 0.4893 0.0186 -0.0768 -0.0630 0.0517 avg_sv_tran_amt 0.4800 0.4351 0.5966 0.1456 -0.0155 0.0272 0.1281 avg_sv_tran_cnt 0.2042 0.3873 0.4931 0.1144 0.2420 0.0884 -0.0646 cc_rev 0.8377 -0.0624 -0.1534 0.0691 -0.3800 0.1036 0.0081 ccacct 0.2025 0.5213 0.4007 0.3021 0.0499 -0.1988 0.1733 ckacct 0.4007 0.1496 -0.4215 0.5497 0.1127 -0.0818 -0.0086 female -0.0209 0.1165 -0.1357 0.3119 0.1887 -0.2228 -0.3438 income 0.6992 -0.2888 0.1353 -0.2987 -0.2684 0.0733 0.0310 married 0.0595 -0.7702 0.2674 0.2434 0.1945 0.0873 0.2768 nbr_children 0.2560 -0.4477 0.1238 -0.0895 -0.0739 -0.5642 0.0898 separated 0.3030 0.0692 0.0545 -0.0666 -0.0796 -0.5089 -0.6425 single -0.2902 0.7648 -0.3004 -0.2010 -0.2120 0.2527 0.0360 svacct 0.4365 0.1616 -0.2592 -0.1705 0.6336 -0.1071 0.0318 years_with_bank 0.0362 -0.0966 0.2120 0.0543 -0.0668 0.5507 -0.5299 Variance Table 22: Factor Variance to Total Variance Ratio .665 Table 23: Variance Explained By Factors Factor Variance Percent of Total Variance Cumulative Percent Condition Indices Factor 1 4.2920 20.4383 20.4383 1.0000 90 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Factor Analysis Table 23: Variance Explained By Factors Factor Variance Percent of Total Variance Cumulative Percent Condition Indices Factor 2 2.4972 11.8914 32.3297 1.3110 Factor 3 1.8438 8.7800 41.1097 1.5257 Factor 4 1.5977 7.6082 48.7179 1.6390 Factor 5 1.4462 6.8869 55.6048 1.7227 Factor 6 1.2544 5.9735 61.5782 1.8497 Factor 7 1.0413 4.9586 66.5369 2.0302 Table 24: Difference Mean Standard Deviation Minimum Maximum 0.0570 0.0866 0.0000 0.7909 Table 25: Prime Factor Variables Factor 1 Factor 2 Factor 3 Factor 4 Factor 5 Factor 6 Factor 7 cc_rev married avg_sv_tran_amt avg_cc_tran_cnt svacct nbr_children separated avg_ck_tran_amt single avg_sv_tran_cnt ckacct avg_ck_tran_cnt years_with_bank female avg_cc_bal ccacct avg_sv_bal * * * * income age * * * * * avg_ck_bal * * * * * * avg_cc_tran_amt * * * * * * Pattern Graph By default, the first twelve variables input to the Factor Analysis, and the first two factors generated, are displayed on the Factor Patterns graph: Scree Plot On the scree plot, all possible factors are shown. In this case, only factors with an eigenvalue greater than 1 were generated by the Factor Analysis: Teradata Warehouse Miner User Guide - Volume 3 91 Chapter 1: Analytic Algorithms Linear Regression Figure 55: Factor Analysis Tutorial: Scree Plot Linear Regression Overview Linear regression is one of the oldest and most fundamental types of analysis in statistics. The British scientist Sir Francis Galton originally developed it in the latter part of the 19th century. The term “regression” derives from the nature of his original study in which he found that the children of both tall and short parents tend to “revert” or “regress” toward average heights. [Neter] It has also been associated with the work of Gauss and Legendre who used linear models in working with astronomical data. Linear regression is thought of today as a special case of generalized linear models, which also includes models such as logit models (logistic regression), log-linear models and multinomial response models. [McCullagh] Why build a linear regression model? It is after all one of the simplest types of models that can be built. Why not start out with a more sophisticated model such as a decision tree? One reason is that if a simpler model will suffice, it is better than an unnecessarily complex model. Another reason is to learn about the relationships between a set of observed variables. Is there in fact a linear relationship between each of the observed variables and the variable to predict? Which variables help in predicting the target dependent variable? If a linear relationship does not exist, is there another type of relationship that does? By transforming a variable, say by taking its exponent or log or perhaps squaring it, and then building a linear regression model, these relationships can hopefully be seen. In some cases, it may even be possible to create an essentially non-linear model using linear regression by transforming the data first. In fact, one of the many sophisticated forms of regression, called piecewise linear 92 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Linear Regression regression, was designed specifically to build nonlinear models of nonlinear phenomena. Finally, in spite of being a relatively simple type of model, there is a rich set of statistics available to explore the nature of any linear regression model built. Multiple Linear Regression Multiple linear regression analysis attempts to predict, or estimate, the value of a dependent variable as a linear combination of independent variables, usually with a constant term included. That is, it attempts to find the b-coefficients in the following equation in order to best predict the value of the dependent variable y based on the independent variables x1 to xn. ) y = b0 + b1 x1 + + bn xn The best values of the coefficients are defined to be the values that minimize the sum of squared error values: y ) y – 2 over all the observations. ) Note that this requires that the actual value of y be known for each observation, in order to contrast it with the predicted value y . This technique is called “least-squared errors.” It turns out that the b-coefficient values to minimize the sum of squared errors can be solved using a little calculus and linear algebra. It is worth spending just a little more effort in describing this technique in order to explain how Teradata Warehouse Miner performs linear regression analysis. It also introduces the concept of a cross-products matrix and its relatives the covariance matrix and the correlation matrix that are so important in multivariate statistical analysis. In order to minimize the sum of squared errors, the equation for the sum of squared errors is expanded using the equation for the estimated y value, and then the partial derivatives of this equation with respect to each b-coefficient are derived and set equal to 0. (This is done in order to find the minimum with respect to all of the coefficient values). This leads to n simultaneous equations in n unknowns, which are commonly referred to as the normal equations. For example: 1 1 b0 + 1 x1 b1 + 1 x2 b2 = 2 x1 1 b0 + x1 b1 + x1 x2 b2 = Teradata Warehouse Miner User Guide - Volume 3 1 y x1 y 93 Chapter 1: Analytic Algorithms Linear Regression 2 x2 1 b0 + x2 x1 b1 + x2 b2 = x2 y The equations above have been presented in a way that gives a hint to how they can be solved using matrix algebra (i.e., by first computing the extended Sum-of-Squares-and-CrossProducts (SSCP) matrix for the constant 1 and the variables x1, x2 and y). By doing this one gets all of the terms in the equation. Teradata Warehouse Miner offers the Build Matrix function to build the SSCP matrix directly in the Teradata database using generated SQL. The linear regression module then reads this matrix from metadata results tables and performs the necessary calculations to solve for the least-squares b-coefficients. Therefore, that part of constructing a linear regression algorithm that requires access to the detail data is simply the building of the extended SSCP matrix (i.e., include the constant 1 as the first variable), and the rest is calculated on the client machine. There is however much more to linear regression analysis than building a model (i.e., calculating the least-squares values of the b-coefficients). Other aspects such as model diagnostics, stepwise model selection and scoring are described below. Model Diagnostics One of the advantages in using a statistical modeling technique such as linear regression (as opposed to a machine learning technique, for example) is the ability to compute rigorous, well-understood measurements of the effectiveness of the model. Most of these measurements are based upon a huge body of work in the areas of probability and probability theory. Goodness of fit ) Several model diagnostics are provided to give an assessment of the effectiveness of the overall model. One of these is called the residual sums of squares or sum of squared errors RSS, which is simply the sum of the squared differences between the dependent variable y estimated by the model and the actual value of y, over all of the rows: y – y ) RSS = 2 Now suppose a similar measure was created based on a naive estimate of y, namely the mean value y : TSS = y – y 2 often called the total sums of squares about the mean. Then, a measure of the improvement of the fit given by the linear regression model is given by: TSS – RSS 2 R = ---------------------------TSS 94 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Linear Regression This is called the squared multiple correlation coefficient R2, which has a value between 0 and 1, with 1 indicating the maximum improvement in fit over estimating y naively with the mean value of y. The multiple correlation coefficient R is actually the correlation between the real y values and the values predicted based on the independent x variables, sometimes written R y x 1 x 2 x n , which is calculated here simply as the positive square root of the R2 value. A variation of this measure adjusted for the number of observations and independent variables in the model is given by the adjusted R2 value: 2 n–1 2 R = 1 – --------------------- 1 – R n–p–1 where n is the number of observations and p is the number of independent variables (substitute n-p in the denominator if there is no constant term). The numerator in the equation for R2, namely TSS - RSS, is sometimes called the due-toregression sums of squares or DRS. Another way of looking at this is that the total unexplained variation about the mean TSS is equal to the variation due to regression DRS plus the unexplained residual variation RSS. This leads to an equation sometimes known as the fundamental equation of regression analysis: = y 2 – y + y – y ) 2 ) y – y 2 Which is the same as saying that TSS = DRS + RSS. From these values a statistical test called an F-test can be made to determine if all the x variables taken together explain a significant amount of variation in y. This test is carried out on the F-ratio given by: meanDRS F = -------------------------meanRSS The values meanDRS and meanRSS are calculated by dividing DRS and RSS by their respective degrees of freedom (p for DRS and n-p-1 for RSS). Standard errors and confidence intervals Measurements are made of the standard deviation of the sampling distribution of each bcoefficient value, and from this, estimates of a confidence interval for each of the coefficients are made. For example, if one of the coefficients has a value of 6, and a 95% confidence interval of 5 to 7, it can be said that the true population coefficient is contained in this interval, with a confidence coefficient of 95%. In other words, if repeated samples were taken of the same size from the population, then 95% of the intervals like the one constructed here, would contain the true value for the population coefficient. Teradata Warehouse Miner User Guide - Volume 3 95 Chapter 1: Analytic Algorithms Linear Regression Another set of useful statistics is calculated as the ratio of each b-coefficient value to its standard error. This statistic is sometimes called a T-statistic or Wald statistic. Along with its associated t-distribution probability value, it can be used to assess the statistical significance of this term in the model. Standardized coefficients The least-squares estimates of the b-coefficients are converted to so-called beta-coefficients or standardized coefficients to give a model in terms of the z-scores of the independent variables. That is, the entire model is recast to use standardized values of the variables and the coefficients are recomputed accordingly. Standardized values cast each variable into units measuring the number of standard deviations away from the mean value for that variable. The advantage of doing this is that the values of the coefficients are scaled equivalently so that their relative importance in the model can be more easily seen. Otherwise the coefficient for a variable such as income would be difficult to compare to a variable such as age or the number of years an account has been open. Incremental R-squared It is possible to calculate the value R2 incrementally by considering the cumulative contributions of x variables added to the model one at a time, namely R y x 1 , R y x1 x2 R y x1 x2 xn . These are called incremental R2 values, and they give a measure of how much the addition of each x variable contributes to explaining the variation in y in the observations. This points out the fact that the order in which the independent x variables are specified in creating the model is important. Multiple Correlation Coefficients Another measure that can be computed for each independent variable in the model is the squared multiple correlation coefficient with respect to the other independent variables in the model taken together. These values range from 0 to1 with 0 indicating a lack of correlation and 1 indicating the maximum correlation. Multiple correlation coefficients are sometimes presented in related forms such as variance inflation factors or tolerances. A variance inflation factor is given by the formula: 1 V k -----------------2 1 – Rk Where Vk is the variance inflation factor and Rk2 is the squared multiple correlation coefficient for the kth independent variable. Tolerance is given by the formula Tk = 1 - Rk2, where Tk is the tolerance of the kth independent variable and Rk2 is as before. These values may be of limited value as indicators of possible collinearity or near dependencies among variables in the case of high correlation values, but the absence of high correlation values does not necessarily indicate the absence of collinearity problems. Further, multiple correlation coefficients are unable to distinguish between several near dependencies should they exist. The reader is referred to [Belsley, Kuh and Welsch] for more information on collinearity diagnostics, as well as to the upcoming section on the subject. 96 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Linear Regression Data Quality Reports A variety of data quality reports are available with the Teradata Warehouse Miner Linear Regression algorithm. Reports include: 1 Constant Variables 2 Variable Statistics 3 Detailed Collinearity Diagnostics 4 • Eigenvalues of Unit Scaled X'X • Condition Indices • Variance Proportions Near Dependency Constant Variables Before attempting to build a model the algorithm checks to see if any variables in the model have a constant value. This check is based on the standard deviation values derived from the SSCP matrix input to the algorithm. If a variable with a constant value (i.e., a standard deviation of zero) is detected, the algorithm stops and notifies the user while producing a Constant Variables Table report. After reading this report, the user may then remove the variables in the report from the model and execute the algorithm again. It is possible that a variable may appear in the Constant Variables Table report that does not actually have a constant value in the data. This can happen when a column has extremely large values that are close together in value. In this case the standard deviation will appear to be zero due to precision loss and will be rejected as a constant column. The remedy for this is to re-scale the values in the column prior to building a matrix or doing the analysis. The ZScore or the Rescale transformation functions may be used for this purpose. Variable Statistics The user may optionally request that a Variables Statistics Report be provided, giving the mean value and standard deviation of each variable in the model based on the SSCP matrix provided as input. Detailed Collinearity Diagnostics One of the conditions that can lead to a poor linear regression model is when the independent variables in the model are not independent of each other, that is, when they are collinear (highly correlated) with one another. Collinearity can be loosely defined as a condition where one variable is nearly a linear combination of one or more other variables, sometimes also called a near dependency. This leads to an ill conditioned matrix of variables. Teradata Warehouse Miner provides an optional Detailed Collinearity Diagnostics report using a specialized technique described in [Belsley, Kuh and Welsch]. This technique involves performing a singular value decomposition of the independent x variables in the model in order to measure collinearity. The analysis proceeds roughly as follows. In order to put all variables on an equal footing, the data is scaled so that each variable adds up to 1 when summed over all the observations or rows. In order to calculate the singular values of X (the rows of X are the observations), the Teradata Warehouse Miner User Guide - Volume 3 97 Chapter 1: Analytic Algorithms Linear Regression mathematically equivalent square root of the eigenvalues of XTX are computed instead for practical reasons. The condition index of each eigenvalue is calculated as the square root of the ratio of the largest eigenvalue to the given eigenvalue, a value always 1 or greater. The variance decomposition of these eigenvalues is computed using the eigenvalues together with the eigenvectors associated with them. The result is a matrix giving, for each variable, the proportion of variance associated with each eigenvalue. Large condition indices indicate a probable near dependency. A value of 10 may indicate a weak dependency, values of 15 to 30 may be considered a borderline dependency, above 30 worth investigating further, and above 100, a potentially damaging collinearity. As a rule of thumb, an eigenvalue with a condition index greater than 30 and an associated variance proportion of greater than 50% with two or more model variables implies that a collinearity problem exists. (The somewhat subjective conclusions described here and the experiments they are based on are described in detail in [Belsley, Kuh and Welsch]). An example of the Detailed Collinearity Diagnostics report is given below. Table 26: Eigenvalues of Unit Scaled X'X Factor 1 5.2029 Factor 2 .8393 Factor 3 .5754 Factor 4 .3764 Factor 5 4.1612E-03 Factor 6 1.8793E-03 Factor 7 2.3118E-08 Table 27: Condition Indices 98 Factor 1 1 Factor 2 2.4898 Factor 3 3.007 Factor 4 3.718 Factor 5 35.3599 Factor 6 52.6169 Factor 7 15001.8594 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Linear Regression Table 28: Variable Name Factor 1 Factor 2 Factor 3 Factor 4 Factor 5 Factor 6 Factor 7 CONSTANT 1.3353E-09 1.0295E-08 1.3781E-09 1.6797E-08 1.1363E-11 2.1981E-07 1 cust_id 1.3354E-09 1.0296E-08 1.3782E-09 1.6799E-08 1.1666E-11 2.2068E-07 1 income 2.3079E-04 1.8209E-03 1.6879E-03 1.1292E-03 .9951 4.4773E-06 1.2957E-05 age 1.0691E-04 1.9339E-04 9.321E-05 1.7896E-03 1.56E-05 .9963 1.4515E-03 children 2.9943E-03 4.4958E-02 .2361 1.6499E-03 3.6043E-04 .713 9.1708E-04 combo1 2.3088E-04 1.8703E-03 1.6658E-03 1.1339E-03 .995 1.0973E-04 2.3525E-05 combo2 1.4002E-04 3.1477E-05 4.4942E-05 5.0407E-03 4.7784E-06 .9935 1.2583E-03 Near Dependency In addition to or in place of the Detailed Collinearity Diagnostics report, the user may optionally request a Near Dependency report based on the automated application of the specialized criteria used in the aforementioned report. Requesting the Near Dependency report greatly simplifies the search for collinear variables or near dependencies in the data. The user may specify the threshold value for the condition index (by default 30) and the variance proportion (by default 0.5) such that a near dependency is reported. That is, if two or more variables have a variance proportion greater than the variance proportion threshold, for a condition index with value greater than the condition index threshold, the variables involved in the near dependency are reported along with their variance proportions, their means and their standard deviations. Near dependencies are reported in descending order based on their condition index value, and variables contributing to a near dependency are reported in descending order based on their variance proportion. The following is an example of a Near Dependency report. Table 29: Near Dependency report (example) Variable Name Factor Condition Index Variance Proportion Mean Standard Deviation CONSTANT 7 15001.8594 1 * * cust_id 7 15001.8594 1 1362987.891 293.5012 age 6 52.6169 .9963 33.744 22.3731 combo2 6 52.6169 .9935 25.733 23.4274 children 6 52.6169 .713 .534 1.0029 income 5 35.3599 .9951 16978.026 21586.8442 combo1 5 35.3599 .995 33654.602 43110.862 Teradata Warehouse Miner User Guide - Volume 3 99 Chapter 1: Analytic Algorithms Linear Regression Stepwise Linear Regression Automated stepwise regression analysis is a technique to aid in regression model selection. That is, it helps in deciding which independent variables to include in a regression model. If there are only two or three independent variables under consideration, one could try all possible models. But since there are 2k - 1 models that can be built from k variables, this quickly becomes impractical as the number of variables increases (32 variables yield more than 4 billion models!). The automated stepwise procedures described below can provide insight into the variables that should be included in a regression model. It is not recommended that stepwise procedures be the sole deciding factor in the makeup of a model. For one thing, these techniques are not guaranteed to produce the best results. And sometimes, variables should be included because of certain descriptive or intuitive qualities, or excluded for subjective reasons. Therefore an element of human decision-making is recommended to produce a model with useful business application. Forward-Only Stepwise Linear Regression The forward only procedure consists solely of forward steps as described below, starting without any independent x variables in the model. Forward steps are continued until no variables can be added to the model. Forward Stepwise Linear Regression The forward stepwise procedure is a combination of the forward and backward steps described below, starting without any independent x variables in the model. One forward step is followed by one backward step, and these single forward and backward steps are alternated until no variables can be added or removed. Backward-Only Stepwise Linear Regression The backward only procedure consists solely of backward steps as described below, starting with all of the independent x variables in the model. Backward steps are continued until no variables can be removed from the model. Backward Stepwise Linear Regression The backward stepwise procedure is a combination of the backward and forward steps as described below, starting with all of the independent x variables in the model. One backward step is followed by one forward step, and these single backward and forward steps are alternated until no variables can be added or removed. Stepwise Linear Regression - Forward Step Each forward step seeks to add the independent variable x that will best contribute to explaining the variance in the dependent variable y. In order to do this a quantity called the partial F statistic must be computed for each xi variable that can be added to the model. A quantity called the extra sums of squares is first calculated as follows: ESS = “DRS with xi” “DRS w/o xi”, where DRS is the Regression Sums of squares or “due-to-regression sums of squares”. Then, the partial F statistic is given by f(xi) = ESS(xi) / meanRSS(xi) where meanRSS is the Residual Mean Square. Each forward step then consists of adding the 100 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Linear Regression variable with the largest partial F statistic providing it is greater than the criterion to enter value. An equivalent alternative to using the partial F statistic is to use the probability or P-value associated with the T-statistic mentioned earlier under model diagnostics. The t statistic is the ratio of the b-coefficient to its standard error. Teradata Warehouse Miner offers both alternatives as an option. When the P-value is used, a forward step consists of adding the variable with the smallest P-value providing it is less than the criterion to enter. In this case, if more than one variable has a P-value of 0, the variable with the largest F statistic is entered. Stepwise Linear Regression - Backward Step Each backward step seeks to remove the independent variable xi that least contributes to explaining the variance in the dependent variable y. The partial F statistic is calculated for each independent x variable in the model. If the smallest value is less than the criterion to remove, it is removed. As with forward steps, an option is provided to use the probability or P-value associated with the T-statistic, that is, the ratio of the b-coefficient to its standard error. In this case all the probabilities or P-values are calculated for the variables currently in the model at one time, and the one with the largest P-value is removed if it is greater than the criterion to remove. Linear Regression and Missing Data Null values for columns in a linear regression analysis can adversely affect results. It is recommended that the listwise deletion option be used when building the input matrix with the Build Matrix function. This ensures that any row for which one of the columns is null will be left out of the matrix computations completely. Another strategy is to use the Recoding transformation function to build a new column, substituting a fixed known value for null values. Yet another option is to use one of the analytic algorithms in Teradata Warehouse Miner to estimate replacement values for null values. This technique is often called missing value imputation. Initiate a Linear Regression Function Use the following procedure to initiate a new Linear Regression analysis in Teradata Warehouse Miner: 1 Click on the Add New Analysis icon in the toolbar: Figure 56: Add New Analysis from toolbar Teradata Warehouse Miner User Guide - Volume 3 101 Chapter 1: Analytic Algorithms Linear Regression 2 In the resulting Add New Analysis dialog box, click on Analytics under Categories and then under Analyses double-click on Linear Regression: Figure 57: Add New Analysis dialog 3 This will bring up the Linear Regression dialog in which you will enter INPUT and OUTPUT options to parameterize the analysis as described in the following sections. Linear Regression - INPUT - Data Selection On the Linear Regression dialog click on INPUT and then click on data selection: Figure 58: Linear Regression > Input > Data Selection On this screen select: 1 Select Input Source Users may select between different sources of input, Table, Matrix or Analysis. By selecting the Input Source Table the user can select from available databases, tables (or views) and columns in the usual manner. (In this case a matrix will be dynamically built and discarded when the algorithm completes execution). By selecting the Input Source Matrix the user may can select from available matrices created by the Build Matrix function. This has the advantage that the matrix selected for input is available for further analysis after completion of the algorithm, perhaps selecting a different subset of columns from the matrix. 102 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Linear Regression By selecting the Input Source Analysis the user can select directly from the output of another analysis of qualifying type in the current project. (In this case a matrix will be dynamically built and discarded when the algorithm completes execution). Analyses that may be selected from directly include all of the Analytic Data Set (ADS) and Reorganization analyses (except Refresh). In place of Available Databases the user may select from Available Analyses, while Available Tables then contains a list of all the output tables that will eventually be produced by the selected Analysis. (Note that since this analysis cannot select from a volatile input table, Available Analyses will contain only those qualifying analyses that create an output table or view). 2 Select Columns From One Table • Available Databases (only for Input Source equal to Table) — All the databases which are available for the Linear Regression analysis. • Available Matrices (only for Input Source equal to Matrix) — When the Input source is Matrix, a matrix must first be built with the Build Matrix function before linear regression can be performed. Select the matrix that summarizes the data to be analyzed. (The matrix must have been built with more rows than selected columns or the Linear Regression analysis will produce a singular matrix, causing a failure). • Available Analyses (only for Input Source equal to Analysis) — All the analyses that are available for the Linear Regression analysis. • Available Tables (only for Input Source equal to Table or Analysis) — All the tables that are available for the Linear Regression analysis. • Available Columns — All the columns that are available for the Linear Regression analysis. • Selected Columns — Select columns by highlighting and then either dragging and dropping into the Selected Columns window, or click on the arrow button to move highlighted columns into the Selected Columns window. Note: The Selected Columns window is actually a split window; you can either insert columns as Dependent or Independent columns. Make sure you have the correct portion of the window highlighted. The Dependent variable column is the column whose value is being predicted by the linear regression model. The algorithm requires that the Dependent and Independent columns must be of numeric type (or contain numbers in character format). Linear Regression - INPUT - Analysis Parameters On the Linear Regression dialog click on INPUT and then click on analysis parameters: Figure 59: Linear Regression > Input > Analysis Parameters Teradata Warehouse Miner User Guide - Volume 3 103 Chapter 1: Analytic Algorithms Linear Regression On this screen select: • Regression Options • Include Constant — This option specifies that the linear regression model should include a constant term. With a constant, the linear equation can be thought of as: ŷ = b 0 + b 1 x 1 + + b n x n Without a constant, the equation changes to: ŷ = b 1 x 1 + + b n x n • Stepwise Options — The Linear Regression analysis can use the stepwise technique to automatically determine a variable’s importance (or lack there of) to a particular model. If selected, the algorithm is performed repeatedly with various combinations of independent variable columns to attempt to arrive at a final “best” model. The stepwise options are: Step Direction — (Selecting “None” turns off the Stepwise option). • • Forward Only — Option to add qualifying independent variables one at a time. • Forward — Option for independent variables being added one at a time to an empty model, possibly removing a variable after a variable is added. • Backward Only — Option to remove independent variables one at a time. • Backward — Option for variables being removed from an initial model containing all of the independent variables, possibly adding a variable after a variable is removed. Step Method • F Statistic — Option to choose the partial F test statistic (F statistic) as the basis for adding or removing model variables. • P-value — Option to choose the probability associated with the T-statistic (Pvalue) as the basis for adding or removing model variables. • Criterion to Enter • Criterion to Remove — If the step method is to use the F statistic, then an independent variable is only added to the model if the F statistic is greater than the criterion to enter and removed if it is less than the criterion to remove. When the F statistic is used, the default for each is 3.84. If the step method is to use the P-value, then an independent variable is added to the model if the P-value is less than the criterion to enter and removed if it is greater than the criterion to remove. When the P-value is used, the default for each is 0.05. The default F statistic criteria of 3.84 corresponds to a P-value of 0.05. These default values are provided with the assumption that the input variables are somewhat correlated. If this is not the case, a lower F statistic or higher P-value criteria can be used. Also, a higher F statistic or lower P value can be specified if more stringent criteria are desired for including variables in a model. • 104 Report Options — Statistical diagnostics can be taken on each variable during the execution of the Linear Regression Analysis. These diagnostics include: Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Linear Regression • Variable Statistics — This report gives the mean value and standard deviation of each variable in the model based on the SSCP matrix provided as input. • Near Dependency — This report lists collinear variables or near dependencies in the data based on the SSCP matrix provided as input. Condition Index Threshold — Entries in the Near Dependency report are triggered by two conditions occurring simultaneously. The one that involves this parameter is the occurrence of a large condition index value associated with a specially constructed principal factor. If a factor has a condition index greater than this parameter’s value, it is a candidate for the Near Dependency report. A default value of 30 is used as a rule of thumb. Variance Proportion Threshold — Entries in the Near Dependency report are triggered by two conditions occurring simultaneously. The one that involves this parameter is when two or more variables have a variance proportion greater than this threshold value for a factor with a high condition index. Another way of saying this is that a ‘suspect’ factor accounts for a high proportion of the variance of two or more variables. This parameter defines what a high proportion of variance is. A default value of 0.5 is used as a rule of thumb. • Detailed Collinearity Diagnostics — This report provides the details behind the Near Dependency report, consisting of the “Eigenvalues of Unit Scaled X’X”, “Condition Indices” and “Variance Proportions” tables. Linear Regression - OUTPUT On the Linear Regression dialog click on OUTPUT: Figure 60: Linear Regression > OUTPUT On this screen select: • Store the variables table of this analysis in the database — Check this box to store the model variables table of this analysis in the database. • Database Name — The name of the database to create the output table in. • Output Table Name — The name of the output table. • Advertise Output — The Advertise Output option “advertises” output by inserting information into one or more of the Advertise Output metadata tables according to the type of analysis and the options selected in the analysis. • Advertise Note — An Advertise Note may be specified if desired when the Advertise Output option is selected or when the Always Advertise option is selected on the Connection Properties dialog. It is a free-form text field of up to 30 characters that may be used to categorize or describe the output. Teradata Warehouse Miner User Guide - Volume 3 105 Chapter 1: Analytic Algorithms Linear Regression By way of an example, the tutorial example creates the following output table: Table 30: Upper Increment Standard al RCoefficient Squared SqMultiCo rrCoef(1Tolerance) 0.1694 1.6294 0.0331 0.8787 0.1312 0.0417 0.0111 0.5771 0.0263 0.8794 0.0168 -2.7887 0.0054 -1.3198 -0.2293 -0.036 0.8779 0.0207 0.0004 -41.3942 0 -0.0182 -0.0166 -0.6382 0.7556 0.3135 10.2793 0.8162 12.5947 0 8.677 11.8815 0.1703 0.8732 0.1073 income 0.0005 0 24.5414 0 0.0005 0.0005 0.3777 0.8462 0.311 married -4.3056 0.8039 -5.3558 0 -5.8838 -2.7273 -0.0718 0.8766 0.0933 0.9749 -6.6301 0 -8.378 -4.55 0 0 Column Name B Standard Coefficient Error T Statistic P-Value Lower nbr_ children 0.8994 0.3718 2.4187 0.0158 years_ 0.2941 with_bank 0.1441 2.0404 avg_sv_ tran_cnt -0.7746 0.2777 avg_cc_ bal -0.0174 ckacct (Constant) -6.464 If Database Name is twm_results and Output Table Name is test, the output table is defined as: CREATE SET TABLE twm_results.test2 ( "Column Name" VARCHAR(30) CHARACTER SET UNICODE NOT CASESPECIFIC, "B Coefficient" FLOAT, "Standard Error" FLOAT, "T Statistic" FLOAT, "P-Value" FLOAT, "Lower" FLOAT, "Upper" FLOAT, "Standard Coefficient" FLOAT, "Incremental R-Squared" FLOAT, "SqMultiCorrCoef(1-Tolerance)" FLOAT) UNIQUE PRIMARY INDEX ( "Column Name" ); Run the Linear Regression After setting parameters on the INPUT screens as described above, you are ready to run the analysis. To run the analysis you can either: • Click the Run icon on the toolbar, or • Select Run <project name> on the Project menu, or • Press the F5 key on your keyboard 106 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Linear Regression Results - Linear Regression The results of running the Teradata Warehouse Miner Linear Regression analysis include a variety of statistical reports on the individual variables and generated model as well as barcharts displaying coefficients and T-statistics. All of these results are outlined below. Linear Regression - RESULTS On the Linear Regression dialog, click on RESULTS (note that the RESULTS tab will be grayed-out/disabled until after the analysis is completed) to view results. Result options are as follows: Linear Regression Reports Data Quality Reports • Variable Statistics — If selected on the Results Options tab, this report gives the mean value and standard deviation of each variable in the model based on the SSCP matrix provided as input. • Near Dependency — If selected on the Results Options tab, this report lists collinear variables or near dependencies in the data based on the SSCP matrix provided as input. Entries in the Near Dependency report are triggered by two conditions occurring simultaneously. The first is the occurrence of a large condition index value associated with a specially constructed principal factor. If a factor has a condition index greater than the parameter specified on the Results Option tab, it is a candidate for the Near Dependency report. The other is when two or more variables have a variance proportion greater than a threshold value for a factor with a high condition index. Another way of saying this is that a ‘suspect’ factor accounts for a high proportion of the variance of two or more variables. The parameter to defines what a high proportion of variance is also set on the Results Option tab. A default value of 0.5. • Detailed Collinearity Diagnostics — If selected on the Results Options tab, this report provides the details behind the Near Dependency report, consisting of the following tables. • Eigenvalues of Unit Scaled X'X — Report of the eigenvalues of all variables scaled so that each variable adds up to 1 when summed over all the observations or rows. In order to calculate the singular values of X (the rows of X are the observations), the mathematically equivalent square root of the eigenvalues of XTX are computed instead for practical reasons. • Condition Indices — The condition index of each eigenvalue, calculated as the square root of the ratio of the largest eigenvalue to the given eigenvalue, a value always 1 or greater. • Variance Proportions — The variance decomposition of these eigenvalues is computed using the eigenvalues together with the eigenvectors associated with them. The result is a matrix giving, for each variable, the proportion of variance associated with each eigenvalue. Linear Regression Step N (Stepwise-only) • Linear Regression Model Assessment Teradata Warehouse Miner User Guide - Volume 3 107 Chapter 1: Analytic Algorithms Linear Regression • Squared Multiple Correlation Coefficient (R-squared) — This is the same value calculated for the Linear Regression report, but it is calculated here for the model as it stands at this step. The closer to 1 its value is, the more effective the model. • Standard Error of Estimate — This is the same value calculated for the Linear Regression report, but it is calculated here for the model as it stands at this step. • In Report — This report contains the same fields as the Variables in Model report (described below) with the addition of the following field. • F Stat — F Stat is the partial F statistic for this variable in the model, which may be used to decide its inclusion in the model. A quantity called the extra sums of squares is first calculated as follows: ESS = “DRS with x” - “DRS w/o”, where DRS is the Regression Sums of squares or “due-to-regression sums of squares”. Then the partial F statistic is given by F(xi) = ESS(xi) / meanRSS(xi), where meanRSS is the Residual Mean Square. • Out Report • Independent Variable — This is an independent variable not included in the model at this step. • P-Value — This is the probability associated with the T-statistic associated with each variable not in, or excluded from, the model, as described for the Variables in Model report as T Stat and P-Value. (Note that it is not the P-Value associated with F Stat). When the P-Value is used for step decisions, a forward step consists of adding the variable with the smallest P-value providing it is less than the criterion to enter. For backward steps, all the probabilities or P-values are calculated for the variables currently in the model at one time, and the one with the largest P-value is removed if it is greater than the criterion to remove. • F Stat — F Stat is the partial F statistic for this variable in the model, which may be used to decide its inclusion in the model. A quantity called the extra sums of squares is first calculated as follows: ESS = “DRS with xi” - “DRS w/o xi”, where DRS is the Regression Sums of squares or “due-to-regression sums of squares”. Then the partial F statistic is given by F(xi) = ESS(xi) / meanRSS(xi), where meanRSS is the Residual Mean Square. • Partial Correlation — The partial correlation coefficient for a variable not in the model is based on the square root of a measure called the coefficient of partial determination, which represents the marginal contribution of the variable to a model that doesn’t include the variable. (Here, contribution to the model means reduction in the unexplained variation of the dependent variable). The formula for the partial correlation of the ith independent variable in the linear regression model built from all the independent variables is given by: Ri = DRS – NDRS ----------------------------------RSS where DRS is the Regression Sums of squares for the model including those variables currently in the model, NDRS is the Regression Sums of squares for the current model 108 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Linear Regression without the ith variable, and RSS is the Residual Sums of squares for the current model. Linear Regression Model • Total Observations — This is the number of rows originally summarized in the SSCP matrix that the linear regression analysis is based on. The number of observations reflects the row count after any rows were eliminated by listwise deletion (recommended) when the matrix was built. • Total Sums of squares — The so-called Total Sums of squares is given by the 2 equation TSS = y – y where y is the dependent variable that is being predicted and y is its mean value. The Total Sums of squares is sometimes also called the total sums of squares about the mean. Of particular interest is its relation to the “due-to-regression sums of squares” and the “residual sums of squares” given by TSS = DRS + RSS. This is a shorthand form of what is sometimes known as the fundamental equation of regression analysis: y – y 2 = ŷ – y 2 = y – ŷ 2 where y is the dependent variable, y is its mean value and ŷ is its predicted value. • Multiple Correlation Coefficient (R) — The multiple correlation coefficient R is the correlation between the real dependent variable y values and the values predicted based on the independent x variables, sometimes written R y x1 x2 xn , which is calculated in Teradata Warehouse Miner simply as the positive square root of the Squared Multiple Correlation Coefficient (R2) value. • Squared Multiple Correlation Coefficient (R-squared) — The squared multiple correlation coefficient R2 is a measure of the improvement of the fit given by the linear regression model over estimating the dependent variable y naïvely with the mean value of y. It is given by: TSS – RSS 2 R = ---------------------------TSS where TSS is the Total Sums of squares and RSS is the Residual Sums of squares. It has a value between 0 and 1, with 1 indicating the maximum improvement in fit over estimating y naïvely with the mean value of y. • Adjusted R-squared — The adjusted R2 value is a variation of the Squared Multiple Correlation Coefficient (R2) that has been adjusted for the number of observations and independent variables in the model. Its formula is given by: n–1 2 2 R = 1 – --------------------- 1 – R n–p–1 where n is the number of observations and p is the number of independent variables (substitute n-p in the denominator if there is no constant term). • Standard Error of Estimate — The standard error of estimate is calculated as the square root of the average squared residual value over all the observations, i.e. Teradata Warehouse Miner User Guide - Volume 3 109 Chapter 1: Analytic Algorithms Linear Regression 2 y – ŷ -------------------------n–p–1 where y is the actual value of the dependent variable, ŷ is its predicted value, n is the number of observations, and p is the number of independent variables (substitute n-p in the denominator if there is no constant term). • Regression Sums of squares — This is the “due-to-regression sums of squares” or DRS referred to in the description of the Total Sums of squares, where it is pointed out that TSS = DRS + RSS. It is also the middle term in what is sometimes known as the fundamental equation of regression analysis: y – y 2 = ŷ – y 2 = y – ŷ 2 where y is the dependent variable, is its mean value and is its predicted value. • Regression Degrees of Freedom — The Regression Degrees of Freedom is equal to the number of independent variables in the linear regression model. It is used in the calculation of the Regression Mean-Square. • Regression Mean-Square — The Regression Mean-Square is simply the Regression Sums of squares divided by the Regression Degrees of Freedom. This value is also the numerator in the calculation of the Regression F Ratio. • Regression F Ratio — A statistical test called an F-test is made to determine if all the independent x variables taken together explain a statistically significant amount of variation in the dependent variable y. This test is carried out on the F-ratio given by meanDRS F = -------------------------meanRSS where meanDRS is the Regression Mean-Square and meanRSS is the Residual MeanSquare. A large value of the F Ratio means that the model as a whole is statistically significant. (The easiest way to assess the significance of this term in the model is to check if the associated Regression P-Value is less than 0.05. However, the critical value of the F Ratio could be looked up in an F distribution table. This value is very roughly in the range of 1 to 3, depending on the number of observations and variables). • Regression P-value — This is the probability or P-value associated with the statistical test on the Regression F Ratio. This statistical F-test is made to determine if all the independent x variables taken together explain a statistically significant amount of variation in the dependent variable y. A value close to 0 indicates that they do. The hypothesis being tested or null hypothesis is that the coefficients in the model are all zero except the constant term (i.e., all the corresponding independent variables together 110 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Linear Regression contribute nothing to the model). The P-value in this case is the probability that the null hypothesis is true and the given F statistic has the value it has or smaller. A right tail test on the F distribution is performed with a 5% significance level used by convention. If the P-value is less than the significance level (i.e., less than 0.05), the null hypothesis should be rejected (i.e., the coefficients taken together are significant and not all 0). • Residual Sums of squares — The residual sums of squares or sum of squared errors RSS is simply the sum of the squared differences between the dependent variable estimated by the model and the actual value of y, over all of the rows: RSS = y – ŷ 2 • Residual Degrees of Freedom — The Residual Degrees of Freedom is given by n-p-1 where n is the number of observations and p is the number of independent variables (or np if there is no constant term). It is used in the calculation of the Residual Mean-Square. • Residual Mean-Square — The Residual Mean-Square is simply the Residual Sums of squares divided by the Residual Degrees of Freedom. This value is also the denominator in the calculation of the Regression F Ratio. Linear Regression Variables in Model Report • Dependent Variable — The dependent variable is the variable being predicted by the linear regression model. • Independent Variable — Each independent variable in the model is listed along with accompanying measures. Unless the user deselects the option Include Constant on the Regression Options tab of the input dialog, the first independent variable listed is CONSTANT, a fixed value representing the constant term in the linear regression model. • B Coefficient — Linear regression attempts to find the b-coefficients in the equation ŷ = b 0 + b 1 x 1 + b n x n in order to best predict the value of the dependent variable y based on the independent variables x1 to xn. The best values of the coefficients are defined to be the values that minimize the sum of squared error values y – ŷ 2 over all the observations. • Standard Error — This is the standard error of the B Coefficient term of the linear regression model, a measure of how accurate the B Coefficient term is over all the observations used to build the model. It is the basis for estimating a confidence interval for the B Coefficient value. • T Statistic — The T-statistic is the ratio of a B Coefficient value to its standard error (Std Error). Along with the associated t-distribution probability value or P-value, it can be used to assess the statistical significance of this term in the linear model. (The easiest way to assess the significance of this term in the model is to check if the Pvalue is less than 0.05. However, one could look up the critical T Stat value in a two-tailed T distribution table with probability .95 and degrees of freedom roughly the number of observations minus the number of variables. This would show that for all practical Teradata Warehouse Miner User Guide - Volume 3 111 Chapter 1: Analytic Algorithms Linear Regression purposes, if the absolute value of T Stat is greater than 2 the model term is statistically significant). • P-value — This is the t-distribution probability value associated with the T-statistic (T Stat), that is, the ratio of the b-coefficient value to its standard error (Std Error). It can be used to assess the statistical significance of this term in the linear model. A value close to 0 implies statistical significance and means this term in the model is important. The hypothesis being tested or null hypothesis is that the coefficient in the model is actually zero (i.e., the corresponding independent variable contributes nothing to the model). The P-value in this case is the probability that the null hypothesis is true and the given T-statistic has the absolute value it has or smaller. A two-tailed test on the tdistribution is performed with a 5% significance level used by convention. If the P-value is less than the significance level (i.e., less than 0.05), the null hypothesis should be rejected (i.e., the coefficient is statistically significant and not 0). • Squared Multiple Correlation Coefficient (R-squared) — The Squared Multiple Correlation Coefficient (Rk2) is a measure of the correlation of this, the kth variable with respect to the other independent variables in the model taken together. (This measure should not be confused with the R2 measure of the same name that applies to the model taken as a whole). The value ranges from 0 to 1 with 0 indicating a lack of correlation and 1 indicating the maximum correlation. It is not calculated for the constant term in the model. Multiple correlation coefficients are sometimes presented in related forms such as variance inflation factors or tolerances. The variance inflation factor is given by the formula: 1 V k = --------------21 – Rk where Vk is the variance inflation factor and Rk2 is the squared multiple correlation coefficient for the kth independent variable. Tolerance is given by the 2 formula T k = 1 – R k where Tk is the tolerance of the kth independent variable and Rk2 is as before. (Refer to the section Multiple Correlation Coefficients for details on the limitations of using this measure to detect collinearity problems in the data). • Lower — Lower is the lower value in the confidence interval for this coefficient and is based on its standard error value. For example, if the coefficient has a value of 6 and a confidence interval of 5 to 7, it means that according to the normal error distribution assumptions of the model, there is a 95% probability that the true population value of the coefficient is actually between 5 and 7. • Upper — Upper is the upper value in the confidence interval for this coefficient based on its standard error value. For example, if the coefficient has a value of 6 and a confidence interval of 5 to 7, it means that according to the normal error distribution assumptions of the model, there is a 95% probability that the true population value of the coefficient is actually between 5 and 7. 112 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Linear Regression • Standard Coefficient — Standardized coefficients, sometimes called beta-coefficients, express the linear model in terms of the z-scores or standardized values of the independent variables. Standardized values cast each variable into units measuring the number of standard deviations away from the mean value for that variable. The advantage of examining standardized coefficients is that they are scaled equivalently, so that their relative importance in the model can be more easily seen. • Incremental R-squared — It is possible to calculate the model’s Squared Multiple Correlation value incrementally by considering the cumulative contributions of x variables added to the model one at a time, namely R y x R y x x R y x x x . 1 1 2 1 2 n These are called Incremental R2 values, and they give a measure of how much the addition of each x variable contributes to explaining the variation in y in the observations. Linear Regression Graphs The Linear Regression Analysis can display the coefficients and/or T-statistics of the resultant model. Weights Graph This graph displays the relative magnitudes of the standardized coefficients and/or the Tstatistic associated with each standardized coefficient in the linear regression model. The sign, positive or negative, is portrayed by the colors red or blue respectively. The user may scroll to the left or right to see all the variables in the model. The T-statistic is the ratio of the coefficient value to its standard error, so the larger its value the more reliable the value of the coefficient is. The following options are available on the Graphics Options tab on the Linear Weights graph: • Graph Type — The following can be graphed by the Linear Weights Graph • T Statistic — Display the T Statistics on the bar chart. • Standardized Coefficient — Display the Standardized Coefficients on the bar chart. • Vertical Axis — The user may request multiple vertical axes in order to display separate coefficient values that are orders of magnitude different from the rest of the values. If the coefficients are of roughly the same magnitude, this option is grayed out. • Single — Display the Standardized Coefficients or T Statistics on single axis on the bar chart. • Multiple — Display the Standardized Coefficients or T Statistics on dual axes on the bar chart. Tutorial - Linear Regression Parameterize a Linear Regression Analysis as follows: • Available Matrices — Customer_Analysis_Matrix • Dependent Variable — cc_rev • Independent Variables • income — age • years_with_bank — nbr_children Teradata Warehouse Miner User Guide - Volume 3 113 Chapter 1: Analytic Algorithms Linear Regression • female — single • married — separated • ccacct — ckacct • svacct — avg_cc_bal • avg_ck_bal — avg_sv_bal • avg_cc_tran_amt — avg_cc_tran_cnt • avg_ck_tran_amt — avg_ck_tran_cnt • avg_sv_tran_amt — avg_sv_tran_cnt • Include Constant — Enabled • Step Direction — Forward • Step Method — F Statistic • Criterion to Enter — 3.84 • Criterion to Remove — 3.84 Run the analysis, and click on Results when it completes. For this example, the Linear Regression Analysis generated the following pages. A single click on each page name populates Results with the item. Table 31: Linear Regression Report Total Observations: 747 Total Sum of Squares: 6.69E5 Multiple Correlation Coefficient (R): 0.9378 Squared Multiple Correlation Coefficient (1-Tolerance): 0.8794 Adjusted R-Squared: 0.8783 Standard Error of Estimate: 1.04E1 Table 32: Regression vs. Residual Sum of Squares Degrees of Freedom Mean-Square F Ratio P-value Regression 5.88E5 7 8.40E4 769.8872 0.0000 Residual 8.06E4 739 1.09E2 N/A N/A Table 33: Execution Status 114 6/20/2004 2:07:28 PM Getting Matrix 6/20/2004 2:07:28 PM Stepwise Regression Running... 6/20/2004 2:07:28 PM Step 0 Complete Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Linear Regression Table 33: Execution Status 6/20/2004 2:07:28 PM Step 1 Complete 6/20/2004 2:07:28 PM Step 2 Complete 6/20/2004 2:07:28 PM Step 3 Complete 6/20/2004 2:07:28 PM Step 4 Complete 6/20/2004 2:07:28 PM Step 5 Complete 6/20/2004 2:07:28 PM Step 6 Complete 6/20/2004 2:07:28 PM Step 7 Complete 6/20/2004 2:07:29 PM Creating Report Table 34: Variables Column Name B Standard Coefficient Error T Statistic P-value Lower Upper Standard Incremental Coefficient R Squared Multiple Correlation Coefficient (1Tolerance) (Constant) -6.4640 0.9749 -6.6301 0.0000 -8.3780 -4.5500 0.0000 0.0000 N/A avg_cc_bal -0.0174 0.0004 -41.3942 0.0000 -0.0182 -0.0166 -0.6382 0.7556 0.3135 income 0.0005 0.0000 24.5414 0.0000 0.0005 0.0005 0.3777 0.8462 0.3110 ckacct 10.2793 0.8162 12.5947 0.0000 8.6770 11.8815 0.1703 0.8732 0.1073 married -4.3056 0.8039 -5.3558 0.0000 -5.8838 -2.7273 -0.0718 0.8766 0.0933 avg_sv_ tran_cnt -0.7746 0.2777 -2.7887 0.0054 -1.3198 -0.2293 -0.0360 0.8779 0.0207 nbr_ children 0.8994 0.3718 2.4187 0.0158 0.1694 1.6294 0.0331 0.8787 0.1312 years_with_ 0.2941 bank 0.1441 2.0404 0.0417 0.0111 0.5771 0.0263 0.8794 0.0168 Step 0 Table 35: Out Independent Variable P-value F Stat age 0.0000 19.7680 avg_cc_bal 0.0000 2302.7983 avg_cc_tran_amt 0.0000 69.5480 Teradata Warehouse Miner User Guide - Volume 3 115 Chapter 1: Analytic Algorithms Linear Regression Table 35: Out Independent Variable P-value F Stat avg_cc_tran_cnt 0.0000 185.3197 avg_ck_bal 0.0000 116.5094 avg_ck_tran_amt 0.0000 271.3578 avg_ck_tran_cnt 0.0002 13.9152 avg_sv_bal 0.0000 37.8598 avg_sv_tran_amt 0.0000 76.1104 avg_sv_tran_cnt 0.7169 0.1316 ccacct 0.1754 1.8399 ckacct 0.0000 105.5843 female 0.5404 0.3751 income 0.0000 647.3239 married 0.8937 0.0179 nbr_children 0.0000 30.2315 separated 0.0000 28.7618 single 0.0000 17.1850 svacct 0.0001 15.7289 years_with_bank 0.1279 2.3235 Step 1 Table 36: Model Assessment Squared Multiple Correlation Coefficient (1-Tolerance) 0.7556 Standard Error of Estimate 14.8111 Table 37: Columns In (Part 1) Independent Variable B Coefficient Standard Error T Statistic P-value avg_cc_bal -0.0237 0.0005 -47.9875 0.0000 116 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Linear Regression Table 38: Columns In (Part 2) Independent Variable B Coefficient Lower Upper F Stat avg_cc_bal -0.0237 -0.0247 -0.0227 2302.7983 Incremental R2 0.7556 Table 39: Columns In (Part 3) Independent Variable B Coefficient Standard Coefficient Squared Multiple Correlation Coefficient (1-Tolerance) avg_cc_bal -0.0237 -0.8692 0.0000 Table 40: Columns Out Independent Variable P-value F Stat Partial Correlation age 0.0539 3.7287 0.0708 avg_cc_tran_amt 0.0000 27.4695 0.1921 avg_cc_tran_cnt 0.2346 1.4153 0.0436 avg_ck_bal 0.0000 17.1826 0.1520 avg_ck_tran_amt 0.0000 94.9295 0.3572 avg_ck_tran_cnt 0.4712 0.5198 0.0264 avg_sv_bal 0.0083 6.9952 0.0970 avg_sv_tran_amt 0.0164 5.7848 0.0882 avg_sv_tran_cnt 0.1314 2.2807 0.0554 ccacct 0.8211 0.0512 0.0083 ckacct 0.0000 41.3084 0.2356 female 0.3547 0.8575 0.0340 income 0.0000 438.7799 0.7680 married 0.4812 0.4967 0.0258 nbr_children 0.0000 30.4645 0.2024 separated 0.0004 12.8680 0.1315 single 0.0024 9.3169 0.1119 svacct 0.0862 2.9523 0.0630 years_with_bank 0.3407 0.9090 0.0350 Teradata Warehouse Miner User Guide - Volume 3 117 Chapter 1: Analytic Algorithms Linear Regression Linear Weights Graph By default, the Linear Weights graph displays the relative magnitudes of the T-statistic associated with each coefficient in the linear regression model: Figure 61: Linear Regression Tutorial: Linear Weights Graph Select the Graphics Options tab and change the Graph Type to Standardized Coefficient to view the standardized coefficient values. Although not generated automatically, a Scatter Plot is useful for analyzing the model built with the Linear Regression analysis. As an example, a scatter plot is brought up to look at the dependent variable (“cc_rev”), with the first two independent variables that made it into the model (“avg_cc_bal,” “income”). Create a new Scatter Plot analysis, and pick these three variables in the Selected Tables and Columns option. The results are shown first in two dimensions (avg_cc_bal and cc_rev), and then with all three: 118 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Linear Regression Figure 62: Linear Regression Tutorial: Scatter Plot (2d) Figure 63: Linear Regression Tutorial: Scatter Plot (3d) Teradata Warehouse Miner User Guide - Volume 3 119 Chapter 1: Analytic Algorithms Logistic Regression Logistic Regression Overview In many types of regression problems, the response variable or dependent variable to be predicted has only two possible outcomes. For example, will the customer buy the product in response to the promotion or not? Is the transaction fraudulent or not? Will the customer close their account or not? There are many examples of business problems with only two possible outcomes. Unfortunately the linear regression model comes up short in finding solutions to this type of problem. It is worth trying to understand what these shortcomings are and how the logistic regression model is an improvement when predicting a two-valued response variable. When the response variable y has only two possible values, which may be coded as a 0 and 1, the expected value of yi, E(yi), is actually the probability that the value will be 1. The error term for a linear regression model for a two-valued response function also has only two possible values, so it doesn't have a normal distribution or constant variance over the values of the independent variables. Finally, the regression model can produce a value that doesn't fall within the necessary constraint of 0 to 1. What would be better would be to compute a continuous probability function between 0 and 1. In order to achieve this continuous probability function, the usual linear regression expression b0 + b1x1 + ... + bnxn is transformed using a function called a logit transformation function. This function is an example of a sigmoid function, so named because it looks like a sigma or 's' when plotted. It is of course the logit transformation function that gives rise to the term logistic regression. The type of logistic regression model that Teradata Warehouse Miner supports is one with a two-valued dependent variable, referred to as a binary logit model. However, Teradata Warehouse Miner is capable of coding values for the dependent variable so that the user is not required to code their dependent variable to two distinct values. The user can choose which values to represent as the response value (i.e., 1 or TRUE) and all other will be treated as nonresponse values (i.e., 0 or FALSE). Even though values other than 1 and 0 are supported in the dependent variable, throughout this section the dependent variable response value is represented as 1 and the non-response value as 0 for ease of reading. The primary sources of information and formulae in this section are [Hosmer] and [Neter]. Logit model The logit transformation function is chosen because of its mathematical power and simplicity, and because it lends an intuitive understanding to the coefficients eventually created in the model. The following equations describe the logistic regression model, with x being the probability that the dependent variable is 1, and g(x) being the logit transformation: b +b x ++b x n n e 0 1 x x = -------------------------------------------------b + b x + + bn xn 1+e 0 1 x 120 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Logistic Regression x g x = ln -------------------- = b 0 + b 1 x 1 + b n x n 1 – x Notice that the logit transformation g(x) has linear parameters (b-values) and may be continuous with unrestricted range. Using these functions, a binomial error distribution is found with y = x + . The solution to a logistic regression model is to find the b-values that “best” predict the dichotomous y variable based on the values of the numeric x variables. Maximum likelihood In linear regression analysis it is possible to use a least-squares approach to finding the best bvalues in the linear regression equation. The least-squared error approach leads to a set of n normal equations in n unknowns that can be solved for directly. But that approach does not work here for logistic regression. Suppose any b-values are selected and the question is asked what is the likelihood that they match the logistic distribution defined, using statistical principles and the assumption that errors have a normal probability distribution. This technique of picking the most likely b-values that match the observed data is known as a maximum likelihood solution. In the case of linear regression, a maximum likelihood solution turns out to be mathematically equivalent to a least squares solution. But here maximum likelihood must be used directly. For convenience, compute the natural logarithm of the likelihood function so that it is possible to convert the product of likelihood’s into a sum, which is easier to work with. The log likelihood equation for a given vector B of b-values with v x-variables is given by: n ln L b 0 b v = n yi B'X – ln 1 + exp B'X i=1 i=1 where B’X = b0 + b1x1 + ... + bvxv. By differentiating this equation with respect to the constant term b0 and with respect to the variable terms bi, the likelihood equations are derived: n yi – xi = 0 i=1 Teradata Warehouse Miner User Guide - Volume 3 121 Chapter 1: Analytic Algorithms Logistic Regression and n xi yi – xi = 0 i=1 where exp B'X x i = --------------------------------1 + exp B'X The log likelihood equation is not linear in the unknown b-value parameters, so it must be solved using non-linear optimization techniques described below. Computational technique Unlike with linear regression, logistic regression calculations cannot be based on an SSCP matrix. Teradata Warehouse Miner therefore dynamically generates SQL to perform the calculations required to solve the model, produce model diagnostics, produce success tables, and to score new data with a model once it is built. However, to enhance performance with small data sets, Teradata Warehouse Miner provides an optional in-memory calculation feature (that is also helpful when one of the stepwise options is used). This feature selects the data into the client system’s memory if it will fit into a user-specified maximum memory amount. The maximum amount of memory in megabytes to use is specified on the expert options tab of the analysis input screen. The user can adjust this value according to their workstation and network requirements. Setting this amount to zero will disable the feature. Teradata Warehouse Miner offers two optimization techniques for logistic regression, the default method of iteratively reweighted least squares (RLS), equivalent to the Gauss-Newton technique, and the quasi-Newton method of Broyden-Fletcher-Goldfarb-Shanno (BFGS). The RLS method is considerably faster than the BFGS method unless there are a large number of columns (RLS grows in complexity roughly as the square of the number of columns). Having a choice between techniques can be useful for more than performance reasons however, since there may be cases where one or the other technique has better convergence properties. You may specify your choice of technique, or allow Teradata Warehouse Miner to automatically select it for you. With the automatic option the program will select RLS if there are less than 35 independent variable columns; otherwise it will select BFGS. Logistic Regression Model Diagnostics Logistic regression has counterparts to many of the same model diagnostics available with linear regression. In a similar manner to linear regression, these diagnostics provide a mathematically sound way to evaluate a model built with logistic regression. 122 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Logistic Regression Standard errors and statistics As is the case with linear regression, measurements are made of the standard error associated with each b-coefficient value. Similarly, the T-statistic or Wald statistic as it is also called, is calculated for each b-coefficient as the ratio of the b-coefficient value to its standard error. Along with its associated t-distribution probability value, it can be used to assess the statistical significance of this term in the model. The computation of the standard errors of the coefficients is based on a matrix called the information matrix or Hessian matrix. This matrix is the matrix of second order partial derivatives of the log likelihood function with respect to all possible pairs of the coefficient values. The formula for the “j, k” element of the information matrix is: 2 n LB ------------------ = – x ik x ik i 1 – i B j B k i–1 where exp B'X x i = --------------------------------1 + exp B'X Unlike the case with linear regression, confidence intervals are not computed directly on the standard error values, but on something called the odds ratios, described below. Odds ratios and confidence intervals In linear regression, the meaning of each b-coefficient in the model can be thought of as the amount the dependent y variable changes when the corresponding independent x variable changes by 1. Because of the logit transformation, however, the meaning of each b-coefficient in a logistic regression model is not so clear. In a logistic regression model, the increase of an x variable by 1 implies a change in the odds that the outcome y variable will be 1 rather than 0. Looking back at the formula for the logit response function: x g x = ln -------------------- = b 0 + + b n x n 1 – x it is evident that the response function is actually the log of the odds that the response is 1, where x is the probability that the response is 1 and 1 – x is the probability that the Teradata Warehouse Miner User Guide - Volume 3 123 Chapter 1: Analytic Algorithms Logistic Regression response is 0. Now suppose that one of the x variables, say xj, varies by 1. Then the response function will vary by bj. This can be written as g(x0...xj + 1...xn) - g(x0...xj...xn) = bj. But it could also be written as: ln odds j + 1 ln odds j + 1 – ln odds j = ------------------------------- = b j odds j Therefore odds j + 1 -------------------- = exp b j odds j the formula for the odds ratio of the coefficient bj . By taking the exponent of a b-coefficient, one gets the odds ratio that is the factor by which the odds change due to a unit increase in xj. Because this odds ratio is the value that has more meaning, confidence intervals are calculated on odds ratios for each of the coefficients rather than on the coefficients themselves. The confidence interval is computed based on a 95% confidence level and a twotailed normal distribution. Logistic Regression Goodness of fit In linear regression one of the key measures associated with goodness of fit is the residual sums of squares RSS. An analogous measure for logistic regression is a statistic sometimes called the deviance. Its value is based on the ratio of the likelihood of a given model to the likelihood of a perfectly fitted or saturated model and is given by D = -2ln(ModelLH / SatModelLH). This can be rewritten D=-2LM + 2LS in terms of the model log likelihood and the saturated model log likelihood. Looking at the data as a set of n independent Bernoulli observations, LS is actually 0, so that D = -2LM. Two models can be contrasted by taking the difference between their deviance values, which leads to a statistic G = D1 - D2 = -2(L1 - L2). This is similar to the numerator in the partial F test in linear regression, the extra sums of squares or ESS mentioned in the section on linear regression. In order to get an assessment of the utility of the independent model terms taken as a whole, the deviance difference statistic is calculated for the model with a constant term only versus the model with all variables fitted. This statistic is then G = -2(L0 - LM). LM is calculated using the log likelihood formula given earlier. L0, the log likelihood of the constant only model with n observations is given by: L 0 = y ln y + n – y ln n – y – n ln n G follows a chi-square distribution with “variables minus one” degrees of freedom, and as such provides a probability value to test whether all the x-term coefficients should in fact be zero. 124 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Logistic Regression Finally, there are a number of pseudo R-squared values that have been suggested in the literature. These are not truly speaking goodness of fit measures, but can nevertheless be useful in assessing the model. Teradata Warehouse Miner provides one such measure suggested by McFadden as (L0 - LM) / L0. [Agresti] Logistic Regression Data Quality Reports The same data quality reports optionally available for linear regression are also available when performing logistic regression. Since an SSCP matrix is not used in the logistic regression algorithm, additional internal processing is needed to produce data quality reports, especially for the Near Dependency report and the Detailed Collinearity Diagnostics report. Stepwise Logistic Regression Automated stepwise regression procedures are available for logistic regression to aid in model selection just as they are for linear regression. The procedures are in fact very similar to those described for linear regression. As such an attempt will be made to highlight the similarities and differences in the descriptions below. As is the case with stepwise linear regression, the automated stepwise procedures described below can provide insight into the variables that should be included in a logistic regression model. An element of human decision-making however is recommended in order to produce a model with useful business application. Forward-Only Stepwise Logistic Regression The forward only procedure consists solely of forward steps as described below, starting without any independent x variables in the model. Forward steps are continued until no variables can be added to the model. Forward Stepwise Logistic Regression The forward stepwise procedure is a combination of the forward and backward steps always done in pairs, as described below, starting without any independent x variables in the model. One forward step is always followed by one backward step, and these single forward and backward steps are alternated until no variables can be added or removed. Additional checks are made after each step to see if the same variables exist in the model as existed after a previous step in the same direction. When this condition is detected in both the forward and backward directions the algorithm will also terminate. Backward-Only Stepwise Logistic Regression The backward only procedure consists solely of backward steps as described below, starting with all of the independent x variables in the model. Backward steps are continued until no variables can be removed from the model. Backward Stepwise Logistic Regression The backward stepwise procedure is a combination of the backward and forward steps always done in pairs, as described below, starting with all of the independent x variables in the model. One backward step is followed by one forward step, and these single backward and forward steps are alternated until no variables can be added or removed. Additional checks are made after each step to see if the same variables exist in the model as existed after a Teradata Warehouse Miner User Guide - Volume 3 125 Chapter 1: Analytic Algorithms Logistic Regression previous step in the same direction. When this condition is detected in both the backward and forward directions the algorithm will also terminate. Stepwise Logistic Regression - Forward step In stepwise linear regression the partial F statistic, or the analogous T-statistic probability value, is computed separately for each variable outside the model, adding each of them into the model one at a time. The analogous procedure for logistic regression would consist of computing the likelihood ratio statistic G, described in the Goodness of Fit section, for each variable outside the model, selecting the variable that results in the largest G value when added to the model. In the case of logistic regression however this becomes an expensive proposition because the solution of the model for each variable requires another iterative maximum likelihood solution, contrasted to the more rapidly achieved closed form solution available in linear regression. What is needed is a statistic that can be calculated without requiring an additional maximum likelihood solution. Teradata Warehouse Miner uses such a statistic proposed by Peduzzi, Hardy and Holford that they call a W statistic. This statistic is comparatively inexpensive to compute for each variable outside the model and is therefore expedient to use as a criterion for selecting a variable to add to the model. The W statistic is assumed to follow a chi square distribution with one degree of freedom due to its similarity to other statistics, and it gives evidence of behaving similarly to the likelihood ratio statistic. Therefore, the variable with the smallest chi square probability or P-value associated with its W statistic is added to the model in a forward step if the P-value is less than the criterion to enter. If more than one variable has a P-value of 0, then the variable with the largest W statistic is entered. For more information, refer to [Peduzzi, Hardy and Holford]. Stepwise Logistic Regression - Backward step Each backward step seeks to remove those variables that have statistical significance below a certain level. This is done by first fitting the model with the currently selected variables, including the calculation of the probability or P-value associated with the T-statistic for each variable, which is the ratio of the b-coefficient to its standard error. The variable with the largest P-value is removed if it is greater than the criterion to remove. Logistic Regression and Missing Data Null values for columns in a logistic regression analysis can adversely affect results, so Teradata Warehouse Miner ensures that listwise deletion is effectively performed with logistic regression. This ensures that any row for which one of the independent or dependent variable columns is null will be left out of computations completely. Additionally, the Recode transformation function can be used to build a new column, substituting a fixed known value for null. Initiate a Logistic Regression Function Use the following procedure to initiate a new Logistic Regression analysis in Teradata Warehouse Miner: 126 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Logistic Regression 1 Click on the Add New Analysis icon in the toolbar: Figure 64: Add New Analysis from toolbar 2 In the resulting Add New Analysis dialog box, click on Analytics under Categories and then under Analyses double-click on Logistic Regression: Figure 65: Add New Analysis dialog 3 This will bring up the Logistic Regression dialog in which you will enter INPUT and OUTPUT options to parameterize the analysis as described in the following sections. Logistic Regression - INPUT - Data Selection On the Logistic Regression dialog click on INPUT and then click on data selection: Teradata Warehouse Miner User Guide - Volume 3 127 Chapter 1: Analytic Algorithms Logistic Regression Figure 66: Logistic Regression > Input > Data Selection On this screen select: 1 Select Input Source Users may select between different sources of input. By selecting the Input Source Table the user can select from available databases, tables (or views) and columns in the usual manner. By selecting the Input Source Analysis however the user can select directly from the output of another analysis of qualifying type in the current project. Analyses that may be selected from directly include all of the Analytic Data Set (ADS) and Reorganization analyses (except Refresh). In place of Available Databases the user may select from Available Analyses, while Available Tables then contains a list of all the output tables that will eventually be produced by the selected Analysis. (Note that since this analysis cannot select from a volatile input table, Available Analyses will contain only those qualifying analyses that create an output table or view). 2 Select Columns From a Single Table • Available Databases (or Analyses) — All the databases (or analyses) that are available for the Logistic Regression analysis. • Available Tables — All the tables that are available for the Logistic Regression analysis. • Available Columns — Within the selected table or matrix, all columns which are available for the Logistic Regression analysis. • Selected Columns — Select columns by highlighting and then either dragging and dropping into the Selected Columns window, or click on the arrow button to move highlighted columns into the Selected Columns window. Note: The Selected Columns window is actually a split window; you can either insert columns as Dependent or Independent columns. Make sure you have the correct portion of the window highlighted. The Dependent variable column is the column whose value is being predicted by the logistic regression model. The algorithm requires that the Independent columns must be of numeric type (or contain numbers in character format). The Dependent column may be of any type. Logistic Regression - INPUT - Analysis Parameters On the Logistic Regression dialog click on INPUT and then click on analysis parameters: 128 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Logistic Regression Figure 67: Logistic Regression > Input > Analysis Parameters On this screen select: • Regression Options • Convergence Criterion — The algorithm continues to repeatedly estimate the model coefficient values until either the difference in the log likelihood function from one iteration to the next is less than or equal to the convergence criterion or the maximum iterations is reached. Default value is 0.001. • Maximum iterations — The algorithm stops iterating if the maximum iterations is reached. The default value is 100. • Response Value — The value of the dependent variable that will represent the response value. All other dependent variable values will be considered a non-response value. • Include Constant Term (checkbox) — This option specifies that the logistic regression model should include a constant term. With a constant, the logistic equation can be thought of as: b0 + b1 xx + + bn xn e x = -------------------------------------------------b + b x + + bn xn 1+e 0 1 x x g x = ln -------------------- = b 0 + b 1 x 1 + b n x n 1 – x Without a constant, the equation changes to: b x ++b x n n e 1 x x = ----------------------------------------b1 xx + + bn xn 1+e The default value is to include the constant term. • Stepwise Options — If selected, the algorithm is performed repeatedly with various combinations of independent variable columns to attempt to arrive at a final “best” model. The default is to not use Stepwise Regression. • Step Direction — (Selecting “None” turns off the Stepwise option). • Forward — Option for independent variables being added one at a time to an Teradata Warehouse Miner User Guide - Volume 3 129 Chapter 1: Analytic Algorithms Logistic Regression x g x = ln -------------------- = b 1 x 1 + b n x n 1 – x empty model, possibly removing a variable after a variable is added. • Forward Only — Option for qualifying independent variables being added one at a time. • Backward — Option for removing variables from an initial model containing all of the independent variables, possibly adding a variable after a variable is removed. • Backward Only — Option for independent variables being removed one at a time. • Criterion to Enter — An independent variable is only added to the model if its W statistic chi-square P-value is less than the specified criterion to enter. The default value is 0.05. • Criterion to Remove — An independent variable is only removed if its T-statistic Pvalue is greater than the specified criterion to remove. The default value is 0.05 for each. • Report Options • Prediction Success Table — Creates a prediction success table using sums of probabilities rather than estimates based on a threshold value. The default is to generate the prediction success table. • Multi-Threshold Success Table — This table provides values similar to those in the prediction success table, but based on a range of threshold values, thus allowing the user to compare success scenarios using different threshold values. The default is to generate the multi-threshold Success table. • • Threshold Begin • Threshold End • Threshold Increment — Specifies the threshold values to be used in the multithreshold success table. If the computed probability is greater than or equal to a threshold value, that observation is assigned a 1 rather than a 0. Default values are 0, 1 and .05 respectively. Cumulative Lift Table — Produce a cumulative lift table for deciles based on probability values. The default is to generate the Cumulative Lift table. • (Data Quality Reports) — These are the same data quality reports provided for Linear Regression and Factor Analysis. However, in the case of Logistic Regression, the “Sums of squares and Cross Products” or SSCP matrix is not readily available since it is not input to the algorithm, so it is derived dynamically by the algorithm. If there are a large number of independent variables in the model it may be more efficient to use the Build Matrix function to build and save the matrix and the Linear Regression function to produce the Data Quality Reports listed below. 130 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Logistic Regression • Variable Statistics — This report gives the mean value and standard deviation of each variable in the model based on the derived SSCP matrix. • Near Dependency — This report lists collinear variables or near dependencies in the data based on the derived SSCP matrix. • • Condition Index Threshold — Entries in the Near Dependency report are triggered by two conditions occurring simultaneously. The one that involves this parameter is the occurrence of a large condition index value associated with a specially constructed principal factor. If a factor has a condition index greater than this parameter’s value, it is a candidate for the Near Dependency report. A default value of 30 is used as a rule of thumb. • Variance Proportion Threshold — Entries in the Near Dependency report are triggered by two conditions occurring simultaneously. The one that involves this parameter is when two or more variables have a variance proportion greater than this threshold value for a factor with a high condition index. Another way of saying this is that a ‘suspect’ factor accounts for a high proportion of the variance of two or more variables. This parameter defines what a high proportion of variance is. A default value of 0.5 is used as a rule of thumb. Detailed Collinearity Diagnostics — This report provides the details behind the Near Dependency report, consisting of the “Eigenvalues of Unit Scaled X’X”, “Condition Indices” and “Variance Proportions” tables. Logistic Regression - INPUT - Expert Options On the Logistic Regression dialog click on INPUT and then click on expert options: Figure 68: Logistic Regression > Input > Expert Options On this screen select: • Optimization Method • Automatic — The program selects Reweighted Least Squares (RLS) unless there are 35 or more independent variable columns, in which case Quasi-Newton BFGS is selected instead. This is the default option. • Quasi-Newton (BFGS) — The user may explicitly request this optimization technique attributed to Broyden-Fletcher-Goldfarb-Shanno. Quasi-Newton methods do not require a Hessian matrix of second partial derivatives of the objective function to be calculated explicitly, saving time in some situations. • Reweighted Least Squares (RLS) — The user may explicitly request this optimization technique equivalent to the Gauss-Newton method. It involves computing a matrix Teradata Warehouse Miner User Guide - Volume 3 131 Chapter 1: Analytic Algorithms Logistic Regression very similar to a Hessian matrix but is typically the fastest technique for logistic regression. • Performance • Maximum amount of data for in-memory processing — Enter a number of megabytes. • Use multiple threads when applicable — This flag indicates that multiple SQL statements may be executed simultaneously, up to 5 simultaneous executions as needed. It only applies when not processing in memory, and only to certain processing performed in SQL. Where and when multi-threading is used is dependent on the number of columns and the Optimization Method selected (but both RLS and BFGS can potentially make some use of multi-threading). Logistic Regression - OUTPUT On the Logistic Regression dialog click on OUTPUT: Figure 69: Logistic Regression > OUTPUT On this screen select: • Store the variables table of this analysis in the database — Store the model variables table of this analysis in the database. • Database Name — Name of the database to create the output table in. • Output Table Name — Name of the output table. • Advertise Output — “Advertises” output by inserting information into one or more of the Advertise Output metadata tables according to the type of analysis and the options selected in the analysis. • Advertise Note — Specify when the Advertise Output option is selected or when the Always Advertise option is selected on the Connection Properties dialog. It is a free-form text field of up to 30 characters that may be used to categorize or describe the output. By way of an example, the tutorial example produces the following output table: Table 41: Logistic Regression - OUTPUT Standardi zed Coefficie nt Column Name B Coefficie nt years_ with_ bank 0.044251 4.914916 0.026929 0.906555 0.831242 0.988692 0.098102 11 2.216961 39 5 6 0.053055 0.144717 98 14 1 132 Standard Error Wald Statistic T Statistic P-Value Odds Ratio Lower Upper Partial R Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Logistic Regression Table 41: Logistic Regression - OUTPUT Standardi zed Coefficie nt Column Name B Coefficie nt avg_sv_ tran_cnt 0.213310 31.22951 3.22526 1.192052 7 5.588337 E-08 avg_sv_ tran_amt 0.030762 0.003824 64.70387 8.043871 3.552714 1.03124 34 32 E-15 ckacct 0.465670 0.236528 3.876044 1.968767 0.049353 1.593081 1.002084 2.53263 2 8 13 avg_ck_ tran_cnt 0.009613 5.608763 0.018127 0.977489 0.959244 0.996082 0.022767 534 2.368283 26 7 1 3 0.059032 0.179196 57 84 4 married 0.233367 7.115234 -2.66744 0.622493 6 9 Standard Error Wald Statistic T Statistic P-Value Odds Ratio Lower Upper Partial R 0.303597 0.199861 0.461178 6 2 0.168006 0.914416 2 5 1.02354 1.038999 0.246071 2.061767 7 0.042563 0.127321 37 6 0.007810 0.536604 0.339634 0.847807 519 5 2 3 0.070282 0.171455 51 6 (Constan 0.273292 18.84624 1.614427 t) 1.186426 9 4.341225 E-05 avg_sv_ bal 0.003125 0.000559 31.16868 5.582892 3.323695 1.00313 305 8004 E-08 1.00203 1.004231 0.167831 2.625869 3 If Database Name is twm_results and Output Table Name is test, the output table is defined as: CREATE SET TABLE twm_results.test ( "Column Name" VARCHAR(30) CHARACTER SET UNICODE NOT CASESPECIFIC, "B Coefficient" FLOAT, "Standard Error" FLOAT, "Wald Statistic" FLOAT, "T Statistic" FLOAT, "P-Value" FLOAT, "Odds Ratio" FLOAT, "Lower" FLOAT, "Upper" FLOAT, "Partial R" FLOAT, "Standardized Coefficient" FLOAT) UNIQUE PRIMARY INDEX ( "Column Name" ); Run the Logistic Regression After setting parameters on the INPUT screens as described above, you are ready to run the analysis. To run the analysis you can either: • Click the Run icon on the toolbar, or • Select Run <project name> on the Project menu, or Teradata Warehouse Miner User Guide - Volume 3 133 Chapter 1: Analytic Algorithms Logistic Regression • Press the F5 key on your keyboard Results - Logistic Regression The results of running the Teradata Warehouse Miner Linear Regression analysis include a variety of statistical reports on the individual variables and generated model as well as barcharts displaying coefficients and T-statistics. All of these results are outlined below. The title of this report is preceded by the name of the technique that was used to build the model either Reweighted Least Squares Logistic Regression or Quasi-Newton (BFGS) Logistic Regression. On the Logistic Regression dialog, click on RESULTS (note that the RESULTS tab will be grayed-out/disabled until after the analysis is completed) to view results. Result options are as follows: • Data Quality Reports 134 • Variable Statistics — If selected on the Results Options tab, this report gives the mean value and standard deviation of each variable in the model based on the SSCP matrix provided as input. • Near Dependency — If selected on the Results Options tab, this report lists collinear variables or near dependencies in the data based on the SSCP matrix provided as input. Entries in the Near Dependency report are triggered by two conditions occurring simultaneously. The first is the occurrence of a large condition index value associated with a specially constructed principal factor. If a factor has a condition index greater than the parameter specified on the Results Option tab, it is a candidate for the Near Dependency report. The other is when two or more variables have a variance proportion greater than a threshold value for a factor with a high condition index. Another way of saying this is that a ‘suspect’ factor accounts for a high proportion of the variance of two or more variables. The parameter to defines what a high proportion of variance is also set on the Results Option tab. A default value of 0.5. • Detailed Collinearity Diagnostics — If selected on the Results Options tab, this report provides the details behind the Near Dependency report, consisting of the following tables. • Eigenvalues of Unit Scaled X'X — Report of the eigenvalues of all variables scaled so that each variable adds up to 1 when summed over all the observations or rows. In order to calculate the singular values of X (the rows of X are the observations), the mathematically equivalent square root of the eigenvalues of XTX are computed instead for practical reasons • Condition Indices — The condition index of each eigenvalue, calculated as the square root of the ratio of the largest eigenvalue to the given eigenvalue, a value always 1 or greater. • Variance Proportions — The variance decomposition of these eigenvalues is computed using the eigenvalues together with the eigenvectors associated with them. The result is a matrix giving, for each variable, the proportion of variance associated with each eigenvalue. Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Logistic Regression • Logistic Regression Step N (Stepwise-only) • In Report — This report is the same as the Variables in Model report, but it is provided for each step during stepwise logistic regression based on the variables currently in the model at each step. • Out Report • Column Name — The independent variable excluded from the model. • W Statistic — The W Statistic is a specialized statistic designed to determine the best variable to add to a model without calculating a maximum likelihood solution for each variable outside the model. The W statistic is assumed to follow a chi square distribution with one degree of freedom due to its similarity to other statistics, and it gives evidence of behaving similarly to the likelihood ratio statistic. For more information, refer to [Peduzzi, Hardy and Holford]. • Chi Sqr P-value — The W statistic is assumed to follow a chi square distribution on one degree of freedom due to its similarity to other statistics, and it gives evidence of behaving similarly to the likelihood ratio statistic. Therefore, the variable with the smallest chi square probability or P-value associated with its W statistic is added to the model in a forward step if the P-value is less than the criterion to enter. • Logistic Regression Model • Total Observations — This is the number of rows in the table that the logistic regression analysis is based on. The number of observations reflects the row count after any rows were eliminated by listwise deletion (due to one of the variables being null). • Total Iterations — The number of iterations used by the non-linear optimization algorithm in maximizing the log likelihood function. • Initial Log Likelihood — The initial log likelihood is the log likelihood of the constant only model and is given only when the constant is included in the model. The formula for initial log likelihood is given by: L 0 = y ln y + n – y ln n – y – n ln n where n is the number of observations. • Final Log Likelihood — This is the value of the log likelihood function after the last iteration. • Likelihood Ratio Test G Statistic — Deviance, given by D = -2LM, where LM is the log likelihood of the logistic regression model, is a measure analogous to the residual sums of squares RSS in a linear regression model. In order to assess the utility of the independent terms taken as a whole in the logistic regression model, the deviance difference statistic G is calculated for the model with a constant term only versus the model with all variables fitted. This statistic is then G = -2(L0 - LM), where L0 is the log likelihood of a model containing only a constant. The G statistic, like the deviance D, is an example of a likelihood ratio test statistic. Teradata Warehouse Miner User Guide - Volume 3 135 Chapter 1: Analytic Algorithms Logistic Regression • Chi-Square Degrees of Freedom — The G Statistic follows a chi-square distribution with “variables minus one” degrees of freedom. This field then is the degrees of freedom for the G Statistic’s chi-square test. • Chi-Square Value — This is the chi-square random variable value for the Likelihood Ratio Test G Statistic. This can be used to test whether all the independent variable coefficients should be 0. Examining the field Chi-square Probability is however the easiest way to assess this test. • Chi-Square Probability — This is the chi-square probability value for the Likelihood Ratio Test G Statistic. It can be used to test whether all the independent variable coefficients should be 0. That is, the probability that a chi-square distributed variable would have the value G or greater is the probability associated with having all 0 coefficients. The null hypothesis that all the terms should be 0 can be rejected if this probability is sufficiently small, say less than 0.05. • McFadden's Pseudo R-Squared — To mimic the Squared Multiple Correlation Coefficient (R2) in a linear regression model, the researcher McFadden suggested this measure given by (L0 - LM) / L0 where L0 is the log likelihood of a model containing only a constant and LM is the log likelihood of the logistic regression model. Although it is not truly speaking a goodness of fit measure, it can be useful in assessing a logistic regression model. (Experience shows that the value of this statistic tends to be less than the R2 value it mimics. In fact, values between 0.20 and 0.40 are quite satisfactory). • Dependent Variable Name — Column chosen as the dependent variable. • Dependent Variable Response Values — The response value chosen for the dependent variable on the Regression Options tab. • Dependent Variable Distinct Values — The number of distinct values that the dependent variable takes on. • Logistic Regression Variables in Model report • Column Name — This is the name of the independent variable in the model or CONSTANT for the constant term. • B Coefficient — The b-coefficient is the coefficient in the logistic regression model for this variable. The following equations describe the logistic regression model, with being the probability that the dependent variable is 1, and g(x) being the logit transformation: b +b x ++b x n n e 0 1 x x = -------------------------------------------------b0 + b1 xx + + bn xn 1+e x g x = ln -------------------- = b 0 + b 1 x 1 + b n x n 1 – x 136 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Logistic Regression • Standard Error — The standard error of a b-coefficient in the logistic regression model is a measure of its expected accuracy. It is analogous to the standard error of a coefficient in a linear regression model. • Wald Statistic — The Wald statistic is calculated as the square of the T-statistic (T Stat) described below. The T-statistic is calculated for each b-coefficient as the ratio of the b-coefficient value to its standard error. • T Statistic — In a manner analogous to linear regression, the T-statistic is calculated for each b-coefficient as the ratio of the b-coefficient value to its standard error. Along with its associated t-distribution probability value, it can be used to assess the statistical significance of this term in the model. • P-value — This is the t-distribution probability value associated with the T-statistic (T Stat), that is, the ratio of the b-coefficient value (B Coef) to its standard error (Std Error). It can be used to assess the statistical significance of this term in the logistic regression model. A value close to 0 implies statistical significance and means this term in the model is important. The P-value represents the probability that the null hypothesis is true, that is the observation of the estimated coefficient value is chance occurrence (i.e., the null hypothesis is that the coefficient equals zero). The smaller the P-value, the stronger the evidence for rejecting the null hypothesis that the coefficient is actually equal to zero. In other words, the smaller the P-value, the larger the evidence that the coefficient is different from zero. • Odds Ratio — The odds ratio for an independent variable in the model is calculated by taking the exponent of the b-coefficient. The odds ratio is the factor by which the odds of the dependent variable being 1 change due to a unit increase in this independent variable. • Lower — Because of the intuitive meaning of the odds ratio, confidence intervals for coefficients in the model are calculated on odds ratios rather than on the coefficients themselves. The confidence interval is computed based on a 95% confidence level and a two-tailed normal distribution. “Lower” is the lower range of this confidence interval. • Upper — Because of the intuitive meaning of the odds ratio, confidence intervals for coefficients in the model are calculated on odds ratios rather than on the coefficients themselves. The confidence interval is computed based on a 95% confidence level and a two-tailed normal distribution. “Upper” is the upper range of this confidence interval. • Partial R — The Partial R statistic is calculated for each b-coefficient value as: wi – 2 Sign b i -------------– 2L 0 where bi is the b-coefficient and wi is the Wald Statistic of the ith independent variable, while L0 is the initial log likelihood of the model. (Note that if wi <= 2 then Partial R is set to 0). This statistic provides a measure of the relative importance of each Teradata Warehouse Miner User Guide - Volume 3 137 Chapter 1: Analytic Algorithms Logistic Regression variable in the model. It is calculated only when the constant term is included in the model. [SPSS] • Standardized Coefficient — The estimated standardized coefficient is calculated for each b-coefficient value as: b i i ------- 3 where bi is the b-coefficient, i is the standard deviation of the ith independent 3 variable, and ------- is the standard deviation of the standard logistic distribution. This calculation only provides an estimate of the standardized coefficients since it uses a constant value for the logistic distribution without regard to the actual distribution of the dependent variable in the model. [Menard] • Prediction Success Table — The prediction success table is computed using only probabilities and not estimates based on a threshold value. Using an input table that contains known values for the dependent variable, the sum of the probability values x and 1 – x , which correspond to the probability that the predicted value is 1 or 0 respectively, are calculated separately for rows with actual value of 1 and 0. Refer to the Model Evaluation section for more information. 138 • Estimate Response — The entries in the “Estimate Response” column are the sums of the probabilities x that the outcome is 1, summed separately over the observations where the actual outcome is 1 and 0 and then totaled. (Note that this is independent of the threshold value that is used in scoring to determine which probabilities correspond to an estimate of 1 and 0 respectively). • Estimate Non-Response — The entries in the “Estimate Non-Response” column are the sums of the probabilities 1 – x that the outcome is 0, summed separately over the observations where the actual outcome is 1 and 0 and then totaled. (Note that this is independent of the threshold value that is used in scoring to determine which probabilities correspond to an estimate of 1 and 0 respectively). • Actual Total — The entries in this column are the sums of the entries in the Estimate Response and Estimate Non-Response columns, across the rows in the Prediction Success Table. But in fact this turns out to be the number of actual 0’s and 1’s and total observations in the training data. • Actual Response — The entries in the “Actual Response” row correspond to the observations in the data where the actual value of the dependent variable is 1. • Actual Non-Response — The entries in the “Actual Non-Response” row correspond to the observations in the data where the actual value of the dependent variable is 0. Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Logistic Regression • Estimated Total — The entries in this row are the sums of the entries in the Actual Response and Actual Non-Response rows, down the columns in the Prediction Success Table. This turns out to be the sum of the probabilities of estimated 0’s and 1’s and total observations in the model. • Multi-Threshold Success Table — This table provides values similar to those in the prediction success table, but instead of summing probabilities, the estimated values based on a threshold value are summed instead. Rather than just one threshold however, several thresholds ranging from a user specified low to high value are displayed in user specified increments. This allows the user to compare several success scenarios using different threshold values, to aid in the choice of an ideal threshold. Refer to the Model Evaluation section for more information. • Threshold Probability — This column gives various incremental values of the probability at or above which an observation is to have an estimated value of 1 for the dependent variable. For example, at a threshold of 0.5, a response value of 1 is estimated if the probability predicted by the logistic regression model is greater than or equal to 0.5. The user may request the starting, ending and increment values for these thresholds. • Actual Response, Estimate Response — This column corresponds to the number of observations for which the model estimated a value of 1 for the dependent variable and the actual value of the dependent variable is 1. • Actual Response, Estimate Non-Response — This column corresponds to the number of observations for which the model estimated a value of 0 for the dependent variable but the actual value of the dependent variable is 1, a “false negative” error case for the model. • Actual Non-Response, Estimate Response — This column corresponds to the number of observations for which the model estimated a value of 1 for the dependent variable but the actual value of the dependent variable is 0, a “false positive” error case for the model. • Actual Non-Response, Estimate Non-Response — This column corresponds to the number of observations for which the model estimated a value of 0 for the dependent variable and the actual value of the dependent variable is 0. • Cumulative Lift Table — The Cumulative Lift Table demonstrates how effective the model is in estimating the dependent variable. It is produced using deciles based on the probability values. Note that the deciles are labeled such that 1 is the highest decile and 10 is the lowest, based on the probability values calculated by logistic regression. The information in this report however is best viewed in the Lift Chart produced as a graph under a logistic regression analysis. • Decile — The deciles in the report are based on the probability values predicted by the model. Note that 1 is the highest decile and 10 is the lowest. That is, decile 1 contains data on the 10% of the observations with the highest estimated probabilities that the dependent variable is 1. • Count — This column contains the count of observations in the decile. • Response — This column contains the count of observations in the decile where the actual value of the dependent variable is 1. Teradata Warehouse Miner User Guide - Volume 3 139 Chapter 1: Analytic Algorithms Logistic Regression • Response (%) — This column contains the percentage of observations in the decile where the actual value of the dependent variable is 1. • Captured Response (%) — This column contains the percentage of responses in the decile over all the responses in any decile. • Lift — The lift value is the percentage response in the decile (Pct Response) divided by the expected response, where the expected response is the percentage of response or dependent 1-values over all observations. For example, if 10% of the observations overall have a dependent variable with value 1, and 20% of the observations in decile 1 have a dependent variable with value 1, then the lift value within decile 1 is 2.0, meaning that the model gives a “lift” that is better than chance alone by a factor of two in predicting response values of 1 within this decile. • Cumulative Response — This is a cumulative measure of Response, from decile 1 to this decile. • Cumulative Response (%) — This is a cumulative measure of Pct Response, from decile 1 to this decile. • Cumulative Captured Response (%) — This is a cumulative measure of Pct Captured Response, from decile 1 to this decile. • Cumulative Lift — This is a cumulative measure of Lift, from decile 1 to this decile. Logistic Regression Graphs The Logistic Regression Analysis can display bar chars for the T-statistics, Wald Statistics, Log Odds Ratios, Partial R and Estimated Standard Coefficients of the resultant model. In addition, a Lift Chart in deciles is generated. Logistic Weights Graph This graph displays the relative magnitudes of the T-statistics, Wald Statistics, Log Odds Ratios, Partial R and Estimated Standard Coefficients associated with each variable in the logistic regression model. The sign, positive or negative, is portrayed by the colors red or blue respectively. The user may scroll to the left or right to see all the variables associated statistics in the model. The following options are available on the Graphics Options tab on the Logistic Weights graph: • Graph Type — The following can be graphed by the Linear Weights Graph • Vertical Axis — The user may request multiple vertical axes in order to display separate coefficient values that are orders of magnitude different from the rest of the values. If the coefficients are of roughly the same magnitude, this option is grayed out. • Single — Display the selected statistics on single axis on the bar chart. • Multiple — Display the selected statistics on dual axes on the bar chart. Lift Chart This graph displays the statistics in the Cumulative Lift Table, with the following options: • Non-Cumulative 140 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Logistic Regression • % Response — This column contains the percentage of observations in the decile where the actual value of the dependent variable is 1. • % Captured Response — This column contains the percentage of responses in the decile over all the responses in any decile. • Lift — The lift value is the percentage response in the decile (Pct Response) divided by the expected response, where the expected response is the percentage of response or dependent 1-values over all observations. For example, if 10% of the observations overall have a dependent variable with value 1, and 20% of the observations in decile 1 have a dependent variable with value 1, then the lift value within decile 1 is 2.0, meaning that the model gives a “lift” that is better than chance alone by a factor of two in predicting response values of 1 within this decile. • Cumulative • % Response — This is a cumulative measure of the percentage of observations in the decile where the actual value of the dependent variable is 1, from decile 1 to this decile. • % Captured Response — This is a cumulative measure of the percentage of responses in the decile over all the responses in any decile, from decile 1 to this decile. • Cumulative Lift — This is a cumulative measure of the percentage response in the decile (Pct Response) divided by the expected response, where the expected response is the percentage of response or dependent 1-values over all observations, from decile 1 to this decile. Tutorial - Logistic Regression The following is an example of using the stepwise feature of Logistic Regression analysis. The stepwise feature adds extra processing steps to the analysis; that is, normal Logistic Regression processing is a subset of the output shown below. In this example, ccacct (has credit card, 0 or 1) is being predicted in terms of 16 independent variables, from income to avg_sv_tran_cnt. The forward stepwise process determines that only 7 out of the original 16 input variables should be used in the model. These include avg_sv_tran_amt (average amount of savings transactions), avg_sv_tran_cnt (average number of savings transactions per month), avg_sv_bal (average savings account balance), married, years_with_bank, avg_ck_ tran_cnt (average number of checking transactions per month), and ckacct (has checking account, 0 or 1). Step 0 shows that all of the original 16 independent variables are excluded from the model, the starting point for forward stepwise regression. In Step 1, the Model Assessment report shows that the variable avg_sv_tran_amt added to the model, along with the constant term, with all other variables still excluded from the model. For the sake of brevity, Steps 2 through 6 are not shown. Then in Step 7, the variable ckacct is the last variable added to the model. At this point the stepwise algorithm stops because there are no more variables qualifying to be added or removed from the model, and the Reweighted Least Squares Logistic Regression and Variables in Model reports are given, just as they would be if these variables were analyzed without stepwise requested. Finally the Prediction Success Table, Multi-Threshold Success Table, and Cumulative Lift Table are given, as requested, to complete the analysis. Teradata Warehouse Miner User Guide - Volume 3 141 Chapter 1: Analytic Algorithms Logistic Regression Parameterize a Logistic Regression Analysis as follows: • Available Table — twm_customer_analysis • Dependent Variable — cc_acct • Independent Variables • income — age • years_with_bank — nbr_children • female — single • married — separated • ckacct — svacct • avg_ck_bal — avg_sv_bal • avg_ck_tran_amt — avg_ck_tran_cnt • avg_sv_tran_amt — avg_sv_tran_cnt • Convergence Criterion — 0.001 • Maximum Iterations — 100 • Response Value — 1 • Include Constant — Enabled • Prediction Success Table — Enabled • Multi-Threshold Success Table — Enabled • Threshold Begin — 0 • Threshold End — 1 • Threshold Increment — 0.05 • Cumulative Lift Table — Enabled • Use Stepwise Regression — Enabled • Criterion to Enter — 0.05 • Criterion to Remove — 0.05 • Direction — Forward • Optimization Type — Automatic Run the analysis, and click on Results when it completes. For this example, the Logistic Regression Analysis generated the following pages. A single click on each page name populates Results with the item. Table 42: Logistic Regression Report 142 Total Observations: 747 Total Iterations: 9 Initial Log Likelihood: -517.7749 Final Log Likelihood: -244.4929 Likelihood Ratio Test G Statistic: 546.5641 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Logistic Regression Table 42: Logistic Regression Report Chi-Square Degrees of Freedom: 7.0000 Chi-Square Value: 14.0671 Chi-Square Probability: 0.0000 McFadden's Pseudo R-Squared: 0.5278 Dependent Variable: ccacct Dependent Response Value: 1 Total Distinct Values: 2 Table 43: Execution Summary 6/20/2004 2:19:02 PM Stepwise Logistic Regression Running. 6/20/2004 2:19:03 PM Step 0 Complete 6/20/2004 2:19:03 PM Step 1 Complete 6/20/2004 2:19:03 PM Step 2 Complete 6/20/2004 2:19:03 PM Step 3 Complete 6/20/2004 2:19:03 PM Step 4 Complete 6/20/2004 2:19:04 PM Step 5 Complete 6/20/2004 2:19:04 PM Step 6 Complete 6/20/2004 2:19:04 PM Step 7 Complete 6/20/2004 2:19:04 PM Log Likelihood: -517.78094387828 6/20/2004 2:19:04 PM Log Likelihood: -354.38456690558 6/20/2004 2:19:04 PM Log Likelihood: -287.159936852895 6/20/2004 2:19:04 PM Log Likelihood: -258.834546711159 6/20/2004 2:19:04 PM Log Likelihood: -247.445356552554 6/20/2004 2:19:04 PM Log Likelihood: -244.727173470081 6/20/2004 2:19:04 PM Log Likelihood: -244.49467692232 6/20/2004 2:19:04 PM Log Likelihood: -244.492882024522 6/20/2004 2:19:04 PM Log Likelihood: -244.492881920691 6/20/2004 2:19:04 PM Computing Multi-Threshold Success Table 6/20/2004 2:19:06 PM Computing Prediction Success Table 6/20/2004 2:19:06 PM Computing Cumulative Lift Table 6/20/2004 2:19:07 PM Creating Report Teradata Warehouse Miner User Guide - Volume 3 143 Chapter 1: Analytic Algorithms Logistic Regression Table 44: Variables Column Name B Standard Coefficient Error Wald Statistic T Statistic P-Value Odds Ratio Lower Upper Partial R Standardized Coefficient (Constant) -1.1864 0.2733 18.8462 -4.3412 0.0000 N/A N/A N/A N/A N/A avg_sv_ tran_amt 0.0308 0.0038 64.7039 8.0439 0.0000 1.0312 1.0235 1.0390 0.2461 2.0618 avg_sv_ tran_cnt -1.1921 0.2133 31.2295 -5.5883 0.0000 0.3036 0.1999 0.4612 -0.1680 -0.9144 avg_sv_bal 0.0031 0.0006 31.1687 5.5829 0.0000 1.0031 1.0020 1.0042 0.1678 2.6259 married -0.6225 0.2334 7.1152 -2.6674 0.0078 0.5366 0.3396 0.8478 -0.0703 -0.1715 years_with_ -0.0981 bank 0.0443 4.9149 -2.2170 0.0269 0.9066 0.8312 0.9887 -0.0531 -0.1447 avg_ck_ tran_cnt -0.0228 0.0096 5.6088 -2.3683 0.0181 0.9775 0.9592 0.9961 -0.0590 -0.1792 ckacct 0.4657 0.2365 3.8760 1.9688 0.0494 1.5931 1.0021 2.5326 0.0426 0.1273 Step 0 Table 45: Columns Out 144 Column Name W Statistic Chi-Square P-Value age 1.9521 0.1624 avg_ck_bal 0.5569 0.4555 avg_ck_tran_amt 1.6023 0.2056 avg_ck_tran_cnt 0.0844 0.7714 avg_sv_bal 85.5070 0.0000 avg_sv_tran_amt 233.7979 0.0000 avg_sv_tran_cnt 44.0510 0.0000 ckacct 21.8407 0.0000 female 3.2131 0.0730 income 1.9877 0.1586 married 19.6058 0.0000 nbr_children 5.1128 0.0238 separated 5.5631 0.0183 single 6.9958 0.0082 svacct 7.4642 0.0063 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Logistic Regression Table 45: Columns Out Column Name W Statistic Chi-Square P-Value years_with_bank 3.0069 0.0829 Step 1 Table 46: Variables Column Name B Standard Coefficient Error Wald Statistic T Statistic P-Value Odds Ratio avg_sv_ tran_amt 0.0201 193.2455 13.9013 0.0000 1.0203 1.0174 1.0232 0.4297 0.0014 Lower Upper Partial R Standardized Coefficient 1.3445 Table 47: Columns Out Column Name W Statistic Chi-Square P-Value age 3.4554 0.0630 avg_ck_bal 0.4025 0.5258 avg_ck_tran_amt 0.3811 0.5370 avg_ck_tran_cnt 11.3612 0.0007 avg_sv_bal 46.6770 0.0000 avg_sv_tran_cnt 134.8091 0.0000 ckacct 7.8238 0.0052 female 2.4111 0.1205 income 5.2143 0.0224 married 7.7743 0.0053 nbr_children 2.6647 0.1026 separated 3.9342 0.0473 single 2.7417 0.0978 svacct 2.0405 0.1532 years_with_bank 13.2617 0.0003 Teradata Warehouse Miner User Guide - Volume 3 145 Chapter 1: Analytic Algorithms Logistic Regression Step 2-7 Table 48: Prediction Success Table Estimate Response Estimate Non-Response Actual Total Actual Response 304.5868 70.4132 375.0000 Actual Non-Response 70.4133 301.5867 372.0000 Actual Total 375.0000 372.0000 747.0000 Table 49: Multi-Threshold Success Table Threshold Probability Actual Response, Estimate Response Actual Response, Estimate Non-Response Actual Non-Response, Estimate Response Actual Non-Response, Estimate Non-Response 0 375 0 372 0 .05 375 0 353 19 .1 374 1 251 121 .15 373 2 152 220 .2 369 6 90 282 .25 361 14 58 314 .3 351 24 37 335 .35 344 31 29 343 .4 329 46 29 343 .45 318 57 28 344 .5 313 62 24 348 .55 305 70 23 349 .6 291 84 23 349 .65 286 89 21 351 .7 276 99 20 352 .75 265 110 20 352 .8 253 122 20 352 .85 243 132 16 356 .9 229 146 13 359 .95 191 184 11 361 146 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Logistic Regression Table 50: Cumulative Lift Table Cumulative Response (%) Cumulative Captured Response (%) Cumulative Lift Decile Count Response Response (%) Captured Response (%) 1 74.0000 73.0000 98.6486 19.4667 1.9651 73.0000 98.6486 19.4667 1.9651 2 75.0000 69.0000 92.0000 18.4000 1.8326 142.0000 95.3020 37.8667 1.8984 3 75.0000 71.0000 94.6667 18.9333 1.8858 213.0000 95.0893 56.8000 1.8942 4 74.0000 65.0000 87.8378 17.3333 1.7497 278.0000 93.2886 74.1333 1.8583 5 75.0000 66.0000 88.0000 17.6000 1.7530 344.0000 92.2252 91.7333 1.8371 6 75.0000 24.0000 32.0000 6.4000 0.6374 368.0000 82.1429 98.1333 1.6363 7 74.0000 4.0000 5.4054 1.0667 0.1077 372.0000 71.2644 99.2000 1.4196 8 73.0000 2.0000 2.7397 0.5333 0.0546 374.0000 62.8571 99.7333 1.2521 9 69.0000 1.0000 1.4493 0.2667 0.0289 375.0000 56.4759 100.0000 1.1250 10 83.0000 0.0000 0.0000 0.0000 0.0000 375.0000 50.2008 100.0000 1.0000 Lift Cumulative Response Logistic Weights Graph By default, the Logistic Weights graph displays the relative magnitudes of the T-statistic associated with each coefficient in the logistic regression model: Figure 70: Logistic Regression Tutorial: Logistic Weights Graph Teradata Warehouse Miner User Guide - Volume 3 147 Chapter 1: Analytic Algorithms Logistic Regression Select the Graphics Options tab and change the Graph Type to Wald Statistic, Log Odds Ratio, Partial R or Estimated Standardized Coefficient to view those statistical measures respectively Lift Chart By default, the Lift Chart displays the cumulative measure of the percentage of observations in the decile where the actual value of the dependent variable is 1, from decile 1 to this decile (Cumulative, %Response): Figure 71: Logistic Regression Tutorial: Lift Chart 148 Teradata Warehouse Miner User Guide - Volume 3 CHAPTER 2 Scoring What’s In This Chapter This chapter applies only to an instance of Teradata Warehouse Miner operating on a Teradata database. For more information, see these subtopics: 1 “Overview” on page 149 2 “Cluster Scoring” on page 149 3 “Tree Scoring” on page 157 4 “Factor Scoring” on page 168 5 “Linear Scoring” on page 176 6 “Logistic Scoring” on page 184 Overview Model scoring in Teradata Warehouse Miner is performed entirely through generated SQL, executed in the database (although PMML based scoring generally requires that certain supplied User Defined Functions be installed beforehand). A scoring analysis is provided for every Teradata Warehouse Miner algorithm that produces a predictive model (thus excluding the Association Rules algorithm). Scoring applies a predictive model to a data set that has the same columns as those used in building the model, with the exception that the scoring input table need not always include the predicted or dependent variable column for those models that utilize one. In fact, the dependent variable column is required only when model evaluation is requested in the Tree Scoring, Linear Scoring and Logistic Scoring analyses. Cluster Scoring Scoring a table is the assignment of each row to a cluster. In the Gaussian Mixture model, the “maximum probability rule” is used to assign the row to the cluster for which its conditional probability is the largest. The model also assigns relative probabilities of each cluster to the row, so the soft assignment of a row to more than one cluster can be obtained. Teradata Warehouse Miner User Guide - Volume 3 149 Chapter 2: Scoring Cluster Scoring When scoring is requested, the selected table is scored against centroids/variances from the selected Clustering analysis. After a single iteration, each row is assigned to one of the previously defined clusters, together with the probability of membership. The row to cluster assignment is based on the largest probability. The Cluster Scoring analysis scores an input table that contains the same columns that were used to perform the selected Clustering Analysis. The implicit assumption in doing this is that the underlying population distributions are the same. When scoring is requested, the specified table is scored against the centroids and variances obtained in the selected Clustering analysis. Only a single iteration is required before the new scored table is produced. After clusters have been identified by their centroids and variances, the scoring engine identifies to which cluster each row belongs. The Gaussian Mixture model permits multiple cluster memberships, with scoring showing the probability of membership to each cluster. In addition, the highest probability is used to assign the row absolutely to a cluster. The resulting score table consists of the index (key) columns, followed by probabilities for each cluster membership, followed by the assigned cluster number (the cluster with the highest probability of membership). Initiate Cluster Scoring After generating a Cluster analysis (as described in “Cluster Analysis” on page 20), use the following procedure to initiate Cluster Scoring: 1 Click on the Add New Analysis icon in the toolbar: Figure 72: Add New Analysis from toolbar 2 150 In the resulting Add New Analysis dialog box, click on Scoring under Categories and then under Analyses double-click on Cluster Scoring: Teradata Warehouse Miner User Guide - Volume 3 Chapter 2: Scoring Cluster Scoring Figure 73: Add New Analysis > Scoring > Cluster Scoring 3 This will bring up the Cluster Scoring dialog in which you will enter INPUT and OUTPUT options to parameterize the analysis as described in the following sections. Cluster Scoring - INPUT - Data Selection On the Factor Scoring dialog click on INPUT and then click on data selection: Figure 74: Add New Analysis > Input > Data Selection On this screen select: 1 Select Input Source Users may select between different sources of input. By selecting the Input Source Table the user can select from available databases, tables (or views) and columns in the usual manner. By selecting the Input Source Analysis however the user can select directly from the output of another analysis of qualifying type in the current project. Analyses that may be selected from directly include all of the Analytic Data Set (ADS) and Reorganization analyses (except Refresh). In place of Available Databases the user may select from Available Analyses, while Available Tables then contains a list of all the output tables that will eventually be produced by the selected Analysis. (Note that since this analysis cannot select from a volatile input table, Available Analyses will contain only those qualifying analyses that create an output table or view). 2 Select Columns From a Single Table Teradata Warehouse Miner User Guide - Volume 3 151 Chapter 2: Scoring Cluster Scoring 3 • Available Databases — All available source databases that have been added on the Connection Properties dialog. • Available Tables — The tables available for scoring are listed in this window, though all may not strictly qualify; the input table to be scored must contain the same column names used in the original analysis. • Available Columns — The columns available for scoring are listed in this window. • Selected Columns — The Selected Columns window is a split window for specifying Index and/or Retain columns. • Index Columns — If a table is specified as input, the primary index of the table is defaulted here but can be changed. If a view is specified as input, an index must be provided. When scoring a Fast K-Means model, any columns used to determine clusters in the analysis being scored are not necessarily specified as Index columns when scoring. A duplicate definition error can occur. • Retain Columns — Other columns within the table being scored can be appended to the scored table by specifying them here. Columns specified in Index Columns are not necessarily specified here. None of the columns involved in Fast K-Means clustering can contain leading or trailing spaces or, if publishing, a separator character ' | '. Select Model Analysis — Select from the list an existing Cluster analysis on which to run the scoring. The Cluster analysis must exist in the same project as the Cluster Scoring analysis. Cluster Scoring - INPUT - Analysis Parameters On the Cluster Scoring dialog click on INPUT and then click on analysis parameters: Figure 75: Add New Analysis > Input > Analysis Parameters On this screen select: • Score Options • Include Cluster Membership — The name of the column in the output score table representing the cluster number to which an observation or row belongs can be set by the user. For Fast K-Means, this option must be checked and a column name specified. For other model types, this column may be excluded by clearing the selection box, but if this is done the cluster probability scores must be included. • 152 Column Name — Name of the column that will be populated with the cluster numbers, but it cannot have the same name as any of the columns in the table being scored Teradata Warehouse Miner User Guide - Volume 3 Chapter 2: Scoring Cluster Scoring • Include Cluster Probability Scores — Specify the prefix of the name of the columns in the output score table representing the probabilities that an observation or row belongs to each cluster. A column is created for each possible cluster, adding the cluster number to this prefix (for example, p1, p2, p3). To exclude these columns, clear the selection box, but you must include the cluster membership number. • Column Prefix — Specify a prefix for each column generated (one per cluster) that will be populated with the probability scores. The prefix includes sequential numbers, beginning with 1 and incrementing for each cluster appended to it. If the resultant column conflicts with a column in the table to be scored, an error occurs. Cluster Scoring - OUTPUT On the Cluster Scoring dialog click on OUTPUT: Figure 76: Cluster Scoring > Output On this screen select: • Output Table • Database name — The name of the database. • Table name — The name of the scored output table to be created-required only if a scoring option is selected. • Create output table using the FALLBACK keyword — If a table is selected, it will be built with FALLBACK if this option is selected • Create output table using the MULTISET keyword — This option is not enabled for scored output tables; the MULTISET keyword is not used. • Advertise Output — The Advertise Output option “advertises” output by inserting information into one or more of the Advertise Output metadata tables according to the type of analysis and the options selected in the analysis. • Advertise Note — An Advertise Note may be specified if desired when the Advertise Output option is selected or when the Always Advertise option is selected on the Connection Properties dialog. It is a free-form text field of up to 30 characters that may be used to categorize or describe the output. • Generate the SQL for this analysis, but do not execute it — If this option is selected the analysis will only generate SQL, returning it and terminating immediately. Note: To create a stored procedure to score this model, use the Refresh analysis. Run the Cluster Scoring Analysis After setting parameters on the INPUT screens as described above, you are ready to run the analysis. To run the analysis you can either: Teradata Warehouse Miner User Guide - Volume 3 153 Chapter 2: Scoring Cluster Scoring • Click the Run icon on the toolbar, or • Select Run <project name> on the Project menu, or • Press the F5 key on your keyboard Results - Cluster Scoring The results of running the Teradata Warehouse Miner Cluster Scoring Analysis include a variety of statistical reports on the scored model. All of these results are outlined below. Cluster Scoring - RESULTS - reports On the Cluster Scoring dialog, click RESULTS and then click reports (note that the RESULTS tab will be grayed-out/disabled until after the analysis is completed): Figure 77: Cluster Scoring > Results > Reports • Clustering Scoring Report • Iteration — When scoring, the algorithm performs only one iteration, so this value is always 1. • Log Likelihood — This is the log likelihood value calculated using the scored data, giving a measure of the effectiveness of the model applied to this data. • Diff — Since only one iteration of the algorithm is performed when scoring, this is always 0. • Timestamp — This is the day, date, hour, minute and second marking the end of the scoring processing. The Cluster Scoring report for Fast K-Means contains the Timestamp and Message columns, where the message indicates progress and a single iteration. Cluster Scoring - RESULTS - data On the Cluster Scoring dialog click RESULTS and then click on data (note that the RESULTS tab will be grayed-out/disabled until after the analysis is completed): Figure 78: Cluster Scoring > Results > Data Results data, if any, is displayed in a data grid. 154 Teradata Warehouse Miner User Guide - Volume 3 Chapter 2: Scoring Cluster Scoring If a table was created, a sample of rows is displayed here - the size determined by the setting specified by Maximum result rows to display in Tools-Preferences-Limits. The following table is built in the requested Output Database by the Cluster Scoring analysis. Note that the options selected affect the structure of the table. Those columns in bold below will comprise the Unique Primary Index (UPI). Also note that there may be repeated groups of columns, and that some columns will be generated only if specific options are selected. Table 51: Output Database (Built by the Cluster Scoring analysis) Name Type Definition Key User Defined One or more unique-key columns, which default to the index, defined in the table to be scored (i.e., in Selected Tables). The data type defaults to the same as the scored table, but can be changed via Primary Index Columns and Types. Probability (Default) FLOAT The probabilities that an observation or row belongs to each cluster if the Include Cluster Probability Scores option is selected. A column is created for each possible cluster, adding the cluster number to the prefix entered in the Column Prefix option. This prefix will be used for each column generated (one per cluster) that will be populated with the probability scores. Note that the prefix used will have sequential numbers, beginning with 1 and incrementing for each cluster appended to it. (By default, the Column Prefix is p, so p1, p2, p3, etc. will be generated). These columns may be excluded by not selecting the Include Cluster Probability Scores option, but if this is done the cluster membership number must be included. Cluster Number (Default) INTEGER The column in the output score table representing the cluster number to which an observation or row belongs can be set by the user. This column may be excluded by not selecting the Include Cluster Membership option, but if this is done the cluster probability scores must be included (see above). The name of the column defaults to Cluster Number, but this can be overwritten by entering another value in Column Name under the Include Cluster Membership option. This cannot have the same name as any of the index columns in the table being scored, and the name entered cannot exist as a column in the table being scored. When scoring a Fast K-Means model, the score table differs from that shown above. The first column identifies the cluster and has the default name Cluster Number. The next columns are the Index columns and the last columns are the Retain columns, if any. (A Fast K-Means score table does not contain probability columns.) Cluster Scoring - RESULTS - SQL With Fast K-Means scoring, the SQL is displayed whether or not the Generate SQL Only option is selected. The SQL generated is simply the call to td_analyze, followed by any requested postprocessing SQL. On the Cluster Scoring dialog click RESULTS and then click on SQL (note that the RESULTS tab will be grayed-out/disabled until after the analysis is completed): Teradata Warehouse Miner User Guide - Volume 3 155 Chapter 2: Scoring Cluster Scoring Figure 79: Cluster Scoring > Results > SQL The SQL generated for scoring is returned here, but only if the Output - Storage option to Generate the SQL for this analysis, but do not execute it was selected. When SQL is displayed here, it may be selected and copied as desired. (Both right-click menu options and buttons to Select All and Copy are available). Tutorial - Cluster Scoring In this example, the same table is scored as was used to build the cluster analysis model. Parameterize a Cluster Score Analysis as follows: • Selected Table — twm_customer_analysis • Include Cluster Membership — Enabled • Column Name — Clusterno • Include Cluster Probability Scores — Enabled • Column Prefix — p • Result Table Name — twm_score_cluster_1 • Index Columns — cust_id Run the analysis, and click on Results when it completes. For this example, the Cluster Scoring Analysis generated the following pages. A single click on each page name populates Results with the item. Table 52: Clustering Progress Iteration Log Likelihood Diff Timestamp 1 -24.3 0 Tue Jun 12 15:41:58 2001 Table 53: Data cust_id p1 p2 p3 clusterno 1362509 .457 .266 .276 1 1362573 1.12E-22 1 0 2 1362589 6E-03 5.378E-03 .989 3 1362693 8.724E-03 8.926E-03 .982 3 1362716 3.184E-03 3.294E-03 .994 3 156 Teradata Warehouse Miner User Guide - Volume 3 Chapter 2: Scoring Tree Scoring Table 53: Data cust_id p1 p2 p3 clusterno 1362822 .565 .132 .303 1 1363017 7.267E-02 .927 1.031E-18 2 1363078 3.598E-03 3.687E-03 .993 3 1363438 2.366E-03 2.607E-03 .995 3 1363465 .115 5.923E-02 .826 3 … … … … … … … … … … … … … … … Tree Scoring After building a model a means of deploying it is required to allow scoring of new data sets. The way in which Teradata Warehouse Miner deploys a decision tree model is via SQL. A series of SQL statements is generated from the metadata model that describes the decision tree. The SQL uses CASE statements to classify the predicted value. Here is an example of a statement: SELECT CASE WHEN(subset1 expression) THEN ‘Buy’ WHEN(subset2 expression) THEN ‘Do not Buy’ END FROM tablename; Note that Tree Scoring applies a Decision Tree model to a data set that has the same columns as those used in building the model (with the exception that the scoring input table need not include the predicted or dependent variable column unless model evaluation is requested). A number of scoring options including model evaluation and profiling rulesets are provided on the analysis parameters panel of the Tree Scoring analysis. Initiate Tree Scoring After generating a Decision Tree analysis (as described in “Decision Trees” on page 39) use the following procedure to initiate Tree Scoring: Teradata Warehouse Miner User Guide - Volume 3 157 Chapter 2: Scoring Tree Scoring 1 Click on the Add New Analysis icon in the toolbar: Figure 80: Add New Analysis from toolbar 2 In the resulting Add New Analysis dialog box, click on Scoring under Categories and then under Analyses double-click on Tree Scoring: Figure 81: Add New Analysis > Scoring > Tree Scoring 3 This will bring up the Tree Scoring dialog in which you will enter INPUT and OUTPUT options to parameterize the analysis as described in the following sections. Tree Scoring - INPUT - Data Selection On the Tree Scoring dialog click on INPUT and then click on data selection: 158 Teradata Warehouse Miner User Guide - Volume 3 Chapter 2: Scoring Tree Scoring Figure 82: Tree Scoring > Input > Data Selection On this screen select: 1 Select Input Source Users may select between different sources of input. By selecting the Input Source Table the user can select from available databases, tables (or views) and columns in the usual manner. By selecting the Input Source Analysis however the user can select directly from the output of another analysis of qualifying type in the current project. Analyses that may be selected from directly include all of the Analytic Data Set (ADS) and Reorganization analyses (except Refresh). In place of Available Databases the user may select from Available Analyses, while Available Tables then contains a list of all the output tables that will eventually be produced by the selected Analysis. (Note that since this analysis cannot select from a volatile input table, Available Analyses will contain only those qualifying analyses that create an output table or view). 2 Select Columns From a Single Table • Available Databases — All available source databases that have been added on the Connection Properties dialog. • Available Tables — The tables available for scoring are listed in this window, though all may not strictly qualify; the input table to be scored must contain the same column names used in the original analysis. • Available Columns — The columns available for scoring are listed in this window. • Selected Columns • Index Columns — Note that the Selected Columns window is actually a split window for specifying Index and/or Retain columns if desired. If a table is specified as input, the primary index of the table is defaulted here, but can be changed. If a view is specified as input, an index must be provided. • Retain Columns — Other columns within the table being scored can be appended to the scored table, by specifying them here. Columns specified in Index Columns may not be specified here. 3 Select Model Analysis 4 Select from the list an existing Decision Tree analysis on which to run the scoring. The Decision Tree analysis must exist in the same project as the Decision Tree Scoring analysis. Tree Scoring - INPUT - Analysis Parameters On the Tree Scoring dialog click on INPUT and then click on analysis parameters: Teradata Warehouse Miner User Guide - Volume 3 159 Chapter 2: Scoring Tree Scoring Figure 83: Tree Scoring > Input > Analysis Parameters On this screen select: • Scoring Method • Score — Option to create a score table only. • Evaluate — Option to perform model evaluation only. Not available for Decision Tree models built using the Regression Trees option. • Evaluate and Score — Option to create a score table and perform model evaluation. Not available for Decision Tree models built using the Regression Trees option. • Scoring Options • Use Dependent variable for predicted value column name — Option to use the exact same column name as the dependent variable when the model is scored. This is the default option. • Predicted Value Column Name — If above option is not checked, then enter here the name of the column in the score table which contains the estimated value of the dependent variable. • Include Confidence Factor — If this option is checked then the confidence factor will be added to the output table. The Confidence Factor is a measure of how “confident” the model is that it can predict the correct score for a record that falls into a particular leaf node based on the training data the model was built from. Example: If a leaf node contained 10 observations and 9 of them predict Buy and the other record predicts Do Not Buy, then the model built will have a confidence factor of .9, or be 90% sure of predicting the right value for a record that falls into that leaf node of the model. If the Include validation table option was selected when the decision tree model was built, additional information is provided in the scored table and/or results depending on the scoring option selected. If Score Only is selected, a recalculated confidence factor based on the original validation table is included in the scored output table. If Evaluate Only is selected, a confusion matrix based on the selected table to score is added to the results. If Evaluate and Score is selected, then a confusion matrix based on the selected table to score is added to the results and a recalculated confidence factor based on the selected table to score is included in the scored output table. • 160 Targeted Confidence (Binary Outcome Only) — Models built with a predicted variable that has only 2 outcomes can add a targeted confidence value to the output table. The outcomes of the above example were 9 Buys and 1 Do Not Buy at that particular node and if the target value was set to Buy, .9 is the targeted confidence. However if it is desired to target the Do Not Buy outcome by setting the value to Do Not Buy, then any record falling into this leaf of the tree would get a targeted confidence of .1 or 10%. Teradata Warehouse Miner User Guide - Volume 3 Chapter 2: Scoring Tree Scoring If the Include validation table option was selected when the decision tree model was built, additional information is provided in a manner similar to that for the Include Confidence Factor option described above. • Targeted Value — The value for the binary targeted confidence. Note that Include Confidence Factor and Targeted Confidence are mutually exclusive options, so that only one of the two may be selected. • Create Profiling Tables — If this option is selected, additional tables are created to profile the leaf nodes in the tree and to link scored rows to the leaf nodes that they correspond to. To do this, a node ID field is added to the scored output table and two additional tables are built to describe the leaf nodes. One table contains confidence factor or targeted confidence (if requested) and prediction information (named by appending “_1” to the scored output table name), and the other contains the rules corresponding to each leaf node (named by appending “_2” to the scored output table name). Note however that selection of the option to Create Profiling Tables is ignored if the Evaluate scoring method or the output option to Generate the SQL for this analysis but do not execute it is selected. It is also ignored if the analysis is being refreshed by a Refresh analysis that requests the creation of a stored procedure. Tree Scoring - OUTPUT On the Tree Scoring dialog click on OUTPUT: Figure 84: Tree Scoring > Output On this screen select: • Output Table • Database name — The name of the database. • Table name — The name of the scored output table to be created. • Create output table using the FALLBACK keyword — If a table is selected, it will be built with FALLBACK if this option is selected • Create output table using the MULTISET keyword — This option is not enabled for scored output tables; the MULTISET keyword is not used. • Advertise Output — The Advertise Output option “advertises” output (including Profiling Tables if requested) by inserting information into one or more of the Advertise Output metadata tables according to the type of analysis and the options selected in the analysis. • Advertise Note — An Advertise Note may be specified if desired when the Advertise Output option is selected or when the Always Advertise option is selected on the Teradata Warehouse Miner User Guide - Volume 3 161 Chapter 2: Scoring Tree Scoring Connection Properties dialog. It is a free-form text field of up to 30 characters that may be used to categorize or describe the output. • Generate the SQL for this analysis, but do not execute it — If this option is selected the analysis will only generate SQL, returning it and terminating immediately. Note: To create a stored procedure to score this model, use the Refresh analysis. Run the Tree Scoring Analysis After setting parameters on the INPUT screens as described above, you are ready to run the analysis. To run the analysis you can either: • Click the Run icon on the toolbar, or • Select Run <project name> on the Project menu, or • Press the F5 key on your keyboard Results - Tree Scoring The results of running the Teradata Warehouse Miner Decision Tree Scoring Analysis include a variety of statistical reports on the scored model. All of these results are outlined below. Tree Scoring - RESULTS - Reports On the Tree Scoring dialog click RESULTS and then click on reports (note that the RESULTS tab will be grayed-out/disabled until after the analysis is completed): Figure 85: Tree Scoring > Results > Reports • Decision Tree Score Report • Resulting Scored Table Name — This is the name given the table with the scored values of the decision tree model. • Number of Rows in Scored Table — This is the number of rows in the scored decision tree table. • Confusion Matrix — A N x (N+2) (for N outcomes of the dependent variable) confusion matrix is given with the following format: Table 54: Confusion Matrix Actual ‘0’ Actual ‘1’ … Actual ‘N’ Correct Incorrect Predicted ‘0’ # correct ‘0’ Predictions # incorrect‘1’ Predictions … # incorrect ‘N’ Predictions Total Correct ‘0’ Predictions Total Incorrect ‘0’ Predictions Predicted ‘1’ # incorrect‘0’ Predictions # correct ‘1’ Predictions … # incorrect ‘N’ Predictions Total Correct ‘1’ Predictions Total Incorrect ‘1’ Predictions 162 Teradata Warehouse Miner User Guide - Volume 3 Chapter 2: Scoring Tree Scoring Table 54: Confusion Matrix Actual ‘0’ Actual ‘1’ … Actual ‘N’ Correct Incorrect … … … … … … … Predicted ‘N’ # incorrect‘0’ Predictions # incorrect ‘1’ Predictions … # correct ‘N’ Predictions Total Correct ‘N’ Predictions Total Incorrect ‘N’ Predictions • Cumulative Lift Table — The Cumulative Lift Table demonstrates how effective the model is in estimating the dependent variable. It is produced using deciles based on the probability values. Note that the deciles are labeled such that 1 is the highest decile and 10 is the lowest, based on the probability values calculated by logistic regression. The information in this report however is best viewed in the Lift Chart produced as a graph. Note that this is only valid for binary dependent variables. • Decile — The deciles in the report are based on the probability values predicted by the model. Note that 1 is the highest decile and 10 is the lowest. That is, decile 1 contains data on the 10% of the observations with the highest estimated probabilities that the dependent variable is 1. • Count — This column contains the count of observations in the decile. • Response — This column contains the count of observations in the decile where the actual value of the dependent variable is 1. • Pct Response — This column contains the percentage of observations in the decile where the actual value of the dependent variable is 1. • Pct Captured Response — This column contains the percentage of responses in the decile over all the responses in any decile. • Lift — The lift value is the percentage response in the decile (Pct Response) divided by the expected response, where the expected response is the percentage of response or dependent 1-values over all observations. For example, if 10% of the observations overall have a dependent variable with value 1, and 20% of the observations in decile 1 have a dependent variable with value 1, then the lift value within decile 1 is 2.0, meaning that the model gives a “lift” that is better than chance alone by a factor of two in predicting response values of 1 within this decile. • Cumulative Response — This is a cumulative measure of Response, from decile 1 to this decile. • Cumulative Pct Response — This is a cumulative measure of Pct Response, from decile 1 to this decile. • Cumulative Pct Captured Response — This is a cumulative measure of Pct Captured Response, from decile 1 to this decile. • Cumulative Lift — This is a cumulative measure of Lift, from decile 1 to this decile. Tree Scoring - RESULTS - Data On the Tree Scoring dialog click RESULTS and then click on data (note that the RESULTS tab will be grayed-out/disabled until after the analysis is completed): Teradata Warehouse Miner User Guide - Volume 3 163 Chapter 2: Scoring Tree Scoring Figure 86: Tree Scoring > Results > Data Results data, if any, is displayed in a data grid. If a table was created, a sample of rows is displayed here - the size determined by the setting specified by Maximum result rows to display in Tools-Preferences-Limits. The following table is built in the requested Output Database by the Decision Tree Scoring analysis. Note that the options selected affect the structure of the table. Those columns in bold below will comprise the Primary Index. Also note that there may be repeated groups of columns, and that some columns will be generated only if specific options are selected. Table 55: Output Database table (Built by the Decision Tree Scoring analysis) Name Type Definition Key User Defined One or more key columns, which default to the index, defined in the table to be scored (i.e., in Selected Table). The data type defaults to the same as the scored table, but can be changed via Primary Index Columns. <app_var> User Defined One or more columns as selected under Retain Columns. <dep_var > User Defined The predicted value of the dependent variable. The name used defaults to the Dependent Variable specified when the tree was built. If Use Dependent variable for predicted value column name is not selected, then an appropriate column name must be entered and is used here. The data type used is the same as the Dependent Variable. _tm_node_id FLOAT When the Create profiling tables option is selected this column is included to link each row with a particular leaf node in the decision tree and thereby with a specific set of rules. _tm_target, or FLOAT One of two measures that are mutually exclusive. If the Include Confidence Factor option is selected, _tm_confidence will be generated and populated with Confidence Factors - a measure of how “confident” the model is that it can predict the correct score for a record that falls into a particular leaf node based on the training data the model was built from. (Default) _tm_confidence If the Targeted Confidence (Binary Outcome Only) option is selected, then _tm_ target will be generated and populated with Targeted Confidences for models built with a predicted value that has only 2 outcomes. The Targeted confidence is a measure of how confident the model is that it can predict the correct score for a particular leaf node based upon a user specified Target Value. For example, if a particular decision node had an outcome of 9 “Buys” and 1 “Do Not Buy” at that particular node, setting the Target Value to “Buy”, would generate a .9 or 9% targeted confidence. However if it is desired to set the Target Value to “Do Not Buy”, then any record falling into this leaf of the tree would get a targeted confidence of .1 or 10%. 164 Teradata Warehouse Miner User Guide - Volume 3 Chapter 2: Scoring Tree Scoring Table 55: Output Database table (Built by the Decision Tree Scoring analysis) Name Type Definition _tm_recalc_target, or FLOAT Recalculated versions of the confidence factor or targeted confidence factor based on the original validation table when Score Only is selected, or based on the selected table to score when Evaluate and Score is selected. _tm_recalc_confidence The following table is built in the requested Output Database by the Decision Tree Scoring analysis when the Create profiling tables option is selected. (It is named by appending “_1” to the scored result table name). Table 56: Output Database table (Built by the Decision Tree Scoring analysis) - Create profiling tables option selected (“_1” appended) Name Type Definition _tm_node_id FLOAT This column identifies a particular leaf node in the decision tree. _tm_target, or FLOAT The confidence factor or targeted confidence factor for this leaf node, as described above for the scored output table. VARCHAR(n) The predicted value of the dependent variable at this leaf node. _tm_confidence _tm_prediction The following table is built in the requested Output Database by the Decision Tree Scoring analysis when the Create profiling tables option is selected. (It is named by appending “_2” to the scored result table name). Table 57: Output Database table (Built by the Decision Tree Scoring analysis) - Create profiling tables option selected (“_2” appended) Name Type Definition _tm_node_id FLOAT This column identifies a particular leaf node in the decision tree. _tm_sequence_id FLOAT An integer from 1 to n to order the rules associated with a leaf node. _tm_rule VARCHAR(n) A rule for inclusion in the ruleset for this leaf node in the decision tree (rules are joined with a logical AND). Tree Scoring - RESULTS - Lift Graph On the Tree Scoring dialog click RESULTS and then click on lift graph (note that the RESULTS tab will be grayed-out/disabled until after the analysis is completed): Figure 87: Tree Scoring > Results > Lift Graph Teradata Warehouse Miner User Guide - Volume 3 165 Chapter 2: Scoring Tree Scoring This chart displays the information in the Cumulative Lift Table. This is the same graph described in “Results - Logistic Regression” on page 134 as Lift Chart, but applied to possibly new data. Tree Scoring - RESULTS - SQL On the Tree Scoring dialog click RESULTS and then click on SQL (note that the RESULTS tab will be grayed-out/disabled until after the analysis is completed): Figure 88: Tree Scoring > Results > SQL The SQL generated for scoring is returned here, but only if the Score Method was set to Score on the Input - Analysis Parameters tab, and if the Output - Storage option to Generate the SQL for this analysis, but do not execute it was selected. When SQL is displayed here, it may be selected and copied as desired. (Both right-click menu options and buttons to Select All and Copy are available). Tutorial - Tree Scoring In this example, the same table is scored as was used to build the decision tree model, as a matter of convenience. Typically, this would not be done unless the contents of the table changed since the model was built. Parameterize a Decision Tree Scoring Analysis as follows: • Selected Tables — twm_customer_analysis • Scoring Method — Evaluate and Score • Use the name of the dependent variable as the predicted value column name — Enabled • Targeted Confidence(s) - For binary outcome only — Enabled • Targeted Value — 1 • Result Table Name — twm_score_tree_1 • Primary Index Columns — cust_id Run the analysis, and click on Results when it completes. For this example, the Decision Tree Scoring Analysis generated the following pages. A single click on each page name populates Results with the item. Table 58: Decision Tree Model Scoring Report 166 Resulting Scored Table Name score_tree_1 Number of Rows in Scored File 747 Teradata Warehouse Miner User Guide - Volume 3 Chapter 2: Scoring Tree Scoring Table 59: Confusion Matrix Actual Non-Response Actual Response Correct Incorrect Predicted 0 340/45.52% 0/0.00% 340/45.52% 0/0.00% Predicted 1 32/4.28% 375/50.20% 375/50.20% 32/4.28% Cumulativ e Lift Table 60: Cumulative Lift Table Captured Response (%) Lift Cumulativ e Response Cumulativ e Response (%) Cumulativ e Captured Response (%) Decile Count Response Response (%) 1 5 5.00 100.00 1.33 1.99 5.00 100.00 1.33 1.99 2 0 0.00 0.00 0.00 0.00 5.00 100.00 1.33 1.99 3 0 0.00 0.00 0.00 0.00 5.00 100.00 1.33 1.99 4 0 0.00 0.00 0.00 0.00 5.00 100.00 1.33 1.99 5 0 0.00 0.00 0.00 0.00 5.00 100.00 1.33 1.99 6 402 370.00 92.04 98.67 1.83 375.00 92.14 100.00 1.84 7 0 0.00 0.00 0.00 0.00 375.00 92.14 100.00 1.84 8 0 0.00 0.00 0.00 0.00 375.00 92.14 100.00 1.84 9 0 0.00 0.00 0.00 0.00 375.00 92.14 100.00 1.84 10 340 0.00 0.00 0.00 0.00 375.00 50.20 100.00 1.00 Table 61: Data cust_id cc_acct _tm_target 1362480 1 0.92 1362481 0 0 1362484 1 0.92 1362485 0 0 1362486 1 0.92 … … … Lift Graph Additionally, you can click on Lift Chart to view the contents of the Lift Table graphically. Teradata Warehouse Miner User Guide - Volume 3 167 Chapter 2: Scoring Factor Scoring Factor Scoring Factor analysis is designed primarily for the purpose of discovering the underlying structure or meaning in a set of variables and to facilitate their reduction to a fewer number of variables called factors or components. The first goal is facilitated by finding the factor loadings that describe the variables in a data set in terms of a linear combination of factors. The second goal is facilitated by finding a description for the factors as linear combinations of the original variables they describe. These are sometimes called factor measurements or scores. After computing the factor loadings, computing factor scores might seem like an afterthought, but it is somewhat more involved than that. Teradata Warehouse Miner does automate the process however based on the model information stored in metadata results tables, computing factor scores directly in the database by dynamically generating and executing SQL. Note that Factor Scoring computes factor scores for a data set that has the same columns as those used in performing the selected Factor Analysis. When scoring is performed, a table is created including index (key) columns, optional “retain” columns, and factor scores for each row in the input table being scored. Scoring is performed differently depending on the type of factor analysis that was performed, whether principal components (PCA), principal axis factors (PAF) or maximum likelihood factors (MLF). Further, scoring is affected by whether or not the factor analysis included a rotation. Also, input data is centered based on the mean value of each variable, and if the factor analysis was performed on a correlation matrix, input values are each divided by the standard deviation of the variable in order to normalize to unit length variance. When scoring a table using a PCA factor analysis model, the scores can be calculated directly without estimation, even if an orthogonal rotation was performed. When scoring using a PAF or MLF model, or a PCA model with an oblique rotation, a unique solution does not exist and cannot be directly solved for (a condition known as the indeterminacy of factor measurements). There are many techniques however for estimating factor measurements, and the technique used by Teradata Warehouse Miner is known as estimation by regression. This technique involves regressing each factor on the original variables in the factor analysis model using linear regression techniques. It gives an accurate solution in the “least-squared error” sense but it typically introduces some degree of dependence or correlation in the computed factor scores. A final word about the independence or orthogonality of factor scores is appropriate here. It was pointed out earlier that factor loadings are orthogonal using the techniques offered by Teradata Warehouse Miner unless an oblique rotation is performed. Factor scores however will not necessarily be orthogonal for principal axis factors and maximum likelihood factors and with oblique rotations since scores are estimated by regression. This is a subtle distinction that is an easy source of confusion. That is, the new variables or factor scores created by a factor analysis, expressed as a linear combination of the original variables, are not necessarily independent of each other, even if the factors themselves are. The user may measure their independence however by using the Matrix and Export Matrix functions to build a correlation matrix from the factor score table once it is built. 168 Teradata Warehouse Miner User Guide - Volume 3 Chapter 2: Scoring Factor Scoring Initiate Factor Scoring After generating a Factor Analysis (as described in “Factor Analysis” on page 62) use the following procedure to initiate Factor Scoring: 1 Click on the Add New Analysis icon in the toolbar: Figure 89: Add New Analysis from toolbar 2 In the resulting Add New Analysis dialog box, click on Scoring under Categories and then under Analyses double-click on Factor Scoring: Figure 90: Add New Analysis > Scoring > Factor Scoring 3 This will bring up the Factor Scoring dialog in which you will enter INPUT and OUTPUT options to parameterize the analysis as described in the following sections. Factor Scoring - INPUT - Data Selection On the Factor Scoring dialog click on INPUT and then click on data selection: Teradata Warehouse Miner User Guide - Volume 3 169 Chapter 2: Scoring Factor Scoring Figure 91: Factor Scoring > Input > Data Selection On this screen select: 1 Select Input Source Users may select between different sources of input. By selecting the Input Source Table the user can select from available databases, tables (or views) and columns in the usual manner. By selecting the Input Source Analysis however the user can select directly from the output of another analysis of qualifying type in the current project. Analyses that may be selected from directly include all of the Analytic Data Set (ADS) and Reorganization analyses (except Refresh). In place of Available Databases the user may select from Available Analyses, while Available Tables then contains a list of all the output tables that will eventually be produced by the selected Analysis. (Note that since this analysis cannot select from a volatile input table, Available Analyses will contain only those qualifying analyses that create an output table or view). 2 3 Select Columns From a Single Table • Available Databases — All available source databases that have been added on the Connection Properties dialog. • Available Tables — The tables available for scoring are listed in this window, though all may not strictly qualify; the input table to be scored must contain the same column names used in the original analysis. • Available Columns — The columns available for scoring are listed in this window. • Selected Columns • Index Columns — Note that the Selected Columns window is actually a split window for specifying Index and/or Retain columns if desired. If a table is specified as input, the primary index of the table is defaulted here, but can be changed. If a view is specified as input, an index must be provided. • Retain Columns — Other columns within the table being scored can be appended to the scored table, by specifying them here. Columns specified in Index Columns may not be specified here. Select Model Analysis Select from the list an existing Factor Analysis analysis on which to run the scoring. The Factor Analysis analysis must exist in the same project as the Factor Scoring analysis. Factor Scoring - INPUT - Analysis Parameters On the Factor Scoring dialog click on INPUT and then click on analysis parameters: 170 Teradata Warehouse Miner User Guide - Volume 3 Chapter 2: Scoring Factor Scoring Figure 92: Factor Scoring > Input > Analysis Parameters On this screen select: • Scoring Method • Score — Option to create a score table only. • Evaluate — Option to perform model evaluation only. • Evaluate and Score — Option to create a score table and perform model evaluation. • Factor Names — The names of the factor columns in the created table of scores are optional parameters if scoring is selected. The default names of the factor columns are factor1, factor2 ... factorn. Factor Scoring - OUTPUT On the Factor Scoring dialog click on OUTPUT: Figure 93: Factor Scoring > Output On this screen select: • Output Table • Database name — The name of the database. • Table name — The name of the scored output table to be created-required only if a scoring option is selected. • Create output table using the FALLBACK keyword — If a table is selected, it will be built with FALLBACK if this option is selected • Create output table using the MULTISET keyword — This option is not enabled for scored output tables; the MULTISET keyword is not used. • Advertise Output — The Advertise Output option “advertises” output by inserting information into one or more of the Advertise Output metadata tables according to the type of analysis and the options selected in the analysis. • Advertise Note — An Advertise Note may be specified if desired when the Advertise Output option is selected or when the Always Advertise option is selected on the Connection Properties dialog. It is a free-form text field of up to 30 characters that may be used to categorize or describe the output. Teradata Warehouse Miner User Guide - Volume 3 171 Chapter 2: Scoring Factor Scoring • Generate the SQL for this analysis, but do not execute it — If this option is selected the analysis will only generate SQL, returning it and terminating immediately. Note: To create a stored procedure to score this model, use the Refresh analysis. Run the Factor Scoring Analysis After setting parameters on the INPUT screens as described above, you are ready to run the analysis. To run the analysis you can either: • Click the Run icon on the toolbar, or • Select Run <project name> on the Project menu, or • Press the F5 key on your keyboard Results - Factor Scoring The results of running the Teradata Warehouse Miner Factor Analysis Scoring/Evaluation Analysis include a variety of statistical reports on the scored model. All of these results are outlined below. Factor Scoring - RESULTS - reports On the Factor Scoring dialog click RESULTS and then click on reports (note that the RESULTS tab will be grayed-out/disabled until after the analysis is completed): Figure 94: Factor Scoring > Results > Reports • Resulting Scored Table — Name of the scored table - equivalent to Result Table Name. • Number of Rows in Scored Table — Number of rows in the Resulting Scored Table. • Evaluation — Model evaluation for factor analysis consists of computing the standard error of estimate for each variable based on working backwards and re-estimating their values using the scored factors. Estimated values of the original data are made using the T factor scoring equation Ŷ = XC where Ŷ is the estimated raw data, X is the scored data, and C is the factor pattern matrix or rotated factor pattern matrix if rotation was included in the model. The standard error of estimate for each variable y in the original data Y is then given by: 2 y – ŷ -------------------------n–p where each ŷ is the estimated value of each variable y, n is the number of observations and p is the number of factors. 172 Teradata Warehouse Miner User Guide - Volume 3 Chapter 2: Scoring Factor Scoring Factor Scoring - RESULTS - Data On the Factor Scoring dialog click RESULTS and then click on data (note that the RESULTS tab will be grayed-out/disabled until after the analysis is completed): Figure 95: Factor Scoring > Results > Data Results data, if any, is displayed in a data grid. If a table was created, a sample of rows is displayed here - the size determined by the setting specified by Maximum result rows to display in Tools-Preferences-Limits. The following table is built in the requested Output Database by Factor Scoring. Note that the options selected affect the structure of the table. Those columns in bold below will comprise the Unique Primary Index (UPI). Also note that there may be repeated groups of columns, and that some columns will be generated only if specific options are selected. Table 62: Output Database table (Built by Factor Scoring) Name Type Definition Key User Defined One or more unique-key columns which default to the index defined in the table to be scored (i.e., in Selected Tables). The data type defaults to the same as the scored table, but can be changed via Primary Index Columns. <app_var> User Defined One or more columns as selected under Retain Columns. The data type defaults to the same as that within the appended table, but can be changed via Columns Types (for appended columns). Factorx FLOAT A column generated for each scored factor. The names of the factor columns in the created table of scores are optional parameters if scoring is selected. The default names of the factor columns are factor1, factor2, ... factorn, unless Factor Names are specified. (Default) Factor Scoring - RESULTS - SQL On the Factor Scoring dialog click RESULTS and then click on SQL (note that the RESULTS tab will be grayed-out/disabled until after the analysis is completed): Figure 96: Factor Scoring > Results > SQL Teradata Warehouse Miner User Guide - Volume 3 173 Chapter 2: Scoring Factor Scoring The SQL generated for scoring is returned here, but only if the Score Method was set to Score on the Input - Analysis Parameters tab, and if the Output - Storage option to Generate the SQL for this analysis, but do not execute it was selected. When SQL is displayed here, it may be selected and copied as desired. (Both right-click menu options and buttons to Select All and Copy are available). Tutorial - Factor Scoring In this example, the same table is scored as was used to build the factor analysis model. Parameterize a Factor Analysis Scoring Analysis as follows: • Selected Table — twm_customer_analysis • Evaluate and Score — Enabled • Factor Names • Factor1 • Factor2 • Factor3 • Factor4 • Factor5 • Factor6 • Factor7 • Result Table Name — twm_score_factor_1 • Index Columns — cust_id Run the analysis, and click on Results when it completes. For this example, the Factor Analysis Scoring/Evaluation function generated the following pages. A single click on each page name populates Results with the item. Table 63: Factor Analysis Score Report Resulting Scored Table <result_db >.score_factor_1 Number of Rows in Scored Table 747 Table 64: Evaluation 174 Variable Name Standard Error of Estimate income 0.4938 age 0.5804 years_with_bank 0.5965 nbr_children 0.6180 female 0.8199 Teradata Warehouse Miner User Guide - Volume 3 Chapter 2: Scoring Linear Scoring Table 64: Evaluation Variable Name Standard Error of Estimate single 0.3013 married 0.3894 separated 0.4687 ccacct 0.6052 ckacct 0.5660 svacct 0.5248 avg_cc_bal 0.4751 avg_ck_bal 0.6613 avg_sv_bal 0.7166 avg_cc_tran_amt 0.8929 avg_cc_tran_cnt 0.5174 avg_ck_tran_amt 0.3563 avg_ck_tran_cnt 0.7187 avg_sv_tran_amt 0.4326 avg_sv_tran_cnt 0.6967 cc_rev 0.3342 Table 65: Data cust_id factor1 factor2 factor3 factor4 factor5 factor6 factor7 1362480 1.43 -0.28 1.15 -0.50 -0.31 -0.05 1.89 1362481 -1.03 -1.37 0.57 -0.08 -0.60 -0.39 -0.55 ... ... ... ... ... ... ... ... Linear Scoring Once a linear regression model has been built, it can be used to “score” new data, that is, to estimate the value of the dependent variable in the model using data for which its value may not be known. Scoring is performed using the values of the b-coefficients in the linear regression model and the names of the independent variable columns they correspond to. Other information needed includes the table name(s) in which the data resides, the new table to be created, and primary index information for the new table. The result of scoring a linear regression model will be a new table containing primary index columns and an estimate of the Teradata Warehouse Miner User Guide - Volume 3 175 Chapter 2: Scoring Linear Scoring dependent variable, optionally including a residual value for each row, calculated as the difference between the estimated value and the actual value of the dependent variable. (The option to include the residual value is available only when model evaluation is requested). Note that Linear Scoring applies a Linear Regression model to a data set that has the same columns as those used in building the model (with the exception that the scoring input table need not include the predicted or dependent variable column unless model evaluation is requested). Linear Regression Model Evaluation Linear regression model evaluation begins with scoring a table that includes the actual values of the dependent variable. The standard error of estimate for the model is calculated and reported and may be compared to the standard error of estimate reported when the model was built. The standard error of estimate is calculated as the square root of the average squared residual value over all the observations, i.e. 2 y – ŷ -------------------------n–p–1 where ŷ is the actual value of the dependent variable, is its predicted value, n is the number of observations, and p is the number of independent variables (substituting n-p in the denominator if there is no constant term). Initiate Linear Scoring After generating a Linear Regression analysis (as described in “Linear Regression” on page 92) use the following procedure to initiate Linear Regression Scoring: 1 Click on the Add New Analysis icon in the toolbar: Figure 97: Add New Analysis from toolbar 2 176 In the resulting Add New Analysis dialog box, click on Scoring under Categories and then under Analyses double-click on Linear Scoring: Teradata Warehouse Miner User Guide - Volume 3 Chapter 2: Scoring Linear Scoring Figure 98: Add New Analysis > Scoring > Linear Scoring 3 This will bring up the Linear Scoring dialog in which you will enter INPUT and OUTPUT options to parameterize the analysis as described in the following sections. Linear Scoring - INPUT - Data Selection On the Linear Scoring dialog click on INPUT and then click on data selection: Figure 99: Linear Scoring > Input > Data Selection On this screen select: 1 Select Input Source Users may select between different sources of input. By selecting the Input Source Table the user can select from available databases, tables (or views) and columns in the usual manner. By selecting the Input Source Analysis however the user can select directly from the output of another analysis of qualifying type in the current project. Analyses that may be selected from directly include all of the Analytic Data Set (ADS) and Reorganization analyses (except Refresh). In place of Available Databases the user may select from Available Analyses, while Available Tables then contains a list of all the output tables that will eventually be produced by the selected Analysis. (Note that since this analysis cannot select from a volatile input table, Available Analyses will contain only those qualifying analyses that create an output table or view). 2 Select Columns From a Single Table Teradata Warehouse Miner User Guide - Volume 3 177 Chapter 2: Scoring Linear Scoring 3 • Available Databases — All available source databases that have been added on the Connection Properties dialog. • Available Tables — The tables available for scoring are listed in this window, though all may not strictly qualify; the input table to be scored must contain the same column names used in the original analysis. • Available Columns — The columns available for scoring are listed in this window. • Selected Columns • Index Columns — Note that the Selected Columns window is actually a split window for specifying Index and/or Retain columns if desired. If a table is specified as input, the primary index of the table is defaulted here, but can be changed. If a view is specified as input, an index must be provided. • Retain Columns — Other columns within the table being scored can be appended to the scored table, by specifying them here. Columns specified in Index Columns may not be specified here. Select Model Analysis Select from the list an existing Linear Regression analysis on which to run the scoring. The Linear Regression analysis must exist in the same project as the Linear Scoring analysis. Linear Scoring - INPUT - Analysis Parameters On the Linear Scoring dialog click on INPUT and then click on analysis parameters: Figure 100: Linear Scoring > Input > Analysis Parameters On this screen select: • Scoring Method • Score — Option to create a score table only. • Evaluate — Option to perform model evaluation only. • Evaluate and Score — Option to create a score table and perform model evaluation. • Scoring Options • Use Dependent variable for predicted value column name — Option to use the exact same column name as the dependent variable when the model is scored. This is the default option. • 178 Predicted Value Column Name — If above option is not checked, then enter here the name of the column in the score table which contains the estimated value of the dependent variable. Teradata Warehouse Miner User Guide - Volume 3 Chapter 2: Scoring Linear Scoring • Residual Column Name — If Evaluate and Score is requested, enter the name of the column that will contain the residual values of the evaluation. This column will be populated with the difference between the estimated value and the actual value of the dependent variable. Linear Scoring - OUTPUT On the Linear Scoring dialog click on OUTPUT: Figure 101: Linear Scoring > Output On this screen select: • Output Table • Database name — The name of the database. • Table name — The name of the scored output table to be created-required only if a scoring option is selected. • Create output table using the FALLBACK keyword — If a table is selected, it will be built with FALLBACK if this option is selected • Create output table using the MULTISET keyword — This option is not enabled for scored output tables; the MULTISET keyword is not used. • Advertise Output — The Advertise Output option “advertises” output by inserting information into one or more of the Advertise Output metadata tables according to the type of analysis and the options selected in the analysis. • Advertise Note — An Advertise Note may be specified if desired when the Advertise Output option is selected or when the Always Advertise option is selected on the Connection Properties dialog. It is a free-form text field of up to 30 characters that may be used to categorize or describe the output. • Generate the SQL for this analysis, but do not execute it — If this option is selected the analysis will only generate SQL, returning it and terminating immediately.Hint: To create a stored procedure to score this model, use the Refresh analysis. Note: To create a stored procedure to score this model, use the Refresh analysis. Run the Linear Scoring Analysis After setting parameters on the INPUT screens as described above, you are ready to run the analysis. To run the analysis you can either: • Click the Run icon on the toolbar, or • Select Run <project name> on the Project menu, or • Press the F5 key on your keyboard Teradata Warehouse Miner User Guide - Volume 3 179 Chapter 2: Scoring Linear Scoring Results - Linear Scoring The results of running the Linear Regression Scoring/Evaluation analysis include a variety of statistical reports on the scored model. All of these results are outlined below. Linear Scoring - RESULTS - reports On the Linear Scoring dialog click RESULTS and then click on reports (note that the RESULTS tab will be grayed-out/disabled until after the analysis is completed): Figure 102: Linear Scoring > Results > Reports • Resulting Scored Table — Name of the scored table - equivalent to Result Table Name. • Number of Rows in Scored Table — Number of rows in the Resulting Scored Table. • Evaluation • Minimum Absolute Error • Maximum Absolute Error • Average Absolute Error The term ‘error’ in the evaluation of a linear regression model refers to the difference between the value of the dependent variable predicted by the model and the actual value in a training set of data (data where the value of the dependent variable is known). Considering the absolute value of the error (changing negative differences to positive differences) provides a measure of the magnitude of the error in the model, which is a more useful measure of the model’s accuracy. With this introduction, the terms minimum, maximum and average absolute error have the usual meanings when calculated over all the observations in the input or scored table. • Standard Error of Estimate The standard error of estimate is calculated as the square root of the average squared residual value over all the observations, i.e. 2 y – ŷ -------------------------n–p–1 where y is the actual value of the dependent variable, ŷ is its predicted value, n is the number of observations, and p is the number of independent variables (substitute n-p in the denominator if there is no constant term). Linear Scoring - RESULTS - data On the Linear Scoring dialog click RESULTS and then click on data (note that the RESULTS tab will be grayed-out/disabled until after the analysis is completed): 180 Teradata Warehouse Miner User Guide - Volume 3 Chapter 2: Scoring Linear Scoring Figure 103: Linear Scoring > Results > Data Results data, if any, is displayed in a data grid. If a table was created, a sample of rows is displayed here - the size determined by the setting specified by Maximum result rows to display in Tools-Preferences-Limits. The following table is built in the requested Output Database by Linear Regression scoring. Note that the options selected affect the structure of the table. Those columns in bold below will comprise the Unique Primary Index (UPI). Also note that there may be repeated groups of columns, and that some columns will be generated only if specific options are selected. Table 66: Output Database table (Built by Linear Regression scoring) Name Type Definition Key User Defined One or more unique-key columns which default to the index defined in the table to be scored (i.e., in Selected Tables). The data type defaults to the same as the scored table, but can be changed via Primary Index Columns. <app_var> User Defined One or more columns as selected under Retain Columns. <dep_var> FLOAT The predicted value of the dependent variable. The name used defaults to the Dependent Variable specified when the model was built. If Use Dependent variable for predicted value column name is not selected, then an appropriate column name must be entered here. FLOAT The residual values of the evaluation, the difference between the estimated value and the actual value of the dependent variable. This is generated only if the Evaluate or Evaluate and Score options are selected. The name defaults to “Residual” unless it is overwritten by the user. (Default) Residual (Default) Linear Scoring - RESULTS - SQL On the Linear Scoring dialog click RESULTS and then click on SQL (note that the RESULTS tab will be grayed-out/disabled until after the analysis is completed): Figure 104: Linear Scoring > Results > SQL Teradata Warehouse Miner User Guide - Volume 3 181 Chapter 2: Scoring Linear Scoring The SQL generated for scoring is returned here, but only if the Score Method was set to Score on the Input - Analysis Parameters tab, and if the Output - Storage option to Generate the SQL for this analysis, but do not execute it was selected. When SQL is displayed here, it may be selected and copied as desired. (Both right-click menu options and buttons to Select All and Copy are available). Tutorial - Linear Scoring In this example, the same table is scored as was used to build the linear model, as a matter of convenience. Typically, this would not be done unless the contents of the table changed since the model was built. In the case of this example, the Standard Error of Estimate can be seen to be exactly the same, 10.445, that it was when the model was built (see “Tutorial - Linear Regression” on page 113). Parameterize a Linear Regression Scoring Analysis as follows: • Selected Table — twm_customer_analysis • Evaluate and Score — Enabled • Use dependent variable for predicted value column name — Enabled • Residual column name — Residual • Result Table Name — twm_score_linear_1 • Index Columns — cust_id Run the analysis, and click on Results when it completes. For this example, the Linear Regression Scoring/Evaluation Analysis generated the following pages. A single click on each page name populates Results with the item. Table 67: Linear Regression Reports Resulting Scored Table <result_db>.score_linear_1 Number of Rows in Scored Table 747 Table 68: Evaluation Minimum Absolute Error 0.0056 Maximum Absolute Error 65.7775 Average Absolute Error 7.2201 Standard Error of Estimate 10.4451 Table 69: Data 182 cust_id cc_rev Residual 1362480 59.188 15.812 1362481 3.412 -3.412 Teradata Warehouse Miner User Guide - Volume 3 Chapter 2: Scoring Logistic Scoring Table 69: Data cust_id cc_rev Residual 1362484 12.254 -.254 1362485 28.272 1.728 1362486 -9.026E-02 9.026E-02 1362487 14.325 -1.325 1362488 -5.105 5.105 1362489 69.738 12.262 1362492 53.368 .632 1362496 -5.876 5.876 … … … … … … … … … Logistic Scoring Once a logistic regression model has been built, it can be used to “score” new data, that is, to estimate the value of the dependent variable in the model using data for which its value may not be known. Scoring is performed using the values of the b-coefficients in the logistic regression model and the names of the independent variable column names they correspond to. This information resides in the results metadata stored in the Teradata database by Teradata Warehouse Miner. Other information needed includes the table name in which the data resides, the new table to be created, and primary index information for the new table. Scoring a logistic regression model requires some steps beyond those required in scoring a linear regression model. The result of scoring a logistic regression model will be a new table containing primary index columns, the probability that the dependent variable is 1 (representing the response value) rather than 0 (representing the non-response value), and/or an estimate of the dependent variable, either 0 or 1, based on a user specified threshold value. For example, if the threshold value is 0.5, then a value of 1 is estimated if the probability value is greater than or equal to 0.5. The probability is based on the logistic regression functions given earlier. The user can achieve different results based on the threshold value applied to the probability. The model evaluation tables described below can be used to determine what this threshold value should be. Note that Logistic Scoring applies a Logistic Regression model to a data set that has the same columns as those used in building the model (with the exception that the scoring input table Teradata Warehouse Miner User Guide - Volume 3 183 Chapter 2: Scoring Logistic Scoring need not include the predicted or dependent variable column unless model evaluation is requested). Logistic Regression Model Evaluation The same model evaluation that is available when building a Logistic Regression model is also available when scoring it, including the following report tables. Prediction Success Table The prediction success table is computed using only probabilities and not estimates based on a threshold value. Using an input table that contains known values for the dependent variable, the sum of the probability values x and 1 – x , which correspond to the probability that the predicted value is 1 or 0 respectively, are calculated separately for rows with actual values of 1 and 0. This produces a report table such as that shown below. Table 70: Prediction Success Table Estimate Response Estimate Non-Response Actual Total Actual Response 306.5325 68.4675 375.0000 Actual Non-Response 69.0115 302.9885 372.0000 Estimated Total 375.5440 371.4560 747.0000 An interesting and useful feature of this table is that it is independent of the threshold value that is used in scoring to determine which probabilities correspond to an estimate of 1 and 0 respectively. This is possible because the entries in the “Estimate Response” column are the sums of the probabilities x that the outcome is 1, summed separately over the rows where the actual outcome is 1 and 0 and then totaled. Similarly, the entries in the “Estimate NonResponse” column are the sums of the probabilities 1 – x that the outcome is 0. Multi-Threshold Success Table This table provides values similar to those in the prediction success table, but instead of summing probabilities, the estimated values based on a threshold value are summed instead. Rather than just one threshold however, several thresholds ranging from a user specified low to high value are displayed in user specified increments. This allows the user to compare several success scenarios using different threshold values, to aid in the choice of an ideal threshold. It might be supposed that the ideal threshold value would be the one that maximizes the number of correctly classified observations. However, subjective business considerations may be applied by looking at all of the success values. It may be that wrong predictions in one direction (say estimate 1 when the actual value is 0) may be more tolerable than in the other direction (estimate 0 when the actual value is 1). One may, for example, mind less overlooking fraudulent behavior than wrongly accusing someone of fraud. The following is an example of a logistic regression multi-threshold success table. 184 Teradata Warehouse Miner User Guide - Volume 3 Chapter 2: Scoring Logistic Scoring Table 71: Logistic Regression Multi-Threshold Success table Threshold Probability Actual Response, Estimate Response Actual Response, Estimate NonResponse Actual Non-Response, Estimate Response Actual Non-Response, Estimate NonResponse 0.0000 375 0 372 0 0.0500 375 0 326 46 0.1000 374 1 231 141 0.1500 372 3 145 227 0.2000 367 8 93 279 0.2500 358 17 59 313 0.3000 354 21 46 326 0.3500 347 28 38 334 0.4000 338 37 32 340 0.4500 326 49 27 345 0.5000 318 57 27 345 0.5500 304 71 26 346 0.6000 296 79 24 348 0.6500 287 88 22 350 0.7000 279 96 21 351 0.7500 270 105 19 353 0.8000 258 117 18 354 0.8500 245 130 16 356 0.9000 222 153 12 360 0.9500 187 188 10 362 Cumulative Lift Table The Cumulative Lift Table is produced for deciles based on the probability values. Note that the deciles are labeled such that 1 is the highest decile and 10 is the lowest, based on the probability values calculated by logistic regression. Within each decile, the following measures are given: 1 count of “response” values 2 count of observations 3 percentage response (percentage of response values within the decile) 4 captured response (percentage of responses over all response values) Teradata Warehouse Miner User Guide - Volume 3 185 Chapter 2: Scoring Logistic Scoring 5 lift value (percentage response / expected response, where the expected response is the percentage of responses over all observations) 6 cumulative versions of each of the measures above The following is an example of a logistic regression Cumulative Lift Table. Table 72: Logistic Regression Cumulative Lift Table Cumulative Response Cumulative Response (%) Cumulative Captured Response (%) Cumulative Lift Decile Count Response Response (%) Captured Response (%) 1 74.0000 73.0000 98.6486 19.4667 1.9651 73.0000 98.6486 19.4667 1.9651 2 75.0000 69.0000 92.0000 18.4000 1.8326 142.0000 95.3020 37.8667 1.8984 3 75.0000 71.0000 94.6667 18.9333 1.8858 213.0000 95.0893 56.8000 1.8942 4 74.0000 65.0000 87.8378 17.3333 1.7497 278.0000 93.2886 74.1333 1.8583 5 75.0000 63.0000 84.0000 16.8000 1.6733 341.0000 91.4209 90.9333 1.8211 6 75.0000 23.0000 30.6667 6.1333 0.6109 364.0000 81.2500 97.0667 1.6185 7 74.0000 8.0000 10.8108 2.1333 0.2154 372.0000 71.2644 99.2000 1.4196 8 75.0000 2.0000 2.6667 0.5333 0.0531 374.0000 62.6466 99.7333 1.2479 9 75.0000 1.0000 1.3333 0.2667 0.0266 375.0000 55.8036 100.0000 1.1116 10 75.0000 0.0000 0.0000 0.0000 0.0000 375.0000 50.2008 100.0000 1.0000 Lift Initiate Logistic Scoring After generating a Logistic Regression analysis (as described in “Logistic Regression” on page 120) use the following procedure to initiate Logistic Scoring: 1 Click on the Add New Analysis icon in the toolbar: Figure 105: Add New Analysis from toolbar 2 186 In the resulting Add New Analysis dialog box, click on Scoring under Categories and then under Analyses double-click on Logistic Scoring: Teradata Warehouse Miner User Guide - Volume 3 Chapter 2: Scoring Logistic Scoring Figure 106: Add New Analysis > Scoring > Logistic Scoring 3 This will bring up the Logistic Scoring dialog in which you will enter INPUT and OUTPUT options to parameterize the analysis as described in the following sections. Logistic Scoring - INPUT - Data Selection On the Logistic Scoring dialog click on INPUT and then click on data selection: Figure 107: Logistic Scoring > Input > Data Selection On this screen select: 1 Select Input Source Users may select between different sources of input. By selecting the Input Source Table the user can select from available databases, tables (or views) and columns in the usual manner. By selecting the Input Source Analysis however the user can select directly from the output of another analysis of qualifying type in the current project. Analyses that may be selected from directly include all of the Analytic Data Set (ADS) and Reorganization analyses (except Refresh). In place of Available Databases the user may select from Available Analyses, while Available Tables then contains a list of all the output tables that will eventually be produced by the selected Analysis. (Note that since this analysis cannot select from a volatile input table, Available Analyses will contain only those qualifying analyses that create an output table or view). 2 Select Columns From a Single Table Teradata Warehouse Miner User Guide - Volume 3 187 Chapter 2: Scoring Logistic Scoring 3 • Available Databases — All available source databases that have been added on the Connection Properties dialog. • Available Tables — The tables available for scoring are listed in this window, though all may not strictly qualify; the input table to be scored must contain the same column names used in the original analysis. • Available Columns — The columns available for scoring are listed in this window. • Selected Columns • Index Columns — Note that the Selected Columns window is actually a split window for specifying Index and/or Retain columns if desired. If a table is specified as input, the primary index of the table is defaulted here, but can be changed. If a view is specified as input, an index must be provided. • Retain Columns — Other columns within the table being scored can be appended to the scored table, by specifying them here. Columns specified in Index Columns may not be specified here. Select Model Analysis Select from the list an existing Logistic Regression analysis on which to run the scoring. The Logistic Regression analysis must exist in the same project as the Logistic Scoring analysis. Logistic Scoring - INPUT - Analysis Parameters On the Logistic Scoring dialog click on INPUT and then click on analysis parameters: Figure 108: Logistic Scoring > Input > Analysis Parameters On this screen select: • Scoring Method • Score — Option to create a score table only. • Evaluate — Option to perform model evaluation only. • Evaluate and Score — Option to create a score table and perform model evaluation. • Scoring Options • Include Probability Score Column — Inclusion of a column in the score table that contains the probability between 0 and 1 that the value of the dependent variable is 1 is an optional parameter when scoring is selected. The default is to include a probability score column in the created score table. (Either the probability score or the estimated value or both must be requested when scoring). • 188 Column Name — Column name containing the probability between 0 and 1 that the value of the dependent variable is 1. Teradata Warehouse Miner User Guide - Volume 3 Chapter 2: Scoring Logistic Scoring • Include Estimate from Threshold Column — Inclusion of a column in the score table that contains the estimated value of the dependent variable is an option when scoring is selected. The default is to include an estimated value column in the created score table. (Either the probability score or the estimated value or both must be requested when scoring). • Column Name — Column name containing the estimated value of the dependent variable. • Threshold Default — The threshold value is a value between 0 and 1 that determines which probabilities result in an estimated value of 0 or 1. For example, with a threshold value of 0.3, probabilities of 0.3 or greater yield an estimated value of 1, while probabilities less than 0.3 yield an estimated value of 0. The threshold option is valid only if the Include Estimate option has been requested and scoring is selected. If the Include Estimate option is requested but the threshold value is not specified, a default threshold value of 0.5 is used. • Evaluation Options • Prediction Success Table — Creates a prediction success table using sums of probabilities rather than estimates based on a threshold value. The default value is to include the Prediction Success Table. (This only applies if evaluation is requested). • Multi-Threshold Success Table — This table provides values similar to those in the prediction success table, but based on a range of threshold values, thus allowing the user to compare success scenarios using different threshold values. The default value is to include the multi-threshold success table. (This only applies if evaluation is requested). • Threshold Begin • Threshold End • Threshold Increment Specifies the threshold values to be used in the multi-threshold success table. If the computed probability is greater than or equal to a threshold value, that observation is assigned a 1 rather than a 0. Default values are 0, 1 and .05 respectively. • Cumulative Lift Table — Produce a cumulative lift table for deciles based on probability values. The default value is to include the cumulative lift table. (This only applies if evaluation is requested). Logistic Scoring - OUTPUT On the Logistic Scoring dialog click on OUTPUT: Figure 109: Logistic Scoring > Output On this screen select: Teradata Warehouse Miner User Guide - Volume 3 189 Chapter 2: Scoring Logistic Scoring • Output Table • Database name — The name of the database. • Table name — The name of the scored output table to be created-required only if a scoring option is selected. • Create output table using the FALLBACK keyword — If a table is selected, it will be built with FALLBACK if this option is selected. • Create output table using the MULTISET keyword — This option is not enabled for scored output tables; the MULTISET keyword is not used. • Advertise Output — The Advertise Output option “advertises” output by inserting information into one or more of the Advertise Output metadata tables according to the type of analysis and the options selected in the analysis. • Advertise Note — An Advertise Note may be specified if desired when the Advertise Output option is selected or when the Always Advertise option is selected on the Connection Properties dialog. It is a free-form text field of up to 30 characters that may be used to categorize or describe the output. • Generate the SQL for this analysis, but do not execute it — If this option is selected the analysis will only generate SQL, returning it and terminating immediately. Note: To create a stored procedure to score this model, use the Refresh analysis. Run the Logistic Scoring Analysis After setting parameters on the INPUT screens as described above, you are ready to run the analysis. To run the analysis you can either: • Click the Run icon on the toolbar, or • Select Run <project name> on the Project menu, or • Press the F5 key on your keyboard Results - Logistic Scoring The results of running the Logistic Scoring analysis include a variety of statistical reports on the scored model, and if selected, a Lift Chart. All of these results are outlined below. It is important to note that although a response value other than 1 may have been indicated when the Logistic Regression model was built, the Logistic Regression Scoring analysis will always use the value 1 as the response value, and the value 0 for the non-response value(s). Logistic Scoring - RESULTS - Reports On the Logistic Scoring dialog click RESULTS and then click on reports (note that the RESULTS tab will be grayed-out/disabled until after the analysis is completed): 190 Teradata Warehouse Miner User Guide - Volume 3 Chapter 2: Scoring Logistic Scoring Figure 110: Logistic Scoring > Results > Reports • Resulting Scored Table — Name of the scored table - equivalent to Result Table Name. • Number of Rows in Scored Table — Number of rows in the Resulting Scored Table. • Prediction Success Table — This is the same report described in “Results - Logistic Regression” on page 134, but applied to possibly new data. • Multi-Threshold Success Table — This is the same report described in “Results - Logistic Regression” on page 134, but applied to possibly new data. • Cumulative Lift Table — This is the same report described in “Results - Logistic Regression” on page 134, but applied to possibly new data. Logistic Scoring - RESULTS - Data On the Logistic Scoring dialog click RESULTS and then click on data (note that the RESULTS tab will be grayed-out/disabled until after the analysis is completed): Figure 111: Logistic Scoring > Results > Data Results data, if any, is displayed in a data grid. If a table was created, a sample of rows is displayed here - the size determined by the setting specified by Maximum result rows to display in Tools-Preferences-Limits. The following table is built in the requested Output Database by Logistic Regression scoring. Note that the options selected affect the structure of the table. Those columns in bold below will comprise the Unique Primary Index (UPI). Also note that there may be repeated groups of columns, and that some columns will be generated only if specific options are selected. Table 73: Output Database table (Built by Logistic Regression scoring) Name Type Definition Key User Defined One or more unique-key columns which default to the index defined in the table to be scored (i.e., in Selected Table). The data type defaults to the same as the scored table, but can be changed via Primary Index Columns. <app_var> User Defined One or more columns as selected under Retain Columns. Teradata Warehouse Miner User Guide - Volume 3 191 Chapter 2: Scoring Logistic Scoring Table 73: Output Database table (Built by Logistic Regression scoring) Name Type Definition Probability FLOAT A probability between 0 and 1 that the value of the dependent variable is 1. The name used defaults to “Probability” unless an appropriate column name is entered. Generated only if Include Probability Score Column is selected. The default is to not include a probability score column in the created score table. (Either the probability score or the estimated value or both must be requested when scoring). FLOAT The estimated value of the dependent variable,. The default is to not include an estimated value column in the created score table. Generated only if Include Estimate from Threshold Column is selected. (Either the probability score or the estimated value or both must be requested when scoring). (Default) Estimate (Default) Logistic Scoring - RESULTS - Lift Graph On the Logistic Scoring dialog click RESULTS and then click on lift graph (note that the RESULTS tab will be grayed-out/disabled until after the analysis is completed): Figure 112: Logistic Scoring > Results > Lift Graph This chart displays the information in the Cumulative Lift Table. This is the same graph described in “Results - Logistic Regression” on page 134 as Lift Chart, but applied to possibly new data. Logistic Scoring - RESULTS - SQL On the Logistic Scoring dialog click RESULTS and then click on SQL (note that the RESULTS tab will be grayed-out/disabled until after the analysis is completed): Figure 113: Logistic Scoring > Results > SQL The SQL generated for scoring is returned here, but only if the Score Method was set to Score on the Input - Analysis Parameters tab, and if the Output - Storage option to Generate the SQL for this analysis, but do not execute it was selected. When SQL is displayed here, it may be selected and copied as desired. (Both right-click menu options and buttons to Select All and Copy are available). 192 Teradata Warehouse Miner User Guide - Volume 3 Chapter 2: Scoring Logistic Scoring Tutorial - Logistic Scoring In this example, the same table is scored as was used to build the logistic regression model, as a matter of convenience. Typically, this would not be done unless the contents of the table changed since the model was built. Parameterize a Logistic Regression Scoring Analysis as follows: • Selected Table — twm_customer_analysis • Evaluate and Score — Enabled • Include Probability Score Column — Enabled • Column Name — Probability • Include Estimate from Threshold Column — Enabled • Column Name — Estimate • Threshold Default — 0.35 • Prediction Success Table — Enabled • Multi-Threshold Success Table — Enabled • Threshold Begin — 0 • Threshold End — 1 • Threshold Increment — 0.05 • Cumulative Lift Table — Enabled • Result Table Name — score_logistic_1 • Index Columns — cust_id Run the analysis, and click on Results when it completes. For this example, the Logistic Regression Scoring/Evaluation Analysis generated the following pages. A single click on each page name populates Results with the item. Table 74: Logistic Regression Model Scoring Report Resulting Scored Table <result_db>.score_logistic_1 Number of Rows in Scored Table 747 Table 75: Prediction Success Table Estimate Response Estimate Non-Response Actual Total Actual Response 304.58 / 40.77% 70.42 / 9.43% 375.00 / 50.20% Actual Non-Response 70.41 / 9.43% 301.59 / 40.37% 372.00 / 49.80% Estimated Total 374.99 / 50.20% 372.01 / 49.80% 747.00 / 100.00% Teradata Warehouse Miner User Guide - Volume 3 193 Chapter 2: Scoring Logistic Scoring Table 76: Multi-Threshold Success Table Threshold Probability Actual Response, Estimate Response Actual Response, Estimate NonResponse Actual Non-Response, Estimate Response Actual Non-Response, Estimate NonResponse 0.0000 375 0 372 0 0.0500 375 0 353 19 0.1000 374 1 251 121 0.1500 373 2 152 220 0.2000 369 6 90 282 0.2500 361 14 58 314 0.3000 351 24 37 335 0.3500 344 31 29 343 0.4000 329 46 29 343 0.4500 318 57 28 344 0.5000 313 62 24 348 0.5500 305 70 23 349 0.6000 291 84 23 349 0.6500 286 89 21 351 0.7000 276 99 20 352 0.7500 265 110 20 352 0.8000 253 122 20 352 0.8500 243 132 16 356 0.9000 229 146 13 359 0.9500 191 184 11 361 Lift Table 77: Cumulative Lift Table Cumulative Response Cumulative Response (%) Cumulative Captured Response (%) Cumulative Lift Decile Count Response Response (%) Captured Response (%) 1 74.0000 73.0000 98.6486 19.4667 1.9651 73.0000 98.6486 19.4667 1.9651 2 75.0000 69.0000 92.0000 18.4000 1.8326 142.0000 95.3020 37.8667 1.8984 3 75.0000 71.0000 94.6667 18.9333 1.8858 213.0000 95.0893 56.8000 1.8942 194 Teradata Warehouse Miner User Guide - Volume 3 Chapter 2: Scoring Logistic Scoring Table 77: Cumulative Lift Table Cumulative Response Cumulative Response (%) Cumulative Captured Response (%) Cumulative Lift Decile Count Response Response (%) Captured Response (%) 4 74.0000 65.0000 87.8378 17.3333 1.7497 278.0000 93.2886 74.1333 1.8583 5 75.0000 66.0000 88.0000 17.6000 1.7530 344.0000 92.2252 91.7333 1.8371 6 75.0000 24.0000 32.0000 6.4000 0.6374 368.0000 82.1429 98.1333 1.6363 7 74.0000 4.0000 5.4054 1.0667 0.1077 372.0000 71.2644 99.2000 1.4196 8 73.0000 2.0000 2.7397 0.5333 0.0546 374.0000 62.8571 99.7333 1.2521 9 69.0000 1.0000 1.4493 0.2667 0.0289 375.0000 56.4759 100.0000 1.1250 10 83.0000 0.0000 0.0000 0.0000 0.0000 375.0000 50.2008 100.0000 1.0000 Lift Table 78: Data cust_id Probability Estimate 1362480 1.00 1 1362481 0.08 0 1362484 1.00 1 1362485 0.14 0 1362486 0.66 1 1362487 0.86 1 1362488 0.07 0 1362489 1.00 1 1362492 0.29 0 1362496 0.35 1 … ... ... Lift Graph By default, the Lift Graph displays the cumulative measure of the percentage of observations in the decile where the actual value of the dependent variable is 1, from decile 1 to this decile (Cumulative, %Response). Teradata Warehouse Miner User Guide - Volume 3 195 Chapter 2: Scoring Logistic Scoring Figure 114: Logistic Scoring Tutorial: Lift Graph 196 Teradata Warehouse Miner User Guide - Volume 3 Chapter 2: Scoring Logistic Scoring Teradata Warehouse Miner User Guide - Volume 3 197 Chapter 2: Scoring Logistic Scoring 198 Teradata Warehouse Miner User Guide - Volume 3 CHAPTER 3 Statistical Tests What’s In This Chapter This chapter applies only to an instance of Teradata Warehouse Miner operating on a Teradata database. For more information, see these subtopics: 1 “Overview” on page 199 2 “Parametric Tests” on page 204 3 “Binomial Tests” on page 228 4 “Kolmogorov-Smirnov Tests” on page 241 5 “Tests Based on Contingency Tables” on page 270 6 “Rank Tests” on page 283 Overview Teradata Warehouse Miner contains both parametric and nonparametric statistical tests from the classical statistics literature, as well as more recently developed tests. In addition, “group by” variables permit the ability to statistically analyze data groups defined by selected variables having specific values. In this way, multiple tests can be conducted at once to provide a profile of customer data showing hidden clues about customer behavior. In simplified terms, what statistical inference allows us to do is to find out whether the outcome of an experiment could have happened by accident, or if it is extremely unlikely to have happened by chance. Of course a very well designed experiment would have outcomes which are clearly different, and require no statistical test. Unfortunately, in nature noisy outcomes of experiments are common, and statistical inference is required to get the answer. It doesn’t matter whether our data come from an experiment we designed, or from a retail database. Questions can be asked of the data, and statistical inference can provide the answer. What is statistical inference? It is a process of drawing conclusions about parameters of a statistical distribution. In summary, there are three principal approaches to statistical inference. One type of statistical inference is Bayesian estimation, where conclusions are based upon posterior judgments about the parameter given an experimental outcome. A second type is based on the likelihood approach, in which all conclusions are inferred from the likelihood function of the parameter given an experimental outcome. A third type of inference is hypothesis testing, which includes both nonparametric and parametric inference. Teradata Warehouse Miner User Guide - Volume 1 199 Chapter 3: Statistical Tests Overview For nonparametric inference, estimators concerning the distribution function are independent of the specific mathematical form of the distribution function. Parametric inference, by contrast, involves estimators about the distribution function that assumes a particular mathematical form, most often the normal distribution. Parametric tests are based on the sampling distribution of a particular statistic. Given knowledge of the underlying distribution of a variable, how the statistic is distributed in multiple equal-size samples can be predicted. The statistical tests provided in Teradata Warehouse Miner are solely those of the hypothesis testing type, both parametric and nonparametric. Hypothesis tests generally belong to one of five classes: 1 parametric tests including the class of t-tests and F-tests assuming normality of data populations 2 nonparametric tests of the binomial type 3 nonparametric tests of the chi square type, based on contingency tables. 4 nonparametric tests based on ranks 5 nonparametric tests of the Kolmogorov-Smirnov type Within each class of tests there exist many variants, some of which have risen to the level of being named for their authors. Often tests have multiple names due to different originators. The tests may be applied to data in different ways, such as on one sample, two samples or multiple samples. The specific hypothesis of the test may be two-tailed, upper-tailed or lowertailed. Hypothesis tests vary depending on the assumptions made in the context of the experiment, and care must be exercised that they are valid in the particular context of the data to be examined. For example, is it a fair assumption that the variables are normally distributed? The choice of which test to apply will depend on the answer to this question. Failure to exercise proper judgment in which test to apply may result in false alarms, where the null hypothesis is rejected incorrectly, or misses, where the null hypothesis is accepted improperly. Note: Identity columns (i.e., columns defined with the attribute “GENERATED … AS IDENTITY”), cannot be analyzed by many of the statistical test functions and should therefore generally be avoided. Summary of Tests Parametric Tests Tests include the T-test, the F(1-way), F(2-way with equal Sample Size), F(3-way with equal Sample Size), and the F(2-way with unequal Sample Size). The two-sample t-test checks if two population means are equal. The ANOVA or F test determines if significant differences exist among treatment means or interactions. It’s a preliminary test that indicates if further analysis of the relationship among treatment means is warranted. 200 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Overview Tests of the Binomial Type These tests include the Binomial test and Sign test. The data for a binomial test is assumed to come from n independent trials, and have outcomes in either of two classes. The binomial test reports whether the probability that the outcome is of the first class is a particular p_value, p*, usually ½. Tests Based on Contingency Tables - Chi Square Type Tests include the Chi Square and Median test. The Chi Square Test determines whether the probabilities observed from data in a RxC contingency table are the same or different. Additional statistics provided are Phi coefficient, Cramer’s V, Likelihood Ratio Chi Square, Continuity-Adjusted Chi-Square, and Contingency Coefficient The Median test is a special case of the chi-square test with fixed marginal totals, testing whether several samples came from populations with the same median. Tests of the Kolmogorov-Smirnov Type These tests include the Kolmogorov-Smirnov and Lilliefors tests for goodness of fit to a particular distribution (normal), the Shapiro-Wilk and D'Agostino-Pearson tests of normality, and the Smirnov test of equality of two distributions. Tests Based on Ranks Tests include the MannWhitney test for 2 independent samples, Kruskal-Wallis test for k independent samples, Wilcoxon Signed Ranks test, and Friedman test. The Friedman test is an extension of the sign test for several independent samples. It is a test for treatment differences in a randomized, complete block design. Additional statistics provided are Kendall’s Coefficient of Concordance (W) and Spearman’s Rho. Data Requirements The following chart summarizes how the Statistical Test functions handle various types of input. Those cases with the note “should be normal numeric” will give warnings for any type of input that is not standard numeric (i.e., for character data, dates, big integers or decimals, etc.). In the table below, cat is an abbreviation for categorical, num for numeric and bignum for big integers or decimals: Table 79: Statistical Test functions handling of input Test Input Columns Tests Return Results With Note Median column of interest cat, num, date, bignum can be anything Median columns cat, num, date, bignum can be anything Median group by columns cat, num, date, bignum can be anything Teradata Warehouse Miner User Guide - Volume 1 201 Chapter 3: Statistical Tests Overview Table 79: Statistical Test functions handling of input Test Input Columns Tests Return Results With Note Chi Square 1st columns cat, num, date, bignum can be anything (limit of 2000 distinct value pairs) Chi Square 2nd columns cat, num, date, bignum can be anything Mann Whitney column of interest cat, num, date, bignum can be anything Mann Whitney columns cat, num, date, bignum can be anything Mann Whitney group by columns cat, num, date, bignum can be anything Wilcoxon 1st column num, date, bignum should be normal numeric Wilcoxon 2nd column num, date, bignum should be normal numeric Wilcoxon group by columns cat, num, date, bignum can be anything Friedman column of interest num should be normal numeric Friedman treatment column special count requirements Friedman block column special count requirements Friedman group by columns cat, num, date, bignum can be anything F(n)way column of interest num should be normal numeric F(n)way columns cat, num, date, bignum can be anything F(n)way group by columns cat, num, date, bignum can be anything F(2)way ucc column of interest num should be normal numeric F(2)way ucc columns cat, num, date, bignum can be anything F(2)way ucc group by columns cat, num, date, bignum can be anything T Paired 1st column num should be normal numeric T Paired 2nd column num, date, bignum should be normal numeric T Paired group by columns cat, num, date, bignum can be anything 202 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Overview Table 79: Statistical Test functions handling of input Test Input Columns Tests Return Results With Note T Unpaired 1st column num should be normal numeric T Unpaired 2nd column num, date, bignum should be normal numeric T Unpaired group by columns cat, num, date, bignum can be anything T Unpaired w ind 1st column num should be normal numeric T Unpaired w ind indicator column cat, num, date, bignum can be anything T Unpaired w ind group by columns cat, num, date, bignum can be anything Kolmogorov-Smirnov column of interest num, date, bignum should be normal numeric Kolmogorov-Smirnov group by columns cat, num, date, bignum can be anything Lilliefors column of interest num, date, bignum should be normal numeric Lilliefors group by columns cat, num, bignum can be anything but date Shapiro-Wilk column of interest num, date, bignum should be normal numeric Shapiro-Wilk group by columns cat, num, date, bignum can be anything D'Agostino-Pearson column of interest num should be normal numeric D'Agostino-Pearson group by columns cat, num, bignum can be anything but date Smirnov column of interest cat, num, date, bignum should be normal numeric Smirnov columns must be 2 distinct values must be 2 distinct values Smirnov group by columns cat, num, bignum can be anything but date Binomial 1st column num, date, bignum should be normal numeric Binomial 2nd column num, date, bignum should be normal numeric Binomial group by columns cat, num, date, bignum can be anything Sign 1st column num, bignum should be normal numeric Sign group by columns cat, num, date, bignum can be anything Teradata Warehouse Miner User Guide - Volume 1 203 Chapter 3: Statistical Tests Parametric Tests Parametric Tests Parametric Tests are a class of statistical test which requires particular assumptions about the data. These often include that the observations are independent and normally distributed. A researcher may want to verify the assumption of normality before using a parametric test. He could use any of the four provided described below, such as the Kolmogorov-Smirnov test for normality, to determine if his use of one of the parametric tests were appropriate. Two Sample T-Test for Equal Means For the paired t test, a one-to-one correspondence must exist between values in both samples. The test is whether paired values have mean differences which are not significantly different from zero. It assumes differences are identically distributed normal random variables, and that they are independent. The unpaired t test is similar, but there is no correspondence between values of the samples. It assumes that within each sample, values are identically distributed normal random variables, and that the two samples are independent of each other. The two sample sizes may be equal or unequal. Variances of both samples may be assumed to be equal (homoscedastic) or unequal (heteroscedastic). In both cases, the null hypothesis is that the population means are equal. Test output is a p-value which compared to the threshold determines whether the null hypothesis should be rejected. Two methods of data selection are available for the unpaired t test: The first, the “T Unpaired” simply selects the columns with the two unpaired datasets, some of which may be NULL. The second, “T Unpaired with Indicator”, selects the column of interest and a second indicator column which determines to which group the first variable belongs. If the indicator variable is negative or zero, it will be assigned to the first group; if it is positive, it will be assigned to the second group. The two sample t test for unpaired data is defined as shown below (though calculated differently in the SQL): Table 80: Two sample t tests for unpaired data H0 : 1 = 2 Ha: 1 2 Test Statistic: Y1 – Y2 T = --------------------------------------------------2 2 s1 N1 + s2 N2 where N1 and N2 are the sample sizes, and s22 are the sample variances. 204 Y 1 and Y 2 are the sample means, and s12 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Parametric Tests Initiate a Two Sample T-Test Use the following procedure to initiate a new T-Test in Teradata Warehouse Miner: 1 Click on the Add New Analysis icon in the toolbar: Figure 115: Add New Analysis from toolbar 2 In the resulting Add New Analysis dialog box, click on Statistical Tests under Categories and then under Analyses double-click on Parametric Tests: Figure 116: Add New Analysis > Statistical Tests > Parametric Tests 3 This will bring up the Parametric Tests dialog in which you will enter STATISTICAL TEST STYLE, INPUT and OUTPUT options to parameterize the analysis as described in the following sections. T-Test - INPUT - Data Selection On the Parametric Tests dialog click on INPUT and then click on data selection: Teradata Warehouse Miner User Guide - Volume 1 205 Chapter 3: Statistical Tests Parametric Tests Figure 117: T-Test > Input > Data Selection On this screen select: 1 Select Input Source Users may select between different sources of input. By selecting the Input Source Table the user can select from available databases, tables (or views) and columns in the usual manner. By selecting the Input Source Analysis however the user can select directly from the output of another analysis of qualifying type in the current project. Analyses that may be selected from directly include all of the Analytic Data Set (ADS) and Reorganization analyses (except Refresh). In place of Available Databases the user may select from Available Analyses, while Available Tables then contains a list of all the output tables that will eventually be produced by the selected Analysis, or it contains a single entry with the name of the analysis under the label Volatile Table, representing the output of the analysis that is ordinarily produced by a Select statement. 2 3 Select Columns From a Single Table • Available Databases (or Analyses) — These are the databases (or analyses) available to be processed. • Available Tables — These are the tables and views that are available to be processed. • Available Columns — These are the columns within the table/view that are available for processing. Select Statistical Test Style These are the parametric tests available (F(n-way), F(2-way with unequal cell counts, T Paired, T Unpaired, T Unpaired with Indicator). Select “T Paired”, “T Unpaired”, or “T Unpaired with Indicator”. 4 Select Optional Columns • Selected Columns — Select columns by highlighting and then either dragging and dropping into the Selected Columns window, or click on the arrow button to move highlighted columns into the Selected Columns window. Note: The Selected Columns window is actually a split window; you can insert columns as First Column, Second Column or Group By Columns. Make sure you have the correct portion of the window highlighted. 206 • First Column — The column that specifies the first variable for the Parametric Test analysis. • Second Column (or Indicator Column) — The column that specifies the second variable for the Parametric Test analysis. Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Parametric Tests (Or the column that determines to which group the first variable belongs. If negative or zero, it will be assigned to the first group; if it is positive, it will be assigned to the second group). • Group By Columns — The column(s) that specifies the variable(s) whose distinct value combinations will categorize the data, so a separate test is performed on each category. T-Test - INPUT - Analysis Parameters On the Parametric Tests dialog click on INPUT and then click on analysis parameters: Figure 118: T-Test > Input > Analysis Parameters On this screen enter or select: • Processing Options • Threshold Probability — Enter the “alpha” probability below which to reject the null hypothesis. • Equal Variance — Check this box if the “equal variance” assumption is to be used. Default is “unequal variance”. T-Test - OUTPUT On the Parametric Tests dialog click on OUTPUT: Figure 119: T-Test > Output On this screen select: • Store the tabular output of this analysis in the database — Option to generate a Teradata table populated with the results of the analysis. Once enabled, the following three fields must be specified: • Database Name — The database where the output table will be saved. • Output Name — The table name that the output will be saved under. • Output Type — The output type must be table when storing Statistical Test output in the database. • Stored Procedure — The creation of a stored procedure containing the SQL generated for this analysis can be requested by entering the desired name of the stored procedure Teradata Warehouse Miner User Guide - Volume 1 207 Chapter 3: Statistical Tests Parametric Tests here. This will result in the creation of a stored procedure in the user's login database in place of the execution of the SQL generated by the analysis. • Procedure Comment — When an optional Procedure Comment is entered it is applied to a requested Stored Procedure with an SQL Comment statement. It can be up to 255 characters in length and contain substitution parameters for the output category (Score, ADS, Stats or Other), project name and/or analysis name (using the tags <Category>, <Project> and <Analysis>, respectively). (Note that the default value of this field may be set on the Defaults tab of the Preferences dialog, available from the Tools Menu). • Create output table using fallback keyword — Fallback keyword will be used to create the table • Create output table using multiset keyword — Multiset keyword will be used to create the table • Advertise Output — The Advertise Output option may be requested when creating a table, view or procedure. This feature “advertises” output by inserting information into one or more of the Advertise Output metadata tables according to the type of analysis and the options selected in the analysis. • Advertise Note — An Advertise Note may be specified if desired when the Advertise Output option is selected or when the Always Advertise option is selected on the Connection Properties dialog. It is a free-form text field of up to 30 characters that may be used to categorize or describe the output. • Generate SQL, but do not Execute it — Generate the Statistical Test SQL, but do not execute it. The SQL will be available to be viewed. Run the T-Test Analysis After setting parameters on the INPUT screens as described above, you are ready to run the analysis. To run the analysis you can either: • Click the Run icon on the toolbar, or • Select Run <project name> on the Project menu, or • Press the F5 key on your keyboard Results - T-Test Analysis The results of running the T-Test analysis include a table with a row for each group-by variable requested, as well as the SQL to perform the statistical analysis. All of these results are outlined below. T-Test - RESULTS - SQL On the Parametric Tests dialog click on RESULTS and then click on SQL: 208 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Parametric Tests Figure 120: T-Test > Results > SQL The series of SQL statements comprise the T-Test analysis. It is always returned, and is the only item returned when the Generate SQL Without Executing option is used. T-Test - RESULTS - Data On the Parametric Tests dialog click on RESULTS and then click on data: Figure 121: T-Test > Results > Data The output table is generated by the T-Test analysis for each group-by variable combination. Output Columns - T-Test Analysis The following table is built in the requested Output Database by the T-Test analysis. Any group-by columns will comprise the Unique Primary Index (UPI). Table 81: Output Database table Name Type Definition D_F INTEGER Degrees of Freedom for the group-by values selected. T Float The computed value of the T statistic TTestPValue Float The probability associated with the T statistic TTestCallP Char The TTest result: a=accept, p=reject (positive), n=reject(negative) Tutorial - T-Test In this example, a T-Test analysis of type T-Paired is performed on the fictitious banking data to analyze account usage. Parameterize a Parametric Test analysis as follows: • Available Tables — twm_customer_analysis • Statistical Test Style — T Paired • First Column — avg_cc_bal • Second Column — avg_sv_bal Teradata Warehouse Miner User Guide - Volume 1 209 Chapter 3: Statistical Tests Parametric Tests • Group By Columns — age, gender • Analysis Parameters • Threshold Probability — 0.05 • Equal Variance — true (checked) Run the analysis and click on Results when it completes. For this example, the Parametric Test analysis generated the following page. The paired t-test was computed on average credit card balance vs. average savings balance, by gender and age. Ages over 33 were excluded for brevity. Results were sorted by age and gender in the listing below. The tests shows whether the paired values have mean differences which are not significantly different from zero for each gender-age combination. A ‘p’ means the difference was significantly different from zero. An ‘a’ means the difference was insignificant. The SQL is available for viewing but not listed below. Table 82: T-Test gender age D_F TTestPValue T TTestCallP_0.05 F 13 7 0.01 3.99 p M 13 6 0.13 1.74 a F 14 5 0.10 2.04 a M 14 8 0.04 2.38 p F 15 18 0.01 3.17 p M 15 12 0.04 2.29 p F 16 9 0.00 4.47 p M 16 8 0.04 2.52 p F 17 13 0.00 4.68 p M 17 6 0.01 3.69 p F 18 9 0.00 6.23 p M 18 9 0.02 2.94 p F 19 9 0.01 3.36 p M 19 6 0.03 2.92 p F 22 3 0.21 1.57 a M 22 3 0.11 2.25 a F 23 3 0.34 1.13 a M 23 3 0.06 2.88 a F 25 4 0.06 2.59 a F 26 5 0.08 2.22 a 210 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Parametric Tests Table 82: T-Test gender age D_F TTestPValue T TTestCallP_0.05 F 27 5 0.09 2.12 a F 28 4 0.06 2.68 a M 28 4 0.03 3.35 p F 29 4 0.06 2.54 a M 29 5 0.16 1.65 a F 30 8 0.00 4.49 p M 30 5 0.01 4.25 p F 31 5 0.04 2.69 p M 31 6 0.05 2.52 p F 32 5 0.05 2.50 a M 32 6 0.10 1.98 a F 33 9 0.01 3.05 p M 33 4 0.09 2.27 a F-Test - N-Way • F-Test/Analysis of Variance — One Way, Equal or Unequal Sample Size • F-Test/Analysis of Variance — Two Way, Equal Sample Size • F-Test/Analysis of Variance — Three Way, Equal Sample Size The ANOVA or F test determines if significant differences exist among treatment means or interactions. It’s a preliminary test that indicates if further analysis of the relationship among treatment means is warranted. If the null hypothesis of no difference among treatments is accepted, the test result implies factor levels and response are unrelated, so the analysis is terminated. When the null hypothesis is rejected, the analysis is usually continued to examine the nature of the factor-level effects. Examples are: • Tukey’s Method — tests all possible pairwise differences of means • Scheffe’s Method — tests all possible contrasts at the same time • Bonferroni’s Method — tests, or puts simultaneous confidence intervals around a preselected group of contrasts The N-way F-Test is designed to execute within groups defined by the distinct values of the group-by variables (GBV's), the same as most of the other nonparametric tests. Two or more treatments must exist in the data within the groups defined by the distinct GBV values. Given a column of interest (dependent variable), one or more input columns (independent variables) and optionally one or more group-by columns (all from the same input table), an FTest is produced. The N-Way ANOVA tests whether a set of sample means are all equal (the Teradata Warehouse Miner User Guide - Volume 1 211 Chapter 3: Statistical Tests Parametric Tests null hypothesis). Output is a p-value which when compared to the user’s threshold, determines whether the null hypothesis should be rejected. Initiate an N-Way F-Test Use the following procedure to initiate a new F-Test analysis in Teradata Warehouse Miner: 1 Click on the Add New Analysis icon in the toolbar: Figure 122: Add New Analysis from toolbar 2 In the resulting Add New Analysis dialog box, click on Statistical Tests under Categories and then under Analyses double-click on Parametric Tests: Figure 123: Add New Analysis > Statistical Tests > Parametric Tests 3 This will bring up the Parametric Tests dialog in which you will enter STATISTICAL TEST STYLE, INPUT and OUTPUT options to parameterize the analysis as described in the following sections. F-Test (N-Way) - INPUT - Data Selection On the Parametric Tests dialog click on INPUT and then click on data selection: 212 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Parametric Tests Figure 124: F-Test > Input > Data Selection On this screen select: 1 Select Input Source Users may select between different sources of input. By selecting the Input Source Table the user can select from available databases, tables (or views) and columns in the usual manner. By selecting the Input Source Analysis however the user can select directly from the output of another analysis of qualifying type in the current project. Analyses that may be selected from directly include all of the Analytic Data Set (ADS) and Reorganization analyses (except Refresh). In place of Available Databases the user may select from Available Analyses, while Available Tables then contains a list of all the output tables that will eventually be produced by the selected Analysis, or it contains a single entry with the name of the analysis under the label Volatile Table, representing the output of the analysis that is ordinarily produced by a Select statement. 2 3 Select Columns From a Single Table • Available Databases (or Analyses) — These are the databases (or analyses) available to be processed. • Available Tables — These are the tables and views that are available to be processed. • Available Columns — These are the columns within the table/view that are available for processing. Select Statistical Test Style These are the parametric tests available (T, F(n-way), F(2-way with unequal cell counts) Select “F(n-way)”. 4 Select Optional Columns • Selected Columns — Select columns by highlighting and then either dragging and dropping into the Selected Columns window, or click on the arrow button to move highlighted columns into the Selected Columns window. Note: The Selected Columns window is actually a split window; you can insert columns as Column of Interest, Columns or Group By Columns. Make sure you have the correct portion of the window highlighted. • Column of Interest — The column that specifies the dependent variable for the Ftest analysis. • Columns — The column(s) that specifies the independent variable(s) for the F-test analysis. Selection of one column will generate a 1-Way F-test, two columns a 2Way F-test, and three columns a 3-Way F-test. Do not select over three columns because the 4-way, 5-way, etc. F-tests are not implemented in the version of TWM. Teradata Warehouse Miner User Guide - Volume 1 213 Chapter 3: Statistical Tests Parametric Tests Warning: For this test, equal cell counts are required for the 2 and 3 way tests. • Group By Columns — The column(s) that specifies the variable(s) whose distinct value combinations will categorize the data, so a separate test is performed on each category. F-Test (N-Way) - INPUT - Analysis Parameters On the Parametric Tests dialog click on INPUT and then click on analysis parameters: Figure 125: F-Test > Input > Analysis Parameters On this screen enter or select: • Processing Options • Threshold Probability — Enter the “alpha” probability below which to reject the null hypothesis. F-Test - OUTPUT On the Parametric Tests dialog click on OUTPUT: Figure 126: F-Test > Output On this screen select the following options if desired: • Store the tabular output of this analysis in the database — Option to generate a Teradata table populated with the results of the analysis. Once enabled, the following three fields must be specified: 214 • Database Name — The database where the output table will be saved. • Output Name — The table name that the output will be saved under. • Output Type — The output type must be table when storing Statistical Test output in the database. • Stored Procedure — The creation of a stored procedure containing the SQL generated for this analysis can be requested by entering the desired name of the stored procedure here. This will result in the creation of a stored procedure in the user's login database in place of the execution of the SQL generated by the analysis. • Procedure Comment — When an optional Procedure Comment is entered it is applied to a requested Stored Procedure with an SQL Comment statement. It can be up to 255 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Parametric Tests characters in length and contain substitution parameters for the output category (Score, ADS, Stats or Other), project name and/or analysis name (using the tags <Category>, <Project> and <Analysis>, respectively). (Note that the default value of this field may be set on the Defaults tab of the Preferences dialog, available from the Tools Menu). • Create output table using fallback keyword — Fallback keyword will be used to create the table • Create output table using multiset keyword — Multiset keyword will be used to create the table • Advertise Output — The Advertise Output option may be requested when creating a table, view or procedure. This feature “advertises” output by inserting information into one or more of the Advertise Output metadata tables according to the type of analysis and the options selected in the analysis. • Advertise Note — An Advertise Note may be specified if desired when the Advertise Output option is selected or when the Always Advertise option is selected on the Connection Properties dialog. It is a free-form text field of up to 30 characters that may be used to categorize or describe the output. • Generate SQL, but do not Execute it — Generate the Statistical Test SQL, but do not execute it. The SQL will be available to be viewed. Run the F-Test Analysis After setting parameters on the INPUT screens as described above, you are ready to run the analysis. To run the analysis you can either: • Click the Run icon on the toolbar, or • Select Run <project name> on the Project menu, or • Press the F5 key on your keyboard Results - F-Test Analysis The results of running the F-test analysis include a table with a row for each group-by variable requested, as well as the SQL to perform the statistical analysis. All of these results are outlined below. F-Test - RESULTS - SQL On the Parametric Tests dialog click on RESULTS and then click on SQL: Figure 127: F-Test > Results > SQL The series of SQL statements comprise the F-test Analysis. It is always returned, and is the only item returned when the Generate SQL Without Executing option is used. Teradata Warehouse Miner User Guide - Volume 1 215 Chapter 3: Statistical Tests Parametric Tests F-Test - RESULTS - data On the Parametric Tests dialog click on RESULTS and then click on data: Figure 128: F-Test > Results > data The output table is generated by the F-test Analysis for each group-by variable combination. Output Columns - F-Test Analysis The particular result table returned will depend on whether the test is 1-way, 2-way or 3-way, and is built in the requested Output Database by the F-test analysis. If group-by columns are present, they will comprise the Unique Primary Index (UPI). Otherwise DF will be the UPI. Table 83: Output Columns - 1-Way F-Test Analysis Name Type Definition DF INTEGER Degrees of Freedom for the Variable DFErr INTEGER Degrees of Freedom for Error F Float The computed value of the F statistic FPValue Float The probability associated with the F statistic FPText Char If not NULL, the probability is less than the smallest or more than the largest table value FCallP Char The F-Test result: a=accept, p=reject (positive), n=reject(negative) Table 84: Output Columns - 2-Way F-Test Analysis Name Type Definition DF INTEGER Degrees of Freedom for the model Fmodel Float The computed value of the F statistic for the model DFErr INTEGER Degrees of Freedom for Error term DF_1 INTEGER Degrees of Freedom for first variable F1 Float The computed value of the F statistic for the first variable DF_2 INTEGER Degrees of Freedom for second variable F2 Float The computed value of the F statistic for the second variable 216 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Parametric Tests Table 84: Output Columns - 2-Way F-Test Analysis Name Type Definition DF_12 INTEGER Degrees of Freedom for interaction F12 Float The computed value of the F statistic for interaction Fmodel_PValue Float The probability associated with the F statistic for the model Fmodel_PText Char If not NULL, the probability is less than the smallest or more than the largest table value Fmodel_CallP_0.05 Char The F test result: a=accept, p=reject for the model F1_PValue Float The probability associated with the F statistic for the first variable F1_PText Char If not NULL, the probability is less than the smallest or more than the largest table value F1_callP_0.05 Char The F test result: a=accept, p=reject for the first variable F2_PValue Float The probability associated with the F statistic for the second variable F2_PText Char If not NULL, the probability is less than the smallest or more than the largest table value F2_callP_0.05 Char The F test result: a=accept, p=reject for the second variable F12_PValue Float The probability associated with the F statistic for the interaction F12_PText Char If not NULL, the probability is less than the smallest or more than the largest table value F12_callP_0.05 Char The F test result: a=accept, p=reject for the interaction Table 85: Output Columns - 3-Way F-Test Analysis Name Type Definition DF INTEGER Degrees of Freedom for the model Fmodel Float The computed value of the F statistic for the model DFErr INTEGER Degrees of Freedom for Error term DF_1 INTEGER Degrees of Freedom for first variable F1 Float The computed value of the F statistic for the first variable DF_2 INTEGER Degrees of Freedom for second variable F2 Float The computed value of the F statistic for the second variable DF_3 INTEGER Degrees of Freedom for third variable F3 Float The computed value of the F statistic for the third variable DF_12 INTEGER Degrees of Freedom for interaction of v1 and v2 Teradata Warehouse Miner User Guide - Volume 1 217 Chapter 3: Statistical Tests Parametric Tests Table 85: Output Columns - 3-Way F-Test Analysis Name Type Definition F12 Float The computed value of the F statistic for interaction of v1 and v2 DF_13 INTEGER Degrees of Freedom for interaction of v1 and v3 F13 Float The computed value of the F statistic for interaction of v1 and v3 DF_23 INTEGER Degrees of Freedom for interaction of v2 and v3 F23 Float The computed value of the F statistic for interaction of v2 and v3 DF_123 INTEGER Degrees of Freedom for three-way interaction of v1, v2, and v3 F123 Float The computed value of the F statistic for three-way interaction of v1, v2 and v3 Fmodel_PValue Float The probability associated with the F statistic for the model Fmodel_PText Char If not NULL, the probability is less than the smallest or more than the largest table value Fmodel_callP_0.05 Char The F test result: a=accept, p=reject for the model F1_PValue Float The probability associated with the F statistic for the first variable F1_PText Char If not NULL, the probability is less than the smallest or more than the largest table value F1_callP_0.05 Char The F test result: a=accept, p=reject for the first variable F2_PValue Float The probability associated with the F statistic for the second variable F2_PText Char If not NULL, the probability is less than the smallest or more than the largest table value F2_callP_0.05 Char The F test result: a=accept, p=reject for the second variable F3_PValue Float The probability associated with the F statistic for the third variable F3_PText Char If not NULL, the probability is less than the smallest or more than the largest table value F3_callP_0.05 Char The F test result: a=accept, p=reject for the third variable F12_PValue Float The probability associated with the F statistic for the interaction of v1 and v2 F12_PText Char If not NULL, the probability is less than the smallest or more than the largest table value F12_callP_0.05 Char The F test result: a=accept, p=reject for the interaction of v1 and v2 F13_PValue Float The probability associated with the F statistic for the interaction of v1 and v3 F13_PText Char If not NULL, the probability is less than the smallest or more than the largest table value F13_callP_0.05 Char The F test result: a=accept, p=reject for the interaction of v1 and v3 218 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Parametric Tests Table 85: Output Columns - 3-Way F-Test Analysis Name Type Definition F23_PValue Float The probability associated with the F statistic for the interaction of v2 and v3 F23_PText Char If not NULL, the probability is less than the smallest or more than the largest table value F23_callP_0.05 Char The F test result: a=accept, p=reject for the interaction of v2 and v3 F123_PValue Float The probability associated with the F statistic for the three-way interaction of v1, v2 and v3 F123_PText Char If not NULL, the probability is less than the smallest or more than the largest table value F123_callP_0.05 Char The F test result: a=accept, p=reject for the three-way interaction of v1, v2 and v3 Tutorial - One-Way F-Test Analysis In this example, an F-test analysis is performed on the fictitious banking data to analyze income by gender. Parameterize an F-Test analysis as follows: • Available Tables — twm_customer • Column of Interest — income • Columns — gender • Group By Columns — years_with_bank, nbr_children • Analysis Parameters • Threshold Probability — 0.01 Run the analysis and click on Results when it completes. For this example, the F-Test analysis generated the following page. The F-Test was computed on income over gender for every combination of years_with_bank and nbr_children. Results were sorted by years_with_bank and nbr_children in the listing below. The tests shows whether significant differences exist in income for males and females, and does so separately for each value of years_with_bank and nbr_children. A ‘p’ means the difference was significant, and an ‘a’ means it was not significant. If the field is null, it indicates there was insufficient data for the test. The SQL is available for viewing but not listed below. Table 86: F-Test (one-way) years_with_bank nbr_children DF DFErr F FPValue FPText FCallP_0.01 0 0 1 53 0.99 0.25 >0.25 a 0 1 1 8 1.87 0.22 a 0 2 1 10 1.85 0.22 a Teradata Warehouse Miner User Guide - Volume 1 219 Chapter 3: Statistical Tests Parametric Tests Table 86: F-Test (one-way) years_with_bank nbr_children DF DFErr F FPValue FPText FCallP_0.01 0 3 1 6 0.00 0.25 >0.25 a 0 4 1 0 0 5 0 0 1 0 1 55 0.00 0.25 >0.25 a 1 1 1 6 0.00 0.25 >0.25 a 1 2 1 14 0.00 0.25 >0.25 a 1 3 1 2 0.50 0.25 >0.25 a 1 4 0 0 1 5 0 0 2 0 1 55 0.82 0.25 >0.25 a 2 1 1 14 1.54 0.24 2 2 1 14 0.07 0.25 >0.25 a 2 3 1 1 0.30 0.25 >0.25 a 2 4 0 0 2 5 0 0 3 0 1 49 0.05 0.25 >0.25 a 3 1 1 9 1.16 0.25 >0.25 a 3 2 1 10 0.06 0.25 >0.25 a 3 3 1 6 16.90 0.01 3 4 1 1 4.50 0.25 3 5 0 0 4 0 1 52 1.84 0.20 4 1 1 10 0.54 0.25 4 2 1 6 2.38 0.20 a 4 3 0 0 4 4 0 0 4 5 0 1 5 0 1 46 4.84 0.04 a 5 1 1 15 0.48 0.25 5 2 1 10 3.51 0.09 220 a p >0.25 a a >0.25 >0.25 a a a Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Parametric Tests Table 86: F-Test (one-way) years_with_bank nbr_children DF DFErr F FPValue FPText FCallP_0.01 5 3 1 2 2.98 0.24 5 4 0 0 6 0 1 46 0.01 0.25 6 1 1 14 3.67 0.08 6 2 1 15 0.13 0.25 6 3 0 0 6 5 0 0 7 0 1 41 4.99 0.03 7 1 1 8 0.01 0.25 >0.25 a 7 2 1 4 0.13 0.25 >0.25 a 7 3 1 2 0.04 0.25 >0.25 a 7 5 0 1 8 0 1 23 0.50 0.25 >0.25 a 8 1 1 7 0.38 0.25 >0.25 a 8 2 1 6 0.09 0.25 >0.25 a 8 3 1 0 8 5 0 0 9 0 1 26 0.07 0.25 >0.25 a 9 1 1 3 3.11 0.20 9 2 1 1 0.09 0.25 >0.25 a 9 3 1 1 0.12 0.25 >0.25 a a >0.25 a a >0.25 a a a F-Test/Analysis of Variance - Two Way Unequal Sample Size The ANOVA or F test determines if significant differences exist among treatment means or interactions. It’s a preliminary test that indicates if further analysis of the relationship among treatment means is warranted. If the null hypothesis of no difference among treatments is accepted, the test result implies factor levels and response are unrelated, so the analysis is terminated. When the null hypothesis is rejected, the analysis is usually continued to examine the nature of the factor-level effects. Examples are: • Tukey’s Method — tests all possible pairwise differences of means • Scheffe’s Method — tests all possible contrasts at the same time • Bonferroni’s Method — tests, or puts simultaneous confidence intervals around a preselected group of contrasts Teradata Warehouse Miner User Guide - Volume 1 221 Chapter 3: Statistical Tests Parametric Tests The 2-way Unequal Sample Size F-Test is designed to execute on the entire dataset. No group-by parameter is provided for this test, but if such a test is desired, multiple tests must be run on pre-prepared datasets with group-by variables in each as different constants. Two or more treatments must exist in the data within the dataset. (Note that this test will create a temporary work table in the Result Database and drop it at the end of processing, even if the Output option to “Store the tabular output of this analysis in the database” is not selected). Given a table name of tabulated values, an F-Test is produced. The N-Way ANOVA tests whether a set of sample means are all equal (the null hypothesis). Output is a p-value which when compared to the user’s threshold, determines whether the null hypothesis should be rejected. Initiate a 2-Way F-Test with Unequal Cell Counts Use the following procedure to initiate a new F-Test in Teradata Warehouse Miner: 1 Click on the Add New Analysis icon in the toolbar: Figure 129: Add New Analysis from toolbar 2 222 In the resulting Add New Analysis dialog box, click on Statistical Tests under Categories and then under Analyses double-click on Parametric Tests: Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Parametric Tests Figure 130: Add New Analysis > Statistical Tests > Parametric Tests 3 This will bring up the Parametric Tests dialog in which you will enter STATISTICAL TEST STYLE, INPUT and OUTPUT options to parameterize the analysis as described in the following sections. F-Test (Unequal Cell Counts) - INPUT - Data Selection On the Parametric Tests dialog click on INPUT and then click on data selection: Figure 131: F-Test > Input > Data Selection On this screen select: 1 Select Input Source Users may select between different sources of input. By selecting the Input Source Table the user can select from available databases, tables (or views) and columns in the usual manner. By selecting the Input Source Analysis however the user can select directly from the output of another analysis of qualifying type in the current project. Analyses that may be selected from directly include all of the Analytic Data Set (ADS) and Reorganization analyses (except Refresh). In place of Available Databases the user may select from Available Analyses, while Available Tables then contains a list of all the output tables that will eventually be produced by the selected Analysis. Note that if an analysis is selected it must be one that creates a table or view for output since a volatile table cannot be processed with this Statistical Test Style. Teradata Warehouse Miner User Guide - Volume 1 223 Chapter 3: Statistical Tests Parametric Tests 2 3 Select Columns From a Single Table • Available Databases (or Analyses) — These are the databases (or analyses) available to be processed. • Available Tables — These are the tables and views that are available to be processed. • Available Columns — These are the columns within the table/view that are available for processing. Select Statistical Test Style These are the parametric tests available (T, F(n-way), F(2-way with unequal cell counts). Select “F(2-way with unequal cell counts)”. 4 Select Optional Columns • Selected Columns — Select columns by highlighting and then either dragging and dropping into the Selected Columns window, or click on the arrow button to move highlighted columns into the Selected Columns window. Note: The Selected Columns window is actually a split window; you can insert columns as Column of Interest, First Column or Second Column. Make sure you have the correct portion of the window highlighted. • Column of Interest — The column that specifies the dependent variable for the Ftest analysis. • First Column — The column that specifies the first independent variable for the Ftest analysis. • Second Column — The column that specifies the second independent variable for the F-test analysis. F-Test - INPUT - Analysis Parameters On the Parametric Tests dialog click on INPUT and then click on analysis parameters: Figure 132: F-Test > Input > Analysis Parameters On this screen enter or select: • Processing Options 224 • Threshold Probability — Enter the “alpha” probability below which to reject the null hypothesis. • First Column Values — Use the selection wizard to choose any or all of the values of the first independent variable to be used in the analysis. • Second Column Values — Use the selection wizard to choose any or all of the values of the second independent variable to be used in the analysis. Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Parametric Tests F-Test - OUTPUT On the Parametric Tests dialog click on OUTPUT: Figure 133: F-Test > Output On this screen select the following options if desired: • Store the tabular output of this analysis in the database — Option to generate a Teradata table populated with the results of the analysis. Once enabled, the following three fields must be specified: • Database Name — The database where the output table will be saved. • Output Name — The table name that the output will be saved under. • Output Type — The output type must be table when storing Statistical Test output in the database. • Stored Procedure — The creation of a stored procedure containing the SQL generated for this analysis can be requested by entering the desired name of the stored procedure here. This will result in the creation of a stored procedure in the user's login database in place of the execution of the SQL generated by the analysis. • Procedure Comment — When an optional Procedure Comment is entered it is applied to a requested Stored Procedure with an SQL Comment statement. It can be up to 255 characters in length and contain substitution parameters for the output category (Score, ADS, Stats or Other), project name and/or analysis name (using the tags <Category>, <Project> and <Analysis>, respectively). (Note that the default value of this field may be set on the Defaults tab of the Preferences dialog, available from the Tools Menu). • Create output table using fallback keyword — Fallback keyword will be used to create the table • Create output table using multiset keyword — Multiset keyword will be used to create the table • Advertise Output — The Advertise Output option may be requested when creating a table, view or procedure. This feature “advertises” output by inserting information into one or more of the Advertise Output metadata tables according to the type of analysis and the options selected in the analysis. • Advertise Note — An Advertise Note may be specified if desired when the Advertise Output option is selected or when the Always Advertise option is selected on the Connection Properties dialog. It is a free-form text field of up to 30 characters that may be used to categorize or describe the output. • Generate SQL, but do not Execute it — Generate the Statistical Test SQL, but do not execute it. The SQL will be available to be viewed. Teradata Warehouse Miner User Guide - Volume 1 225 Chapter 3: Statistical Tests Parametric Tests Run the F-Test Analysis After setting parameters on the INPUT screens as described above, you are ready to run the analysis. To run the analysis you can either: • Click the Run icon on the toolbar, or • Select Run <project name> on the Project menu, or • Press the F5 key on your keyboard Results - F-Test Analysis The results of running the F-test analysis include a table with a single row, as well as the SQL to perform the statistical analysis. All of these results are outlined below. F-Test - RESULTS - SQL On the Parametric Tests dialog click on RESULTS and then click on SQL: Figure 134: F-Test > Results > SQL The series of SQL statements comprise the F-test Analysis. It is always returned, and is the only item returned when the Generate SQL Without Executing option is used. F-Test - RESULTS - data On the Parametric Tests dialog click on RESULTS and then click on data: Figure 135: F-Test > Results > data The output table is generated by the F-test Analysis for each group-by variable combination. Output Columns - F-Test Analysis The result table returned is built in the requested Output Database by the F-test analysis. DF will be the UPI. Table 87: Output Columns - 2-Way F-Test Analysis Name Type Definition DF INTEGER Degrees of Freedom for the model 226 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Parametric Tests Table 87: Output Columns - 2-Way F-Test Analysis Name Type Definition Fmodel Float The computed value of the F statistic for the model DFErr INTEGER Degrees of Freedom for Error term DF_1 INTEGER Degrees of Freedom for first variable F1 Float The computed value of the F statistic for the first variable DF_2 INTEGER Degrees of Freedom for second variable F2 Float The computed value of the F statistic for the second variable DF_12 INTEGER Degrees of Freedom for interaction F12 Float The computed value of the F statistic for interaction Fmodel_PValue Float The probability associated with the F statistic for the model Fmodel_PText Char If not NULL, the probability is less than the smallest or more than the largest table value Fmodel_CallP_0.05 Char The F test result: a=accept, p=reject for the model F1_PValue Float The probability associated with the F statistic for the first variable F1_PText Char If not NULL, the probability is less than the smallest or more than the largest table value F1_callP_0.05 Char The F test result: a=accept, p=reject for the first variable F2_PValue Float The probability associated with the F statistic for the second variable F2_PText Char If not NULL, the probability is less than the smallest or more than the largest table value F2_callP_0.05 Char The F test result: a=accept, p=reject for the second variable F12_PValue Float The probability associated with the F statistic for the interaction F12_PText Char If not NULL, the probability is less than the smallest or more than the largest table value F12_callP_0.05 Char The F test result: a=accept, p=reject for the interaction Tutorial - Two-Way Unequal Cell Count F-Test Analysis In this example, an F-test analysis is performed on the fictitious banking data to analyze income by years_with_bank and marital_status. Parameterize an F-Test analysis as follows: • Available Tables — twm_customer • Column of Interest — income • First Column — years_with_bank • Second Column — marital_status • Analysis Parameters Teradata Warehouse Miner User Guide - Volume 1 227 Chapter 3: Statistical Tests Binomial Tests • Threshold Probability — 0.05 • First Column Values — 0, 1, 2, 3, 4, 5, 6, 7 • Second Column Values — 1, 2, 3, 4 Run the analysis and click on Results when it completes. For this example, the F-Test analysis generated the following page. The F-Test was computed on income over years_with_bank and marital_status. The test shows whether significant differences exist in income for years_with_bank by marital_status. The first column, years_with_bank, is represented by F1. The second column, marital_status, is represented by F2. The interaction term is F12. A ‘p’ means the difference was significant, and an ‘a’ means it was not significant. If the field is null, it indicates there was insufficient data for the test. The SQL is available for viewing but not listed below. The results show that there are no significant differences in income for different values of years_with_bank or the interaction term for years_with_bank and marital_status. There was a highly significant (p<0.001) difference in income for different values of marital status. The overall model difference was significant at a level better than 0.001. Table 88: F-Test (Two-way Unequal Cell Count) (Part 1) DF Fmodel DFErr DF_1 F1 DF_2 F2 DF_12 F12 31 3.76 631 7 0.93 3 29.02 21 1.09 Table 89: F-Test (Two-way Unequal Cell Count) (Part 2) Fmodel_PValue Fmodel_PText Fmodel_CallP_0.05 F1_PValue F1_PText F1_CallP_0.05 0.001 <0.001 p 0.25 >0.25 a Table 90: F-Test (Two-way Unequal Cell Count) (Part 3) F2_PValue F2_PText F2_CallP_0.05 F12_PValue F12_PText F12_CallP_0.05 0.001 <0.001 p 0.25 >0.25 a Binomial Tests The data for a binomial test is assumed to come from n independent trials, and have outcomes in either of two classes. The other assumption is that the probability of each outcome of each trial is the same, designated p. The values of the outcome could come directly from the data, where the value is always one of two kinds. More commonly, however, the test is applied to the sign of the difference between two values. If the probability is 0.5, this is the oldest of all nonparametric tests, and is called the ‘sign test’. Where the sign of the difference between two 228 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Binomial Tests values is used, the binomial test reports whether the probability that the sign is positive is a particular p_value, p*. Binomial/Ztest Output for each unique set of values of the group-by variables (GBV's) is a p-value which when compared to the user’s choice of alpha, the probability threshold, determines whether the null hypothesis (p=p*, p<=p*, or p>p*) should be rejected for the GBV set. Though both binomial and Ztest results are provided for all N, for the approximate value obtained from the Z-test (nP) is appropriate when N is large. For values of N over 100, only the Z-test is performed. Otherwise, the value bP returned is the p_value of the one-tailed or two-tailed test, depending on the user’s choice. Initiate a Binomial Test Use the following procedure to initiate a new Binomial in Teradata Warehouse Miner: 1 Click on the Add New Analysis icon in the toolbar: Figure 136: Add New Analysis from toolbar 2 In the resulting Add New Analysis dialog box, click on Statistical Tests under Categories and then under Analyses double-click on Binomial Tests: Teradata Warehouse Miner User Guide - Volume 1 229 Chapter 3: Statistical Tests Binomial Tests Figure 137: Add New Analysis > Statistical Tests > Binomial Tests 3 This will bring up the Binomial Tests dialog in which you will enter STATISTICAL TEST STYLE, INPUT and OUTPUT options to parameterize the analysis as described in the following sections. Binomial Tests - INPUT - Data Selection On the Binomial Tests dialog click on INPUT and then click on data selection: Figure 138: Binomial Tests > Input > Data Selection On this screen select: 1 Select Input Source Users may select between different sources of input. By selecting the Input Source Table the user can select from available databases, tables (or views) and columns in the usual manner. By selecting the Input Source Analysis however the user can select directly from the output of another analysis of qualifying type in the current project. Analyses that may be selected from directly include all of the Analytic Data Set (ADS) and Reorganization analyses (except Refresh). In place of Available Databases the user may select from Available Analyses, while Available Tables then contains a list of all the output tables that will eventually be produced by the selected Analysis, or it contains a single entry with the name of the analysis under the label Volatile Table, representing the output of the analysis that is ordinarily produced by a Select statement. 230 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Binomial Tests 2 3 Select Columns From a Single Table • Available Databases (or Analyses) — These are the databases (or analyses) available to be processed. • Available Tables — These are the tables and views that are available to be processed. • Available Columns — These are the columns within the table/view that are available for processing. Select Statistical Test Style These are the binomial tests available (Binomial, Sign). Select “Binomial”. 4 Select Optional Columns • Selected Columns — Select columns by highlighting and then either dragging and dropping into the Selected Columns window, or click on the arrow button to move highlighted columns into the Selected Columns window. Note: The Selected Columns window is actually a split window; you can insert columns as First Column, Second Column or Group By Columns. Make sure you have the correct portion of the window highlighted. • First Column — The column that specifies the first variable for the Binomial Test analysis. • Second Column — The column that specifies the second variable for the Binomial Test analysis. • Group By Columns — The column(s) that specifies the variable(s) whose distinct value combinations will categorize the data, so a separate test is performed on each category. Binomial Tests - INPUT - Analysis Parameters On the Binomial Tests dialog click on INPUT and then click on analysis parameters: Figure 139: Binomial Tests > Input > Analysis Parameters On this screen enter or select: • Processing Options • Threshold Probability — Enter the “alpha” probability below which to reject the null hypothesis. • Single Tail — Check this box if the Binomial Test is to be single tailed. Default is twotailed. • Binomial Probability — If the binomial test is not ½, enter the probability desired. Default is 0.5. Teradata Warehouse Miner User Guide - Volume 1 231 Chapter 3: Statistical Tests Binomial Tests • Exact Matches Comparison Criterion — Check the button to specify how exact matches are to be handled. Default is they are discarded. Other options are to include them with negative count, or with positive count. Binomial Tests - OUTPUT On the Binomial Tests dialog click on OUTPUT: Figure 140: Binomial Tests > Output On this screen select the following options if desired: • Store the tabular output of this analysis in the database — Option to generate a Teradata table populated with the results of the analysis. Once enabled, the following three fields must be specified: 232 • Database Name — The database where the output table will be saved. • Output Name — The table name that the output will be saved under. • Output Type — The output type must be table when storing Statistical Test output in the database. • Stored Procedure — The creation of a stored procedure containing the SQL generated for this analysis can be requested by entering the desired name of the stored procedure here. This will result in the creation of a stored procedure in the user's login database in place of the execution of the SQL generated by the analysis. • Procedure Comment — When an optional Procedure Comment is entered it is applied to a requested Stored Procedure with an SQL Comment statement. It can be up to 255 characters in length and contain substitution parameters for the output category (Score, ADS, Stats or Other), project name and/or analysis name (using the tags <Category>, <Project> and <Analysis>, respectively). (Note that the default value of this field may be set on the Defaults tab of the Preferences dialog, available from the Tools Menu). • Create output table using fallback keyword — Fallback keyword will be used to create the table • Create output table using multiset keyword — Multiset keyword will be used to create the table • Advertise Output — The Advertise Output option may be requested when creating a table, view or procedure. This feature “advertises” output by inserting information into one or more of the Advertise Output metadata tables according to the type of analysis and the options selected in the analysis. • Advertise Note — An Advertise Note may be specified if desired when the Advertise Output option is selected or when the Always Advertise option is selected on the Connection Properties dialog. It is a free-form text field of up to 30 characters that may be used to categorize or describe the output. Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Binomial Tests • Generate SQL, but do not Execute it — Generate the Statistical Test SQL, but do not execute it. The SQL will be available to be viewed. Run the Binomial Sign Test Analysis After setting parameters on the INPUT screens as described above, you are ready to run the analysis. To run the analysis you can either: • Click the Run icon on the toolbar, or • Select Run <project name> on the Project menu, or • Press the F5 key on your keyboard Results - Binomial Test The results of running the Binomial analysis include a table with a row for each group-by variable requested, as well as the SQL to perform the statistical analysis. All of these results are outlined below. Binomial Tests - RESULTS - SQL On the Binomial Tests dialog click on RESULTS and then click on SQL: Figure 141: Binomial Tests > Results > SQL The series of SQL statements comprise the Binomial Analysis. It is always returned, and is the only item returned when the Generate SQL Without Executing option is used. Binomial Tests - RESULTS - data On the Binomial Tests dialog click on RESULTS and then click on data: Figure 142: Binomial Tests > Results > data The output table is generated by the Binomial Analysis for each group-by variable combination. Output Columns - Binomial Tests The following table is built in the requested Output Database by the Binomial analysis. Any group-by columns will comprise the Unique Primary Index (UPI), otherwise the UPI will be “N”. Teradata Warehouse Miner User Guide - Volume 1 233 Chapter 3: Statistical Tests Binomial Tests Table 91: Output Database table (Built by the Binomial Analysis) Name Type Definition N INTEGER Total count of value pairs NPos INTEGER Count of positive value differences NNeg INTEGER Count of negative value differences BP FLOAT The Binomial Probability BinomialCallP Char The Binomial result: a=accept, p=reject (positive), n=reject(negative) Tutorial - Binomial Tests Analysis In this example, an Binomial analysis is performed on the fictitious banking data to analyze account usage. Parameterize the Binomial analysis as follows: • Available Tables — twm_customer_analysis • First Column — avg_sv_bal • Second Column — avg_ck_bal • Group By Columns — gender • Analysis Parameters • Threshold Probability — 0.05 • Single Tail — true • Binomial Probability — 0.5 • Exact Matches — discarded Run the analysis and click on Results when it completes. For this example, the Binomial analysis generated the following. The Binomial was computed on average savings balance (column 1) vs. average check account balance (column 2), by gender. The test is a Z Test since N>100, and Z is 3.29 (not in answer set) so the one-sided test of the null hypothesis that p is ½ is rejected as shown in the table below. Table 92: Binomial Test Analysis (Table 1) gender N NPos NNeg BP BinomialCallP_0.05 F 366 217 149 0.0002 p M 259 156 103 0.0005 p Rerunning the test with parameter binomial probability set to 0.6 gives a different result: the one-sided test of the null hypothesis that p is 0.6 is accepted as shown in the table below. 234 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Binomial Tests Table 93: Binomial Test Analysis (Table 2) gender N NPos NNeg BP BinomialCallP_0.05 F 366 217 149 0.3909 a M 259 156 103 0.4697 a Binomial Sign Test For the sign test, one column is selected and the test is whether the value is positive or not positive. Initiate a Binomial Sign Test Use the following procedure to initiate a new Binomial Sign Test in Teradata Warehouse Miner: 1 Click on the Add New Analysis icon in the toolbar: Figure 143: Add New Analysis from toolbar 2 In the resulting Add New Analysis dialog box, click on Statistical Tests under Categories and then under Analyses double-click on Binomial Tests: Teradata Warehouse Miner User Guide - Volume 1 235 Chapter 3: Statistical Tests Binomial Tests Figure 144: Add New Analysis > Statistical Tests > Binomial Tests 3 This will bring up the Binomial Tests dialog in which you will enter STATISTICAL TEST STYLE, INPUT and OUTPUT options to parameterize the analysis as described in the following sections. Binomial Sign Test - INPUT - Data Selection On the Binomial Tests dialog click on INPUT and then click on data selection: Figure 145: Binomial Sign Test > Input > Data Selection On this screen select: 1 Select Input Source Users may select between different sources of input. By selecting the Input Source Table the user can select from available databases, tables (or views) and columns in the usual manner. By selecting the Input Source Analysis however the user can select directly from the output of another analysis of qualifying type in the current project. Analyses that may be selected from directly include all of the Analytic Data Set (ADS) and Reorganization analyses (except Refresh). In place of Available Databases the user may select from Available Analyses, while Available Tables then contains a list of all the output tables that will eventually be produced by the selected Analysis, or it contains a single entry with the name of the analysis under the label Volatile Table, representing the output of the analysis that is ordinarily produced by a Select statement. 236 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Binomial Tests 2 3 Select Columns From a Single Table • Available Databases (or Analyses) — These are the databases (or analyses) available to be processed. • Available Tables — These are the tables and views that are available to be processed. • Available Columns — These are the columns within the table/view that are available for processing. Select Statistical Test Style These are the binomial tests available (Binomial, Sign). Select “Sign”. 4 Select Optional Columns • Selected Columns — Select columns by highlighting and then either dragging and dropping into the Selected Columns window, or click on the arrow button to move highlighted columns into the Selected Columns window. • Column/Group By Columns — Note that the Selected Columns window is actually a split window; you can insert columns as Column, or Group By Columns. Make sure you have the correct portion of the window highlighted. • Column — The column that specifies the first variable for the Binomial Test analysis. • Group By Columns — The column(s) that specifies the variable(s) whose distinct value combinations will categorize the data, so a separate test is performed on each category. Binomial Sign Test - INPUT - Analysis Parameters On the Binomial Tests dialog click on INPUT and then click on analysis parameters: Figure 146: Binomial Sign Test > Input > Analysis Parameters On this screen enter or select: • Processing Options • Threshold Probability — Enter the “alpha” probability below which to reject the null hypothesis. • Single Tail — Check this box if the Binomial Test is to be single tailed. Default is twotailed. Binomial Sign Test - OUTPUT On the Binomial Tests dialog click on OUTPUT: Teradata Warehouse Miner User Guide - Volume 1 237 Chapter 3: Statistical Tests Binomial Tests Figure 147: Binomial Sign Test > Output On this screen select the following options if desired: • Store the tabular output of this analysis in the database — Option to generate a Teradata table populated with the results of the analysis. Once enabled, the following three fields must be specified: • Database Name — The database where the output table will be saved. • Output Name — The table name that the output will be saved under. • Output Type — The output type must be table when storing Statistical Test output in the database. • Stored Procedure — The creation of a stored procedure containing the SQL generated for this analysis can be requested by entering the desired name of the stored procedure here. This will result in the creation of a stored procedure in the user's login database in place of the execution of the SQL generated by the analysis. • Procedure Comment — When an optional Procedure Comment is entered it is applied to a requested Stored Procedure with an SQL Comment statement. It can be up to 255 characters in length and contain substitution parameters for the output category (Score, ADS, Stats or Other), project name and/or analysis name (using the tags <Category>, <Project> and <Analysis>, respectively). (Note that the default value of this field may be set on the Defaults tab of the Preferences dialog, available from the Tools Menu). • Create output table using fallback keyword — Fallback keyword will be used to create the table • Create output table using multiset keyword — Multiset keyword will be used to create the table • Advertise Output — The Advertise Output option may be requested when creating a table, view or procedure. This feature “advertises” output by inserting information into one or more of the Advertise Output metadata tables according to the type of analysis and the options selected in the analysis. • Advertise Note — An Advertise Note may be specified if desired when the Advertise Output option is selected or when the Always Advertise option is selected on the Connection Properties dialog. It is a free-form text field of up to 30 characters that may be used to categorize or describe the output. • Generate SQL, but do not Execute it — Generate the Statistical Test SQL, but do not execute it. The SQL will be available to be viewed. Run the Binomial Sign Analysis After setting parameters on the INPUT screens as described above, you are ready to run the analysis. To run the analysis you can either: 238 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Binomial Tests • Click the Run icon on the toolbar, or • Select Run <project name> on the Project menu, or • Press the F5 key on your keyboard Results - Binomial Sign Test Analysis The results of running the Binomial Sign analysis include a table with a row for each groupby variable requested, as well as the SQL to perform the statistical analysis. All of these results are outlined below. Binomial Sign Test - RESULTS - SQL On the Binomial Tests dialog click on RESULTS and then click on SQL: Figure 148: Binomial Sign Test > Results > SQL The series of SQL statements comprise the Binomial Sign Analysis. It is always returned, and is the only item returned when the Generate SQL Without Executing option is used. Binomial Sign Test - RESULTS - data On the Binomial Tests dialog click on RESULTS and then click on data: Figure 149: Binomial Sign Test > Results > data The output table is generated by the Binomial Sign Analysis for each group-by variable combination. Output Columns - Binomial Sign Analysis The following table is built in the requested Output Database by the Binomial analysis. Any group-by columns will comprise the Unique Primary Index (UPI), otherwise the UPI will be “N”. Table 94: Binomial Sign Analysis: Output Columns Name Type Definition N INTEGER Total count of value pairs Teradata Warehouse Miner User Guide - Volume 1 239 Chapter 3: Statistical Tests Binomial Tests Table 94: Binomial Sign Analysis: Output Columns Name Type Definition NPos INTEGER Count of positive values NNeg INTEGER Count of negative or zero values BP FLOAT The Binomial Probability BinomialCallP Char The Binomial Sign result: a=accept, p=reject (positive), n=reject(negative) Tutorial - Binomial Sign Analysis In this example, a Binomial analysis is performed on the fictitious banking data to analyze account usage. Parameterize the Binomial analysis as follows: • Available Tables — twm_customer_analysis • Column — female • Group By Columns — years_with_bank • Analysis Parameters • Threshold Probability — 0.05 • Single Tail — true Run the analysis and click on Results when it completes. For this example, the Binomial Sign analysis generated the following. The Binomial was computed on the Boolean variable “female” by years_with_bank. The one-sided test of the null hypothesis that p is ½ accepted for all cases except years_with_bank=2 as shown in the table below. Table 95: Tutorial - Binomial Sign Analysis years_with_bank N NPos NNeg BP BinomialCallP_0.05 0 88 51 37 0.08272 a 1 87 48 39 0.195595 a 2 94 57 37 0.024725 p 3 86 46 40 0.295018 a 4 78 39 39 0.545027 a 5 82 46 36 0.160147 a 6 83 46 37 0.19 a 7 65 36 29 0.22851 a 8 45 26 19 0.185649 a 9 39 23 16 0.168392 a 240 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Kolmogorov-Smirnov Tests Kolmogorov-Smirnov Tests Tests of the Kolmogorov-Smirnov Type are based on statistical procedures which use maximum vertical distance between functions as a measure of function similarity. Two empirical distribution functions are mapped against each other, or a single empirical function is mapped against a hypothetical (e.g. Normal) distribution. Conclusions are then drawn about the likelihood the two distributions are the same. Kolmogorov-Smirnov Test (One Sample) The Kolmogorov-Smirnov (one-sample) test determines whether a dataset matches a particular distribution (for this test, the normal distribution). The test has the advantage of making no assumption about the distribution of data. (Non-parametric and distribution free) Note that this generality comes at some cost: other tests (e.g. the Student's t-test) may be more sensitive if the data meet the requirements of the test. The Kolmogorov-Smirnov test is generally less powerful than the tests specifically designed to test for normality. This is especially true when the mean and variance are not specified in advance for the KolmogorovSmirnov test, which then becomes conservative. Further, the Kolmogorov-Smirnov test will not indicate the type of nonnormality, e.g. whether the distribution is skewed or heavy-tailed. Examination of the skewness and kurtosis, and of the histogram, boxplot, and normal probability plot for the data may show why the data failed the Kolmogorov-Smirnov test. In this test, the user can specify group-by variables (GBV's) so a separate test will be done for every unique set of values of the GBV's. Initiate a Kolmogorov-Smirnov Test Use the following procedure to initiate a new Kolmogorov-Smirnov Test in Teradata Warehouse Miner: 1 Click on the Add New Analysis icon in the toolbar: Figure 150: Add New Analysis from toolbar 2 In the resulting Add New Analysis dialog box, click on Statistical Tests under Categories and then under Analyses double-click on Kolmogorov-Smirnov Tests: Teradata Warehouse Miner User Guide - Volume 1 241 Chapter 3: Statistical Tests Kolmogorov-Smirnov Tests Figure 151: Add New Analysis > Statistical Tests > Kolmogorov-Smirnov Tests 3 This will bring up the Kolmogorov-Smirnov Tests dialog in which you will enter STATISTICAL TEST STYLE, INPUT and OUTPUT options to parameterize the analysis as described in the following sections. Kolmogorov-Smirnov Test - INPUT - Data Selection On the Kolmogorov-Smirnov Tests dialog click on INPUT and then click on data selection: Figure 152: Kolmogorov-Smirnov Test > Input > Data Selection On this screen select: 1 Select Input Source Users may select between different sources of input. By selecting the Input Source Table the user can select from available databases, tables (or views) and columns in the usual manner. By selecting the Input Source Analysis however the user can select directly from the output of another analysis of qualifying type in the current project. Analyses that may be selected from directly include all of the Analytic Data Set (ADS) and Reorganization analyses (except Refresh). In place of Available Databases the user may select from Available Analyses, while Available Tables then contains a list of all the output tables that will eventually be produced by the selected Analysis, or it contains a single entry with the name of the analysis under the label Volatile Table, representing the output of the analysis that is ordinarily produced by a Select statement. 242 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Kolmogorov-Smirnov Tests 2 3 Select Columns From a Single Table • Available Databases (or Analyses) — These are the databases (or analyses) available to be processed. • Available Tables — These are the tables and views that are available to be processed. • Available Columns — These are the columns within the table/view that are available for processing. Select Statistical Test Style These are the Tests of the Kolmogorov-Smirnov Type available (Kolmogorov-Smirnov, Lilliefors, Shapiro-Wilk, D'Agostino-Pearson, Smirnov). Select “Kolmogorov-Smirnov”. 4 Select Optional Columns • Selected Columns — Select columns by highlighting and then either dragging and dropping into the Selected Columns window, or click on the arrow button to move highlighted columns into the Selected Columns window. Note: The Selected Columns window is actually a split window; you can insert columns as Column of Interest, Group By Columns. Make sure you have the correct portion of the window highlighted. • Column of Interest — The column that specifies the numeric variable to be tested for normality. • Group By Columns — The columns which specify the variables whose distinct value combinations will categorize the data, so a separate test is performed on each category. Kolmogorov-Smirnov Test - INPUT - Analysis Parameters On the Kolmogorov-Smirnov Tests dialog click on INPUT and then click on analysis parameters: Figure 153: Kolmogorov-Smirnov Test > Input > Analysis Parameters On this screen enter or select: • Processing Options • Threshold Probability — Enter the “alpha” probability below which to reject the null hypothesis. Kolmogorov-Smirnov Test - OUTPUT On the Kolmogorov-Smirnov Tests dialog click on OUTPUT: Teradata Warehouse Miner User Guide - Volume 1 243 Chapter 3: Statistical Tests Kolmogorov-Smirnov Tests Figure 154: Kolmogorov-Smirnov Test > Output On this screen select: • Store the tabular output of this analysis in the database — Option to generate a Teradata table populated with the results of the analysis. Once enabled, the following three fields must be specified: • Database Name — The database where the output table will be saved. • Output Name — The table name that the output will be saved under. • Output Type — The output type must be table when storing Statistical Test output in the database. • Stored Procedure — The creation of a stored procedure containing the SQL generated for this analysis can be requested by entering the desired name of the stored procedure here. This will result in the creation of a stored procedure in the user's login database in place of the execution of the SQL generated by the analysis. • Procedure Comment — When an optional Procedure Comment is entered it is applied to a requested Stored Procedure with an SQL Comment statement. It can be up to 255 characters in length and contain substitution parameters for the output category (Score, ADS, Stats or Other), project name and/or analysis name (using the tags <Category>, <Project> and <Analysis>, respectively). (Note that the default value of this field may be set on the Defaults tab of the Preferences dialog, available from the Tools Menu). • Create output table using fallback keyword — Fallback keyword will be used to create the table • Create output table using multiset keyword — Multiset keyword will be used to create the table • Advertise Output — The Advertise Output option may be requested when creating a table, view or procedure. This feature “advertises” output by inserting information into one or more of the Advertise Output metadata tables according to the type of analysis and the options selected in the analysis. • Advertise Note — An Advertise Note may be specified if desired when the Advertise Output option is selected or when the Always Advertise option is selected on the Connection Properties dialog. It is a free-form text field of up to 30 characters that may be used to categorize or describe the output. • Generate SQL, but do not Execute it — Generate the Statistical Test SQL, but do not execute it. The SQL will be available to be viewed. Run the Kolmogorov-Smirnov Test Analysis After setting parameters on the INPUT screens as described above, you are ready to run the analysis. To run the analysis you can either: 244 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Kolmogorov-Smirnov Tests • Click the Run icon on the toolbar, or • Select Run <project name> on the Project menu, or • Press the F5 key on your keyboard Results - Kolmogorov-Smirnov Test The results of running the Kolmogorov-Smirnov Test analysis include a table with a row for each separate Kolmogorov-Smirnov test on all distinct-value group-by variables, as well as the SQL to perform the statistical analysis. All of these results are outlined below. Kolmogorov-Smirnov Test - RESULTS - SQL On the Kolmogorov-Smirnov Test dialog click on RESULTS and then click on SQL: Figure 155: Kolmogorov-Smirnov Test > Results > SQL The series of SQL statements comprise the Kolmogorov-Smirnov Test Analysis. It is always returned, and is the only item returned when the Generate SQL without Executing option is used. Kolmogorov-Smirnov Test - RESULTS - data On the Kolmogorov-Smirnov Test dialog click on RESULTS and then click on data: Figure 156: Kolmogorov-Smirnov Test > Results > Data The output table is generated by the Analysis for each separate Kolmogorov-Smirnov test on all distinct-value group-by variables. Output Columns - Kolmogorov-Smirnov Test Analysis The following table is built in the requested Output Database by the Kolmogorov-Smirnov test analysis. Group By Variables, if present, will be the Unique Primary Index (UPI). Otherwise Klm will be the UPI. Table 96: Output Database table (Built by the Kolmogorov-Smirnov test analysis) Name Type Definition Klm Float Kolmogorov-Smirnov Value Teradata Warehouse Miner User Guide - Volume 1 245 Chapter 3: Statistical Tests Kolmogorov-Smirnov Tests Table 96: Output Database table (Built by the Kolmogorov-Smirnov test analysis) Name Type Definition M INTEGER Count KlmPValue Float The probability associated with the Kolmogorov-Smirnov statistic KlmPText Char Text description if P is outside table range KlmCallP_0.05 Char The Kolmogorov-Smirnov result: a=accept, p=reject Tutorial - Kolmogorov-Smirnov Test Analysis In this example, a Kolmogorov-Smirnov test analysis is performed on the fictitious banking data to analyze account usage. Parameterize a Kolmogorov-Smirnov Test analysis as follows: • Available Tables — twm_customer_analysis • Column of Interest — income • Group By Columns — years_with_bank • Analysis Parameters • Threshold Probability — 0.05 Run the analysis and click on Results when it completes. For this example, the KolmogorovSmirnov Test analysis generated the following table. The Kolmogorov-Smirnov Test was computed for each distinct value of the group by variable “years_with_bank”. Results were sorted by years_with_bank. The tests shows customer incomes with years_with_bank of 1, 5,6,7,8, and 9 were normally distributed and those with 0, 2, and 3 were not. A ‘p’ means significantly nonnormal and an ‘a’ means accept the null hypothesis of normality. The SQL is available for viewing but not listed below. Table 97: Kolmogorov-Smirnov Test years_with_bank Klm M KlmPValue 0 0.159887652 88 0.019549995 p 1 0.118707332 87 0.162772589 a 2 0.140315991 94 0.045795894 p 3 0.15830739 86 0.025080666 p 4 0.999999 78 0.01 5 0.138336567 82 0.080579955 a 6 0.127171093 83 0.127653475 a 7 0.135147555 65 0.172828265 a 8 0.184197592 45 0.084134345 a 9 0.109205054 39 0.20 246 KlmPText <0.01 >0.20 KlmCallP_0.05 p a Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Kolmogorov-Smirnov Tests Lilliefors Test The Lilliefors test determines whether a dataset matches a particular distribution, and is identical to the Kolmogorov-Smirnov test except that conversion to Z-scores is made. The Lilliefors test is therefore a modification of the Kolmogorov-Smirnov test. The Lilliefors test computes the Lilliefors statistic and checks its significance. Exact tables of the quantiles of the test statistic were computed from random numbers in computer simulations. The computed value of the test statistic is compared with the quantiles of the statistic. When the test is for the normal distribution, the null hypothesis is that the distribution function is normal with unspecified mean and variance. The alternative hypothesis is that the distribution function is nonnormal. The empirical distribution of X is compared with a normal distribution with the same mean and variance as X. It is similar to the Kolmogorov-Smirnov test, but it adjusts for the fact that the parameters of the normal distribution are estimated from X rather than specified in advance. In this test, the user can specify group-by variables (GBV's) so a separate test will be done for every unique set of values of the GBV's. Initiate a Lilliefors Test Use the following procedure to initiate a new Lilliefors Test in Teradata Warehouse Miner: 1 Click on the Add New Analysis icon in the toolbar: Figure 157: Add New Analysis from toolbar 2 In the resulting Add New Analysis dialog box, click on Statistical Tests under Categories and then under Analyses double-click on Kolmogorov-Smirnov Tests: Teradata Warehouse Miner User Guide - Volume 1 247 Chapter 3: Statistical Tests Kolmogorov-Smirnov Tests Figure 158: Add New Analysis> Statistical Tests > Kolmogorov-Smirnov Tests 3 This will bring up the Kolmogorov-Smirnov Tests dialog in which you will enter STATISTICAL TEST STYLE, INPUT and OUTPUT options to parameterize the analysis as described in the following sections. Lilliefors Test - INPUT - Data Selection On the Kolmogorov-Smirnov Tests dialog click on INPUT and then click on data selection: Figure 159: Lillefors Test > Input > Data Selection On this screen select: 1 Select Input Source Users may select between different sources of input. By selecting the Input Source Table the user can select from available databases, tables (or views) and columns in the usual manner. By selecting the Input Source Analysis however the user can select directly from the output of another analysis of qualifying type in the current project. Analyses that may be selected from directly include all of the Analytic Data Set (ADS) and Reorganization analyses (except Refresh). In place of Available Databases the user may select from Available Analyses, while Available Tables then contains a list of all the output tables that will eventually be produced by the selected Analysis, or it contains a single entry with the name of the analysis under the label Volatile Table, representing the output of the analysis that is ordinarily produced by a Select statement. 248 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Kolmogorov-Smirnov Tests 2 3 Select Columns From a Single Table • Available Databases (or Analyses) — These are the databases (or analyses) available to be processed. • Available Tables — These are the tables and views that are available to be processed. • Available Columns — These are the columns within the table/view that are available for processing. Select Statistical Test Style These are the Tests of the Kolmogorov-Smirnov Type available (Kolmogorov-Smirnov, Lilliefors, Shapiro-Wilk, D'Agostino-Pearson, Smirnov). Select “Lilliefors”. 4 Select Optional Columns • Selected Columns — Select columns by highlighting and then either dragging and dropping into the Selected Columns window, or click on the arrow button to move highlighted columns into the Selected Columns window. Note: The Selected Columns window is actually a split window; you can insert columns as Column of Interest, Group By Columns. Make sure you have the correct portion of the window highlighted. • Column of Interest — The column that specifies the numeric variable to be tested for normality. • Group By Columns — The columns which specify the variables whose distinct value combinations will categorize the data, so a separate test is performed on each category. Lilliefors Test - INPUT - Analysis Parameters On the Kolmogorov-Smirnov Tests dialog click on INPUT and then click on analysis parameters: Figure 160: Lillefors Test > Input > Analysis Parameters On this screen enter or select: • Processing Options • Threshold Probability — Enter the “alpha” probability below which to reject the null hypothesis. Lilliefors Test - OUTPUT On the Kolmogorov-Smirnov Tests dialog click on OUTPUT: Teradata Warehouse Miner User Guide - Volume 1 249 Chapter 3: Statistical Tests Kolmogorov-Smirnov Tests Figure 161: Lillefors Test > Output On this screen select the following options if desired: • Store the tabular output of this analysis in the database — Option to generate a Teradata table populated with the results of the analysis. Once enabled, the following three fields must be specified: • Database Name — The database where the output table will be saved. • Output Name — The table name that the output will be saved under. • Output Type — The output type must be table when storing Statistical Test output in the database. • Stored Procedure — The creation of a stored procedure containing the SQL generated for this analysis can be requested by entering the desired name of the stored procedure here. This will result in the creation of a stored procedure in the user's login database in place of the execution of the SQL generated by the analysis. • Procedure Comment — When an optional Procedure Comment is entered it is applied to a requested Stored Procedure with an SQL Comment statement. It can be up to 255 characters in length and contain substitution parameters for the output category (Score, ADS, Stats or Other), project name and/or analysis name (using the tags <Category>, <Project> and <Analysis>, respectively). (Note that the default value of this field may be set on the Defaults tab of the Preferences dialog, available from the Tools Menu). • Create output table using fallback keyword — Fallback keyword will be used to create the table • Create output table using multiset keyword — Multiset keyword will be used to create the table • Advertise Output — The Advertise Output option may be requested when creating a table, view or procedure. This feature “advertises” output by inserting information into one or more of the Advertise Output metadata tables according to the type of analysis and the options selected in the analysis. • Advertise Note — An Advertise Note may be specified if desired when the Advertise Output option is selected or when the Always Advertise option is selected on the Connection Properties dialog. It is a free-form text field of up to 30 characters that may be used to categorize or describe the output. • Generate SQL, but do not Execute it — Generate the Statistical Test SQL, but do not execute it. The SQL will be available to be viewed. Run the Lilliefors Test Analysis After setting parameters on the INPUT screens as described above, you are ready to run the analysis. To run the analysis you can either: 250 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Kolmogorov-Smirnov Tests • Click the Run icon on the toolbar, or • Select Run <project name> on the Project menu, or • Press the F5 key on your keyboard Results - Lilliefors Test Analysis The results of running the Lilliefors Test analysis include a table with a row for each separate Lilliefors test on all distinct-value group-by variables, as well as the SQL to perform the statistical analysis. All of these results are outlined below. Lilliefors Test - RESULTS - SQL On the Kolmogorov-Smirnov Tests dialog click on RESULTS and then click on SQL: Figure 162: Lillefors Test > Results > SQL The series of SQL statements comprise the Lilliefors Test Analysis. It is always returned, and is the only item returned when the Generate SQL without Executing option is used. Lilliefors Test - RESULTS - Data On the Kolmogorov-Smirnov Tests dialog click on RESULTS and then click on data: Figure 163: Lillefors Test > Results > Data The output table is generated by the Analysis for each separate Lilliefors test on all distinctvalue group-by variables. Output Columns - Lilliefors Test Analysis The following table is built in the requested Output Database by the Lilliefors test analysis. Group By Variables, if present, will be the Unique Primary Index (UPI). Otherwise Lilliefors will be the UPI. Table 98: Lilliefors Test Analysis: Output Columns Name Type Definition Lilliefors Float Lilliefors Value M INTEGER Count Teradata Warehouse Miner User Guide - Volume 1 251 Chapter 3: Statistical Tests Kolmogorov-Smirnov Tests Table 98: Lilliefors Test Analysis: Output Columns Name Type Definition LillieforsPValue Float The probability associated with the Lilliefors statistic LillieforsPText Char Text description if P is outside table range LillieforsCallP_0.05 Char The Lilliefors result: a=accept, p=reject Tutorial - Lilliefors Test Analysis In this example, a Lilliefors test analysis is performed on the fictitious banking data to analyze account usage. Parameterize a Lilliefors Test analysis as follows: • Available Tables — twm_customer_analysis • Column of Interest — income • Group By Columns — years_with_bank • Analysis Parameters • Threshold Probability — 0.05 Run the analysis and click on Results when it completes. For this example, the Lilliefors Test analysis generated the following table. The Lilliefors Test was computed for each distinct value of the group by variable “years_with_bank”. Results were sorted by years_with_bank. The tests show customer all incomes were not normally distributed. ‘p’ means significantly nonnormal and an ‘a’ means accept the null hypothesis of normality. Note: The SQL is available for viewing but not listed below. Table 99: Lilliefors Test years_with_bank Lilliefors M LillieforsPValue LillieforsPText LillieforsCallP_0.05 0 0.166465166 88 0.01 <0.01 p 1 0.123396019 87 0.01 <0.01 p 2 0.146792366 94 0.01 <0.01 p 3 0.156845809 86 0.01 <0.01 p 4 0.192756959 78 0.01 <0.01 p 5 0.144308699 82 0.01 <0.01 p 6 0.125268495 83 0.01 <0.01 p 7 0.141128127 65 0.01 <0.01 p 8 0.191869596 45 0.01 <0.01 p 9 0.111526787 39 0.20 >0.20 a 252 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Kolmogorov-Smirnov Tests Shapiro-Wilk Test The Shapiro-Wilk W test is designed to detect departures from normality without requiring that the mean or variance of the hypothesized normal distribution be specified in advance. It is considered to be one of the best omnibus tests of normality. The function is based on the approximations and code given by Royston (1982a, b). It can be used in samples as large as 2,000 or as small as 3. Royston (1982b) gives approximations and tabled values that can be used to compute the coefficients, and obtains the significance level of the W statistic. Small values of W are evidence of departure from normality. This test has done very well in comparison studies with other goodness of fit tests. In general, either the Shapiro-Wilk or D'Agostino-Pearson test is a powerful overall test for normality. As omnibus tests, however, they will not indicate the type of nonnormality, e.g. whether the distribution is skewed as opposed to heavy-tailed (or both). Examination of the calculated skewness and kurtosis, and of the histogram, boxplot, and normal probability plot for the data may provide clues as to why the data failed the Shapiro-Wilk or D'AgostinoPearson test. The standard algorithm for the Shapiro-Wilk test only applies to sample sizes from 3 to 2000. For larger sample sizes, a different normality test should be used. The test statistic is based on the Kolmogorov-Smirnov statistic for a normal distribution with the same mean and variance as the sample mean and variance. Initiate a Shapiro-Wilk Test Use the following procedure to initiate a new Shapiro-Wilk Test in Teradata Warehouse Miner: 1 Click on the Add New Analysis icon in the toolbar: Figure 164: Add New Analysis from toolbar 2 In the resulting Add New Analysis dialog box, click on Statistical Tests under Categories and then under Analyses double-click on Kolmogorov-Smirnov Tests: Teradata Warehouse Miner User Guide - Volume 1 253 Chapter 3: Statistical Tests Kolmogorov-Smirnov Tests Figure 165: Add New Analysis > Statistical Tests > Kolmogorov-Smirnov Tests 3 This will bring up the Kolmogorov-Smirnov Tests dialog in which you will enter STATISTICAL TEST STYLE, INPUT and OUTPUT options to parameterize the analysis as described in the following sections. Shapiro-Wilk Test - INPUT - Data Selection On the Kolmogorov-Smirnov Tests dialog click on INPUT and then click on data selection: Figure 166: Shapiro-Wilk Test > Input > Data Selection On this screen select: 1 Select Input Source Users may select between different sources of input. By selecting the Input Source Table the user can select from available databases, tables (or views) and columns in the usual manner. By selecting the Input Source Analysis however the user can select directly from the output of another analysis of qualifying type in the current project. Analyses that may be selected from directly include all of the Analytic Data Set (ADS) and Reorganization analyses (except Refresh). In place of Available Databases the user may select from Available Analyses, while Available Tables then contains a list of all the output tables that will eventually be produced by the selected Analysis, or it contains a single entry with the name of the analysis under the label Volatile Table, representing the output of the analysis that is ordinarily produced by a Select statement. 254 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Kolmogorov-Smirnov Tests 2 3 Select Columns From a Single Table • Available Databases (or Analyses) — These are the databases (or analyses) available to be processed. • Available Tables — These are the tables and views that are available to be processed. • Available Columns — These are the columns within the table/view that are available for processing. Select Statistical Test Style These are the Tests of the Kolmogorov-Smirnov Type available (Kolmogorov-Smirnov, Lilliefors, Shapiro-Wilk, D'Agostino-Pearson, Smirnov). Select “Shapiro-Wilk”. 4 Select Optional Columns • Selected Columns — Select columns by highlighting and then either dragging and dropping into the Selected Columns window, or click on the arrow button to move highlighted columns into the Selected Columns window. Note: The Selected Columns window is actually a split window; you can insert columns as Column of Interest, Group By Columns. Make sure you have the correct portion of the window highlighted. • Column of Interest — The column that specifies the numeric variable to be tested for normality. • Group By Columns — The columns which specify the variables whose distinct value combinations will categorize the data, so a separate test is performed on each category. Shapiro-Wilk Test - INPUT - Analysis Parameters On the Kolmogorov-Smirnov Tests dialog click on INPUT and then click on analysis parameters: Figure 167: Shapiro-Wilk Test > Input > Analysis Parameters On this screen enter or select: • Processing Options • Threshold Probability — Enter the “alpha” probability below which to reject the null hypothesis. Shapiro-Wilk Test - OUTPUT On the Kolmogorov-Smirnov Tests dialog click on OUTPUT: Teradata Warehouse Miner User Guide - Volume 1 255 Chapter 3: Statistical Tests Kolmogorov-Smirnov Tests Figure 168: Shapiro-Wilk Test > Output On this screen select: • Store the tabular output of this analysis in the database — Option to generate a Teradata table populated with the results of the analysis. Once enabled, the following three fields must be specified: • Database Name — The database where the output table will be saved. • Output Name — The table name that the output will be saved under. • Output Type — The output type must be table when storing Statistical Test output in the database. • Stored Procedure — The creation of a stored procedure containing the SQL generated for this analysis can be requested by entering the desired name of the stored procedure here. This will result in the creation of a stored procedure in the user's login database in place of the execution of the SQL generated by the analysis. • Procedure Comment — When an optional Procedure Comment is entered it is applied to a requested Stored Procedure with an SQL Comment statement. It can be up to 255 characters in length and contain substitution parameters for the output category (Score, ADS, Stats or Other), project name and/or analysis name (using the tags <Category>, <Project> and <Analysis>, respectively). (Note that the default value of this field may be set on the Defaults tab of the Preferences dialog, available from the Tools Menu). • Create output table using fallback keyword — Fallback keyword will be used to create the table • Create output table using multiset keyword — Multiset keyword will be used to create the table • Advertise Output — The Advertise Output option may be requested when creating a table, view or procedure. This feature “advertises” output by inserting information into one or more of the Advertise Output metadata tables according to the type of analysis and the options selected in the analysis. • Advertise Note — An Advertise Note may be specified if desired when the Advertise Output option is selected or when the Always Advertise option is selected on the Connection Properties dialog. It is a free-form text field of up to 30 characters that may be used to categorize or describe the output. • Generate SQL, but do not Execute it — Generate the Statistical Test SQL, but do not execute it. The SQL will be available to be viewed. Run the Shapiro-Wilk Test Analysis After setting parameters on the INPUT screens as described above, you are ready to run the analysis. To run the analysis you can either: 256 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Kolmogorov-Smirnov Tests • Click the Run icon on the toolbar, or • Select Run <project name> on the Project menu, or • Press the F5 key on your keyboard Results - Shapiro-Wilk Analysis The results of running the Shapiro-Wilk Test analysis include a table with a row for each separate Shapiro-Wilk test on all distinct-value group-by variables, as well as the SQL to perform the statistical analysis. All of these results are outlined below. Shapiro-Wilk Test - RESULTS - SQL On the Kolmogorov-Smirnov Tests dialog click on RESULTS and then click on SQL: Figure 169: Shapiro-Wilk Test > Results > SQL The series of SQL statements comprise the Shapiro-Wilk Test Analysis. It is always returned, and is the only item returned when the Generate SQL without Executing option is used. Shapiro-Wilk Test - RESULTS - data On the Kolmogorov-Smirnov Tests dialog click on RESULTS and then click on data: Figure 170: Shapiro-Wilk Test > Results > data The output table is generated for each separate Shapiro-Wilk test on all distinct-value groupby variables. Output Columns - Shapiro-Wilk Test Analysis The following table is built in the requested Output Database by the Shapiro-Wilk test analysis. Group By Variables, if present, will be the Unique Primary Index (UPI). Otherwise Shw will be the UPI. Table 100: Shapiro-Wilk Test Analysis: Output Columns Name Type Definition Shw Float Shapiro-Wilk Value N INTEGER Count Teradata Warehouse Miner User Guide - Volume 1 257 Chapter 3: Statistical Tests Kolmogorov-Smirnov Tests Table 100: Shapiro-Wilk Test Analysis: Output Columns Name Type Definition ShapiroWilkPValue Float The probability associated with the Shapiro-Wilk statistic ShapiroWilkPText Char Text description if P is outside table range ShapiroWilkCallP_0.05 Char The Shapiro-Wilk result: a=accept, p=reject Tutorial - Shapiro-Wilk Test Analysis In this example, a Shapiro-Wilk test analysis is performed on the fictitious banking data to analyze account usage. Parameterize a Shapiro-Wilk Test analysis as follows: • Available Tables — twm_customer_analysis • Column of Interest — income • Group By Columns — years_with_bank • Analysis Parameters • Threshold Probability — 0.05 Run the analysis and click on Results when it completes. For this example, the Shapiro-Wilk Test analysis generated the following table. The Shapiro-Wilk Test was computed for each distinct value of the group by variable “years_with_bank”. Results were sorted by years_ with_bank. The tests show customer all incomes were not normally distributed. ‘p’ means significantly nonnormal and an ‘a’ means accept the null hypothesis of normality. Note: The SQL is available for viewing but not listed below. Table 101: Shapiro-Wilk Test years_with_bank Shw N ShapiroWilkPValue 0 0.84919004 88 0.000001 p 1 0.843099681 87 0.000001 p 2 0.831069533 94 0.000001 p 3 0.838965439 86 0.000001 p 4 0.707924134 78 0.000001 p 5 0.768444329 82 0.000001 p 6 0.855276885 83 0.000001 p 7 0.827399691 65 0.000001 p 8 0.863932178 45 0.01 9 0.930834522 39 0.029586304 258 ShapiroWilkPText <0.01 ShapiroWilkCallP_0.05 p p Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Kolmogorov-Smirnov Tests D'Agostino and Pearson Test In general, either the Shapiro-Wilk or D'Agostino-Pearson test is a powerful overall test for normality. These tests are designed to detect departures from normality without requiring that the mean or variance of the hypothesized normal distribution be specified in advance. Though these tests cannot indicate the type of nonnormality, they tend to be more powerful than the Kolmogorov-Smirnov test. The D'Agostino-Pearson Ksquared statistic has approximately a chi-squared distribution with 2 df when the population is normally distributed. Initiate a D'Agostino and Pearson Test Use the following procedure to initiate a new D'Agostino and Pearson Test in Teradata Warehouse Miner: 1 Click on the Add New Analysis icon in the toolbar: Figure 171: Add New Analysis from toolbar 2 In the resulting Add New Analysis dialog box, click on Statistical Tests under Categories and then under Analyses double-click on Kolmogorov-Smirnov Tests: Figure 172: Add New Analysis > Statistical Tests > Kolmogorov-Smirnov Tests Teradata Warehouse Miner User Guide - Volume 1 259 Chapter 3: Statistical Tests Kolmogorov-Smirnov Tests 3 This will bring up the Kolmogorov-Smirnov Tests dialog in which you will enter STATISTICAL TEST STYLE, INPUT and OUTPUT options to parameterize the analysis as described in the following sections. D'Agostino and Pearson Test - INPUT - Data Selection On the Kolmogorov-Smirnov Tests dialog click on INPUT and then click on data selection: Figure 173: D'Agostino and Pearson Test > Input > Data Selection On this screen select: 1 Select Input Source Users may select between different sources of input. By selecting the Input Source Table the user can select from available databases, tables (or views) and columns in the usual manner. By selecting the Input Source Analysis however the user can select directly from the output of another analysis of qualifying type in the current project. Analyses that may be selected from directly include all of the Analytic Data Set (ADS) and Reorganization analyses (except Refresh). In place of Available Databases the user may select from Available Analyses, while Available Tables then contains a list of all the output tables that will eventually be produced by the selected Analysis, or it contains a single entry with the name of the analysis under the label Volatile Table, representing the output of the analysis that is ordinarily produced by a Select statement. 2 3 Select Columns From a Single Table • Available Databases (or Analyses) — These are the databases (or analyses) available to be processed. • Available Tables — These are the tables and views that are available to be processed. • Available Columns — These are the columns within the table/view that are available for processing. Select Statistical Test Style These are the Tests of the Kolmogorov-Smirnov Type available (Kolmogorov-Smirnov, Lilliefors, Shapiro-Wilk, D'Agostino-Pearson, Smirnov). Select “D'Agostino and Pearson”. 4 Select Optional Columns • 260 Selected Columns — Select columns by highlighting and then either dragging and dropping into the Selected Columns window, or click on the arrow button to move highlighted columns into the Selected Columns window. Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Kolmogorov-Smirnov Tests Note: The Selected Columns window is actually a split window; you can insert columns as Column of Interest, Group By Columns. Make sure you have the correct portion of the window highlighted. • Column of Interest — The column that specifies the numeric variable to be tested for normality. • Group By Columns — The columns which specify the variables whose distinct value combinations will categorize the data, so a separate test is performed on each category. D'Agostino and Pearson Test - INPUT - Analysis Parameters On the Kolmogorov-Smirnov Tests dialog click on INPUT and then click on analysis parameters: Figure 174: D'Agostino and Pearson Test > Input > Analysis Parameters On this screen enter or select: • Processing Options • Threshold Probability — Enter the “alpha” probability below which to reject the null hypothesis. D'Agostino and Pearson Test - OUTPUT On the Kolmogorov-Smirnov Tests dialog click on OUTPUT: Figure 175: D'Agostino and Pearson Test > Output On this screen select the following options if desired: • Store the tabular output of this analysis in the database — Option to generate a Teradata table populated with the results of the analysis. Once enabled, the following three fields must be specified: • Database Name — The database where the output table will be saved. • Output Name — The table name that the output will be saved under. • Output Type — The output type must be table when storing Statistical Test output in the database. Teradata Warehouse Miner User Guide - Volume 1 261 Chapter 3: Statistical Tests Kolmogorov-Smirnov Tests • Stored Procedure — The creation of a stored procedure containing the SQL generated for this analysis can be requested by entering the desired name of the stored procedure here. This will result in the creation of a stored procedure in the user's login database in place of the execution of the SQL generated by the analysis. • Procedure Comment — When an optional Procedure Comment is entered it is applied to a requested Stored Procedure with an SQL Comment statement. It can be up to 255 characters in length and contain substitution parameters for the output category (Score, ADS, Stats or Other), project name and/or analysis name (using the tags <Category>, <Project> and <Analysis>, respectively). (Note that the default value of this field may be set on the Defaults tab of the Preferences dialog, available from the Tools Menu). • Create output table using fallback keyword — Fallback keyword will be used to create the table • Create output table using multiset keyword — Multiset keyword will be used to create the table • Advertise Output — The Advertise Output option may be requested when creating a table, view or procedure. This feature “advertises” output by inserting information into one or more of the Advertise Output metadata tables according to the type of analysis and the options selected in the analysis. • Advertise Note — An Advertise Note may be specified if desired when the Advertise Output option is selected or when the Always Advertise option is selected on the Connection Properties dialog. It is a free-form text field of up to 30 characters that may be used to categorize or describe the output. • Generate SQL, but do not Execute it — Generate the Statistical Test SQL, but do not execute it. The SQL will be available to be viewed. Run the D'Agostino and Pearson Test Analysis After setting parameters on the INPUT screens as described above, you are ready to run the analysis. To run the analysis you can either: • Click the Run icon on the toolbar, or • Select Run <project name> on the Project menu, or • Press the F5 key on your keyboard Results - D'Agostino and Pearson Test Analysis The results of running the D'Agostino and Pearson Test analysis include a table with a row for each separate D'Agostino and Pearson test on all distinct-value group-by variables, as well as the SQL to perform the statistical analysis. All of these results are outlined below. D'Agostino and Pearson Test - RESULTS - SQL On the Kolmogorov-Smirnov Tests dialog click on RESULTS and then click on SQL: 262 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Kolmogorov-Smirnov Tests Figure 176: D'Agostino and Pearson Test > Results > SQL The series of SQL statements comprise the D'Agostino and Pearson Test Analysis. It is always returned, and is the only item returned when the Generate SQL without Executing option is used. D'Agostino and Pearson Test - RESULTS - data On the Kolmogorov-Smirnov Tests dialog click on RESULTS and then click on data: Figure 177: D'Agostino and Pearson Test > Results > data The output table is generated by the Analysis for each separate D'Agostino and Pearson test on all distinct-value group-by variables. Output Columns - D'Agostino and Pearson Test Analysis The following table is built in the requested Output Database by the D'Agostino and Pearson test analysis. Group By Variables, if present, will be the Unique Primary Index (UPI). Otherwise T will be the UPI. Table 102: D'Agostino and Pearson Test Analysis: Output Columns Name Type Definition T Float K-Squared statistic Zkurtosis Float Z of kurtosis Zskew Float Z of Skewness ChiPValue Float The probability associated with the K-Squared statistic ChiPText Char Text description if P is outside table range ChiCallP_0.05 Char The D'Agostino-Pearson result: a=accept, p=reject Tutorial - D'Agostino and Pearson Test Analysis In this example, a D'Agostino and Pearson test analysis is performed on the fictitious banking data to analyze account usage. Parameterize a D'Agostino and Pearson Test analysis as follows: Teradata Warehouse Miner User Guide - Volume 1 263 Chapter 3: Statistical Tests Kolmogorov-Smirnov Tests • Available Tables — twm_customer_analysis • Column of Interest — income • Group By Columns — years_with_bank • Analysis Parameters • Threshold Probability — 0.05 Run the analysis and click on Results when it completes. For this example, the D'Agostino and Pearson Test analysis generated the following table. The D'Agostino and Pearson Test was computed for each distinct value of the group by variable “years_with_bank”. Results were sorted by years_with_bank. The tests show customer all incomes were not normally distributed except those from years_with_bank = 9. ‘p’ means significantly nonnormal and an ‘a’ means accept the null hypothesis of normality. The SQL is available for viewing but not listed below. Table 103: D'Agostino and Pearson Test: Output Columns years_with_bank T Zkurtosis Zskew ChiPValue ChiPText ChiCallP_0.05 0 29.05255 2.71261 4.65771 0.0001 <0.0001 p 1 34.18025 3.30609 4.82183 0.0001 <0.0001 p 2 30.71123 2.78588 4.79062 0.0001 <0.0001 p 3 32.81104 3.06954 4.83621 0.0001 <0.0001 p 4 82.01928 5.72010 7.02137 0.0001 <0.0001 p 5 62.36861 4.91949 6.17796 0.0001 <0.0001 p 6 24.80241 2.40521 4.36089 0.0001 <0.0001 p 7 17.72275 1.83396 3.78937 0.00019 p 8 6.55032 -0.23415 2.54863 0.03992 p 9 3.32886 -0.68112 1.69261 0.20447 a Smirnov Test The Smirnov test (aka “two-sample Kolmogorov-Smirnov test”) checks whether two datasets have a significantly different distribution. The tests have the advantage of making no assumption about the distribution of data. (non-parametric and distribution free). Note that this generality comes at some cost: other tests (e.g. the Student's t-test) may be more sensitive if the data meet the requirements of the test. Initiate a Smirnov Test Use the following procedure to initiate a new Smirnov Test in Teradata Warehouse Miner: 264 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Kolmogorov-Smirnov Tests 1 Click on the Add New Analysis icon in the toolbar: Figure 178: Add New Analysis from toolbar 2 In the resulting Add New Analysis dialog box, click on Statistical Tests under Categories and then under Analyses double-click on Kolmogorov-Smirnov Tests: Figure 179: Add New Analysis > Statistical Tests > Kolmogorov-Smirnov Tests 3 This will bring up the Kolmogorov-Smirnov Tests dialog in which you will enter STATISTICAL TEST STYLE, INPUT and OUTPUT options to parameterize the analysis as described in the following sections. Smirnov Test - INPUT - Data Selection On the Kolmogorov-Smirnov Tests dialog click on INPUT and then click on data selection: Teradata Warehouse Miner User Guide - Volume 1 265 Chapter 3: Statistical Tests Kolmogorov-Smirnov Tests Figure 180: Smirnov Test > Input > Data Selection On this screen select: 1 Select Input Source Users may select between different sources of input. By selecting the Input Source Table the user can select from available databases, tables (or views) and columns in the usual manner. By selecting the Input Source Analysis however the user can select directly from the output of another analysis of qualifying type in the current project. Analyses that may be selected from directly include all of the Analytic Data Set (ADS) and Reorganization analyses (except Refresh). In place of Available Databases the user may select from Available Analyses, while Available Tables then contains a list of all the output tables that will eventually be produced by the selected Analysis, or it contains a single entry with the name of the analysis under the label Volatile Table, representing the output of the analysis that is ordinarily produced by a Select statement. 2 3 Select Columns From a Single Table • Available Databases (or Analyses) — These are the databases (or analyses) available to be processed. • Available Tables — These are the tables and views that are available to be processed. • Available Columns — These are the columns within the table/view that are available for processing. Select Statistical Test Style These are the Tests of the Kolmogorov-Smirnov Type available (Kolmogorov-Smirnov, Lilliefors, Shapiro-Wilk, D'Agostino-Pearson, Smirnov). Select “Smirnov”. 4 Select Optional Columns • Selected Columns — Select columns by highlighting and then either dragging and dropping into the Selected Columns window, or click on the arrow button to move highlighted columns into the Selected Columns window. Note: The Selected Columns window is actually a split window; you can insert columns as Column of Interest, Columns, Group By Columns. Make sure you have the correct portion of the window highlighted. 266 • Column of Interest — The column that specifies the numeric variable to be tested for normality. • Columns — The column specifying the 2-category variable that identifies the distribution to which the column of interest belongs. • Group By Columns — The columns which specify the variables whose distinct value combinations will categorize the data, so a separate test is performed on each category. Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Kolmogorov-Smirnov Tests Smirnov Test - INPUT - Analysis Parameters On the Kolmogorov-Smirnov Tests dialog click on INPUT and then click on analysis parameters: Figure 181: Smirnov Test > Input > Analysis Parameters On this screen enter or select: • Processing Options • Threshold Probability — Enter the “alpha” probability below which to reject the null hypothesis. Smirnov Test - OUTPUT On the Kolmogorov-Smirnov Tests dialog click on OUTPUT: Figure 182: Smirnov Test > Output On this screen select the following options if desired: • Store the tabular output of this analysis in the database — Option to generate a Teradata table populated with the results of the analysis. Once enabled, the following three fields must be specified: • Database Name — The database where the output table will be saved. • Output Name — The table name that the output will be saved under. • Output Type — The output type must be table when storing Statistical Test output in the database. • Stored Procedure — The creation of a stored procedure containing the SQL generated for this analysis can be requested by entering the desired name of the stored procedure here. This will result in the creation of a stored procedure in the user's login database in place of the execution of the SQL generated by the analysis. • Procedure Comment — When an optional Procedure Comment is entered it is applied to a requested Stored Procedure with an SQL Comment statement. It can be up to 255 characters in length and contain substitution parameters for the output category (Score, ADS, Stats or Other), project name and/or analysis name (using the tags <Category>, <Project> and <Analysis>, respectively). (Note that the default value of this field may be set on the Defaults tab of the Preferences dialog, available from the Tools Menu). Teradata Warehouse Miner User Guide - Volume 1 267 Chapter 3: Statistical Tests Kolmogorov-Smirnov Tests • Create output table using fallback keyword — Fallback keyword will be used to create the table • Create output table using multiset keyword — Multiset keyword will be used to create the table • Advertise Output — The Advertise Output option may be requested when creating a table, view or procedure. This feature “advertises” output by inserting information into one or more of the Advertise Output metadata tables according to the type of analysis and the options selected in the analysis. • Advertise Note — An Advertise Note may be specified if desired when the Advertise Output option is selected or when the Always Advertise option is selected on the Connection Properties dialog. It is a free-form text field of up to 30 characters that may be used to categorize or describe the output. • Generate SQL, but do not Execute it — Generate the Statistical Test SQL, but do not execute it. The SQL will be available to be viewed. Run the Smirnov Test Analysis After setting parameters on the INPUT screens as described above, you are ready to run the analysis. To run the analysis you can either: • Click the Run icon on the toolbar, or • Select Run <project name> on the Project menu, or • Press the F5 key on your keyboard Results - Smirnov Test Analysis The results of running the Smirnov Test analysis include a table with a row for each separate Smirnov test on all distinct-value group-by variables, as well as the SQL to perform the statistical analysis. All of these results are outlined below. Smirnov Test - RESULTS - SQL On the Kolmogorov-Smirnov Tests dialog click on RESULTS and then click on SQL: Figure 183: Smirnov Test > Results > SQL The series of SQL statements comprise the Smirnov Test Analysis. It is always returned, and is the only item returned when the Generate SQL without Executing option is used. Smirnov Test - RESULTS - data On the Kolmogorov-Smirnov Tests dialog click on RESULTS and then click on data: 268 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Kolmogorov-Smirnov Tests Figure 184: Smirnov Test > Results > data The output table is generated by the Analysis for each separate Smirnov test on all distinctvalue group-by variables. Output Columns - Smirnov Test Analysis The following table is built in the requested Output Database by the Smirnov test analysis. Group By Variables, if present, will be the Unique Primary Index (UPI). Otherwise M will be the UPI. Table 104: Smirnov Test Analysis: Output Columns Name Type Definition M Integer Number of first distribution observations N Integer Number of second distribution observations D Float D Statistic SmirnovPValue Float The probability associated with the D statistic SmirnovPText Char Text description if P is outside table range SmirnovCallP_0.01 Char The D'Agostino-Pearson result: a=accept, p=reject Tutorial - Smirnov Test Analysis In this example, a Smirnov test analysis is performed on the fictitious banking data to analyze account usage. Parameterize a Smirnov Test analysis as follows: • Available Tables — twm_customer • Column of Interest — income • Columns — gender • Group By Columns — years_with_bank • Analysis Parameters • Threshold Probability — 0.05 Run the analysis and click on Results when it completes. For this example, the Smirnov Test analysis generated the following table. The Smirnov Test was computed for each distinct value of the group by variable “years_with_bank”. Results were sorted by years_with_bank. The tests show distributions of incomes of males and females were different for all values of years_with_bank. ‘p’ means significantly nonnormal and an ‘a’ means accept the null hypothesis of normality. The SQL is available for viewing but not listed below. Teradata Warehouse Miner User Guide - Volume 1 269 Chapter 3: Statistical Tests Tests Based on Contingency Tables Table 105: Smirnov Test years_with_bank M N D SmirnovPValue SmirnovPText SmirnovCallP_0.01 0 37 51 1.422949567 0.000101 p 1 39 48 1.371667516 0.000103 p 2 37 57 1.465841724 0.000101 p 3 40 46 1.409836326 0.000105 p 4 39 39 1.397308541 0.000146 p 5 36 46 1.309704108 0.000105 p 6 37 46 1.287964978 0.000104 p 7 29 36 1.336945293 0.000112 p 8 19 26 1.448297864 0.00011 p 9 16 23 1.403341724 0.000101 p Tests Based on Contingency Tables Tests Based on Contingency Tables are based on an array or matrix of numbers which represent counts or frequencies. The tests basically evaluate the matrix to detect if there is a nonrandom pattern of frequencies. Chi Square Test The most common application for chi-square is in comparing observed counts of particular cases to the expected counts. For example, a random sample of people would contain m males and f females but usually we would not find exactly m=½N and f=½N. We could use the chisquared test to determine if the difference were significant enough to rule out the 50/50 hypothesis. The Chi Square Test determines whether the probabilities observed from data in a RxC contingency table are the same or different. The null hypothesis is that probabilities observed are the same. Output is a p-value which when compared to the user’s threshold, determines whether the null hypothesis should be rejected. Other Calculated Measures of Association • Phi coefficient — The Phi coefficient is a measure of the degree of association between two binary variables, and represents the correlation between two dichotomous variables. It is based on adjusting chi-square significance to factor out sample size, and is the same as the Pearson correlation for two dichotomous variables. • Cramer’s V — Cramer's V is used to examine the association between two categorical variables when there is more than a 2 X 2 contingency (e.g., 2 X 3). In these more 270 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Tests Based on Contingency Tables complex designs, phi is not appropriate, but Cramer's statistic is. Cramer's V represents the association or correlation between two variables. Cramer's V is the most popular of the chi-square-based measures of nominal association, designed so that the attainable upper limit is always 1. • Likelihood Ratio Chi Square — Likelihood ratio chi-square is an alternative to test the hypothesis of no association of columns and rows in nominal-level tabular data. It is based on maximum likelihood estimation, and involves the ratio between the observed and the expected frequencies, whereas the ordinary chi-square test involves the difference between the two. This is a more recent version of chi-square and is directly related to loglinear analysis and logistic regression. • Continuity-Adjusted Chi-Square — The continuity-adjusted chi-square statistic for 2 × 2 tables is similar to the Pearson chi-square, except that it is adjusted for the continuity of the chi-square distribution. The continuity-adjusted chi-square is most useful for small sample sizes. The use of the continuity adjustment is controversial; this chi-square test is more conservative, and more like Fisher's exact test, when your sample size is small. As the sample size increases, the statistic becomes more and more like the Pearson chisquare. • Contingency Coefficient — The contingency coefficient is an adjustment to phi coefficient, intended for tables larger than 2-by-2. It is always less than 1 and approaches 1.0 only for large tables. The larger the contingency coefficient, the stronger the association. Recommended only for 5-by-5 tables or larger, for smaller tables it underestimates level of association. Initiate a Chi Square Test Use the following procedure to initiate a new Chi Square Test in Teradata Warehouse Miner: 1 Click on the Add New Analysis icon in the toolbar: Figure 185: Add New Analysis from toolbar 2 In the resulting Add New Analysis dialog box, click on Statistical Tests under Categories and then under Analyses double-click on Tests Based on Contingency Tables: Teradata Warehouse Miner User Guide - Volume 1 271 Chapter 3: Statistical Tests Tests Based on Contingency Tables Figure 186: Add New Analysis > Statistical Tests > Tests Based on Contingency Tables 3 This will bring up the Tests Based on Contingency Tables dialog in which you will enter STATISTICAL TEST STYLE, INPUT and OUTPUT options to parameterize the analysis as described in the following sections. Chi Square Test - INPUT - Data Selection On the Tests Based on Contingency Tables dialog click on INPUT and then click on data selection: Figure 187: Chi Square Test > Input > Data Selection On this screen select: 1 Select Input Source Users may select between different sources of input. By selecting the Input Source Table the user can select from available databases, tables (or views) and columns in the usual manner. By selecting the Input Source Analysis however the user can select directly from the output of another analysis of qualifying type in the current project. Analyses that may be selected from directly include all of the Analytic Data Set (ADS) and Reorganization analyses (except Refresh). In place of Available Databases the user may select from Available Analyses, while Available Tables then contains a list of all the output tables that will eventually be produced by the selected Analysis, or it contains a single entry with the 272 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Tests Based on Contingency Tables name of the analysis under the label Volatile Table, representing the output of the analysis that is ordinarily produced by a Select statement. 2 3 Select Columns From a Single Table • Available Databases (or Analyses) — These are the databases (or analyses) available to be processed. • Available Tables — These are the tables and views that are available to be processed. • Available Columns — These are the columns within the table/view that are available for processing. Select Statistical Test Style These are the Tests Based on Contingency Tables available (Chi Square, Median). Select “Chi Square”. 4 Select Optional Columns • Selected Columns — Select columns by highlighting and then either dragging and dropping into the Selected Columns window, or click on the arrow button to move highlighted columns into the Selected Columns window. • First Columns/Second Columns — Note that the Selected Columns window is actually a split window; you can insert columns as First Columns, Second Columns. Make sure you have the correct portion of the window highlighted. • First Columns — The set of columns that specifies the first of a pair of variables for Chi Square analysis. • Second Columns — The set of columns that specifies the second of a pair of variables for Chi Square analysis. Each combination of the first and second variables will generate a separate Chi Square test. (Limitation: to avoid excessively long execution, the number of combinations is limited to 100, and unless the product of the number of distinct values of each pair is 2000 or less, the calculation will be skipped.) Note: Group-By Columns are not available in the Chi Square Test. Chi Square Test - INPUT - Analysis Parameters On the Tests Based on Contingency Tables dialog click on INPUT and then click on analysis parameters: Figure 188: Chi Square Test > Input > Analysis Parameters On this screen enter or select: • Processing Options Teradata Warehouse Miner User Guide - Volume 1 273 Chapter 3: Statistical Tests Tests Based on Contingency Tables • Threshold Probability — Enter the “alpha” probability below which to reject the null hypothesis. Chi Square Test - OUTPUT On the Tests Based on Contingency Tables dialog click on OUTPUT: Figure 189: Chi Square Test > Output On this screen select the following options if desired: • Store the tabular output of this analysis in the database — Option to generate a Teradata table populated with the results of the analysis. Once enabled, the following three fields must be specified: 274 • Database Name — The database where the output table will be saved. • Output Name — The table name that the output will be saved under. • Output Type — The output type must be table when storing Statistical Test output in the database. • Stored Procedure — The creation of a stored procedure containing the SQL generated for this analysis can be requested by entering the desired name of the stored procedure here. This will result in the creation of a stored procedure in the user's login database in place of the execution of the SQL generated by the analysis. • Procedure Comment — When an optional Procedure Comment is entered it is applied to a requested Stored Procedure with an SQL Comment statement. It can be up to 255 characters in length and contain substitution parameters for the output category (Score, ADS, Stats or Other), project name and/or analysis name (using the tags <Category>, <Project> and <Analysis>, respectively). (Note that the default value of this field may be set on the Defaults tab of the Preferences dialog, available from the Tools Menu). • Create output table using fallback keyword — Fallback keyword will be used to create the table • Create output table using multiset keyword — Multiset keyword will be used to create the table • Advertise Output — The Advertise Output option may be requested when creating a table, view or procedure. This feature “advertises” output by inserting information into one or more of the Advertise Output metadata tables according to the type of analysis and the options selected in the analysis. • Advertise Note — An Advertise Note may be specified if desired when the Advertise Output option is selected or when the Always Advertise option is selected on the Connection Properties dialog. It is a free-form text field of up to 30 characters that may be used to categorize or describe the output. Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Tests Based on Contingency Tables • Generate SQL, but do not Execute it — Generate the Statistical Test SQL, but do not execute it. The SQL will be available to be viewed. Run the Chi Square Test Analysis After setting parameters on the INPUT screens as described above, you are ready to run the analysis. To run the analysis you can either: • Click the Run icon on the toolbar, or • Select Run <project name> on the Project menu, or • Press the F5 key on your keyboard Results - Chi Square Analysis The results of running the Chi Square Test analysis include a table with a row for each separate Chi Square test on all pairs of selected variables, as well as the SQL to perform the statistical analysis. All of these results are outlined below. Chi Square Test - RESULTS - SQL On the Tests Based on Contingency Tables dialog click on RESULTS and then click on SQL: Figure 190: Chi Square Test > Results > SQL The series of SQL statements comprise the Chi Square Test Analysis. It is always returned, and is the only item returned when the Generate SQL without Executing option is used. Chi Square Test - RESULTS - data On the Tests Based on Contingency Tables dialog click on RESULTS and then click on data: Figure 191: Chi Square Test > Results > data The output table is generated by the Analysis for each separate Chi Square test on all pairs of selected variables Output Columns - Chi Square Test Analysis The following table is built in the requested Output Database by the Chi Square test analysis. Column1 will be the Unique Primary Index (UPI). Teradata Warehouse Miner User Guide - Volume 1 275 Chapter 3: Statistical Tests Tests Based on Contingency Tables Table 106: Chi Square Test Analysis: Output Columns Name Type Definition column1 Char First of pair of variables column2 Char Second of pair of variables Chisq Float Chi Square Value DF INTEGER Degrees of Freedom Z Float Z Score CramersV Float § Cramer’s V PhiCoeff Float § Phi coefficient LlhChiSq Float Likelihood Ratio Chi Square ContAdjChiSq Float § Continuity-Adjusted Chi-Square ContinCoeff Float § Contingency Coefficient ChiPValue Float The probability associated with the Chi Square statistic ChiPText Char Text description if P is outside table range ChiCallP_0.05 Char The Chi Square result: a=accept, p=reject (positive), n=reject(negative) Tutorial - Chi Square Test Analysis In this example, a Chi Square test analysis is performed on the fictitious banking data to analyze account usage. Parameterize a Chi Square Test analysis as follows: • Available Tables — twm_customer_analysis • First Columns — female, single • Second Columns — svacct, ccacct, ckacct • Analysis Parameters • Threshold Probability — 0.05 Run the analysis and click on Results when it completes. For this example, the Chi Square Test analysis generated the following table. The Chi Square Test was computed on all combinations of pairs of the two sets of variables. Results were sorted by column1 and column2. The tests shows that probabilities observed are the same for three pairs of variables and different for three other pairs. A ‘p’ means significantly different and an ‘a’ means not significantly different. The SQL is available for viewing but not listed below. Table 107: Chi Square Test (Part 1) column1 column2 Chisq DF Z CramersV PhiCoeff LlhChiSq female ccacct 3.2131312 1 1.480358596 0.065584911 0.065584911 3.21543611 female ckacct 8.2389731 1 2.634555949 0.105021023 0.105021023 8.23745744 276 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Tests Based on Contingency Tables Table 107: Chi Square Test (Part 1) column1 column2 Chisq DF Z CramersV PhiCoeff LlhChiSq female svacct 3.9961257 1 1.716382791 0.073140727 0.073140727 3.98861957 single ccacct 6.9958187 1 2.407215881 0.096774063 0.096774063 7.01100739 single ckacct 0.6545145 1 0.191899245 0.02960052 0.02960052 0.65371179 single svacct 1.5387084 1 0.799100586 0.045385576 0.045385576 1.53297321 Table 108: Chi Square Test (Part 2) column1 column2 ContAdjChiSq ContinCoeff ChiPValue ChiPText female ccacct 2.954339388 0.065444311 0.077657185 a female ckacct 7.817638955 0.10444661 0.004512106 p female svacct 3.697357526 0.072945873 0.046729867 p single ccacct 6.600561728 0.096324066 0.00854992 p single ckacct 0.536617115 0.029587561 0.25 single svacct 1.35045989 0.045338905 0.226624385 >0.25 ChiCallP_0.05 a a Median Test The Median test is a special case of the chi-square test with fixed marginal totals. It tests whether several samples came from populations with the same median. The null hypothesis is that all samples have the same median. The median test is applied for data in similar cases as for the ANOVA for independent samples, but when: 1 the data are either importantly non-normally distributed 2 the measurement scale of the dependent variable is ordinal (not interval or ratio) 3 or the data sample is too small. Note: The Median test is a less powerful non-parametric test than alternative rank tests due to the fact the dependent variable is dichotomized at the median. Because this technique tends to discard most of the information inherent in the data, it is less often used. Frequencies are evaluated by a simple 2 x 2 contingency table, so it becomes simply a 2 x 2 chi square test of independence with 1 DF. Given k independent samples of numeric values, a Median test is produced for each set of unique values of the group-by variables (GBV's), if any, testing whether all the populations have the same median. Output for each set of unique values of the GBV's is a p-value, which when compared to the user’s threshold, determines whether the null hypothesis should be rejected for the unique set of values of the GBV's. For more than 2 samples, this is sometimes called the Brown-Mood test. Teradata Warehouse Miner User Guide - Volume 1 277 Chapter 3: Statistical Tests Tests Based on Contingency Tables Initiate a Median Test Use the following procedure to initiate a new Median Test in Teradata Warehouse Miner: 1 Click on the Add New Analysis icon in the toolbar: Figure 192: Add New Analysis from toolbar 2 In the resulting Add New Analysis dialog box, click on Statistical Tests under Categories and then under Analyses double-click on Tests Based on Contingency Tables: Figure 193: Add New Analysis > Statistical Tests > Tests Based On Contingency Tables 3 This will bring up the Tests Based on Contingency Tables dialog in which you will enter STATISTICAL TEST STYLE, INPUT and OUTPUT options to parameterize the analysis as described in the following sections. Median Test - INPUT - Data Selection On the Tests Based on Contingency Tables dialog click on INPUT and then click on data selection: 278 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Tests Based on Contingency Tables Figure 194: Median Test > Input > Data Selection On this screen select: 1 Select Input Source Users may select between different sources of input. By selecting the Input Source Table the user can select from available databases, tables (or views) and columns in the usual manner. By selecting the Input Source Analysis however the user can select directly from the output of another analysis of qualifying type in the current project. Analyses that may be selected from directly include all of the Analytic Data Set (ADS) and Reorganization analyses (except Refresh). In place of Available Databases the user may select from Available Analyses, while Available Tables then contains a list of all the output tables that will eventually be produced by the selected Analysis, or it contains a single entry with the name of the analysis under the label Volatile Table, representing the output of the analysis that is ordinarily produced by a Select statement. 2 3 Select Columns From a Single Table • Available Databases (or Analyses) — These are the databases (or analyses) available to be processed. • Available Tables — These are the tables and views that are available to be processed. • Available Columns — These are the columns within the table/view that are available for processing. Select Statistical Test Style These are the Tests Based on Contingency Tables available (Chi Square, Median). Select “Median”. 4 Select Optional Columns • Selected Columns — Select columns by highlighting and then either dragging and dropping into the Selected Columns window, or click on the arrow button to move highlighted columns into the Selected Columns window. Note: The Selected Columns window is actually a split window; you can insert columns as Column of Interest, Columns and Group By Columns. Make sure you have the correct portion of the window highlighted. • Column of Interest — The numeric dependent variable for Median analysis. • Columns — The set of categorical independent variables for Median analysis. • Group By Columns — The column(s) that specifies the variable(s) whose distinct value combinations will categorize the data, so a separate test is performed on each category. Teradata Warehouse Miner User Guide - Volume 1 279 Chapter 3: Statistical Tests Tests Based on Contingency Tables Median Test - INPUT - Analysis Parameters On the Tests Based on Contingency Tables dialog click on INPUT and then click on analysis parameters: Figure 195: Median Test > Input > Analysis Parameters On this screen enter or select: • Processing Options • Threshold Probability — Enter the “alpha” probability below which to reject the null hypothesis. Median Test - OUTPUT On the Tests Based on Contingency Tables dialog click on OUTPUT: Figure 196: Median Test > Output On this screen select the following options if desired: • Store the tabular output of this analysis in the database — Option to generate a Teradata table populated with the results of the analysis. Once enabled, the following three fields must be specified: 280 • Database Name — The database where the output table will be saved. • Output Name — The table name that the output will be saved under. • Output Type — The output type must be table when storing Statistical Test output in the database. • Stored Procedure — The creation of a stored procedure containing the SQL generated for this analysis can be requested by entering the desired name of the stored procedure here. This will result in the creation of a stored procedure in the user's login database in place of the execution of the SQL generated by the analysis. • Procedure Comment — When an optional Procedure Comment is entered it is applied to a requested Stored Procedure with an SQL Comment statement. It can be up to 255 characters in length and contain substitution parameters for the output category (Score, ADS, Stats or Other), project name and/or analysis name (using the tags <Category>, <Project> and <Analysis>, respectively). (Note that the default value of this field may be set on the Defaults tab of the Preferences dialog, available from the Tools Menu). Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Tests Based on Contingency Tables • Create output table using fallback keyword — Fallback keyword will be used to create the table • Create output table using multiset keyword — Multiset keyword will be used to create the table • Advertise Output — The Advertise Output option may be requested when creating a table, view or procedure. This feature “advertises” output by inserting information into one or more of the Advertise Output metadata tables according to the type of analysis and the options selected in the analysis. • Advertise Note — An Advertise Note may be specified if desired when the Advertise Output option is selected or when the Always Advertise option is selected on the Connection Properties dialog. It is a free-form text field of up to 30 characters that may be used to categorize or describe the output. • Generate SQL, but do not Execute it — Generate the Statistical Test SQL, but do not execute it. The SQL will be available to be viewed. Run the Median Test Analysis After setting parameters on the INPUT screens as described above, you are ready to run the analysis. To run the analysis you can either: • Click the Run icon on the toolbar, or • Select Run <project name> on the Project menu, or • Press the F5 key on your keyboard Results - Median Analysis The results of running the Median Test analysis include a table with a row for each separate Median test on all distinct-value group-by variables, as well as the SQL to perform the statistical analysis. All of these results are outlined below. Median Test - RESULTS - SQL On the Tests Based on Contingency Tables dialog click on RESULTS and then click on SQL: Figure 197: Median Test > Results > SQL The series of SQL statements comprise the Median Test Analysis. It is always returned, and is the only item returned when the Generate SQL without Executing option is used. Median Test - RESULTS - data On the Tests Based on Contingency Tables dialog click on RESULTS and then click on data: Teradata Warehouse Miner User Guide - Volume 1 281 Chapter 3: Statistical Tests Tests Based on Contingency Tables Figure 198: Median Test > Results > data The output table is generated by the Analysis for each group-by variable combination. Output Columns - Median Test Analysis The following table is built in the requested Output Database by the Median Test analysis. Group By Variables, if present, will be the Unique Primary Index (UPI). Otherwise ChiSq will be the UPI. Table 109: Median Test Analysis: Output Columns Name Type Definition Chisq Float Chi Square Value DF INTEGER Degrees of Freedom MedianPValue Float The probability associated with the Chi Square statistic MedianPText Char Text description if P is outside table range MedianCallP_0.01 Char The Chi Square result: a=accept, p=reject (positive), n=reject(negative) Tutorial - Median Test Analysis In this example, a Median test analysis is performed on the fictitious banking data to analyze account usage. Parameterize a Median Test analysis as follows: • Available Tables — twm_customer • Column of Interest — income • Columns — marital_status • Group By Columns — years_with_bank • Analysis Parameters • Threshold Probability — 0.01 Run the analysis and click on Results when it completes. For this example, the Median Test analysis generated the following table. The Median Test was computed on income over marital_status by years_with_bank. Results were sorted by years_with_bank. The tests shows that values came from populations with the same median where MedianCallP_0.01 = ‘a’ (accept null hypothesis) and from populations with different medians where it is ‘p’ (reject null hypothesis). The SQL is available for viewing but not listed below. 282 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Rank Tests Table 110: Median Test years_with_bank ChiSq DF MedianPValue MedianPText MedianCallP_0.01 0 12.13288563 3 0.007361344 p 1 12.96799683 3 0.004848392 p 2 13.12480388 3 0.004665414 p 3 8.504645761 3 0.038753824 a 4 4.458333333 3 0.225502846 a 5 15.81395349 3 0.001527445 p 6 4.531466733 3 0.220383974 a 7 11.35971787 3 0.009950322 p 8 2.855999742 3 0.25 >0.25 a 9 2.23340311 3 0.25 >0.25 a Rank Tests Tests Based on Ranks use the ranks of the data rather than the data itself to calculate statistics. Therefore the data must have at least an ordinal scale of measurement. If data are nonnumeric but ordinal and ranked, these rank tests may be the most powerful tests available. Even numeric variables which meet the requirements of parametric tests, such as independent, randomly distributed normal variables, can be efficiently analyzed by these tests. These rank tests are valid for variables which are continuous, discrete, or a mixture of both. Types of Rank tests supported by Teradata Warehouse Miner include: • Mann-Whitney/Kruskal-Wallis • Mann-Whitney/Kruskal-Wallis (Independent Tests) • Wilcoxon Signed Rank • Friedman Mann-Whitney/Kruskal-Wallis Test The selection of which test to execute is automatically based on the number of distinct values of the independent variable. The Mann-Whitney is used for two groups, the Kruskal-Wallis for three or more groups. A special version of the Mann-Whitney/Kruskal-Wallis test performs a separate, independent test for each independent variable, and displays the result of each test with its accompanying column name. Under the primary version of the Mann-Whitney/Kruskal-Wallis test, all independent variable value combinations are used, often forcing the Kruskal-Wallis test, since Teradata Warehouse Miner User Guide - Volume 1 283 Chapter 3: Statistical Tests Rank Tests the number of value combinations exceeds two. When a variable which has more than two distinct values is included in the set of independent variables, then the Kruskal-Wallis test is performed for all variables. Since Kruskal-Wallis is a generalization of Mann-Whitney, the Kruskal-Wallis results are valid for all the variables, including two-valued ones. In the discussion below, both types of Mann-Whitney/Kruskal-Wallis are referred to as MannWhitney/Kruskal-Wallis tests, since the only difference is the way the independent variable is treated. The Mann-Whitney test, AKA Wilcoxon Two Sample Test, is the nonparametric analog of the 2-sample t test. It is used to compare two independent groups of sampled data, and tests whether they are from the same population or from different populations (i.e., whether the samples have the same distribution function). Unlike the parametric t-test, this nonparametric test makes no assumptions about the distribution of the data (e.g., normality). It is to be used as an alternative to the independent group t-test, when the assumption of normality or equality of variance is not met. Like many non-parametric tests, it uses the ranks of the data rather than the data itself to calculate the U statistic. But since the Mann-Whitney test makes no distribution assumption, it is less powerful than the t-test. On the other hand, the Mann-Whitney is more powerful than the t-test when parametric assumptions are not met. Another advantage is that it will provide the same results under any monotonic transformation of the data so the results of the test are more generalizable. The Mann-Whitney is used when the independent variable is nominal or ordinal and the dependent variable is ordinal (or treated as ordinal). The main assumption is that the variable on which the 2 groups are to be compared is continuously distributed. This variable may be non-numeric, and if so, is converted to a rank based on alphanumeric precedence. The null hypothesis is that both samples have the same distribution. The alternative hypotheses are that the distributions differ from each other in either direction (two-tailed test), or in a specific direction (upper-tailed or lower-tailed tests). Output is a p-value, which when compared to the user’s threshold, determines whether the null hypothesis should be rejected. Given one or more columns (independent variables) whose values define two independent groups of sampled data, and a column (dependent variable) whose distribution is of interest from the same input table, the Mann-Whitney test is performed for each set of unique values of the group-by variables (GBV's), if any. The Kruskal-Wallis test is the nonparametric analog of the one-way analysis of variance or Ftest used to compare three or more independent groups of sampled data. When there are only two groups, it reduces to the Mann-Whitney test (above). The Kruskal-Wallis test tests whether multiple samples of data are from the same population or from different populations (i.e., whether the samples have the same distribution function). Unlike the parametric independent group ANOVA (one way ANOVA), this non-parametric test makes no assumptions about the distribution of the data (e.g., normality). Since this test does not make a distributional assumption, it is not as powerful as ANOVA. Given k independent samples of numeric values, a Kruskal-Wallis test is produced for each set of unique values of the GBV's, testing whether all the populations are identical. This test variable may be non-numeric, and if so, is converted to a rank based on alphanumeric precedence. The null hypothesis is that all samples have the same distribution. The alternative hypotheses are that the distributions differ from each other. Output for each unique set of 284 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Rank Tests values of the GBV's is a statistic H, and a p-value, which when compared to the user’s threshold, determines whether the null hypothesis should be rejected for the unique set of values of the GBV's. Initiate a Mann-Whitney/Kruskal-Wallis Test Use the following procedure to initiate a new Mann-Whitney/Kruskal-Wallis Test in Teradata Warehouse Miner: 1 Click on the Add New Analysis icon in the toolbar: Figure 199: Add New Analysis from toolbar 2 In the resulting Add New Analysis dialog box, click on Statistical Tests under Categories and then under Analyses double-click on Rank Tests: Figure 200: Add New Analysis > Statistical Tests > Rank Tests 3 This will bring up the Rank Tests dialog in which you will enter STATISTICAL TEST STYLE, INPUT and OUTPUT options to parameterize the analysis as described in the following sections. Teradata Warehouse Miner User Guide - Volume 1 285 Chapter 3: Statistical Tests Rank Tests Mann-Whitney/Kruskal-Wallis Test - INPUT - Data Selection On the Ranks Tests dialog click on INPUT and then click on data selection: Figure 201: Mann-Whitney/Kruskal-Wallis Test > Input > Data Selection On this screen select: 1 Select Input Source Users may select between different sources of input. By selecting the Input Source Table the user can select from available databases, tables (or views) and columns in the usual manner. By selecting the Input Source Analysis however the user can select directly from the output of another analysis of qualifying type in the current project. Analyses that may be selected from directly include all of the Analytic Data Set (ADS) and Reorganization analyses (except Refresh). In place of Available Databases the user may select from Available Analyses, while Available Tables then contains a list of all the output tables that will eventually be produced by the selected Analysis, or it contains a single entry with the name of the analysis under the label Volatile Table, representing the output of the analysis that is ordinarily produced by a Select statement. 2 3 Select Columns From a Single Table • Available Databases (or Analyses) — These are the databases (or analyses) available to be processed. • Available Tables — These are the tables and views that are available to be processed. • Available Columns — These are the columns within the table/view that are available for processing. Select Statistical Test Style These are the Tests Based on Ranks available (Mann-Whitney/Kruskal-Wallis, MannWhitney/Kruskal-Wallis Independent Tests, Wilcoxon, Friedman). Select “MannWhitney/Kruskal-Wallis” or Mann-Whitney/Kruskal-Wallis Independent Tests. 4 Select Optional Columns • Selected Columns — Select columns by highlighting and then either dragging and dropping into the Selected Columns window, or click on the arrow button to move highlighted columns into the Selected Columns window. Note: The Selected Columns window is actually a split window; you can insert columns as Column of Interest, Columns or Group By Columns. Make sure you have the correct portion of the window highlighted. • 286 Column of Interest — The column that specifies the dependent variable to be tested. Note that this variable may be non-numeric, but if so, will be converted to a rank based on alphanumeric precedence. Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Rank Tests • Columns — The columns that specify the independent variables, categorizing the data. • Group By Columns — The columns which specify the variables whose distinct value combinations will categorize the data, so a separate test is performed on each category. Mann-Whitney/Kruskal-Wallis Test - INPUT - Analysis Parameters On the Rank Tests dialog click on INPUT and then click on analysis parameters: Figure 202: Mann-Whitney/Kruskal-Wallis Test > Input > Analysis Parameters On this screen enter or select: • Processing Options • Threshold Probability — Enter the “alpha” probability below which to reject the null hypothesis. • Single Tail — Select the box if single tailed test is desired (default is two-tailed). The single-tail option is only valid if the test is Mann-Whitney. Mann-Whitney/Kruskal-Wallis Test - OUTPUT On the Rank Tests dialog click on OUTPUT: Figure 203: Mann-Whitney/Kruskal-Wallis Test > Output On this screen select the following options if desired: • Store the tabular output of this analysis in the database — Option to generate a Teradata table populated with the results of the analysis. Once enabled, the following three fields must be specified: • Database Name — The database where the output table will be saved. • Output Name — The table name that the output will be saved under. • Output Type — The output type must be table when storing Statistical Test output in the database. • Stored Procedure — The creation of a stored procedure containing the SQL generated for this analysis can be requested by entering the desired name of the stored procedure Teradata Warehouse Miner User Guide - Volume 1 287 Chapter 3: Statistical Tests Rank Tests here. This will result in the creation of a stored procedure in the user's login database in place of the execution of the SQL generated by the analysis. • Procedure Comment — When an optional Procedure Comment is entered it is applied to a requested Stored Procedure with an SQL Comment statement. It can be up to 255 characters in length and contain substitution parameters for the output category (Score, ADS, Stats or Other), project name and/or analysis name (using the tags <Category>, <Project> and <Analysis>, respectively). (Note that the default value of this field may be set on the Defaults tab of the Preferences dialog, available from the Tools Menu). • Create output table using fallback keyword — Fallback keyword will be used to create the table • Create output table using multiset keyword — Multiset keyword will be used to create the table • Advertise Output — The Advertise Output option may be requested when creating a table, view or procedure. This feature “advertises” output by inserting information into one or more of the Advertise Output metadata tables according to the type of analysis and the options selected in the analysis. • Advertise Note — An Advertise Note may be specified if desired when the Advertise Output option is selected or when the Always Advertise option is selected on the Connection Properties dialog. It is a free-form text field of up to 30 characters that may be used to categorize or describe the output. • Generate SQL, but do not Execute it — Generate the Statistical Test SQL, but do not execute it. The SQL will be available to be viewed. Run the Mann-Whitney/Kruskal-Wallis Test Analysis After setting parameters on the INPUT screens as described above, you are ready to run the analysis. To run the analysis you can either: • Click the Run icon on the toolbar, or • Select Run <project name> on the Project menu, or • Press the F5 key on your keyboard Results - Mann-Whitney/Kruskal-Wallis Test Analysis The results of running the Mann-Whitney/Kruskal-Wallis Test analysis include a table with a row for each separate Mann-Whitney/Kruskal-Wallis test on all distinct-value group-by variables, as well as the SQL to perform the statistical analysis. In the case of Mann-Whitney/ Kruskal-Wallis Independent Tests, the results will be displayed with a separate row for each independent variable column-name. All of these results are outlined below. Mann-Whitney/Kruskal-Wallis Test - RESULTS - SQL On the Rank Tests dialog click on RESULTS and then click on SQL: 288 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Rank Tests Figure 204: Mann-Whitney/Kruskal-Wallis Test > Results > SQL The series of SQL statements comprise the Mann-Whitney/Kruskal-Wallis Test Analysis. It is always returned, and is the only item returned when the Generate SQL without Executing option is used. Mann-Whitney/Kruskal-Wallis Test - RESULTS - data On the Rank Tests dialog click on RESULTS and then click on data: Figure 205: Mann-Whitney/Kruskal-Wallis Test > Results > data The output table is generated by the Analysis for each separate Mann-Whitney/KruskalWallis test on all distinct-value group-by variables. Output Columns - Mann-Whitney/Kruskal-Wallis Test Analysis The following table is built in the requested Output Database by the Mann-Whitney/KruskalWallis test analysis. Group By Variables, if present, will be the Unique Primary Index (UPI). Otherwise Z will be the UPI. In the case of Mann-Whitney/Kruskal-Wallis Independent Tests, the additional column _twm_independent_variable will contain the column-name of the independent variable for each separate test. Table for Mann-Whitney (if two groups) Table 111: Table for Mann-Whitney (if two groups) Name Type Definition Z Float Mann-Whitney Z Value MannWhitneyPValue Float The probability associated with the Mann-Whitney/Kruskal-Wallis statistic MannWhitneyCallP_0.01 Char The Mann-Whitney/Kruskal-Wallis result: a=accept, p=reject Teradata Warehouse Miner User Guide - Volume 1 289 Chapter 3: Statistical Tests Rank Tests Table 112: Table for Kruskal-Wallis (if more than two groups) Name Type Definition Z Float Kruskal-Wallis Z Value ChiSq Float Kruskal-Wallis Chi Square Statistic DF Integer Degrees of Freedom KruskalWallisPValue Float The probability associated with the Kruskal-Wallis statistic KruskalWallisPText Char The text description of probability if out of table range KruskalWallisCallP_0.01 Char The Kruskal-Wallis result: a=accept, p=reject Tutorial 1 - Mann-Whitney Test Analysis In this example, a Mann-Whitney test analysis is performed on the fictitious banking data to analyze account usage. Parameterize a Mann-Whitney Test analysis as follows: • Available Tables — twm_customer • Column of Interest — income • Columns — gender (2 distinct values -> Mann-Whitney test) • Group By Columns — years_with_bank • Analysis Parameters • Threshold Probability — 0.01 • Single Tail — false (default) Run the analysis and click on Results when it completes. For this example, the MannWhitney Test analysis generated the following table. The Mann-Whitney Test was computed for each distinct value of the group by variable “years_with_bank”. Results were sorted by years_with_bank. The tests show that customer incomes by gender were from the same population for all values of years_with_bank (an ‘a’ means accept the null hypothesis). The SQL is available for viewing but not listed below. Table 113: Mann-Whitney Test years_with_bank Z MannWhitneyPValue MannWhitneyCallP_0.01 0 -0.0127 0.9896 a 1 -0.2960 0.7672 a 2 -0.4128 0.6796 a 3 -0.6970 0.4858 a 4 -1.8088 0.0705 a 5 -2.2541 0.0242 a 6 -0.8683 0.3854 a 290 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Rank Tests Table 113: Mann-Whitney Test years_with_bank Z MannWhitneyPValue MannWhitneyCallP_0.01 7 -1.7074 0.0878 a 8 -0.8617 0.3887 a 9 -0.4997 0.6171 a Tutorial 2 - Kruskal-Wallis Test Analysis In this example, a Kruskal-Wallis test analysis is performed on the fictitious banking data to analyze account usage. Parameterize a Kruskal-Wallis Test analysis as follows: • Available Tables — twm_customer • Column of Interest — income • Columns — marital_status (4 distinct values -> Kruskal-Wallis test) • Group By Columns — years_with_bank • Analysis Parameters • Threshold Probability — 0.01 • Single Tail — false (default) Run the analysis and click on Results when it completes. For this example, the Kruskal-Wallis Test analysis generated the following table. The test was computed for each distinct value of the group by variable “years_with_bank”. Results were sorted by years_with_bank. The tests shows customer incomes by marital_status were from the same population for years_with_ bank 4, 6, 8 and 9. Those with years_with_bank 0-3, 5 and 7 were from different populations for each marital status. An ‘n’ or ‘p’ means significant and an ‘a’ means accept the null hypothesis. The SQL is available for viewing but not listed below. Table 114: Kruskal-Wallis Test years_with_bank Z ChiSq DF KruskalWallisPValue 0 3.5507 20.3276 3 0.0002 1 4.0049 24.5773 3 0.0001 2 3.3103 18.2916 3 0.0004 p 3 3.0994 16.6210 3 0.0009 p 4 1.5879 7.5146 3 0.0596 a 5 4.3667 28.3576 3 0.0001 6 2.1239 10.2056 3 0.0186 a 7 3.2482 17.7883 3 0.0005 p 8 0.1146 2.6303 3 0.25 >0.25 a 9 -0.1692 2.0436 3 0.25 >0.25 a Teradata Warehouse Miner User Guide - Volume 1 KruskalWallisPText KruskalWallisCallP_0.01 p <0.0001 <0.0001 p p 291 Chapter 3: Statistical Tests Rank Tests Tutorial 3 - Mann-Whitney Independent Tests Analysis In this example, a Mann-Whitney Independent Tests analysis is performed on the fictitious banking data to analyze account usage. Parameterize a Mann-Whitney Independent Tests analysis as follows: • Available Tables — twm_customer_analysis • Column of Interest — income • Columns — gender, ccacct, ckacct, svacct • Group By Columns • Analysis Parameters • Threshold Probability — 0.05 • Single Tail — false (default) Run the analysis and click on Results when it completes. For this example, the MannWhitney Independent Tests analysis generated the following table. The Mann-Whitney Test was computed separately for each independent variable. The tests show that customer incomes by gender and by svacct were from different populations, and that customer incomes by ckacct and by ccacct were from identical populations. The SQL is available for viewing but not listed below. Table 115: Mann-Whitney Test _twm_independent Z MannWhitneyPValue MannWhitneyCallP_0.05 gender -3.00331351 0.002673462 n svacct -3.37298401 0.000743646 n ckacct -1.92490664 0.05422922 a ccacct 1.764991014 0.077563672 a Wilcoxon Signed Ranks Test The Wilcoxon Signed Ranks Test is an alternative analogous to the t-test for correlated samples. The correlated-samples t-test makes assumptions about the data, and can be properly applied only if certain assumptions are met: 1 the scale of measurement has the properties of an equal-interval scale 2 differences between paired values are randomly selected from the source population 3 The source population has a normal distribution. If any of these assumptions are invalid, the t-test for correlated samples should not be used. Of cases where these assumptions are unmet, the most common are those where the scale of measurement fails to have equal-interval scale properties, e.g. a case in which the measures are from a rating scale. When data within two correlated samples fail to meet one or another of the assumptions of the t-test, an appropriate non-parametric alternative is the Wilcoxon Signed-Rank Test, a test based on ranks. Assumptions for this test are: 292 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Rank Tests 1 The distribution of difference scores is symmetric (implies equal interval scale) 2 difference scores are mutually independent 3 difference scores have the same mean The original measures are replaced with ranks resulting in analysis only of the ordinal relationships. The signed ranks are organized and summed, giving a number, W. When the numbers of positive and negative signs are about equal (i.e., there is no tendency in either direction), the value of W will be near zero, and the null hypothesis will be supported. Positive or negative sums indicate there is a tendency for the ranks to have significance so there is a difference in the cases in the specified direction. Given a table name and names of paired numeric columns, a Wilcoxon test is produced. The Wilcoxon tests whether a sample comes from a population with a specific mean or median. The null hypothesis is that the samples come from populations with the same mean or median. The alternative hypothesis is that the samples come from populations with different means or medians (two-tailed test), or that in addition the difference is in a specific direction (upper-tailed or lower-tailed tests). Output is a p-value, which when compared to the user’s threshold, determines whether the null hypothesis should be rejected. Initiate a Wilcoxon Signed Ranks Test Use the following procedure to initiate a new Wilcoxon Signed Ranks Test in Teradata Warehouse Miner: 1 Click on the Add New Analysis icon in the toolbar: Figure 206: Add New Analysis from toolbar 2 In the resulting Add New Analysis dialog box, click on Statistical Tests under Categories and then under Analyses double-click on Rank Tests: Teradata Warehouse Miner User Guide - Volume 1 293 Chapter 3: Statistical Tests Rank Tests Figure 207: Add New Analysis > Statistical Tests > Rank Tests 3 This will bring up the Rank Tests dialog in which you will enter STATISTICAL TEST STYLE, INPUT and OUTPUT options to parameterize the analysis as described in the following sections. Wilcoxon Signed Ranks Test - INPUT - Data Selection On the Rank Tests dialog click on INPUT and then click on data selection: Figure 208: Wilcoxon Signed Ranks Test > Input > Data Selection On this screen select: 1 Select Input Source Users may select between different sources of input. By selecting the Input Source Table the user can select from available databases, tables (or views) and columns in the usual manner. By selecting the Input Source Analysis however the user can select directly from the output of another analysis of qualifying type in the current project. Analyses that may be selected from directly include all of the Analytic Data Set (ADS) and Reorganization analyses (except Refresh). In place of Available Databases the user may select from Available Analyses, while Available Tables then contains a list of all the output tables that will eventually be produced by the selected Analysis, or it contains a single entry with the name of the analysis under the label Volatile Table, representing the output of the analysis that is ordinarily produced by a Select statement. 294 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Rank Tests 2 3 Select Columns From a Single Table • Available Databases (or Analyses) — These are the databases (or analyses) available to be processed. • Available Tables — These are the tables and views that are available to be processed. • Available Columns — These are the columns within the table/view that are available for processing. Select Statistical Test Style These are the Tests Based on Ranks available (Mann-Whitney/Kruskal-Wallis, Wilcoxon, Friedman). Select “Wilcoxon”. 4 Select Optional Columns • Selected Columns — Select columns by highlighting and then either dragging and dropping into the Selected Columns window, or click on the arrow button to move highlighted columns into the Selected Columns window. Note: The Selected Columns window is actually a split window; you can insert columns as First Column, Second Column, Group By Columns. Make sure you have the correct portion of the window highlighted. • First Column — The column that specifies the variable from the first sample • Second Column — The column that specifies the variable from the second sample • Group By Columns — The columns which specify the variables whose distinct value combinations will categorize the data, so a separate test is performed on each category. Wilcoxon Signed Ranks Test - INPUT - Analysis Parameters On the Rank Tests dialog click on INPUT and then click on analysis parameters: Figure 209: Wilcoxon Signed Ranks Test > Input > Analysis Parameters On this screen enter or select: • Processing Options • Threshold Probability — Enter the “alpha” probability below which to reject the null hypothesis. • Single Tail — Select the box if single tailed test is desired (default is two-tailed). The single-tail option is only valid if the test is Mann-Whitney. • Include Zero — The “include zero” option generates a variant of the Wilcoxon in which zero differences are included with the positive count. The default “discard zero” option is the true Wilcoxon. Teradata Warehouse Miner User Guide - Volume 1 295 Chapter 3: Statistical Tests Rank Tests Wilcoxon Signed Ranks Test - OUTPUT On the Rank Tests dialog click on OUTPUT: Figure 210: Wilcoxon Signed Ranks Test > Output On this screen select the following options if desired: • Store the tabular output of this analysis in the database — Option to generate a Teradata table populated with the results of the analysis. Once enabled, the following three fields must be specified: 296 • Database Name — The database where the output table will be saved. • Output Name — The table name that the output will be saved under. • Output Type — The output type must be table when storing Statistical Test output in the database. • Stored Procedure — The creation of a stored procedure containing the SQL generated for this analysis can be requested by entering the desired name of the stored procedure here. This will result in the creation of a stored procedure in the user's login database in place of the execution of the SQL generated by the analysis. • Procedure Comment — When an optional Procedure Comment is entered it is applied to a requested Stored Procedure with an SQL Comment statement. It can be up to 255 characters in length and contain substitution parameters for the output category (Score, ADS, Stats or Other), project name and/or analysis name (using the tags <Category>, <Project> and <Analysis>, respectively). (Note that the default value of this field may be set on the Defaults tab of the Preferences dialog, available from the Tools Menu). • Create output table using fallback keyword — Fallback keyword will be used to create the table • Create output table using multiset keyword — Multiset keyword will be used to create the table • Advertise Output — The Advertise Output option may be requested when creating a table, view or procedure. This feature “advertises” output by inserting information into one or more of the Advertise Output metadata tables according to the type of analysis and the options selected in the analysis. • Advertise Note — An Advertise Note may be specified if desired when the Advertise Output option is selected or when the Always Advertise option is selected on the Connection Properties dialog. It is a free-form text field of up to 30 characters that may be used to categorize or describe the output. • Generate SQL, but do not Execute it — Generate the Statistical Test SQL, but do not execute it. The SQL will be available to be viewed. Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Rank Tests Run the Wilcoxon Signed Ranks Test Analysis After setting parameters on the INPUT screens as described above, you are ready to run the analysis. To run the analysis you can either: • Click the Run icon on the toolbar, or • Select Run <project name> on the Project menu, or • Press the F5 key on your keyboard Results - Wilcoxon Analysis The results of running the Wilcoxon Signed Ranks Test analysis include a table with a row for each separate Wilcoxon Signed Ranks Test on all distinct-value group-by variables, as well as the SQL to perform the statistical analysis. All of these results are outlined below. Wilcoxon Signed Ranks Test - RESULTS - SQL On the Rank Tests dialog click on RESULTS and then click on SQL: Figure 211: Wilcoxon Signed Ranks Test > Results > SQL The series of SQL statements comprise the Wilcoxon Signed Ranks Test Analysis. It is always returned, and is the only item returned when the Generate SQL without Executing option is used. Wilcoxon Signed Ranks Test - RESULTS - data On the Rank Tests dialog click on RESULTS and then click on data: Figure 212: Wilcoxon Signed Ranks Test > Results > data The output table is generated by the Analysis for each separate Wilcoxon Signed Ranks Test on all distinct-value group-by variables. Output Columns - Wilcoxon Signed Ranks Test Analysis The following table is built in the requested Output Database by the Wilcoxon Signed Ranks Test analysis. Group By Variables, if present, will be the Unique Primary Index (UPI). Otherwise Z_ will be the UPI. Teradata Warehouse Miner User Guide - Volume 1 297 Chapter 3: Statistical Tests Rank Tests Table 116: Wilcoxon Signed Ranks Test Analysis: Output Columns Name Type Definition N Integer variable count Z_ Float Mann-Whitney Z Value WilcoxonPValue Float The probability associated with the Wilcoxon statistic WilcoxonCallP_0.05 Char The Wilcoxon result: a=accept, p or n=reject Tutorial - Wilcoxon Test Analysis In this example, a Wilcoxon test analysis is performed on the fictitious banking data to analyze account usage. Parameterize a Wilcoxon Test analysis as follows: • Available Tables — twm_customer_analysis • First Column — avg_ck_bal • Second Column — avg_sv_bal • Group By Columns — years_with_bank • Analysis Parameters • Threshold Probability — 0.05 • Single Tail — false (default) • Include Zero — false (default) Run the analysis and click on Results when it completes. For this example, the Wilcoxon Test analysis generated the following table. The Wilcoxon Test was computed for each distinct value of the group by variable “gender”. The tests show the samples of avg_ck_bal and avg_ sv_bal came from populations with the same mean or median for customers with years_with_ bank of 0, 4-9, and from populations with different means or medians for those with years_ with_bank of 1-3. An ‘n’ or ‘p’ means significant and an ‘a’ means accept the null hypothesis. The SQL is available for viewing but not listed below. Table 117: Wilcoxon Test years_with_bank N Z_ WilcoxonPValue WilcoxonCallP_0.05 0 75 -1.77163 0.07639 a 1 77 -3.52884 0.00042 n 2 83 -2.94428 0.00324 n 3 69 -2.03882 0.04145 n 4 69 -0.56202 0.57412 a 5 67 -1.95832 0.05023 a 6 65 -1.25471 0.20948 a 7 48 -0.44103 0.65921 a 298 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Rank Tests Table 117: Wilcoxon Test years_with_bank N Z_ WilcoxonPValue WilcoxonCallP_0.05 8 39 -1.73042 0.08363 a 9 33 -1.45623 0.14539 a Friedman Test with Kendall's Coefficient of Concordance & Spearman's Rho The Friedman test is an extension of the sign test for several independent samples. It is analogous to the 2-way Analysis of Variance, but depends only on the ranks of the observations, so it is like a 2-way ANOVA on ranks. The Friedman test should not be used for only three treatments due to lack of power, and is best for six or more treatments. It is a test for treatment differences in a randomized, complete block design. Data consists of b mutually independent k-variate random variables called blocks. The Friedman assumptions are that the data in these blocks are mutually independent, and that within each block, observations are ordinally rankable according to some criterion of interest. A Friedman Test is produced using rank scores and the F table, though alternative implementations call it the Friedman Statistic and use the chi-square table. Note that when all of the treatments are not applied to each block, it is an incomplete block design. The requirements of the Friedman test are not met under these conditions, and other tests such as the Durban test should be applied. In addition to the Friedman statistics, Kendall’s Coefficient of Concordance (W) is produced, as well as Spearman’s Rho. Kendall's coefficient of concordance can range from 0 to 1. The higher its value, the stronger the association. W is 1.0 if all treatments receive the same rankness in all blocks, and 0 if there is “perfect disagreement” among blocks. Spearman's rho is a measure of the linear relationship between two variables. It differs from Pearson's correlation only in that the computations are done after the numbers are converted to ranks. Spearman’s Rho equals 1 if there is perfect agreement among rankings; disagreement causes rho to be less than 1, sometimes becoming negative. Initiate a Friedman Test Use the following procedure to initiate a new Friedman Test in Teradata Warehouse Miner: Teradata Warehouse Miner User Guide - Volume 1 299 Chapter 3: Statistical Tests Rank Tests 1 Click on the Add New Analysis icon in the toolbar: Figure 213: Add New Analysis from toolbar 2 In the resulting Add New Analysis dialog box, click on Statistical Tests under Categories and then under Analyses double-click on Rank Tests: Figure 214: Add New Analysis > Statistical Tests > Rank Tests 3 This will bring up the Rank Tests dialog in which you will enter STATISTICAL TEST STYLE, INPUT and OUTPUT options to parameterize the analysis as described in the following sections. Friedman Test - INPUT - Data Selection On the Rank Tests dialog click on INPUT and then click on data selection: 300 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Rank Tests Figure 215: Friedman Test > Input > Data Selection On this screen select: 1 Select Input Source Users may select between different sources of input. By selecting the Input Source Table the user can select from available databases, tables (or views) and columns in the usual manner. By selecting the Input Source Analysis however the user can select directly from the output of another analysis of qualifying type in the current project. Analyses that may be selected from directly include all of the Analytic Data Set (ADS) and Reorganization analyses (except Refresh). In place of Available Databases the user may select from Available Analyses, while Available Tables then contains a list of all the output tables that will eventually be produced by the selected Analysis, or it contains a single entry with the name of the analysis under the label Volatile Table, representing the output of the analysis that is ordinarily produced by a Select statement. 2 3 Select Columns From a Single Table • Available Databases (or Analyses) — These are the databases (or analyses) available to be processed. • Available Tables — These are the tables and views that are available to be processed. • Available Columns — These are the columns within the table/view that are available for processing. Select Statistical Test Style These are the Tests Based on Ranks available (Mann-Whitney/Kruskal-Wallis, Wilcoxon, Friedman). Select “Friedman”. 4 Select Optional Columns • Selected Columns — Select columns by highlighting and then either dragging and dropping into the Selected Columns window, or click on the arrow button to move highlighted columns into the Selected Columns window. Note: The Selected Columns window is actually a split window; you can insert columns as Column of Interest, Treatment Column, Block Column, Group By Columns. Make sure you have the correct portion of the window highlighted. • Column of Interest — The column that specifies the dependent variable to be analyzed • Treatment Column — The column that specifies the independent categorical variable representing treatments within blocks. • Block Column — The column that specifies the variable representing blocks, or independent experimental groups. Teradata Warehouse Miner User Guide - Volume 1 301 Chapter 3: Statistical Tests Rank Tests Warning: Equal cell counts are required for all Treatment Column x Block Column pairs. Division by zero may occur in the case of unequal cell counts. • Group By Columns — The columns which specify the variables whose distinct value combinations will categorize the data, so a separate test is performed on each category. Warning: Equal cell counts are required for all Treatment Column x Block Column pairs within each group. Division by zero may occur in the case of unequal cell counts. Friedman Test - INPUT - Analysis Parameters On the Rank Tests dialog click on INPUT and then click on analysis parameters: Figure 216: Friedman Test > Input > Analysis Parameters On this screen enter or select: • Processing Options • Threshold Probability — Enter the “alpha” probability below which to reject the null hypothesis. Friedman Test - OUTPUT On the Rank Tests dialog click on OUTPUT: Figure 217: Friedman Test > Output On this screen select the following options if desired: • Store the tabular output of this analysis in the database — Option to generate a Teradata table populated with the results of the analysis. Once enabled, the following three fields must be specified: 302 • Database Name — The database where the output table will be saved. • Output Name — The table name that the output will be saved under. • Output Type — The output type must be table when storing Statistical Test output in the database. • Stored Procedure — The creation of a stored procedure containing the SQL generated for this analysis can be requested by entering the desired name of the stored procedure Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Rank Tests here. This will result in the creation of a stored procedure in the user's login database in place of the execution of the SQL generated by the analysis. • Procedure Comment — When an optional Procedure Comment is entered it is applied to a requested Stored Procedure with an SQL Comment statement. It can be up to 255 characters in length and contain substitution parameters for the output category (Score, ADS, Stats or Other), project name and/or analysis name (using the tags <Category>, <Project> and <Analysis>, respectively). (Note that the default value of this field may be set on the Defaults tab of the Preferences dialog, available from the Tools Menu). • Create output table using fallback keyword — Fallback keyword will be used to create the table • Create output table using multiset keyword — Multiset keyword will be used to create the table • Advertise Output — The Advertise Output option may be requested when creating a table, view or procedure. This feature “advertises” output by inserting information into one or more of the Advertise Output metadata tables according to the type of analysis and the options selected in the analysis. • Advertise Note — An Advertise Note may be specified if desired when the Advertise Output option is selected or when the Always Advertise option is selected on the Connection Properties dialog. It is a free-form text field of up to 30 characters that may be used to categorize or describe the output. • Generate SQL, but do not Execute it — Generate the Statistical Test SQL, but do not execute it. The SQL will be available to be viewed. Run the Friedman Test Analysis After setting parameters on the INPUT screens as described above, you are ready to run the analysis. To run the analysis you can either: • Click the Run icon on the toolbar, or • Select Run <project name> on the Project menu, or • Press the F5 key on your keyboard Results - Friedman Test Analysis The results of running the Friedman Test analysis include a table with a row for each separate Friedman Test on all distinct-value group-by variables, as well as the SQL to perform the statistical analysis. All of these results are outlined below. Friedman Test - RESULTS - SQL On the Rank Tests dialog click on RESULTS and then click on SQL: Teradata Warehouse Miner User Guide - Volume 1 303 Chapter 3: Statistical Tests Rank Tests Figure 218: Friedman Test > Results > SQL The series of SQL statements comprise the Analysis. It is always returned, and is the only item returned when the Generate SQL without Executing option is used. Friedman Test - RESULTS - data On the Rank Tests dialog click on RESULTS and then click on data: Figure 219: Friedman Test > Results > data The output table is generated by the Analysis for each separate Friedman Test on all distinctvalue group-by variables. Output Columns - Friedman Test Analysis The following table is built in the requested Output Database by the Friedman Test analysis. Group By Variables, if present, will be the Unique Primary Index (UPI). Otherwise Kendalls_ W will be the UPI. Table 118: Friedman Test Analysis: Output Columns Name Type Definition Kendalls_W Float Kendall's W Average_Spearmans_Rho Float Average Spearman's Rho DF_1 Integer Degrees of Freedom for Treatments DF_2 Integer Degrees of Freedom for Blocks F Float 2-Way ANOVA F Statistic on ranks FriedmanPValue Float The probability associated with the Friedman statistic FriedmanPText Char The text description of probability if out of table range FriedmanCallP_0.05 Char The Friedman result: a=accept, p or n=reject 304 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Rank Tests Tutorial - Friedman Test Analysis In this example, a Friedman test analysis is performed on the fictitious banking data to analyze account usage. If the data does not have equal cell counts in the treatment x block cells, stratified sampling can be used to identify the smallest count, and then produce a temporary table which can be analyzed. The first step is to identify the smallest count with a Free Form SQL analysis (or two Variable Creation analyses) with SQL such as the following (be sure to set the database in the FROM clause to that containing the demonstration data tables): SELECT MIN("_twm_N") AS smallest_count FROM ( SELECT marital_status ,gender ,COUNT(*) AS "_twm_N" FROM "twm_source"."twm_customer_analysis" GROUP BY "marital_status", "gender" ) AS "T0"; The second step is to use a Sample analysis with stratified sampling to create the temporary table with equal cell counts. The value 18 used in the stratified Sizes/Fractions parameter below corresponds to the smallest_count returned from above. Parameterize a Sample Analysis called Friedman Work Table Setup as follows: Input Options: • Available Tables — TWM_CUSTOMER_ANALYSIS • Selected Columns and Aliases • TWM_CUSTOMER_ANALYSIS.cust_id • TWM_CUSTOMER_ANALYSIS.gender • TWM_CUSTOMER_ANALYSIS.marital_status • TWM_CUSTOMER_ANALYSIS.income Analysis Parameters: • Sample Style — Stratified • Stratified Sample Options • Create a separate sample for each fraction/size — Enabled • Stratified Conditions • gender='f' and marital_status='1' • gender='f' and marital_status='2' • gender='f' and marital_status='3' • gender='f' and marital_status='4' • gender='m' and marital_status='1' • gender='m' and marital_status='2' • gender='m' and marital_status='3' Teradata Warehouse Miner User Guide - Volume 1 305 Chapter 3: Statistical Tests Rank Tests • gender='m' and marital_status='4' • Sizes/Fractions — 18 (use the same value for all conditions) Output Options: • Store the tabular output of this analysis in the database — Enabled • Table Name — Twm_Friedman_Worktable Finally, Parameterize a Friedman Test analysis as follows: Input Options: • Select Input Source — Analysis • Available Analyses — Friedman Work Table Setup • Available Tables — Twm_Friedman_Worktable • Select Statistical Test Style — Friedman • Column of Interest — income • Treatment Column — gender • Block Column — marital_status Analysis Parameters: • Analysis Parameters • Threshold Probability — 0.05 Run the analysis and click on Results when it completes. For this example, the Friedman Test analysis generated the following table. (Note that results may vary due to the use of sampling in creating the input table Twm_Friedman_Worktable). The test shows that analysis of income by treatment (male vs. female) differences is significant at better than the 0.001 probability level. An ‘n’ or ‘p’ means significant and an ‘a’ means accept the null hypothesis. The SQL is available for viewing but not listed below. Table 119: Friedman Test Kendalls_W Average_Spearmans_Rho DF_1 DF_2 F FriedmanPValue FriedmanPText FriedmanCallP_0.001 0.76319692 5 0.773946177 1 0.001 p 306 71 228.8271876 <0.001 Teradata Warehouse Miner User Guide - Volume 1 APPENDIX A References 1 Agrawal, R. Mannila, H. Srikant, R. Toivonen, H. and Verkamo, I., Fast Discovery of Association Rules. In Advances in Knowledge Discovery and Data Mining, 1996, eds. U.M. Fayyad, G. Paitetsky-Shapiro, P. Smyth and R. Uthurusamy. Menlo Park, AAAI Press/The MIT Press. 2 Agresti, A. (1990) Categorical Data Analysis. Wiley, New York. 3 Arabie, P., Hubert, L., and DeSoete, G., Clustering and Classification, World Scientific, 1996 4 Belsley, D.A., Kuh, E., and Welsch, R.E. (1980) Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. Wiley, New York. 5 Bradley, P., Fayyad, U. and Reina, C., Scaling EM Clustering to Large Databases, Microsoft Research Technical Report MSR-TR-98-35, 1998 6 Breiman, L., Friedman, J.H., Olshen, R.A., and Stone, C.J. Classification and Regression Trees. Wadsworth, Belmont, 1984. 7 Conover, W.J. Practical Nonparametric Statistics 3rd Edition 8 Cox, D.R. and Hinkley, D.V. (1974) Theoretical Statistics. Chapman & Hall/CRC, New York. 9 D'Agostino, RB. (1971) An omnibus test of normality for moderate and large size samples, Biometrica, 58, 341-348 10 D'Agostino, R. B. and Stephens, M. A., eds. Goodness-of-fit Techniques, 1986,. New York: Dekker. 11 D’Agostino, R, Belanger, A., and D’Agostino,R. Jr., A Suggestion for Using Powerful and Informative Tests of Normality, American Statistician, 1990, Vol. 44, No. 4 12 Finn, J.D. (1974) A General Model for Multivariate Analysis. Holt, Rinehart and Winston, New York. 13 Harman, H.H. (1976) Modern Factor Analysis. University of Chicago Press, Chicago. 14 Harter, H.L. and Owen, D.B., eds, Selected Tables in Mathematical Statistics, Vol. 1.. Providence, Rhode Island: American Mathematical Society. 15 Hosmer, D.W. and Lemeshow, S. (1989) Applied Logistic Regression. Wiley, New York. 16 Jennrich, R.I., and Sampson, P.F. (1966) Rotation For Simple Loadings. Psychometrika, Vol. 31, No. 3. 17 Johnson, R.A. and Wichern, D.W. (1998) Applied Multivariate Statistical Analysis, 4th Edition. Prentice Hall, New Jersey. 18 Kachigan, S.K. (1991) Multivariate Statistical Analysis. Radius Press, New York. 19 Kaiser, Henry F. (1958) The Varimax Criterion For Analytic Rotation In Factor Analysis. Psychometrika, Vol. 23, No. 3. Teradata Warehouse Miner User Guide - Volume 3 307 Appendix A: References 20 Kass, G. V. (1979) An Exploratory Technique for Investigating Large Quantities of Categorical Data, Applied Statistics (1980) 29, No. 2 pp. 119-127 21 Kaufman, L. and Rousseeuw, P., Finding Groups in Data, J Wiley & Sons, 1990 22 Kennedy, W.J. and Gentle, J.E. (1980) Statistical Computing. Marcel Dekker, New York. 23 Kleinbaum, D.G. and Kupper, L.L. (1978) Applied Regression Analysis and Other Multivariable Methods. Duxbury Press, North Scituate, Massachusetts. 24 Maddala, G.S. (1983) Limited-Dependent and Qualitative Variables In Econometrics. Cambridge University Press, Cambridge, United Kingdom. 25 Maindonald, J.H. (1984) Statistical Computation. Wiley, New York. 26 McCullagh, P.M. and Nelder, J.A. (1989) Generalized Linear Models, 2nd Edition. Chapman & Hall/CRC, New York. 27 McLachlan, G.J. and Krishnan, T., The EM Algorithm and Extensions, J Wiley & Sons, 1997 28 Menard, S (1995) Applied Logistic Regression Analysis, Sage, Thousand Oaks 29 Mulaik, S.A. (1972) The Foundations of Factor Analysis. McGraw-Hill, New York. 30 Neter, J., Kutner, M.H., Nachtsheim, C.J., and Wasserman, W. (1996) Applied Linear Statistical Models, 4th Edition. WCB/McGraw-Hill, New York. 31 NIST/SEMATECH e-Handbook of Statistical Methods, http://www.itl.nist.gov/div898/ handbook/, 2005. 32 Nocedal, J. and Wright, S.J. (1999) Numerical Optimization. Springer-Verlag, New York. 33 Orchestrate/OSH Component User’s Guide Vol II, Analytics Library, Chapter 2: Introduction to Data Mining. Torrent Systems, Inc., 1997. 34 Ordonez, C. and Cereghini, P. (2000) SQLEM: Fast Clustering in SQL using the EM Algorithm. SIGMOD Conference 2000: 559-570 35 Ordonez, C. (2004): Programming the K-means clustering algorithm in SQL. KDD 2004: 823-828 36 Ordonez, C. (2004): Horizontal aggregations for building tabular data sets. DMKD 2004: 35-42 37 Pagano, Gauvreau Principles of Biostatistics 2nd Edition 38 Peduzzi, P.N., Hardy, R.J., and Holford, T.R. (1980) A Stepwise Variable Selection Procedure for Nonlinear Regression Models. Biometrics 36, 511-516. 39 Pregibon, D. (1981) Logistic Regression Diagnostics. Annals of Statistics, Vol. 9, No. 4, 705-724. 40 PROPHET StatGuide, BBN Corporation, 1996. 41 Quinlan, J.R. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, 1993. 42 Roweis, S. and Ghahramani, Z., A Unifying Review of Linear Gaussian Models, Journal of Neural Computation, 1999 308 Teradata Warehouse Miner User Guide - Volume 3 Appendix A: References 43 Royston, JP., An Extension of Shapiro and Wilk’s W Test for Normality to Large Samples, Applied Statistics, 1982, 31, No. 2, pp.115-124 44 Royston, JP, Algorithm AS 177: Expected normal order statistics (exact and approximate), 1982, Applied Statistics, 31, 161-165. 45 Royston, JP., Algorithm AS 181: The W Test for Normality. 1982, Applied Statistics, 31, 176-180. 46 Royston, JP., A Remark on Algorithm AS 181: The W Test for Normality., 1995, Applied Statistics, 44, 547-551. 47 Rubin, Donald B., and Thayer, Dorothy T. (1982) EM Algorithms For ML Factor Analysis. Psychometrika, Vol. 47, No. 1. 48 Shapiro, SS and Francia, RS (1972). An approximate analysis of variance test for normality, Journal of the American Statistical Association, 67, 215-216 49 SPSS 7.5 Statistical Algorithms Manual, SPSS Inc., Chicago. 50 SYSTAT 9: Statistics I. (1999) SPSS Inc., Chicago. 51 Takahashi, T. (2005) Getting Started: International Character Sets and the Teradata Database, Teradata Corporation, 541-0004068-C02 52 Tatsuoka, M.M. (1971) Multivariate Analysis: Techniques For Educational and Psychological Research. Wiley, New York. 53 Tatsuoka, M.M. (1974) Selected Topics in Advanced Statistics, Classification Procedures, Institute for Personality and Ability Testing, 1974 54 Teradata Database SQL Functions, Operators, Expressions, and Predicates Release 15.0, B035-1145-015A, March 2014 55 Teradata Warehouse Miner Model Manager User Guide, B035-2303-106A, October 2016 56 Teradata Warehouse Miner Release Definition, B035-2494-106C, October 2016 57 Teradata Warehouse Miner User Guide, Volume 1 Introduction and Profiling, B035-2300- 106A, October 2016 58 Teradata Warehouse Miner User Guide, Volume 2 ADS Generation, B035-2301-106A, October 2016 59 Teradata Warehouse Miner User Guide, Volume 3 Analytic Functions, B035-2302-106A, October 2016 60 Wendorf, Craig A., MANUALS FOR UNIVARIATE AND MULTIVARIATE STATISTICS © 1997, Revised 2004-03-12, UWSP, 2005 61 Wilkinson, L., Blank, G., and Gruber, C. (1996) Desktop Data Analysis SYSTAT. Prentice Hall, New Jersey. Teradata Warehouse Miner User Guide - Volume 3 309 Appendix A: References 310 Teradata Warehouse Miner User Guide - Volume 3