* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Teradata Warehouse Miner User Guide
Survey
Document related concepts
Transcript
Teradata Warehouse Miner User Guide - Volume 3 Analytic Functions Release 5.3.4 B035-2302-093A September 2013 The product or products described in this book are licensed products of Teradata Corporation or its affiliates. Teradata, BYNET, DBC/1012, DecisionCast, DecisionFlow, DecisionPoint, Eye logo design, InfoWise, Meta Warehouse, MyCommerce, SeeChain, SeeCommerce, SeeRisk, Teradata Warehouse Miner, Teradata Source Experts, WebAnalyst, and You’ve Never Seen Your Business Like This Before are trademarks or registered trademarks of Teradata Corporation or its affiliates. Adaptec and SCSISelect are trademarks or registered trademarks of Adaptec, Inc. AMD Opteron and Opteron are trademarks of Advanced Micro Devices, Inc. BakBone and NetVault are trademarks or registered trademarks of BakBone Software, Inc. EMC, PowerPath, SRDF, and Symmetrix are registered trademarks of EMC Corporation. GoldenGate is a trademark of GoldenGate Software, Inc. Hewlett-Packard and HP are registered trademarks of Hewlett-Packard Company. Intel, Pentium, and XEON are registered trademarks of Intel Corporation. IBM, CICS, DB2, MVS, RACF, Tivoli, and VM are registered trademarks of International Business Machines Corporation. Linux is a registered trademark of Linus Torvalds. LSI and Engenio are registered trademarks of LSI Corporation. Microsoft, Active Directory, Windows, Windows NT, Windows Server, Windows Vista, Visual Studio and Excel are either registered trademarks or trademarks of Microsoft Corporation in the United States or other countries. Novell and SUSE are registered trademarks of Novell, Inc., in the United States and other countries. QLogic and SANbox trademarks or registered trademarks of QLogic Corporation. SAS, SAS/C and Enterprise Miner are trademarks or registered trademarks of SAS Institute Inc. SPSS is a registered trademark of SPSS Inc. STATISTICA and StatSoft are trademarks or registered trademarks of StatSoft, Inc. SPARC is a registered trademarks of SPARC International, Inc. Sun Microsystems, Solaris, Sun, and Sun Java are trademarks or registered trademarks of Sun Microsystems, Inc., in the United States and other countries. Symantec, NetBackup, and VERITAS are trademarks or registered trademarks of Symantec Corporation or its affiliates in the United States and other countries. Unicode is a collective membership mark and a service mark of Unicode, Inc. UNIX is a registered trademark of The Open Group in the United States and other countries. Other product and company names mentioned herein may be the trademarks of their respective owners. THE INFORMATION CONTAINED IN THIS DOCUMENT IS PROVIDED ON AN “AS-IS” BASIS, WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING THE IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, OR NON-INFRINGEMENT. SOME JURISDICTIONS DO NOT ALLOW THE EXCLUSION OF IMPLIED WARRANTIES, SO THE ABOVE EXCLUSION MAY NOT APPLY TO YOU. IN NO EVENT WILL TERADATA CORPORATION BE LIABLE FOR ANY INDIRECT, DIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING LOST PROFITS OR LOST SAVINGS, EVEN IF EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. The information contained in this document may contain references or cross-references to features, functions, products, or services that are not announced or available in your country. Such references do not imply that Teradata Corporation intends to announce such features, functions, products, or services in your country. Please consult your local Teradata Corporation representative for those features, functions, products, or services available in your country. Information contained in this document may contain technical inaccuracies or typographical errors. Information may be changed or updated without notice. Teradata Corporation may also make improvements or changes in the products or services described in this information at any time without notice. To maintain the quality of our products and services, we would like your comments on the accuracy, clarity, organization, and value of this document. Please e-mail: [email protected] Any comments or materials (collectively referred to as “Feedback”) sent to Teradata Corporation will be deemed non-confidential. Teradata Corporation will have no obligation of any kind with respect to Feedback and will be free to use, reproduce, disclose, exhibit, display, transform, create derivative works of, and distribute the Feedback and derivative works thereof without limitation on a royalty-free basis. Further, Teradata Corporation will be free to use any ideas, concepts, know-how, or techniques contained in such Feedback for any purpose whatsoever, including developing, manufacturing, or marketing products or services incorporating Feedback. Copyright © 1999-2013 by Teradata Corporation. All Rights Reserved. Preface Purpose This volume describes how to use the modeling, scoring and statistical test features of the Teradata Warehouse Miner product. Teradata Warehouse Miner is a set of Microsoft .NET interfaces and a multi-tier User Interface that together help understand the quality of data residing in a Teradata database, create analytic data sets, and build and score analytic models directly in the Teradata database. Audience This manual is written for users of Teradata Warehouse Miner, who should be familiar with Teradata SQL, the operation and administration of the Teradata RDBMS system and statistical techniques. They should also be familiar with the Microsoft Windows operating environment and standard Microsoft Windows operating techniques. Revision Record The following table lists a history of releases where this guide has been revised: Release Date Description TWM 5.3.3 06/30/12 Maintenance Release TWM 5.3.2 06/01/11 Maintenance Release TWM 5.3.1 06/30/10 Maintenance Release TWM 5.3.0 10/30/09 Feature Release TWM 5.2.2 02/05/09 Maintenance Release TWM 5.2.1 12/15/08 Maintenance Release TWM 5.2.0 05/31/08 Feature Release TWM 5.1.1 01/23/08 Maintenance Release TWM 5.1.0 07/12/07 Feature Release TWM 5.0.1 11/16/06 Maintenance Release TWM 5.0.0 09/22/06 Major Release Teradata Warehouse Miner User Guide - Volume 3 iii Preface How This Manual Is Organized How This Manual Is Organized This manual is organized and presents information as follows: • Chapter 1: “Analytic Algorithms” — describes how to use the Teradata Warehouse Miner Multivariate Statistics and Machine Learning Algorithms. This includes Linear Regression, Logistic Regression, Factor Analysis, Decision Trees, Clustering, Association Rules and Neural Network algorithms. • Chapter 2: “Scoring” — describes how to use the Teradata Warehouse Miner Multivariate Statistics and Machine Learning Algorithms scoring analyses. Scoring is available for Linear Regression, Logistic Regression, Factor Analysis, Decision Trees, Clustering and Neural Networks • Chapter 3: “Statistical Tests” — describes how to use Teradata Warehouse Miner Statistical Tests. This includes Binomial, Kolmogorov Smirnov, Parametric, Rank, and Contingency Tables-based tests. Conventions Used In This Manual The following typographical conventions are used in this guide: Convention Description Italic Titles (esp. screen names/titles) New terms for emphasis Monospace Code sample Output ALL CAPS Acronyms Bold Important term or concept GUI Item Screen item and/or esp. something you will click on or highlight in following a procedure. Related Documents Related Teradata documentation and other sources of information are available from: http://www.info.teradata.com Additional technical information on data warehousing and other topics is available from: http://www.teradata.com/t/resources Support Information Services, support and training information is available from: iv Teradata Warehouse Miner User Guide - Volume 3 Preface Related Documents http://www.teradata.com/services-support Teradata Warehouse Miner User Guide - Volume 3 v Preface Related Documents vi Teradata Warehouse Miner User Guide - Volume 3 Table of Contents Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Audience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Revision Record . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii How This Manual Is Organized . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv Conventions Used In This Manual. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv Related Documents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv Support Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv Chapter 1: Analytic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Association Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Initiate an Association Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Association - INPUT - Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Association - INPUT - Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Association - INPUT - Expert Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Association - OUTPUT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Run the Association Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Results - Association Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Tutorial - Association Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Options - Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Success Analysis - Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Using the TWM Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Optimizing Performance of Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Initiate a Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Cluster - INPUT - Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Cluster - INPUT - Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Cluster - INPUT - Expert Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Cluster - OUTPUT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Run the Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Teradata Warehouse Miner User Guide - Volume 3 vii Table of Contents Results - Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Tutorial - Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Initiate a Decision Tree Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Decision Tree - INPUT - Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Decision Tree - INPUT - Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Decision Tree - INPUT - Expert Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Run the Decision Tree Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Results - Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tutorial - Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 36 41 41 43 44 45 45 54 Factor Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Initiate a Factor Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Factor - INPUT - Data Selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Factor - INPUT - Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Run the Factor Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Results - Factor Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tutorial - Factor Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 58 67 68 69 72 72 82 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Initiate a Linear Regression Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Linear Regression - INPUT - Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 Linear Regression - INPUT - Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Linear Regression - OUTPUT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 Run the Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 Results - Linear Regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Tutorial - Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Initiate a Logistic Regression Function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Logistic Regression - INPUT - Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Logistic Regression - INPUT - Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . Logistic Regression - INPUT - Expert Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Logistic Regression - OUTPUT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Run the Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Results - Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tutorial - Logistic Regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 114 120 121 122 124 125 127 127 134 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Initiate a Neural Networks Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Neural Networks - INPUT - Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 141 157 158 viii Teradata Warehouse Miner User Guide - Volume 3 Neural Networks - INPUT - Network Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 Neural Networks - INPUT - Network Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 Neural Networks - INPUT - MLP Activation Functions . . . . . . . . . . . . . . . . . . . . . . . . 161 Neural Networks - INPUT - Sampling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 Run the Neural Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 Results - Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 Tutorial - Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 Chapter 2: Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 Cluster Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 Initiate Cluster Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 Cluster Scoring - INPUT - Data Selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 Cluster Scoring - INPUT - Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 Cluster Scoring - OUTPUT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 Run the Cluster Scoring Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 Results - Cluster Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 Tutorial - Cluster Scoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 Tree Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 Initiate Tree Scoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 Tree Scoring - INPUT - Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 Tree Scoring - INPUT - Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 Tree Scoring - OUTPUT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 Run the Tree Scoring Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 Results - Tree Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 Tutorial - Tree Scoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 Factor Scoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 Initiate Factor Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220 Factor Scoring - INPUT - Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 Factor Scoring - INPUT - Analysis Parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 Factor Scoring - OUTPUT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 Run the Factor Scoring Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 Results - Factor Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 Tutorial - Factor Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 Linear Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 Initiate Linear Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 Linear Scoring - INPUT - Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 Linear Scoring - INPUT - Analysis Parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 Linear Scoring - OUTPUT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 Run the Linear Scoring Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 Teradata Warehouse Miner User Guide - Volume 3 ix Table of Contents Results - Linear Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 Tutorial - Linear Scoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234 Logistic Scoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Initiate Logistic Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Logistic Scoring - INPUT - Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Logistic Scoring - INPUT - Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Logistic Scoring - OUTPUT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Run the Logistic Scoring Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Results - Logistic Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tutorial - Logistic Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236 239 240 241 242 243 243 246 Neural Networks Scoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Initiate Neural Networks Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Neural Networks Scoring - INPUT - Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . Neural Networks Scoring - OUTPUT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Run the Neural Networks Scoring Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Results - Neural Networks Scoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Neural Networks Scoring Tutorial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249 249 250 252 253 253 255 Chapter 3: Statistical Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 Summary of Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258 Data Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259 Parametric Tests. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Two Sample T-Test for Equal Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F-Test - N-Way . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F-Test/Analysis of Variance - Two Way Unequal Sample Size. . . . . . . . . . . . . . . . . . 261 262 269 279 Binomial Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286 Binomial/Ztest. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 Binomial Sign Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293 Kolmogorov-Smirnov Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kolmogorov-Smirnov Test (One Sample) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lilliefors Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shapiro-Wilk Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D'Agostino and Pearson Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Smirnov Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299 299 305 311 317 323 Tests Based on Contingency Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329 Chi Square Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329 Median Test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336 Rank Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342 Mann-Whitney/Kruskal-Wallis Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342 x Teradata Warehouse Miner User Guide - Volume 3 Wilcoxon Signed Ranks Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351 Friedman Test with Kendall's Coefficient of Concordance & Spearman's Rho . . . . . . 358 Appendix A: References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367 Teradata Warehouse Miner User Guide - Volume 3 xi Table of Contents xii Teradata Warehouse Miner User Guide - Volume 3 List of Figures Figure 1: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Figure 2: Add New Analysis dialog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Figure 3: Association > Input > Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Figure 4: Association > Input > Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Figure 5: Association: X to X. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Figure 6: Association Combinations pane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Figure 7: Association > Input > Expert Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Figure 8: Association > Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Figure 9: Association > Results > SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Figure 10: Association > Results > Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Figure 11: Association > Results > Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Figure 12: Association Graph Selector. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Figure 13: Association Graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Figure 14: Association Graph: Tutorial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Figure 15: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Figure 16: Add New Analysis dialog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Figure 17: Clustering > Input > Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Figure 18: Clustering > Input > Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Figure 19: Clustering > Input > Expert Options. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Figure 20: Cluster > OUTPUT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Figure 21: Clustering > Results > Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Figure 22: Clustering > Results > Sizes Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Figure 23: Clustering > Results > Similarity Graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Figure 24: Clustering Analysis Tutorial: Sizes Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Figure 25: Clustering Analysis Tutorial: Similarity Graph . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Figure 26: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Figure 27: Add New Analysis dialog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Figure 28: Decision Tree > Input > Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Figure 29: Decision Tree > Input > Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Figure 30: Decision Tree > Input > Expert Options. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Figure 31: Tree Browser. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 Figure 32: Tree Browser menu: Small Navigation Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Teradata Warehouse Miner User Guide - Volume 3 xiii List of Figures Figure 33: Tree Browser menu: Zoom Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Figure 34: Tree Browser menu: Print . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Figure 35: Text Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Figure 36: Rules List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Figure 37: Counts and Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Figure 38: Tree Pruning menu. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Figure 39: Tree Pruning Menu > Prune Selected Branch . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Figure 40: Tree Pruning menu (All Options Enabled) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Figure 41: Decision Tree Graph: Previously Pruned Tree . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Figure 42: Decision Tree Graph: Predicate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Figure 43: Decision Tree Graph: Lift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Figure 44: Decision Tree Graph Tutorial: Browser. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Figure 45: Decision Tree Graph Tutorial: Lift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Figure 46: Decision Tree Graph Tutorial: Browser. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Figure 47: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Figure 48: Add New Analysis dialog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Figure 49: Factor Analysis > Input > Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Figure 50: Factor Analysis > Input > Analysis Parameters. . . . . . . . . . . . . . . . . . . . . . . . . . 70 Figure 51: Factor Analysis > Results > Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Figure 52: Factor Analysis > Results > Pattern Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Figure 53: Factor Analysis > Results > Scree Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Figure 54: Factor Analysis Tutorial: Scree Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Figure 55: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Figure 56: Add New Analysis dialog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 Figure 57: Linear Regression > Input > Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 Figure 58: Linear Regression > Input > Analysis Parameters. . . . . . . . . . . . . . . . . . . . . . . . 97 Figure 59: Linear Regression > OUTPUT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 Figure 60: Linear Regression Tutorial: Linear Weights Graph. . . . . . . . . . . . . . . . . . . . . . 112 Figure 61: Linear Regression Tutorial: Scatter Plot (2d) . . . . . . . . . . . . . . . . . . . . . . . . . . 113 Figure 62: Linear Regression Tutorial: Scatter Plot (3d) . . . . . . . . . . . . . . . . . . . . . . . . . . 113 Figure 63: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 Figure 64: Add New Analysis dialog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 Figure 65: Logistic Regression > Input > Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 121 Figure 66: Logistic Regression > Input > Analysis Parameters . . . . . . . . . . . . . . . . . . . . . 122 Figure 67: Logistic Regression > Input > Expert Options. . . . . . . . . . . . . . . . . . . . . . . . . . 124 Figure 68: Logistic Regression > OUTPUT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 xiv Teradata Warehouse Miner User Guide - Volume 3 List of Figures Figure 69: Logistic Regression Tutorial: Logistic Weights Graph . . . . . . . . . . . . . . . . . . . 140 Figure 70: Logistic Regression Tutorial: Lift Chart. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 Figure 71: Single Neuron System (schematic) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 Figure 72: Parametric Model vs. Non-Parametric Model (schematic). . . . . . . . . . . . . . . . . 145 Figure 73: Fully connected MLP2 neural network with three inputs (schematic) . . . . . . . . 147 Figure 74: MLP vs. RBF neural networks in two dimensional input data (schematic) . . . . 148 Figure 75: RBF Neural Network with three inputs (schematic). . . . . . . . . . . . . . . . . . . . . . 148 Figure 76: Neural Network Training with early stopping (schematic) . . . . . . . . . . . . . . . . 154 Figure 77: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 Figure 78: Add New Analysis dialogue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 Figure 79: Neural Network > Input > Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 Figure 80: Neural Network > Input > Network Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 Figure 81: Neural Network > Input > Network Parameters . . . . . . . . . . . . . . . . . . . . . . . . . 160 Figure 82: Neural Network > Input > MLP Activation Functions . . . . . . . . . . . . . . . . . . . . 161 Figure 83: Neural Network > Input > Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 Figure 84: Neural Network Tutorial 1: Data Selection Tab . . . . . . . . . . . . . . . . . . . . . . . . . 167 Figure 85: Neural Network Tutorial 1: Network Types Tab . . . . . . . . . . . . . . . . . . . . . . . . 168 Figure 86: Neural Network Tutorial 1: Network Parameters Tab . . . . . . . . . . . . . . . . . . . . 169 Figure 87: Neural Network Tutorial 1: MLP Activation Functions Tab . . . . . . . . . . . . . . . 170 Figure 88: Neural Network Tutorial 1: Sampling tab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 Figure 89: Neural Networks Tutorial 1: Results tab - Reports button . . . . . . . . . . . . . . . . . 172 Figure 90: Neural Networks Tutorial 1: Reports - Neural Network Summary . . . . . . . . . . 173 Figure 91: Neural Networks Tutorial 1: Reports - Correlation Coefficients . . . . . . . . . . . . 174 Figure 92: Neural Networks Tutorial 1: Reports - Data Statistics . . . . . . . . . . . . . . . . . . . . 175 Figure 93: Neural Networks Tutorial 1: Reports - Weights . . . . . . . . . . . . . . . . . . . . . . . . . 176 Figure 94: Neural Networks Tutorial 1: Reports - Sensitivity Analysis . . . . . . . . . . . . . . . 177 Figure 95: Neural Networks Tutorial 1: Results tab - Graph button . . . . . . . . . . . . . . . . . . 178 Figure 96: Neural Networks Tutorial 1: Graph - Histogram . . . . . . . . . . . . . . . . . . . . . . . . 179 Figure 97: Neural Networks Tutorial 1: Graph - Target Output . . . . . . . . . . . . . . . . . . . . . 180 Figure 98: Neural Networks Tutorial 1: Graph - X, Y and Z. . . . . . . . . . . . . . . . . . . . . . . . 181 Figure 99: Neural Networks Tutorial 2: Data Selection tab . . . . . . . . . . . . . . . . . . . . . . . . . 182 Figure 100: Neural Networks Tutorial 2: Network Types tab . . . . . . . . . . . . . . . . . . . . . . . 183 Figure 101: Neural Networks Tutorial 2: Network Parameters tab . . . . . . . . . . . . . . . . . . . 184 Figure 102: Neural Networks Tutorial 2: MLP Activation Functions tab . . . . . . . . . . . . . . 185 Figure 103: Neural Networks Tutorial 2: Sampling tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 Figure 104: Neural Networks Tutorial 2: Results tab - Reports button . . . . . . . . . . . . . . . . 188 Teradata Warehouse Miner User Guide - Volume 3 xv List of Figures Figure 105: Neural Networks Tutorial 2: Results - Neural Network Summary . . . . . . . . . 189 Figure 106: Neural Networks Tutorial 2: Reports - Data Statistics. . . . . . . . . . . . . . . . . . . 190 Figure 107: Neural Networks Tutorial 2: Reports - Weights . . . . . . . . . . . . . . . . . . . . . . . 191 Figure 108: Neural Networks Tutorial 2: Reports - Sensitivity Analysis . . . . . . . . . . . . . . 192 Figure 109: Neural Networks Tutorial 2: Reports - Confusion Matrix . . . . . . . . . . . . . . . . 193 Figure 110: Neural Networks Tutorial 2: Reports - Classification Summary . . . . . . . . . . . 194 Figure 111: Neural Networks Tutorial 2: Reports - Confidence Levels . . . . . . . . . . . . . . . 195 Figure 112: Neural Networks Tutorial 2: Results tab - Graph button . . . . . . . . . . . . . . . . . 196 Figure 113: Neural Networks Tutorial 2: Graph - Histogram . . . . . . . . . . . . . . . . . . . . . . . 197 Figure 114: Neural Networks Tutorial 2: Graph - Income vs. Age. . . . . . . . . . . . . . . . . . . 198 Figure 115: Neural Networks Tutorial 2: Graph - Lift Charts. . . . . . . . . . . . . . . . . . . . . . . 199 Figure 116: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 Figure 117: Add New Analysis > Scoring > Cluster Scoring . . . . . . . . . . . . . . . . . . . . . . . 203 Figure 118: Add New Analysis > Input > Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 203 Figure 119: Add New Analysis > Input > Analysis Parameters . . . . . . . . . . . . . . . . . . . . . 204 Figure 120: Cluster Scoring > Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 Figure 121: Cluster Scoring > Results > Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 Figure 122: Cluster Scoring > Results > Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 Figure 123: Cluster Scoring > Results > SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 Figure 124: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 Figure 125: Add New Analysis > Scoring > Tree Scoring . . . . . . . . . . . . . . . . . . . . . . . . . 210 Figure 126: Tree Scoring > Input > Data Selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 Figure 127: Tree Scoring > Input > Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 211 Figure 128: Tree Scoring > Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 Figure 129: Tree Scoring > Results > Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 Figure 130: Tree Scoring > Results > Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 Figure 131: Tree Scoring > Results > Lift Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 Figure 132: Tree Scoring > Results > SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 Figure 133: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 Figure 134: Add New Analysis > Scoring > Factor Scoring. . . . . . . . . . . . . . . . . . . . . . . . 221 Figure 135: Factor Scoring > Input > Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 Figure 136: Factor Scoring > Input > Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . 223 Figure 137: Factor Scoring > Output. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 Figure 138: Factor Scoring > Results > Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 Figure 139: Factor Scoring > Results > Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 Figure 140: Factor Scoring > Results > SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 xvi Teradata Warehouse Miner User Guide - Volume 3 List of Figures Figure 141: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 Figure 142: Add New Analysis > Scoring > Linear Scoring . . . . . . . . . . . . . . . . . . . . . . . . 229 Figure 143: Linear Scoring > Input > Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 Figure 144: Linear Scoring > Input > Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . 231 Figure 145: Linear Scoring > Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 Figure 146: Linear Scoring > Results > Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 Figure 147: Linear Scoring > Results > Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 Figure 148: Linear Scoring > Results > SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234 Figure 149: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 Figure 150: Add New Analysis > Scoring > Logistic Scoring. . . . . . . . . . . . . . . . . . . . . . . 240 Figure 151: Logistic Scoring > Input > Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240 Figure 152: Logistic Scoring > Input > Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . 241 Figure 153: Logistic Scoring > Output. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242 Figure 154: Logistic Scoring > Results > Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244 Figure 155: Logistic Scoring > Results > Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244 Figure 156: Logistic Scoring > Results > Lift Graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 Figure 157: Logistic Scoring > Results > SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 Figure 158: Logistic Scoring Tutorial: Lift Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249 Figure 159: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249 Figure 160: Add New Analysis > Scoring > Neural Net Scoring . . . . . . . . . . . . . . . . . . . . 250 Figure 161: Neural Networks Scoring > Input > Data Selection . . . . . . . . . . . . . . . . . . . . . 251 Figure 162: Neural Networks Scoring > Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252 Figure 163: Neural Networks Scoring > Results > Reports . . . . . . . . . . . . . . . . . . . . . . . . . 253 Figure 164: Neural Networks Scoring > Results > Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 254 Figure 165: Neural Networks Scoring > Results > SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 Figure 166: Neural Networks Scoring Tutorial: Report. . . . . . . . . . . . . . . . . . . . . . . . . . . . 256 Figure 167: Neural Networks Scoring Tutorial: Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256 Figure 168: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 Figure 169: Add New Analysis > Statistical Tests > Parametric Tests . . . . . . . . . . . . . . . . 263 Figure 170: T-Test > Input > Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264 Figure 171: T-Test > Input > Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 Figure 172: T-Test > Output. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 Figure 173: T-Test > Results > SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 Figure 174: T-Test > Results > Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 Figure 175: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270 Figure 176: Add New Analysis > Statistical Tests > Parametric Tests . . . . . . . . . . . . . . . . 270 Teradata Warehouse Miner User Guide - Volume 3 xvii List of Figures Figure 177: F-Test > Input > Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271 Figure 178: F-Test > Input > Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272 Figure 179: F-Test > Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272 Figure 180: F-Test > Results > SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273 Figure 181: F-Test > Results > data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274 Figure 182: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280 Figure 183: Add New Analysis > Statistical Tests > Parametric Tests. . . . . . . . . . . . . . . . 281 Figure 184: F-Test > Input > Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281 Figure 185: F-Test > Input > Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282 Figure 186: F-Test > Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283 Figure 187: F-Test > Results > SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284 Figure 188: F-Test > Results > data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284 Figure 189: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 Figure 190: Add New Analysis > Statistical Tests > Binomial Tests . . . . . . . . . . . . . . . . . 288 Figure 191: Binomial Tests > Input > Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288 Figure 192: Binomial Tests > Input > Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . 289 Figure 193: Binomial Tests > Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290 Figure 194: Binomial Tests > Results > SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291 Figure 195: Binomial Tests > Results > data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291 Figure 196: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293 Figure 197: Add New Analysis > Statistical Tests > Binomial Tests . . . . . . . . . . . . . . . . . 294 Figure 198: Binomial Sign Test > Input > Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . 294 Figure 199: Binomial Sign Test > Input > Analysis Parameters . . . . . . . . . . . . . . . . . . . . . 295 Figure 200: Binomial Sign Test > Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296 Figure 201: Binomial Sign Test > Results > SQL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297 Figure 202: Binomial Sign Test > Results > data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297 Figure 203: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299 Figure 204: Add New Analysis > Statistical Tests > Kolmogorov-Smirnov Tests. . . . . . . 300 Figure 205: Kolmogorov-Smirnov Test > Input > Data Selection . . . . . . . . . . . . . . . . . . . 300 Figure 206: Kolmogorov-Smirnov Test > Input > Analysis Parameters. . . . . . . . . . . . . . . 301 Figure 207: Kolmogorov-Smirnov Test > Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302 Figure 208: Kolmogorov-Smirnov Test > Results > SQL . . . . . . . . . . . . . . . . . . . . . . . . . 303 Figure 209: Kolmogorov-Smirnov Test > Results > Data . . . . . . . . . . . . . . . . . . . . . . . . . 303 Figure 210: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305 Figure 211: Add New Analysis> Statistical Tests > Kolmogorov-Smirnov Tests . . . . . . . 306 Figure 212: Lillefors Test > Input > Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306 xviii Teradata Warehouse Miner User Guide - Volume 3 List of Figures Figure 213: Lillefors Test > Input > Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 307 Figure 214: Lillefors Test > Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308 Figure 215: Lillefors Test > Results > SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309 Figure 216: Lillefors Test > Results > Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309 Figure 217: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311 Figure 218: Add New Analysis > Statistical Tests > Kolmogorov-Smirnov Tests . . . . . . . 312 Figure 219: Shapiro-Wilk Test > Input > Data Selection. . . . . . . . . . . . . . . . . . . . . . . . . . . 312 Figure 220: Shapiro-Wilk Test > Input > Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . 313 Figure 221: Shapiro-Wilk Test > Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314 Figure 222: Shapiro-Wilk Test > Results > SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315 Figure 223: Shapiro-Wilk Test > Results > data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315 Figure 224: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317 Figure 225: Add New Analysis > Statistical Tests > Kolmogorov-Smirnov Tests . . . . . . . 318 Figure 226: D'Agostino and Pearson Test > Input > Data Selection . . . . . . . . . . . . . . . . . . 318 Figure 227: D'Agostino and Pearson Test > Input > Analysis Parameters . . . . . . . . . . . . . 319 Figure 228: D'Agostino and Pearson Test > Output. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320 Figure 229: D'Agostino and Pearson Test > Results > SQL . . . . . . . . . . . . . . . . . . . . . . . . 321 Figure 230: D'Agostino and Pearson Test > Results > data . . . . . . . . . . . . . . . . . . . . . . . . . 321 Figure 231: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323 Figure 232: Add New Analysis > Statistical Tests > Kolmogorov-Smirnov Tests . . . . . . . 324 Figure 233: Smirnov Test > Input > Data Selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324 Figure 234: Smirnov Test > Input > Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 325 Figure 235: Smirnov Test > Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326 Figure 236: Smirnov Test > Results > SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327 Figure 237: Smirnov Test > Results > data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327 Figure 238: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330 Figure 239: Add New Analysis > Statistical Tests > Tests Based on Contingency Tables . 331 Figure 240: Chi Square Test > Input > Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331 Figure 241: Chi Square Test > Input > Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . 332 Figure 242: Chi Square Test > Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333 Figure 243: Chi Square Test > Results > SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334 Figure 244: Chi Square Test > Results > data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334 Figure 245: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337 Figure 246: Add New Analysis > Statistical Tests > Tests Based On Contingency Tables 337 Figure 247: Median Test > Input > Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338 Figure 248: Median Test > Input > Analysis Parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . 339 Teradata Warehouse Miner User Guide - Volume 3 xix List of Figures Figure 249: Median Test > Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339 Figure 250: Median Test > Results > SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340 Figure 251: Median Test > Results > data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341 Figure 252: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344 Figure 253: Add New Analysis > Statistical Tests > Rank Tests . . . . . . . . . . . . . . . . . . . . 344 Figure 254: Mann-Whitney/Kruskal-Wallis Test > Input > Data Selection . . . . . . . . . . . . 345 Figure 255: Mann-Whitney/Kruskal-Wallis Test > Input > Analysis Parameters . . . . . . . 346 Figure 256: Mann-Whitney/Kruskal-Wallis Test > Output. . . . . . . . . . . . . . . . . . . . . . . . . 346 Figure 257: Mann-Whitney/Kruskal-Wallis Test > Results > SQL . . . . . . . . . . . . . . . . . . 348 Figure 258: Mann-Whitney/Kruskal-Wallis Test > Results > data . . . . . . . . . . . . . . . . . . . 348 Figure 259: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352 Figure 260: Add New Analysis > Statistical Tests > Rank Tests . . . . . . . . . . . . . . . . . . . . 353 Figure 261: Wilcoxon Signed Ranks Test > Input > Data Selection. . . . . . . . . . . . . . . . . . 353 Figure 262: Wilcoxon Signed Ranks Test > Input > Analysis Parameters . . . . . . . . . . . . . 354 Figure 263: Wilcoxon Signed Ranks Test > Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355 Figure 264: Wilcoxon Signed Ranks Test > Results > SQL . . . . . . . . . . . . . . . . . . . . . . . . 356 Figure 265: Wilcoxon Signed Ranks Test > Results > data . . . . . . . . . . . . . . . . . . . . . . . . 356 Figure 266: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359 Figure 267: Add New Analysis > Statistical Tests > Rank Tests . . . . . . . . . . . . . . . . . . . . 359 Figure 268: Friedman Test > Input > Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360 Figure 269: Friedman Test > Input > Analysis Parameters. . . . . . . . . . . . . . . . . . . . . . . . . 361 Figure 270: Friedman Test > Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361 Figure 271: Friedman Test > Results > SQL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363 Figure 272: Friedman Test > Results > data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363 xx Teradata Warehouse Miner User Guide - Volume 3 List of Tables List of Tables Table 1: Three-Level Hierarchy Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Table 2: Association Combinations output table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Table 3: Tutorial - Association Analysis Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Table 4: test_ClusterResults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Table 5: test_ClusterColumns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Table 6: Progress . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Table 7: Solution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Table 8: Confusion Matrix Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Table 9: Decision Tree Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Table 10: Variables: Dependent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Table 11: Variables: Independent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Table 12: Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Table 13: Cumulative Lift Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Table 14: Prime Factor Loadings report (Example) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Table 15: Prime Factor Variables report (Example). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Table 16: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Table 17: Factor Analysis Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Table 18: Execution Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Table 19: Eigenvalues. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Table 20: Principal Component Loadings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 Table 21: Factor Variance to Total Variance Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Table 22: Variance Explained By Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Table 23: Difference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Table 24: Prime Factor Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Table 25: Eigenvalues of Unit Scaled X'X . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 Table 26: Condition Indices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 Table 27: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 Table 28: Near Dependency report (example) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Table 29: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 Table 30: Linear Regression Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 Table 31: Regression vs. Residual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 Table 32: Execution Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 Teradata Warehouse Miner User Guide - Volume 3 xxi List of Tables Table 33: Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Table 34: Out . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Table 35: Model Assessment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Table 36: Columns In (Part 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Table 37: Columns In (Part 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Table 38: Columns In (Part 3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 Table 39: Columns Out . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 Table 40: Logistic Regression - OUTPUT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 Table 41: Logistic Regression Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 Table 42: Execution Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 Table 43: Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 Table 44: Columns Out . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 Table 45: Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 Table 46: Columns Out . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 Table 47: Prediction Success Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 Table 48: Multi-Threshold Success Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 Table 49: Cumulative Lift Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 Table 50: Neuron Activation Functions for hidden/output neurons available in SANN. . . 149 Table 51: Output Database (Built by the Cluster Scoring analysis) . . . . . . . . . . . . . . . . . . 207 Table 52: Clustering Progress . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 Table 53: Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 Table 54: Confusion Matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 Table 55: Output Database table (Built by the Decision Tree Scoring analysis) . . . . . . . . 216 Table 56: Output Database table (Built by the Decision Tree Scoring analysis) - Create profiling tables option selected (“_1” appended) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 Table 57: Output Database table (Built by the Decision Tree Scoring analysis) - Create profiling tables option selected (“_2” appended) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 Table 58: Decision Tree Model Scoring Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 Table 59: Confusion Matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 Table 60: Cumulative Lift Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 Table 61: Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 Table 62: Output Database table (Built by Factor Scoring) . . . . . . . . . . . . . . . . . . . . . . . . 225 Table 63: Factor Analysis Score Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 Table 64: Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 Table 65: Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 Table 66: Output Database table (Built by Linear Regression scoring) . . . . . . . . . . . . . . . 234 xxii Teradata Warehouse Miner User Guide - Volume 3 List of Tables Table 67: Linear Regression Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 Table 68: Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 Table 69: Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 Table 70: Prediction Success Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236 Table 71: Logistic Regression Multi-Threshold Success table . . . . . . . . . . . . . . . . . . . . . . 237 Table 72: Logistic Regression Cumulative Lift Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238 Table 73: Output Database table (Built by Logistic Regression scoring) . . . . . . . . . . . . . . 244 Table 74: Logistic Regression Model Scoring Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246 Table 75: Prediction Success Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246 Table 76: Multi-Threshold Success Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 Table 77: Cumulative Lift Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 Table 78: Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248 Table 79: Output Database table (Built by Neural Networks scoring). . . . . . . . . . . . . . . . . 254 Table 80: Statistical Test functions handling of input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259 Table 81: Two sample t tests for unpaired data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262 Table 82: Output Database table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 Table 83: T-Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268 Table 84: Output Columns - 1-Way F-Test Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274 Table 85: Output Columns - 2-Way F-Test Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274 Table 86: Output Columns - 3-Way F-Test Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275 Table 87: F-Test (one-way) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278 Table 88: Output Columns - 2-Way F-Test Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285 Table 89: F-Test (Two-way Unequal Cell Count) (Part 1). . . . . . . . . . . . . . . . . . . . . . . . . . 286 Table 90: F-Test (Two-way Unequal Cell Count) (Part 2). . . . . . . . . . . . . . . . . . . . . . . . . . 286 Table 91: F-Test (Two-way Unequal Cell Count) (Part 3). . . . . . . . . . . . . . . . . . . . . . . . . . 286 Table 92: Output Database table (Built by the Binomial Analysis) . . . . . . . . . . . . . . . . . . . 292 Table 93: Binomial Test Analysis (Table 1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292 Table 94: Binomial Test Analysis (Table 2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293 Table 95: Binomial Sign Analysis: Output Columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298 Table 96: Tutorial - Binomial Sign Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298 Table 97: Output Database table (Built by the Kolmogorov-Smirnov test analysis) . . . . . . 304 Table 98: Kolmogorov-Smirnov Test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304 Table 99: Lilliefors Test Analysis: Output Columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310 Table 100: Lilliefors Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310 Table 101: Shapiro-Wilk Test Analysis: Output Columns. . . . . . . . . . . . . . . . . . . . . . . . . . 316 Table 102: Shapiro-Wilk Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316 Teradata Warehouse Miner User Guide - Volume 3 xxiii List of Tables Table 103: D'Agostino and Pearson Test Analysis: Output Columns . . . . . . . . . . . . . . . . . 322 Table 104: D'Agostino and Pearson Test: Output Columns . . . . . . . . . . . . . . . . . . . . . . . . 322 Table 105: Smirnov Test Analysis: Output Columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328 Table 106: Smirnov Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328 Table 107: Chi Square Test Analysis: Output Columns . . . . . . . . . . . . . . . . . . . . . . . . . . . 335 Table 108: Chi Square Test (Part 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336 Table 109: Chi Square Test (Part 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336 Table 110: Median Test Analysis: Output Columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341 Table 111: Median Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342 Table 112: Table for Mann-Whitney (if two groups) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348 Table 113: Table for Kruskal-Wallis (if more than two groups). . . . . . . . . . . . . . . . . . . . . 349 Table 114: Mann-Whitney Test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349 Table 115: Kruskal-Wallis Test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350 Table 116: Mann-Whitney Test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351 Table 117: Wilcoxon Signed Ranks Test Analysis: Output Columns. . . . . . . . . . . . . . . . . 357 Table 118: Wilcoxon Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357 Table 119: Friedman Test Analysis: Output Columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363 Table 120: Friedman Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365 xxiv Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Overview CHAPTER 1 Analytic Algorithms What’s In This Chapter For more information, see these subtopics: 1 “Overview” on page 1 2 “Association Rules” on page 2 3 “Cluster Analysis” on page 19 4 “Decision Trees” on page 36 5 “Factor Analysis” on page 58 6 “Linear Regression” on page 86 7 “Logistic Regression” on page 114 8 “Neural Networks” on page 141 Overview Teradata Warehouse Miner contains several analytic algorithms from both the traditional statistics and machine learning disciplines. These algorithms pertain to the exploratory data analysis (EDA) and model-building phases of the data mining process. Along with these algorithms, Teradata Warehouse Miner contains corresponding model scoring and evaluation functions that pertain to the model evaluation and deployment phases of the data mining process. A brief summary of the algorithms offered may be given as follows: • Linear Regression — Linear regression can be used to predict or estimate the value of a continuous numeric data element based upon a linear combination of other numeric data elements present for each observation. • Logistic Regression — Logistic regression can be used to predict or estimate a two-valued variable based upon other numeric data elements present for each observation. • Factor Analysis — Factor analysis is a collective term for a family of techniques. In general, Factor analysis can be used to identify, quantify, and re-specify the common and unique sources of variability in a set of numeric variables. One of its many applications allows an analytical modeler to reduce the number of numeric variables needed to describe a collection of observations by creating new variables, called factors, as linear combinations of the original variables. • Decision Trees — Decision trees, or rule induction, can be used to predict or estimate the value of a multi-valued variable based upon other categorical and continuous numeric data elements by building decision rules and presenting them graphically in the shape of a tree, based upon splits on specific data values. Teradata Warehouse Miner User Guide - Volume 3 1 Chapter 1: Analytic Algorithms Association Rules • Clustering — Cluster analysis can be used to form multiple groups of observations, such that each group contains observations that are very similar to one another, based upon values of multiple numeric data elements. • Association Rules — Generate association rules and various measures of frequency, relationship and statistical significance associated with these rules. These rules can be general, or have a dimension of time association with them. • Neural Networks — Neural Networks can be used to build a Regression model for predicting one or more continuous variables or a Classification model for predicting one or more categorical variables, using either a Multi-Layer Perceptron or a Radial Basis Function network. Note: Neural Networks are available only with the product TWM Neural Networks Addin Powered by STATISTICA. Association Rules Overview Association Rules are measurements on groups of observations or transactions that contain items of some kind. These measurements seek to describe the relationships between the items in the groups, such as the frequency of occurrence of items together in a group or the probability that items occur in a group given that other specific items are in that group. The nature of items and groups in association analysis and the meaning of the relationships between items in a group will depend on the nature of the data being studied. For example, the items may be products purchased and the groups the market baskets in which they were purchased. (This is generally called market basket analysis). Another example is that items may be accounts opened and the groups the customers that opened the accounts. This type of association analysis is useful in a cross-sell application to determine what products and services to sell with other products and services. Obviously the possibilities are endless when it comes to the assignment of meaning to items and groups in business and scientific transactions or observations. Rules What does an association analysis produce and what types of measurements does it include? An association analysis produces association rules and various measures of frequency, relationship and statistical significance associated with these rules. Association rules are of the form X 1 X 2 X n Y 1 Y 2 Y m where X 1 X 2 X n is a set of n items that appear in a group along with a set of m items Y 1 Y 2 Y m in the same group. For example, if checking, saving and credit card accounts are owned by a customer, then the customer will also own a certificate of deposit (CD) with a certain frequency. Relationship means that, for example, owning a specific account or set of accounts, (antecedent), is associated with ownership of one or more other specific accounts (consequent). Association rules, in and of themselves, do not warrant inferences of causality, however they may point to relationships among items or events that could be studied further using other analytical 2 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Association Rules techniques which are more appropriate for determining the structure and nature of causalities that may exist. Measures The four measurements made for association rules are support, confidence, lift and Z score. Support Support is a measure of the generality of an association rule, and is literally the percentage (a value between 0 and 1) of groups that contain all of the items referenced in the rule. More formally, in the association rule defined as L R , L represents the items given to occur together (the Left side or antecedent), and R represents the items that occur with them as a result (the Right side or consequent). Support can actually be applied to a single item or a single side of an association rule, as well as to an entire rule. The support of an item is simply the percentage of groups containing that item. Given the previous example of banking product ownership, let L be defined as the number of customers who own the set of products on the left side and let R be defined as the number of customers who own the set of products on the right side. Further, let LR be the number of customers who own all products in the association rule (note that this notation does not mean L times R), and let N be defined as the total number of customers under consideration. The support of L, R and the association rule are given by: L Sup L = ---N R Sup R = ---N LR Sup L R = -------N Let’s say for example that out of 10 customers, 6 of them have a checking account, 5 have a savings account, and 4 have both. If L is (checking) and R is (savings), then Sup L is .6, Sup R is .5 and Sup L R is .4. Confidence Confidence is the probability of R occurring in an item group given that L is in the item group. The equation to calculate the probability of R occurring in an item group given that L is in the item group is given by: L R Conf L R = Sup --------------------Sup L Another way of expressing the measure confidence is as the percentage of groups containing L that also contain R. This gives the following equivalent calculation for confidence: LR Conf L R = -------L Teradata Warehouse Miner User Guide - Volume 3 3 Chapter 1: Analytic Algorithms Association Rules Using the previous example of banking product ownership once again, the confidence that checking account ownership implies savings account ownership is 4/6. The expected value of an association rule is the number of customers that are expected to have both L and R if there is no relationship between L and R. (To say that there is no relationship between L and R means that customers who have L are neither more likely nor less likely to have R than are customers who do not have L). The equation for the expected value of the association rule is: LR E_LR = -----------N An equivalent formula for the expected value of the association rule is: E_LR = Sup L Sup R N Again using the previous example, the expected value of the number of customers with checking and savings is calculated as 6 * 5 / 10 or 3. The expected confidence of a rule is the confidence that would result if there were no relationship between L and R. This simply equals the percentage of customers that own R, since if owning L has no effect on owning R, then it would be expected that the percentage of L’s that own R would be the same as the percentage of the entire population that own R. The following equation computes expected confidence: R E_Conf = ---- = Sup R N From the previous example, the expected confidence that checking implies savings is given by 5/10. Lift Lift measures how much the probability of R is increased by the presence of L in an item group. A lift of 1 indicates there are exactly as many occurrences of R as expected; thus, the presence of L neither increases nor decreases the likelihood of R occurring. A lift of 5 indicates that the presence of L implies that it is 5 times more likely for R to occur than would otherwise be expected. A lift of 0.5 indicates that when L occurs, it is one half as likely that R will occur. Lift can be calculated as follows: LR Lift L R = --------------E_LR From another viewpoint, lift measures the ratio of the actual confidence to the expected confidence, and can be calculated equivalently as either of the following: L R Lift L R = Conf --------------------E_Conf Conf L R Lift L R = ---------------------------------Sup R 4 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Association Rules The lift associated with the previous example of “checking implies savings” is 4/3. Z score Z score measures how statistically different the actual result is from the expected result. A Z score of zero corresponds to the situation where the actual number equals the expected. A Z score of 1 means that the actual number is 1 standard deviation greater than expected. A Z score of -3.0 means that the actual number is 3 standard deviations less than expected. As a rule of thumb, a Z score greater than 3 (or less than -3) indicates a statistically significant result, which means that a difference that large between the actual result and the expected is very unlikely to be due to chance. A Z score attempts to help answer the question of how confident you can be about the observed relationship between L and R, but does not directly indicate the magnitude of the relationship. It is interesting to note that a negative Z score indicates a negative association. These are rules L R where ownership of L decreases the likelihood of owning R. The following equation calculates a measure of the difference between the expected number of customers that have both L and R, if there is no relationship between L and R, and the actual number of customers that have both L and R. (It can be derived starting with either the formula for the standard deviation of the sampling distribution of proportions or the formula for the standard deviation of a binomial variable). LR – E_LR Zscore L R = --------------------------------------------------------------E_LR SQRT E_LR(1 – -------------- N or equivalently: N Sup LR – N Sup L Sup R Zscore L R = -------------------------------------------------------------------------------------------------------N Sup L Sup R 1 – Sup L Sup R The mean value is E_LR, and the actual value is LR. The standard deviation is calculated with SQRT (E_LR * (1 - E_LR/N)). From the previous example, the expected value is 6 * 5 / 10, so the mean value is 3. The actual value is calculated knowing that savings and checking accounts are owned by 4 out of 10 customers. The standard deviation is SQRT(3*(1-3/10)) or 1.449. The Z score is therefore (4 - 3) / 1.449 = .690. Interpreting Measures None of the measures described above are “best”; they all measure slightly different things. In the discussion below, product ownership association analysis is used as an example for purposes of illustration. First look at confidence, which measures the strength of an association: what percent of L customers also own R? Many people will sort associations by confidence and consider the highest confidence rules to be the best. However, there are several other factors to consider. One factor to consider is that a rule may apply to very few customers, so is not very useful. This is what support measures, the generality of the rule, or how often it applies. Thus a rule L R might have a confidence of 70%, but if that is just 7 out of 100 customers, it has Teradata Warehouse Miner User Guide - Volume 3 5 Chapter 1: Analytic Algorithms Association Rules very low support and is not very useful. Another shortcoming of confidence is that by itself it does not tell you whether owning L “changes” the likelihood of owning R, which is probably the more important piece of information. For example, if 20% of the customers own R, then a rule L R (20% of those with L also own R) may have high confidence but is really providing no information, because customers that own L have the same rate of ownership of R as the entire population does. What is probably really wanted is to find the products L for which the confidence of L R is significantly greater than 20%. This is what lift measures, the difference between the actual confidence and the expected confidence. However, lift, like confidence, is much less meaningful when very small numbers are involved; that is, when the support is low. If the expected number is 2 and there are actually 8 customers with product R, then the lift is an impressive 400. But because of the small numbers involved, the association rule is likely of limited use, and might even have occurred by chance. This is where the Z score comes in. For a rule L R , confidence indicates the likelihood that R is owned given that L is owned. Lift indicates how much owning L increases or decreases the probability of the ownership of R, and Z score measures how trustworthy the observed difference between the actual and expected ownership is relative to what could be observed due to chance alone. For example, for a rule L R , if it is expected to have 10,000 customers with both L and R, and there are actually 11,000, the lift would be only 1.1, but the Z score would be very high, because such a large difference could not be due to chance. Thus, a large Z score and small lift means there definitely is an effect, but it is small. A large lift and small Z means there appears to be a large effect, but it might not be real. A possible strategy then is given here as an illustration, but the exact strategy and threshold values will depend on the nature of each business problem addressed with association analysis. The full set of rules produced by an association analysis is often too large to examine in detail. First, prune out rules that have low Z scores. Try throwing out rules with a Z score of less than 2, if not 3, 4 or 5. However, there is little reason to focus in on rules with extremely high Z scores. Next, filter according to support and lift. Setting a limit on the Z score will not remove rules with low support or with low lift that involve common products. Where to set the support threshold depends on what products are of interest and performance considerations. Where to set the lift threshold is not really a technical question, but a question of preference as to how large a lift is useful from a business perspective. A lift of 1.5 for L R means that customers that own L are 50% more likely to own R than among the overall population. If a value of 1.5 does not yield interesting results, then set the threshold higher. Sequence Analysis Sequence analysis is a form of association analysis where the items in an association rule are considered to have a time ordering associated with them. By default, when sequence analysis is requested, left side items are assumed to have “occurred” before right side items, and in fact the items on each side of an association rule, left or right, are also time ordered within themselves. If we use in a sequence analysis the more full notation for an association rule L R , namely X 1 X 2 X m Y 1 Y 2 Y n , then we are asserting that not only do the X items precede the Y items, but X 1 precedes X 2 , which precedes X· m , which precedes Y 1, which precedes Y 2 , which precedes Y n . 6 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Association Rules It is important to note here that if a strict ordering of items in a sequence analysis is either not desired or not possible for some reason (such as multiple purchases on the same day), an option is provided to relax the strict ordering. With relaxed sequence analysis, all items on the left must still precede all items on the right of a sequence rule, but the items on the left and the items on the right are not time ordered amongst themselves. (When the rules are presented, the items in each rule are ordered by name for convenience). Lift and Z score are calculated differently for sequence analysis than for association analysis. Recall that the expected value of the association rule, E_LR, is given by Sup (L) * Sup (R) * N for a non-sequence association analysis. For example, if L occurs half the time and R occurs half the time, then if L and R are independent of each other it can be expected that L and R will occur together one-fourth of the time. But this does not take into account the fact that with sequence analysis, the correct ordering can only be expected to happen some percentage of the time if L and R are truly independent of each other. Interestingly, this expected percentage of independent occurrence of correct ordering is calculated the same for strictly ordered and relaxed ordered sequence analysis. With m items on the left and n on the right, the probability of correct ordering is given by “m!n!/(m + n)!”. Note that this is the inverse of the combinatorial analysis formula for the number of permutations of m + n objects grouped such that m are alike and n are alike. In the case of strictly ordered sequence analysis, the applicability of the formula just given for the probability of correct ordering can be explained as follows. There are clearly m + n objects in the rule, and saying that m are alike and n are alike corresponds to restricting the permutations to those that preserve the ordering of the m items on the left side and the n items on the right side of the rule. That is, all of the orderings of the items on a side other than the correct ordering fall out as being the same permutation. The logic of the formula given for the probability of correct ordering is perhaps easier to see in the case of relaxed ordering. Since there are m + n items in the rule there are (m + n)! possible orderings of the items. Out of these, there are m! ways the left items can be ordered and n! ways the right items can be ordered while insuring that the m items on the left precede the n items on the right, so there are m!n! valid orderings out of the (m + n)! possible. The “probability of correct ordering” factor described above has a direct effect on the calculation of lift and Z score. Lift is effectively divided by this factor, such that a factor of one half results in doubling the lift and increasing the Z score as well. The resulting lift and Z score for sequence analysis must be interpreted cautiously however since the assumptions made in calculating the independent probability of correct ordering are quite broad. For example, it is assumed that all combinations of ordering are equally likely to occur, and the amount of time between occurrences is completely ignored. To give the user more control over the calculation of lift and Z score for a sequence analysis, an option is provided to set the “probability of correct ordering” factor to a constant value if desired. Setting it to 1 for example effectively ignores this factor in the calculation of E_LR and therefore in lift and Z score. Initiate an Association Analysis Use the following procedure to initiate a new Association analysis in Teradata Warehouse Miner: Teradata Warehouse Miner User Guide - Volume 3 7 Chapter 1: Analytic Algorithms Association Rules 1 Click on the Add New Analysis icon in the toolbar: Figure 1: Add New Analysis from toolbar 2 In the resulting Add New Analysis dialog box, click on Analytics under Categories and then under Analyses double-click on Association: Figure 2: Add New Analysis dialog 3 This will bring up the Association dialog in which you will enter INPUT and OUTPUT options to parameterize the analysis as described in the following sections. Association - INPUT - Data Selection On the Association dialog click on INPUT and then click on data selection: Figure 3: Association > Input > Data Selection 8 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Association Rules On this screen select: 1 Select Input Source Users may select between different sources of input. By selecting the Input Source Table the user can select from available databases, tables (or views) and columns in the usual manner. By selecting the Input Source Analysis however the user can select directly from the output of another analysis of qualifying type in the current project. Analyses that may be selected from directly include all of the Analytic Data Set (ADS) and Reorganization analyses (except Refresh). In place of Available Databases the user may select from Available Analyses, while Available Tables then contains a list of all the output tables that will eventually be produced by the selected Analysis. (Note that since this analysis cannot select from a volatile input table, Available Analyses will contain only those qualifying analyses that create an output table or view). For more information refer to “INPUT Tab” on page 73 of the Teradata Warehouse Miner User Guide (Volume 1). 2 Select Columns From a Single Table • Available Databases (or Analyses) — These are the databases (or analyses) available to be processed. • Available Tables — These are the tables and views that are available to be processed. • Available Columns — These are the columns within the table/view that are available for the Association analysis. • Selected Columns — Select columns by highlighting and then either dragging and dropping into the Selected Columns window, or click on the arrow button to move highlighted columns into the Selected Columns window. Note: The Selected Columns window is actually a split window; you can either insert columns as Group, Item, or Sequence columns. Make sure you have the correct portion of the window highlighted. • Group Column — The column that specifies the group for the Association analysis. This column should specify observations or transactions that contain items of some kind. • Item Column — The column that specifies the items to be analyzed in the Association analysis. The relationship of these items within the group will be described by the Association analysis. • Sequence Column — The column that specifies the sequence of items in the Association analysis. This column should have a time ordering relationship with the item associated with them. Association - INPUT - Analysis Parameters On the Association dialog click on INPUT and then click on analysis parameters: Teradata Warehouse Miner User Guide - Volume 3 9 Chapter 1: Analytic Algorithms Association Rules Figure 4: Association > Input > Analysis Parameters On this screen select: • Association Combinations — In this window specify one or more association combinations in the format of “X TO Y” where the sum of X and Y must not exceed a total of 10. First select an “X TO Y” combination from the drop-down lists: Figure 5: Association: X to X Then click the Add button to add this combination to the window. Repeat for as many combinations as needed: Figure 6: Association Combinations pane If needed, remove a combination by highlighting it in the window and then clicking on the Remove button. • Processing Options • Perform All Steps — Execute the entire Association/Sequence Analysis, regardless of result sets generated from a previous execution. • Perform Support Calculation Only — In order to determine the minimum support value to use, the user may choose to only build the single-item support table by using this option, making it possible to stop and examine the table before proceeding. • Recalculate Final Affinities Only — Rebuild just the final association tables using support tables from a previous run provided that intermediate work tables were not dropped (see Drop All Support Tables After Execution option below). • Auto-Calculate group count — By default, the algorithm automatically determines the actual input count. • 10 Force Group Count To — If the Auto-Calculate group count is disabled, this option can be used to fix the number of groups, overriding the actual input count. This is useful in conjunction with the Reduced Input Options, to set the group count to the Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Association Rules group count in the original data set, rather than the reduced input data set. • Drop All Support Tables After Execution — Normally, the Association analysis temporarily builds the support tables, dropping them prior to termination. If for performance reasons, it is desired to use the Recalculate Final Affinities Only option, this option can be disabled so that this clean-up of support tables does not happen. • Minimum Support — The minimum Support value that the association must have in order to be reported. Using this option reduces the input data - this can be saved for further processing using the Reduced Input Options. Using this option also invokes list-wise deletion, automatically removing from processing (and from the reduced input data) all rows containing a null Group, Item or Sequence column. • Minimum Confidence — The minimum Confidence value that the association must have in order to be reported. • Minimum Lift — The minimum Lift value that the association must have in order to be reported. • Minimum Z-Score — The minimum absolute Z-Score value that the association must have in order to be reported. • Sequence Options — If a column is specified with the Sequence Column option, then the following two Sequence Options are enabled. Note that Sequence Analysis is not available when Hierarchy Information is specified: • Use Relaxed Ordering — With this option, the items on each side of the association rule may be in any sequence provided all the left items (antecedents) precede all the right items (precedents). • Auto-Calculate Ordering Probability — Sequence analysis option to let the algorithm calculate the "probability of correct ordering" according to the principles described in “Sequence Analysis” on page 6. (Note that the following option to set "Ordering Probability" to a chosen value is only available if this option is unchecked). • Ordering Probability — Sequence analysis option to set probability of correct ordering to a non-zero constant value between 0 and 1. Setting it to a 1 effectively ignores this principle in calculating lift and Z-score. Association - INPUT - Expert Options On the Association dialog click on INPUT and then click on expert options: Figure 7: Association > Input > Expert Options On this screen select: • Where Conditions — An SQL WHERE clause may be specified here to provide further input filtering for only those groups or items that you are interested in. This works exactly Teradata Warehouse Miner User Guide - Volume 3 11 Chapter 1: Analytic Algorithms Association Rules like the Expert Options for the Descriptive Statistics, Transformation and Data Reorganization functions - only the condition itself is entered here. Using this option reduces the input data set - this can be saved for further processing using the Reduced Input Options. Using this option also invokes list-wise deletion, automatically removing from processing (and from the reduced input data) all rows containing a null Group, Item or Sequence column. • Include Hierarchy Table — A hierarchy lookup table may be specified to convert input items on both the left and right sides of the association rule to a higher level in a hierarchy if desired. Note that the column in the hierarchy table corresponding to the items in the input table must not contain repeated values, so effectively the items in the input table must match the lowest level in the hierarchy table. The following is an example of a threelevel hierarchy table compatible with Association analysis, provided the input table matches up with the column ITEM1. Table 1: Three-Level Hierarchy Table ITEM1 ITEM2 ITEM3 DESC1 DESC2 DESC3 A P Y Savings Passbook Deposit B P Y Checking Passbook Deposit C W Z Atm Electronic Access D S X Charge Short Credit E T Y CD Term Deposit F T Y IRA Term Deposit G L X Mortgage Long Credit H L X Equity Long Credit I S X Auto Short Credit J W Z Internet Electronic Access Using this option reduces the input data set - this can be saved for further processing using the Reduced Input Options. Using this option also invokes list-wise deletion, automatically removing from processing (and from the reduced input data) all rows containing a null Group, Item or Sequence column. The following columns in the hierarchy table must be specified with this option. • Item Column — The name of the column that can be joined to the column specified by the Item Column option on the Select Column tab to look up the associated Hierarchy. • Hierarchy Column — The name of the column with the Hierarchy values. • Include Description Table — For reporting purposes, a descriptive name or label can be given to the items processed during the Association/Sequence Analysis. • 12 Item ID Column — The name of the column that can be joined to the column specified by the Item Column option on the Select Column tab (or Hierarchy Column option on Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Association Rules the Hierarchies tab if hierarchy information is also specified) to look up the description. • Item Description Column — The name of the column with the descriptive values. • Include Left Side Lookup Table — A focus products table may be specified to process only those items that are of interest on the left side of the association. • Left Side Identifier Column — The name of the column where the Focus Products values exist for the left side of the association. • Include Right Side Lookup Table — A focus products table may be specified to process only those items that are of interest on the right side of the association. • Right Side Identifier Column — The name of the column where the Focus Products values exist for the right side of the association. Association - OUTPUT On the Association dialog click on OUTPUT: Figure 8: Association > Output On this screen select: • Output Tables • Database Name — The database where the Association analysis build temporary and permanent tables during the analysis. This defaults to the Result Database. • Table Names — Assign a table name for each displayed combination. • Advertise Output — The Advertise Output option “advertises” each output table (including the Reduced Input Table, if saved) by inserting information into one or more of the Advertise Output metadata tables according to the type of analysis and the options selected in the analysis. (For more information, refer to “Advertise Output” on page 112 of the Teradata Warehouse Miner User Guide (Volume 1)). • Advertise Note — An Advertise Note may be specified if desired when the Advertise Output option is selected or when the Always Advertise option is selected on the Databases tab of the Connection Properties dialog. It is a free-form text field of up to 30 characters that may be used to categorize or describe the output. • Reduced Input Options — A reduced input set, based upon the minimum support value specified, a product hierarchy or input filtering via a WHERE clause, can be saved and used as input to a subsequent Association/Sequence analysis as follows: • Save Reduced Input Table — Check box to specify to the analysis that the reduced input table should be saved. • Database Name — The database name where the reduced input table will be saved. • Table Name — The table name that the reduced input table will be saved under. Teradata Warehouse Miner User Guide - Volume 3 13 Chapter 1: Analytic Algorithms Association Rules • Generate SQL, but do not Execute it — Generate the Association or Sequence Analysis SQL, but do not execute it - the set of queries are returned with the analysis results. Run the Association Analysis After setting parameters on the INPUT screens as described above, you are ready to run the analysis. To run the analysis you can either: • Click the Run icon on the toolbar, or • Select Run <project name> on the Project menu, or • Press the F5 key on your keyboard Results - Association Analysis The results of running the Association analysis include a table for each association pair requested, as well as the SQL to perform the association or sequence analysis. All of these results are outlined below. Association - RESULTS - SQL On the Association dialog click on RESULTS and then click on SQL: Figure 9: Association > Results > SQL The series of SQL statements that comprise the Association/Sequence Analysis are displayed here. Association - RESULTS - data On the Association dialog click on RESULTS and then click on data: Figure 10: Association > Results > Data Results data, if any, is displayed in a data grid as described in “RESULTS Tab” on page 80 of the Teradata Warehouse Miner User Guide (Volume 1). An output table is generated for each item pair specified in the Association Combinations option. Each table generated has the form specified below: 14 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Association Rules Table 2: Association Combinations output table Name Type Definition ITEMXOFY User Defined Two or more columns will be generated, depending upon the number of Association Combinations. Together, these form the UPI of the result table. The value for X in the column name is 1 through the number of item pairs specified. The value for Y in the column name is the sum of the number of items specified. For example, specifying Left and Right Association Combinations or <1, 1> will produce two columns: ITEM1OF2, ITEM2OF2. Specifying <1,2> will result in three columns: ITEM1OF3, ITEM2OF3 and ITEM3OF3. The data type is the same as the Item Column. Default- Data type of Item Column LSUPPORT DECIMAL(18,5) The Support of the left-side item or antecedent only. RSUPPORT DECIMAL(18,5) The Support of the right-side item or consequent only. SUPPORT DECIMAL(18,5) The Support of the association (i.e. antecedent and consequent together). CONFIDENCE DECIMAL(18,5) The Confidence of the association. LIFT DECIMAL(15,5) The Lift of the association. ZSCORE DECIMAL(15,5) The Z-Score of the association. Association - RESULTS - graph On the Association dialog click on RESULTS and then click on graph: Figure 11: Association > Results > Graph For 1-to-1 Associations, a tile map is available as described below. (No graph is available for combinations other than 1-to-1). • Graph Options — Two selectors with a Reference Table display underneath are used to make association selections to graph. For example, the following selections produced the graph below. Teradata Warehouse Miner User Guide - Volume 3 15 Chapter 1: Analytic Algorithms Association Rules Figure 12: Association Graph Selector The Graph Options display has the following selectors: a Select item 1 of 2 from this table, then click button. The first step is to select the left-side or antecedent items to graph associations for by clicking or dragging the mouse just to the left of the row numbers displayed. Note that the accumulated minimum and maximum values of the measures checked just above the display are given in this table. (The third column, “Item2of2 count” is a count of the number of associations that are found in the result table for this left-side item). Once the selections are made, click the big button between the selectors. b Select from these tables to populate graph. The second step is to select once again the desired left-side or antecedent items by clicking or dragging the mouse just to the left of the row numbers displayed under the general header “Item 1 of 2” in the left-hand portion of selector 2. Note that as “Item 1 of 2” items are selected, “Item 2 of 2” right-side or consequent items are automatically selected in the right-hand portion of selector 2. Here the accumulated minimum and maximum values of the measures checked just above this display are given in the trailing columns of the table. (The third column “Item1of2 count” is a count of the number of associations that are found in the result table for this right-side item when limited to associations involving the left-side items selected in step 1). The corresponding associations are automatically highlighted in the Reference Table below. An alternative second step is to directly select one or more “Item 2 of 2" items in the right-hand portion of selector 2. The corresponding associations (again, limited to the left-side items selected in the first step) are then highlighted in the Reference Table below. 16 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Association Rules • Reference Table — This table displays the rows from the result table that correspond to the selections made above in step 1, highlighting the rows corresponding to the selections made in step 2. • (Row Number) — A sequential numbering of the rows in this display. • Item 1 of 2 — Left item or antecedent in the association rule. • Item 2 of 2 — Right item or consequent in the association rule. • LSupport — The left-hand item Support, calculated as the percentage (a value between 0 and 1) of groups that contain the left-hand item referenced in the association rule. • RSupport — The right-hand item Support, calculated as the percentage (a value between 0 and 1) of groups that contain the right-hand item referenced in the association rule. • Support — The Support, which is a measure of the generality of an association rule. Calculated as the percentage (a value between 0 and 1) of groups that contain all of the items referenced in the rule • Confidence — The Confidence defined as the probability of the right-hand item occurring in an item group given that the left-hand item is in the item group. • Lift — The Lift which measures how much the probability of the existence of the right-hand item is increased by the presence of the left hand item in a group. • ZScore — The Z score value, a measure of how statistically different the actual result is from the expected result. • Show Graph — A tile map is displayed when the “show graph” tab is selected, provided that valid “graph options” selections have been made. The example below corresponds to the graph options selected in the example above. Figure 13: Association Graph The tiles are color coded in the gradient specified on the right-hand side. Clicking on any tile, brings up all statistics associated with that association, and highlights the two items in Teradata Warehouse Miner User Guide - Volume 3 17 Chapter 1: Analytic Algorithms Association Rules the association. Radio buttons above the upper right hand corner of the tile map can be used to select the measure to color code in the tiles, that is either Support, Lift or Zscore. Tutorial - Association Analysis In this example, an Association analysis is performed on the fictitious banking data to analyze channel usage. Parameterize an Association analysis as follows: • Available Tables — twm_credit_tran • Group Column — cust_id • Item Column — channel • Association Combinations • Left — 1 • Right — 1 • Processing Options • Perform All Steps — Enabled • Minimum Support — 0 • Minimum Confidence — 0.1 • Minimum Lift — 1 • Minimum Z-Score — 1 • Where Clause Text — channel <> ‘ ‘ (i.e. channel is not equal to a single blank) • Output Tables • 1 to 1 Table Name — twm_tutorials_assoc Run the analysis, and click on Results when it completes. For this example, the Association analysis generated the following pages. The SQL is not shown for brevity. Table 3: Tutorial - Association Analysis Data ITEM1OF2 ITEM20F2 LSUPPORT RSUPPORT SUPPORT CONFIDENCE LIFT ZSCORE A E 0.85777 0.91685 0.80744 0.94132 1.02669 1.09511 B K 0.49672 0.35667 0.21007 0.42291 1.18572 1.84235 B V 0.49672 0.36324 0.22538 0.45374 1.24915 2.49894 C K 0.67177 0.35667 0.26477 0.39414 1.10506 1.26059 C V 0.67177 0.36324 0.27133 0.4039 1.11194 1.35961 E A 0.91685 0.85777 0.80744 0.88067 1.0267 1.09511 K B 0.35667 0.49672 0.21007 0.58898 1.18574 1.84235 K C 0.35667 0.67177 0.26477 0.74234 1.10505 1.26059 K V 0.35667 0.36324 0.1663 0.46626 1.28361 2.33902 V B 0.36324 0.49672 0.22538 0.62047 1.24913 2.49894 18 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Cluster Analysis Table 3: Tutorial - Association Analysis Data ITEM1OF2 ITEM20F2 LSUPPORT RSUPPORT SUPPORT CONFIDENCE LIFT ZSCORE V C 0.36324 0.67177 0.27133 0.74697 1.11194 1.35961 V K 0.36324 0.35667 0.1663 0.45782 1.2836 2.33902 Click on Graph Options and perform the following steps: 1 Select all data in selector 1 under the “Item 1 of 2" heading. 2 Click on the large button between selectors 1 and 2. 3 Select all data in selector 2 under the “Item 1 of 2" heading. 4 Click on the show graph tab. When the tile map displays, perform the following additional steps: a Click on the bottom most tile. (Hovering over this tile will display the item names K and V). b Try selecting different measures at the top right of the tile map. (Zscore will initially be initially selected). Figure 14: Association Graph: Tutorial Cluster Analysis Overview The task of modeling multidimensional data sets encompasses a variety of statistical techniques, including that of ‘cluster analysis’. Cluster analysis is a statistical process for Teradata Warehouse Miner User Guide - Volume 3 19 Chapter 1: Analytic Algorithms Cluster Analysis identifying homogeneous groups of data objects. It is based on unsupervised machine learning and is crucial in data mining. Due to the massive sizes of databases today, implementation of any clustering algorithm must be scalable to complete analysis within a practicable amount of time, and must operate on large volumes of data with many variables. Typical clustering statistical algorithms do not work well with large databases due to memory limitations and execution times required. The advantage of the cluster analysis algorithm in Teradata Warehouse Miner is that it enables scalable data mining operations directly within the Teradata RDBMS. This is achieved by performing the data intensive aspects of the algorithm using dynamically generated SQL, while low-intensity processing is performed in Teradata Warehouse Miner. A second key design feature is that model application or scoring is performed by generating and executing SQL based on information about the model saved in metadata result tables. A third key design feature is the use of the Expectation Maximization or EM algorithm, a particularly sound statistical processing technique. Its simplicity makes possible a purely SQL-based implementation that might not otherwise be feasible with other optimization techniques. And finally, the Gaussian mixture model gives a probabilistic approach to cluster assignment, allowing observations to be assigned probabilities for inclusion in each cluster. The clustering is based on a simplified form of generalized distance in which the variables are assumed to be independent, equivalent to Euclidean distances on standardized measures. Preprocessing - Cluster Analysis Some preprocessing of the input data by the user may be necessary. Any categorical data to be clustered must first be converted to design-coded numeric variables. Since null data values may bias or invalidate the analysis, they may be replaced, or the listwise deletion option selected to exclude rows with any null values in the preprocessing phase. Teradata Warehouse Miner automatically builds a single input table from the requested columns of the requested input table. If the user requests more than 30 input columns, the data is unpivoted with additional rows added for the column values. Through this mechanism, any number of columns within a table may be analyzed, and the SQL optimized for a particular Teradata server capability. Expectation Maximization Algorithm The clustering algorithm requires specification of the desired number of clusters. After preprocessing, an initialization step determines seed values for the clusters, and clustering is then performed based on conditional probability and maximum likelihood principles using the EM algorithm to converge on cluster assignments that yield the maximum likelihood value. In a Gaussian Mixture (GM) model, it is assumed that the variables being modeled are members of a normal (Gaussian) probability distribution. For each cluster, a maximum likelihood equation can be constructed indicating the probability that a randomly selected observation from that cluster would look like a particular observation. A maximum likelihood rule for classification would assign this observation to the cluster with the highest likelihood value. In the computation of these probabilities, conditional probabilities use the relative size of clusters and prior probabilities, to compute a probability of membership of each row to each cluster. Rows are reassigned to clusters with probabilistic weighting, after units of 20 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Cluster Analysis distance have been transformed to units of standard deviation of the standard normal distribution via the Gaussian distance function: p mo = 2 –n 2 R –1 2 2 d mo exp – ------------- 2 Where: • p is dimensioned 1 by 1 and is the probability of membership of a point to a cluster • d is dimensioned 1 by 1 and is the Mahalanobis Distance • n is dimensioned 1 by 1 and is the number of variables • R is dimensioned n by n and is the cluster variance/covariance matrix The Gaussian Distance Function translates distance into a probability of membership under this probabilistic model. Intermediate results are saved in Teradata tables after each iteration, so the algorithm may be stopped at any point and the latest results viewed, or a new clustering process begun at this point. These results consist of cluster means, variances and prior probabilities. Expectation Step Means, variances and frequencies of rows assigned by cluster are first calculated. A covariance inverse matrix is then constructed using these variances, with all non-diagonals assumed to be zero. This simplification is tantamount to the assumption that the variables are independent. Performance is improved thereby, allowing the number of calculations to be proportional to the number of variables, rather than its square. Row distances to the mean of each cluster are calculated using a Mahalanobis Distance (MD) metric: n 2 do = o xn – con Rn r –1 x n – c on i = 1j = 1 Where: • m is the number of rows • n is the number of variables • o is the number of clusters • d is dimensioned n by o and is the Mahalanobis Distance from a row to a cluster • x is dimensioned m by n and is the data • c is dimensioned 1 by n and are the cluster centroids • R is dimensioned n by n and is the cluster variance/covariance matrix Mahalanobis Distance is a rescaled unitless data form used to identify outlying data points. Independent variables may be thought of as defining a multidimensional space in which each observation can be plotted. Means (“centroids”) for each independent variable may also be plotted. Mahalanobis distance is the distance of each observation from its centroid, defined by variables that may be dependent. In the special case where variables are independent or uncorrelated, it is equivalent to the simple Euclidean distance. In the default GM model, Teradata Warehouse Miner User Guide - Volume 3 21 Chapter 1: Analytic Algorithms Cluster Analysis separate covariance matrices are maintained, conforming to the specifications of a pure maximum likelihood rule model. The EM algorithm works by performing the expectation and maximization steps iteratively until the log-likelihood value converges (i.e. changes less than a default or specified epsilon value), or until a maximum specified number of iterations has been performed. The loglikelihood value is the sum over all rows of the natural log of the probabilities associated with each cluster assignment. Although the EM algorithm is guaranteed to converge, it is possible it may converge slowly for comparatively random data, or it may converge to a local maximum rather than a global one. Maximization Step The row is assigned to the nearest cluster with a probabilistic weighting for the GM model, or with certainty for the K-Means model. Options - Cluster Analysis K-Means Option With the K-Means option, rows are reassigned to clusters by associating each to the closest cluster centroid using the shortest distance. Data points are assumed to belong to only one cluster, and the determination is considered a ‘hard assignment’. After the distances are computed from a given point to each cluster centroid, the point is assigned to the cluster whose center is nearest to the point. On the next iteration, the point’s value is used to redefine that cluster’s mean and variance. This is in contrast to the default Gaussian option, wherein rows are reassigned to clusters with probabilistic weighting, after units of distance have been transformed to units of standard deviation via the Gaussian distance function. Also with the K-means option, the variables' distances to cluster centroids are calculated by summing, without any consideration of the variances, resulting effectively in the use of unnormalized Euclidean distances. This implies that variables with large variances will have a greater influence over the cluster definition than those with small variances. Therefore, a typical preparatory step to conducting a K-means cluster analysis is to standardize all of the numeric data to be clustered using the Z-score transformation function in Teradata Warehouse Miner. K-means analyses of data that are not standardized typically produce results that: (a) are dominated by variables with large variances, and (b) virtually or totally ignore variables with small variances during cluster formation. Alternatively, the Rescale function could be used to normalize all numeric data, with a lower boundary of zero and an upper boundary of one. Normalizing the data prior to clustering gives all the variables equal weight. Poisson Option The Poisson option is designed to be applied to data containing mixtures of Poissondistributed variables. The data is first normalized so all variables have the same means and variances, allowing the calculation of the distance metric without biasing the result in favor of larger-magnitude variables. The EM algorithm is then applied with a probability metric based on the likelihood function of the Poisson distribution function. As in the Gaussian Mixture Model option, rows are assigned to the nearest cluster with a probabilistic weighting. At the 22 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Cluster Analysis end of the EM iteration, the data is unnormalized and saved as a potential result, until or unless replaced by the next iteration. Average Mode - Minimum Generalized Distance Within the GM model, a special “average mode” option is provided, using the minimum generalized distance rule. With this option, a single covariance matrix is used for all clusters, rather than using an individual covariance matrix for each cluster. A weighted average of the covariance matrices is constructed for use in the succeeding iteration. Automatic Scaling of Likelihood Values When a large number of variables are input to the cluster analysis module, likelihood values can become prohibitively small. The algorithm automatically scales these values to avoid loss of precision, without invalidating the results in any way. The expert option ‘Scale Factor Exponent (s)’ may be used to bypass this feature by using a specific value, e.g. 10s, to multiply the probabilities. Continue Option The continue option allows clustering to be resumed where it left off by starting with the cluster centroid, variance and probability values of the last complete iteration saved in the metadata tables. Success Analysis - Cluster Analysis If the log-likelihood value converges, and the requested number of clusters is obtained with significant probabilities, then the clustering analysis may be considered to have been successful. If the log-likelihood value declines, indicating convergence has completed, the iterations stop. On occasion, warning messages may indicate constants within one or more clusters. Using the TWM Cluster Analysis Sampling Large Database Tables as a Starting Method It may be most effective to use the sample parameter to begin the analysis of extremely large databases. The execution times will be much faster, and an approximate result obtained that can be used as a starting point, as described above. Results may be compared using the loglikelihood value, where the largest value indicates the best clustering fit, in terms of maximum likelihood. Since local maxima may result from a particular EM clustering analysis, multiple executions from different samples may produce a seed that ultimately yields the best log-likelihood value. Clustering and Data Problems Common data problems for cluster analysis include insufficient rows provided for the number of clusters requested, and constants in the data resulting in singular covariance matrices. When these problems occur, warning messages and recommendations are provided. An option for dealing with null values during processing is described below. Teradata Warehouse Miner User Guide - Volume 3 23 Chapter 1: Analytic Algorithms Cluster Analysis Additionally, Teradata errors may occur for non-normalized data having more than 15 digits of significance. In this case, a preprocessing step of either multiplying (for small numbers) or dividing (for large numbers) by a constant value may rectify overflow and underflow conditions. The clusters will remain the same as all this does is change the unit of measure. Clustering and Constants in Data When one or more of the variables included in the clustering analysis have only a few values, these values may be singled out and included in particular clusters as constants. This is most likely when the number of clusters sought is large. When this happens, the covariance matrix becomes singular and cannot be inverted, since some of the variances are zero. A feature is provided in the cluster algorithm to improve the chance of success under these conditions, by limiting how close to zero the variance may be set, e.g. 10-3. The default value is 10-10. If the log-likelihood values increase for a number of iterations and then start decreasing, it is likely due to the clustering algorithm having found clusters where selected variables are all the same value (a constant), so the cluster variance is zero. Changing the minimum variance exponent value to a larger value may reduce the effect of these constants, allowing the other variables to converge to a higher log-likelihood value. Clustering and Null Values The presence of null values in the data may result in clusters that differ from those that would have resulted from zero or numeric values. Since null data values may bias or invalidate the analysis, they should be replaced or the column eliminated. Alternatively, the listwise deletion option can be selected to exclude rows with any null values in the preprocessing phase. Optimizing Performance of Clustering Parallel execution of SQL is an important feature of the cluster analysis algorithm in Teradata Warehouse Miner as well as Teradata. The number of variables to cluster in parallel is determined by the ‘width’ parameter. The optimum value of width will depend on the size of the Teradata system, its memory size, and so forth. Experience has shown that when a large number of variables are clustered on, the optimum value of width ranges from 20-25. The width value is dynamically set to the lesser of the specified Width option (default = 25) and the number of columns, but can never exceed 118. If SQL errors indicate insufficient memory, reducing the width parameter may alleviate the problem. Initiate a Cluster Analysis Use the following procedure to initiate a new Cluster analysis in Teradata Warehouse Miner: 24 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Cluster Analysis 1 Click on the Add New Analysis icon in the toolbar: Figure 15: Add New Analysis from toolbar 2 In the resulting Add New Analysis dialog box, click on Analytics under Categories and then under Analyses double-click on Clustering: Figure 16: Add New Analysis dialog 3 This will bring up the Clustering dialog in which you will enter INPUT and OUTPUT options to parameterize the analysis as described in the following sections. Cluster - INPUT - Data Selection On the Clustering dialog click on INPUT and then click on data selection: Teradata Warehouse Miner User Guide - Volume 3 25 Chapter 1: Analytic Algorithms Cluster Analysis Figure 17: Clustering > Input > Data Selection On this screen select: 1 Select Input Source Users may select between different sources of input. By selecting the Input Source Table the user can select from available databases, tables (or views) and columns in the usual manner. By selecting the Input Source Analysis however the user can select directly from the output of another analysis of qualifying type in the current project. Analyses that may be selected from directly include all of the Analytic Data Set (ADS) and Reorganization analyses (except Refresh). In place of Available Databases the user may select from Available Analyses, while Available Tables then contains a list of all the output tables that will eventually be produced by the selected Analysis. (Note that since this analysis cannot select from a volatile input table, Available Analyses will contain only those qualifying analyses that create an output table or view). For more information, refer to “INPUT Tab” on page 73 of the Teradata Warehouse Miner User Guide (Volume 1). 2 Select Columns From a Single Table • Available Databases (or Analyses) — All the databases (or analyses) that are available for the Clustering analysis. • Available Tables — All the tables within the Source Database that are available for the Clustering analysis. • Available Columns — Within the selected table or matrix, all columns which are available for the Clustering analysis. • Selected Columns — Columns must be of numeric type. Select columns by highlighting and then either dragging and dropping into the Selected Columns window, or click on the arrow button to move highlighted columns into the Selected Columns window. Cluster - INPUT - Analysis Parameters On the Clustering dialog click on INPUT and then click on analysis parameters: Figure 18: Clustering > Input > Analysis Parameters 26 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Cluster Analysis On this screen select: • Clustering Algorithm • Gaussian — Cluster the data using a Gaussian Mixture Model as described above. This is the default Algorithm. • Poisson — Cluster the data using a Poisson Mixture Model as described above. • K-Means — Cluster the data using the K-Means Model as described above. • Number of clusters — Enter the number of clusters before executing the cluster analysis. • Convergence Criterion — For the Gaussian and Poisson Mixture Models, clustering is stopped when the log-likelihood increases less than this amount. The default value is 0.001. K-Means, on the other hand, does not use this criterion as clustering stops when the distances of all points to each cluster have not changed from the previous iteration. In other words, when the assignment of rows to clusters has not changed from the previous iteration, clustering has converged. • Maximum Iterations — Clustering is stopped after this maximum number of iterations has occurred. The default value is 50. • Remove Null Values (using Listwise deletion) — This option eliminates all rows from processing that contain any null input columns. The default is enabled. • Include Variable Importance Evaluation reports — Report shows resultant log-likelihood when each variable is successively dropped out of the clustering calculations. The most important variable will be listed next to the most negative log-likelihood value; the least important variable will be listed with the least negative value. • Continue Execution (instead of starting over) — Previous execution results are used as seed values for starting clustering. Cluster - INPUT - Expert Options On the Clustering dialog click on INPUT and then click on expert options: Figure 19: Clustering > Input > Expert Options On this screen select: • Width — Number of variables to process in parallel (dependent on system limits) • Input Sample Fraction — Fraction of input dataset to cluster on. • Scale Factor Exponent — If nonzero “s” is entered, this option overrides automatic scaling, scaling by 10s. • Minimum Probability Exponent — If “e” is entered, the Clustering analysis uses 10e as smallest nonzero number in SQL calculations. Teradata Warehouse Miner User Guide - Volume 3 27 Chapter 1: Analytic Algorithms Cluster Analysis • Minimum Variance Exponent — If “v” is entered, the Clustering analysis uses 10v as the minimum variance in SQL calculations. • Use single cluster covariance — Simplified model that uses the same covariance table for all clusters. • Use Random Seeding — When enabled (default) this option seeds the initial clustering answer matrix by randomly selecting a row for each cluster as the seed. This method is the most commonly used type of seeding for all other clustering systems, according to the literature. The byproduct of using this new method is that slightly different solutions will be provided by successive clustering runs, and convergence may be quicker because fewer iterations may be required. • Seed Sample Percentage — If Use Random Seeding is disabled, the previous seeding method of Teradata Warehouse Miner Clustering, where every row is assigned to one of the clusters, and then averages used as the seeds. Enter a percentage (1-100) of the input dataset to use as the starting seed. Cluster - OUTPUT On the Clustering dialog click on OUTPUT: Figure 20: Cluster > OUTPUT On this screen select: • Store the variables table of this analysis in the database — Check this box to store the variables table of this analysis in two tables in the database, one for cluster columns and one for cluster results. • Database Name — The name of the database to create the output tables in. • Output Table Prefix — The prefix of the output tables. (For example, if test is entered here, tables test_ClusterColumns and test_ClusterResults will be created). • Advertise Output — The Advertise Output option "advertises" output by inserting information into one or more of the Advertise Output metadata tables according to the type of analysis and the options selected in the analysis. (For more information, refer to the “Advertise Output” on page 112 of the Teradata Warehouse Miner User Guide (Volume 1)). • Advertise Note — An Advertise Note may be specified if desired when the Advertise Output option is selected or when the Always Advertise option is selected on the Databases tab of the Connection Properties dialog. It is a free-form text field of up to 30 characters that may be used to categorize or describe the output. By way of an example, the tutorial example with prefix test yields table test_ ClusterResults: 28 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Cluster Analysis Table 4: test_ClusterResults column_ix cluster_id priors m v 1 1 0.0692162138434691 -2231.95933518596 7306685.95957656 1 2 0.403625379654599 -947.132576882845 846532.221977884 1 3 0.527158406501931 -231.599917701351 105775.923364194 2 1 0.0692162138434691 3733.31923440023 18669805.3968291 2 2 0.403625379654599 1293.34863525092 1440668.11504453 2 3 0.527158406501931 231.817911577847 102307.594966697 3 1 0.0692162138434691 3725.87257974281 18930649.6488828 3 2 0.403625379654599 632.603945909026 499736.882919713 3 3 0.527158406501931 163.869611182736 57426.9984808451 and test_ClusterColumns: Table 5: test_ClusterColumns table_name column_name column_alias column_order index_flag variable_type twm_ customer_ analysis avg_cc_bal avg_cc_bal 1 0 1 twm_ customer_ analysis avg_ck_bal avg_ck_bal 2 0 1 twm_ customer_ analysis avg_sv_bal avg_sv_bal 3 0 1 If Database Name is twm_results and Output Table Prefix is test, these tables are defined respectively as: CREATE SET TABLE twm_results.test_ClusterResults ( column_ix INTEGER, cluster_id INTEGER, priors FLOAT, m FLOAT, v FLOAT) UNIQUE PRIMARY INDEX ( column_ix ,cluster_id ); CREATE SET TABLE twm_results.test_ClusterColumns ( table_name VARCHAR(30) CHARACTER SET UNICODE NOT CASESPECIFIC, column_name VARCHAR(30) CHARACTER SET UNICODE NOT CASESPECIFIC, Teradata Warehouse Miner User Guide - Volume 3 29 Chapter 1: Analytic Algorithms Cluster Analysis column_alias VARCHAR(100) CHARACTER SET UNICODE NOT CASESPECIFIC, column_order SMALLINT, index_flag SMALLINT, variable_type INTEGER) UNIQUE PRIMARY INDEX ( table_name ,column_name ); Run the Cluster Analysis After setting parameters on the INPUT screens as described above, you are ready to run the analysis. To run the analysis you can either: • Click the Run icon on the toolbar, or • Select Run <project name> on the Project menu, or • Press the F5 key on your keyboard Results - Cluster Analysis The results of running the Cluster analysis include a variety of statistical reports, a similarity/ dissimilarity graph, as well as a cluster size and distance measure graph. All of these results are outlined below. Cluster - RESULTS - reports On the Clustering dialog click on RESULTS and then click on reports: Figure 21: Clustering > Results > Reports Clustering Progress • Iteration — This represents the number of the step in the Expectation Maximization clustering algorithm as it seeks to converge on a solution maximizing the log likelihood function. • Log Likelihood — This is the log likelihood value calculated at the end of this step in the Expectation Maximization clustering algorithm. It does not appear when the K-Means option is used. • Diff — This is simply the difference in the log likelihood value between this and the previous step in the modeling process, starting with 0 at the end of the first step. It does not appear when the K-Means option is used. • Timestamp — This is the day, date, hour, minute and second marking the end of this step in processing. 30 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Cluster Analysis Importance of Variables This report is available when the Include Variable Importance Evaluation Report option is enabled on the Expert Options tab. • Col — The column number in the order the input columns were requested. • Name — Name of the column being clustered. • Log Likelihood — This is the log likelihood value calculated if this variable was removed from the clustering solution. Clustering Solution • Col — This is the column number in the order the input columns were requested. • Table_Name — The name of the table associated with this input column. • Column_Name — The name of the input column used in performing the cluster analysis. • Cluster_Id — The cluster number that this data applies to, from 1 to the number of clusters requested. • Weight — This is the so-called “prior probability” that an observation would belong to this cluster, based on the percentage of observations belonging to this cluster at this stage. • Mean — When the Gaussian Mixture Model algorithm is selected, Mean is the weighted average of this column or variable amongst all the observations, where the weight used is the probability of inclusion in this cluster. When the K-Means algorithm is selected, Mean is the average value of this column or variable amongst the observations assigned to this cluster at this iteration of the algorithm. • Variance — When the Gaussian Mixture Model algorithm is selected, Variance is the weighted variance of this variable amongst all the observations, where the weight used is the probability of inclusion in this cluster. When the K-Means algorithm is selected, Variance is the variance of this variable amongst the observations assigned to this cluster at this iteration. (Variance is the square of a variable’s standard deviation, measuring in some sense how its value varies from one observation to the next). Cluster - RESULTS - sizes graph On the Clustering dialog click on RESULTS and then click on sizes graph: Figure 22: Clustering > Results > Sizes Graph The Sizes (and Distances) graph plots the mean values of a pair of variables at a time, indicating the clusters by color and number label, and the standard deviations (square root of the variance) by the size of the ellipse surrounding the mean point, using the same colorcoding. Roughly speaking, this graph depicts the separation of the clusters with respect to pairs of model variables. The following options are available: Teradata Warehouse Miner User Guide - Volume 3 31 Chapter 1: Analytic Algorithms Cluster Analysis • Non-Normalized — The default value to show the clusters without any normalization. • Normalized — With the Normalized option, cluster means are divided by the largest absolute mean and the size of the circle based on the variance is divided by the largest absolute variance. • Variables • Available — The variables that were input into the Clustering Analysis. • Selected — The variables that will be shown on the Size and Distances graph. Two variables are required to be entered here. • Clusters • Available — A list of clusters generated in the clustering solution. • Selected — The clusters that will be shown on the Size and Distances graph. Up to twelve clusters can be selected to be shown on the Size and Distances graph. • Zoom In — While holding down the left mouse button on the Size and Distances graph, drag a lasso around the area that you desire to magnify. Release the mouse button for the zoom to take place. This can be repeated until the desired level of magnification is achieved. • Zoom Out — Hit the “Z” key, or toggle the Graph Options tab to go back to the original magnification level. Cluster - RESULTS - similarity graph On the Clustering dialog click on RESULTS and then click on similarity graph: Figure 23: Clustering > Results > Similarity Graph The Similarity graph allows plotting the means and variances of up to twelve clusters and twelve variables at one time. The cluster means, i.e. the mean values of the variables for the data points assigned to the cluster, are displayed with values varying along the x-axis. A different line parallel to the x-axis is used for each variable. The normalized variances are displayed for each variable by color-coding, and the clusters are identified by number next to the point graphed. Roughly speaking, the more spread out the points on the graph, the more differentiated the clusters are. The following options are available: • Non-Normalized — The default value to show the clusters without any normalization. • Normalized — With the Normalized option, the cluster mean is divided by the largest absolute mean. • Variables • 32 Available — The variables that were input into the Clustering Analysis. Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Cluster Analysis • Selected — The variables that will be shown on the Similarity graph. Up to twelve variables can be entered here. selected to be shown on the Similarity graph • Clusters • Available — A list of clusters generated in the clustering solution. • Selected — The clusters that will be shown on the Similarity graph. Up to twelve clusters can be selected to be shown on the Similarity graph. Tutorial - Cluster Analysis In this example, Gaussian Mixture Model cluster analysis is performed on 3 variables giving the average credit, checking and savings balances of customers, yielding a requested 3 clusters. Note that since Clustering in Teradata Warehouse Miner is non-deterministic, the results may vary from these, or from execution to execution. Parameterize a Cluster analysis as follows: • Selected Tables and Columns • twm_customer_analysis.avg_cc_bal • twm_customer_analysis.avg_ck_bal • twm_customer_analysis.avg_sv_bal • Number of Clusters — 3 • Algorithm — Gaussian Mixture Model • Convergence Criterion — 0.1 • Use Listwise deletion to eliminate null values — Enabled Run the analysis and click on Results when it completes. For this example, the Clustering Analysis generated the following pages. Note that since Clustering is non-deterministic, results may vary. A single click on each page name populates the page with the item. Table 6: Progress Iteration Log Likelihood Diff Timestamp 1 -25.63 0 3:05 PM 2 -25.17 .46 3:05 PM 3 -24.89 .27 3:05 PM 4 -24.67 .21 3:05 PM 5 -24.42 .24 3:05 PM 6 -24.33 .09 3:06 PM Teradata Warehouse Miner User Guide - Volume 3 33 Chapter 1: Analytic Algorithms Cluster Analysis Table 7: Solution Col Table_Name Column_Name Cluster_Id Weight Mean Variance 1 twm_customer_analysis avg_cc_bal 1 .175 -1935.576 3535133.504 2 twm_customer_analysis avg_ck_bal 1 .175 2196.395 9698027.496 3 twm_customer_analysis avg_sv_bal 1 .175 674.72 825983.51 1 twm_customer_analysis avg_cc_bal 2 .125 -746.095 770621.296 2 twm_customer_analysis avg_ck_bal 2 .125 948.943 1984536.299 3 twm_customer_analysis avg_sv_bal 2 .125 2793.892 11219857.457 1 twm_customer_analysis avg_cc_bal 3 .699 -323.418 175890.376 2 twm_customer_analysis avg_ck_bal 3 .699 570.259 661100.56 3 twm_customer_analysis avg_sv_bal 3 .699 187.507 63863.503 Sizes Graph By default, the following graph will be displayed. This parameterization includes: • Non-Normalized — Enabled • Variables Selected • avg_cc_bal • avg_ck_bal • Clusters Selected 34 • Cluster 1 • Cluster 2 • Cluster 3 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Cluster Analysis Figure 24: Clustering Analysis Tutorial: Sizes Graph Similarity Graph By default, the following graph will be displayed. This parameterization includes: • Non-Normalized — Enabled • Variables Selected • avg_cc_bal • avg_ck_bal • avg_sv_bal • Clusters Selected • Cluster 1 • Cluster 2 • Cluster 3 Teradata Warehouse Miner User Guide - Volume 3 35 Chapter 1: Analytic Algorithms Decision Trees Figure 25: Clustering Analysis Tutorial: Similarity Graph Decision Trees Overview Decision tree models are most commonly used for classification. What is a classification model or classifier? It is simply a model for predicting a categorical variable, that is a variable that assumes one of a predetermined set of values. These values can be either nominal or ordinal, though ordinal variables are typically treated the same as nominal ones in these models. (An example of a nominal variable is single, married and divorced marital status, while an example of an ordinal or ordered variable is low, medium and high temperature). It is the ability of decision trees to not only predict the value of a categorical variable, but to directly use categorical variables as input or predictor variables that is perhaps their principal advantage. Decision trees are by their very nature also well suited to deal with large numbers of input variables, handle a mixture of data types and handle data that is not homogeneous, i.e. the variables do not have the same interrelationships throughout the data space. They also provide insight into the structure of the data space and the meaning of a model, a result at times as important as the accuracy of a model. It should be noted that a variation of decision trees called regression trees can be used to build regression models rather than classification models, enjoying the same benefits just described. Most of the upcoming discussion is geared toward classification trees with regression trees described separately. What are Decision Trees? What does a decision tree model look like? It first of all has a root node, which is associated with all of the data in the training set used to build the tree. Each node in the tree is either a decision node or a leaf node, which has no further connected nodes. A decision node 36 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Decision Trees represents a split in the data based on the values of a single input or predictor variable. A leaf node represents a subset of the data that has a particular value of the predicted variable, i.e. the resulting class of the predicted variable. A measure of accuracy is also associated with the leaf nodes of the tree. The first issue in building a tree is the decision as to how data should be split at each decision node in the tree. The second issue is when to stop splitting each decision node and make it a leaf. And finally, what class should be assigned to each leaf node. In practice, researchers have found that it is usually best to let a tree grow as big as it needs to and then prune it back at the end to reduce its complexity and increase its interpretability. Once a decision tree model is built it can be used to score or classify new data. If the new data includes the values of the predicted variable it can be used to measure the effectiveness of the model. Typically though scoring is performed in order to create a new table containing key fields and the predicted value or class identifier. Decision Trees in Teradata Warehouse Miner Teradata Warehouse Miner provides decision trees for classification models and regression models. They are built largely on the techniques described in [Breiman, Friedman, Olshen and Stone] and [Quinlan]. As such, splits using the Gini diversity index, regression or information gain ratio are provided. Pruning is also provided, using either the Gini diversity index or gain ratio technique. In addition to a summary report, a graphical tree browser is provided when a model is built, displaying the model either as a tree or a set of rules. Finally, a scoring function is provided to score and/or evaluate a decision tree model. The scoring function can also be used to simply generate the scoring SQL for later use. A number of additional options are provided when building or scoring a decision tree model. One of these options is whether or not to bin numeric variables during the tree building process. Another involves including recalculated confidence measures at each leaf node in a tree based on a validation table, supplementing confidence measures based on the training data used to build the tree. Finally, at the time of scoring, a table profiling the leaf nodes in the tree can be requested, at the same time each scored row is linked with a leaf node and corresponding rule set. Decision Tree SQL Generation A key part to the design of the Teradata Warehouse Miner Decision Trees is SQL generation. In order to avoid having to extract all of the data from the RDBMS, the product generates SQL statements to return sufficient statistics. Before the model building begins, SQL is generated to give a better understanding of the attributes and the predicted variable. For each attribute, the algorithm must determine its cardinality and get all possible values of the predicted variable and the counts associated with it from all of the observations. This information helps to initialize some structures in memory for later use in the building process. The driving SQL behind the entire building process is a SQL statement that makes it possible to build a contingency table from the data. A contingency table is an m x n matrix that has m rows corresponding to the distinct values of an attribute by n columns that correspond to the predicted variable’s distinct values. The Teradata Warehouse Miner Decision Tree algorithms can quickly generate the contingency table on massive amounts of data rows and columns. Teradata Warehouse Miner User Guide - Volume 3 37 Chapter 1: Analytic Algorithms Decision Trees This contingency table query allows the program to gather the sufficient statistics needed for the algorithms to do their calculations. Since this consists of the counts of the N distinct values of the dependent variable, a WHERE clause is simply added to this SQL when building a contingency table on a subset of the data instead of the data in the whole table. The WHERE clause expression in the statement helps define the subset of data which is the path down the tree that defines which node is a candidate to be split. Each type of decision tree uses a different method to compute which attribute is the best choice to split a given subset of data upon. Each type of decision tree is considered in turn in what follows. In the course of describing each algorithm, the following notation is used: 1 t denotes a node 2 j denotes the learning classes 3 J denotes the number of classes 4 s denotes a split 5 N(t) denotes the number of cases within a node t 6 p(j|t) is the proportion of class j learning samples in node t 7 An impurity function is a symmetric function with maximum value –1 –1 –1 J J J and 1 0 0 = 0 1 0 = = 0 0 1 = 0 8 t1 denotes a subnode i of t 9 i(t) denotes node impurity measure 10 t1 and tR are the left and right split nodes of t Splitting on Information Gain Ratio Information theory is the basic underlying idea in this type of decision tree. Splits on categorical variables are made on each individual value. Splits on continuous variables are made at one point in an ordered list of the actual values, that is a binary split is introduced right on a particular value. • Define the “info” at node t as the entropy: info t = – p j t log 2 p j t • Suppose t is split into subnodes t1, …, t2 by predictor X. Define: Info x = N t1 info t1 -----------Nt Gain X = info t – info x t N t1 N t1 Split info X = – ------------- log 2 ------------- Nt Nt 38 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Decision Trees gain X Gain ratio X = ------------------------------------Split info X Once the gain ratios have been computed the attribute with the highest gain ratio is used to split the data. Then each subset goes through this process until the observations are all of one class or a stopping criterion is met such as each node must contain at least 2 observations. For a detailed description of this type of decision tree see [Quinlan]. Splitting on Gini Diversity Index Node impurity is the idea behind the Gini diversity index split selection. To measure node impurity, use the formula: i t = p t 0 Maximum impurity arises when there is an equal distribution of the class that is to be predicted. As in the heads and tails example, impurity is highest if half the total is heads and the other half is tails. On the other hand, if there were only tails in a certain sample the impurity would be 0. The Gini index uses the following formula for its calculation of impurity: it = 1 – p j t 2 j For a determination of the goodness of a split, the following formula is used: i s t = i t – p L i t L – p R i t R where tL and tR are the left and right sub nodes of t and pL and pR are the probabilities of being in those sub nodes. For a detailed description of this type of tree see [Breiman, Friedman, Olshen and Stone]. Regression Trees Teradata Warehouse Miner provides regression tree models that are built largely on the techniques described in [Breiman, Friedman, Olshen and Stone]. Like classification trees, regression trees utilize SQL in order to extract only the necessary information from the RDBMS instead of extracting all the data from the table. An m x 3 table is returned from the database that has m rows corresponding to the distinct values of an attribute followed by the SUM and SQUARED SUM of the predicted variable and the total number of rows having that attribute value. Using the formula: yn – avg(y) Teradata Warehouse Miner User Guide - Volume 3 2 39 Chapter 1: Analytic Algorithms Decision Trees the sum of squares for any particular node starting with the root node of all the data is calculated first. The regression tree is built by iteratively splitting nodes and picking the split for that node which will maximize a decrease in the within node sum of squares of the tree. Splitting stops if the minimum number of observations in a node is reached or if all of the predicted variable values are the same. The value to predict for a leaf node is simply the average of all the predicted values that fall into that leaf during model building. Chaid Trees CHAID trees utilize the chi squared significance test as a means of partitioning data. Independent variables are tested by looping through the values and merging categories that have the least significant difference from one another and also are still below the merging significance level parameter (default .05). Once all independent variables have been optimally merged the one with the highest significance is chosen for the split, the data is subdivided, and the process is repeated on the subsets of the data. The splitting stops when the significance goes above the splitting significance level (default .05). For a detailed description of this type of tree see [Kass]. Decision Tree Pruning Many times with algorithms such as those described above, a model over fits the data. One of the ways of correcting this is to prune the model from the leaves up. In situations where the error rate of leaves doesn’t increase when combined then they are joined into a new leaf. A simple example may be given as follows. If there is nothing but random data for the attributes and the class is set to predict “heads” 75% of the time and “tails” 25% of the time, the result will be an over fit model that doesn’t predict the outcome well. Just by looking it can be seen that instead of a built up model with many leaves, the model could just predict “heads” and it would be correct 75% of the time, whereas over fitting usually does much worse in such a case. Teradata Warehouse Miner provides pruning according to the gain ratio and Gini diversity index pruning techniques. It is possible to combine different splitting and pruning techniques, however when pruning a regression tree the Gini diversity index technique must be used. Decision Trees and NULL Values NULL values are handled by listwise deletion. This means that if there are NULL values in any variables (independent and dependent) then that row where a NULL exists will be removed from the model building process. NULL values in scoring, however, are handled differently. Unlike in tree building where listwise deletion is used, scoring can sometimes handle rows that have NULL values in some of the independent variables. The only time a row will not get scored is if a decision node that the row is being tested on has a NULL value for that decision. For instance, if the first split in a tree is “age < 50,” only rows that don’t have a NULL value for age will pass down further in the tree. This row could have a NULL value in the income variable. But since this decision is on age, the NULL will have no impact at this split and the row will continue down the 40 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Decision Trees branches until a leaf is reached or it has a NULL value in a variable used in another decision node. Initiate a Decision Tree Analysis Use the following procedure to initiate a new Decision Tree analysis in Teradata Warehouse Miner: 1 Click on the Add New Analysis icon in the toolbar: Figure 26: Add New Analysis from toolbar 2 In the resulting Add New Analysis dialog box, click on Analytics under Categories and then under Analyses double-click on Decision Tree: Figure 27: Add New Analysis dialog 3 This will bring up the Decision Tree dialog in which you will enter INPUT and OUTPUT options to parameterize the analysis as described in the following sections. Decision Tree - INPUT - Data Selection On the Decision Tree dialog click on INPUT and then click on data selection: Teradata Warehouse Miner User Guide - Volume 3 41 Chapter 1: Analytic Algorithms Decision Trees Figure 28: Decision Tree > Input > Data Selection On this screen select: 1 Select Input Source Users may select between different sources of input. By selecting the Input Source Table the user can select from available databases, tables (or views) and columns in the usual manner. By selecting the Input Source Analysis however the user can select directly from the output of another analysis of qualifying type in the current project. Analyses that may be selected from directly include all of the Analytic Data Set (ADS) and Reorganization analyses (except Refresh). In place of Available Databases the user may select from Available Analyses, while Available Tables then contains a list of all the output tables that will eventually be produced by the selected Analysis. (Note that since this analysis cannot select from a volatile input table, Available Analyses will contain only those qualifying analyses that create an output table or view). For more information, refer to “INPUT Tab” on page 73 of the Teradata Warehouse Miner User Guide (Volume 1). 2 Select Columns From a Single Table • Available Databases (or Analyses) — All the databases (or analyses) that are available for the Decision Tree analysis. • Available Tables — All the tables that are available for the Decision Tree analysis. • Available Columns — Within the selected table or matrix, all columns that are available for the Decision Tree analysis. • Selected Columns — Select columns by highlighting and then either dragging and dropping into the Selected Columns window, or click on the arrow button to move highlighted columns into the Selected Columns window. Note: The Selected Columns window is actually a split window; you can either insert columns as Dependent or Independent columns. Make sure you have the correct portion of the window highlighted. 42 • Independent — These may be of numeric or character type. • Dependent — The dependent variable column is the column whose value is being predicted. It is selected from the Available Variables in the selected table. When Gain Ratio or Gini Index are selected as the Tree Splitting criteria, this is treated as a categorical variable with distinct values, in keeping with the nature of classification trees. Note that in this case an error will occur if the Dependent Variable has more than 50 distinct values. When Regression Trees is selected as the Tree Splitting criteria, this is treated as a continuous variable. In this case it must contain only numeric values. Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Decision Trees Decision Tree - INPUT - Analysis Parameters On the Decision Tree dialog click on INPUT and then click on analysis parameters: Figure 29: Decision Tree > Input > Analysis Parameters On this screen select: • Splitting Options • Splitting Method • Gain Ratio — Option to use the Gain Ratio splitting criteria. • Gini Index — Option to use the Gini Index splitting criteria. • Chaid — Option to use the Chaid splitting criteria. When using this option you are also given the opportunity to change the merging or splitting Chaid Significance Levels. • Regression Trees — Option to use the Regression splitting criteria as outlined above. • Minimum Split Count — This option determines how far the splitting of the decision tree will go. Unless a node is pure (meaning it has only observations with the same dependent value) it will split if each branch that can come off this node will contain at least this many observations. The default is a minimum of 2 cases for each branch. • Maximum Nodes — If the nodes in the tree are equal to or exceed this value while splitting a certain level of the tree, the algorithm stops the tree growing after completing this level and returns the tree built so far. The default is 10000 nodes. • Maximum Depth — Another method of stopping the tree is to specify the maximum depth the tree may grow to. This option will stop the algorithm if the tree being built has this many levels. The default is 100 levels. • Chaid Significance Levels • • Merging — Independent variables are tested by looping through the values and merging categories that have the least significant difference from one another and also are still below this merging significance level parameter (default .05). • Splitting — Once all independent variables have been optimally merged the one with the highest significance is chosen for the split, the data is subdivided, and the process is repeated on the subsets of the data. The splitting stops when the significance goes above this splitting significance level parameter (default .05). Bin Numeric Variables — Option to automatically Bincode the continuous independent variables. Continuous data is separated into one hundred bins when this option is selected. If the variable has less than one hundred distinct values, this option is ignored. Teradata Warehouse Miner User Guide - Volume 3 43 Chapter 1: Analytic Algorithms Decision Trees • • Include Validation Table — A supplementary table may be utilized in the modeling process to validate the effectiveness of the model on a separate set of observations. If specified, this table is used to calculate a second set of confidence or targeted confidence factors. These recalculated confidence factors may be viewed in the tree browser and/or added to the scored table when scoring the resultant model. When Include Validation Table is selected, a separate validation table is required. • Database — The name of the database to look in for the validation table - by default, this is the source database. • Table — The name of the validation table to use for recalculating confidence or targeted confidence factors. Include Lift Table — Option to generate a Cumulative Lift Table in the report to demonstrates how effective the model is in estimating the dependent variable. Valid for binary dependent variables only. • Response Value — An optional response value can be specified for the dependent variable that will represent the response value. Note that all other dependent variable values will be considered a non-response value. Values — Bring up the Decision Tree values wizard to help in specifying the response value. • Pruning Options • • Pruning Method — Pull-down list with the following values: • Gain Ratio — Option to use the Gain Ratio pruning criteria as outlined above. • Gini Index — Option to use the Gini Index pruning criteria as outlined above. • None — Option to not prune the resultant decision tree. Gini Test Table — When Gini Index pruning is selected as the pruning method, a separate Test table is required. • Database — The name of the database to look for the Test table - by default, this is the source database. • Table — The name of the table to use for test purposes during the Gini Pruning process. Decision Tree - INPUT - Expert Options On the Decision Tree dialog click on INPUT and then click on expert options: Figure 30: Decision Tree > Input > Expert Options • Performance 44 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Decision Trees • Maximum amount of data for in-memory processing — By default, 2 MB of data can be processed in memory for the tree. This can be increased here. For smaller data sets, this option may be preferable over the SQL version of the decision tree. Run the Decision Tree Analysis After setting parameters on the INPUT screens as described above, you are ready to run the analysis. To run the analysis you can either: • Click the Run icon on the toolbar, or • Select Run <project name> on the Project menu, or • Press the F5 key on your keyboard Results - Decision Tree The results of running the Decision Tree analysis include a variety of statistical reports as well as a Graphic and Textual Tree browser. All of these results are outlined below. Decision Tree Reports • Total observations — This is the number of observations in the training data set used to build the tree. More precisely, this is the number of rows in the input table after any rows have been excluded for containing a null value in a column selected as an independent or dependent variable. • Nodes before pruning — This is the number of nodes in the tree, including the root node, before it is pruned back in the second stage of the tree-building process. • Nodes after pruning — This is the number of nodes in the tree, including the root node, after it is pruned back in the second stage of the tree-building process. • Total nodes — This is the number of nodes in the tree, including the root node, when either pruning is not requested or doesn’t remove any nodes. • Model Accuracy — This is the percentage of observations in the training data set that the tree accurately predicts the value of the dependent variable for. Variables • Independent Variables — A list of all the independent variables that made it into the decision tree model. • Dependent Variable — The dependent variable that the tree was built to predict. Confusion Matrix A N x (N+2) (for N outcomes of the dependent variable) confusion matrix is given with the following format: Teradata Warehouse Miner User Guide - Volume 3 45 Chapter 1: Analytic Algorithms Decision Trees Table 8: Confusion Matrix Format Actual ‘0’ Actual ‘1’ … Actual ‘N’ Correct Incorrect Predicted ‘0’ # correct ‘0’ Predictions # incorrect‘1’ Predictions … # incorrect ‘N’ Predictions Total Correct ‘0’ Predictions Total Incorrect ‘0’ Predictions Predicted ‘1’ # incorrect‘0’ Predictions # correct ‘1’ Predictions … # incorrect ‘N’ Predictions Total Correct ‘1’ Predictions Total Incorrect ‘1’ Predictions … … … … … … … Predicted ‘N’ # incorrect‘0’ Predictions # incorrect ‘1’ Predictions … # correct ‘N’ Predictions Total Correct ‘N’ Predictions Total Incorrect ‘N’ Predictions Validation Matrix When the Include validation table option is selected, a validation matrix similar to the confusion matrix is produced based on the data in the validation table rather than the input table. Cumulative Lift Table The Cumulative Lift Table demonstrates how effective the model is in estimating the dependent variable. It is produced using deciles based on the probability values. Note that the deciles are labeled such that 1 is the highest decile and 10 is the lowest, based on the probability values calculated by logistic regression. The information in this report however is best viewed in the Lift Chart produced as a graph. Note that this is only valid for binary dependent variables. • Decile — The deciles in the report are based on the probability values predicted by the model. Note that 1 is the highest decile and 10 is the lowest. That is, decile 1 contains data on the 10% of the observations with the highest estimated probabilities that the dependent variable is 1. • Count — This column contains the count of observations in the decile. • Response — This column contains the count of observations in the decile where the actual value of the dependent variable is 1. • Pct Response — This column contains the percentage of observations in the decile where the actual value of the dependent variable is 1. • Pct Captured Response — This column contains the percentage of responses in the decile over all the responses in any decile. • Lift — The lift value is the percentage response in the decile (Pct Response) divided by the expected response, where the expected response is the percentage of response or dependent 1-values over all observations. For example, if 10% of the observations overall have a dependent variable with value 1, and 20% of the observations in decile 1 have a dependent variable with value 1, then the lift value within decile 1 is 2.0, meaning that the model gives a “lift” that is better than chance alone by a factor of two in predicting response values of 1 within this decile. 46 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Decision Trees • Cumulative Response — This is a cumulative measure of Response, from decile 1 to this decile. • Cumulative Pct Response — This is a cumulative measure of Pct Response, from decile 1 to this decile. • Cumulative Pct Captured Response — This is a cumulative measure of Pct Captured Response, from decile 1 to this decile. • Cumulative Lift — This is a cumulative measure of Lift, from decile 1 to this decile. Decision Tree Graphs The Decision Tree Analysis can display either a graphical and textual representation of the decision tree model, as well as a lift chart. Options are available to display decisions for any node in the graphical or textual tree, as well as the counts and distribution of the dependent variable. Additionally, manual pruning of the decision tree model is supported. Tree Browser Figure 31: Tree Browser When Tree Browser is selected, two frames are shown: the upper frame gives a condensed view to aid in navigating through the detailed tree in the lower frame. Set options by rightclicking on either frame to select from the following menu: • Small Navigation Tree — Under Small Navigation Tree, the options are: Teradata Warehouse Miner User Guide - Volume 3 47 Chapter 1: Analytic Algorithms Decision Trees Figure 32: Tree Browser menu: Small Navigation Tree • Zoom — This option allows you to scale down the navigation tree so that more of it will appear within the window. A slider bar is provided so you can select from a range of new sizes while previewing the effect on the navigation tree. The slider bar can also be used to bring the navigation tree back up to a larger dimension after it has been reduced in size: Figure 33: Tree Browser menu: Zoom Tree • Show Extents Box/Hide Extents Box — With this option a box is drawn around the nodes in the upper frame corresponding to the nodes displayed in the lower frame. The box can be dragged and dropped over segments of the small tree, automatically positioning the identical area in the detailed tree within the lower frame. Once set, the option changes to allow hiding the box. • Hide Navigation Tree/Show Navigation Tree — With this option the upper frame is made to disappear (or reappear) in order to give more room to the lower frame that contains the details of the tree. • Show Confidence Factors/Show Targeted Confidence — The Confidence Factor is a measure of how “confident” the model is that it can predict the correct score for a record that falls into a particular leaf node based on the training data the model was built from. For example, if a leaf node contained 10 observations and 9 of them predict Buy and the other record predicts Do Not Buy, then the model built will have a confidence factor of .9, or 90% sure of predicting the right value for a record that falls into that leaf node of the model. Models built with a predicted variable that has only 2 outcomes can display a Targeted Confidence value rather than a confidence factor. If the outcomes were 9 Buys and 1 Do Not Buy at a particular node and if the target value was set to Buy, .9 is the targeted 48 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Decision Trees confidence. However if it is desired to target the Do Not Buy outcome by setting the value to Do Not Buy, then any record falling into this leaf of the tree would get a targeted confidence of .1 or 10%. This option also controls whether Recalculated Confidence Factors or Recalculated Targeted Confidence factors are displayed in the case when the Include validation table option is selected. • Node Detail — The Node Detail feature can be used to copy the entire rule set for a particular node to the Windows Clipboard for use in other applications. • Print Figure 34: Tree Browser menu: Print • Large Tree — Allows you to print the entire tree diagram. This will be printed in pages, with the total number of pages reported before they are printed. (A page will also be printed showing how the tree was mapped into individual pages). If All Pages is selected the entire tree will be printed, across multiple pages if necessary. If Current Browser Page is selected then only that portion of the tree which is viewable will be printed in WYSIWYG fashion. • Small Tree — The entire navigation tree, showing the overall structure of the tree diagram without node labels or statistics, can be printed in pages. (The fewest possible pages will be printed if the navigation tree is reduced as small as possible before printing the small tree). The total number of pages needed to print the smaller tree will be reported before they are sent to the printer). • Save — Currently, the Tree Browser only supports the creation of Bitmaps. If Tree Text is currently selected, the entire tree will be saved. If Tree Browser is selected, only the portion of the tree that is viewable will be saved in WYSIWYG fashion. The lower frame shows the details of the decision tree in a graphical manner. The graphical representation of the tree consists of the following objects: • Root Node — The box at the top of the tree shows the total number of observations or rows used in building the tree after any rows have been removed for containing null values. • Intermediate Node — The boxes representing intermediate nodes in the tree contain the following information. • Decision — Condition under which data passes through this node. • N — Count of number of observations or rows passing through this node. • % — Percentage of observations or rows passing through this node. • Leaf Node — The boxes representing leaf nodes in the tree contain the following information. Teradata Warehouse Miner User Guide - Volume 3 49 Chapter 1: Analytic Algorithms Decision Trees • Decision — Condition under which data passes to this node. • N — Count of number of observations or rows passing to this node. • % — Percentage of observations or rows passing to this node. • CF — Confidence factor • TF — Targeted confidence factor, alternative to CF display • RCF — Recalculated confidence factor based on validation table (if requested) • RTF — Recalculated targeted confidence factor based on validation table (if requested) Text Tree When Tree Text is selected, the diagram represents the decisions made by the tree as a hierarchical structure of rules as follows: Figure 35: Text Tree The first rule corresponds to the root node of the tree. The rules corresponding to leaves in the tree are distinguished by an arrow drawn as ‘-->’, followed by a predicted value of the dependent variable. Rules List On both the Tree Browser and Text Tree, passing the mouse over a node or rule results in a hyperlink indication. When Rules List is enabled, clicking on the hyperlink results in a popup displaying all rules leading to that node or decision as follows: 50 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Decision Trees Figure 36: Rules List Note that the Node Detail, as described above, can be used to copy the Rules List to the Windows Clipboard. Counts and Distributions On both the Tree Browser and Text Tree, passing the mouse over a node or rule results in a hyperlink indication. When Counts and Distributions is enabled, clicking on the hyperlink results in a pop-up displaying the Count/Distribution of the dependent variable at that node as follows. Note that the Counts and Distribution option is only enabled when the dependent variable is multinomial. For regression trees this is not valid, and it is shown directly on the node or rule for binary trees. Figure 37: Counts and Distributions Note that the Node Detail, as described above, can be used to copy the Counts and Distribution list to the Windows Clipboard. Teradata Warehouse Miner User Guide - Volume 3 51 Chapter 1: Analytic Algorithms Decision Trees Tree Pruning On both the Tree Browser and Text Tree, passing the mouse over a node or rule results in a hyperlink indication. When Tree Pruning is enabled, the following menu appears: Figure 38: Tree Pruning menu Clicking on a node or rule highlights the node and all subnodes, indicating which portion of the tree will be pruned. Additionally, the Prune Selected Branch option becomes enabled as follows: Figure 39: Tree Pruning Menu > Prune Selected Branch Clicking on Prune Selected Branch will convert the highlighted node to a leaf node, and all subnode will disappear. When this is done, the other two Tree Pruning options become enabled: Figure 40: Tree Pruning menu (All Options Enabled) Click on Undo Last Prune, to revert back to the original tree, or the previously pruned tree if Prune Selected Branch was done multiple times. Click on Save Pruned Tree to save the tree to XML. This will be saved in metadata and can be rescored in a future release. 52 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Decision Trees After a tree is manually pruned and saved to metadata using the Save Pruned Tree option, it can be reopened and viewed in the Tree Browser and, if desired, pruned further. (All additional prunes must be re-saved to metadata). A previously pruned tree will be labeled to distinguish it from a tree that has not been manually pruned: Figure 41: Decision Tree Graph: Previously Pruned Tree “More >>” On both the Tree Browser and Text Tree, if Gini Index has been selected for Tree Splitting, large surrogate splits may occur. If a surrogate split is proceeded by “more >>”, the entire surrogate split can be displayed in a separate pop-up screen by clicking on the node and/or rule as follows: Figure 42: Decision Tree Graph: Predicate Lift Chart This graph displays the statistic in the Cumulative Lift Table, with the following options: • Non-Cumulative • % Response — This column contains the percentage of observations in the decile where the actual value of the dependent variable is 1. • % Captured Response — This column contains the percentage of responses in the decile over all the responses in any decile. • Lift — The lift value is the percentage response in the decile (Pct Response) divided by the expected response, where the expected response is the percentage of response or dependent 1-values over all observations. For example, if 10% of the observations overall have a dependent variable with value 1, and 20% of the observations in decile 1 have a dependent variable with value 1, then the lift value within decile 1 is 2.0, meaning that the model gives a “lift” that is better than chance alone by a factor of two in predicting response values of 1 within this decile. • Cumulative Teradata Warehouse Miner User Guide - Volume 3 53 Chapter 1: Analytic Algorithms Decision Trees • % Response — This is a cumulative measure of the percentage of observations in the decile where the actual value of the dependent variable is 1, from decile 1 to this decile. • % Captured Response — This is a cumulative measure of the percentage of responses in the decile over all the responses in any decile, from decile 1 to this decile. • Cumulative Lift — This is a cumulative measure of the percentage response in the decile (Pct Response) divided by the expected response, where the expected response is the percentage of response or dependent 1-values over all observations, from decile 1 to this decile. Any combination of options can be displayed as follows: Figure 43: Decision Tree Graph: Lift Tutorial - Decision Tree In this example a standard Gain Ratio tree was built to predict credit card ownership ccacct based on 20 numeric and categorical input variables. Notice that the tree initially built contained 100 nodes but was pruned back to only 11, counting the root node. This yielded not only a relatively simple tree structure, but also Model Accuracy of 95.72% on this training data. Parameterize a Decision Tree as follows: • Available Tables — twm_customer_analysis • Dependent Variable — ccacct • Independent Variables 54 • income • age • years_with_bank Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Decision Trees • nbr_children • gender • marital_status • city_name • state_code • female • single • married • separated • ckacct • svacct • avg_ck_bal • avg_sv_bal • avg_ck_tran_amt • avg_ck_tran_cnt • avg_sv_tran_amt • avg_sv_tran_cnt • Tree Splitting — Gain Ratio • Minimum Split Count — 2 • Maximum Nodes — 1000 • Maximum Depth — 10 • Bin Numeric Variables — Disabled • Pruning Method — Gain Ratio • Include Lift Table — Enabled • Response Value — 1 Run the analysis and click on Results when it completes. For this example, the Decision Tree Analysis generated the following pages. A single click on each page name populates the page with the item. Table 9: Decision Tree Report Total observations 747 Nodes before pruning 33 Nodes after pruning 11 Model Accuracy 95.72% Teradata Warehouse Miner User Guide - Volume 3 55 Chapter 1: Analytic Algorithms Decision Trees Table 10: Variables: Dependent Dependent Variable ccacct Table 11: Variables: Independent Independent Variables income ckacct avg_sv_bal avg_sv_tran_cnt Table 12: Confusion Matrix Actual Non-Response Actual Response Correct Incorrect Predicted 0 340 / 45.52% 0 / 0.00% 340 / 45.52% 0 / 0.00% Predicted 1 32 / 4.28% 375 / 50.20% 375 / 50.20% 32 / 4.28% Table 13: Cumulative Lift Table Cumulative Response Cumulative Response (%) Cumulative Captured Response (%) Cumulative Lift Decile Count Response Response (%) Captured Response (%) 1 5.00 5.00 100.00 1.33 1.99 5.00 100.00 1.33 1.99 2 0.00 0.00 0.00 0.00 0.00 5.00 100.00 1.33 1.99 3 0.00 0.00 0.00 0.00 0.00 5.00 100.00 1.33 1.99 4 0.00 0.00 0.00 0.00 0.00 5.00 100.00 1.33 1.99 5 0.00 0.00 0.00 0.00 0.00 5.00 100.00 1.33 1.99 6 402.00 370.00 92.04 98.67 1.83 375.00 92.14 100.00 1.84 7 0.00 0.00 0.00 0.00 0.00 375.00 92.14 100.00 1.84 8 0.00 0.00 0.00 0.00 0.00 375.00 92.14 100.00 1.84 9 0.00 0.00 0.00 0.00 0.00 375.00 92.14 100.00 1.84 10 340.00 0.00 0.00 0.00 0.00 375.00 50.20 100.00 1.00 56 Lift Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Decision Trees Graphs Tree Browser is displayed as follows: Figure 44: Decision Tree Graph Tutorial: Browser Select the Text Tree radio to view the rules in textual format: Figure 45: Decision Tree Graph Tutorial: Lift Additionally, you can click on Lift Chart to view the contents of the Lift Table graphically. Teradata Warehouse Miner User Guide - Volume 3 57 Chapter 1: Analytic Algorithms Factor Analysis Figure 46: Decision Tree Graph Tutorial: Browser Factor Analysis Overview Consider a data set with a number of correlated numeric variables that is to be used in some type of analysis, such as linear regression or cluster analysis. Or perhaps it is desired to understand customer behavior in a fundamental way, by discovering hidden structure and meaning in data. Factor analysis can be used to reduce a number of correlated numeric variables into a lesser number of variables called factors. These new variables or factors should hopefully be conceptually meaningful if the second goal just mentioned is to be achieved. Meaningful factors not only give insight into the dynamics of a business, but they also make any models built using these factors more explainable, which is generally a requirement for a useful analytic model. There are two fundamental types of factor analysis, principal components and common factors. Teradata Warehouse Miner offers principal components, maximum likelihood common factors and principal axis factors, which is a restricted form of common factor analysis. The product also offers factor rotations, both orthogonal and oblique, as postprocessing for any of these three types of models. Finally, as with all other models, automatic factor model scoring is offered via dynamically generated SQL. Before using the Teradata Warehouse Miner Factor Analysis module, the user must first build a data reduction matrix using the Build Matrix function. The matrix must include all of the input variables to be used in the factor analysis. The user can base the analysis on either a covariance or correlation matrix, thus working with either centered and unscaled data, or centered and normalized data (i.e. unit variance). Teradata Warehouse Miner automatically converts the extended cross-products matrix stored in metadata results tables by the Build 58 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Factor Analysis Matrix function into the desired covariance or correlation matrix. The choice will affect the scaling of resulting factor measures and factor scores. The primary source of information and formulae in this section is [Harman]. Principal Components Analysis The goal of principal components analysis (PCA) is to account for the maximum amount of the original data’s variance in the principal components created. Each of the original variables can be expressed as a linear combination of the new principal components. Each principal component in its turn, from the first to the last, accounts for a maximum amount of the remaining sum of the variances of the original variables. This allows some of the later components to be discarded and only the reduced set of components accounting for the desired amount of total variance to be retained. If all the components were to be retained, then all of the variance would be explained. A principal components solution has many desirable properties. First, the new components are independent of each other, that is, uncorrelated in statistical terminology or orthogonal in the terminology of linear algebra. Further, the principal components can be calculated directly, yielding a unique solution. This is true also of principal component scores, which can be calculated directly from the solution and are also inherently orthogonal or independent of each other. Principal Axis Factors The next step toward the full factor analysis model is a technique known as principal axis factors (PAF), or sometimes also called iterated principal axis factors, or just principal factors. The principal factors model is a blend of the principal components model described earlier and the full common factor model. In the common factor model, each of the original variables is described in terms of certain underlying or common factors, as well as a unique factor for that variable. In principal axis factors however, each variable is described in terms of common factors without a unique factor. Unlike a principal components model for which there is a unique solution, a principal axis factor model consists of estimated factors and scores. As with principal components, the derived factors are orthogonal or independent of each other. The same is not necessarily true of the scores however. (Refer to “Factor Scores” on page 61 for more information). Maximum Likelihood Common Factors The goal of common factors or classical factor analysis is to account in the new factors for the maximum amount of covariance or correlation in the original input variables. In the common factor model, each of the original input variables is expressed in terms of hypothetical common factors plus a unique factor accounting for the remaining variance in that variable. The user must specify the desired number of common factors to look for in the model. This type of model represents factor analysis in the fullest sense. Teradata Warehouse Miner offers maximum likelihood factors (MLF) for estimating common factors, using expectation maximization or EM as the method to determine the maximum likelihood solution. A potential benefit of common factor analysis is that it may reduce the original set of variables into fewer factors than would principal components analysis. It may also produce Teradata Warehouse Miner User Guide - Volume 3 59 Chapter 1: Analytic Algorithms Factor Analysis new variables that have more fundamental meaning. A drawback is that factors can only be estimated using iterative techniques requiring more computation, as there is no unique solution to the common factor analysis model. This is true also of common factor scores, which must likewise be estimated. As with principal components and principal axis factors, the derived factors are orthogonal or independent of each other, but in this case by design (Teradata Warehouse Miner utilizes a technique to insure this). The same is not necessarily true of the factor scores however. (Refer to “Factor Scores” on page 61 for more information). These three types of factor analysis then give the data analyst the choice of modeling the original variables in their entirety (principal components), modeling them with hypothetical common factors alone (principal axis factors), or modeling them with both common factors and unique factors (maximum likelihood common factors). Factor Rotations Whatever technique is chosen to compute principal components or common factors, the new components or factors may not have recognizable meaning. Correlations will be calculated between the new factors and the original input variables, which presumably have business meaning to the data analyst. But factor-variable correlations may not possess the subjective quality of simple structure. The idea behind simple structure is to express each component or factor in terms of fewer variables that are highly correlated with the factor (or vice versa), with the remaining variables largely uncorrelated with the factor. This makes it easier to understand the meaning of the components or factors in terms of the variables. Factor rotations of various types are offered to allow the data analyst to attempt to find simple structure and hence meaning in the new components or factors. Orthogonal rotations maintain the independence of the components or factors while aligning them differently with the data to achieve a particular simple structure goal. Oblique rotations relax the requirement for factor independence while more aggressively seeking better data alignment. Teradata Warehouse Miner offers several options for both orthogonal and oblique rotations. Factor Loadings The term factor loadings is sometimes used to refer to the coefficients of the linear combinations of factors that make up the original variables in a factor analysis model. The appropriate term for this however is the factor pattern. A factor loadings matrix is sometimes also assumed to indicate the correlations between the factors and the original variables, for which the appropriate term is factor structure. The good news is that whenever factors are mutually orthogonal or independent of each other, the factor pattern P and the factor structure S are the same. They are related by the equation S = PQ where Q is the matrix of correlations between factors. In the case of principal components analysis, factor loadings are labeled as component loadings and represent both factor pattern and structure. For other types of analysis, loadings are labeled as factor pattern but indicate structure also, unless a separate structure matrix is also given (as is the case after oblique rotations, described later). Keeping the above caveats in mind, the component loadings, pattern or structure matrix is interpreted for its structure properties in order to understand the meaning of each new factor 60 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Factor Analysis variable. When the analysis is based on a correlation matrix, the loadings, pattern or structure can be interpreted as a correlation matrix with the columns corresponding to the factors and the rows corresponding to the original variables. Like all correlations, the values range in absolute value from 0 to 1 with the higher values representing a stronger correlation or relationship between the variables and factors. By looking at these values, the user gets an idea of the meaning represented by each factor. Teradata Warehouse Miner stores these so called factor loadings and other related values in metadata result tables to make them available for scoring. Factor Scores In order to use a factor as a variable, it must be assigned a value called a factor score for each row or observation in the data. A factor score is actually a linear combination of the original input variables (without a constant term), and the coefficients associated with the original variables are called factor weights. Teradata Warehouse Miner provides a scoring function that calculates these weights and creates a table of new factor score variables using dynamically generated SQL. The ability to automatically generate factor scores, regardless of the factor analysis or rotation options used, is one of the most powerful features of the Teradata Warehouse Miner factor analysis module. Principal Components As mentioned earlier in the introduction, the goal of principal components analysis (PCA) is to account for the maximum amount of the original data’s variance in the independent principal components created. It was also stated that each of the original variables is expressed as a linear combination of the new principal components, and that each principal component in its turn, from the first to the last, accounts for a maximum amount of the remaining sum of the variances of the original variables. These results are achieved by first finding the eigenvalues and eigenvectors of the covariance or correlation matrix of the input variables to be modeled. Although not ordinarily thought of in this way, when analyzing v numeric columns in a table in a relational database, one is in some sense working in a vdimensional vector space corresponding to these columns. Back at the beginning of the previous century when principal components analysis was developed, this was no small task. Today however math library routines are available to perform these computations very efficiently. Although it won’t be attempted here to derive the mathematical solution to finding principal components, it might be helpful to state the following definition, i.e. that a square matrix A has an eigenvalue and an eigenvector x if Ax = x . Further, a v x v square symmetric matrix A has v pairs of eigenvalues and eigenvectors, 1 e 1 2 e 2 v e v . It is further true that eigenvectors can be found so that they have unit length and are mutually orthogonal, i.e. independent or uncorrelated, making them unique. To return to the point at hand, the principal component loadings that are being sought are actually the covariance or correlation matrix eigenvectors just described multiplied by the square root of their respective eigenvalues. The step left out up to now however is the reduction of these principal component loadings to a number fewer than the variables present at the start. This can be achieved by first ordering the eigenvalues, and their corresponding eigenvectors, from maximum to minimum in descending order, and then by throwing away Teradata Warehouse Miner User Guide - Volume 3 61 Chapter 1: Analytic Algorithms Factor Analysis those eigenvalues below a minimum threshold value, such as 1.0. An alternative technique is to retain a desired number of the largest components regardless of the magnitude of the eigenvalues. Teradata Warehouse Miner provides both of these options to the user. The user may further optionally request that the signs of the principal component loadings be inverted if there are more minus signs than positive ones. This is purely cosmetic and does not affect the solution in a substantive way. However, if signs are reversed, this must be kept in mind when attempting to interpret or assign conceptual meaning to the factors. A final point worth noting is that the eigenvalues themselves turn out to be the variance accounted for by each principal component, allowing the computation of several variance related measures and some indication of the effectiveness of the principal components model. Principal Axis Factors In order to talk about principal axis factors (PAF) the term communality must first be introduced. In the common factor model, each original variable x is thought of to be a combination of common factors and a unique factor. The variance of x can then also be thought of as being composed of a common portion and a unique portion, that is 2 2 Var x = c + u . It is the common portion of the variance of x that is called the communality of x, that is the variance that the variable has in common through the common factors with all the other variables. In the algorithm for principal axis factors described below it is of interest to both make an initial estimate of the communality of each variable, and to calculate the actual communality for the variables in a factor model with uncorrelated factors. One method of making an initial estimate of the communality of each variable is to take the largest correlation of that variable with respect to the other variables. The preferred method however is to calculate its squared multiple correlation coefficient with respect to all of the other variables taken as a whole. This is the technique used by Teradata Warehouse Miner. The multiple correlation coefficient is a measure of the overall linear association of one variable with several other variables, that is, the correlation between a variable and the best-fitting linear combination of the other variables. The square of this value has the useful property of being a lower bound for the communality. Once a factor model is built, the actual communality of a variable is simply the sum of the squares of the factor loadings, i.e. 2 hj = r k – 1 fjk 2 With the idea of communality thus in place it is straightforward to describe the principal axis factors algorithm. Begin by estimating the communality of each variable and replacing this value in the appropriate position in the diagonal of the correlation or covariance matrix being factored. Then a principal components solution is found in the usual manner, as described earlier. As before, the user has the option of specifying either a fixed number of desired factors or a minimum eigenvalue by which to reduce the number of factors in the solution. Finally, the new communalities are calculated as the sum of the squared factor loadings, and these values are substituted into the correlation or covariance matrix. This process is repeated until the communalities change by only a small amount. 62 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Factor Analysis Through its use of communality estimates, the principal axis factor method attempts to find independent common factors that account for the covariance or correlation between the original variables in the model, while ignoring the effect of unique factors. It is then possible to use the factor loadings matrix to reproduce the correlation or covariance matrix and compare this to the original as a way of assessing the effectiveness of the model. The reproduced correlation or covariance matrix is simply the factor loadings matrix times its transpose, i.e. CCT. The user may optionally request that the signs of the factor loadings be inverted if there are more minus signs than positive ones. This is purely cosmetic and does not affect the solution in a substantive way. However, if signs are reversed, this must be kept in mind when attempting to interpret or assign meaning to the factors. Maximum Likelihood Factors As mentioned earlier, the common factor model attempts to find both common and unique factors explaining the covariance or correlations amongst a set of variables. That is, an attempt is made to find a factor pattern C and a uniqueness matrix R such that a covariance or correlation matrix S can be modeled as S = CCT + R. To do this, it is necessary to utilize the principle of maximum likelihood based on the assumption that the data comes from a multivariate normal distribution. Due to dealing with the distribution function of the elements of a covariance matrix it is necessary to use the Wishart distribution in order to derive the likelihood equation. The optimization technique used then to maximize the likelihood of a solution for C and R is the Expectation Maximization or EM technique. This technique, often used in the replacement of missing data, is the same basic technique used in Teradata Warehouse Miner’s cluster analysis algorithm. Some key points regarding this technique are described below. Beginning with a correlation or covariance matrix S as with our other factor techniques, a principal components solution is first derived as an initial estimate for the factor pattern matrix C, with the initial estimate for the uniqueness matrix R taken simply as S - CCT. Then the maximum likelihood solution is iteratively found, yielding a best estimate of C and R. In order then to assess the effectiveness of the model, the correlation or covariance matrix S is compared to the reproduced matrix CCT - R. It should be pointed out that when using the maximum likelihood solution the user must first specify the number of common factors f to produce in the model. The software will not automatically determine what this value should be or determine it based on a threshold value. Also, an internal adjustment is made to the final factor pattern matrix C to make the factors orthogonal, something that is automatically true of the other factor solutions. Finally, the user may optionally request that the signs of a factor in the matrix C be inverted if there are more minus signs than positive ones. This is purely cosmetic and does not affect the solution in a substantive way. However, if signs are reversed, this must be kept in mind when attempting to interpret or assign meaning to the factors. Factor Rotations Teradata Warehouse Miner offers a number of techniques for rotating factors in order to find the elusive quality of simple structure described earlier. These may optionally be used in combination with any of the factor techniques offered in the product. When a rotation is performed, both the rotated matrix and the rotation matrix is reported, as well as the Teradata Warehouse Miner User Guide - Volume 3 63 Chapter 1: Analytic Algorithms Factor Analysis reproduced correlation or covariance matrix after rotation. As before with the factor solutions themselves, the user may optionally request that the signs of a factor in the rotated factor or components matrix be inverted if there are more minus signs than positive ones. This is purely cosmetic and does not affect the solution in a substantive way. Orthogonal rotations First consider orthogonal rotations, that is, rotations of a factor matrix A that result in a rotated factor matrix B by way of an orthogonal transformation matrix T, i.e. B = AT. Remember that the nice thing about orthogonal rotations on a factor matrix is that the resulting factors scores are uncorrelated, a desirable property when the factors are going to be used in subsequent regression, cluster or other type of analysis. But how is simple structure obtained? As described earlier, the idea behind simple structure is to express each component or factor in terms of fewer variables that are highly correlated with the factor, with the remaining variables not so correlated with the factor. The two most famous mathematical criteria for simple factor structure are the quartimax and varimax criteria. Simply put, the varimax criterion seeks to simplify the structure of columns or factors in the factor loading matrix, whereas the quartimax criterion seeks to simplify the structure of the rows or variables in the factor loading matrix. Less simply put, the varimax criterion seeks to maximize the variance of the squared loadings across the variables for all factors. The quartimax criterion seeks to maximize the variance of the squared loadings across the factors for all variables. The solution to either optimization problem is mathematically quite involved, though in principle it is based on fundamental techniques of linear algebra, differential calculus, and the use of the popular Newton-Raphson iterative technique for finding the roots of equations. Regardless of the criterion used, rotations are performed on normalized loadings, that is prior to rotating, the rows of the factor loading matrix are set to unit length by dividing each element by the square root of the communality for that variable. The rows are un-normalized back to the original length after the rotation is performed. This has been found to improve results, particularly for the varimax method. Fortunately both the quartimax and varimax criteria can be expressed in terms of the same equation containing a constant value that is 0 for quartimax and 1 for varimax. The orthomax criterion is then obtained simply by setting this constant, call it gamma, to any desired value, equamax corresponds to setting this constant to half the number of factors, and parsimax is given by setting the value of gamma to v(f-1) / (v+f+2) where v is the number of variables and f is the number of factors. Oblique rotations As mentioned earlier, oblique rotations relax the requirement for factor independence that exists with orthogonal rotations, while more aggressively seeking better data alignment. Teradata Warehouse Miner uses a technique known as the indirect oblimin method. As with orthogonal rotations, there is a common equation for the oblique simple structure criterion that contains a constant that can be set for various effects. A value of 0 for this constant, call it gamma, yields the quartimin solution, which is the most oblique solution of those offered. A value of 1 yields the covarimin solution, the least oblique case. And a value of 0.5 yields the 64 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Factor Analysis biquartimin solution, a compromise between the two. A solution known as orthomin can be achieved by setting the value of gamma to any desired positive value. One of the distinctions of a factor solution that incorporates an oblique rotation is that the factor loadings must be thought of in terms of two different matrices, the factor pattern P matrix and the factor structure matrix S. These are related by the equation S = PQ where Q is the matrix of correlations between factors. Obviously if the factors are not correlated, as in an un-rotated solution or after an orthogonal rotation, then Q is the identity matrix and the structure and pattern matrix are the same. The result of an oblique rotation must include both the pattern matrix that describes the common factors and the structure matrix of correlations between the factors and original variables. As with orthogonal rotations, oblique rotations are performed on normalized loadings that are restored to their original size after rotation. A unique characteristic of the indirect oblimin method of rotation is that it is performed on a reference structure based on the normals of the original factor space. There is no inherent value in this, but is in fact just a side effect of the technique. It means however that an oblique rotation results in a reference factor pattern, structure and rotation matrix that is then converted back into the original factor space as the final primary factor pattern, structure and rotation matrix. Data Quality Reports The same data quality reports optionally available for linear regression are also available when performing Factor Analysis. Prime Factor Reports Prime Factor Loadings This report provides a specially sorted presentation of the factor loadings. Like the standard report of factor loadings, the rows represent the variables and the columns represent the factors. In this case, however, each variable is associated with the factor for which it has the largest loading as an absolute value. The variables having factor 1 as the prime factor are listed first, in descending order of the loading with factor 1. Then the variables having factor 2 as the prime factor are listed, continuing on until all the variables are listed. It is possible that not all factors will appear in the Prime Factor column, but all the variables will be listed once and only once with all their factor loadings. Note that in the special case after an oblique rotation has been performed in the factor analysis, the report is based on the factor structure matrix and not the factor pattern matrix, since the structure matrix values represent the correlations between the variables and the factors. The following is an example of a Prime Factor Loadings report. Table 14: Prime Factor Loadings report (Example) Variable Prime Factor Factor 1 Factor 2 Factor 3 income Factor 1 .8229 -1.1675E-02 .1353 revenue Factor 1 .8171 .4475 2.3336E-02 Teradata Warehouse Miner User Guide - Volume 3 65 Chapter 1: Analytic Algorithms Factor Analysis Table 14: Prime Factor Loadings report (Example) Variable Prime Factor Factor 1 Factor 2 Factor 3 single Factor 1 -.7705 .4332 .1554 age Factor 1 .7348 -4.5584E-02 1.0212E-02 cust_years Factor 2 .5158 .6284 .1577 purchases Factor 2 .5433 -.5505 -.254 female Factor 3 -4.1177E-02 .3366 -.9349 Prime Factor Variables The Prime Factor Variables report is closely related to the Prime Factor Loadings report. It associates variables with their prime factors and possibly other factors if a threshold percent or loading value is specified. It provides a simple presentation, without numbers, of the relationships between factors and the variables that contribute to them. If a threshold percent of 1.0 is used, only prime factor relationships are reported. A threshold percentage of less than 1.0 indicates that if the loading for a particular factor is equal to or above this percentage of the loading for the variable's prime factor, then an association is made between the variable and this factor as well. When the variable is associated with a factor other than its prime factor, the variable name is given in parentheses. A threshold loading value may alternately be used to determine the associations between variables and factors. In this case, it is possible that a variable may not appear in the report, depending on the threshold value and the loading values. However, if the option to reverse signs was enabled, positive values may actually represent inverse relationships between factors and original variables. Deselecting this option in a second run and examining factor loading results will provide the true nature (directions) of relationships among variables and factors. The following is an example of a Prime Factor Variables report. Table 15: Prime Factor Variables report (Example) 66 Factor 1 Factor 2 Factor 3 income cust_years female revenue purchases * single * * age * * (purchases) * * Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Factor Analysis Prime Factor Variables with Loadings The Prime Factor Variables with Loadings is functionally the same as the Prime Factor Variables report except that the actual loading values determining the associations between the variables and factors are also given. The magnitude of the loading gives some idea of the relative strength of the relationship and the sign indicates whether or not it is an inverse relationship (a negative sign indicates an inverse relationship in the values, i.e. a negative correlation). The following is an example of a Prime Factor Variables with Loadings report. Table 16: Factor Variable Loading Factor 1 income .8229 Factor 1 revenue .8171 Factor 1 single -.7705 Factor 1 age .7348 Factor 1 (purchases) .5433 Factor 2 cust_years .6284 Factor 2 purchases -.5505 Factor 3 female -.9349 Missing Data Null values for columns in a factor analysis can adversely affect results. It is recommended that the listwise deletion option be used when building the SSCP matrix with the Build Matrix function. This ensures that any row for which one of the columns is null will be left out of the matrix computations completely. Additionally, the Recode transformation function can be used to build a new column, substituting a fixed known value for null. Initiate a Factor Analysis Use the following procedure to initiate a new Factor Analysis in Teradata Warehouse Miner: Teradata Warehouse Miner User Guide - Volume 3 67 Chapter 1: Analytic Algorithms Factor Analysis 1 Click on the Add New Analysis icon in the toolbar: Figure 47: Add New Analysis from toolbar 2 In the resulting Add New Analysis dialog box, click on Analytics under Categories and then under Analyses double-click on Factor Analysis: Figure 48: Add New Analysis dialog 3 This will bring up the Factor Analysis dialog in which you will enter INPUT and OUTPUT options to parameterize the analysis as described in the following sections. Factor - INPUT - Data Selection On the Factor Analysis dialog click on INPUT and then click on data selection: Figure 49: Factor Analysis > Input > Data Selection 68 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Factor Analysis On this screen select: 1 Select Input Source Users may select between different sources of input, Table, Matrix or Analysis. By selecting the Input Source Table the user can select from available databases, tables (or views) and columns in the usual manner. (In this case a matrix will be dynamically built and discarded when the algorithm completes execution). By selecting the Input Source Matrix the user may can select from available matrices created by the Build Matrix function. This has the advantage that the matrix selected for input is available for further analysis after completion of the algorithm, perhaps selecting a different subset of columns from the matrix. By selecting the Input Source Analysis the user can select directly from the output of another analysis of qualifying type in the current project. (In this case a matrix will be dynamically built and discarded when the algorithm completes execution). Analyses that may be selected from directly include all of the Analytic Data Set (ADS) and Reorganization analyses (except Refresh). In place of Available Databases the user may select from Available Analyses, while Available Tables then contains a list of all the output tables that will eventually be produced by the selected Analysis. (Note that since this analysis cannot select from a volatile input table, Available Analyses will contain only those qualifying analyses that create an output table or view). For more information, refer to “INPUT Tab” on page 73 of the Teradata Warehouse Miner User Guide (Volume 1). 2 Select Columns From One Table • Available Databases (only for Input Source equal to Table) — All the databases that are available for the Factor Analysis. • Available Matrices (only for Input Source equal to Matrix) — When the Input Source is Matrix, a matrix must first be built by the user with the Build Matrix function before Factor Analysis can be performed. Select the matrix that summarizes the data to be analyzed. (The matrix must have been built with more rows than columns selected or the Factor Analysis will produce a singular matrix, causing a failure). • Available Analyses (only for Input Source equal to Analysis) — All the analyses that are available for the Factor Analysis. • Available Tables (only for Input Source equal to Table or Analysis) — All the tables that are available for the Factor Analysis. • Available Columns — All the columns that are available for the Factor Analysis. • Selected Columns — Select columns by highlighting and then either dragging and dropping into the Selected Columns window, or click on the arrow button to move highlighted columns into the Selected Columns window. The algorithm requires that the selected columns be of numeric type (or contain numbers in character format). Factor - INPUT - Analysis Parameters On the Factor Analysis dialog click on INPUT and then click on analysis parameters: Teradata Warehouse Miner User Guide - Volume 3 69 Chapter 1: Analytic Algorithms Factor Analysis Figure 50: Factor Analysis > Input > Analysis Parameters On this screen select: • General Options • • Analysis method • Principal Components (PCA) — As described above. This is the default method. • Principal Axis Factors (PAF) — As described above. • Maximum Likelihood Factors (MLF) — As described above. Convergence Method • Minimum Eigenvalue PCA — minimum eigenvalue to include in principal components (default 1.0) PAF — minimum eigenvalue to include in factor loadings (default 0.0) MLF — option does not apply (N/A) • • • • 70 Number of Factors — The user may request a specific number of factors as an alternative to using the minimum eigenvalue option for PCA and PAF. Number of factors is however required for MLF. The number of factors requested must not exceed the number of requested variables. Convergence Criterion • PCA — convergence criterion does not apply • PAF — iteration continues until maximum communality change does not exceed convergence criterion • MLF — iteration continues until maximum change in the square root of uniqueness values does not exceed convergence criterion Maximum Iterations • PCA — maximum iterations does not apply (N/A) • PAF — the algorithm stops if the maximum iterations is exceeded (default 100) • MLF — the algorithm stops if the maximum iterations is exceeded (default 1000) Matrix Type — The product automatically converts the extended cross-products matrix stored in metadata results tables by the Build Matrix function into the desired covariance or correlation matrix. The choice will affect the scaling of resulting factor measures and factor scores. • Correlation — Build a correlation matrix as input to Factor Analysis. This is the default option. • Covariance — Build a covariance matrix as input to Factor Analysis. • Invert signs if majority of matrix values are negative (checkbox) — You may Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Factor Analysis optionally request that the signs of factor loadings and related values be changed if there are more minus signs than positive ones. This is purely cosmetic and does not affect the solution in a substantive way. Default is enabled. • Rotation Options • Rotation Method • None — No factor rotation is performed. This is the default option. • Varimax — Gamma in rotation equation fixed at 1.0. The varimax criterion seeks to simplify the structure of columns or factors in the factor loading matrix • Quartimax — Gamma in rotation equation fixed at 0.0. the quartimax criterion seeks to simplify the structure of the rows or variables in the factor loading matrix • Equamax — Gamma in rotation equation fixed at f / 2. • Parsimax — Gamma in rotation equation fixed at v(f-1) / (v+f+2). • Orthomax — Gamma in rotation equation set by user. • Quartimin — Gamma in rotation equation fixed at 0.0. Provides the most oblique rotation. • Biquartimin — Gamma in rotation equation fixed at 0.5. • Covarimin — Gamma in rotation equation fixed at 1.0. Provides the least oblique rotation. • Orthomin — Gamma in rotation equation set by user. • Report Options • Variable Statistics — This report gives the mean value and standard deviation of each variable in the model based on the derived SSCP matrix. • Near Dependency — This report lists collinear variables or near dependencies in the data based on the derived SSCP matrix. • Condition Index Threshold — Entries in the Near Dependency report are triggered by two conditions occurring simultaneously. The one that involves this parameter is the occurrence of a large condition index value associated with a specially constructed principal factor. If a factor has a condition index greater than this parameter’s value, it is a candidate for the Near Dependency report. A default value of 30 is used as a rule of thumb. • Variance Proportion Threshold — Entries in the Near Dependency report are triggered by two conditions occurring simultaneously. The one that involves this parameter is when two or more variables have a variance proportion greater than this threshold value for a factor with a high condition index. Another way of saying this is that a ‘suspect’ factor accounts for a high proportion of the variance of two or more variables. This parameter defines what a high proportion of variance is. A default value of 0.5 is used as a rule of thumb. • Collinearity Diagnostics Report — This report provides the details behind the Near Dependency report, consisting of the “Eigenvalues of Unit Scaled X’X”, “Condition Indices” and “Variance Proportions” tables. • Factor Loading Reports • Factor Variables Report Teradata Warehouse Miner User Guide - Volume 3 71 Chapter 1: Analytic Algorithms Factor Analysis • Factor Variables with Loadings Report • Display Variables Using • Threshold percent • Threshold loading — A threshold percentage of less than 1.0 indicates that if the loading for a particular factor is equal or above this percentage of the loading for the variable's prime factor, then an association is made between the variable and this factor as well. A threshold loading value may alternatively be used. Run the Factor Analysis After setting parameters on the INPUT screens as described above, you are ready to run the analysis. To run the analysis you can either: • Click the Run icon on the toolbar, or • Select Run <project name> on the Project menu, or • Press the F5 key on your keyboard Results - Factor Analysis The results of running the Factor Analysis include a factor patterns graph, a scree plot (unless MLF was specified), and a variety of statistic reports. All of these results are outlined below. Factor Analysis - RESULTS - Reports On the Factor Analysis dialog, click on RESULTS and then click on reports (note that the RESULTS tab will be grayed-out/disabled until after the analysis is completed): Figure 51: Factor Analysis > Results > Reports Data Quality Reports • Variable Statistics — If selected on the Results Options tab, this report gives the mean value and standard deviation of each variable in the model based on the SSCP matrix provided as input. • Near Dependency — If selected on the Results Options tab, this report lists collinear variables or near dependencies in the data based on the SSCP matrix provided as input. Entries in the Near Dependency report are triggered by two conditions occurring simultaneously. The first is the occurrence of a large condition index value associated with a specially constructed principal factor. If a factor has a condition index greater than the parameter specified on the Results Option tab, it is a candidate for the Near Dependency report. The other is when two or more variables have a variance proportion greater than a threshold value for a factor with a high condition index. Another way of 72 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Factor Analysis saying this is that a ‘suspect’ factor accounts for a high proportion of the variance of two or more variables. The parameter to defines what a high proportion of variance is also set on the Results Option tab. A default value of 0.5. • Detailed Collinearity Diagnostics — If selected on the Results Options tab, this report provides the details behind the Near Dependency report, consisting of the following tables. • Eigenvalues of Unit Scaled X'X — Report of the eigenvalues of all variables scaled so that each variable adds up to 1 when summed over all the observations or rows. In order to calculate the singular values of X (the rows of X are the observations), the mathematically equivalent square root of the eigenvalues of XTX are computed instead for practical reasons • Condition Indices — The condition index of each eigenvalue, calculated as the square root of the ratio of the largest eigenvalue to the given eigenvalue, a value always 1 or greater. • Variance Proportions — The variance decomposition of these eigenvalues is computed using the eigenvalues together with the eigenvectors associated with them. The result is a matrix giving, for each variable, the proportion of variance associated with each eigenvalue. Principal Component Analysis report • Number of Variables — This is the number of variables to be factored, taken from the matrix that is input to the algorithm. Note that there are no dependent or independent variables in a factor analysis model. • Minimum Eigenvalue — The minimum value of a factor’s associated eigenvalue, determining whether or not to include the factor in the final model. This field is not displayed if the Number of Factors option is used to determine the number of factors retained. • Number of Factors — This value reflects the number of factors retained in the final factor analysis model. If the Number of Factors option is explicitly set by the user to determine the number of factors, then this reported value reflects the value set by the user. Otherwise, it reflects the number of factors resulting from applying the Minimum Eigenvalue option. • Matrix Type (cor/cov) — This value reflects the type of input matrix requested by the user, either correlation (cor) or covariance (cov). • Rotation (none/orthogonal/oblique) — This value reflects the type of rotation, if any, requested by the user, either none, orthogonal, or oblique. • Gamma — This value is a coefficient in the rotation equation that reflects the type of rotation requested, if any, and in some cases is explicitly set by the user. Gamma is determined as follows. • Orthogonal rotations: • Varimax — (gamma in rotation equation fixed at 1.0) • Quartimax — (gamma in rotation equation fixed at 0.0) • Equamax — (gamma in rotation equation fixed at f / 2)* Teradata Warehouse Miner User Guide - Volume 3 73 Chapter 1: Analytic Algorithms Factor Analysis • Parsimax — (gamma in rotation equation fixed at v(f-1) / (v+f+2))* • Orthomax — (gamma in rotation equation set by user) * where v is the number of variables and f is the number of factors • Oblique rotations • Quartimin — (gamma in rotation equation fixed at 0.0) • Biquartimin — (gamma in rotation equation fixed at 0.5) • Covarimin — (gamma in rotation equation fixed at 1.0) • Orthomin — (gamma in rotation equation set by user) Principal Axis Factors report • Number of Variables — This is the number of variables to be factored, taken from the matrix that is input to the algorithm. Note that there are no dependent or independent variables in a factor analysis model. • Minimum Eigenvalue — The minimum value of a factor’s associated eigenvalue, determining whether or not to include the factor in the final model. This field is not displayed if the Number of Factors option is used to determine the number of factors retained. • Number of Factors — This value reflects the number of factors retained in the final factor analysis model. If the Number of Factors option is explicitly set by the user to determine the number of factors, then this reported value reflects the value set by the user. Otherwise, it reflects the number of factors resulting from applying the Minimum Eigenvalue option. • Maximum Iterations — This is the maximum number of iterations requested by the user. • Convergence Criterion — This is the value requested by the user as the convergence criterion such that iteration continues until the maximum change in the square root of uniqueness values does not exceed this value. • Rotation (none/orthogonal/oblique) — This value reflects the type of rotation, if any, requested by the user, either none, orthogonal, or oblique. • Gamma — This value is a coefficient in the rotation equation that reflects the type of rotation requested, if any, and in some cases is explicitly set by the user. Gamma is determined as follows. • Orthogonal rotations • Varimax — (gamma in rotation equation fixed at 1.0) • Quartimax — (gamma in rotation equation fixed at 0.0) • Equamax — (gamma in rotation equation fixed at f / 2)* • Parsimax — (gamma in rotation equation fixed at v(f-1) / (v+f+2))* • Orthomax — (gamma in rotation equation set by user) * where v is the number of variables and f is the number of factors • 74 Oblique rotations • Quartimin — (gamma in rotation equation fixed at 0.0) • Biquartimin — (gamma in rotation equation fixed at 0.5) Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Factor Analysis • Covarimin — (gamma in rotation equation fixed at 1.0) • Orthomin — (gamma in rotation equation set by user) Maximum Likelihood (EM) Factor Analysis report • Number of Variables — This is the number of variables to be factored, taken from the matrix that is input to the algorithm. Note that there are no dependent or independent variables in a factor analysis model. • Number of Observations — This is the number of observations in the data used to build the matrix that is input to the algorithm. • Number of Factors — This reflects the number of factors requested by the user for the factor analysis model. • Maximum Iterations — This is the maximum number of iterations requested by the user. (The actual number of iterations used is reflected in the Total Number of Iterations field further down in the report). • Convergence Criterion — This is the value requested by the user as the convergence criterion such that iteration continues until the maximum change in the square root of uniqueness values does not exceed this value. (It should be noted that convergence is based on uniqueness values rather than maximum likelihood values, something that is done strictly for practical reasons based on experimentation). • Matrix Type (cor/cov) — This value reflects the type of input matrix requested by the user, either correlation (cor) or covariance (cov). • Rotation (none/orthogonal/oblique) — This value reflects the type of rotation, if any, requested by the user, either none, orthogonal, or oblique. • Gamma — This value is a coefficient in the rotation equation that reflects the type of rotation requested, if any, and in some cases is explicitly set by the user. Gamma is determined as follows. • Orthogonal rotations • Varimax — (gamma in rotation equation fixed at 1.0) • Quartimax — (gamma in rotation equation fixed at 0.0) • Equamax — (gamma in rotation equation fixed at f / 2)* • Parsimax — (gamma in rotation equation fixed at v(f-1) / (v+f+2))* • Orthomax — (gamma in rotation equation set by user) * where v is the number of variables and f is the number of factors • Oblique rotations • Quartimin — (gamma in rotation equation fixed at 0.0) • Biquartimin — (gamma in rotation equation fixed at 0.5) • Covarimin — (gamma in rotation equation fixed at 1.0) • Orthomin — (gamma in rotation equation set by user) • Total Number of Iterations — This value is the number of iterations that the algorithm performed to converge on a maximum likelihood solution. Teradata Warehouse Miner User Guide - Volume 3 75 Chapter 1: Analytic Algorithms Factor Analysis • Final Average Likelihood — This is the final value of the average likelihood over all the observations represented in the input matrix. • Change in Avg Likelihood — This is the final change, from the previous to the final iteration, in value of the average likelihood over all the observations represented in the input matrix. • Maximum Change in Sqrt (uniqueness) — The algorithm calculates a uniqueness value for each factor each time it iterates, and keeps track of how much the positive square root of each of these values changes from one iteration to the next. The maximum change in this value is given here, and it is of interest because it is used to determine convergence of the model. (Refer to “Final Uniqueness Values” on page 78 for an explanation of these values in the common factor model). Max Change in Sqrt (Communality) For Each Iteration This report, printed for Principal Axis Factors only, and only if the user requests the Report Output option Long, shows the progress of the algorithm in converging on a solution. It does this by showing, at each iteration, the maximum change in the positive square root of the communality of each of the variables. The communality of a variable is that portion of its variance that can be attributed to the common factors. Simply put, when the communality values for all of the variables stop changing sufficiently, the algorithm stops. Matrix to be Factored The correlation or covariance matrix to be factored is printed out only if the user requests the Report Output option Long. Only the lower triangular portion of this symmetric matrix is reported and output is limited to at most 100 rows for expediency. (If it is necessary to view the entire matrix, the Get Matrix function with the Export to File option is recommended). Initial Communality Estimates This report is produced only for Principal Axis Factors and Maximum Likelihood Factors. The communality of a variable is that portion of its variance that can be attributed to the common factors, excluding uniqueness. The initial communality estimates for each variable are made by calculating the squared multiple correlation coefficient of each variable with respect to the other variables taken together. Final Communality Estimates This report is produced only for Principal Axis Factors and Maximum Likelihood Factors. The communality of a variable is that portion of its variance that can be attributed to the common factors, excluding uniqueness. The final communality estimates for each variable are computed as: 2 hj = r k – 1 fjk 2 i.e. as the sum of the squares of the factor loadings for each variable. 76 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Factor Analysis Eigenvalues These are the resulting eigenvalues of the principal component or principal axis factor solution, in descending order. At this stage, there are as many eigenvalues as input variables since the number of factors has not been reduced yet. Eigenvectors These are the resulting eigenvectors of the principal components or principal axis factor solution, in descending order. At this stage, there are as many eigenvectors as input variables since the number of factors has not been reduced yet. Eigenvectors are printed out only if the user requests the Report Output option Long. Principal Component Loadings (Principal Components) This matrix of values, which is variables by factors in size, represents both the factor pattern and factor structure, i.e. the linear combination of factors for each variable and the correlations between factors and variables (provided Matrix Type is Correlation). The number of factors has been reduced to meet the minimum eigenvalue or number of factors requested, but the output does not reflect any factor rotations that may have been requested. This output table contains the raw data used in the Prime Factor Reports, which are probably better to use for interpreting results. If the user requested a Matrix Type of Correlation, the principal component loadings can be interpreted as the correlations between the original variables and the newly created factors. An absolute value approaching 1 indicates that a variable is contributing strongly to a particular factor. Factor Pattern (Principal Axis Factors) This matrix of values, which is variables by factors in size, represents both the factor pattern and factor structure, i.e. the linear combination of factors for each variable and the correlations between factors and variables (provided Matrix Type is Correlation). The number of factors has been reduced to meet the minimum eigenvalue or number of factors requested, but the output does not reflect any factor rotations that may have been requested. This output table contains the raw data used in the Prime Factor Reports, which are probably better to use for interpreting results. If the user requested a Matrix Type of Correlation, the factor pattern can be interpreted as the correlations between the original variables and the newly created factors. An absolute value approaching 1 indicates that a variable is contributing strongly to a particular factor. Factor Pattern (Maximum Likelihood Factors) This matrix of values, which is variables by factors in size, represents both the factor pattern and factor structure, i.e. the linear combination of factors for each variable and the correlations between factors and variables (provided Matrix Type is Correlation). The number of factors has been fixed at the number of factors requested. The output at this stage does not reflect any factor rotations that may have been requested. This output table contains the raw data used in the Prime Factor Reports, which are probably better to use for interpreting results. If the user requested a Matrix Type of Correlation, the factor pattern can be interpreted as the correlations between the original variables and the Teradata Warehouse Miner User Guide - Volume 3 77 Chapter 1: Analytic Algorithms Factor Analysis newly created factors. An absolute value approaching 1 indicates that a variable is contributing strongly to a particular factor. Variance Explained by Factors This report provides the amount of variance in all of the original variables taken together that is accounted for by each factor. For Principal Components and Principal Axis Factor solutions, the variance is the same as the eigenvalues calculated for the solution. In general however, and for Maximum Likelihood Factor solutions in particular, the variance is the sum of the squared loadings for each factor. (After an oblique rotation, if the factors are correlated, there is an interaction term that must also be added in based on the loadings and the correlations between factors. A separate report entitled Contributions of Rotated Factors To Variance is provided if an oblique rotation is performed). • Factor Variance — This column shows the actual amount of variance in the original variables accounted for by each factor. • Percent of Total Variance — This column shows the percentage of the total variance in the original variables accounted for by each factor. • Cumulative Percent — This column shows the cumulative percentage of the total variance in the original variables accounted for by Factor 1 through each subsequent factor in turn. Factor Variance to Total Variance Ratio This is simply the ratio of the variance explained by all the factors to the total variance in the original data. Condition Indices of Components The condition index of a principal component or principal factor is the square root of the ratio of the largest eigenvalue to the eigenvalue associated with that component or factor. This report is provided for Principal Components and Principal Axis Factors only. Final Uniqueness Values The common factor model seeks to find a factor pattern C and a uniqueness matrix R such that a covariance or correlation matrix S can be modeled as S = CCT + R. The uniqueness matrix is a diagonal matrix, so there is a single uniqueness value for each variable in the model. The theory behind the uniqueness value of a variable is that the variance of each variable can be expressed as the sum of its communality and uniqueness, that is the variance of the jth variable is given by: 2 2 2 sj = hj + uj This report is provided for Maximum Likelihood Factors only. Reproduced Matrix Based on Loadings The results of a factor analysis can be used to reproduce or approximate the original correlation or covariance matrix used to build the factor analysis model. This is done to evaluate the effectiveness of the model in accounting for the variance in the original data. For 78 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Factor Analysis Principal Components and Principal Axis Factors the reproduced matrix is simply the loadings matrix times its transpose. For Maximum Likelihood Factors it is the loadings matrix times its transpose plus the uniqueness matrix. This report is provided only when Long is selected as the Output Option. Difference Between Original and Reproduced cor/cov Matrix This report gives the differences between the original correlation or covariance matrix values used in the factor analysis and the Reproduced Matrix Based on Loadings. (In the case of Principal Axis Factors, the reproduced matrix is compared to the original matrix with the initial communality estimates placed in the diagonal of the matrix). This report is provided only when Long is selected as the Output Option. Absolute Difference This report summarizes the absolute value of the differences between the original correlation or covariance matrix values used in the factor analysis and the Reproduced Matrix Based on Loadings. • Mean — This is the average absolute difference in correlation or covariance over the entire matrix. • Standard Deviation — This is the standard deviation of the absolute differences in correlation or covariance over the entire matrix. • Minimum — This is the minimum absolute difference in correlation or covariance over the entire matrix. • Maximum — This is the maximum absolute difference in correlation or covariance over the entire matrix. Rotated Loading Matrix This report of the factor loadings (pattern) after rotation is given only after orthogonal rotations. Rotated Structure This report of the factor structure after rotation is given only after oblique rotations. Note that after an oblique rotation the rotated structure matrix is usually different from the rotated pattern matrix. Rotated Pattern This report of the factor pattern after rotation is given only after oblique rotations. Note that after an oblique rotation the rotated pattern matrix is usually different from the rotated structure matrix. Rotation Matrix After rotating the factor pattern matrix P to get the rotated matrix PR, the rotation matrix T is also produced such that PR = PT. However, after an oblique rotation the rotation matrix obeys the following equation: PR = P(TT)-1. This report is provided only when Long is selected as the Output Option. Teradata Warehouse Miner User Guide - Volume 3 79 Chapter 1: Analytic Algorithms Factor Analysis Variance Explained by Rotated Factors This is the same report as Variance Explained by Factors except that it is based on the rotated factor loadings. Comparison of the two reports can show the effects of rotation on the effectiveness of the model. After an oblique rotation, another report is produced called the Contributions of Rotated Factors to Variance to show both the contributions of individual factors and the contributions of factor interactions to the explanation of the variance in the original variables analyzed. Rotated Factor Variance to Total Variance Ratio This is the same report as Factor Variance to Total Variance Ratio except that it is based on the rotated factor loadings. Comparison of the two reports can show the effects of rotation on the effectiveness of the model. Correlations Among Rotated Factors After an oblique rotation the factors are generally no longer orthogonal or uncorrelated with each other. This report is a standard Pearson product-moment correlation matrix treating the rotated factors as new variables. Values range from 0 to -1 or +1 indicating no correlation to maximum correlation respectively (a negative correlation indicates that two factors vary in opposite directions with respect to each other). This report is provided only after an oblique rotation is performed. Contributions of Rotated Factors to Variance In general, the variance of the original variables explained by a factor is the sum of the squared loadings for the factor. But after an oblique rotation the factors may be correlated, so additional interaction terms between the factors must be considered in computing the explained variance reported in the Variance Explained by Rotated Factors report. The contributions of factors to variance may be characterized as direct contributions: n Vp = bjp 2 j=1 and joint contributions: n V pq = 2r Tp Tq b jp b jq j=1 where p and q vary by factors with p < q, j varies by variables, and r is the correlation between factors. The Contributions of Rotated Factors to Variance report displays direct contributions along the diagonal and joint contributions off the diagonal. This report is provided only after an oblique rotation is performed. 80 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Factor Analysis Factor Analysis - RESULTS - Pattern Graph On the Factor Analysis dialog, click on RESULTS and then click on pattern graph (note that the RESULTS tab will be grayed-out/disabled until after the analysis is completed): Figure 52: Factor Analysis > Results > Pattern Graph The Factor Analysis Pattern Graph plots the final factor pattern values for up to twelve variables, two factors at a time. These factor pattern values are the coefficients in the linear combination of factors that comprise each variable. When the Analysis Type is Principal Components, these pattern values are referred to as factor loadings. When the Matrix Type is Correlation, the values of these coefficients are standardized to be between -1 and 1 (if Covariance, they are not). Unless an oblique rotation has been performed, these values also represent the factor structure, i.e. the correlation between a factor and a variable. The following options are available: • Variables • Available — A list of all variables that were input to the Factor Analysis. • Selected — A list the variables (up to 12), that will be displayed on the Factor Patterns graph. • Factors • Available — A list of all factors generated by the Factor Analysis. • Selected — The selected two factors that will be displayed on the Factor Patterns graph. Factor Analysis - RESULTS - Scree Plot Unless MLF was specified, a screen plot is generated. On the Factor Analysis dialog, click on RESULTS and then click on scree plot (note that the RESULTS tab will be grayed-out/disabled until after the analysis is completed): Figure 53: Factor Analysis > Results > Scree Plot A definition of the word scree is a heap of stones or rocky debris, such as at the bottom of a hill. So in a scree plot the object is to find where the plotted points flatten out, in order to determine how many Principal Component or Principal Axis factors should be retained in the factor analysis model (the scree plot does not apply to Maximum Likelihood factor analysis). Teradata Warehouse Miner User Guide - Volume 3 81 Chapter 1: Analytic Algorithms Factor Analysis The plot shows the eigenvalues of each factor in descending order from left to right. Since the eigenvalues represent the amount of variance in the original variables is explained by the factors, when the eigenvalues flatten out in the plot, the factors they represent add less and less to the effectiveness of the model. Tutorial - Factor Analysis In this example, principal components analysis is performed on a correlation matrix for 21 numeric variables. This reduces the variables to 7 factors using a minimum eigenvalue of 1. The Scree Plot supports limiting the number of factors to 7 by showing how the eigenvalues (and thus the explained variance) level off at 7 or above. Parameterize a Factor Analysis as follows: • Available Matrices — Customer_Analysis_Matrix • Selected Variables • income • age • years_with_bank • nbr_children • female • single • married • separated • ccacct • ckacct • svacct • avg_cc_bal • avg_ck_bal • avg_sv_bal • avg_cc_tran_amt • avg_cc_tran_cnt • avg_ck_tran_amt • avg_ck_tran_cnt • avg_sv_tran_amt • avg_sv_tran_cnt • cc_rev • Analysis Method — Principal Components • Matrix Type — Correlation • Minimum Eigenvalue — 1 • Invert signs if majority of matrix values are negative — Enabled • Rotation Options — None 82 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Factor Analysis • Factor Variables — Enabled • Threshold Percent — 1 • Long Report — Not enabled Run the analysis, and click on Results when it completes. For this example, the Factor Analysis generated the following pages. A single click on each page name populates the Results page with the item. Table 17: Factor Analysis Report Number of Variables 21 Minimum Eigenvalue 1 Number of Factors 7 Matrix Type Correlation Rotation None Table 18: Execution Summary 6/20/2004 1:55:02 PM Getting Matrix 6/20/2004 1:55:02 PM Principal Components Analysis Running...x 6/20/2004 1:55:02 PM Creating Report Table 19: Eigenvalues Factor 1 4.292 Factor 2 2.497 Factor 3 1.844 Factor 4 1.598 Factor 5 1.446 Factor 6 1.254 Factor 7 1.041 (Factor 8) .971 (Factor 9) .926 (Factor 10) .871 (Factor 11) .741 (Factor 12) .693 (Factor 13) .601 (Factor 14) .504 Teradata Warehouse Miner User Guide - Volume 3 83 Chapter 1: Analytic Algorithms Factor Analysis Table 19: Eigenvalues (Factor 15) .437 (Factor 16) .347 (Factor 17) .34 (Factor 18) .253 (Factor 19) .151 (Factor 20) .123 (Factor 21) 7.01E-02 Table 20: Principal Component Loadings Variable Name Factor 1 Factor 2 Factor 3 Factor 4 Factor 5 Factor 6 Factor 7 age 0.2876 -0.4711 0.1979 0.2615 0.2975 0.3233 -0.2463 avg_cc_bal -0.7621 0.0131 0.1628 -0.1438 0.3508 -0.1550 -0.0300 avg_cc_tran_amt 0.3716 -0.0318 -0.1360 0.0543 -0.1975 0.0100 0.0971 avg_cc_tran_cnt 0.4704 0.0873 -0.4312 0.5592 -0.0241 0.0133 0.0782 avg_ck_bal 0.5778 0.0527 -0.0981 -0.4598 0.0735 -0.0123 -0.0542 avg_ck_tran_amt 0.7698 0.0386 -0.0929 -0.4535 0.2489 0.0585 0.0190 avg_ck_tran_cnt 0.3127 0.1180 -0.1619 -0.1114 0.5435 0.1845 0.0884 avg_sv_bal 0.3785 0.3084 0.4893 0.0186 -0.0768 -0.0630 0.0517 avg_sv_tran_amt 0.4800 0.4351 0.5966 0.1456 -0.0155 0.0272 0.1281 avg_sv_tran_cnt 0.2042 0.3873 0.4931 0.1144 0.2420 0.0884 -0.0646 cc_rev 0.8377 -0.0624 -0.1534 0.0691 -0.3800 0.1036 0.0081 ccacct 0.2025 0.5213 0.4007 0.3021 0.0499 -0.1988 0.1733 ckacct 0.4007 0.1496 -0.4215 0.5497 0.1127 -0.0818 -0.0086 female -0.0209 0.1165 -0.1357 0.3119 0.1887 -0.2228 -0.3438 income 0.6992 -0.2888 0.1353 -0.2987 -0.2684 0.0733 0.0310 married 0.0595 -0.7702 0.2674 0.2434 0.1945 0.0873 0.2768 nbr_children 0.2560 -0.4477 0.1238 -0.0895 -0.0739 -0.5642 0.0898 separated 0.3030 0.0692 0.0545 -0.0666 -0.0796 -0.5089 -0.6425 single -0.2902 0.7648 -0.3004 -0.2010 -0.2120 0.2527 0.0360 svacct 0.4365 0.1616 -0.2592 -0.1705 0.6336 -0.1071 0.0318 years_with_bank 0.0362 -0.0966 0.2120 0.0543 -0.0668 0.5507 -0.5299 84 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Factor Analysis Variance Table 21: Factor Variance to Total Variance Ratio .665 Table 22: Variance Explained By Factors Factor Variance Percent of Total Variance Cumulative Percent Condition Indices Factor 1 4.2920 20.4383 20.4383 1.0000 Factor 2 2.4972 11.8914 32.3297 1.3110 Factor 3 1.8438 8.7800 41.1097 1.5257 Factor 4 1.5977 7.6082 48.7179 1.6390 Factor 5 1.4462 6.8869 55.6048 1.7227 Factor 6 1.2544 5.9735 61.5782 1.8497 Factor 7 1.0413 4.9586 66.5369 2.0302 Table 23: Difference Mean Standard Deviation Minimum Maximum 0.0570 0.0866 0.0000 0.7909 Table 24: Prime Factor Variables Factor 1 Factor 2 Factor 3 Factor 4 Factor 5 Factor 6 Factor 7 cc_rev married avg_sv_tran_amt avg_cc_tran_cnt svacct nbr_children separated avg_ck_tran_amt single avg_sv_tran_cnt ckacct avg_ck_tran_cnt years_with_bank female avg_cc_bal ccacct avg_sv_bal * * * * income age * * * * * avg_ck_bal * * * * * * avg_cc_tran_amt * * * * * * Pattern Graph By default, the first twelve variables input to the Factor Analysis, and the first two factors generated, are displayed on the Factor Patterns graph: Teradata Warehouse Miner User Guide - Volume 3 85 Chapter 1: Analytic Algorithms Linear Regression Scree Plot On the scree plot, all possible factors are shown. In this case, only factors with an eigenvalue greater than 1 were generated by the Factor Analysis: Figure 54: Factor Analysis Tutorial: Scree Plot Linear Regression Overview Linear regression is one of the oldest and most fundamental types of analysis in statistics. The British scientist Sir Francis Galton originally developed it in the latter part of the 19th century. The term “regression” derives from the nature of his original study in which he found that the children of both tall and short parents tend to “revert” or “regress” toward average heights. [Neter] It has also been associated with the work of Gauss and Legendre who used linear models in working with astronomical data. Linear regression is thought of today as a special case of generalized linear models, which also includes models such as logit models (logistic regression), log-linear models and multinomial response models. [McCullagh] Why build a linear regression model? It is after all one of the simplest types of models that can be built. Why not start out with a more sophisticated model such as a decision tree or a neural network model? One reason is that if a simpler model will suffice, it is better than an unnecessarily complex model. Another reason is to learn about the relationships between a set of observed variables. Is there in fact a linear relationship between each of the observed variables and the variable to predict? Which variables help in predicting the target dependent variable? If a linear relationship does not exist, is there another type of relationship that does? 86 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Linear Regression By transforming a variable, say by taking its exponent or log or perhaps squaring it, and then building a linear regression model, these relationships can hopefully be seen. In some cases, it may even be possible to create an essentially non-linear model using linear regression by transforming the data first. In fact, one of the many sophisticated forms of regression, called piecewise linear regression, was designed specifically to build nonlinear models of nonlinear phenomena. Finally, in spite of being a relatively simple type of model, there is a rich set of statistics available to explore the nature of any linear regression model built. Multiple Linear Regression Multiple linear regression analysis attempts to predict, or estimate, the value of a dependent variable as a linear combination of independent variables, usually with a constant term included. That is, it attempts to find the b-coefficients in the following equation in order to best predict the value of the dependent variable y based on the independent variables x1 to xn. ) y = b0 + b1 x1 + + bn xn The best values of the coefficients are defined to be the values that minimize the sum of squared error values: y ) y – 2 over all the observations. ) Note that this requires that the actual value of y be known for each observation, in order to contrast it with the predicted value y . This technique is called “least-squared errors.” It turns out that the b-coefficient values to minimize the sum of squared errors can be solved using a little calculus and linear algebra. It is worth spending just a little more effort in describing this technique in order to explain how Teradata Warehouse Miner performs linear regression analysis. It also introduces the concept of a cross-products matrix and its relatives the covariance matrix and the correlation matrix that are so important in multivariate statistical analysis. In order to minimize the sum of squared errors, the equation for the sum of squared errors is expanded using the equation for the estimated y value, and then the partial derivatives of this equation with respect to each b-coefficient are derived and set equal to 0. (This is done in order to find the minimum with respect to all of the coefficient values). This leads to n simultaneous equations in n unknowns, which are commonly referred to as the normal equations. For example: 1 1 b0 + 1 x1 b1 + 1 x2 b2 = 1 y x1 1 b0 + x1 b1 + x1 x2 b2 = x1 y 2 x2 1 b0 + x2 x1 b1 + x2 b2 = 2 Teradata Warehouse Miner User Guide - Volume 3 x2 y 87 Chapter 1: Analytic Algorithms Linear Regression The equations above have been presented in a way that gives a hint to how they can be solved using matrix algebra, i.e. by first computing the extended Sum-of-Squares-and-CrossProducts (SSCP) matrix for the constant 1 and the variables x1, x2 and y. By doing this one gets all of the terms in the equation. Teradata Warehouse Miner offers the Build Matrix function to build the SSCP matrix directly in the Teradata database using generated SQL. The linear regression module then reads this matrix from metadata results tables and performs the necessary calculations to solve for the least-squares b-coefficients. Therefore, that part of constructing a linear regression algorithm that requires access to the detail data is simply the building of the extended SSCP matrix (i.e. include the constant 1 as the first variable), and the rest is calculated on the client machine. There is however much more to linear regression analysis than building a model, i.e. calculating the least-squares values of the b-coefficients. Other aspects such as model diagnostics, stepwise model selection and scoring are described below. Model Diagnostics One of the advantages in using a statistical modeling technique such as linear regression (as opposed to a machine learning technique, for example) is the ability to compute rigorous, well-understood measurements of the effectiveness of the model. Most of these measurements are based upon a huge body of work in the areas of probability and probability theory. Goodness of fit ) Several model diagnostics are provided to give an assessment of the effectiveness of the overall model. One of these is called the residual sums of squares or sum of squared errors RSS, which is simply the sum of the squared differences between the dependent variable y estimated by the model and the actual value of y, over all of the rows: y – y ) RSS = 2 Now suppose a similar measure was created based on a naive estimate of y, namely the mean value y : TSS = y – y 2 often called the total sums of squares about the mean. Then, a measure of the improvement of the fit given by the linear regression model is given by: TSS – RSS 2 R = ---------------------------TSS This is called the squared multiple correlation coefficient R2, which has a value between 0 and 1, with 1 indicating the maximum improvement in fit over estimating y naively with the mean value of y. The multiple correlation coefficient R is actually the correlation between the real y values and the values predicted based on the independent x variables, sometimes written R y x 1 x 2 x n , which is calculated here simply as the positive square root of the R2 88 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Linear Regression value. A variation of this measure adjusted for the number of observations and independent variables in the model is given by the adjusted R2 value: 2 n–1 2 R = 1 – --------------------- 1 – R n–p–1 where n is the number of observations and p is the number of independent variables (substitute n-p in the denominator if there is no constant term). The numerator in the equation for R2, namely TSS - RSS, is sometimes called the due-toregression sums of squares or DRS. Another way of looking at this is that the total unexplained variation about the mean TSS is equal to the variation due to regression DRS plus the unexplained residual variation RSS. This leads to an equation sometimes known as the fundamental equation of regression analysis: = y 2 – y + y – y ) 2 ) y – y 2 Which is the same as saying that TSS = DRS + RSS. From these values a statistical test called an F-test can be made to determine if all the x variables taken together explain a significant amount of variation in y. This test is carried out on the F-ratio given by: meanDRS F = -------------------------meanRSS The values meanDRS and meanRSS are calculated by dividing DRS and RSS by their respective degrees of freedom (p for DRS and n-p-1 for RSS). Standard errors and confidence intervals Measurements are made of the standard deviation of the sampling distribution of each bcoefficient value, and from this, estimates of a confidence interval for each of the coefficients are made. For example, if one of the coefficients has a value of 6, and a 95% confidence interval of 5 to 7, it can be said that the true population coefficient is contained in this interval, with a confidence coefficient of 95%. In other words, if repeated samples were taken of the same size from the population, then 95% of the intervals like the one constructed here, would contain the true value for the population coefficient. Another set of useful statistics is calculated as the ratio of each b-coefficient value to its standard error. This statistic is sometimes called a T-statistic or Wald statistic. Along with its associated t-distribution probability value, it can be used to assess the statistical significance of this term in the model. Standardized coefficients The least-squares estimates of the b-coefficients are converted to so-called beta-coefficients or standardized coefficients to give a model in terms of the z-scores of the independent variables. That is, the entire model is recast to use standardized values of the variables and the coefficients are recomputed accordingly. Standardized values cast each variable into units Teradata Warehouse Miner User Guide - Volume 3 89 Chapter 1: Analytic Algorithms Linear Regression measuring the number of standard deviations away from the mean value for that variable. The advantage of doing this is that the values of the coefficients are scaled equivalently so that their relative importance in the model can be more easily seen. Otherwise the coefficient for a variable such as income would be difficult to compare to a variable such as age or the number of years an account has been open. Incremental R-squared It is possible to calculate the value R2 incrementally by considering the cumulative contributions of x variables added to the model one at a time, namely R y x 1 , R y x1 x2 R y x1 x2 xn . These are called incremental R2 values, and they give a measure of how much the addition of each x variable contributes to explaining the variation in y in the observations. This points out the fact that the order in which the independent x variables are specified in creating the model is important. Multiple Correlation Coefficients Another measure that can be computed for each independent variable in the model is the squared multiple correlation coefficient with respect to the other independent variables in the model taken together. These values range from 0 to1 with 0 indicating a lack of correlation and 1 indicating the maximum correlation. Multiple correlation coefficients are sometimes presented in related forms such as variance inflation factors or tolerances. A variance inflation factor is given by the formula: 1 V k -----------------2 1 – Rk Where Vk is the variance inflation factor and Rk2 is the squared multiple correlation coefficient for the kth independent variable. Tolerance is given by the formula Tk = 1 - Rk2, where Tk is the tolerance of the kth independent variable and Rk2 is as before. These values may be of limited value as indicators of possible collinearity or near dependencies among variables in the case of high correlation values, but the absence of high correlation values does not necessarily indicate the absence of collinearity problems. Further, multiple correlation coefficients are unable to distinguish between several near dependencies should they exist. The reader is referred to [Belsley, Kuh and Welsch] for more information on collinearity diagnostics, as well as to the upcoming section on the subject. Data Quality Reports A variety of data quality reports are available with the Teradata Warehouse Miner Linear Regression algorithm. Reports include: 90 1 Constant Variables 2 Variable Statistics 3 Detailed Collinearity Diagnostics • Eigenvalues of Unit Scaled X'X • Condition Indices • Variance Proportions Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Linear Regression 4 Near Dependency Constant Variables Before attempting to build a model the algorithm checks to see if any variables in the model have a constant value. This check is based on the standard deviation values derived from the SSCP matrix input to the algorithm. If a variable with a constant value, i.e. a standard deviation of zero, is detected, the algorithm stops and notifies the user while producing a Constant Variables Table report. After reading this report, the user may then remove the variables in the report from the model and execute the algorithm again. It is possible that a variable may appear in the Constant Variables Table report that does not actually have a constant value in the data. This can happen when a column has extremely large values that are close together in value. In this case the standard deviation will appear to be zero due to precision loss and will be rejected as a constant column. The remedy for this is to re-scale the values in the column prior to building a matrix or doing the analysis. The ZScore or the Rescale transformation functions may be used for this purpose. Variable Statistics The user may optionally request that a Variables Statistics Report be provided, giving the mean value and standard deviation of each variable in the model based on the SSCP matrix provided as input. Detailed Collinearity Diagnostics One of the conditions that can lead to a poor linear regression model is when the independent variables in the model are not independent of each other, that is, when they are collinear (highly correlated) with one another. Collinearity can be loosely defined as a condition where one variable is nearly a linear combination of one or more other variables, sometimes also called a near dependency. This leads to an ill conditioned matrix of variables. Teradata Warehouse Miner provides an optional Detailed Collinearity Diagnostics report using a specialized technique described in [Belsley, Kuh and Welsch]. This technique involves performing a singular value decomposition of the independent x variables in the model in order to measure collinearity. The analysis proceeds roughly as follows. In order to put all variables on an equal footing, the data is scaled so that each variable adds up to 1 when summed over all the observations or rows. In order to calculate the singular values of X (the rows of X are the observations), the mathematically equivalent square root of the eigenvalues of XTX are computed instead for practical reasons. The condition index of each eigenvalue is calculated as the square root of the ratio of the largest eigenvalue to the given eigenvalue, a value always 1 or greater. The variance decomposition of these eigenvalues is computed using the eigenvalues together with the eigenvectors associated with them. The result is a matrix giving, for each variable, the proportion of variance associated with each eigenvalue. Large condition indices indicate a probable near dependency. A value of 10 may indicate a weak dependency, values of 15 to 30 may be considered a borderline dependency, above 30 worth investigating further, and above 100, a potentially damaging collinearity. As a rule of thumb, an eigenvalue with a condition index greater than 30 and an associated variance proportion of greater than 50% with two or more model variables implies that a collinearity Teradata Warehouse Miner User Guide - Volume 3 91 Chapter 1: Analytic Algorithms Linear Regression problem exists. (The somewhat subjective conclusions described here and the experiments they are based on are described in detail in [Belsley, Kuh and Welsch]). An example of the Detailed Collinearity Diagnostics report is given below. Table 25: Eigenvalues of Unit Scaled X'X Factor 1 5.2029 Factor 2 .8393 Factor 3 .5754 Factor 4 .3764 Factor 5 4.1612E-03 Factor 6 1.8793E-03 Factor 7 2.3118E-08 Table 26: Condition Indices Factor 1 1 Factor 2 2.4898 Factor 3 3.007 Factor 4 3.718 Factor 5 35.3599 Factor 6 52.6169 Factor 7 15001.8594 Table 27: Variable Name Factor 1 Factor 2 Factor 3 Factor 4 Factor 5 Factor 6 Factor 7 CONSTANT 1.3353E-09 1.0295E-08 1.3781E-09 1.6797E-08 1.1363E-11 2.1981E-07 1 cust_id 1.3354E-09 1.0296E-08 1.3782E-09 1.6799E-08 1.1666E-11 2.2068E-07 1 income 2.3079E-04 1.8209E-03 1.6879E-03 1.1292E-03 .9951 4.4773E-06 1.2957E-05 age 1.0691E-04 1.9339E-04 9.321E-05 1.7896E-03 1.56E-05 .9963 1.4515E-03 children 2.9943E-03 4.4958E-02 .2361 1.6499E-03 3.6043E-04 .713 9.1708E-04 combo1 2.3088E-04 1.8703E-03 1.6658E-03 1.1339E-03 .995 1.0973E-04 2.3525E-05 combo2 1.4002E-04 3.1477E-05 4.4942E-05 5.0407E-03 4.7784E-06 .9935 1.2583E-03 92 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Linear Regression Near Dependency In addition to or in place of the Detailed Collinearity Diagnostics report, the user may optionally request a Near Dependency report based on the automated application of the specialized criteria used in the aforementioned report. Requesting the Near Dependency report greatly simplifies the search for collinear variables or near dependencies in the data. The user may specify the threshold value for the condition index (by default 30) and the variance proportion (by default 0.5) such that a near dependency is reported. That is, if two or more variables have a variance proportion greater than the variance proportion threshold, for a condition index with value greater than the condition index threshold, the variables involved in the near dependency are reported along with their variance proportions, their means and their standard deviations. Near dependencies are reported in descending order based on their condition index value, and variables contributing to a near dependency are reported in descending order based on their variance proportion. The following is an example of a Near Dependency report. Table 28: Near Dependency report (example) Variable Name Factor Condition Index Variance Proportion Mean Standard Deviation CONSTANT 7 15001.8594 1 * * cust_id 7 15001.8594 1 1362987.891 293.5012 age 6 52.6169 .9963 33.744 22.3731 combo2 6 52.6169 .9935 25.733 23.4274 children 6 52.6169 .713 .534 1.0029 income 5 35.3599 .9951 16978.026 21586.8442 combo1 5 35.3599 .995 33654.602 43110.862 Stepwise Linear Regression Automated stepwise regression analysis is a technique to aid in regression model selection. That is, it helps in deciding which independent variables to include in a regression model. If there are only two or three independent variables under consideration, one could try all possible models. But since there are 2k - 1 models that can be built from k variables, this quickly becomes impractical as the number of variables increases (32 variables yield more than 4 billion models!). The automated stepwise procedures described below can provide insight into the variables that should be included in a regression model. It is not recommended that stepwise procedures be the sole deciding factor in the makeup of a model. For one thing, these techniques are not guaranteed to produce the best results. And sometimes, variables should be included because of certain descriptive or intuitive qualities, or excluded for subjective reasons. Therefore an element of human decision-making is recommended to produce a model with useful business application. Teradata Warehouse Miner User Guide - Volume 3 93 Chapter 1: Analytic Algorithms Linear Regression Forward-Only Stepwise Linear Regression The forward only procedure consists solely of forward steps as described below, starting without any independent x variables in the model. Forward steps are continued until no variables can be added to the model. Forward Stepwise Linear Regression The forward stepwise procedure is a combination of the forward and backward steps described below, starting without any independent x variables in the model. One forward step is followed by one backward step, and these single forward and backward steps are alternated until no variables can be added or removed. Backward-Only Stepwise Linear Regression The backward only procedure consists solely of backward steps as described below, starting with all of the independent x variables in the model. Backward steps are continued until no variables can be removed from the model. Backward Stepwise Linear Regression The backward stepwise procedure is a combination of the backward and forward steps as described below, starting with all of the independent x variables in the model. One backward step is followed by one forward step, and these single backward and forward steps are alternated until no variables can be added or removed. Stepwise Linear Regression - Forward Step Each forward step seeks to add the independent variable x that will best contribute to explaining the variance in the dependent variable y. In order to do this a quantity called the partial F statistic must be computed for each xi variable that can be added to the model. A quantity called the extra sums of squares is first calculated as follows: ESS = “DRS with xi” “DRS w/o xi”, where DRS is the Regression Sums of squares or “due-to-regression sums of squares”. Then, the partial F statistic is given by f(xi) = ESS(xi) / meanRSS(xi) where meanRSS is the Residual Mean Square. Each forward step then consists of adding the variable with the largest partial F statistic providing it is greater than the criterion to enter value. An equivalent alternative to using the partial F statistic is to use the probability or P-value associated with the T-statistic mentioned earlier under model diagnostics. The t statistic is the ratio of the b-coefficient to its standard error. Teradata Warehouse Miner offers both alternatives as an option. When the P-value is used, a forward step consists of adding the variable with the smallest P-value providing it is less than the criterion to enter. In this case, if more than one variable has a P-value of 0, the variable with the largest F statistic is entered. Stepwise Linear Regression - Backward Step Each backward step seeks to remove the independent variable xi that least contributes to explaining the variance in the dependent variable y. The partial F statistic is calculated for each independent x variable in the model. If the smallest value is less than the criterion to remove, it is removed. As with forward steps, an option is provided to use the probability or P-value associated with the T-statistic, that is, the ratio of the b-coefficient to its standard error. In this case all the 94 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Linear Regression probabilities or P-values are calculated for the variables currently in the model at one time, and the one with the largest P-value is removed if it is greater than the criterion to remove. Linear Regression and Missing Data Null values for columns in a linear regression analysis can adversely affect results. It is recommended that the listwise deletion option be used when building the input matrix with the Build Matrix function. This ensures that any row for which one of the columns is null will be left out of the matrix computations completely. Another strategy is to use the Recoding transformation function to build a new column, substituting a fixed known value for null values. Yet another option is to use one of the analytic algorithms in Teradata Warehouse Miner to estimate replacement values for null values. This technique is often called missing value imputation. Initiate a Linear Regression Function Use the following procedure to initiate a new Linear Regression analysis in Teradata Warehouse Miner: 1 Click on the Add New Analysis icon in the toolbar: Figure 55: Add New Analysis from toolbar 2 In the resulting Add New Analysis dialog box, click on Analytics under Categories and then under Analyses double-click on Linear Regression: Teradata Warehouse Miner User Guide - Volume 3 95 Chapter 1: Analytic Algorithms Linear Regression Figure 56: Add New Analysis dialog 3 This will bring up the Linear Regression dialog in which you will enter INPUT and OUTPUT options to parameterize the analysis as described in the following sections. Linear Regression - INPUT - Data Selection On the Linear Regression dialog click on INPUT and then click on data selection: Figure 57: Linear Regression > Input > Data Selection On this screen select: 1 Select Input Source Users may select between different sources of input, Table, Matrix or Analysis. By selecting the Input Source Table the user can select from available databases, tables (or views) and columns in the usual manner. (In this case a matrix will be dynamically built and discarded when the algorithm completes execution). By selecting the Input Source Matrix the user may can select from available matrices created by the Build Matrix function. This has the advantage that the matrix selected for input is available for further analysis after completion of the algorithm, perhaps selecting a different subset of columns from the matrix. By selecting the Input Source Analysis the user can select directly from the output of another analysis of qualifying type in the current project. (In this case a matrix will be dynamically built and discarded when the algorithm completes execution). Analyses that 96 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Linear Regression may be selected from directly include all of the Analytic Data Set (ADS) and Reorganization analyses (except Refresh). In place of Available Databases the user may select from Available Analyses, while Available Tables then contains a list of all the output tables that will eventually be produced by the selected Analysis. (Note that since this analysis cannot select from a volatile input table, Available Analyses will contain only those qualifying analyses that create an output table or view). For more information, refer to “INPUT Tab” on page 73 of the Teradata Warehouse Miner User Guide (Volume 1). 2 Select Columns From One Table • Available Databases (only for Input Source equal to Table) — All the databases which are available for the Linear Regression analysis. • Available Matrices (only for Input Source equal to Matrix) — When the Input source is Matrix, a matrix must first be built with the Build Matrix function before linear regression can be performed. Select the matrix that summarizes the data to be analyzed. (The matrix must have been built with more rows than selected columns or the Linear Regression analysis will produce a singular matrix, causing a failure). • Available Analyses (only for Input Source equal to Analysis) — All the analyses that are available for the Linear Regression analysis. • Available Tables (only for Input Source equal to Table or Analysis) — All the tables that are available for the Linear Regression analysis. • Available Columns — All the columns that are available for the Linear Regression analysis. • Selected Columns — Select columns by highlighting and then either dragging and dropping into the Selected Columns window, or click on the arrow button to move highlighted columns into the Selected Columns window. Note: The Selected Columns window is actually a split window; you can either insert columns as Dependent or Independent columns. Make sure you have the correct portion of the window highlighted. The Dependent variable column is the column whose value is being predicted by the linear regression model. The algorithm requires that the Dependent and Independent columns must be of numeric type (or contain numbers in character format). Linear Regression - INPUT - Analysis Parameters On the Linear Regression dialog click on INPUT and then click on analysis parameters: Figure 58: Linear Regression > Input > Analysis Parameters On this screen select: • Regression Options Teradata Warehouse Miner User Guide - Volume 3 97 Chapter 1: Analytic Algorithms Linear Regression • Include Constant — This option specifies that the linear regression model should include a constant term. With a constant, the linear equation can be thought of as: ŷ = b 0 + b 1 x 1 + + b n x n Without a constant, the equation changes to: ŷ = b 1 x 1 + + b n x n • Stepwise Options — The Linear Regression analysis can use the stepwise technique to automatically determine a variable’s importance (or lack there of) to a particular model. If selected, the algorithm is performed repeatedly with various combinations of independent variable columns to attempt to arrive at a final “best” model. The stepwise options are: Step Direction — (Selecting “None” turns off the Stepwise option). • • Forward Only — Option to add qualifying independent variables one at a time. • Forward — Option for independent variables being added one at a time to an empty model, possibly removing a variable after a variable is added. • Backward Only — Option to remove independent variables one at a time. • Backward — Option for variables being removed from an initial model containing all of the independent variables, possibly adding a variable after a variable is removed. Step Method • F Statistic — Option to choose the partial F test statistic (F statistic) as the basis for adding or removing model variables. • P-value — Option to choose the probability associated with the T-statistic (Pvalue) as the basis for adding or removing model variables. • Criterion to Enter • Criterion to Remove — If the step method is to use the F statistic, then an independent variable is only added to the model if the F statistic is greater than the criterion to enter and removed if it is less than the criterion to remove. When the F statistic is used, the default for each is 3.84. If the step method is to use the P-value, then an independent variable is added to the model if the P-value is less than the criterion to enter and removed if it is greater than the criterion to remove. When the P-value is used, the default for each is 0.05. The default F statistic criteria of 3.84 corresponds to a P-value of 0.05. These default values are provided with the assumption that the input variables are somewhat correlated. If this is not the case, a lower F statistic or higher P-value criteria can be used. Also, a higher F statistic or lower P value can be specified if more stringent criteria are desired for including variables in a model. • Report Options — Statistical diagnostics can be taken on each variable during the execution of the Linear Regression Analysis. These diagnostics include: • 98 Variable Statistics — This report gives the mean value and standard deviation of each variable in the model based on the SSCP matrix provided as input. Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Linear Regression • Near Dependency — This report lists collinear variables or near dependencies in the data based on the SSCP matrix provided as input. Condition Index Threshold — Entries in the Near Dependency report are triggered by two conditions occurring simultaneously. The one that involves this parameter is the occurrence of a large condition index value associated with a specially constructed principal factor. If a factor has a condition index greater than this parameter’s value, it is a candidate for the Near Dependency report. A default value of 30 is used as a rule of thumb. Variance Proportion Threshold — Entries in the Near Dependency report are triggered by two conditions occurring simultaneously. The one that involves this parameter is when two or more variables have a variance proportion greater than this threshold value for a factor with a high condition index. Another way of saying this is that a ‘suspect’ factor accounts for a high proportion of the variance of two or more variables. This parameter defines what a high proportion of variance is. A default value of 0.5 is used as a rule of thumb. • Detailed Collinearity Diagnostics — This report provides the details behind the Near Dependency report, consisting of the “Eigenvalues of Unit Scaled X’X”, “Condition Indices” and “Variance Proportions” tables. Linear Regression - OUTPUT On the Linear Regression dialog click on OUTPUT: Figure 59: Linear Regression > OUTPUT On this screen select: • Store the variables table of this analysis in the database — Check this box to store the model variables table of this analysis in in the database. • Database Name — The name of the database to create the output table in. • Output Table Name — The name of the output table. • Advertise Output — The Advertise Output option "advertises" output by inserting information into one or more of the Advertise Output metadata tables according to the type of analysis and the options selected in the analysis. (For more information, refer to “Advertise Output” on page 112 of the Teradata Warehouse Miner User Guide (Volume 1)). • Advertise Note — An Advertise Note may be specified if desired when the Advertise Output option is selected or when the Always Advertise option is selected on the Databases tab of the Connection Properties dialog. It is a free-form text field of up to 30 characters that may be used to categorize or describe the output. Teradata Warehouse Miner User Guide - Volume 3 99 Chapter 1: Analytic Algorithms Linear Regression By way of an example, the tutorial example creates the following output table: Table 29: Upper Increment Standard al RCoefficient Squared SqMultiCo rrCoef(1Tolerance) 0.1694 1.6294 0.0331 0.8787 0.1312 0.0417 0.0111 0.5771 0.0263 0.8794 0.0168 -2.7887 0.0054 -1.3198 -0.2293 -0.036 0.8779 0.0207 0.0004 -41.3942 0 -0.0182 -0.0166 -0.6382 0.7556 0.3135 10.2793 0.8162 12.5947 0 8.677 11.8815 0.1703 0.8732 0.1073 income 0.0005 0 24.5414 0 0.0005 0.0005 0.3777 0.8462 0.311 married -4.3056 0.8039 -5.3558 0 -5.8838 -2.7273 -0.0718 0.8766 0.0933 0.9749 -6.6301 0 -8.378 -4.55 0 0 Column Name B Standard Coefficient Error T Statistic P-Value Lower nbr_ children 0.8994 0.3718 2.4187 0.0158 years_ 0.2941 with_bank 0.1441 2.0404 avg_sv_ tran_cnt -0.7746 0.2777 avg_cc_ bal -0.0174 ckacct (Constant) -6.464 If Database Name is twm_results and Output Table Name is test, the output table is defined as: CREATE SET TABLE twm_results.test2 ( "Column Name" VARCHAR(30) CHARACTER SET UNICODE NOT CASESPECIFIC, "B Coefficient" FLOAT, "Standard Error" FLOAT, "T Statistic" FLOAT, "P-Value" FLOAT, "Lower" FLOAT, "Upper" FLOAT, "Standard Coefficient" FLOAT, "Incremental R-Squared" FLOAT, "SqMultiCorrCoef(1-Tolerance)" FLOAT) UNIQUE PRIMARY INDEX ( "Column Name" ); Run the Linear Regression After setting parameters on the INPUT screens as described above, you are ready to run the analysis. To run the analysis you can either: • Click the Run icon on the toolbar, or • Select Run <project name> on the Project menu, or • Press the F5 key on your keyboard 100 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Linear Regression Results - Linear Regression The results of running the Teradata Warehouse Miner Linear Regression analysis include a variety of statistical reports on the individual variables and generated model as well as barcharts displaying coefficients and T-statistics. All of these results are outlined below. Linear Regression - RESULTS On the Linear Regression dialog, click on RESULTS (note that the RESULTS tab will be grayed-out/disabled until after the analysis is completed) to view results. Result options are as follows: Linear Regression Reports Data Quality Reports • Variable Statistics — If selected on the Results Options tab, this report gives the mean value and standard deviation of each variable in the model based on the SSCP matrix provided as input. • Near Dependency — If selected on the Results Options tab, this report lists collinear variables or near dependencies in the data based on the SSCP matrix provided as input. Entries in the Near Dependency report are triggered by two conditions occurring simultaneously. The first is the occurrence of a large condition index value associated with a specially constructed principal factor. If a factor has a condition index greater than the parameter specified on the Results Option tab, it is a candidate for the Near Dependency report. The other is when two or more variables have a variance proportion greater than a threshold value for a factor with a high condition index. Another way of saying this is that a ‘suspect’ factor accounts for a high proportion of the variance of two or more variables. The parameter to defines what a high proportion of variance is also set on the Results Option tab. A default value of 0.5. • Detailed Collinearity Diagnostics — If selected on the Results Options tab, this report provides the details behind the Near Dependency report, consisting of the following tables. • Eigenvalues of Unit Scaled X'X — Report of the eigenvalues of all variables scaled so that each variable adds up to 1 when summed over all the observations or rows. In order to calculate the singular values of X (the rows of X are the observations), the mathematically equivalent square root of the eigenvalues of XTX are computed instead for practical reasons. • Condition Indices — The condition index of each eigenvalue, calculated as the square root of the ratio of the largest eigenvalue to the given eigenvalue, a value always 1 or greater. • Variance Proportions — The variance decomposition of these eigenvalues is computed using the eigenvalues together with the eigenvectors associated with them. The result is a matrix giving, for each variable, the proportion of variance associated with each eigenvalue. Linear Regression Step N (Stepwise-only) • Linear Regression Model Assessment Teradata Warehouse Miner User Guide - Volume 3 101 Chapter 1: Analytic Algorithms Linear Regression • Squared Multiple Correlation Coefficient (R-squared) — This is the same value calculated for the Linear Regression report, but it is calculated here for the model as it stands at this step. The closer to 1 its value is, the more effective the model. • Standard Error of Estimate — This is the same value calculated for the Linear Regression report, but it is calculated here for the model as it stands at this step. • In Report — This report contains the same fields as the Variables in Model report (described below) with the addition of the following field. • F Stat — F Stat is the partial F statistic for this variable in the model, which may be used to decide its inclusion in the model. A quantity called the extra sums of squares is first calculated as follows: ESS = “DRS with x” - “DRS w/o”, where DRS is the Regression Sums of squares or “due-to-regression sums of squares”. Then the partial F statistic is given by F(xi) = ESS(xi) / meanRSS(xi), where meanRSS is the Residual Mean Square. • Out Report • Independent Variable — This is an independent variable not included in the model at this step. • P-Value — This is the probability associated with the T-statistic associated with each variable not in, or excluded from, the model, as described for the Variables in Model report as T Stat and P-Value. (Note that it is not the P-Value associated with F Stat). When the P-Value is used for step decisions, a forward step consists of adding the variable with the smallest P-value providing it is less than the criterion to enter. For backward steps, all the probabilities or P-values are calculated for the variables currently in the model at one time, and the one with the largest P-value is removed if it is greater than the criterion to remove. • F Stat — F Stat is the partial F statistic for this variable in the model, which may be used to decide its inclusion in the model. A quantity called the extra sums of squares is first calculated as follows: ESS = “DRS with xi” - “DRS w/o xi”, where DRS is the Regression Sums of squares or “due-to-regression sums of squares”. Then the partial F statistic is given by F(xi) = ESS(xi) / meanRSS(xi), where meanRSS is the Residual Mean Square. • Partial Correlation — The partial correlation coefficient for a variable not in the model is based on the square root of a measure called the coefficient of partial determination, which represents the marginal contribution of the variable to a model that doesn’t include the variable. (Here, contribution to the model means reduction in the unexplained variation of the dependent variable). The formula for the partial correlation of the ith independent variable in the linear regression model built from all the independent variables is given by: Ri = DRS – NDRS ----------------------------------RSS where DRS is the Regression Sums of squares for the model including those variables currently in the model, NDRS is the Regression Sums of squares for the current model 102 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Linear Regression without the ith variable, and RSS is the Residual Sums of squares for the current model. Linear Regression Model • Total Observations — This is the number of rows originally summarized in the SSCP matrix that the linear regression analysis is based on. The number of observations reflects the row count after any rows were eliminated by listwise deletion (recommended) when the matrix was built. • Total Sums of squares — The so-called Total Sums of squares is given by the 2 equation TSS = y – y where y is the dependent variable that is being predicted and y is its mean value. The Total Sums of squares is sometimes also called the total sums of squares about the mean. Of particular interest is its relation to the “due-to-regression sums of squares” and the “residual sums of squares” given by TSS = DRS + RSS. This is a shorthand form of what is sometimes known as the fundamental equation of regression analysis: y – y 2 = ŷ – y 2 = y – ŷ 2 where y is the dependent variable, y is its mean value and ŷ is its predicted value. • Multiple Correlation Coefficient (R) — The multiple correlation coefficient R is the correlation between the real dependent variable y values and the values predicted based on the independent x variables, sometimes written R y x1 x2 xn , which is calculated in Teradata Warehouse Miner simply as the positive square root of the Squared Multiple Correlation Coefficient (R2) value. • Squared Multiple Correlation Coefficient (R-squared) — The squared multiple correlation coefficient R2 is a measure of the improvement of the fit given by the linear regression model over estimating the dependent variable y naïvely with the mean value of y. It is given by: TSS – RSS 2 R = ---------------------------TSS where TSS is the Total Sums of squares and RSS is the Residual Sums of squares. It has a value between 0 and 1, with 1 indicating the maximum improvement in fit over estimating y naïvely with the mean value of y. • Adjusted R-squared — The adjusted R2 value is a variation of the Squared Multiple Correlation Coefficient (R2) that has been adjusted for the number of observations and independent variables in the model. Its formula is given by: n–1 2 2 R = 1 – --------------------- 1 – R n–p–1 where n is the number of observations and p is the number of independent variables (substitute n-p in the denominator if there is no constant term). • Standard Error of Estimate — The standard error of estimate is calculated as the square root of the average squared residual value over all the observations, i.e. Teradata Warehouse Miner User Guide - Volume 3 103 Chapter 1: Analytic Algorithms Linear Regression y – ŷ -------------------------- 2 n–p–1 where y is the actual value of the dependent variable, ŷ is its predicted value, n is the number of observations, and p is the number of independent variables (substitute n-p in the denominator if there is no constant term). • Regression Sums of squares — This is the “due-to-regression sums of squares” or DRS referred to in the description of the Total Sums of squares, where it is pointed out that TSS = DRS + RSS. It is also the middle term in what is sometimes known as the fundamental equation of regression analysis: y – y 2 = ŷ – y 2 = y – ŷ 2 where y is the dependent variable, is its mean value and is its predicted value. • Regression Degrees of Freedom — The Regression Degrees of Freedom is equal to the number of independent variables in the linear regression model. It is used in the calculation of the Regression Mean-Square. • Regression Mean-Square — The Regression Mean-Square is simply the Regression Sums of squares divided by the Regression Degrees of Freedom. This value is also the numerator in the calculation of the Regression F Ratio. • Regression F Ratio — A statistical test called an F-test is made to determine if all the independent x variables taken together explain a statistically significant amount of variation in the dependent variable y. This test is carried out on the F-ratio given by meanDRS F = -------------------------meanRSS where meanDRS is the Regression Mean-Square and meanRSS is the Residual MeanSquare. A large value of the F Ratio means that the model as a whole is statistically significant. (The easiest way to assess the significance of this term in the model is to check if the associated Regression P-Value is less than 0.05. However, the critical value of the F Ratio could be looked up in an F distribution table. This value is very roughly in the range of 1 to 3, depending on the number of observations and variables). • Regression P-value — This is the probability or P-value associated with the statistical test on the Regression F Ratio. This statistical F-test is made to determine if all the independent x variables taken together explain a statistically significant amount of variation in the dependent variable y. A value close to 0 indicates that they do. (The hypothesis being tested or null hypothesis is that the coefficients in the model are all zero except the constant term, i.e. all the corresponding independent variables together contribute nothing to the model. The P-value in this case is the probability that the null hypothesis is true and the given F statistic has the value it has or smaller. A right tail test on the F distribution is performed with a 5% significance level used by convention. If the P-value is less than the significance level, i.e. less than 0.05, the null hypothesis should be rejected, i.e. the coefficients taken together are significant and not all 0). 104 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Linear Regression • Residual Sums of squares — The residual sums of squares or sum of squared errors RSS is simply the sum of the squared differences between the dependent variable estimated by the model and the actual value of y, over all of the rows: RSS = y – ŷ 2 • Residual Degrees of Freedom — The Residual Degrees of Freedom is given by n-p-1 where n is the number of observations and p is the number of independent variables (or np if there is no constant term). It is used in the calculation of the Residual Mean-Square. • Residual Mean-Square — The Residual Mean-Square is simply the Residual Sums of squares divided by the Residual Degrees of Freedom. This value is also the denominator in the calculation of the Regression F Ratio. Linear Regression Variables in Model Report • Dependent Variable — The dependent variable is the variable being predicted by the linear regression model. • Independent Variable — Each independent variable in the model is listed along with accompanying measures. Unless the user deselects the option Include Constant on the Regression Options tab of the input dialog, the first independent variable listed is CONSTANT, a fixed value representing the constant term in the linear regression model. • B Coefficient — Linear regression attempts to find the b-coefficients in the equation ŷ = b 0 + b 1 x 1 + b n x n in order to best predict the value of the dependent variable y based on the independent variables x1 to xn. The best values of the coefficients are defined to be the values that minimize the sum of squared error values y – ŷ 2 over all the observations. • Standard Error — This is the standard error of the B Coefficient term of the linear regression model, a measure of how accurate the B Coefficient term is over all the observations used to build the model. It is the basis for estimating a confidence interval for the B Coefficient value. • T Statistic — The T-statistic is the ratio of a B Coefficient value to its standard error (Std Error). Along with the associated t-distribution probability value or P-value, it can be used to assess the statistical significance of this term in the linear model. (The easiest way to assess the significance of this term in the model is to check if the Pvalue is less than 0.05. However, one could look up the critical T Stat value in a two-tailed T distribution table with probability .95 and degrees of freedom roughly the number of observations minus the number of variables. This would show that for all practical purposes, if the absolute value of T Stat is greater than 2 the model term is statistically significant). • P-value — This is the t-distribution probability value associated with the T-statistic (T Stat), that is, the ratio of the b-coefficient value to its standard error (Std Error). It can be used to assess the statistical significance of this term in the linear model. A value close to 0 implies statistical significance and means this term in the model is important. Teradata Warehouse Miner User Guide - Volume 3 105 Chapter 1: Analytic Algorithms Linear Regression (The hypothesis being tested or null hypothesis is that the coefficient in the model is actually zero, i.e. the corresponding independent variable contributes nothing to the model. The P-value in this case is the probability that the null hypothesis is true and the given T-statistic has the absolute value it has or smaller. A two-tailed test on the tdistribution is performed with a 5% significance level used by convention. If the P-value is less than the significance level, i.e. less than 0.05, the null hypothesis should be rejected, i.e. the coefficient is statistically significant and not 0). • Squared Multiple Correlation Coefficient (R-squared) — The Squared Multiple Correlation Coefficient (Rk2) is a measure of the correlation of this, the kth variable with respect to the other independent variables in the model taken together. (This measure should not be confused with the R2 measure of the same name that applies to the model taken as a whole). The value ranges from 0 to 1 with 0 indicating a lack of correlation and 1 indicating the maximum correlation. It is not calculated for the constant term in the model. Multiple correlation coefficients are sometimes presented in related forms such as variance inflation factors or tolerances. The variance inflation factor is given by the formula: 1 V k = --------------21 – Rk where Vk is the variance inflation factor and Rk2 is the squared multiple correlation coefficient for the kth independent variable. Tolerance is given by the 2 formula T k = 1 – R k where Tk is the tolerance of the kth independent variable and Rk2 is as before. (Refer to the section Multiple Correlation Coefficients for details on the limitations of using this measure to detect collinearity problems in the data). • Lower — Lower is the lower value in the confidence interval for this coefficient and is based on its standard error value. For example, if the coefficient has a value of 6 and a confidence interval of 5 to 7, it means that according to the normal error distribution assumptions of the model, there is a 95% probability that the true population value of the coefficient is actually between 5 and 7. • Upper — Upper is the upper value in the confidence interval for this coefficient based on its standard error value. For example, if the coefficient has a value of 6 and a confidence interval of 5 to 7, it means that according to the normal error distribution assumptions of the model, there is a 95% probability that the true population value of the coefficient is actually between 5 and 7. • Standard Coefficient — Standardized coefficients, sometimes called beta-coefficients, express the linear model in terms of the z-scores or standardized values of the independent variables. Standardized values cast each variable into units measuring the number of standard deviations away from the mean value for that variable. The advantage of examining standardized coefficients is that they are scaled equivalently, so that their relative importance in the model can be more easily seen. 106 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Linear Regression • Incremental R-squared — It is possible to calculate the model’s Squared Multiple Correlation value incrementally by considering the cumulative contributions of x variables added to the model one at a time, namely R y x R y x x R y x x x . 1 1 2 1 2 n These are called Incremental R2 values, and they give a measure of how much the addition of each x variable contributes to explaining the variation in y in the observations. Linear Regression Graphs The Linear Regression Analysis can display the coefficients and/or T-statistics of the resultant model. Weights Graph This graph displays the relative magnitudes of the standardized coefficients and/or the Tstatistic associated with each standardized coefficient in the linear regression model. The sign, positive or negative, is portrayed by the colors red or blue respectively. The user may scroll to the left or right to see all the variables in the model. The T-statistic is the ratio of the coefficient value to its standard error, so the larger its value the more reliable the value of the coefficient is. The following options are available on the Graphics Options tab on the Linear Weights graph: • Graph Type — The following can be graphed by the Linear Weights Graph • T Statistic — Display the T Statistics on the bar chart. • Standardized Coefficient — Display the Standardized Coefficients on the bar chart. • Vertical Axis — The user may request multiple vertical axes in order to display separate coefficient values that are orders of magnitude different from the rest of the values. If the coefficients are of roughly the same magnitude, this option is grayed out. • Single — Display the Standardized Coefficients or T Statistics on single axis on the bar chart. • Multiple — Display the Standardized Coefficients or T Statistics on dual axes on the bar chart. Tutorial - Linear Regression Parameterize a Linear Regression Analysis as follows: • Available Matrices — Customer_Analysis_Matrix • Dependent Variable — cc_rev • Independent Variables • income — age • years_with_bank — nbr_children • female — single • married — separated • ccacct — ckacct • svacct — avg_cc_bal • avg_ck_bal — avg_sv_bal Teradata Warehouse Miner User Guide - Volume 3 107 Chapter 1: Analytic Algorithms Linear Regression • avg_cc_tran_amt — avg_cc_tran_cnt • avg_ck_tran_amt — avg_ck_tran_cnt • avg_sv_tran_amt — avg_sv_tran_cnt • Include Constant — Enabled • Step Direction — Forward • Step Method — F Statistic • Criterion to Enter — 3.84 • Criterion to Remove — 3.84 Run the analysis, and click on Results when it completes. For this example, the Linear Regression Analysis generated the following pages. A single click on each page name populates Results with the item. Table 30: Linear Regression Report Total Observations: 747 Total Sum of Squares: 6.69E5 Multiple Correlation Coefficient (R): 0.9378 Squared Multiple Correlation Coefficient (1-Tolerance): 0.8794 Adjusted R-Squared: 0.8783 Standard Error of Estimate: 1.04E1 Table 31: Regression vs. Residual Sum of Squares Degrees of Freedom Mean-Square F Ratio P-value Regression 5.88E5 7 8.40E4 769.8872 0.0000 Residual 8.06E4 739 1.09E2 N/A N/A Table 32: Execution Status 108 6/20/2004 2:07:28 PM Getting Matrix 6/20/2004 2:07:28 PM Stepwise Regression Running... 6/20/2004 2:07:28 PM Step 0 Complete 6/20/2004 2:07:28 PM Step 1 Complete 6/20/2004 2:07:28 PM Step 2 Complete 6/20/2004 2:07:28 PM Step 3 Complete 6/20/2004 2:07:28 PM Step 4 Complete 6/20/2004 2:07:28 PM Step 5 Complete Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Linear Regression Table 32: Execution Status 6/20/2004 2:07:28 PM Step 6 Complete 6/20/2004 2:07:28 PM Step 7 Complete 6/20/2004 2:07:29 PM Creating Report Table 33: Variables Column Name B Standard Coefficient Error T Statistic P-value Lower Upper Standard Incremental Coefficient R Squared Multiple Correlation Coefficient (1Tolerance) (Constant) -6.4640 0.9749 -6.6301 0.0000 -8.3780 -4.5500 0.0000 0.0000 N/A avg_cc_bal -0.0174 0.0004 -41.3942 0.0000 -0.0182 -0.0166 -0.6382 0.7556 0.3135 income 0.0005 0.0000 24.5414 0.0000 0.0005 0.0005 0.3777 0.8462 0.3110 ckacct 10.2793 0.8162 12.5947 0.0000 8.6770 11.8815 0.1703 0.8732 0.1073 married -4.3056 0.8039 -5.3558 0.0000 -5.8838 -2.7273 -0.0718 0.8766 0.0933 avg_sv_ tran_cnt -0.7746 0.2777 -2.7887 0.0054 -1.3198 -0.2293 -0.0360 0.8779 0.0207 nbr_ children 0.8994 0.3718 2.4187 0.0158 0.1694 1.6294 0.0331 0.8787 0.1312 years_with_ 0.2941 bank 0.1441 2.0404 0.0417 0.0111 0.5771 0.0263 0.8794 0.0168 Step 0 Table 34: Out Independent Variable P-value F Stat age 0.0000 19.7680 avg_cc_bal 0.0000 2302.7983 avg_cc_tran_amt 0.0000 69.5480 avg_cc_tran_cnt 0.0000 185.3197 avg_ck_bal 0.0000 116.5094 avg_ck_tran_amt 0.0000 271.3578 avg_ck_tran_cnt 0.0002 13.9152 avg_sv_bal 0.0000 37.8598 Teradata Warehouse Miner User Guide - Volume 3 109 Chapter 1: Analytic Algorithms Linear Regression Table 34: Out Independent Variable P-value F Stat avg_sv_tran_amt 0.0000 76.1104 avg_sv_tran_cnt 0.7169 0.1316 ccacct 0.1754 1.8399 ckacct 0.0000 105.5843 female 0.5404 0.3751 income 0.0000 647.3239 married 0.8937 0.0179 nbr_children 0.0000 30.2315 separated 0.0000 28.7618 single 0.0000 17.1850 svacct 0.0001 15.7289 years_with_bank 0.1279 2.3235 Step 1 Table 35: Model Assessment Squared Multiple Correlation Coefficient (1-Tolerance) 0.7556 Standard Error of Estimate 14.8111 Table 36: Columns In (Part 1) Independent Variable B Coefficient Standard Error T Statistic P-value avg_cc_bal -0.0237 0.0005 -47.9875 0.0000 Independent Variable B Coefficient Lower Upper F Stat avg_cc_bal -0.0237 -0.0247 -0.0227 2302.7983 Table 37: Columns In (Part 2) 110 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Linear Regression Table 38: Columns In (Part 3) Independent Variable B Coefficient Standard Coefficient Squared Multiple Correlation Coefficient (1-Tolerance) avg_cc_bal -0.0237 -0.8692 0.0000 Incremental R2 0.7556 Table 39: Columns Out Independent Variable P-value F Stat Partial Correlation age 0.0539 3.7287 0.0708 avg_cc_tran_amt 0.0000 27.4695 0.1921 avg_cc_tran_cnt 0.2346 1.4153 0.0436 avg_ck_bal 0.0000 17.1826 0.1520 avg_ck_tran_amt 0.0000 94.9295 0.3572 avg_ck_tran_cnt 0.4712 0.5198 0.0264 avg_sv_bal 0.0083 6.9952 0.0970 avg_sv_tran_amt 0.0164 5.7848 0.0882 avg_sv_tran_cnt 0.1314 2.2807 0.0554 ccacct 0.8211 0.0512 0.0083 ckacct 0.0000 41.3084 0.2356 female 0.3547 0.8575 0.0340 income 0.0000 438.7799 0.7680 married 0.4812 0.4967 0.0258 nbr_children 0.0000 30.4645 0.2024 separated 0.0004 12.8680 0.1315 single 0.0024 9.3169 0.1119 svacct 0.0862 2.9523 0.0630 years_with_bank 0.3407 0.9090 0.0350 Linear Weights Graph By default, the Linear Weights graph displays the relative magnitudes of the T-statistic associated with each coefficient in the linear regression model: Teradata Warehouse Miner User Guide - Volume 3 111 Chapter 1: Analytic Algorithms Linear Regression Figure 60: Linear Regression Tutorial: Linear Weights Graph Select the Graphics Options tab and change the Graph Type to Standardized Coefficient to view the standardized coefficient values. Although not generated automatically, a Scatter Plot is useful for analyzing the model built with the Linear Regression analysis. As an example, a scatter plot is brought up to look at the dependent variable (“cc_rev”), with the first two independent variables that made it into the model (“avg_cc_bal,” “income”). Create a new Scatter Plot analysis, and pick these three variables in the Selected Tables and Columns option. The results are shown first in two dimensions (avg_cc_bal and cc_rev), and then with all three: 112 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Linear Regression Figure 61: Linear Regression Tutorial: Scatter Plot (2d) Figure 62: Linear Regression Tutorial: Scatter Plot (3d) Teradata Warehouse Miner User Guide - Volume 3 113 Chapter 1: Analytic Algorithms Logistic Regression Logistic Regression Overview In many types of regression problems, the response variable or dependent variable to be predicted has only two possible outcomes. For example, will the customer buy the product in response to the promotion or not? Is the transaction fraudulent or not? Will the customer close their account or not? There are many examples of business problems with only two possible outcomes. Unfortunately the linear regression model comes up short in finding solutions to this type of problem. It is worth trying to understand what these shortcomings are and how the logistic regression model is an improvement when predicting a two-valued response variable. When the response variable y has only two possible values, which may be coded as a 0 and 1, the expected value of yi, E(yi), is actually the probability that the value will be 1. The error term for a linear regression model for a two-valued response function also has only two possible values, so it doesn't have a normal distribution or constant variance over the values of the independent variables. Finally, the regression model can produce a value that doesn't fall within the necessary constraint of 0 to 1. What would be better would be to compute a continuous probability function between 0 and 1. In order to achieve this continuous probability function, the usual linear regression expression b0 + b1x1 + ... + bnxn is transformed using a function called a logit transformation function. This function is an example of a sigmoid function, so named because it looks like a sigma or 's' when plotted. It is of course the logit transformation function that gives rise to the term logistic regression. The type of logistic regression model that Teradata Warehouse Miner supports is one with a two-valued dependent variable, referred to as a binary logit model. However, Teradata Warehouse Miner is capable of coding values for the dependent variable so that the user is not required to code their dependent variable to two distinct values. The user can choose which values to represent as the response value (i.e. 1 or TRUE) and all other will be treated as nonresponse values (i.e. 0 or FALSE). Even though values other than 1 and 0 are supported in the dependent variable, throughout this section the dependent variable response value is represented as 1 and the non-response value as 0 for ease of reading. The primary sources of information and formulae in this section are [Hosmer] and [Neter]. Logit model The logit transformation function is chosen because of its mathematical power and simplicity, and because it lends an intuitive understanding to the coefficients eventually created in the model. The following equations describe the logistic regression model, with x being the probability that the dependent variable is 1, and g(x) being the logit transformation: b +b x ++b x n n e 0 1 x x = -------------------------------------------------b + b x + + bn xn 1+e 0 1 x 114 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Logistic Regression x g x = ln -------------------- = b 0 + b 1 x 1 + b n x n 1 – x Notice that the logit transformation g(x) has linear parameters (b-values) and may be continuous with unrestricted range. Using these functions, a binomial error distribution is found with y = x + . The solution to a logistic regression model is to find the b-values that “best” predict the dichotomous y variable based on the values of the numeric x variables. Maximum likelihood In linear regression analysis it is possible to use a least-squares approach to finding the best bvalues in the linear regression equation. The least-squared error approach leads to a set of n normal equations in n unknowns that can be solved for directly. But that approach does not work here for logistic regression. Suppose any b-values are selected and the question is asked what is the likelihood that they match the logistic distribution defined, using statistical principles and the assumption that errors have a normal probability distribution. This technique of picking the most likely b-values that match the observed data is known as a maximum likelihood solution. In the case of linear regression, a maximum likelihood solution turns out to be mathematically equivalent to a least squares solution. But here maximum likelihood must be used directly. For convenience, compute the natural logarithm of the likelihood function so that it is possible to convert the product of likelihood’s into a sum, which is easier to work with. The log likelihood equation for a given vector B of b-values with v x-variables is given by: n n yi B'X – ln 1 + exp B'X ln L b 0 b v = i=1 i=1 where B’X = b0 + b1x1 + ... + bvxv. By differentiating this equation with respect to the constant term b0 and with respect to the variable terms bi, the likelihood equations are derived: n yi – xi = 0 i=1 and n xi yi – xi = 0 i=1 where Teradata Warehouse Miner User Guide - Volume 3 115 Chapter 1: Analytic Algorithms Logistic Regression exp B'X x i = --------------------------------1 + exp B'X The log likelihood equation is not linear in the unknown b-value parameters, so it must be solved using non-linear optimization techniques described below. Computational technique Unlike with linear regression, logistic regression calculations cannot be based on an SSCP matrix. Teradata Warehouse Miner therefore dynamically generates SQL to perform the calculations required to solve the model, produce model diagnostics, produce success tables, and to score new data with a model once it is built. However, to enhance performance with small data sets, Teradata Warehouse Miner provides an optional in-memory calculation feature (that is also helpful when one of the stepwise options is used). This feature selects the data into the client system’s memory if it will fit into a user-specified maximum memory amount. The maximum amount of memory in megabytes to use is specified on the expert options tab of the analysis input screen. The user can adjust this value according to their workstation and network requirements. Setting this amount to zero will disable the feature. Teradata Warehouse Miner offers two optimization techniques for logistic regression, the default method of iteratively reweighted least squares (RLS), equivalent to the Gauss-Newton technique, and the quasi-Newton method of Broyden-Fletcher-Goldfarb-Shanno (BFGS). The RLS method is considerably faster than the BFGS method unless there are a large number of columns (RLS grows in complexity roughly as the square of the number of columns). Having a choice between techniques can be useful for more than performance reasons however, since there may be cases where one or the other technique has better convergence properties. You may specify your choice of technique, or allow Teradata Warehouse Miner to automatically select it for you. With the automatic option the program will select RLS if there are less than 35 independent variable columns; otherwise it will select BFGS. Logistic Regression Model Diagnostics Logistic regression has counterparts to many of the same model diagnostics available with linear regression. In a similar manner to linear regression, these diagnostics provide a mathematically sound way to evaluate a model built with logistic regression. Standard errors and statistics As is the case with linear regression, measurements are made of the standard error associated with each b-coefficient value. Similarly, the T-statistic or Wald statistic as it is also called, is calculated for each b-coefficient as the ratio of the b-coefficient value to its standard error. Along with its associated t-distribution probability value, it can be used to assess the statistical significance of this term in the model. The computation of the standard errors of the coefficients is based on a matrix called the information matrix or Hessian matrix. This matrix is the matrix of second order partial derivatives of the log likelihood function with respect to all possible pairs of the coefficient values. The formula for the “j, k” element of the information matrix is: 116 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Logistic Regression 2 n LB ------------------ = – x ik x ik i 1 – i B j B k i–1 where exp B'X x i = --------------------------------1 + exp B'X Unlike the case with linear regression, confidence intervals are not computed directly on the standard error values, but on something called the odds ratios, described below. Odds ratios and confidence intervals In linear regression, the meaning of each b-coefficient in the model can be thought of as the amount the dependent y variable changes when the corresponding independent x variable changes by 1. Because of the logit transformation, however, the meaning of each b-coefficient in a logistic regression model is not so clear. In a logistic regression model, the increase of an x variable by 1 implies a change in the odds that the outcome y variable will be 1 rather than 0. Looking back at the formula for the logit response function: x g x = ln -------------------- = b 0 + + b n x n 1 – x it is evident that the response function is actually the log of the odds that the response is 1, where x is the probability that the response is 1 and 1 – x is the probability that the response is 0. Now suppose that one of the x variables, say xj, varies by 1. Then the response function will vary by bj. This can be written as g(x0...xj + 1...xn) - g(x0...xj...xn) = bj. But it could also be written as: ln odds j + 1 ln odds j + 1 – ln odds j = ------------------------------- = b j odds j Therefore odds j + 1 -------------------- = exp b j odds j the formula for the odds ratio of the coefficient bj . By taking the exponent of a b-coefficient, one gets the odds ratio that is the factor by which the odds change due to a unit increase in xj. Because this odds ratio is the value that has more meaning, confidence intervals are calculated on odds ratios for each of the coefficients rather than on the coefficients themselves. The confidence interval is computed based on a 95% confidence level and a twotailed normal distribution. Teradata Warehouse Miner User Guide - Volume 3 117 Chapter 1: Analytic Algorithms Logistic Regression Logistic Regression Goodness of fit In linear regression one of the key measures associated with goodness of fit is the residual sums of squares RSS. An analogous measure for logistic regression is a statistic sometimes called the deviance. Its value is based on the ratio of the likelihood of a given model to the likelihood of a perfectly fitted or saturated model and is given by D = -2ln(ModelLH / SatModelLH). This can be rewritten D=-2LM + 2LS in terms of the model log likelihood and the saturated model log likelihood. Looking at the data as a set of n independent Bernoulli observations, LS is actually 0, so that D = -2LM. Two models can be contrasted by taking the difference between their deviance values, which leads to a statistic G = D1 - D2 = -2(L1 - L2). This is similar to the numerator in the partial F test in linear regression, the extra sums of squares or ESS mentioned in the section on linear regression. In order to get an assessment of the utility of the independent model terms taken as a whole, the deviance difference statistic is calculated for the model with a constant term only versus the model with all variables fitted. This statistic is then G = -2(L0 - LM). LM is calculated using the log likelihood formula given earlier. L0, the log likelihood of the constant only model with n observations is given by: L 0 = y ln y + n – y ln n – y – n ln n G follows a chi-square distribution with “variables minus one” degrees of freedom, and as such provides a probability value to test whether all the x-term coefficients should in fact be zero. Finally, there are a number of pseudo R-squared values that have been suggested in the literature. These are not truly speaking goodness of fit measures, but can nevertheless be useful in assessing the model. Teradata Warehouse Miner provides one such measure suggested by McFadden as (L0 - LM) / L0. [Agresti] Logistic Regression Data Quality Reports The same data quality reports optionally available for linear regression are also available when performing logistic regression. Since an SSCP matrix is not used in the logistic regression algorithm, additional internal processing is needed to produce data quality reports, especially for the Near Dependency report and the Detailed Collinearity Diagnostics report. Stepwise Logistic Regression Automated stepwise regression procedures are available for logistic regression to aid in model selection just as they are for linear regression. The procedures are in fact very similar to those described for linear regression. As such an attempt will be made to highlight the similarities and differences in the descriptions below. As is the case with stepwise linear regression, the automated stepwise procedures described below can provide insight into the variables that should be included in a logistic regression model. An element of human decision-making however is recommended in order to produce a model with useful business application. 118 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Logistic Regression Forward-Only Stepwise Logistic Regression The forward only procedure consists solely of forward steps as described below, starting without any independent x variables in the model. Forward steps are continued until no variables can be added to the model. Forward Stepwise Logistic Regression The forward stepwise procedure is a combination of the forward and backward steps always done in pairs, as described below, starting without any independent x variables in the model. One forward step is always followed by one backward step, and these single forward and backward steps are alternated until no variables can be added or removed. Additional checks are made after each step to see if the same variables exist in the model as existed after a previous step in the same direction. When this condition is detected in both the forward and backward directions the algorithm will also terminate. Backward-Only Stepwise Logistic Regression The backward only procedure consists solely of backward steps as described below, starting with all of the independent x variables in the model. Backward steps are continued until no variables can be removed from the model. Backward Stepwise Logistic Regression The backward stepwise procedure is a combination of the backward and forward steps always done in pairs, as described below, starting with all of the independent x variables in the model. One backward step is followed by one forward step, and these single backward and forward steps are alternated until no variables can be added or removed. Additional checks are made after each step to see if the same variables exist in the model as existed after a previous step in the same direction. When this condition is detected in both the backward and forward directions the algorithm will also terminate. Stepwise Logistic Regression - Forward step In stepwise linear regression the partial F statistic, or the analogous T-statistic probability value, is computed separately for each variable outside the model, adding each of them into the model one at a time. The analogous procedure for logistic regression would consist of computing the likelihood ratio statistic G, described in the Goodness of Fit section, for each variable outside the model, selecting the variable that results in the largest G value when added to the model. In the case of logistic regression however this becomes an expensive proposition because the solution of the model for each variable requires another iterative maximum likelihood solution, contrasted to the more rapidly achieved closed form solution available in linear regression. What is needed is a statistic that can be calculated without requiring an additional maximum likelihood solution. Teradata Warehouse Miner uses such a statistic proposed by Peduzzi, Hardy and Holford that they call a W statistic. This statistic is comparatively inexpensive to compute for each variable outside the model and is therefore expedient to use as a criterion for selecting a variable to add to the model. The W statistic is assumed to follow a chi square distribution with one degree of freedom due to its similarity to other statistics, and it gives evidence of behaving similarly to the likelihood ratio statistic. Therefore, the variable with the smallest chi square probability or P-value associated with its W statistic is added to the Teradata Warehouse Miner User Guide - Volume 3 119 Chapter 1: Analytic Algorithms Logistic Regression model in a forward step if the P-value is less than the criterion to enter. If more than one variable has a P-value of 0, then the variable with the largest W statistic is entered. For more information, refer to [Peduzzi, Hardy and Holford]. Stepwise Logistic Regression - Backward step Each backward step seeks to remove those variables that have statistical significance below a certain level. This is done by first fitting the model with the currently selected variables, including the calculation of the probability or P-value associated with the T-statistic for each variable, which is the ratio of the b-coefficient to its standard error. The variable with the largest P-value is removed if it is greater than the criterion to remove. Logistic Regression and Missing Data Null values for columns in a logistic regression analysis can adversely affect results, so Teradata Warehouse Miner ensures that listwise deletion is effectively performed with logistic regression. This ensures that any row for which one of the independent or dependent variable columns is null will be left out of computations completely. Additionally, the Recode transformation function can be used to build a new column, substituting a fixed known value for null. Initiate a Logistic Regression Function Use the following procedure to initiate a new Logistic Regression analysis in Teradata Warehouse Miner: 1 Click on the Add New Analysis icon in the toolbar: Figure 63: Add New Analysis from toolbar 2 120 In the resulting Add New Analysis dialog box, click on Analytics under Categories and then under Analyses double-click on Logistic Regression: Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Logistic Regression Figure 64: Add New Analysis dialog 3 This will bring up the Logistic Regression dialog in which you will enter INPUT and OUTPUT options to parameterize the analysis as described in the following sections. Logistic Regression - INPUT - Data Selection On the Logistic Regression dialog click on INPUT and then click on data selection: Figure 65: Logistic Regression > Input > Data Selection On this screen select: 1 Select Input Source Users may select between different sources of input. By selecting the Input Source Table the user can select from available databases, tables (or views) and columns in the usual manner. By selecting the Input Source Analysis however the user can select directly from the output of another analysis of qualifying type in the current project. Analyses that may be selected from directly include all of the Analytic Data Set (ADS) and Reorganization analyses (except Refresh). In place of Available Databases the user may select from Available Analyses, while Available Tables then contains a list of all the output tables that will eventually be produced by the selected Analysis. (Note that since this analysis cannot select from a volatile input table, Available Analyses will contain only those qualifying analyses that create an output table or view). For more information, refer to “INPUT Tab” on page 73 of the Teradata Warehouse Miner User Guide (Volume 1). Teradata Warehouse Miner User Guide - Volume 3 121 Chapter 1: Analytic Algorithms Logistic Regression 2 Select Columns From a Single Table • Available Databases (or Analyses) — All the databases (or analyses) that are available for the Logistic Regression analysis. • Available Tables — All the tables that are available for the Logistic Regression analysis. • Available Columns — Within the selected table or matrix, all columns which are available for the Logistic Regression analysis. • Selected Columns — Select columns by highlighting and then either dragging and dropping into the Selected Columns window, or click on the arrow button to move highlighted columns into the Selected Columns window. Note: The Selected Columns window is actually a split window; you can either insert columns as Dependent or Independent columns. Make sure you have the correct portion of the window highlighted. The Dependent variable column is the column whose value is being predicted by the logistic regression model. The algorithm requires that the Independent columns must be of numeric type (or contain numbers in character format). The Dependent column may be of any type. Logistic Regression - INPUT - Analysis Parameters On the Logistic Regression dialog click on INPUT and then click on analysis parameters: Figure 66: Logistic Regression > Input > Analysis Parameters On this screen select: • Regression Options • Convergence Criterion — The algorithm continues to repeatedly estimate the model coefficient values until either the difference in the log likelihood function from one iteration to the next is less than or equal to the convergence criterion or the maximum iterations is reached. Default value is 0.001. • Maximum iterations — The algorithm stops iterating if the maximum iterations is reached. The default value is 100. • Response Value — The value of the dependent variable that will represent the response value. All other dependent variable values will be considered a non-response value. • Include Constant Term (checkbox) — This option specifies that the logistic regression model should include a constant term. With a constant, the logistic equation can be thought of as: 122 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Logistic Regression b +b x ++b x n n e 0 1 x x = -------------------------------------------------b + b x + + bn xn 1+e 0 1 x x g x = ln -------------------- = b 0 + b 1 x 1 + b n x n 1 – x Without a constant, the equation changes to: b x ++b x n n e 1 x x = ----------------------------------------b x + + bn xn 1+e 1 x x g x = ln -------------------- = b 1 x 1 + b n x n 1 – x The default value is to include the constant term. • Stepwise Options — If selected, the algorithm is performed repeatedly with various combinations of independent variable columns to attempt to arrive at a final “best” model. The default is to not use Stepwise Regression. • Step Direction — (Selecting “None” turns off the Stepwise option). • Forward — Option for independent variables being added one at a time to an empty model, possibly removing a variable after a variable is added. • Forward Only — Option for qualifying independent variables being added one at a time. • Backward — Option for removing variables from an initial model containing all of the independent variables, possibly adding a variable after a variable is removed. • Backward Only — Option for independent variables being removed one at a time. • Criterion to Enter — An independent variable is only added to the model if its W statistic chi-square P-value is less than the specified criterion to enter. The default value is 0.05. • Criterion to Remove — An independent variable is only removed if its T-statistic Pvalue is greater than the specified criterion to remove. The default value is 0.05 for each. • Report Options • Prediction Success Table — Creates a prediction success table using sums of probabilities rather than estimates based on a threshold value. The default is to generate the prediction success table. • Multi-Threshold Success Table — This table provides values similar to those in the prediction success table, but based on a range of threshold values, thus allowing the user to compare success scenarios using different threshold values. The default is to generate the multi-threshold Success table. • Threshold Begin • Threshold End • Threshold Increment — Specifies the threshold values to be used in the multi- Teradata Warehouse Miner User Guide - Volume 3 123 Chapter 1: Analytic Algorithms Logistic Regression threshold success table. If the computed probability is greater than or equal to a threshold value, that observation is assigned a 1 rather than a 0. Default values are 0, 1 and .05 respectively. • Cumulative Lift Table — Produce a cumulative lift table for deciles based on probability values. The default is to generate the Cumulative Lift table. • (Data Quality Reports) — These are the same data quality reports provided for Linear Regression and Factor Analysis. However, in the case of Logistic Regression, the “Sums of squares and Cross Products” or SSCP matrix is not readily available since it is not input to the algorithm, so it is derived dynamically by the algorithm. If there are a large number of independent variables in the model it may be more efficient to use the Build Matrix function to build and save the matrix and the Linear Regression function to produce the Data Quality Reports listed below. • Variable Statistics — This report gives the mean value and standard deviation of each variable in the model based on the derived SSCP matrix. • Near Dependency — This report lists collinear variables or near dependencies in the data based on the derived SSCP matrix. • • Condition Index Threshold — Entries in the Near Dependency report are triggered by two conditions occurring simultaneously. The one that involves this parameter is the occurrence of a large condition index value associated with a specially constructed principal factor. If a factor has a condition index greater than this parameter’s value, it is a candidate for the Near Dependency report. A default value of 30 is used as a rule of thumb. • Variance Proportion Threshold — Entries in the Near Dependency report are triggered by two conditions occurring simultaneously. The one that involves this parameter is when two or more variables have a variance proportion greater than this threshold value for a factor with a high condition index. Another way of saying this is that a ‘suspect’ factor accounts for a high proportion of the variance of two or more variables. This parameter defines what a high proportion of variance is. A default value of 0.5 is used as a rule of thumb. Detailed Collinearity Diagnostics — This report provides the details behind the Near Dependency report, consisting of the “Eigenvalues of Unit Scaled X’X”, “Condition Indices” and “Variance Proportions” tables. Logistic Regression - INPUT - Expert Options On the Logistic Regression dialog click on INPUT and then click on expert options: Figure 67: Logistic Regression > Input > Expert Options On this screen select: 124 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Logistic Regression • Optimization Method • Automatic — The program selects Reweighted Least Squares (RLS) unless there are 35 or more independent variable columns, in which case Quasi-Newton BFGS is selected instead. This is the default option. • Quasi-Newton (BFGS) — The user may explicitly request this optimization technique attributed to Broyden-Fletcher-Goldfarb-Shanno. Quasi-Newton methods do not require a Hessian matrix of second partial derivatives of the objective function to be calculated explicitly, saving time in some situations. • Reweighted Least Squares (RLS) — The user may explicitly request this optimization technique equivalent to the Gauss-Newton method. It involves computing a matrix very similar to a Hessian matrix but is typically the fastest technique for logistic regression. • Performance • Maximum amount of data for in-memory processing — Enter a number of megabytes. • Use multiple threads when applicable — This flag indicates that multiple SQL statements may be executed simultaneously, up to 5 simultaneous executions as needed. It only applies when not processing in memory, and only to certain processing performed in SQL. Where and when multi-threading is used is dependent on the number of columns and the Optimization Method selected (but both RLS and BFGS can potentially make some use of multi-threading). Logistic Regression - OUTPUT On the Logistic Regression dialog click on OUTPUT: Figure 68: Logistic Regression > OUTPUT On this screen select: • Store the variables table of this analysis in the database — Check this box to store the model variables table of this analysis in in the database. • Database Name — The name of the database to create the output table in. • Output Table Name — The name of the output table. • Advertise Output — The Advertise Output option "advertises" output by inserting information into one or more of the Advertise Output metadata tables according to the type of analysis and the options selected in the analysis. (For more information, refer to “Advertise Output” on page 112 of the Teradata Warehouse Miner User Guide (Volume 1)). • Advertise Note — An Advertise Note may be specified if desired when the Advertise Output option is selected or when the Always Advertise option is selected on the Databases Teradata Warehouse Miner User Guide - Volume 3 125 Chapter 1: Analytic Algorithms Logistic Regression tab of the Connection Properties dialog. It is a free-form text field of up to 30 characters that may be used to categorize or describe the output. By way of an example, the tutorial example produces the following output table: Table 40: Logistic Regression - OUTPUT Standardi zed Coefficie nt Column Name B Coefficie nt years_ with_ bank 0.044251 4.914916 0.026929 0.906555 0.831242 0.988692 0.098102 11 2.216961 39 5 6 0.053055 0.144717 98 14 1 avg_sv_ tran_cnt 0.213310 31.22951 3.22526 1.192052 7 5.588337 E-08 avg_sv_ tran_amt 0.030762 0.003824 64.70387 8.043871 3.552714 1.03124 34 32 E-15 ckacct 0.465670 0.236528 3.876044 1.968767 0.049353 1.593081 1.002084 2.53263 2 8 13 avg_ck_ tran_cnt 0.009613 5.608763 0.018127 0.977489 0.959244 0.996082 0.022767 534 2.368283 26 7 1 3 0.059032 0.179196 57 84 4 married 0.233367 7.115234 -2.66744 0.622493 6 9 Standard Error Wald Statistic T Statistic P-Value Odds Ratio Lower Upper Partial R 0.303597 0.199861 0.461178 6 2 0.168006 0.914416 5 2 1.02354 1.038999 0.246071 2.061767 7 0.042563 0.127321 37 6 0.007810 0.536604 0.339634 0.847807 519 5 2 3 0.070282 0.171455 51 6 (Constan 0.273292 18.84624 1.614427 t) 1.186426 9 4.341225 E-05 avg_sv_ bal 0.003125 0.000559 31.16868 5.582892 3.323695 1.00313 305 8004 E-08 1.00203 1.004231 0.167831 2.625869 3 If Database Name is twm_results and Output Table Name is test, the output table is defined as: CREATE SET TABLE twm_results.test ( "Column Name" VARCHAR(30) CHARACTER SET UNICODE NOT CASESPECIFIC, "B Coefficient" FLOAT, "Standard Error" FLOAT, "Wald Statistic" FLOAT, "T Statistic" FLOAT, "P-Value" FLOAT, "Odds Ratio" FLOAT, "Lower" FLOAT, "Upper" FLOAT, "Partial R" FLOAT, "Standardized Coefficient" FLOAT) UNIQUE PRIMARY INDEX ( "Column Name" ); 126 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Logistic Regression Run the Logistic Regression After setting parameters on the INPUT screens as described above, you are ready to run the analysis. To run the analysis you can either: • Click the Run icon on the toolbar, or • Select Run <project name> on the Project menu, or • Press the F5 key on your keyboard Results - Logistic Regression The results of running the Teradata Warehouse Miner Linear Regression analysis include a variety of statistical reports on the individual variables and generated model as well as barcharts displaying coefficients and T-statistics. All of these results are outlined below. The title of this report is preceded by the name of the technique that was used to build the model either Reweighted Least Squares Logistic Regression or Quasi-Newton (BFGS) Logistic Regression. On the Logistic Regression dialog, click on RESULTS (note that the RESULTS tab will be grayed-out/disabled until after the analysis is completed) to view results. Result options are as follows: • Data Quality Reports • Variable Statistics — If selected on the Results Options tab, this report gives the mean value and standard deviation of each variable in the model based on the SSCP matrix provided as input. • Near Dependency — If selected on the Results Options tab, this report lists collinear variables or near dependencies in the data based on the SSCP matrix provided as input. Entries in the Near Dependency report are triggered by two conditions occurring simultaneously. The first is the occurrence of a large condition index value associated with a specially constructed principal factor. If a factor has a condition index greater than the parameter specified on the Results Option tab, it is a candidate for the Near Dependency report. The other is when two or more variables have a variance proportion greater than a threshold value for a factor with a high condition index. Another way of saying this is that a ‘suspect’ factor accounts for a high proportion of the variance of two or more variables. The parameter to defines what a high proportion of variance is also set on the Results Option tab. A default value of 0.5. • Detailed Collinearity Diagnostics — If selected on the Results Options tab, this report provides the details behind the Near Dependency report, consisting of the following tables. • Eigenvalues of Unit Scaled X'X — Report of the eigenvalues of all variables scaled so that each variable adds up to 1 when summed over all the observations or rows. In order to calculate the singular values of X (the rows of X are the observations), the mathematically equivalent square root of the eigenvalues of XTX are computed instead for practical reasons • Condition Indices — The condition index of each eigenvalue, calculated as the Teradata Warehouse Miner User Guide - Volume 3 127 Chapter 1: Analytic Algorithms Logistic Regression square root of the ratio of the largest eigenvalue to the given eigenvalue, a value always 1 or greater. • Variance Proportions — The variance decomposition of these eigenvalues is computed using the eigenvalues together with the eigenvectors associated with them. The result is a matrix giving, for each variable, the proportion of variance associated with each eigenvalue. • Logistic Regression Step N (Stepwise-only) • In Report — This report is the same as the Variables in Model report, but it is provided for each step during stepwise logistic regression based on the variables currently in the model at each step. • Out Report • Column Name — The independent variable excluded from the model. • W Statistic — The W Statistic is a specialized statistic designed to determine the best variable to add to a model without calculating a maximum likelihood solution for each variable outside the model. The W statistic is assumed to follow a chi square distribution with one degree of freedom due to its similarity to other statistics, and it gives evidence of behaving similarly to the likelihood ratio statistic. For more information, refer to [Peduzzi, Hardy and Holford]. • Chi Sqr P-value — The W statistic is assumed to follow a chi square distribution on one degree of freedom due to its similarity to other statistics, and it gives evidence of behaving similarly to the likelihood ratio statistic. Therefore, the variable with the smallest chi square probability or P-value associated with its W statistic is added to the model in a forward step if the P-value is less than the criterion to enter. • Logistic Regression Model • Total Observations — This is the number of rows in the table that the logistic regression analysis is based on. The number of observations reflects the row count after any rows were eliminated by listwise deletion (due to one of the variables being null). • Total Iterations — The number of iterations used by the non-linear optimization algorithm in maximizing the log likelihood function. • Initial Log Likelihood — The initial log likelihood is the log likelihood of the constant only model and is given only when the constant is included in the model. The formula for initial log likelihood is given by: L 0 = y ln y + n – y ln n – y – n ln n where n is the number of observations. 128 • Final Log Likelihood — This is the value of the log likelihood function after the last iteration. • Likelihood Ratio Test G Statistic — Deviance, given by D = -2LM, where LM is the log likelihood of the logistic regression model, is a measure analogous to the residual sums of squares RSS in a linear regression model. In order to assess the utility of the independent terms taken as a whole in the logistic regression model, the deviance Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Logistic Regression difference statistic G is calculated for the model with a constant term only versus the model with all variables fitted. This statistic is then G = -2(L0 - LM), where L0 is the log likelihood of a model containing only a constant. The G statistic, like the deviance D, is an example of a likelihood ratio test statistic. • Chi-Square Degrees of Freedom — The G Statistic follows a chi-square distribution with “variables minus one” degrees of freedom. This field then is the degrees of freedom for the G Statistic’s chi-square test. • Chi-Square Value — This is the chi-square random variable value for the Likelihood Ratio Test G Statistic. This can be used to test whether all the independent variable coefficients should be 0. Examining the field Chi-square Probability is however the easiest way to assess this test. • Chi-Square Probability — This is the chi-square probability value for the Likelihood Ratio Test G Statistic. It can be used to test whether all the independent variable coefficients should be 0. That is, the probability that a chi-square distributed variable would have the value G or greater is the probability associated with having all 0 coefficients. The null hypothesis that all the terms should be 0 can be rejected if this probability is sufficiently small, say less than 0.05. • McFadden's Pseudo R-Squared — To mimic the Squared Multiple Correlation Coefficient (R2) in a linear regression model, the researcher McFadden suggested this measure given by (L0 - LM) / L0 where L0 is the log likelihood of a model containing only a constant and LM is the log likelihood of the logistic regression model. Although it is not truly speaking a goodness of fit measure, it can be useful in assessing a logistic regression model. (Experience shows that the value of this statistic tends to be less than the R2 value it mimics. In fact, values between 0.20 and 0.40 are quite satisfactory). • Dependent Variable Name — Column chosen as the dependent variable. • Dependent Variable Response Values — The response value chosen for the dependent variable on the Regression Options tab. • Dependent Variable Distinct Values — The number of distinct values that the dependent variable takes on. • Logistic Regression Variables in Model report • Column Name — This is the name of the independent variable in the model or CONSTANT for the constant term. • B Coefficient — The b-coefficient is the coefficient in the logistic regression model for this variable. The following equations describe the logistic regression model, with being the probability that the dependent variable is 1, and g(x) being the logit transformation: b +b x ++b x n n e 0 1 x x = -------------------------------------------------b + b x + + bn xn 1+e 0 1 x x g x = ln -------------------- = b 0 + b 1 x 1 + b n x n 1 – x Teradata Warehouse Miner User Guide - Volume 3 129 Chapter 1: Analytic Algorithms Logistic Regression • Standard Error — The standard error of a b-coefficient in the logistic regression model is a measure of its expected accuracy. It is analogous to the standard error of a coefficient in a linear regression model. • Wald Statistic — The Wald statistic is calculated as the square of the T-statistic (T Stat) described below. The T-statistic is calculated for each b-coefficient as the ratio of the b-coefficient value to its standard error. • T Statistic — In a manner analogous to linear regression, the T-statistic is calculated for each b-coefficient as the ratio of the b-coefficient value to its standard error. Along with its associated t-distribution probability value, it can be used to assess the statistical significance of this term in the model. • P-value — This is the t-distribution probability value associated with the T-statistic (T Stat), that is, the ratio of the b-coefficient value (B Coef) to its standard error (Std Error). It can be used to assess the statistical significance of this term in the logistic regression model. A value close to 0 implies statistical significance and means this term in the model is important. (The P-value represents the probability that the null hypothesis is true, that is the observation of the estimated coefficient value is chance occurrence - i.e. the null hypothesis is that the coefficient equals zero. The smaller the P-value, the stronger the evidence for rejecting the null hypothesis that the coefficient is actually equal to zero. In other words, the smaller the P-value, the larger the evidence that the coefficient is different from zero). • Odds Ratio — The odds ratio for an independent variable in the model is calculated by taking the exponent of the b-coefficient. The odds ratio is the factor by which the odds of the dependent variable being 1 change due to a unit increase in this independent variable. • Lower — Because of the intuitive meaning of the odds ratio, confidence intervals for coefficients in the model are calculated on odds ratios rather than on the coefficients themselves. The confidence interval is computed based on a 95% confidence level and a two-tailed normal distribution. “Lower” is the lower range of this confidence interval. • Upper — Because of the intuitive meaning of the odds ratio, confidence intervals for coefficients in the model are calculated on odds ratios rather than on the coefficients themselves. The confidence interval is computed based on a 95% confidence level and a two-tailed normal distribution. “Upper” is the upper range of this confidence interval. • Partial R — The Partial R statistic is calculated for each b-coefficient value as: wi – 2 Sign b i -------------– 2L 0 where bi is the b-coefficient and wi is the Wald Statistic of the ith independent variable, while L0 is the initial log likelihood of the model. (Note that if wi <= 2 then Partial R is set to 0). This statistic provides a measure of the relative importance of each 130 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Logistic Regression variable in the model. It is calculated only when the constant term is included in the model. [SPSS] • Standardized Coefficient — The estimated standardized coefficient is calculated for each b-coefficient value as: b i i ------- 3 where bi is the b-coefficient, i is the standard deviation of the ith independent 3 variable, and ------- is the standard deviation of the standard logistic distribution. This calculation only provides an estimate of the standardized coefficients since it uses a constant value for the logistic distribution without regard to the actual distribution of the dependent variable in the model. [Menard] • Prediction Success Table — The prediction success table is computed using only probabilities and not estimates based on a threshold value. Using an input table that contains known values for the dependent variable, the sum of the probability values x and 1 – x , which correspond to the probability that the predicted value is 1 or 0 respectively, are calculated separately for rows with actual value of 1 and 0. Refer to the Model Evaluation section for more information. • Estimate Response — The entries in the “Estimate Response” column are the sums of the probabilities x that the outcome is 1, summed separately over the observations where the actual outcome is 1 and 0 and then totaled. (Note that this is independent of the threshold value that is used in scoring to determine which probabilities correspond to an estimate of 1 and 0 respectively). • Estimate Non-Response — The entries in the “Estimate Non-Response” column are the sums of the probabilities 1 – x that the outcome is 0, summed separately over the observations where the actual outcome is 1 and 0 and then totaled. (Note that this is independent of the threshold value that is used in scoring to determine which probabilities correspond to an estimate of 1 and 0 respectively). • Actual Total — The entries in this column are the sums of the entries in the Estimate Response and Estimate Non-Response columns, across the rows in the Prediction Success Table. But in fact this turns out to be the number of actual 0’s and 1’s and total observations in the training data. • Actual Response — The entries in the “Actual Response” row correspond to the observations in the data where the actual value of the dependent variable is 1. • Actual Non-Response — The entries in the “Actual Non-Response” row correspond to the observations in the data where the actual value of the dependent variable is 0. • Estimated Total — The entries in this row are the sums of the entries in the Actual Response and Actual Non-Response rows, down the columns in the Prediction Success Table. This turns out to be the sum of the probabilities of estimated 0’s and 1’s and total observations in the model. • Multi-Threshold Success Table — This table provides values similar to those in the prediction success table, but instead of summing probabilities, the estimated values based Teradata Warehouse Miner User Guide - Volume 3 131 Chapter 1: Analytic Algorithms Logistic Regression on a threshold value are summed instead. Rather than just one threshold however, several thresholds ranging from a user specified low to high value are displayed in user specified increments. This allows the user to compare several success scenarios using different threshold values, to aid in the choice of an ideal threshold. Refer to the Model Evaluation section for more information. • Threshold Probability — This column gives various incremental values of the probability at or above which an observation is to have an estimated value of 1 for the dependent variable. For example, at a threshold of 0.5, a response value of 1 is estimated if the probability predicted by the logistic regression model is greater than or equal to 0.5. The user may request the starting, ending and increment values for these thresholds. • Actual Response, Estimate Response — This column corresponds to the number of observations for which the model estimated a value of 1 for the dependent variable and the actual value of the dependent variable is 1. • Actual Response, Estimate Non-Response — This column corresponds to the number of observations for which the model estimated a value of 0 for the dependent variable but the actual value of the dependent variable is 1, a “false negative” error case for the model. • Actual Non-Response, Estimate Response — This column corresponds to the number of observations for which the model estimated a value of 1 for the dependent variable but the actual value of the dependent variable is 0, a “false positive” error case for the model. • Actual Non-Response, Estimate Non-Response — This column corresponds to the number of observations for which the model estimated a value of 0 for the dependent variable and the actual value of the dependent variable is 0. • Cumulative Lift Table — The Cumulative Lift Table demonstrates how effective the model is in estimating the dependent variable. It is produced using deciles based on the probability values. Note that the deciles are labeled such that 1 is the highest decile and 10 is the lowest, based on the probability values calculated by logistic regression. The information in this report however is best viewed in the Lift Chart produced as a graph under a logistic regression analysis. 132 • Decile — The deciles in the report are based on the probability values predicted by the model. Note that 1 is the highest decile and 10 is the lowest. That is, decile 1 contains data on the 10% of the observations with the highest estimated probabilities that the dependent variable is 1. • Count — This column contains the count of observations in the decile. • Response — This column contains the count of observations in the decile where the actual value of the dependent variable is 1. • Response (%) — This column contains the percentage of observations in the decile where the actual value of the dependent variable is 1. • Captured Response (%) — This column contains the percentage of responses in the decile over all the responses in any decile. Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Logistic Regression • Lift — The lift value is the percentage response in the decile (Pct Response) divided by the expected response, where the expected response is the percentage of response or dependent 1-values over all observations. For example, if 10% of the observations overall have a dependent variable with value 1, and 20% of the observations in decile 1 have a dependent variable with value 1, then the lift value within decile 1 is 2.0, meaning that the model gives a “lift” that is better than chance alone by a factor of two in predicting response values of 1 within this decile. • Cumulative Response — This is a cumulative measure of Response, from decile 1 to this decile. • Cumulative Response (%) — This is a cumulative measure of Pct Response, from decile 1 to this decile. • Cumulative Captured Response (%) — This is a cumulative measure of Pct Captured Response, from decile 1 to this decile. • Cumulative Lift — This is a cumulative measure of Lift, from decile 1 to this decile. Logistic Regression Graphs The Logistic Regression Analysis can display bar chars for the T-statistics, Wald Statistics, Log Odds Ratios, Partial R and Estimated Standard Coefficients of the resultant model. In addition, a Lift Chart in deciles is generated. Logistic Weights Graph This graph displays the relative magnitudes of the T-statistics, Wald Statistics, Log Odds Ratios, Partial R and Estimated Standard Coefficients associated with each variable in the logistic regression model. The sign, positive or negative, is portrayed by the colors red or blue respectively. The user may scroll to the left or right to see all the variables associated statistics in the model. The following options are available on the Graphics Options tab on the Logistic Weights graph: • Graph Type — The following can be graphed by the Linear Weights Graph • Vertical Axis — The user may request multiple vertical axes in order to display separate coefficient values that are orders of magnitude different from the rest of the values. If the coefficients are of roughly the same magnitude, this option is grayed out. • Single — Display the selected statistics on single axis on the bar chart. • Multiple — Display the selected statistics on dual axes on the bar chart. Lift Chart This graph displays the statistics in the Cumulative Lift Table, with the following options: • Non-Cumulative • % Response — This column contains the percentage of observations in the decile where the actual value of the dependent variable is 1. • % Captured Response — This column contains the percentage of responses in the decile over all the responses in any decile. Teradata Warehouse Miner User Guide - Volume 3 133 Chapter 1: Analytic Algorithms Logistic Regression • Lift — The lift value is the percentage response in the decile (Pct Response) divided by the expected response, where the expected response is the percentage of response or dependent 1-values over all observations. For example, if 10% of the observations overall have a dependent variable with value 1, and 20% of the observations in decile 1 have a dependent variable with value 1, then the lift value within decile 1 is 2.0, meaning that the model gives a “lift” that is better than chance alone by a factor of two in predicting response values of 1 within this decile. • Cumulative • % Response — This is a cumulative measure of the percentage of observations in the decile where the actual value of the dependent variable is 1, from decile 1 to this decile. • % Captured Response — This is a cumulative measure of the percentage of responses in the decile over all the responses in any decile, from decile 1 to this decile. • Cumulative Lift — This is a cumulative measure of the percentage response in the decile (Pct Response) divided by the expected response, where the expected response is the percentage of response or dependent 1-values over all observations, from decile 1 to this decile. Tutorial - Logistic Regression The following is an example of using the stepwise feature of Logistic Regression analysis. The stepwise feature adds extra processing steps to the analysis; that is, normal Logistic Regression processing is a subset of the output shown below. In this example, ccacct (has credit card, 0 or 1) is being predicted in terms of 16 independent variables, from income to avg_sv_tran_cnt. The forward stepwise process determines that only 7 out of the original 16 input variables should be used in the model. These include avg_sv_tran_amt (average amount of savings transactions), avg_sv_tran_cnt (average number of savings transactions per month), avg_sv_bal (average savings account balance), married, years_with_bank, avg_ck_ tran_cnt (average number of checking transactions per month), and ckacct (has checking account, 0 or 1). Step 0 shows that all of the original 16 independent variables are excluded from the model, the starting point for forward stepwise regression. In Step 1, the Model Assessment report shows that the variable avg_sv_tran_amt added to the model, along with the constant term, with all other variables still excluded from the model. For the sake of brevity, Steps 2 through 6 are not shown. Then in Step 7, the variable ckacct is the last variable added to the model. At this point the stepwise algorithm stops because there are no more variables qualifying to be added or removed from the model, and the Reweighted Least Squares Logistic Regression and Variables in Model reports are given, just as they would be if these variables were analyzed without stepwise requested. Finally the Prediction Success Table, Multi-Threshold Success Table, and Cumulative Lift Table are given, as requested, to complete the analysis. Parameterize a Logistic Regression Analysis as follows: • Available Table — twm_customer_analysis • Dependent Variable — cc_acct 134 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Logistic Regression • Independent Variables • income — age • years_with_bank — nbr_children • female — single • married — separated • ckacct — svacct • avg_ck_bal — avg_sv_bal • avg_ck_tran_amt — avg_ck_tran_cnt • avg_sv_tran_amt — avg_sv_tran_cnt • Convergence Criterion — 0.001 • Maximum Iterations — 100 • Response Value — 1 • Include Constant — Enabled • Prediction Success Table — Enabled • Multi-Threshold Success Table — Enabled • Threshold Begin — 0 • Threshold End — 1 • Threshold Increment — 0.05 • Cumulative Lift Table — Enabled • Use Stepwise Regression — Enabled • Criterion to Enter — 0.05 • Criterion to Remove — 0.05 • Direction — Forward • Optimization Type — Automatic Run the analysis, and click on Results when it completes. For this example, the Logistic Regression Analysis generated the following pages. A single click on each page name populates Results with the item. Table 41: Logistic Regression Report Total Observations: 747 Total Iterations: 9 Initial Log Likelihood: -517.7749 Final Log Likelihood: -244.4929 Likelihood Ratio Test G Statistic: 546.5641 Chi-Square Degrees of Freedom: 7.0000 Chi-Square Value: 14.0671 Chi-Square Probability: 0.0000 Teradata Warehouse Miner User Guide - Volume 3 135 Chapter 1: Analytic Algorithms Logistic Regression Table 41: Logistic Regression Report McFadden's Pseudo R-Squared: 0.5278 Dependent Variable: ccacct Dependent Response Value: 1 Total Distinct Values: 2 Table 42: Execution Summary 136 6/20/2004 2:19:02 PM Stepwise Logistic Regression Running. 6/20/2004 2:19:03 PM Step 0 Complete 6/20/2004 2:19:03 PM Step 1 Complete 6/20/2004 2:19:03 PM Step 2 Complete 6/20/2004 2:19:03 PM Step 3 Complete 6/20/2004 2:19:03 PM Step 4 Complete 6/20/2004 2:19:04 PM Step 5 Complete 6/20/2004 2:19:04 PM Step 6 Complete 6/20/2004 2:19:04 PM Step 7 Complete 6/20/2004 2:19:04 PM Log Likelihood: -517.78094387828 6/20/2004 2:19:04 PM Log Likelihood: -354.38456690558 6/20/2004 2:19:04 PM Log Likelihood: -287.159936852895 6/20/2004 2:19:04 PM Log Likelihood: -258.834546711159 6/20/2004 2:19:04 PM Log Likelihood: -247.445356552554 6/20/2004 2:19:04 PM Log Likelihood: -244.727173470081 6/20/2004 2:19:04 PM Log Likelihood: -244.49467692232 6/20/2004 2:19:04 PM Log Likelihood: -244.492882024522 6/20/2004 2:19:04 PM Log Likelihood: -244.492881920691 6/20/2004 2:19:04 PM Computing Multi-Threshold Success Table 6/20/2004 2:19:06 PM Computing Prediction Success Table 6/20/2004 2:19:06 PM Computing Cumulative Lift Table 6/20/2004 2:19:07 PM Creating Report Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Logistic Regression Table 43: Variables Column Name B Standard Coefficient Error Wald Statistic T Statistic P-Value Odds Ratio Lower Upper Partial R Standardized Coefficient (Constant) -1.1864 0.2733 18.8462 -4.3412 0.0000 N/A N/A N/A N/A N/A avg_sv_ tran_amt 0.0308 0.0038 64.7039 8.0439 0.0000 1.0312 1.0235 1.0390 0.2461 2.0618 avg_sv_ tran_cnt -1.1921 0.2133 31.2295 -5.5883 0.0000 0.3036 0.1999 0.4612 -0.1680 -0.9144 avg_sv_bal 0.0031 0.0006 31.1687 5.5829 0.0000 1.0031 1.0020 1.0042 0.1678 2.6259 married -0.6225 0.2334 7.1152 -2.6674 0.0078 0.5366 0.3396 0.8478 -0.0703 -0.1715 years_with_ -0.0981 bank 0.0443 4.9149 -2.2170 0.0269 0.9066 0.8312 0.9887 -0.0531 -0.1447 avg_ck_ tran_cnt -0.0228 0.0096 5.6088 -2.3683 0.0181 0.9775 0.9592 0.9961 -0.0590 -0.1792 ckacct 0.4657 0.2365 3.8760 1.9688 0.0494 1.5931 1.0021 2.5326 0.0426 0.1273 Step 0 Table 44: Columns Out Column Name W Statistic Chi-Square P-Value age 1.9521 0.1624 avg_ck_bal 0.5569 0.4555 avg_ck_tran_amt 1.6023 0.2056 avg_ck_tran_cnt 0.0844 0.7714 avg_sv_bal 85.5070 0.0000 avg_sv_tran_amt 233.7979 0.0000 avg_sv_tran_cnt 44.0510 0.0000 ckacct 21.8407 0.0000 female 3.2131 0.0730 income 1.9877 0.1586 married 19.6058 0.0000 nbr_children 5.1128 0.0238 separated 5.5631 0.0183 single 6.9958 0.0082 svacct 7.4642 0.0063 Teradata Warehouse Miner User Guide - Volume 3 137 Chapter 1: Analytic Algorithms Logistic Regression Table 44: Columns Out Column Name W Statistic Chi-Square P-Value years_with_bank 3.0069 0.0829 Step 1 Table 45: Variables Column Name B Standard Coefficient Error Wald Statistic T Statistic P-Value Odds Ratio avg_sv_ tran_amt 0.0201 193.2455 13.9013 0.0000 1.0203 1.0174 1.0232 0.4297 0.0014 Lower Upper Partial R Standardized Coefficient 1.3445 Table 46: Columns Out 138 Column Name W Statistic Chi-Square P-Value age 3.4554 0.0630 avg_ck_bal 0.4025 0.5258 avg_ck_tran_amt 0.3811 0.5370 avg_ck_tran_cnt 11.3612 0.0007 avg_sv_bal 46.6770 0.0000 avg_sv_tran_cnt 134.8091 0.0000 ckacct 7.8238 0.0052 female 2.4111 0.1205 income 5.2143 0.0224 married 7.7743 0.0053 nbr_children 2.6647 0.1026 separated 3.9342 0.0473 single 2.7417 0.0978 svacct 2.0405 0.1532 years_with_bank 13.2617 0.0003 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Logistic Regression Step 2-7 Table 47: Prediction Success Table Estimate Response Estimate Non-Response Actual Total Actual Response 304.5868 70.4132 375.0000 Actual Non-Response 70.4133 301.5867 372.0000 Actual Total 375.0000 372.0000 747.0000 Table 48: Multi-Threshold Success Table Threshold Probability Actual Response, Estimate Response Actual Response, Estimate Non-Response Actual Non-Response, Estimate Response Actual Non-Response, Estimate Non-Response 0 375 0 372 0 .05 375 0 353 19 .1 374 1 251 121 .15 373 2 152 220 .2 369 6 90 282 .25 361 14 58 314 .3 351 24 37 335 .35 344 31 29 343 .4 329 46 29 343 .45 318 57 28 344 .5 313 62 24 348 .55 305 70 23 349 .6 291 84 23 349 .65 286 89 21 351 .7 276 99 20 352 .75 265 110 20 352 .8 253 122 20 352 .85 243 132 16 356 .9 229 146 13 359 .95 191 184 11 361 Teradata Warehouse Miner User Guide - Volume 3 139 Chapter 1: Analytic Algorithms Logistic Regression Table 49: Cumulative Lift Table Cumulative Response (%) Cumulative Captured Response (%) Cumulative Lift Decile Count Response Response (%) Captured Response (%) 1 74.0000 73.0000 98.6486 19.4667 1.9651 73.0000 98.6486 19.4667 1.9651 2 75.0000 69.0000 92.0000 18.4000 1.8326 142.0000 95.3020 37.8667 1.8984 3 75.0000 71.0000 94.6667 18.9333 1.8858 213.0000 95.0893 56.8000 1.8942 4 74.0000 65.0000 87.8378 17.3333 1.7497 278.0000 93.2886 74.1333 1.8583 5 75.0000 66.0000 88.0000 17.6000 1.7530 344.0000 92.2252 91.7333 1.8371 6 75.0000 24.0000 32.0000 6.4000 0.6374 368.0000 82.1429 98.1333 1.6363 7 74.0000 4.0000 5.4054 1.0667 0.1077 372.0000 71.2644 99.2000 1.4196 8 73.0000 2.0000 2.7397 0.5333 0.0546 374.0000 62.8571 99.7333 1.2521 9 69.0000 1.0000 1.4493 0.2667 0.0289 375.0000 56.4759 100.0000 1.1250 10 83.0000 0.0000 0.0000 0.0000 0.0000 375.0000 50.2008 100.0000 1.0000 Lift Cumulative Response Logistic Weights Graph By default, the Logistic Weights graph displays the relative magnitudes of the T-statistic associated with each coefficient in the logistic regression model: Figure 69: Logistic Regression Tutorial: Logistic Weights Graph 140 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Neural Networks Select the Graphics Options tab and change the Graph Type to Wald Statistic, Log Odds Ratio, Partial R or Estimated Standardized Coefficient to view those statistical measures respectively Lift Chart By default, the Lift Chart displays the cumulative measure of the percentage of observations in the decile where the actual value of the dependent variable is 1, from decile 1 to this decile (Cumulative, %Response): Figure 70: Logistic Regression Tutorial: Lift Chart Neural Networks Overview Note: The material in this overview was contributed by StatSoft®, Inc. Over the past two decades there has been an explosion of interest in neural networks. It started with the successful application of this powerful technique across a wide range of problem domains, in areas as diverse as finance, medicine, engineering, geology and even physics. The sweeping success of neural networks over almost every other statistical technique can be attributed to its power, versatility and ease of use. Neural networks are very sophisticated modeling and prediction-making techniques capable of modeling extremely complex functions and data relationships. The ability to learn by examples is one of the many features of neural networks which enables the user to model data and establish accurate rules governing the underlying relationship Teradata Warehouse Miner User Guide - Volume 3 141 Chapter 1: Analytic Algorithms Neural Networks between various data attributes. The neural network user gathers representative data, and then invokes training algorithms which can automatically learn the structure of the data. Although the user does need to have some heuristic knowledge of how to select and prepare data, the appropriate neural network, and interpret the results, the level of user knowledge needed to successfully apply neural networks is much lower than that needed in most traditional statistical tools and techniques. The neural network algorithms can be hidden behind a welldesigned and intelligent computer program which takes the user from start to finish with just a few clicks. Using neural networks Neural networks have a remarkable ability to derive and extract meaning, rules and trends from complicated, noisy and imprecise data. They can be used to extract patterns and detect trends that are governed by complicated mathematical functions too difficult, if not impossible, to model using analytic or parametric techniques. One of the abilities of neural networks is to accurately predict data that was not part of the training dataset, a process known as generalization. Given these characteristics with their broad applicability, neural networks are suitable for applications of real world problems in research and science, business and industry. Below are some examples where neural networks have been successfully applied: • Signal processing • Process control • Robotics • Classification • Data preprocessing • Pattern recognition • Image and speech analysis • Medical diagnostics and monitoring • Stock market and forecasting • Loan or credit solicitations The biological inspiration Neural networks are also intuitively appealing, since their principles are based on crude and low-level models of biological neural information processing systems. These have led to the development of more intelligent computer systems that can be used in statistical and data analysis tasks. Neural networks emerged out of research in artificial intelligence, inspired by attempts to mimic the fault-tolerance and “capacity to learn” of biological neural systems by modeling the low-level structure of the brain (see Patterson, 1996). The brain is principally composed of over ten billion neurons, massively interconnected with thousands of interconnects per neuron. Each neuron is a specialized cell that can create, propagate and receive electrochemical signals. Like any biological cell, neurons have a body, a branching input structure called dendrites and a branching output structure known as axons. The axons of one cell connect to the dendrites of another via a synapse. When a neuron is activated, it fires an electrochemical signal along the axon. This signal crosses the synapses to 142 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Neural Networks thousands of other neurons, which in turn may fire, thus propagating the signal over the entire neural system (e.g. the biological brain). A neuron fires only if the total signal received at the cell body from the dendrites exceeds a certain level known as threshold. Though a single neuron accomplishes no meaningful task on its own, when the efforts of a large number of them are combined together, the results become quite dramatic: they can create or achieve various and extremely complex cognitive tasks such as learning and even consciousness. Thus, from a very large number of extremely simple processing units the brain manages to perform extremely complex tasks. Of course, there is a great deal of complexity in the brain that has not been discussed here, but it is interesting that artificial neural networks can achieve remarkable results using a model not much more complex than this. The basic mathematical model The following figure shows a Schematic of a single neuron system. The inputs x send signals to the neuron, at which a weighted sum of the signals is obtained and further transformed using a mathematical function f. Figure 71: Single Neuron System (schematic) Here we consider the simplest form of artificial neural networks, with a single neuron with a number of inputs and one (for simplicity) output. Although a more realistic artificial network typically consists of many more neurons, this model sheds light on the basics of this technology. The neuron receives signals from many sources. This source usually comes from the data referred to as input variables x or inputs for short. The inputs are received from a connection that has a certain strength, known as weights. The strength of a weight is represented by a number. The larger the value of a weight w, the stronger is its incoming signal and the more influential the corresponding input is. Upon receiving the signals, a weighted sum of the inputs is formed to compose the activation function f (“activation”) of the neuron. The neuron activation is a mathematical function which converts the weighted sum of the signals to form the output of the neuron. Thus: output = f(w1x1 + ... + wdxd) The outputs of the neuron are actually predictions of the single neuron model for a variable in the data which is referred to as the target t. It is believed that there is a relationship between the inputs x and the targets t; it is the task of the neural network to model this relationship by Teradata Warehouse Miner User Guide - Volume 3 143 Chapter 1: Analytic Algorithms Neural Networks relating the inputs to the targets via a suitable mathematical function which can be learned from examples in the data. Feedforward neural networks The artificially simple (“toy”) model discussed above is the simplest neural network model one can construct. This model is used to explain some of the basic functionality and principles of neural networks as well as to describe the individual neuron. As mentioned above, however, a single neuron cannot perform a meaningful task on its own. Instead many interconnected neurons are needed to achieve any specific goal. This takes us to considering more neural network architectures which are used in practical applications. The next question is “how should neurons be connected together?” If a network is to be of any use, there must be inputs (which carry the values of variables of interest in the outside world) and outputs (which form predictions, or control signals). Inputs and outputs correspond to sensory and motor nerves such as those coming from the eyes and leading to the hands. However, there also can be hidden neurons that play an internal role in the network. The input, hidden, and output neurons need to be connected together. The key issue here is feedback (Haykin, 1994). A simple network has a feedforward structure: signals flow from inputs, forwards through any hidden units, eventually reaching the output units. Such a structure has stable behavior and fault tolerance. Feedforward neural networks are by far the most useful in solving real problems and therefore are the most widely used. See Bishop 1995 for more information on various neural networks types and architectures. A typical feedforward network has neurons arranged in a distinct layered topology. Generally, the input layer simply serves to introduce the values of the input variables. The hidden and output layer neurons are each connected to all of the units in the preceding layer. Again, it is possible to define networks that are partially connected to only some units in the preceding layer. However, for most applications, fully connected networks are better, and this is the type of network supported by STATISTICA Automatic Neural Networks (SANN). When the network is executed, the input variable values are placed in the input units, and then the hidden and output layer units are progressively executed in sequential order. Each of them calculates its activation value by taking the weighted sum of the outputs of the units in the preceding layer. The activation value is passed through the activation function to produce the output of the neuron. When the entire network has been executed, the neurons of the output layer act as the output of the entire network. Neural network tasks Like most statistical models, neural networks are capable of performing three major tasks including regression, and classification. Regression tasks are concerned with relating a number of input variables x with set of continuous outcomes t (target variables). By contrast, classification tasks assign class memberships to a categorical target variable given a set of input values. In the next section we will consider regression in more details. Regression and the family of nonparametric (black-box) tools One of the most straightforward and perhaps simplest approach to statistical inference is to assume that the data can be modeled using a closed functional form which can contain a 144 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Neural Networks number of adjustable parameters (weights) which can be estimated so the model can provide us with the best explanation of the data in hand. For example, consider a regression problem in which we are modeling or approximating a single target variable t as a linear function of an input variable x. The mathematical function used to model such relationship is simply given by a linear transformation f with two parameters, namely the intercept a and slope b: t = f(x) = a + bx Our task is to find suitable values for a and b which relates an input x to the variable t. This problem is known as the linear regression. Another example of parametric regression is the quadratic problem where the input output relationship is described by the quadratic form: t = f(x) = a + bx2 The following schematic shows the difference between parametric and nonparametric models. In parametric models the input-target relationship is described by a mathematical function of closed form. By contrast, in nonparametric models, the input-target relationship is governed by an approximator (like a neural network) which cannot be represented by a standard mathematical function. Figure 72: Parametric Model vs. Non-Parametric Model (schematic) The examples above belong to the category of the so-called parametric methods. They strictly rely on the assumption that t is related to x in a priori known way, or can be sufficiently approximated by a closed mathematical from, e.g. a line or a quadratic function. Once the mathematical function is chosen, all we have to do is to adjust the parameters of the assumed model so they best approximate (predict) t given an instance of x. By contrast, non-parametric models generally make no assumptions regarding the relationship of x and t. In other words they assume that the true underlying function governing the relationship between x and t is not known a priori, hence the term “black box”. Instead, they attempt to discover a mathematical function (which often does not have a closed form) that can approximate the representation of x & t sufficiently well. The most popular examples of non-parametric models are polynomial functions with adaptable parameters, and Teradata Warehouse Miner User Guide - Volume 3 145 Chapter 1: Analytic Algorithms Neural Networks indeed neural networks. Since no closed form for the relationship between x and t is assumed, the non-parametric method must be sufficiently flexible to be able to model a wide spectrum of functional relationships. The higher the order of a polynomial, for example, the more flexible the model is. Similarly, the more neurons a neural network has the stronger the model becomes. Parametric models enjoy the advantage of being easy to use and having outputs which are easy to interpret. On the other hand, they suffer from the disadvantage of limited flexibility. Consequently, their usefulness strictly depends on how well the assumed input-target relationship survives the test of reality. Unfortunately many of the real world problems do not simply lend themselves to a closed form and the parametric representation may often prove too restrictive. No wonder then that statisticians and engineers often consider using nonparametric models, especially neural networks, as alternatives to parametric methods. Neural networks and classification tasks Neural networks, like most statistical tools, can also be used to tackle classification problems. By contrast to regression problems, a neural network classifier assigns class membership to an input x. For example, if the input set has three categories {A, B, C}, a neural network assigns each and every input to one of the three classes. The class membership information is carried in the target variable t. For that reason, in a classification analysis the target variable must always be categorical. A variable is categorical if (a) it can only assume discrete values which (b) cannot be numerically arranged (ranked). For example, a target variable with {MALE, FEMALE} is two- state categorical variable. A target variable with date values, however, is not truly categorical since values can be ranked (arranged according in numerical order). The multilayer perceptron neural networks The following figure shows a schematic diagram of a fully connected MLP2 neural network with three inputs, four hidden units (neurons) and 3 outputs. Note that the hidden and output layers have a bias term. Bias is a neuron which emits signals with strength 1. 146 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Neural Networks Figure 73: Fully connected MLP2 neural network with three inputs (schematic) Multilayer perceptrons MLP is perhaps the most popular network architecture in use today, credited originally to Rumelhart and McClelland (1986) and discussed at length in most neural network textbooks (Bishop, 1995). Each neuron performs a weighted sum of its inputs and passes it through a transfer function f to produce its output. For each neural layer in an MLP network there is also a bias term. A bias is a neuron with its activation function permanently set to 1. Just as in other neurons, a bias connects to the neurons in the layer above via a weight which is often called a threshold. The neurons and biases are arranged in a layered feedforward topology. The network thus has a simple interpretation as a form of input-output model, with the weights and thresholds as the free (adjustable) parameters of the model. Such networks can model functions of nearly arbitrary complexity, with the number of layers and the number of units in each layer determining the function complexity. Important issues in Multilayer Perceptrons design include specification of the number of hidden layers and the number of units in these layers (Bishop, 1995). Others include the choice of activation functions and methods of training. The following schematic shows the difference between MLP and RBF neural networks in two dimensional input data. One way to separate the clusters of inputs is to draw appropriate planes separating the various classes from one another. This method is used by MLP networks. An alternative approach is to fit each class of input data with a Gaussian basis function. Teradata Warehouse Miner User Guide - Volume 3 147 Chapter 1: Analytic Algorithms Neural Networks Figure 74: MLP vs. RBF neural networks in two dimensional input data (schematic) The radial basis function neural networks The following figure shows a schematic diagram of an RBF neural network with three inputs, four radial basis functions and 3 outputs. Note that in contrast to MLP networks, it is only the output units which have a bias term. Figure 75: RBF Neural Network with three inputs (schematic) Another type of neural network architecture used by SANN is known as Radial Basis Functions (RBF). RBF networks are perhaps the most popular type of neural networks after MLPs. In many ways, RBF is similar to MLP networks. First of all they too have unidirectional feedforward connections and every neuron is fully connected to the units in the layer above. The neurons are arranged in a layered feedforward topology. Nonetheless, RBF neural networks models are fundamentally different in the way they model the input-target relationship. While MLP networks model the input-target relationship in one stage, an RBF network partitions this learning process into two distinct and independent stages. In the first 148 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Neural Networks stage and with the aid of the hidden layer neurons known as radial basis functions, the RBF network models the probability distribution of the input data. In the second stage, it learns how to relate an input data x to a target variable t. Note that unlike MLP networks, the bias term in an RBF neural network connects to the output neurons only. In other words, RBF networks do not have a bias term connecting the inputs to the radial basis units. In the rest of this document we will refer to both weights and thresholds as weights for short unless it is necessary to make a distinction. Like MLP, the activation function of the inputs is taken to be the identity. The signals from these inputs are passed to each radial basis unit in the hidden layer and the Euclidean distance between the input and a prototype vector is calculated for each neuron. This prototype vector is taken to be the location of the basis function in the space of the input data. Each neuron in the output layer performs a weighted sum of its inputs and passes it through a transfer function to produce its output. Therefore, unlike an MLP network, an RBF network has two types of parameters, (1) location and radial spread of the basis functions and (2) weights which connect these basis functions to the output units. Activation functions A mentioned above, a multilayer perceptron MLP is a feedforward neural network architecture with unidirectional full connections between successive layers. This does not, however, uniquely determine the property of a network. In addition to network architecture, the neurons of a network have activation functions which transform the incoming signals from the neurons of the previous layer using a mathematical function. The type of this function represents the activation function itself and can profoundly influence the network performance. Thus it is very important to choose a type of activation function for the neurons of a neural network. Input neurons usually have no activation function. In other words, they use the identity function, which means the input signals are not transformed at all. Instead they are combined in a weighted sum (weighted by the input-hidden layer weights) and passed on to the neurons in the layer above (usually called the hidden layer). For an MLP with two layers (MLP2) it is recommended that you use the tanh (hyperbolic) function, although other types are also possible such as the logistic sigmoid and exponential functions. The output neuron activation functions are in most cases set to identity, but this may vary from task to task. For example, in classification tasks it is set to softmax (Bishop 1995) while for regression problems they are set to identity (together with the choice of tanh for the hidden neurons). The set of neuron activation functions for the hidden and output neurons available in SANN is given in the table below: Table 50: Neuron Activation Functions for hidden/output neurons available in SANN Function Definition Identity Teradata Warehouse Miner User Guide - Volume 3 Description The activation of the neuron is passed on directly as the output. Range – 149 Chapter 1: Analytic Algorithms Neural Networks Table 50: Neuron Activation Functions for hidden/output neurons available in SANN Function Definition Logistic sigmoid 1 --------------–a 1–e Hyperbolic tangent e –e ----------------a –a e –e Exponential e Sine sin a Softmax Gaussian a –a –a exp a i ------------------------ exp ai 2 1 x – -------------- exp ----------------------2 2 2 Description Range An S-shaped curve. 0 1 A sigmoid curve similar to the logistic function. Often performs better than the logistic function because of its symmetry. Ideal for multilayer perceptrons, particularly the hidden layers. The negative exponential function. Possibly useful if recognizing radially distributed data. Not used by default. Mainly used for (but not restricted to) classification tasks. Useful for constructing neural networks with normalized multiple outputs which makes it particularly suitable for creating neural network classifiers with probabilistic outputs. – 1 1 0 0 1 0 1 This type of isotropic Gaussian activation function is solely used by the hidden units of an RBF neural network which are also known as radial basis functions. The location (also known as prototype vectors) and spread parameters are equivalent to the input-hidden layer weights of an MLP neural network. Selecting the input variables The number of input and output units is defined by the problem. The target (predicted dependent variable) is believed to depend on the inputs and so its choice is clear. Not so when it comes to selecting the inputs (independent variables). There may be some uncertainty about which inputs to use. Using a sufficient number of correct inputs is a matter of great 150 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Neural Networks importance in predictive modeling (i.e. relating a target variable to a set of inputs) and indeed all forms of statistical analysis. By including irrelevant inputs, for example, one may inadvertently damage the performance of the neural network. On the other hand, a dataset with an insufficient number of inputs may never be accurately modeled by a neural network. Neural network complexity The complexity of a neural network (and the two layer perceptron MLP2 & Radial Basis Functions) is measured by the number of neurons in the hidden layer. The more neurons in a neural network, the greater the flexibility and complexity of the system. Flexible neural networks can be used to approximate any function complexity which relates the input-target variables. Thus in order to model a dataset, it is important to have a sufficiently flexible neural network with enough neurons in the hidden layer. The optimal number of neurons depends on the problem domain, but it is generally related to the number of inputs. Training of neural networks Once a neural network architecture is selected (i.e. neural network type), activation functions, etc., the remaining adjustable parameters of the model are the weights connecting the inputs to the hidden neurons and the hidden neurons to the output neurons. The process of adjusting these parameters so the network can approximate the underlying functional relationship between the inputs x and the targets t is known as “training”. It is in this process that the neural network learns to model the data by example. Although there are various methods to train neural networks, implementing most of them involve numeric algorithms which can complete the task in a finite number of iterations. The need for these iterative algorithms is primarily due to the highly nonlinear nature of neural network models for which a closed form solution is most often unavailable. An iterative training algorithm gradually adjusts the weights of the neural network so that for any given input data x the neural network can produce an output which is as close as possible to t. Weights initialization Because training neural networks require an iterative algorithm in which the weights are adjusted, the weights must first be initialized to reasonable starting values. This may sometimes affect not only the quality of the solution, but also the time needed to prepare the network (training). It is important that you initialize the weights using small weight values so at that at the start of training the network operates in a linear mode, and then let it increase the values of its weights to fit the data accurately enough SANN provides you with two random methods for initializing weights using the normal and uniform distributions. The normal method initializes the weights using normally distributed values, within a range whose mean is zero and standard deviation equal to one. On the other hand, the uniform method assigns weight values in the range 0 and 1. Neural Network training - learning by examples A neural network on its own cannot be used for making predictions unless it is trained on some examples known as training data. The training data usually consists of input-target pairs which are presented one by one to the network during training. You may view the input instances as “questions” and the target values as “answers”. Therefore each time a neural Teradata Warehouse Miner User Guide - Volume 3 151 Chapter 1: Analytic Algorithms Neural Networks network is presented with an input-target pair it is effectively told the answer for a given question. Nonetheless, at each instance of this presentation the neural network is required to make a guess using the current state (i.e. value) of the weights, and its performance is assessed using a criterion known as the error function. If the performance was not adequate, the network weights are adjusted to produce the right (or more correct answer) as compared to the previous attempt. In general, this learning process is noisy to some extent (i.e. the network answers my sometimes be more accurate in the previous cycle of training compared to the current one), but on the average the errors reduce in size as the network learning improves. The adjustment of the weights is usually carried out using a training algorithm, which like a teacher, teaches the neural network how to adapt its weights in order to make better predictions for each set of input-target pair in the dataset. The above steps are known as training. Algorithmically it is carried out using the following sequence of steps: 1 Present the network with an input-target pair. 2 Compute the predictions of the network for the targets. 3 Use the error function to calculate the difference between the predictions (output) of the network and the target values. 4 Continue with steps 1 and 2 until all input-target pairs are presented to the network. 5 Use the training algorithm to adjust the weights of the networks so that it gives better predictions for each and every input-target. Note that steps 1-5 form one training cycle or iteration. The number of cycles needed to train a neural network model is not known a priori, but can be determined as part of the training process. Repeat steps 1 to 5 again for a number of training cycles or iterations until the network starts producing sufficiently accurate outputs (i.e. outputs which are close enough to the targets given their input values). Steps 1 through 5 form one training cycle. A typical neural network training process consists of 100’s of cycles. The error function As discussed above, the error function is used to evaluate the performance of a neural networks during training. It is like an examiner who assesses the performance of a student. The error function measures how close the network predictions are to the targets and, hence, how much weight adjustment should be applied by the training algorithm in each iteration. Thus the error function is the eyes and ears of the training algorithm as to how well a network performs given its current state of training (and hence how much adjustment should be made to the value of its weights). All error functions used for training neural networks must provide some sort of distance measure between the targets and predictions at the location of the inputs. One common approach is to use the sum-squares error function. In this case the network learns a discriminant function. The sum-of-squares error is simply given by the sum of differences between the target and prediction outputs defined over the entire training set. Thus: 152 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Neural Networks N is the number of training cases and yi is the prediction (network outputs) of the target value ti and target values of the ith datacase. It is clear that the bigger the difference between prediction of the network and the targets the higher the error value which means more weight adjustment is needed by the training algorithm. N yi – ti E sos = 2 i–1 The sum-of-squares error function is primarily used for regression analysis but it can also be used in classification tasks. Nonetheless, a true neural network classifier must have an error function other than sum-of-squares, namely cross entropy error function. It is with the use of this error function together with the softmax output activation function that we can interpret the outputs of a neural network as class membership probabilities. The cross entropy error function is given by: N yi E CE = – t i ln ---- ti i–1 which assumes that the target variables are derived from a multinomial distribution. This is in contrast to the sum-of-squares error which models the distribution of the targets as a normal probability density function. The training algorithm Neural networks are highly nonlinear tools which are usually trained using iterative techniques. The most recommended techniques for training neural networks are the BFGS and Scaled Conjugate Gradient algorithms (see Bishop 1995). These methods perform significantly better than the more traditional algorithms such as Gradient Descent, but they are, generally speaking, more memory intensive and computationally demanding. Nonetheless, these techniques may require a smaller number of iterations to train a neural network given their fast convergence rate and more intelligent search criterion. Training multilayer perceptron neural networks SANN provides several options for training MLP neural networks. These include BFGS, Scaled Conjugate and Gradient Descent. Training radial basis function neural networks The method used to train radial basis function networks is fundamentally different from that employed for MLPs. This mainly is due to the nature of the RBF networks with their hidden neurons (basis functions) forming a Gaussian mixture model which estimates the probability density of the input data (see Bishop 95). For RBF with linear activation functions the training process involves two stages. In the first part we fix the location and radial spread of the basis functions using the input data (no targets are considered at this stage). In the second stage we fix the weights connecting the radial functions to the output neurons. For identity Teradata Warehouse Miner User Guide - Volume 3 153 Chapter 1: Analytic Algorithms Neural Networks output activation function this second stage of training involves a simple matrix inversion. Thus it is exact and does not require an iterative process. The linear training, however, holds only when the error function is sum-of-squares and the output activation functions are the identity. If these requirements are not met, as in the case of cross-entropy error function and output activation functions other then the identity, we have to resort to an iterative algorithm, e.g. BFGS, to fix the hidden-output layer weights in order to complete the training of the RBF neural network. Generalization and performance The performance of neural networks is measured by how well they can predict unseen data (an unseen dataset is one not used during training). This is known as generalization. The issue of generalization is actually one of the major concerns when training neural networks. It is known as the tendency to overfit the training data accompanied by the difficulty in predicting new data. While one can always fine-tune (overfit) a sufficiently large and flexible neural network to achieve a perfect fit (viz. zero training error), the real issue here is how to construct a network which is capable of predicting new data well. As it turns out there is a relation between overfitting the training data and poor generalization. Thus when training neural networks one must take the issue of performance and generalization into account. Test data and early stopping The following figure shows a schematic of neural network training with early stopping. The network is repeatedly trained for a number of cycles so long as the test error is on the decrease. When the test error starts to increase training is halted. Figure 76: Neural Network Training with early stopping (schematic) There are several techniques to combat the problem of overfitting and tackling the generalization issue. The most popular ones involve the use of a test data. Test data is a holdout sample which will never be used in training. Instead it will be used as a means of validating how well a network makes progress in modeling the input-target relationship as training continues. Most work on assessing performance in neural modeling concentrates on approaches to test data. A neural network is optimized using a training set. A separate test set is used to halt training to mitigate overfitting. The process of halting of neural network 154 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Neural Networks training to prevent overfitting and improving the generalization ability is known as “early stopping”. This technique slightly modifies the training algorithm to: 1 Present the network with an input-target pair from the training set. 2 Compute the predictions of the network for the targets. 3 Use the error function to calculate the difference between the predictions (output) of the network and the target values. 4 Continue with steps 1 and 2 until all input-target pairs from the training set are presented to the network. 5 Use the training algorithm to adjust the weights of the networks so that it gives better predictions for each and every input-target. 6 Pass the entire test set to the network, make predictions and compute the value of network test error. 7 Compare the test error with the one from the previous iteration. If the error keeps decreasing continue training, otherwise stop training. Note that the number of cycles needed to train a neural network model with test data and early stopping may vary. In theory we would continue training the network for as many cycles as needed so long as the test error is on the decrease. Validation data Sometimes the test data alone may not be sufficient proof of good generalization ability of a trained neural network. For example, it is entirely possible that a good performance on the test sample may actually be just a coincidence. To make sure that this is not the case, often we use another set of data known as the validation sample. Just as like the test sample, a validation sample is never used for training the neural network. Instead it is used at the end of training as an extra check on the performance of the model. If the performance of network was found to be consistently good on both the test and validation samples then it is reasonable to assume that the network generalizes well on unseen data. Regularization Besides the use of test data for early stopping, another technique frequently used for improving the generalization of neural networks is known as regularization. The method involves adding a term to the error function which generally penalizes (discourages) large weight values. One of the most common choices of regularization is known as weight decay (Bishop 1995). Weight decay works by modifying the network's error function to penalize large weights by adding an additional term Ew (same applied to the cross-entropy error function): E = E sos + E w T E w = ---w w 2 where is the weight decay constant and w are the network weights (biases excluded). The larger , the more the weights are penalized. Consequently, too large a weight decay constant Teradata Warehouse Miner User Guide - Volume 3 155 Chapter 1: Analytic Algorithms Neural Networks may damage network performance by encouraging underfitting, and experimentation is generally needed to determine an appropriate weight decay factor for a particular problem domain. The generalization ability of the network can depend crucially on the decay constant. One approach to choosing the decay constant is to train several networks with different amounts of decay and estimate the generalization error for each; then choose the decay constant that minimizes the estimated generalization error. The above form will encourage the development of smaller weights, which tends to reduce the problem of overfitting by limiting the ability of the network to form large curvature, thereby potentially improving generalization performance of the network. The result is a network which compromises between performance and weight size. It should noted that the basic weight decay model above might not always be the most suitable way of imposing regularization. A fundamental consideration with weight decay is that different weight groups in the network usually require different decay constants. Although this may be problem dependent, it is often the case that a certain group of weights in the network may require different scale values for an effective modeling of the data. An example of such is the input-hidden and hidden-output weights. Therefore, SANN uses separate weight decay values for regularizing these two groups of weights. Pre and post processing of data All neurons in a neural network take numeric input and produce numeric output. The activation function of a neural unit can accept input values in any range and produces output in a strictly limited range. Although the input can be in any range, there is a saturation effect so that the unit is only sensitive to inputs within a fairly limited range. For example consider the logistic function. In this case, the output is in the range (0, 1), and the input is sensitive in a range not much larger than (-1, +1). Thus for a wide range of input values ranging outside (1, +1), the output of a logistic neuron is approximately the same. This saturation effect will severely limit the ability of a network from capturing the underlying input-target relationship. The above problem can be solved by limiting the numerical range of the original input and target variables. This is process is known as scaling, one of the most commonly used forms of preprocessing. SANN scales the input and target variables using linear transformations such that the original minimum and maximum of each and every variable is mapped to the range (0, 1). There are other important reasons for standardization of the variables. One is related to weight decay. Standardizing the inputs and targets will usually make the weight decay regularization more effective. Other reasons include the original variable scaling and units of measure. It is often the case that variables in the original dataset have substantially different ranges (i.e. different variances). This may have to do with the units of measurements or simply the nature of the variables themselves. The numeric range of a variable, however, may not be a good indication of the importance of that variable. Predicting future data and deployment A fully trained neural network can be used for making predictions on any future data with variables which are thought to have been generated by the same underlying relations and processes as the original set used to train the model. The ability to generalize is an important 156 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Neural Networks feature of neural networks and the process of using neural networks for making predictions in the future is known as deployment. SANN generated models can be saved and re-deployed later using the Predictive Markup Model Language (PMML). (See “Neural Networks Scoring” on page 249 for more information). There is one issue which needs consideration, however, when deploying neural network models. One should not attempt to extrapolate, viz. present a neural network model with input values differing significantly from those used to train the network. This is known as extrapolation which is generally unwise and unsafe. Recommended textbooks • Bishop, C. (1995). Neural Networks for Pattern Recognition. Oxford: University Press. • Carling, A. (1992). Introducing Neural Networks. Wilmslow, UK: Sigma Press. • Fausett, L. (1994). Fundamentals of Neural Networks. New York: Prentice Hall. • Haykin, S. (1994). Neural Networks: A Comprehensive Foundation. New York: Macmillan Publishing. • Patterson, D. (1996). Artificial Neural Networks. Singapore: Prentice Hall. • Ripley, B.D. (1996). Pattern Recognition and Neural Networks. Cambridge University Press. Initiate a Neural Networks Function Use the following procedure to initiate a new Neural Networks analysis in Teradata Warehouse Miner: 1 Click on the Add New Analysis icon in the toolbar: Figure 77: Add New Analysis from toolbar 2 In the resulting Add New Analysis dialog box, click on Analytics under Categories and then under Analyses double-click on Neural Networks: Teradata Warehouse Miner User Guide - Volume 3 157 Chapter 1: Analytic Algorithms Neural Networks Figure 78: Add New Analysis dialogue 3 This will bring up the Neural Networks dialog in which you will enter INPUT and OUTPUT options to parameterize the analysis as described in the following sections. Neural Networks - INPUT - Data Selection On the Neural Networks dialog click on INPUT and then click on data selection: Figure 79: Neural Network > Input > Data Selection On this screen select: 1 Select Input Source Users may select between different sources of input. By selecting the Input Source Table the user can select from available databases, tables (or views) and columns in the usual manner. By selecting the Input Source Analysis however the user can select directly from the output of another analysis of qualifying type in the current project. Analyses that may be selected from directly include all of the Analytic Data Set (ADS) and Reorganization analyses (except Refresh). In place of Available Databases the user may select from Available Analyses, while Available Tables then contains a list of all the output tables that will eventually be produced by the selected Analysis. (Note that since this analysis cannot select from a volatile input table, Available Analyses will contain only those qualifying analyses that create an output table or view). For more information, refer to “INPUT Tab” on page 73 of the Teradata Warehouse Miner User Guide (Volume 1). 2 158 Select Columns From a Single Table Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Neural Networks • Available Databases (or Analyses) — All the databases (or analyses) available for input to Neural Networks. • Available Tables — All the tables available for the Neural Networks analysis. • Available Columns — All the columns available for the Neural Networks analysis. • Neural Network Style — Select the type of analysis to perform: regression or classification. This will constrain the types of variables selectable as input or output (independent or dependent) to conform to the respective type of Neural Networks analysis. • • Regression — Select this option when your dependent (output) variables of interest are continuous in nature (e.g., weight, temperature, height, length, etc.). • Classification — Select this option when your dependent (output) variables of interest are categorical in nature (e.g., gender). Note that for classification analysis, you can only specify one target. Selected Columns — Select columns by highlighting and then either dragging and dropping into the Selected Columns window, or click on the arrow button to move highlighted columns into the Selected Columns window. Note: The Selected Columns window is actually a triply split window. You can insert columns as Dependent or Continuous Independent or Categorical Independent columns. Make sure you have the correct portion of the window highlighted. The Dependent variable columns are the columns whose values are being predicted by the Neural Networks model. The user may choose to treat a categorical variable as either categorical or continuous but a continuous variable must be treated as continuous only. Therefore the Independent Categorical columns cannot be of numeric type. The Independent Continuous columns and the Dependent columns may be of any type. Neural Networks - INPUT - Network Types On the Neural Networks dialog click on INPUT and then click on Network Types: Figure 80: Neural Network > Input > Network Types Network Types Use these options to specify the type of network (MLP or RBF). For each selected type, you can also specify a range for the complexity (i.e., minimum and maximum number of hidden units) of the neural network models to be tried by the Automatic Network Search (ANS). Specify the complexity of networks to be tested in terms of a range of figures for the number of hidden units. Specifying the number of hidden units exactly (i.e., by setting the minimum equal to the maximum) may be beneficial if you know, or have good cause to suspect, the optimal number. In this case, it allows the Automatic Network Search (ANS) to concentrate its search algorithms. The wider the range of hidden units in the ANS search, the better the Teradata Warehouse Miner User Guide - Volume 3 159 Chapter 1: Analytic Algorithms Neural Networks chances of finding an optimal model. On the other hand, if you know the optimal complexity, simply set the minimum and maximum number of hidden units to that value. MLP Select the MLP check box to include multilayer perceptron networks in the network search. The multilayer perceptron is the most common form of network. It requires iterative training, which may be quite slow, but the networks are quite compact, execute quickly once trained, and in most problems yield better results than the other types of networks. • Minimum Hidden Units — Specify the minimum number of hidden units to be tried by the Automatic Network Search (ANS) when using MLP networks. • Maximum Hidden Units — Specify the maximum number of hidden units to be tried by the Automatic Network Search (ANS) when using MLP networks. RBF Select the RBF check box to include radial basis function networks in the network search. Radial basis function networks tend to be slower and larger than Multilayer Perceptrons, and often have a relatively inferior performance, but they train extremely quickly when they use the identity output activation functions. They are also usually less effective than multilayer perceptrons if you have a large number of input variables (they are more sensitive to the inclusion of unnecessary inputs). • Minimum Hidden Units — Specify the minimum number of hidden units to be tried by the Automatic Network Search (ANS) when using RBF networks. • Maximum Hidden Units — Specify the maximum number of hidden units to be tried by the Automatic Network Search (ANS) when using RBF networks. Neural Networks - INPUT - Network Parameters On the Neural Networks dialog click on INPUT and then click on network parameters: Figure 81: Neural Network > Input > Network Parameters On this screen select: • Network Options 160 • Networks to Train — Use this option to specify how many networks the Automatic Network Search (ANS) should perform. The larger the number of networks trained the more detailed is the search carried out by the ANS. It is recommended that you set the value for this option as large as possible depending on your hardware speed and resources. • Networks to Retain — Specifies how many of the neural networks tested by the Automatic Network Search (ANS) should be retained (for testing, and then insertion Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Neural Networks into the current network set). Networks with the best performance (i.e., best correlation fit for regression and classification rate for classification analysis) will be retained. • Error Functions • SOS — Select SOS to generate networks using the sum of squares error function. This option is available for both regression and classification analysis. • Cross Entropy — Select Cross entropy to generate networks using the cross entropy error function. Such networks perform maximum likelihood optimization, assuming that the data is drawn from the multinomial family of distributions. Together with the use of Softmax as output activation functions this supports a direct probabilistic interpretation of network outputs as probabilities. This error function is only available for classification, not for regression. Note: When the Cross Entropy error function is used to train a neural network, the output activation functions will always be of type softmax. • Weight Decay — This parameter specifies the use of weight decay regularization, promoting the development of smaller weights. This tends to reduce the problem of overfitting, thereby potentially improving generalization performance of the network. Weight decay works by modifying the network's error function to penalize large weights, resulting in an error function that compromises between performance and weight size. Because too large a weight decay term may damage network performance unacceptably, experimentation is often needed to determine an appropriate weight decay factor in a particular problem domain. • Output Layer — Select this option to apply weight decay regularization to the hiddenoutput layer weights. Specify the minimum and maximum weight decay value for the output layer weights. • Hidden Layer — Select this option to apply weight decay regularization to the inputhidden layer weights (Not applicable to RBF networks). When minimum and maximum weight decay values are selected for the hidden layer weights, the ANS will search for the best weight parameters within the specified range. Neural Networks - INPUT - MLP Activation Functions On the Neural Networks dialog click on INPUT and then click on MLP activation functions: Figure 82: Neural Network > Input > MLP Activation Functions • Hidden Neurons — This is the set of activation functions available to be used for the hidden layer for MLP neural networks. For RBF neural networks, however. only Gaussian activation functions are allowed. • Identity — Uses the identity function. With this function, the activation level is passed on directly as the output. Teradata Warehouse Miner User Guide - Volume 3 161 Chapter 1: Analytic Algorithms Neural Networks • Logistic — Uses the logistic sigmoid function. This is an S-shaped (sigmoid) curve, with output in the range (0,1). • Tanh — Uses the hyperbolic tangent function (recommended), is a symmetric Sshaped (sigmoid) function with output in range (-1, +1), and often performs better than the logistic sigmoid function because of its symmetry. • Exp — Uses the exponential activation function. • Sin — Uses the standard sine activation function. Since it may be useful only when data is radially distributed, it is not selected by default. • Output Neurons — This is the set of activation functions available to be used for the outputs for MLP neural networks. For RBF neural networks, however, only the identity function is allowed. When the error is “Cross Entropy”, MLP and RBF output activation functions are softmax. • Identity — Uses the identity function (recommended). With this function, the activation level is passed on directly as the output. • Logistic — Uses the logistic sigmoid function. This is an S-shaped (sigmoid) curve, with output in the range (0,1). • Tanh — Uses the hyperbolic tangent function, is a symmetric S-shaped (sigmoid) function with output in range (-1, +1), and often performs better than the logistic sigmoid function because of its symmetry. • Exp — Uses the exponential activation function. • Sin — Uses the standard sine activation function. This may be useful only when the data is radially distributed. Therefore it is not selected by default. Neural Networks - INPUT - Sampling On the Neural Networks dialog click on INPUT and then click on sampling: Figure 83: Neural Network > Input > Sampling • Teradata Sampling — Specify the fraction of the data table to sample from the Teradata table to pass on to the Automatic Network Search engine. • Neural Networks Sampling — The performance of a neural network is measured by how well it generalizes to unseen data (how well it predicts data not used during training). The issue of generalization is one of the major concerns when training neural networks. When the training data have been overfit (fit so completely that even the random noise within a particular data set is reproduced), it is difficult for the network to make accurate predictions using new data. A way to reduce this problem is to split the data into two (or three) subsets: a training sample, a testing sample and a validation sample. These samples can then be used to (1) train the network, (2) cross verify (or test) the performance of the training algorithms as they run, and (3) perform an final validation test to determine how well the network predicts “new” data. The assignment of the cases to the subsets can be 162 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Neural Networks done randomly or based upon a special subset variable in the data set. Cases will be randomly assigned to subsets based on specified percentages with the total percentage summing to more than 0 and less or equal 100. To not split the data into subsets, enter a 100 in the Train sample size (%) field. Note, however, that the use of the test sample is strongly recommended to aid with training the networks. • Train Sample Size — Specify the percent of valid cases to use in the training sample. Default is 80%. • Test Sample Size — Randomly assign cases to a test sample, specifying the percentage of cases to use. Default is 20%. • Validation Sample Size — Randomly assign cases to a validation sample, specifying the percentage of cases to use. Default is 0%. • Seed for Sampling — The positive integer used as the seed for a random number generator that produces the random sub samples from the data. By changing the seed you can end up with different data cases in the train, test and validation samples (for each new analysis). Default is 1000. Run the Neural Networks After setting parameters on the INPUT screens as described above, you are ready to run the analysis. To run the analysis you can either: • Click the Run icon on the toolbar, or • Select Run <project name> on the Project menu, or • Press the F5 key on your keyboard Results - Neural Networks The results of running the Teradata Warehouse Miner Neural Networks analysis include a variety of statistical reports on the individual variables and generated model as well as barcharts. All of these results are outlined below. On the Neural Networks dialog, click on RESULTS (note that the RESULTS tab will be grayedout/disabled until after the analysis is completed) to view results. Result options are as follows: Neural Network Reports • Neural Network Summary — For categorical analyses, this report gives the model name, training performance, test performance, training algorithm, error function, type of hidden activation, and type of output activation. For regression analysis, it gives the model name, training performance, test performance, training error, test error, training algorithm, error function, type of hidden activation, and type of output activation. • Correlation Coefficients (Regression only) — This report is a spreadsheet of correlation coefficients for each model. Teradata Warehouse Miner User Guide - Volume 3 163 Chapter 1: Analytic Algorithms Neural Networks • Data Statistics — This report contains some statistics (minimum, maximum, mean and standard deviation) of the input and target variables for training, testing, and validation samples. • Weights and Thresholds — This report is a spreadsheet of weights and thresholds for each model. • Sensitivity Analysis — A sensitivity analysis is displayed for each model in a spreadsheet. Sensitivity analysis rates the importance of the models' input variables. • Confusion Matrix (Classification only) — A confusion matrix is displayed for each model in a spreadsheet. This is a detailed breakdown of misclassifications. The observed class is displayed at the top of the matrix, and the predicted class down the side; each cell contains a number showing how many cases that were actually of the given observed class were assigned by the model to the given predicted class. In a perfectly performing model, all the cases are counted in the leading diagonal. • Classification Summary (Classification only) — A classification summary is displayed for each model in a spreadsheet. This gives the total number of observations in each class of the target, the number of correct and incorrect predictions for each class, and the percentage of correct and incorrect predictions for each class. This information is provided for each network. • Confidence (Classification only) — A confidence matrix is displayed for each model in a spreadsheet. Confidence levels will be displayed for each model. Pointwise Sensitivity Analysis (Regression with no categorical inputs only) • Pointwise Sensitivity — Generates a separate spreadsheet of model sensitivities for each model. Model sensitivities are values that indicate how sensitive the output of a neural network is to a given input at a particular location of the input. These sensitivity values are actual first-order derivatives evaluated at specific centile points for each input. For each input the derivative is taken with respect to the target at ten evenly spaced locations with the observed minimum and maximum values serving as end points. Other input variables are set to their respective means during this calculation. A separate spreadsheet is also generated for each dependent (target) variable as well. Note this option is available only for regression analyses with no categorical inputs. Neural Network Graphs Note: For more than one target (dependent variable), only the first will be available in graphs. • X-axis/Y-axis/Z-axis — Use the X-axis, Y-axis, and Z-axis list boxes to select a quantity to plot on the respective axis. Available graph types are dependent on the number of selections you make with these list boxes. For example, if you want to generate a 3D surface plot, you must select a value in each list box. By default, values are selected in the X-axis and Y-axis list box enabling you to create histograms (for the X-axis variables) or 2D scatter plots (for the X-axis and Y-axis variables). • Target — Select Target to plot the target that has been selected in the Target variable list on the selected axis. 164 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Neural Networks • Output — Select Outputs to plot the output (or predicted value) of the target that has been selected in the Target variable list on the selected axis. • Residual — Select Residual to plot the residual value (for the selected target variable) on the axis. (regression only) • Std. Residual — Select Std. Residual to plot the standardized residual value (for the selected target variable) on the axis. (regression only). • Abs. Residual — Select Abs. Residual to plot the absolute value of the residual (for the selected target variable) on the axis. (regression only) • Square Residual — Select Square Residual to plot the squared residual value (for the selected target variable) on the axis. (regression only) • Accuracy — Select Accuracy to plot the accuracy (incorrect or correct) of the prediction on the selected axis. (classification only) • Conf — For classification type analyses, the confidence level for each category of the target can be selected and plotted on the axis. • Input variables — Each input variable is listed by name and is also available for selection. • Histogram of X — Click the Histogram of X button to generate a histogram of the quantity selected in the X-axis list box. When there is more than one active network, individual histograms will be generated for each network, when applicable. For example, if you select Residual in the X-axis box, then click Histograms of X, a histogram will be generated for each of the networks in the Active neural networks grid. • X and Y — Click the X and Y button to generate a 2D scatter plot of the variables selected in the X-axis and Y-axis list boxes. When there is more than one active network, a multiple scatter plot will be generated that plots the selected values for all networks, where applicable. For example, if you select Target in the X-axis box and Output in the Yaxis box, then click the X and Y button, only one scatter plot will be generated. It will contain a Target by Output plot for each of the active networks. • X, Y and Z — Click the X, Y and Z button generate a 3D surface plot of the variables selected in the X-axis, Y-axis, and Z-axis list boxes. When there is more than one active network, an individual surface plot will be generated for each network, when applicable. For example, if there are three active networks, a surface plot will be generated for each network. Lift Charts (Classification models only). These charts may be used to evaluate and compare the utility of the model for predicting the different categories or classes for the categorical target variable. Select the option that specifies the type of chart and the scaling for the chart you wish to compute. • Category — Select the response category for which to compute the gains and/or lift charts. You can chose to produce lift charts for a single or all categories. • Gains chart — Select this option button to compute a gains chart, which shows the percent of observations correctly classified into the chosen category (see Category of response) when taking the top x percent of cases from the sorted (by classification probabilities) data file. For example, this chart can show that by taking the top 20 percent (shown on the Teradata Warehouse Miner User Guide - Volume 3 165 Chapter 1: Analytic Algorithms Neural Networks x-axis) of cases classified into the respective category with the greatest certainty (maximum classification probability), you would correctly classify almost 80 percent of all cases (as shown on the vertical y-axis of the plot) belonging to that category in the population. In this plot, the baseline random classification (selection of cases) would yield a straight line (from the lower-left to the upper-right corner), which can serve as a comparison to gauge the utility of the respective models for classification. • Lift chart (resp %) — Select this option to compute a lift chart where the vertical y-axis is scaled in terms of the percent of all cases belonging to the respective category. As in the gains chart, the x-axis denotes the respective top x percent of cases from the sorted (by classification probabilities) data file. • Lift chart (lift value) — Select this option to compute a lift chart where the vertical y-axis is scaled in terms of the lift value, expressed as the multiple of the baseline random selection model. For example, this chart can show that by taking the top 20 percent (shown on the x-axis) of cases classified into the respective category with the greatest certainty (maximum classification probability), you would end up with a sample that has almost 4 times as many cases belonging to the respective category when compared to the baseline random selection (classification) model. • Cumulative — Show in the chosen lift and gains charts the cumulative percentages, lift values, etc. Clear this check box to show the simple (non-cumulative) values. • Lift Graphs — Creates the chart according to the options above. Tutorial - Neural Networks Tutorial 1: Performing Regression with Fictitious Banking Data For this example, we will use twm_customer_analysis, a fictitious banking dataset. The following is an example of using neural networks analysis regression. Here, cc_rev (credit card revenue) is predicted in terms of 20 independent variables, some continuous and some categorical. Starting the Analysis After connecting to the appropriate Teradata database: 1 166 Starting from the Input > Data Selection menu, select “Table” as Input Source, teraminer as the database, twm_customer_analysis as the input table and “Regression” as Neural Network Style. As Categorical Dependent Columns, select cc_rev, as Continuous Independent Columns select variables income, age, years_with_bank, nbr_children, avg_ cc_bal, avg_ck_bal, avg_sv_bal, avg_cc_tran_amt, avg_cc_tran_cnt, avg_ck_tran_amt, avg_ck_tran_cnt, avg_sv_tran_amt, and avg_sv_tran_cnt, and as Categorical Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Neural Networks Independent Columns select variables female, single, married, separated, ccact, ckacct, and svacct. Figure 84: Neural Network Tutorial 1: Data Selection Tab 2 Next, select the network types tab, which also includes network complexity, or number of hidden units, and click the button, “Load Defaults for Editing”. Teradata Warehouse Miner User Guide - Volume 3 167 Chapter 1: Analytic Algorithms Neural Networks Figure 85: Neural Network Tutorial 1: Network Types Tab The ANS can be configured to train both multilayer perceptron (MLP) networks and radial basis functions (RBF) networks. The multilayer perceptron is the most common form of network. It requires iterative training, which may be relatively slow, but the networks are quite compact, execute quickly once trained, and in most problems yield better results than the other types of networks. Radial basis function networks tend to be larger and hence slower than multilayer perceptrons, and often have a relatively inferior performance, but they train extremely quickly provided they use SOS error and linear output activation functions. They are also usually less effective than multilayer perceptrons if you have a large number of input variables (they are more sensitive to the inclusion of unnecessary inputs). Note that RBF networks are not appropriate for models that contain categorical inputs, so the RBF option is not available when categorical inputs are included in the model, as in this case. • 168 Network complexity (number of hidden units). One particular issue to which you need to pay attention is the number of hidden units (network complexity). For example, if you run ANS several times without producing any good networks, you may want to consider increasing the range of network complexity tried by ANS. Alternatively, if you believe that a certain number of neurons is optimal for your problem, you may then exclude the complexity factor from the ANS algorithm by simply setting the Min. hidden units equal to the Max. hidden units. This way you will help the ANS to concentrate on other network parameters in its search for the best network architecture and specifications, which unlike the number of hidden units, you do not know a priori. Note that network complexity is set separately for each network type. Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Neural Networks 3 Next, select the network parameters tab, which also includes error function and weight decay. Figure 86: Neural Network Tutorial 1: Network Parameters Tab • Networks to train, networks to retain. The number of networks that are to be trained and retained can be modified. You can specify any number of networks you may want to generate (the only limits are the resources on your machine) and choose to retain any number of them when training is over. If you want to retain all the models you train, set the value in Networks to train equal to the Networks to retain. However, often it is better to set the number of networks to retain to a smaller value than the number of networks to train. This will result in TWM Neural Networks retaining a subset of those networks that perform best on the data set. The ANS is a search algorithm that helps you create and test neural networks for your data analysis and prediction problems. It designs a number of networks to solve the problem, copies these into the current network set, and then selects those networks that perform best. For that reason, it is recommended that you set the value in Networks to train to as high as possible, even though it may take some time for TWM Neural Networks to complete the computation for data sets with many variables and data cases and/or networks with a large number of hidden units. This configures TWM Neural Networks to thoroughly search the space of network architectures and configurations and select the best for modeling the training data. Teradata Warehouse Miner User Guide - Volume 3 169 Chapter 1: Analytic Algorithms Neural Networks 4 • Error function. Specifies the error function to be used in training the networks. Because the analysis is regression, the SOS (sum-of-squares) error is the only option available since Cross Entropy is exclusively used for classification tasks. • Weight Decay. Use the default selection for Weight decay, which specifies use of weight decay regularization. • Use weight decay (hidden layer). Use the default selections for output layer weight decay, including minimum and maximum values. • Use weight decay (output layer). Use the default selections for hidden layer weight decay, including minimum and maximum values. Next, select the MLP Activation Functions tab. • MLP activation functions. This is a list of activation functions available for hidden and output layers of MLP networks. Figure 87: Neural Network Tutorial 1: MLP Activation Functions Tab Although most of the default configurations for the ANS are calculated from properties of the data, it is sometimes necessary to change these configurations to something other than the default. For example, you may want the search algorithm to include the sine function (not included by default) as a possible hidden and/or output activation function. This might prove useful when your data are radially distributed. Alternatively, sometimes you might know (from previous experience) that networks with tanh hidden activations might not do so well for your particular data set. In this case you can simply exclude this choice of activation function by clearing the Tanh check box. 170 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Neural Networks You can specify activation functions for hidden and output neurons in a MLP network. These options do not apply to RBF networks. Note that you can also restrict the ANS from searching for best hidden and/or output activation functions by selecting only one option among the many that are available. For example, if you set the choice of hidden activations to Logistic, the ANS will then produce networks with this type of activation function only. Generally speaking, however, you should only restrict the ANS search parameters when you have a logical reason to do so. Unless you have a priori information about your data, you should make the ANS search parameters (for any network property) as wide as possible. 5 Finally, select the Sampling tab. Figure 88: Neural Network Tutorial 1: Sampling tab The performance of a neural network is measured by how well it generalizes to unseen data (i.e., how well it predicts data that was not used during training). The issue of generalization is actually one of the major concerns when training neural networks. When the training data have been overfit (i.e., been fit so completely that even the random noise within the particular data set is reproduced), it is difficult for the network to make accurate predictions using new data. One way to combat this problem is to split the data into two (or three) subsets: a training sample, a testing sample and a validation sample. These samples can then be used to (1) train the network, (2) cross verify (or test) the performance of the training algorithms as they run, and (3) perform an final validation test to determine how well the network predicts “new” data. Teradata Warehouse Miner User Guide - Volume 3 171 Chapter 1: Analytic Algorithms Neural Networks • Enable Teradata Sampling. The most efficient method of sampling a large dataset is the use of Teradata’s internal sampling function. Checking this box and entering the sampling fraction will automatically select the desired proportion of the target dataset to pass on to the Neural Network sampling option below. • Neural Networks Sampling. Percentages for each of the training, test, and validation sets are specifiable in this window. The seed for sampling may also be changed from its default value of 1000. Automatic Network Building TWM provides a neural network search/building strategy which automatically generates your models, Automatic Network Search (ANS). ANS creates neural networks with various settings and configurations with minimal user effort. ANS first creates a number of networks which solve the problem and then chooses the best networks representing the relationship between the input and target variables. 1 Click the execute button. This will trigger the Neural networks training. Training progress will be shown at the bottom of the screen. Once the training is completed, the TWM Neural Networks - Results button will become visible. 2 Click the Results button to show the Reports and Graph Screen. Figure 89: Neural Networks Tutorial 1: Results tab - Reports button 172 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Neural Networks • Reviewing the results. The neural network summary enables you to quickly compare the training and testing performance for each of the selected networks and provides additional summary information about each model including the algorithm used in training, the error function and activation functions used for the hidden and output layers. When a Validation subset is specified (on the Input Sampling tab), performance for that subset is also displayed in the Neural Network Summary. Figure 90: Neural Networks Tutorial 1: Reports - Neural Network Summary • Correlation Coefficients. Click this button to view the correlation coefficients of the networks. Teradata Warehouse Miner User Guide - Volume 3 173 Chapter 1: Analytic Algorithms Neural Networks Figure 91: Neural Networks Tutorial 1: Reports - Correlation Coefficients • 174 Data statistics. Click this button to create a spreadsheet containing some statistics (minimum, maximum, mean and standard deviation) of the input and target variables for the training, testing, and validation samples. Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Neural Networks Figure 92: Neural Networks Tutorial 1: Reports - Data Statistics • Weights. Click the Weights and Thresholds button to display a spreadsheet of weights and thresholds for each model in the Active neural networks grid. Teradata Warehouse Miner User Guide - Volume 3 175 Chapter 1: Analytic Algorithms Neural Networks Figure 93: Neural Networks Tutorial 1: Reports - Weights • Sensitivity Analysis. Sensitivity Analysis gives you some information about the relative importance of the variables used in a neural network. In sensitivity analysis, TWM Neural Networks tests how the neural network outputs would change should a particular input variable were to change. There are two types of sensitivity analysis in TWM Neural Networks, namely local and global sensitivities. The local sensitivity measures how sensitive the output of a neural network is to a particular value of an input variable. The larger the change the more influential that input is. By contrast, the global sensitivity measures the average (global) importance of the network outputs with respect to the individual inputs. Sensitivity analysis actually measures only the importance of variables in the context of a particular neural model. Variables usually exhibit various forms of interdependency and redundancy. If several variables are correlated, then the training algorithm may arbitrarily choose some combination of them and the sensitivities may reflect this, giving inconsistent results between different networks. It is usually best to run sensitivity analysis on a number of networks, and to draw conclusions only from consistent results. Nonetheless, sensitivity analysis is useful in helping you to understand how important variables are. 176 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Neural Networks Figure 94: Neural Networks Tutorial 1: Reports - Sensitivity Analysis • Predictions. See “Neural Networks Scoring” on page 249. • Graphs. Next, click on the Graph tab. The options on this tab enable you to create histograms, 2D scatter plots, and 3D surface plots using targets, predictions, residuals and inputs. Teradata Warehouse Miner User Guide - Volume 3 177 Chapter 1: Analytic Algorithms Neural Networks Figure 95: Neural Networks Tutorial 1: Results tab - Graph button For example, you can review the distribution of the target variable cc_rev. Select Target in the X-axis list box and click the Histograms of X button. 178 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Neural Networks Figure 96: Neural Networks Tutorial 1: Graph - Histogram For a scatter plot of target vs. output, select target in the X-axis list and output in the Yaxis list, and click the X and Y button. Teradata Warehouse Miner User Guide - Volume 3 179 Chapter 1: Analytic Algorithms Neural Networks Figure 97: Neural Networks Tutorial 1: Graph - Target Output In the Select Networks to Graph window, when multiple networks are selected by checkmarks in the model name column, the scatter plots of all the selected networks are overlaid. This enables you to compare the values for all networks. Similarly, three dimensional graphs may be generated of variables relationships by selecting variables for X, Y, and Z axes and clicking the “X, Y and Z” button. 180 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Neural Networks Figure 98: Neural Networks Tutorial 1: Graph - X, Y and Z Unique models, best models, good models. If you have not worked with neural networks for building predictive models, it is important to remember that these are “general learning algorithms,” not statistical estimation techniques. That means that the models that are generated may not necessarily be the best models that could be found, nor is there a single best model. In practice, you will often find several models that appear of nearly identical quality. Each model can be regarded, in this case, as a unique solution. Note that even models with the same number of hidden units, hidden and output activation function, etc., may actually have different predictions and hence different performance. This is due to the nature of neural networks as highly nonlinear models capable of producing multiple solutions for the same problem. Tutorial 2: Performing Classification with Fictitious Banking Data For this example, we will use twm_customer_analysis, a fictitious banking dataset. The following is an example of using neural networks analysis classification. Here, ccacct (has credit card, 0 or 1) is predicted in terms of 16 independent variables, from income to avg_sv_ tran_cnt. Starting the Analysis After connecting to the appropriate Teradata database, from the Input/Data Selection menu: 1 Select “Table” as Input Source, teraminer as the database, twm_customer_analysis as the input table and “Classification” as Neural Network Style. As Categorical Dependent Columns, select ccact, as Continuous Independent Columns select variables income, age, Teradata Warehouse Miner User Guide - Volume 3 181 Chapter 1: Analytic Algorithms Neural Networks years_with_bank, nbr_children, avg_ck_bal, avg_sv_bal, avg_ck_tran_amt, avg_ck_ tran_cnt, avg_sv_tran_amt, and avg_sv_tran_cnt, and as Categorical Independent Columns select variables female, single, married, separated, ckacct, and svacct. Figure 99: Neural Networks Tutorial 2: Data Selection tab 2 182 Next, select the network types tab, which also includes network complexity, or number of hidden units, and click the button, “Load Defaults for Editing”. Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Neural Networks Figure 100: Neural Networks Tutorial 2: Network Types tab The ANS can be configured to train both multilayer perceptron (MLP) networks and radial basis functions (RBF) networks. The multilayer perceptron is the most common form of network. It requires iterative training, which may be relatively slow, but the networks are quite compact, execute quickly once trained, and in most problems yield better results than the other types of networks. Radial basis function networks tend to be larger and hence slower than multilayer perceptrons, and often have a relatively inferior performance, but they train extremely quickly provided they use SOS error and linear output activation functions. They are also usually less effective than multilayer perceptrons if you have a large number of input variables (they are more sensitive to the inclusion of unnecessary inputs). Note that RBF networks are not appropriate for models that contain categorical inputs, so the RBF option is not available when categorical inputs are included in the model, as in this case. • Network complexity (number of hidden units). One particular issue to which you need to pay attention is the number of hidden units (network complexity). For example, if you run ANS several times without producing any good networks, you may want to consider increasing the range of network complexity tried by ANS. Alternatively, if you believe that a certain number of neurons is optimal for your problem, you may then exclude the complexity factor from the ANS algorithm by simply setting the Min. hidden units equal to the Max. hidden units. This way you will help the ANS to concentrate on other network parameters in its search for the best network architecture and specifications, which unlike the number of hidden units, you do not know a priori. Note that network complexity is set separately for each network type. Teradata Warehouse Miner User Guide - Volume 3 183 Chapter 1: Analytic Algorithms Neural Networks 3 Next, select the network parameters tab, which also includes error function and weight decay. Figure 101: Neural Networks Tutorial 2: Network Parameters tab • Networks to train, networks to retain. The number of networks that are to be trained and retained can be modified. You can specify any number of networks you may want to generate (the only limits are the resources on your machine) and choose to retain any number of them when training is over. If you want to retain all the models you train, set the value in Networks to train equal to the Networks to retain. However, often it is better to set the number of networks to retain to a smaller value than the number of networks to train. This will result in TWM Neural Networks retaining a subset of those networks that perform best on the data set. The ANS is a search algorithm that helps you create and test neural networks for your data analysis and prediction problems. It designs a number of networks to solve the problem, copies these into the current network set, and then selects those networks that perform best. For that reason, it is recommended that you set the value in Networks to train to as high as possible, even though it may take some time for TWM Neural Networks to complete the computation for data sets with many variables and data cases and/or networks with a large number of hidden units. This configures TWM Neural Networks to thoroughly search the space of network architectures and configurations and select the best for modeling the training data. 184 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Neural Networks 4 • Error function. Specifies the error function to be used in training a network. Because the analysis is classification, the default is either SOS or Cross Entropy. In such cases, both error functions will be tried, and the best chosen by ANS. • Weight Decay. Use the default selection for Weight decay, which specifies use of weight decay regularization. • Use weight decay (hidden layer). Use the default selections for output layer weight decay, including minimum and maximum values. • Use weight decay (output layer). Use the default selections for hidden layer weight decay, including minimum and maximum values. Next, select the MLP Activation Functions tab. Figure 102: Neural Networks Tutorial 2: MLP Activation Functions tab • MLP activation functions. This is a list of activation functions available for hidden and output layers of MLP networks. Although most of the default configurations for the ANS are calculated from properties of the data, it is sometimes necessary to change these configurations to something other than the default. For example, you may want the search algorithm to include the sine function (not included by default) as a possible hidden and/or output activation function. This might prove useful when your data are radially distributed. Alternatively, sometimes you might know (from previous experience) that networks with tanh hidden activations might not do so well for your particular data set. In this Teradata Warehouse Miner User Guide - Volume 3 185 Chapter 1: Analytic Algorithms Neural Networks case you can simply exclude this choice of activation function by clearing the Tanh check box. You can specify activation functions for hidden and output neurons in a MLP network. These options do not apply to RBF networks. Note that you can also restrict the ANS from searching for best hidden and/or output activation functions by selecting only one option among the many that are available. For example, if you set the choice of hidden activations to Logistic, the ANS will then produce networks with this type of activation function only. Generally speaking, however, you should only restrict the ANS search parameters when you have a logical reason to do so. Unless you have a priori information about your data, you should make the ANS search parameters (for any network property) as wide as possible. 5 Finally, select the Sampling tab. Figure 103: Neural Networks Tutorial 2: Sampling tab The performance of a neural network is measured by how well it generalizes to unseen data (i.e., how well it predicts data that was not used during training). The issue of generalization is actually one of the major concerns when training neural networks. When the training data have been overfit (i.e., been fit so completely that even the random noise within the particular data set is reproduced), it is difficult for the network to make accurate predictions using new data. One way to combat this problem is to split the data into two (or three) subsets: a training sample, a testing sample and a validation sample. These samples can then be used to (1) train the network, (2) cross verify (or test) the 186 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Neural Networks performance of the training algorithms as they run, and (3) perform an final validation test to determine how well the network predicts “new” data. • Enable Teradata Sampling. The most efficient method of sampling a large dataset is the use of Teradata’s internal sampling function. Checking this box and entering the sampling fraction will automatically select the desired proportion of the target dataset to pass on to the Neural Network sampling option below. • Neural Networks Sampling. Percentages for each of the training, test, and validation sets are specifiable in this window. The seed for sampling may also be changed from its default value of 1000. Automatic Network Building TWM provides a neural network search/building strategy which automatically generates your models, Automatic Network Search (ANS). ANS creates neural networks with various settings and configurations with minimal user effort. ANS first creates a number of networks which solve the problem and then chooses the best networks representing the relationship between the input and target variables. 1 Click the execute button. This will trigger the Neural networks training. Training progress will be shown at the bottom of the screen. Once the training is completed, the TWM Neural Networks - Results button will become visible. 2 Click the results button to show the Reports and Graph Screen. Teradata Warehouse Miner User Guide - Volume 3 187 Chapter 1: Analytic Algorithms Neural Networks Figure 104: Neural Networks Tutorial 2: Results tab - Reports button • 188 Reviewing the results. The neural network summary enables you to quickly compare the training and testing performance for each of the selected networks and provides additional summary information about each model including the algorithm used in training, the error function and activation functions used for the hidden and output layers. When a Validation subset is specified (on the Input Sampling tab), performance for that subset is also displayed in the Neural Network Summary. Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Neural Networks Figure 105: Neural Networks Tutorial 2: Results - Neural Network Summary • Data statistics. Click this button to create a spreadsheet containing some statistics (minimum, maximum, mean and standard deviation) of the input and target variables for the training, testing, and validation samples. Teradata Warehouse Miner User Guide - Volume 3 189 Chapter 1: Analytic Algorithms Neural Networks Figure 106: Neural Networks Tutorial 2: Reports - Data Statistics • 190 Weights. Click the Weights and Thresholds button to display a spreadsheet of weights and thresholds for each model in the Active neural networks grid. Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Neural Networks Figure 107: Neural Networks Tutorial 2: Reports - Weights • Sensitivity Analysis. Sensitivity Analysis gives you some information about the relative importance of the variables used in a neural network. In sensitivity analysis, TWM Neural Networks tests how the neural network would cope if each of its input variables were unavailable. TWM Neural Networks has facilities to automatically compensate for missing values (for classification analysis, casewise deletion of missing data is used). In sensitivity analysis, the data set is submitted to the network repeatedly, with each variable in turn treated as missing, and the resulting network error is recorded. If an important variable is removed in this fashion, the error will increase a great deal; if an unimportant variable is removed, the error will not increase very much. The spreadsheet shows, for each selected model, the ratio of the network error with a given input omitted to the network error with the input available. It also shows the rank order of these ratios for each input, which puts the input variables into order of importance. If the ratio is 1 or less, the network actually performs better if the variable is omitted entirely - a sure sign that it should be pruned from the network. We tend to interpret the sensitivities as indicating the relative importance of variables. However, they actually measure only the importance of variables in the context of a particular neural model. Variables usually exhibit various forms of interdependency and redundancy. If several variables are correlated, then the training algorithm may arbitrarily choose some combination of them and the sensitivities may reflect this, giving inconsistent results between different networks. It is usually best to run sensitivity analysis on a number of networks, and to draw conclusions only from Teradata Warehouse Miner User Guide - Volume 3 191 Chapter 1: Analytic Algorithms Neural Networks consistent results. Nonetheless, sensitivity analysis is extremely useful in helping you to understand how important variables are. Figure 108: Neural Networks Tutorial 2: Reports - Sensitivity Analysis • 192 Confusion matrix. Click the Confusion matrix button to generate a confusion matrix and classification summary for the categorical target. A confusion matrix gives a detailed breakdown of misclassifications. The observed class is displayed at the top of the matrix, and the predicted class down the side; each cell contains a number showing how many cases that were actually of the given observed class were assigned by the model to the given predicted class. In a perfectly performing model, all the cases are counted in the leading diagonal. A classification summary gives the total number of observations in each class of the target, the number of correct and incorrect predictions for each class, and the percentage of correct and incorrect predictions for each class. This information is provided for each active network. Note this option is only available for classification. Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Neural Networks Figure 109: Neural Networks Tutorial 2: Reports - Confusion Matrix Teradata Warehouse Miner User Guide - Volume 3 193 Chapter 1: Analytic Algorithms Neural Networks Classification Summary Figure 110: Neural Networks Tutorial 2: Reports - Classification Summary • Confidence levels. Click the Confidence button to display a spreadsheet of confidence levels for each case. Confidence levels will be displayed for each model. Note that this option is only available for classification problems. 194 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Neural Networks Figure 111: Neural Networks Tutorial 2: Reports - Confidence Levels • Predictions. See “Neural Networks Scoring” on page 249. • Graphs. Next, click on the Graph tab. The options on this tab enable you to create histograms, 2D scatter plots, and 3D surface plots using targets, predictions, residuals and inputs. Teradata Warehouse Miner User Guide - Volume 3 195 Chapter 1: Analytic Algorithms Neural Networks Figure 112: Neural Networks Tutorial 2: Results tab - Graph button For example, you can review the distribution of the target variable ccact. Select Target in the X-axis list box and click the Histograms of X button. • 196 To review histograms of model accuracy (number correct, number incorrect), select Accuracy in the X-axis list box, and click the Histogram of X button. Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Neural Networks Figure 113: Neural Networks Tutorial 2: Graph - Histogram • For a scatter plot of income vs. age, select income in the X-axis list and age in the Yaxis list, and click the X and Y button. Teradata Warehouse Miner User Guide - Volume 3 197 Chapter 1: Analytic Algorithms Neural Networks Figure 114: Neural Networks Tutorial 2: Graph - Income vs. Age • In the Select Networks to Graph window, when multiple networks are selected by checkmarks in the model name column, the scatter plots of all the selected networks are overlaid. This enables you to compare the values for all networks. • Similarly, three dimensional graphs may be generated of variables relationships by selecting variables for X, Y, and Z axes and clicking the “X, Y and Z” button. • Lift Charts. Lift Graphs may be generated of three different types, and may be either cumulative or non-cumulative. Choose the variable’s value or “All” and click the Lift Graphs button to see that variable value’s lift, or that of all variable values. For example: 198 Teradata Warehouse Miner User Guide - Volume 3 Chapter 1: Analytic Algorithms Neural Networks Figure 115: Neural Networks Tutorial 2: Graph - Lift Charts • Unique models, best models, good models. If you have not worked with neural networks for building predictive models, it is important to remember that these are “general learning algorithms,” not statistical estimation techniques. That means that the models that are generated may not necessarily be the best models that could be found, nor is there a single best model. In practice, you will often find several models that appear of nearly identical quality. Each model can be regarded, in this case, as a unique solution. Note that even models with the same number of hidden units, hidden and output activation function, etc., may actually have different predictions and hence different performance. This is due to the nature of neural networks as highly nonlinear models capable of producing multiple solutions for the same problem. Teradata Warehouse Miner User Guide - Volume 3 199 Chapter 1: Analytic Algorithms Neural Networks 200 Teradata Warehouse Miner User Guide - Volume 3 Chapter 2: Scoring Overview CHAPTER 2 Scoring What’s In This Chapter For more information, see these subtopics: 1 “Overview” on page 201 2 “Cluster Scoring” on page 201 3 “Tree Scoring” on page 209 4 “Factor Scoring” on page 219 5 “Linear Scoring” on page 228 6 “Logistic Scoring” on page 236 7 “Neural Networks Scoring” on page 249 Overview Model scoring in Teradata Warehouse Miner is performed entirely through generated SQL, executed in the database (although PMML based scoring generally requires that certain supplied User Defined Functions be installed beforehand). A scoring analysis is provided for every Teradata Warehouse Miner algorithm that produces a predictive model (thus excluding the Association Rules algorithm). Scoring applies a predictive model to a data set that has the same columns as those used in building the model, with the exception that the scoring input table need not always include the predicted or dependent variable column for those models that utilize one. In fact, the dependent variable column is required only when model evaluation is requested in the Tree Scoring, Linear Scoring and Logistic Scoring analyses. Cluster Scoring Scoring a table is the assignment of each row to a cluster. In the Gaussian Mixture model, the “maximum probability rule” is used to assign the row to the cluster for which its conditional probability is the largest. The model also assigns relative probabilities of each cluster to the row, so the soft assignment of a row to more than one cluster can be obtained. When scoring is requested, the selected table is scored against centroids/variances from the selected Clustering analysis. After a single iteration, each row is assigned to one of the previously defined clusters, together with the probability of membership. The row to cluster assignment is based on the largest probability. Teradata Warehouse Miner User Guide - Volume 1 201 Chapter 2: Scoring Cluster Scoring The Cluster Scoring analysis scores an input table that contains the same columns that were used to perform the selected Clustering Analysis. The implicit assumption in doing this is that the underlying population distributions are the same. When scoring is requested, the specified table is scored against the centroids and variances obtained in the selected Clustering analysis. Only a single iteration is required before the new scored table is produced. After clusters have been identified by their centroids and variances, the scoring engine identifies to which cluster each row belongs. The Gaussian Mixture model permits multiple cluster memberships, with scoring showing the probability of membership to each cluster. In addition, the highest probability is used to assign the row absolutely to a cluster. The resulting score table consists of the index (key) columns, followed by probabilities for each cluster membership, followed by the assigned cluster number (the cluster with the highest probability of membership). Initiate Cluster Scoring After generating a Cluster analysis (as described in “Cluster Analysis” on page 19), use the following procedure to initiate Cluster Scoring: 1 Click on the Add New Analysis icon in the toolbar: Figure 116: Add New Analysis from toolbar 2 202 In the resulting Add New Analysis dialog box, click on Scoring under Categories and then under Analyses double-click on Cluster Scoring: Teradata Warehouse Miner User Guide - Volume 1 Chapter 2: Scoring Cluster Scoring Figure 117: Add New Analysis > Scoring > Cluster Scoring 3 This will bring up the Cluster Scoring dialog in which you will enter INPUT and OUTPUT options to parameterize the analysis as described in the following sections. Cluster Scoring - INPUT - Data Selection On the Factor Scoring dialog click on INPUT and then click on data selection: Figure 118: Add New Analysis > Input > Data Selection On this screen select: 1 Select Input Source Users may select between different sources of input. By selecting the Input Source Table the user can select from available databases, tables (or views) and columns in the usual manner. By selecting the Input Source Analysis however the user can select directly from the output of another analysis of qualifying type in the current project. Analyses that may be selected from directly include all of the Analytic Data Set (ADS) and Reorganization analyses (except Refresh). In place of Available Databases the user may select from Available Analyses, while Available Tables then contains a list of all the output tables that will eventually be produced by the selected Analysis. (Note that since this analysis cannot select from a volatile input table, Available Analyses will contain only those qualifying analyses that create an output table or view). For more information, refer to “INPUT Tab” on page 73 of the Teradata Warehouse Miner User Guide (Volume 1). Teradata Warehouse Miner User Guide - Volume 1 203 Chapter 2: Scoring Cluster Scoring 2 3 Select Columns From a Single Table • Available Databases — All available source databases that have been added through Connection Properties. • Available Tables — The tables available for scoring are listed in this window, though all may not strictly qualify; the input table to be scored must contain the same column names used in the original analysis. • Available Columns — The columns available for scoring are listed in this window. • Selected Columns • Index Columns — Note that the Selected Columns window is actually a split window for specifying Index and/or Retain columns if desired. If a table is specified as input, the primary index of the table is defaulted here, but can be changed. If a view is specified as input, an index must be provided. • Retain Columns — Other columns within the table being scored can be appended to the scored table, by specifying them here. Columns specified in Index Columns may not be specified here. Select Model Analysis — Select from the list an existing Cluster analysis on which to run the scoring. The Cluster analysis must exist in the same project as the Cluster Scoring analysis. Cluster Scoring - INPUT - Analysis Parameters On the Cluster Scoring dialog click on INPUT and then click on analysis parameters: Figure 119: Add New Analysis > Input > Analysis Parameters On this screen select: • Score Options • Include Cluster Membership — The name of the column in the output score table representing the cluster number to which an observation or row belongs can be set by the user. This column may be excluded by un-checking the selection box, but if this is done the cluster probability scores must be included. • • 204 Column Name — Name of the column that will be populated with the cluster numbers. Note that this can not have the same name as any of the columns in the table being scored. Include Cluster Probability Scores — The prefix of the name of the columns in the output score table representing the probabilities that an observation or row belongs to each cluster can be set by the user. A column is created for each possible cluster, adding the cluster number to this prefix (for example, p1, p2, p3). These columns may Teradata Warehouse Miner User Guide - Volume 1 Chapter 2: Scoring Cluster Scoring be excluded by un-checking the selection box, but if this is done the cluster membership number must be included. • Column Prefix — A prefix for each column generated (one per cluster) that will be populated with the probability scores. Note that the prefix used will have sequential numbers, beginning with 1 and incrementing for each cluster, appended to it. If the resultant column conflicts with a column in the table to be scored, an error will occur. Cluster Scoring - OUTPUT On the Cluster Scoring dialog click on OUTPUT: Figure 120: Cluster Scoring > Output On this screen select: • Output Table • Database name — The name of the database. • Table name — The name of the scored output table to be created-required only if a scoring option is selected. • Create output table using the FALLBACK keyword — If a table is selected, it will be built with FALLBACK if this option is selected • Create output table using the MULTISET keyword — This option is not enabled for scored output tables; the MULTISET keyword is not used. • Advertise Output — The Advertise Output option “advertises” output by inserting information into one or more of the Advertise Output metadata tables according to the type of analysis and the options selected in the analysis. (For more information, refer to “Advertise Output” on page 112 of the Teradata Warehouse Miner User Guide (Volume 1)). • Advertise Note — An Advertise Note may be specified if desired when the Advertise Output option is selected or when the Always Advertise option is selected on the Databases tab of the Connection Properties dialog. It is a free-form text field of up to 30 characters that may be used to categorize or describe the output. • Generate the SQL for this analysis, but do not execute it — If this option is selected the analysis will only generate SQL, returning it and terminating immediately. Note: To create a stored procedure to score this model, use the Refresh analysis. Run the Cluster Scoring Analysis After setting parameters on the INPUT screens as described above, you are ready to run the analysis. To run the analysis you can either: Teradata Warehouse Miner User Guide - Volume 1 205 Chapter 2: Scoring Cluster Scoring • Click the Run icon on the toolbar, or • Select Run <project name> on the Project menu, or • Press the F5 key on your keyboard Results - Cluster Scoring The results of running the Teradata Warehouse Miner Cluster Scoring Analysis include a variety of statistical reports on the scored model. All of these results are outlined below. Cluster Scoring - RESULTS - reports On the Cluster Scoring dialog click RESULTS and then click on reports (note that the RESULTS tab will be grayed-out/disabled until after the analysis is completed): Figure 121: Cluster Scoring > Results > Reports • Clustering Scoring Report • Iteration — When scoring, the algorithm performs only one iteration, so this value is always 1. • Log Likelihood — This is the log likelihood value calculated using the scored data, giving a measure of the effectiveness of the model applied to this data. • Diff — Since only one iteration of the algorithm is performed when scoring, this is always 0. • Timestamp — This is the day, date, hour, minute and second marking the end of scoring processing. Cluster Scoring - RESULTS - data On the Cluster Scoring dialog click RESULTS and then click on data (note that the RESULTS tab will be grayed-out/disabled until after the analysis is completed): Figure 122: Cluster Scoring > Results > Data Results data, if any, is displayed in a data grid as described in the “RESULTS Tab” on page 80 of the Teradata Warehouse Miner User Guide (Volume 1). 206 Teradata Warehouse Miner User Guide - Volume 1 Chapter 2: Scoring Cluster Scoring If a table was created, a sample of rows is displayed here - the size determined by the setting specified by Maximum result rows to display in Tools-Preferences-Limits. The following table is built in the requested Output Database by the Cluster Scoring analysis. Note that the options selected affect the structure of the table. Those columns in bold below will comprise the Unique Primary Index (UPI). Also note that there may be repeated groups of columns, and that some columns will be generated only if specific options are selected. Table 51: Output Database (Built by the Cluster Scoring analysis) Name Type Definition Key User Defined One or more unique-key columns, which default to the index, defined in the table to be scored (i.e. in Selected Tables). The data type defaults to the same as the scored table, but can be changed via Primary Index Columns and Types. Px (Default) FLOAT The probabilities that an observation or row belongs to each cluster if the Include Cluster Probability Scores option is selected. A column is created for each possible cluster, adding the cluster number to the prefix entered in the Column Prefix option. This prefix will be used for each column generated (one per cluster) that will be populated with the probability scores. Note that the prefix used will have sequential numbers, beginning with 1 and incrementing for each cluster, appended to it. (By default, the Column Prefix is p, so p1, p2, p3, etc. will be generated). These columns may be excluded by not selecting the Include Cluster Probability Scores option, but if this is done the cluster membership number must be included. Clusterno (Default) INTEGER The column in the output score table representing the cluster number to which an observation or row belongs can be set by the user. This column may be excluded by not selecting the Include Cluster Membership option, but if this is done the cluster probability scores must be included (see above). The name of the column defaults to “clusterno”, but this can be overwritten by entering another value in Column Name under the Include Cluster Membership option. Note that this can not have the same name as any of the index columns in the table being scored. The name entered can not exist as a column in the table being scored. Cluster Scoring - RESULTS - SQL On the Cluster Scoring dialog click RESULTS and then click on SQL (note that the RESULTS tab will be grayed-out/disabled until after the analysis is completed): Figure 123: Cluster Scoring > Results > SQL Teradata Warehouse Miner User Guide - Volume 1 207 Chapter 2: Scoring Cluster Scoring The SQL generated for scoring is returned here, but only if the Output - Storage option to Generate the SQL for this analysis, but do not execute it was selected. When SQL is displayed here, it may be selected and copied as desired. (Both right-click menu options and buttons to Select All and Copy are available). Tutorial - Cluster Scoring In this example, the same table is scored as was used to build the cluster analysis model. Parameterize a Cluster Score Analysis as follows: • Selected Table — twm_customer_analysis • Include Cluster Membership — Enabled • Column Name — Clusterno • Include Cluster Probability Scores — Enabled • Column Prefix — p • Result Table Name — twm_score_cluster_1 • Index Columns — cust_id Run the analysis, and click on Results when it completes. For this example, the Cluster Scoring Analysis generated the following pages. A single click on each page name populates Results with the item. Table 52: Clustering Progress Iteration Log Likelihood Diff Timestamp 1 -24.3 0 Tue Jun 12 15:41:58 2001 Table 53: Data cust_id p1 p2 p3 clusterno 1362509 .457 .266 .276 1 1362573 1.12E-22 1 0 2 1362589 6E-03 5.378E-03 .989 3 1362693 8.724E-03 8.926E-03 .982 3 1362716 3.184E-03 3.294E-03 .994 3 1362822 .565 .132 .303 1 1363017 7.267E-02 .927 1.031E-18 2 1363078 3.598E-03 3.687E-03 .993 3 1363438 2.366E-03 2.607E-03 .995 3 1363465 .115 5.923E-02 .826 3 208 Teradata Warehouse Miner User Guide - Volume 1 Chapter 2: Scoring Tree Scoring Table 53: Data cust_id p1 p2 p3 clusterno … … … … … … … … … … … … … … … Tree Scoring After building a model a means of deploying it is required to allow scoring of new data sets. The way in which Teradata Warehouse Miner deploys a decision tree model is via SQL. A series of SQL statements is generated from the metadata model that describes the decision tree. The SQL uses CASE statements to classify the predicted value. Here is an example of a statement: SELECT CASE WHEN(subset1 expression) THEN ‘Buy’ WHEN(subset2 expression) THEN ‘Do not Buy’ END FROM tablename; Note that Tree Scoring applies a Decision Tree model to a data set that has the same columns as those used in building the model (with the exception that the scoring input table need not include the predicted or dependent variable column unless model evaluation is requested). A number of scoring options including model evaluation and profiling rulesets are provided on the analysis parameters panel of the Tree Scoring analysis. Initiate Tree Scoring After generating a Decision Tree analysis (as described in “Decision Trees” on page 36) use the following procedure to initiate Tree Scoring: 1 Click on the Add New Analysis icon in the toolbar: Figure 124: Add New Analysis from toolbar 2 In the resulting Add New Analysis dialog box, click on Scoring under Categories and then under Analyses double-click on Tree Scoring: Teradata Warehouse Miner User Guide - Volume 1 209 Chapter 2: Scoring Tree Scoring Figure 125: Add New Analysis > Scoring > Tree Scoring 3 This will bring up the Tree Scoring dialog in which you will enter INPUT and OUTPUT options to parameterize the analysis as described in the following sections. Tree Scoring - INPUT - Data Selection On the Tree Scoring dialog click on INPUT and then click on data selection: Figure 126: Tree Scoring > Input > Data Selection On this screen select: 1 Select Input Source Users may select between different sources of input. By selecting the Input Source Table the user can select from available databases, tables (or views) and columns in the usual manner. By selecting the Input Source Analysis however the user can select directly from the output of another analysis of qualifying type in the current project. Analyses that may be selected from directly include all of the Analytic Data Set (ADS) and Reorganization analyses (except Refresh). In place of Available Databases the user may select from Available Analyses, while Available Tables then contains a list of all the output tables that will eventually be produced by the selected Analysis. (Note that since this analysis cannot select from a volatile input table, Available Analyses will contain only those qualifying analyses that create an output table or view). For more information, refer to “INPUT Tab” on page 73 of the Teradata Warehouse Miner User Guide (Volume 1). 210 Teradata Warehouse Miner User Guide - Volume 1 Chapter 2: Scoring Tree Scoring 2 Select Columns From a Single Table • Available Databases — All available source databases that have been added through Connection Properties. • Available Tables — The tables available for scoring are listed in this window, though all may not strictly qualify; the input table to be scored must contain the same column names used in the original analysis. • Available Columns — The columns available for scoring are listed in this window. • Selected Columns • Index Columns — Note that the Selected Columns window is actually a split window for specifying Index and/or Retain columns if desired. If a table is specified as input, the primary index of the table is defaulted here, but can be changed. If a view is specified as input, an index must be provided. • Retain Columns — Other columns within the table being scored can be appended to the scored table, by specifying them here. Columns specified in Index Columns may not be specified here. 3 Select Model Analysis 4 Select from the list an existing Decision Tree analysis on which to run the scoring. The Decision Tree analysis must exist in the same project as the Decision Tree Scoring analysis. Tree Scoring - INPUT - Analysis Parameters On the Tree Scoring dialog click on INPUT and then click on analysis parameters: Figure 127: Tree Scoring > Input > Analysis Parameters On this screen select: • Scoring Method • Score — Option to create a score table only. • Evaluate — Option to perform model evaluation only. Not available for Decision Tree models built using the Regression Trees option. • Evaluate and Score — Option to create a score table and perform model evaluation. Not available for Decision Tree models built using the Regression Trees option. • Scoring Options • Use Dependent variable for predicted value column name — Option to use the exact same column name as the dependent variable when the model is scored. This is the default option. Teradata Warehouse Miner User Guide - Volume 1 211 Chapter 2: Scoring Tree Scoring • Predicted Value Column Name — If above option is not checked, then enter here the name of the column in the score table which contains the estimated value of the dependent variable. • Include Confidence Factor — If this option is checked then the confidence factor will be added to the output table. The Confidence Factor is a measure of how “confident” the model is that it can predict the correct score for a record that falls into a particular leaf node based on the training data the model was built from. Example: If a leaf node contained 10 observations and 9 of them predict Buy and the other record predicts Do Not Buy, then the model built will have a confidence factor of .9, or be 90% sure of predicting the right value for a record that falls into that leaf node of the model. If the Include validation table option was selected when the decision tree model was built, additional information is provided in the scored table and/or results depending on the scoring option selected. If Score Only is selected, a recalculated confidence factor based on the original validation table is included in the scored output table. If Evaluate Only is selected, a confusion matrix based on the selected table to score is added to the results. If Evaluate and Score is selected, then a confusion matrix based on the selected table to score is added to the results and a recalculated confidence factor based on the selected table to score is included in the scored output table. • Targeted Confidence (Binary Outcome Only) — Models built with a predicted variable that has only 2 outcomes can add a targeted confidence value to the output table. The outcomes of the above example were 9 Buys and 1 Do Not Buy at that particular node and if the target value was set to Buy, .9 is the targeted confidence. However if it is desired to target the Do Not Buy outcome by setting the value to Do Not Buy, then any record falling into this leaf of the tree would get a targeted confidence of .1 or 10%. If the Include validation table option was selected when the decision tree model was built, additional information is provided in a manner similar to that for the Include Confidence Factor option described above. • Targeted Value — The value for the binary targeted confidence. Note that Include Confidence Factor and Targeted Confidence are mutually exclusive options, so that only one of the two may be selected. • Create Profiling Tables — If this option is selected, additional tables are created to profile the leaf nodes in the tree and to link scored rows to the leaf nodes that they correspond to. To do this, a node ID field is added to the scored output table and two additional tables are built to describe the leaf nodes. One table contains confidence factor or targeted confidence (if requested) and prediction information (named by appending “_1” to the scored output table name), and the other contains the rules corresponding to each leaf node (named by appending “_2” to the scored output table name). Note however that selection of the option to Create Profiling Tables is ignored if the Evaluate scoring method or the output option to Generate the SQL for this analysis but do not execute it is selected. It is also ignored if the analysis is being refreshed by a Refresh analysis that requests the creation of a stored procedure. 212 Teradata Warehouse Miner User Guide - Volume 1 Chapter 2: Scoring Tree Scoring Tree Scoring - OUTPUT On the Tree Scoring dialog click on OUTPUT: Figure 128: Tree Scoring > Output On this screen select: • Output Table • Database name — The name of the database. • Table name — The name of the scored output table to be created. • Create output table using the FALLBACK keyword — If a table is selected, it will be built with FALLBACK if this option is selected • Create output table using the MULTISET keyword — This option is not enabled for scored output tables; the MULTISET keyword is not used. • Advertise Output — The Advertise Output option “advertises” output (including Profiling Tables if requested) by inserting information into one or more of the Advertise Output metadata tables according to the type of analysis and the options selected in the analysis. (For more information, refer to “Advertise Output” on page 112 of the Teradata Warehouse Miner User Guide (Volume 1)). • Advertise Note — An Advertise Note may be specified if desired when the Advertise Output option is selected or when the Always Advertise option is selected on the Databases tab of the Connection Properties dialog. It is a free-form text field of up to 30 characters that may be used to categorize or describe the output. • Generate the SQL for this analysis, but do not execute it — If this option is selected the analysis will only generate SQL, returning it and terminating immediately. Note: To create a stored procedure to score this model, use the Refresh analysis. Run the Tree Scoring Analysis After setting parameters on the INPUT screens as described above, you are ready to run the analysis. To run the analysis you can either: • Click the Run icon on the toolbar, or • Select Run <project name> on the Project menu, or • Press the F5 key on your keyboard Results - Tree Scoring The results of running the Teradata Warehouse Miner Decision Tree Scoring Analysis include a variety of statistical reports on the scored model. All of these results are outlined below. Teradata Warehouse Miner User Guide - Volume 1 213 Chapter 2: Scoring Tree Scoring Tree Scoring - RESULTS - Reports On the Tree Scoring dialog click RESULTS and then click on reports (note that the RESULTS tab will be grayed-out/disabled until after the analysis is completed): Figure 129: Tree Scoring > Results > Reports • Decision Tree Score Report • Resulting Scored Table Name — This is the name given the table with the scored values of the decision tree model. • Number of Rows in Scored Table — This is the number of rows in the scored decision tree table. • Confusion Matrix — A N x (N+2) (for N outcomes of the dependent variable) confusion matrix is given with the following format: Table 54: Confusion Matrix Actual ‘0’ Actual ‘1’ … Actual ‘N’ Correct Incorrect Predicted ‘0’ # correct ‘0’ Predictions # incorrect‘1’ Predictions … # incorrect ‘N’ Predictions Total Correct ‘0’ Predictions Total Incorrect ‘0’ Predictions Predicted ‘1’ # incorrect‘0’ Predictions # correct ‘1’ Predictions … # incorrect ‘N’ Predictions Total Correct ‘1’ Predictions Total Incorrect ‘1’ Predictions … … … … … … … Predicted ‘N’ # incorrect‘0’ Predictions # incorrect ‘1’ Predictions … # correct ‘N’ Predictions Total Correct ‘N’ Predictions Total Incorrect ‘N’ Predictions • Cumulative Lift Table — The Cumulative Lift Table demonstrates how effective the model is in estimating the dependent variable. It is produced using deciles based on the probability values. Note that the deciles are labeled such that 1 is the highest decile and 10 is the lowest, based on the probability values calculated by logistic regression. The information in this report however is best viewed in the Lift Chart produced as a graph. Note that this is only valid for binary dependent variables. 214 • Decile — The deciles in the report are based on the probability values predicted by the model. Note that 1 is the highest decile and 10 is the lowest. That is, decile 1 contains data on the 10% of the observations with the highest estimated probabilities that the dependent variable is 1. • Count — This column contains the count of observations in the decile. • Response — This column contains the count of observations in the decile where the actual value of the dependent variable is 1. Teradata Warehouse Miner User Guide - Volume 1 Chapter 2: Scoring Tree Scoring • Pct Response — This column contains the percentage of observations in the decile where the actual value of the dependent variable is 1. • Pct Captured Response — This column contains the percentage of responses in the decile over all the responses in any decile. • Lift — The lift value is the percentage response in the decile (Pct Response) divided by the expected response, where the expected response is the percentage of response or dependent 1-values over all observations. For example, if 10% of the observations overall have a dependent variable with value 1, and 20% of the observations in decile 1 have a dependent variable with value 1, then the lift value within decile 1 is 2.0, meaning that the model gives a “lift” that is better than chance alone by a factor of two in predicting response values of 1 within this decile. • Cumulative Response — This is a cumulative measure of Response, from decile 1 to this decile. • Cumulative Pct Response — This is a cumulative measure of Pct Response, from decile 1 to this decile. • Cumulative Pct Captured Response — This is a cumulative measure of Pct Captured Response, from decile 1 to this decile. • Cumulative Lift — This is a cumulative measure of Lift, from decile 1 to this decile. Tree Scoring - RESULTS - Data On the Tree Scoring dialog click RESULTS and then click on data (note that the RESULTS tab will be grayed-out/disabled until after the analysis is completed): Figure 130: Tree Scoring > Results > Data Results data, if any, is displayed in a data grid as described in the “RESULTS Tab” on page 80 of the Teradata Warehouse Miner User Guide (Volume 1). If a table was created, a sample of rows is displayed here - the size determined by the setting specified by Maximum result rows to display in Tools-Preferences-Limits. The following table is built in the requested Output Database by the Decision Tree Scoring analysis. Note that the options selected affect the structure of the table. Those columns in bold below will comprise the Primary Index. Also note that there may be repeated groups of columns, and that some columns will be generated only if specific options are selected. Teradata Warehouse Miner User Guide - Volume 1 215 Chapter 2: Scoring Tree Scoring Table 55: Output Database table (Built by the Decision Tree Scoring analysis) Name Type Definition Key User Defined One or more key columns, which default to the index, defined in the table to be scored (i.e. in Selected Table). The data type defaults to the same as the scored table, but can be changed via Primary Index Columns. <app_var> User Defined One or more columns as selected under Retain Columns. <dep_var > User Defined The predicted value of the dependent variable. The name used defaults to the Dependent Variable specified when the tree was built. If Use Dependent variable for predicted value column name is not selected, then an appropriate column name must be entered and is used here. The data type used is the same as the Dependent Variable. _tm_node_id FLOAT When the Create profiling tables option is selected this column is included to link each row with a particular leaf node in the decision tree and thereby with a specific set of rules. _tm_target, or FLOAT One of two measures that are mutually exclusive. If the Include Confidence Factor option is selected, _tm_confidence will be generated and populated with Confidence Factors - a measure of how “confident” the model is that it can predict the correct score for a record that falls into a particular leaf node based on the training data the model was built from. (Default) _tm_confidence If the Targeted Confidence (Binary Outcome Only) option is selected, then _tm_ target will be generated and populated with Targeted Confidences for models built with a predicted value that has only 2 outcomes. The Targeted confidence is a measure of how confident the model is that it can predict the correct score for a particular leaf node based upon a user specified Target Value. For example, if a particular decision node had an outcome of 9 “Buys” and 1 “Do Not Buy” at that particular node, setting the Target Value to “Buy”, would generate a .9 or 9% targeted confidence. However if it is desired to set the Target Value to “Do Not Buy”, then any record falling into this leaf of the tree would get a targeted confidence of .1 or 10%. _tm_recalc_target, or FLOAT _tm_recalc_confidence Recalculated versions of the confidence factor or targeted confidence factor based on the original validation table when Score Only is selected, or based on the selected table to score when Evaluate and Score is selected. The following table is built in the requested Output Database by the Decision Tree Scoring analysis when the Create profiling tables option is selected. (It is named by appending “_1” to the scored result table name). Table 56: Output Database table (Built by the Decision Tree Scoring analysis) - Create profiling tables option selected (“_1” appended) Name Type Definition _tm_node_id FLOAT This column identifies a particular leaf node in the decision tree. _tm_target, or FLOAT The confidence factor or targeted confidence factor for this leaf node, as described above for the scored output table. VARCHAR(n) The predicted value of the dependent variable at this leaf node. _tm_confidence _tm_prediction 216 Teradata Warehouse Miner User Guide - Volume 1 Chapter 2: Scoring Tree Scoring The following table is built in the requested Output Database by the Decision Tree Scoring analysis when the Create profiling tables option is selected. (It is named by appending “_2” to the scored result table name). Table 57: Output Database table (Built by the Decision Tree Scoring analysis) - Create profiling tables option selected (“_2” appended) Name Type Definition _tm_node_id FLOAT This column identifies a particular leaf node in the decision tree. _tm_sequence_id FLOAT An integer from 1 to n to order the rules associated with a leaf node. _tm_rule VARCHAR(n) A rule for inclusion in the ruleset for this leaf node in the decision tree (rules are joined with a logical AND). Tree Scoring - RESULTS - Lift Graph On the Tree Scoring dialog click RESULTS and then click on lift graph (note that the RESULTS tab will be grayed-out/disabled until after the analysis is completed): Figure 131: Tree Scoring > Results > Lift Graph This chart displays the information in the Cumulative Lift Table. This is the same graph described in “Results - Logistic Regression” on page 127 as Lift Chart, but applied to possibly new data. Tree Scoring - RESULTS - SQL On the Tree Scoring dialog click RESULTS and then click on SQL (note that the RESULTS tab will be grayed-out/disabled until after the analysis is completed): Figure 132: Tree Scoring > Results > SQL The SQL generated for scoring is returned here, but only if the Score Method was set to Score on the Input - Analysis Parameters tab, and if the Output - Storage option to Generate the SQL for this analysis, but do not execute it was selected. When SQL is displayed here, it may be selected and copied as desired. (Both right-click menu options and buttons to Select All and Copy are available). Teradata Warehouse Miner User Guide - Volume 1 217 Chapter 2: Scoring Tree Scoring Tutorial - Tree Scoring In this example, the same table is scored as was used to build the decision tree model, as a matter of convenience. Typically, this would not be done unless the contents of the table changed since the model was built. Parameterize a Decision Tree Scoring Analysis as follows: • Selected Tables — twm_customer_analysis • Scoring Method — Evaluate and Score • Use the name of the dependent variable as the predicted value column name — Enabled • Targeted Confidence(s) - For binary outcome only — Enabled • Targeted Value — 1 • Result Table Name — twm_score_tree_1 • Primary Index Columns — cust_id Run the analysis, and click on Results when it completes. For this example, the Decision Tree Scoring Analysis generated the following pages. A single click on each page name populates Results with the item. Table 58: Decision Tree Model Scoring Report Resulting Scored Table Name score_tree_1 Number of Rows in Scored File 747 Table 59: Confusion Matrix Actual Non-Response Actual Response Correct Incorrect Predicted 0 340/45.52% 0/0.00% 340/45.52% 0/0.00% Predicted 1 32/4.28% 375/50.20% 375/50.20% 32/4.28% Cumulativ e Lift Table 60: Cumulative Lift Table Captured Response (%) Lift Cumulativ e Response Cumulativ e Response (%) Cumulativ e Captured Response (%) Decile Count Response Response (%) 1 5 5.00 100.00 1.33 1.99 5.00 100.00 1.33 1.99 2 0 0.00 0.00 0.00 0.00 5.00 100.00 1.33 1.99 3 0 0.00 0.00 0.00 0.00 5.00 100.00 1.33 1.99 4 0 0.00 0.00 0.00 0.00 5.00 100.00 1.33 1.99 5 0 0.00 0.00 0.00 0.00 5.00 100.00 1.33 1.99 218 Teradata Warehouse Miner User Guide - Volume 1 Chapter 2: Scoring Factor Scoring Table 60: Cumulative Lift Table Captured Response (%) Lift Cumulativ e Response Cumulativ e Response (%) Cumulativ e Captured Response (%) Decile Count Response Response (%) Cumulativ e Lift 6 402 370.00 92.04 98.67 1.83 375.00 92.14 100.00 1.84 7 0 0.00 0.00 0.00 0.00 375.00 92.14 100.00 1.84 8 0 0.00 0.00 0.00 0.00 375.00 92.14 100.00 1.84 9 0 0.00 0.00 0.00 0.00 375.00 92.14 100.00 1.84 10 340 0.00 0.00 0.00 0.00 375.00 50.20 100.00 1.00 Table 61: Data cust_id cc_acct _tm_target 1362480 1 0.92 1362481 0 0 1362484 1 0.92 1362485 0 0 1362486 1 0.92 … … … Lift Graph Additionally, you can click on Lift Chart to view the contents of the Lift Table graphically. Factor Scoring Factor analysis is designed primarily for the purpose of discovering the underlying structure or meaning in a set of variables and to facilitate their reduction to a fewer number of variables called factors or components. The first goal is facilitated by finding the factor loadings that describe the variables in a data set in terms of a linear combination of factors. The second goal is facilitated by finding a description for the factors as linear combinations of the original variables they describe. These are sometimes called factor measurements or scores. After computing the factor loadings, computing factor scores might seem like an afterthought, but it is somewhat more involved than that. Teradata Warehouse Miner does automate the process however based on the model information stored in metadata results tables, computing factor scores directly in the database by dynamically generating and executing SQL. Teradata Warehouse Miner User Guide - Volume 1 219 Chapter 2: Scoring Factor Scoring Note that Factor Scoring computes factor scores for a data set that has the same columns as those used in performing the selected Factor Analysis. When scoring is performed, a table is created including index (key) columns, optional “retain” columns, and factor scores for each row in the input table being scored. Scoring is performed differently depending on the type of factor analysis that was performed, whether principal components (PCA), principal axis factors (PAF) or maximum likelihood factors (MLF). Further, scoring is affected by whether or not the factor analysis included a rotation. Also, input data is centered based on the mean value of each variable, and if the factor analysis was performed on a correlation matrix, input values are each divided by the standard deviation of the variable in order to normalize to unit length variance. When scoring a table using a PCA factor analysis model, the scores can be calculated directly without estimation, even if an orthogonal rotation was performed. When scoring using a PAF or MLF model, or a PCA model with an oblique rotation, a unique solution does not exist and cannot be directly solved for (a condition known as the indeterminacy of factor measurements). There are many techniques however for estimating factor measurements, and the technique used by Teradata Warehouse Miner is known as estimation by regression. This technique involves regressing each factor on the original variables in the factor analysis model using linear regression techniques. It gives an accurate solution in the “least-squared error” sense but it typically introduces some degree of dependence or correlation in the computed factor scores. A final word about the independence or orthogonality of factor scores is appropriate here. It was pointed out earlier that factor loadings are orthogonal using the techniques offered by Teradata Warehouse Miner unless an oblique rotation is performed. Factor scores however will not necessarily be orthogonal for principal axis factors and maximum likelihood factors and with oblique rotations since scores are estimated by regression. This is a subtle distinction that is an easy source of confusion. That is, the new variables or factor scores created by a factor analysis, expressed as a linear combination of the original variables, are not necessarily independent of each other, even if the factors themselves are. The user may measure their independence however by using the Matrix and Export Matrix function of the product to build a correlation matrix from the factor score table once it is built. Initiate Factor Scoring After generating a Factor Analysis (as described in “Factor Analysis” on page 58) use the following procedure to initiate Factor Scoring: 220 Teradata Warehouse Miner User Guide - Volume 1 Chapter 2: Scoring Factor Scoring 1 Click on the Add New Analysis icon in the toolbar: Figure 133: Add New Analysis from toolbar 2 In the resulting Add New Analysis dialog box, click on Scoring under Categories and then under Analyses double-click on Factor Scoring: Figure 134: Add New Analysis > Scoring > Factor Scoring 3 This will bring up the Factor Scoring dialog in which you will enter INPUT and OUTPUT options to parameterize the analysis as described in the following sections. Factor Scoring - INPUT - Data Selection On the Factor Scoring dialog click on INPUT and then click on data selection: Teradata Warehouse Miner User Guide - Volume 1 221 Chapter 2: Scoring Factor Scoring Figure 135: Factor Scoring > Input > Data Selection On this screen select: 1 Select Input Source Users may select between different sources of input. By selecting the Input Source Table the user can select from available databases, tables (or views) and columns in the usual manner. By selecting the Input Source Analysis however the user can select directly from the output of another analysis of qualifying type in the current project. Analyses that may be selected from directly include all of the Analytic Data Set (ADS) and Reorganization analyses (except Refresh). In place of Available Databases the user may select from Available Analyses, while Available Tables then contains a list of all the output tables that will eventually be produced by the selected Analysis. (Note that since this analysis cannot select from a volatile input table, Available Analyses will contain only those qualifying analyses that create an output table or view). For more information, refer to “INPUT Tab” on page 73 of the Teradata Warehouse Miner User Guide (Volume 1). 2 3 Select Columns From a Single Table • Available Databases — All available source databases that have been added through Connection Properties. • Available Tables — The tables available for scoring are listed in this window, though all may not strictly qualify; the input table to be scored must contain the same column names used in the original analysis. • Available Columns — The columns available for scoring are listed in this window. • Selected Columns • Index Columns — Note that the Selected Columns window is actually a split window for specifying Index and/or Retain columns if desired. If a table is specified as input, the primary index of the table is defaulted here, but can be changed. If a view is specified as input, an index must be provided. • Retain Columns — Other columns within the table being scored can be appended to the scored table, by specifying them here. Columns specified in Index Columns may not be specified here. Select Model Analysis Select from the list an existing Factor Analysis analysis on which to run the scoring. The Factor Analysis analysis must exist in the same project as the Factor Scoring analysis. Factor Scoring - INPUT - Analysis Parameters On the Factor Scoring dialog click on INPUT and then click on analysis parameters: 222 Teradata Warehouse Miner User Guide - Volume 1 Chapter 2: Scoring Factor Scoring Figure 136: Factor Scoring > Input > Analysis Parameters On this screen select: • Scoring Method • Score — Option to create a score table only. • Evaluate — Option to perform model evaluation only. • Evaluate and Score — Option to create a score table and perform model evaluation. • Factor Names — The names of the factor columns in the created table of scores are optional parameters if scoring is selected. The default names of the factor columns are factor1, factor2 ... factorn. Factor Scoring - OUTPUT On the Factor Scoring dialog click on OUTPUT: Figure 137: Factor Scoring > Output On this screen select: • Output Table • Database name — The name of the database. • Table name — The name of the scored output table to be created-required only if a scoring option is selected. • Create output table using the FALLBACK keyword — If a table is selected, it will be built with FALLBACK if this option is selected • Create output table using the MULTISET keyword — This option is not enabled for scored output tables; the MULTISET keyword is not used. • Advertise Output — The Advertise Output option “advertises” output by inserting information into one or more of the Advertise Output metadata tables according to the type of analysis and the options selected in the analysis. (For more information, refer to “Advertise Output” on page 112 of the Teradata Warehouse Miner User Guide (Volume 1)). • Advertise Note — An Advertise Note may be specified if desired when the Advertise Output option is selected or when the Always Advertise option is selected on the Teradata Warehouse Miner User Guide - Volume 1 223 Chapter 2: Scoring Factor Scoring Databases tab of the Connection Properties dialog. It is a free-form text field of up to 30 characters that may be used to categorize or describe the output. • Generate the SQL for this analysis, but do not execute it — If this option is selected the analysis will only generate SQL, returning it and terminating immediately. Note: To create a stored procedure to score this model, use the Refresh analysis. Run the Factor Scoring Analysis After setting parameters on the INPUT screens as described above, you are ready to run the analysis. To run the analysis you can either: • Click the Run icon on the toolbar, or • Select Run <project name> on the Project menu, or • Press the F5 key on your keyboard Results - Factor Scoring The results of running the Teradata Warehouse Miner Factor Analysis Scoring/Evaluation Analysis include a variety of statistical reports on the scored model. All of these results are outlined below. Factor Scoring - RESULTS - reports On the Factor Scoring dialog click RESULTS and then click on reports (note that the RESULTS tab will be grayed-out/disabled until after the analysis is completed): Figure 138: Factor Scoring > Results > Reports • Resulting Scored Table — Name of the scored table - equivalent to Result Table Name. • Number of Rows in Scored Table — Number of rows in the Resulting Scored Table. • Evaluation — Model evaluation for factor analysis consists of computing the standard error of estimate for each variable based on working backwards and re-estimating their values using the scored factors. Estimated values of the original data are made using the T factor scoring equation Ŷ = XC where Ŷ is the estimated raw data, X is the scored data, and C is the factor pattern matrix or rotated factor pattern matrix if rotation was included in the model. The standard error of estimate for each variable y in the original data Y is then given by: y – ŷ -------------------------- 2 n–p 224 Teradata Warehouse Miner User Guide - Volume 1 Chapter 2: Scoring Factor Scoring where each ŷ is the estimated value of each variable y, n is the number of observations and p is the number of factors. Factor Scoring - RESULTS - Data On the Factor Scoring dialog click RESULTS and then click on data (note that the RESULTS tab will be grayed-out/disabled until after the analysis is completed): Figure 139: Factor Scoring > Results > Data Results data, if any, is displayed in a data grid as described in “RESULTS Tab” on page 80 of the Teradata Warehouse Miner User Guide (Volume 1). If a table was created, a sample of rows is displayed here - the size determined by the setting specified by Maximum result rows to display in Tools-Preferences-Limits. The following table is built in the requested Output Database by Factor Scoring. Note that the options selected affect the structure of the table. Those columns in bold below will comprise the Unique Primary Index (UPI). Also note that there may be repeated groups of columns, and that some columns will be generated only if specific options are selected. Table 62: Output Database table (Built by Factor Scoring) Name Type Definition Key User Defined One or more unique-key columns which default to the index defined in the table to be scored (i.e. in Selected Tables). The data type defaults to the same as the scored table, but can be changed via Primary Index Columns. <app_var> User Defined One or more columns as selected under Retain Columns. The data type defaults to the same as that within the appended table, but can be changed via Columns Types (for appended columns). Factorx FLOAT A column generated for each scored factor. The names of the factor columns in the created table of scores are optional parameters if scoring is selected. The default names of the factor columns are factor1, factor2, ... factorn, unless Factor Names are specified. (Default) Factor Scoring - RESULTS - SQL On the Factor Scoring dialog click RESULTS and then click on SQL (note that the RESULTS tab will be grayed-out/disabled until after the analysis is completed): Teradata Warehouse Miner User Guide - Volume 1 225 Chapter 2: Scoring Factor Scoring Figure 140: Factor Scoring > Results > SQL The SQL generated for scoring is returned here, but only if the Score Method was set to Score on the Input - Analysis Parameters tab, and if the Output - Storage option to Generate the SQL for this analysis, but do not execute it was selected. When SQL is displayed here, it may be selected and copied as desired. (Both right-click menu options and buttons to Select All and Copy are available). Tutorial - Factor Scoring In this example, the same table is scored as was used to build the factor analysis model. Parameterize a Factor Analysis Scoring Analysis as follows: • Selected Table — twm_customer_analysis • Evaluate and Score — Enabled • Factor Names • Factor1 • Factor2 • Factor3 • Factor4 • Factor5 • Factor6 • Factor7 • Result Table Name — twm_score_factor_1 • Index Columns — cust_id Run the analysis, and click on Results when it completes. For this example, the Factor Analysis Scoring/Evaluation function generated the following pages. A single click on each page name populates Results with the item. Table 63: Factor Analysis Score Report 226 Resulting Scored Table <result_db >.score_factor_1 Number of Rows in Scored Table 747 Teradata Warehouse Miner User Guide - Volume 1 Chapter 2: Scoring Factor Scoring Table 64: Evaluation Variable Name Standard Error of Estimate income 0.4938 age 0.5804 years_with_bank 0.5965 nbr_children 0.6180 female 0.8199 single 0.3013 married 0.3894 separated 0.4687 ccacct 0.6052 ckacct 0.5660 svacct 0.5248 avg_cc_bal 0.4751 avg_ck_bal 0.6613 avg_sv_bal 0.7166 avg_cc_tran_amt 0.8929 avg_cc_tran_cnt 0.5174 avg_ck_tran_amt 0.3563 avg_ck_tran_cnt 0.7187 avg_sv_tran_amt 0.4326 avg_sv_tran_cnt 0.6967 cc_rev 0.3342 Table 65: Data cust_id factor1 factor2 factor3 factor4 factor5 factor6 factor7 1362480 1.43 -0.28 1.15 -0.50 -0.31 -0.05 1.89 1362481 -1.03 -1.37 0.57 -0.08 -0.60 -0.39 -0.55 ... ... ... ... ... ... ... ... Teradata Warehouse Miner User Guide - Volume 1 227 Chapter 2: Scoring Linear Scoring Linear Scoring Once a linear regression model has been built, it can be used to “score” new data, that is, to estimate the value of the dependent variable in the model using data for which its value may not be known. Scoring is performed using the values of the b-coefficients in the linear regression model and the names of the independent variable columns they correspond to. Other information needed includes the table name(s) in which the data resides, the new table to be created, and primary index information for the new table. The result of scoring a linear regression model will be a new table containing primary index columns and an estimate of the dependent variable, optionally including a residual value for each row, calculated as the difference between the estimated value and the actual value of the dependent variable. (The option to include the residual value is available only when model evaluation is requested). Note that Linear Scoring applies a Linear Regression model to a data set that has the same columns as those used in building the model (with the exception that the scoring input table need not include the predicted or dependent variable column unless model evaluation is requested). Linear Regression - Model Evaluation Linear regression model evaluation begins with scoring a table that includes the actual values of the dependent variable. The standard error of estimate for the model is calculated and reported and may be compared to the standard error of estimate reported when the model was built. The standard error of estimate is calculated as the square root of the average squared residual value over all the observations, i.e. y – ŷ -------------------------- 2 n–p–1 where ŷ is the actual value of the dependent variable, is its predicted value, n is the number of observations, and p is the number of independent variables (substituting n-p in the denominator if there is no constant term). Initiate Linear Scoring After generating a Linear Regression analysis (as described in “Linear Regression” on page 86) use the following procedure to initiate Linear Regression Scoring: 228 Teradata Warehouse Miner User Guide - Volume 1 Chapter 2: Scoring Linear Scoring 1 Click on the Add New Analysis icon in the toolbar: Figure 141: Add New Analysis from toolbar 2 In the resulting Add New Analysis dialog box, click on Scoring under Categories and then under Analyses double-click on Linear Scoring: Figure 142: Add New Analysis > Scoring > Linear Scoring 3 This will bring up the Linear Scoring dialog in which you will enter INPUT and OUTPUT options to parameterize the analysis as described in the following sections. Linear Scoring - INPUT - Data Selection On the Linear Scoring dialog click on INPUT and then click on data selection: Teradata Warehouse Miner User Guide - Volume 1 229 Chapter 2: Scoring Linear Scoring Figure 143: Linear Scoring > Input > Data Selection On this screen select: 1 Select Input Source Users may select between different sources of input. By selecting the Input Source Table the user can select from available databases, tables (or views) and columns in the usual manner. By selecting the Input Source Analysis however the user can select directly from the output of another analysis of qualifying type in the current project. Analyses that may be selected from directly include all of the Analytic Data Set (ADS) and Reorganization analyses (except Refresh). In place of Available Databases the user may select from Available Analyses, while Available Tables then contains a list of all the output tables that will eventually be produced by the selected Analysis. (Note that since this analysis cannot select from a volatile input table, Available Analyses will contain only those qualifying analyses that create an output table or view). For more information, refer to “INPUT Tab” on page 73 of the Teradata Warehouse Miner User Guide (Volume 1). 2 3 Select Columns From a Single Table • Available Databases — All available source databases that have been added through Connection Properties. • Available Tables — The tables available for scoring are listed in this window, though all may not strictly qualify; the input table to be scored must contain the same column names used in the original analysis. • Available Columns — The columns available for scoring are listed in this window. • Selected Columns • Index Columns — Note that the Selected Columns window is actually a split window for specifying Index and/or Retain columns if desired. If a table is specified as input, the primary index of the table is defaulted here, but can be changed. If a view is specified as input, an index must be provided. • Retain Columns — Other columns within the table being scored can be appended to the scored table, by specifying them here. Columns specified in Index Columns may not be specified here. Select Model Analysis Select from the list an existing Linear Regression analysis on which to run the scoring. The Linear Regression analysis must exist in the same project as the Linear Scoring analysis. Linear Scoring - INPUT - Analysis Parameters On the Linear Scoring dialog click on INPUT and then click on analysis parameters: 230 Teradata Warehouse Miner User Guide - Volume 1 Chapter 2: Scoring Linear Scoring Figure 144: Linear Scoring > Input > Analysis Parameters On this screen select: • Scoring Method • Score — Option to create a score table only. • Evaluate — Option to perform model evaluation only. • Evaluate and Score — Option to create a score table and perform model evaluation. • Scoring Options • Use Dependent variable for predicted value column name — Option to use the exact same column name as the dependent variable when the model is scored. This is the default option. • • Predicted Value Column Name — If above option is not checked, then enter here the name of the column in the score table which contains the estimated value of the dependent variable. Residual Column Name — If Evaluate and Score is requested, enter the name of the column that will contain the residual values of the evaluation. This column will be populated with the difference between the estimated value and the actual value of the dependent variable. Linear Scoring - OUTPUT On the Linear Scoring dialog click on OUTPUT: Figure 145: Linear Scoring > Output On this screen select: • Output Table • Database name — The name of the database. • Table name — The name of the scored output table to be created-required only if a scoring option is selected. • Create output table using the FALLBACK keyword — If a table is selected, it will be built with FALLBACK if this option is selected Teradata Warehouse Miner User Guide - Volume 1 231 Chapter 2: Scoring Linear Scoring • Create output table using the MULTISET keyword — This option is not enabled for scored output tables; the MULTISET keyword is not used. • Advertise Output — The Advertise Output option “advertises” output by inserting information into one or more of the Advertise Output metadata tables according to the type of analysis and the options selected in the analysis. (For more information, refer to “Advertise Output” on page 112 of the Teradata Warehouse Miner User Guide (Volume 1)). • Advertise Note — An Advertise Note may be specified if desired when the Advertise Output option is selected or when the Always Advertise option is selected on the Databases tab of the Connection Properties dialog. It is a free-form text field of up to 30 characters that may be used to categorize or describe the output. • Generate the SQL for this analysis, but do not execute it — If this option is selected the analysis will only generate SQL, returning it and terminating immediately.Hint: To create a stored procedure to score this model, use the Refresh analysis. Note: To create a stored procedure to score this model, use the Refresh analysis. Run the Linear Scoring Analysis After setting parameters on the INPUT screens as described above, you are ready to run the analysis. To run the analysis you can either: • Click the Run icon on the toolbar, or • Select Run <project name> on the Project menu, or • Press the F5 key on your keyboard Results - Linear Scoring The results of running the Linear Regression Scoring/Evaluation analysis include a variety of statistical reports on the scored model. All of these results are outlined below. Linear Scoring - RESULTS - reports On the Linear Scoring dialog click RESULTS and then click on reports (note that the RESULTS tab will be grayed-out/disabled until after the analysis is completed): Figure 146: Linear Scoring > Results > Reports • Resulting Scored Table — Name of the scored table - equivalent to Result Table Name. • Number of Rows in Scored Table — Number of rows in the Resulting Scored Table. • Evaluation • 232 Minimum Absolute Error Teradata Warehouse Miner User Guide - Volume 1 Chapter 2: Scoring Linear Scoring • Maximum Absolute Error • Average Absolute Error The term ‘error’ in the evaluation of a linear regression model refers to the difference between the value of the dependent variable predicted by the model and the actual value in a training set of data (data where the value of the dependent variable is known). Considering the absolute value of the error (changing negative differences to positive differences) provides a measure of the magnitude of the error in the model, which is a more useful measure of the model’s accuracy. With this introduction, the terms minimum, maximum and average absolute error have the usual meanings when calculated over all the observations in the input or scored table. • Standard Error of Estimate The standard error of estimate is calculated as the square root of the average squared residual value over all the observations, i.e. y – ŷ -------------------------- 2 n–p–1 where y is the actual value of the dependent variable, ŷ is its predicted value, n is the number of observations, and p is the number of independent variables (substitute n-p in the denominator if there is no constant term). Linear Scoring - RESULTS - data On the Linear Scoring dialog click RESULTS and then click on data (note that the RESULTS tab will be grayed-out/disabled until after the analysis is completed): Figure 147: Linear Scoring > Results > Data Results data, if any, is displayed in a data grid as described in the “RESULTS Tab” on page 80 of the Teradata Warehouse Miner User Guide (Volume 1). If a table was created, a sample of rows is displayed here - the size determined by the setting specified by Maximum result rows to display in Tools-Preferences-Limits. The following table is built in the requested Output Database by Linear Regression scoring. Note that the options selected affect the structure of the table. Those columns in bold below will comprise the Unique Primary Index (UPI). Also note that there may be repeated groups of columns, and that some columns will be generated only if specific options are selected. Teradata Warehouse Miner User Guide - Volume 1 233 Chapter 2: Scoring Linear Scoring Table 66: Output Database table (Built by Linear Regression scoring) Name Type Definition Key User Defined One or more unique-key columns which default to the index defined in the table to be scored (i.e. in Selected Tables). The data type defaults to the same as the scored table, but can be changed via Primary Index Columns. <app_var> User Defined One or more columns as selected under Retain Columns. <dep_var> FLOAT The predicted value of the dependent variable. The name used defaults to the Dependent Variable specified when the model was built. If Use Dependent variable for predicted value column name is not selected, then an appropriate column name must be entered here. FLOAT The residual values of the evaluation, the difference between the estimated value and the actual value of the dependent variable. This is generated only if the Evaluate or Evaluate and Score options are selected. The name defaults to “Residual” unless it is overwritten by the user. (Default) Residual (Default) Linear Scoring - RESULTS - SQL On the Linear Scoring dialog click RESULTS and then click on SQL (note that the RESULTS tab will be grayed-out/disabled until after the analysis is completed): Figure 148: Linear Scoring > Results > SQL The SQL generated for scoring is returned here, but only if the Score Method was set to Score on the Input - Analysis Parameters tab, and if the Output - Storage option to Generate the SQL for this analysis, but do not execute it was selected. When SQL is displayed here, it may be selected and copied as desired. (Both right-click menu options and buttons to Select All and Copy are available). Tutorial - Linear Scoring In this example, the same table is scored as was used to build the linear model, as a matter of convenience. Typically, this would not be done unless the contents of the table changed since the model was built. In the case of this example, the Standard Error of Estimate can be seen to be exactly the same, 10.445, that it was when the model was built (see “Tutorial - Linear Regression” on page 107). Parameterize a Linear Regression Scoring Analysis as follows: • Selected Table — twm_customer_analysis • Evaluate and Score — Enabled • Use dependent variable for predicted value column name — Enabled 234 Teradata Warehouse Miner User Guide - Volume 1 Chapter 2: Scoring Linear Scoring • Residual column name — Residual • Result Table Name — twm_score_linear_1 • Index Columns — cust_id Run the analysis, and click on Results when it completes. For this example, the Linear Regression Scoring/Evaluation Analysis generated the following pages. A single click on each page name populates Results with the item. Table 67: Linear Regression Reports Resulting Scored Table <result_db>.score_linear_1 Number of Rows in Scored Table 747 Table 68: Evaluation Minimum Absolute Error 0.0056 Maximum Absolute Error 65.7775 Average Absolute Error 7.2201 Standard Error of Estimate 10.4451 Table 69: Data cust_id cc_rev Residual 1362480 59.188 15.812 1362481 3.412 -3.412 1362484 12.254 -.254 1362485 28.272 1.728 1362486 -9.026E-02 9.026E-02 1362487 14.325 -1.325 1362488 -5.105 5.105 1362489 69.738 12.262 1362492 53.368 .632 1362496 -5.876 5.876 … … … … … … … … … Teradata Warehouse Miner User Guide - Volume 1 235 Chapter 2: Scoring Logistic Scoring Logistic Scoring Once a logistic regression model has been built, it can be used to “score” new data, that is, to estimate the value of the dependent variable in the model using data for which its value may not be known. Scoring is performed using the values of the b-coefficients in the logistic regression model and the names of the independent variable column names they correspond to. This information resides in the results metadata stored in the Teradata database by Teradata Warehouse Miner. Other information needed includes the table name in which the data resides, the new table to be created, and primary index information for the new table. Scoring a logistic regression model requires some steps beyond those required in scoring a linear regression model. The result of scoring a logistic regression model will be a new table containing primary index columns, the probability that the dependent variable is 1 (representing the response value) rather than 0 (representing the non-response value), and/or an estimate of the dependent variable, either 0 or 1, based on a user specified threshold value. For example, if the threshold value is 0.5, then a value of 1 is estimated if the probability value is greater than or equal to 0.5. The probability is based on the logistic regression functions given earlier. The user can achieve different results based on the threshold value applied to the probability. The model evaluation tables described below can be used to determine what this threshold value should be. Note that Logistic Scoring applies a Logistic Regression model to a data set that has the same columns as those used in building the model (with the exception that the scoring input table need not include the predicted or dependent variable column unless model evaluation is requested). Logistic Regression Model Evaluation The same model evaluation that is available when building a Logistic Regression model is also available when scoring it, including the following report tables. Prediction success table The prediction success table is computed using only probabilities and not estimates based on a threshold value. Using an input table that contains known values for the dependent variable, the sum of the probability values x and 1 – x , which correspond to the probability that the predicted value is 1 or 0 respectively, are calculated separately for rows with actual values of 1 and 0. This produces a report table such as that shown below. Table 70: Prediction Success Table Estimate Response Estimate Non-Response Actual Total Actual Response 306.5325 68.4675 375.0000 Actual Non-Response 69.0115 302.9885 372.0000 Estimated Total 375.5440 371.4560 747.0000 236 Teradata Warehouse Miner User Guide - Volume 1 Chapter 2: Scoring Logistic Scoring An interesting and useful feature of this table is that it is independent of the threshold value that is used in scoring to determine which probabilities correspond to an estimate of 1 and 0 respectively. This is possible because the entries in the “Estimate Response” column are the sums of the probabilities x that the outcome is 1, summed separately over the rows where the actual outcome is 1 and 0 and then totaled. Similarly, the entries in the “Estimate NonResponse” column are the sums of the probabilities 1 – x that the outcome is 0. Multi-threshold Success Table This table provides values similar to those in the prediction success table, but instead of summing probabilities, the estimated values based on a threshold value are summed instead. Rather than just one threshold however, several thresholds ranging from a user specified low to high value are displayed in user specified increments. This allows the user to compare several success scenarios using different threshold values, to aid in the choice of an ideal threshold. It might be supposed that the ideal threshold value would be the one that maximizes the number of correctly classified observations. However, subjective business considerations may be applied by looking at all of the success values. It may be that wrong predictions in one direction (say estimate 1 when the actual value is 0) may be more tolerable than in the other direction (estimate 0 when the actual value is 1). One may, for example, mind less overlooking fraudulent behavior than wrongly accusing someone of fraud. The following is an example of a logistic regression multi-threshold success table. Table 71: Logistic Regression Multi-Threshold Success table Threshold Probability Actual Response, Estimate Response Actual Response, Estimate NonResponse Actual Non-Response, Estimate Response Actual Non-Response, Estimate NonResponse 0.0000 375 0 372 0 0.0500 375 0 326 46 0.1000 374 1 231 141 0.1500 372 3 145 227 0.2000 367 8 93 279 0.2500 358 17 59 313 0.3000 354 21 46 326 0.3500 347 28 38 334 0.4000 338 37 32 340 0.4500 326 49 27 345 0.5000 318 57 27 345 0.5500 304 71 26 346 0.6000 296 79 24 348 Teradata Warehouse Miner User Guide - Volume 1 237 Chapter 2: Scoring Logistic Scoring Table 71: Logistic Regression Multi-Threshold Success table Threshold Probability Actual Response, Estimate Response Actual Response, Estimate NonResponse Actual Non-Response, Estimate Response Actual Non-Response, Estimate NonResponse 0.6500 287 88 22 350 0.7000 279 96 21 351 0.7500 270 105 19 353 0.8000 258 117 18 354 0.8500 245 130 16 356 0.9000 222 153 12 360 0.9500 187 188 10 362 Cumulative Lift table The Cumulative Lift Table is produced for deciles based on the probability values. Note that the deciles are labeled such that 1 is the highest decile and 10 is the lowest, based on the probability values calculated by logistic regression. Within each decile, the following measures are given: 1 count of “response” values 2 count of observations 3 percentage response (percentage of response values within the decile) 4 captured response (percentage of responses over all response values) 5 lift value (percentage response / expected response, where the expected response is the percentage of responses over all observations) 6 cumulative versions of each of the measures above The following is an example of a logistic regression Cumulative Lift Table. Table 72: Logistic Regression Cumulative Lift Table Cumulative Response Cumulative Response (%) Cumulative Captured Response (%) Cumulative Lift Decile Count Response Response (%) Captured Response (%) 1 74.0000 73.0000 98.6486 19.4667 1.9651 73.0000 98.6486 19.4667 1.9651 2 75.0000 69.0000 92.0000 18.4000 1.8326 142.0000 95.3020 37.8667 1.8984 3 75.0000 71.0000 94.6667 18.9333 1.8858 213.0000 95.0893 56.8000 1.8942 4 74.0000 65.0000 87.8378 17.3333 1.7497 278.0000 93.2886 74.1333 1.8583 5 75.0000 63.0000 84.0000 16.8000 1.6733 341.0000 91.4209 90.9333 1.8211 6 75.0000 23.0000 30.6667 6.1333 0.6109 364.0000 81.2500 97.0667 1.6185 238 Lift Teradata Warehouse Miner User Guide - Volume 1 Chapter 2: Scoring Logistic Scoring Table 72: Logistic Regression Cumulative Lift Table Cumulative Response Cumulative Response (%) Cumulative Captured Response (%) Cumulative Lift Decile Count Response Response (%) Captured Response (%) 7 74.0000 8.0000 10.8108 2.1333 0.2154 372.0000 71.2644 99.2000 1.4196 8 75.0000 2.0000 2.6667 0.5333 0.0531 374.0000 62.6466 99.7333 1.2479 9 75.0000 1.0000 1.3333 0.2667 0.0266 375.0000 55.8036 100.0000 1.1116 10 75.0000 0.0000 0.0000 0.0000 0.0000 375.0000 50.2008 100.0000 1.0000 Lift Initiate Logistic Scoring After generating a Logistic Regression analysis (as described in “Logistic Regression” on page 114) use the following procedure to initiate Logistic Scoring: 1 Click on the Add New Analysis icon in the toolbar: Figure 149: Add New Analysis from toolbar 2 In the resulting Add New Analysis dialog box, click on Scoring under Categories and then under Analyses double-click on Logistic Scoring: Teradata Warehouse Miner User Guide - Volume 1 239 Chapter 2: Scoring Logistic Scoring Figure 150: Add New Analysis > Scoring > Logistic Scoring 3 This will bring up the Logistic Scoring dialog in which you will enter INPUT and OUTPUT options to parameterize the analysis as described in the following sections. Logistic Scoring - INPUT - Data Selection On the Logistic Scoring dialog click on INPUT and then click on data selection: Figure 151: Logistic Scoring > Input > Data Selection On this screen select: 1 Select Input Source Users may select between different sources of input. By selecting the Input Source Table the user can select from available databases, tables (or views) and columns in the usual manner. By selecting the Input Source Analysis however the user can select directly from the output of another analysis of qualifying type in the current project. Analyses that may be selected from directly include all of the Analytic Data Set (ADS) and Reorganization analyses (except Refresh). In place of Available Databases the user may select from Available Analyses, while Available Tables then contains a list of all the output tables that will eventually be produced by the selected Analysis. (Note that since this analysis cannot select from a volatile input table, Available Analyses will contain only those qualifying analyses that create an output table or view). For more information, refer to “INPUT Tab” on page 73 of the Teradata Warehouse Miner User Guide (Volume 1). 240 Teradata Warehouse Miner User Guide - Volume 1 Chapter 2: Scoring Logistic Scoring 2 3 Select Columns From a Single Table • Available Databases — All available source databases that have been added through Connection Properties. • Available Tables — The tables available for scoring are listed in this window, though all may not strictly qualify; the input table to be scored must contain the same column names used in the original analysis. • Available Columns — The columns available for scoring are listed in this window. • Selected Columns • Index Columns — Note that the Selected Columns window is actually a split window for specifying Index and/or Retain columns if desired. If a table is specified as input, the primary index of the table is defaulted here, but can be changed. If a view is specified as input, an index must be provided. • Retain Columns — Other columns within the table being scored can be appended to the scored table, by specifying them here. Columns specified in Index Columns may not be specified here. Select Model Analysis Select from the list an existing Logistic Regression analysis on which to run the scoring. The Logistic Regression analysis must exist in the same project as the Logistic Scoring analysis. Logistic Scoring - INPUT - Analysis Parameters On the Logistic Scoring dialog click on INPUT and then click on analysis parameters: Figure 152: Logistic Scoring > Input > Analysis Parameters On this screen select: • Scoring Method • Score — Option to create a score table only. • Evaluate — Option to perform model evaluation only. • Evaluate and Score — Option to create a score table and perform model evaluation. • Scoring Options • Include Probability Score Column — Inclusion of a column in the score table that contains the probability between 0 and 1 that the value of the dependent variable is 1 is an optional parameter when scoring is selected. The default is to include a probability score column in the created score table. (Either the probability score or the estimated value or both must be requested when scoring). • Column Name — Column name containing the probability between 0 and 1 that the Teradata Warehouse Miner User Guide - Volume 1 241 Chapter 2: Scoring Logistic Scoring value of the dependent variable is 1. • Include Estimate from Threshold Column — Inclusion of a column in the score table that contains the estimated value of the dependent variable is an option when scoring is selected. The default is to include an estimated value column in the created score table. (Either the probability score or the estimated value or both must be requested when scoring). • Column Name — Column name containing the estimated value of the dependent variable. • Threshold Default — The threshold value is a value between 0 and 1 that determines which probabilities result in an estimated value of 0 or 1. For example, with a threshold value of 0.3, probabilities of 0.3 or greater yield an estimated value of 1, while probabilities less than 0.3 yield an estimated value of 0. The threshold option is valid only if the Include Estimate option has been requested and scoring is selected. If the Include Estimate option is requested but the threshold value is not specified, a default threshold value of 0.5 is used. • Evaluation Options • Prediction Success Table — Creates a prediction success table using sums of probabilities rather than estimates based on a threshold value. The default value is to include the Prediction Success Table. (This only applies if evaluation is requested). • Multi-Threshold Success Table — This table provides values similar to those in the prediction success table, but based on a range of threshold values, thus allowing the user to compare success scenarios using different threshold values. The default value is to include the multi-threshold success table. (This only applies if evaluation is requested). • Threshold Begin • Threshold End • Threshold Increment Specifies the threshold values to be used in the multi-threshold success table. If the computed probability is greater than or equal to a threshold value, that observation is assigned a 1 rather than a 0. Default values are 0, 1 and .05 respectively. • Cumulative Lift Table — Produce a cumulative lift table for deciles based on probability values. The default value is to include the cumulative lift table. (This only applies if evaluation is requested). Logistic Scoring - OUTPUT On the Logistic Scoring dialog click on OUTPUT: Figure 153: Logistic Scoring > Output 242 Teradata Warehouse Miner User Guide - Volume 1 Chapter 2: Scoring Logistic Scoring On this screen select: • Output Table • Database name — The name of the database. • Table name — The name of the scored output table to be created-required only if a scoring option is selected. • Create output table using the FALLBACK keyword — If a table is selected, it will be built with FALLBACK if this option is selected. • Create output table using the MULTISET keyword — This option is not enabled for scored output tables; the MULTISET keyword is not used. • Advertise Output — The Advertise Output option “advertises” output by inserting information into one or more of the Advertise Output metadata tables according to the type of analysis and the options selected in the analysis. (For more information, refer to “Advertise Output” on page 112 of the Teradata Warehouse Miner User Guide (Volume 1)). • Advertise Note — An Advertise Note may be specified if desired when the Advertise Output option is selected or when the Always Advertise option is selected on the Databases tab of the Connection Properties dialog. It is a free-form text field of up to 30 characters that may be used to categorize or describe the output. • Generate the SQL for this analysis, but do not execute it — If this option is selected the analysis will only generate SQL, returning it and terminating immediately. Note: To create a stored procedure to score this model, use the Refresh analysis. Run the Logistic Scoring Analysis After setting parameters on the INPUT screens as described above, you are ready to run the analysis. To run the analysis you can either: • Click the Run icon on the toolbar, or • Select Run <project name> on the Project menu, or • Press the F5 key on your keyboard Results - Logistic Scoring The results of running the Logistic Scoring analysis include a variety of statistical reports on the scored model, and if selected, a Lift Chart. All of these results are outlined below. It is important to note that although a response value other than 1 may have been indicated when the Logistic Regression model was built, the Logistic Regression Scoring analysis will always use the value 1 as the response value, and the value 0 for the non-response value(s). Logistic Scoring - RESULTS - Reports On the Logistic Scoring dialog click RESULTS and then click on reports (note that the RESULTS tab will be grayed-out/disabled until after the analysis is completed): Teradata Warehouse Miner User Guide - Volume 1 243 Chapter 2: Scoring Logistic Scoring Figure 154: Logistic Scoring > Results > Reports • Resulting Scored Table — Name of the scored table - equivalent to Result Table Name. • Number of Rows in Scored Table — Number of rows in the Resulting Scored Table. • Prediction Success Table — This is the same report described in “Results - Logistic Regression” on page 127, but applied to possibly new data. • Multi-Threshold Success Table — This is the same report described in “Results - Logistic Regression” on page 127, but applied to possibly new data. • Cumulative Lift Table — This is the same report described in “Results - Logistic Regression” on page 127, but applied to possibly new data. Logistic Scoring - RESULTS - Data On the Logistic Scoring dialog click RESULTS and then click on data (note that the RESULTS tab will be grayed-out/disabled until after the analysis is completed): Figure 155: Logistic Scoring > Results > Data Results data, if any, is displayed in a data grid as described in “RESULTS Tab” on page 80 of the Teradata Warehouse Miner User Guide (Volume 1). If a table was created, a sample of rows is displayed here - the size determined by the setting specified by Maximum result rows to display in Tools-Preferences-Limits. The following table is built in the requested Output Database by Logistic Regression scoring. Note that the options selected affect the structure of the table. Those columns in bold below will comprise the Unique Primary Index (UPI). Also note that there may be repeated groups of columns, and that some columns will be generated only if specific options are selected. Table 73: Output Database table (Built by Logistic Regression scoring) Name Type Definition Key User Defined One or more unique-key columns which default to the index defined in the table to be scored (i.e. in Selected Table). The data type defaults to the same as the scored table, but can be changed via Primary Index Columns. <app_var> User Defined One or more columns as selected under Retain Columns. 244 Teradata Warehouse Miner User Guide - Volume 1 Chapter 2: Scoring Logistic Scoring Table 73: Output Database table (Built by Logistic Regression scoring) Name Type Definition Probability FLOAT A probability between 0 and 1 that the value of the dependent variable is 1. The name used defaults to “Probability” unless an appropriate column name is entered. Generated only if Include Probability Score Column is selected. The default is to not include a probability score column in the created score table. (Either the probability score or the estimated value or both must be requested when scoring). FLOAT The estimated value of the dependent variable,. The default is to not include an estimated value column in the created score table. Generated only if Include Estimate from Threshold Column is selected. (Either the probability score or the estimated value or both must be requested when scoring). (Default) Estimate (Default) Logistic Scoring - RESULTS - Lift Graph On the Logistic Scoring dialog click RESULTS and then click on lift graph (note that the RESULTS tab will be grayed-out/disabled until after the analysis is completed): Figure 156: Logistic Scoring > Results > Lift Graph This chart displays the information in the Cumulative Lift Table. This is the same graph described in “Results - Logistic Regression” on page 127 as Lift Chart, but applied to possibly new data. Logistic Scoring - RESULTS - SQL On the Logistic Scoring dialog click RESULTS and then click on SQL (note that the RESULTS tab will be grayed-out/disabled until after the analysis is completed): Figure 157: Logistic Scoring > Results > SQL The SQL generated for scoring is returned here, but only if the Score Method was set to Score on the Input - Analysis Parameters tab, and if the Output - Storage option to Generate the SQL for this analysis, but do not execute it was selected. When SQL is displayed here, it may be selected and copied as desired. (Both right-click menu options and buttons to Select All and Copy are available). Teradata Warehouse Miner User Guide - Volume 1 245 Chapter 2: Scoring Logistic Scoring Tutorial - Logistic Scoring In this example, the same table is scored as was used to build the logistic regression model, as a matter of convenience. Typically, this would not be done unless the contents of the table changed since the model was built. Parameterize a Logistic Regression Scoring Analysis as follows: • Selected Table — twm_customer_analysis • Evaluate and Score — Enabled • Include Probability Score Column — Enabled • Column Name — Probability • Include Estimate from Threshold Column — Enabled • Column Name — Estimate • Threshold Default — 0.35 • Prediction Success Table — Enabled • Multi-Threshold Success Table — Enabled • Threshold Begin — 0 • Threshold End — 1 • Threshold Increment — 0.05 • Cumulative Lift Table — Enabled • Result Table Name — score_logistic_1 • Index Columns — cust_id Run the analysis, and click on Results when it completes. For this example, the Logistic Regression Scoring/Evaluation Analysis generated the following pages. A single click on each page name populates Results with the item. Table 74: Logistic Regression Model Scoring Report Resulting Scored Table <result_db>.score_logistic_1 Number of Rows in Scored Table 747 Table 75: Prediction Success Table Estimate Response Estimate Non-Response Actual Total Actual Response 304.58 / 40.77% 70.42 / 9.43% 375.00 / 50.20% Actual Non-Response 70.41 / 9.43% 301.59 / 40.37% 372.00 / 49.80% Estimated Total 374.99 / 50.20% 372.01 / 49.80% 747.00 / 100.00% 246 Teradata Warehouse Miner User Guide - Volume 1 Chapter 2: Scoring Logistic Scoring Table 76: Multi-Threshold Success Table Threshold Probability Actual Response, Estimate Response Actual Response, Estimate NonResponse Actual Non-Response, Estimate Response Actual Non-Response, Estimate NonResponse 0.0000 375 0 372 0 0.0500 375 0 353 19 0.1000 374 1 251 121 0.1500 373 2 152 220 0.2000 369 6 90 282 0.2500 361 14 58 314 0.3000 351 24 37 335 0.3500 344 31 29 343 0.4000 329 46 29 343 0.4500 318 57 28 344 0.5000 313 62 24 348 0.5500 305 70 23 349 0.6000 291 84 23 349 0.6500 286 89 21 351 0.7000 276 99 20 352 0.7500 265 110 20 352 0.8000 253 122 20 352 0.8500 243 132 16 356 0.9000 229 146 13 359 0.9500 191 184 11 361 Lift Table 77: Cumulative Lift Table Cumulative Response Cumulative Response (%) Cumulative Captured Response (%) Cumulative Lift Decile Count Response Response (%) Captured Response (%) 1 74.0000 73.0000 98.6486 19.4667 1.9651 73.0000 98.6486 19.4667 1.9651 2 75.0000 69.0000 92.0000 18.4000 1.8326 142.0000 95.3020 37.8667 1.8984 3 75.0000 71.0000 94.6667 18.9333 1.8858 213.0000 95.0893 56.8000 1.8942 Teradata Warehouse Miner User Guide - Volume 1 247 Chapter 2: Scoring Logistic Scoring Table 77: Cumulative Lift Table Cumulative Response Cumulative Response (%) Cumulative Captured Response (%) Cumulative Lift Decile Count Response Response (%) Captured Response (%) 4 74.0000 65.0000 87.8378 17.3333 1.7497 278.0000 93.2886 74.1333 1.8583 5 75.0000 66.0000 88.0000 17.6000 1.7530 344.0000 92.2252 91.7333 1.8371 6 75.0000 24.0000 32.0000 6.4000 0.6374 368.0000 82.1429 98.1333 1.6363 7 74.0000 4.0000 5.4054 1.0667 0.1077 372.0000 71.2644 99.2000 1.4196 8 73.0000 2.0000 2.7397 0.5333 0.0546 374.0000 62.8571 99.7333 1.2521 9 69.0000 1.0000 1.4493 0.2667 0.0289 375.0000 56.4759 100.0000 1.1250 10 83.0000 0.0000 0.0000 0.0000 0.0000 375.0000 50.2008 100.0000 1.0000 Lift Table 78: Data cust_id Probability Estimate 1362480 1.00 1 1362481 0.08 0 1362484 1.00 1 1362485 0.14 0 1362486 0.66 1 1362487 0.86 1 1362488 0.07 0 1362489 1.00 1 1362492 0.29 0 1362496 0.35 1 … ... ... Lift Graph By default, the Lift Graph displays the cumulative measure of the percentage of observations in the decile where the actual value of the dependent variable is 1, from decile 1 to this decile (Cumulative, %Response). 248 Teradata Warehouse Miner User Guide - Volume 1 Chapter 2: Scoring Neural Networks Scoring Figure 158: Logistic Scoring Tutorial: Lift Graph Neural Networks Scoring Neural Networks Scoring is implemented by saving each of the retained models from a Neural Networks Analysis in PMML. After the user selects a particular model to score, the scoring is the same as PMML scoring. (Refer to “PMML Scoring” on page 295 of the Teradata Warehouse Miner User Guide (Volume 2)). Initiate Neural Networks Scoring Use the following procedure to initiate a Neural Networks Scoring analysis: 1 Click on the Add New Analysis icon in the toolbar: Figure 159: Add New Analysis from toolbar Teradata Warehouse Miner User Guide - Volume 1 249 Chapter 2: Scoring Neural Networks Scoring 2 In the resulting Add New Analysis dialog box, click on Scoring under Categories and then under Analyses double-click on Neural Net Scoring: Figure 160: Add New Analysis > Scoring > Neural Net Scoring 3 This will bring up the Neural Networks Scoring dialog in which you will enter INPUT and OUTPUT options to parameterize the analysis as described in the following sections. Neural Networks Scoring - INPUT - Data Selection On the Neural Networks Scoring dialog click on INPUT and then click on data selection: 250 Teradata Warehouse Miner User Guide - Volume 1 Chapter 2: Scoring Neural Networks Scoring Figure 161: Neural Networks Scoring > Input > Data Selection On this screen select: 1 Select Input Source Users may select between different sources of input. By selecting the Input Source Table the user can select from available databases, tables (or views) and columns in the usual manner. By selecting the Input Source Analysis however the user can select directly from the output of another analysis of qualifying type in the current project. Analyses that may be selected from directly include all of the Analytic Data Set (ADS) and Reorganization analyses (except Refresh). In place of Available Databases the user may select from Available Analyses, while Available Tables then contains a list of all the output tables that will eventually be produced by the selected Analysis. (Note that since this analysis cannot select from a volatile input table, Available Analyses will contain only those qualifying analyses that create an output table or view). For more information, refer to “INPUT Tab” on page 73 of the Teradata Warehouse Miner User Guide (Volume 1). 2 Select Columns From a Single Table • Available Databases — All available source databases that have been added through Connection Properties. • Available Tables — The tables available for scoring are listed in this window, though all may not strictly qualify; the input table to be scored must contain the same column names used in the original analysis. • Available Columns — The columns available for scoring are listed in this window. Teradata Warehouse Miner User Guide - Volume 1 251 Chapter 2: Scoring Neural Networks Scoring • 3 Selected Columns • Index Columns — Note that the Selected Columns window is actually a split window for specifying Index and/or Retain columns if desired. If a table is specified as input, the primary index of the table is defaulted here, but can be changed. If a view is specified as input, an index must be provided. • Retain Columns — Other columns within the table being scored can be appended to the scored table, by specifying them here. Columns specified in Index Columns may not be specified here. Select Model Analysis Select from the list an existing Neural Networks analysis on which to run the scoring. The Neural Networks analysis must exist in the same project as the Neural Networks Scoring analysis. 4 Select Model Choose a particular Neural Network model from the above analysis to be used for scoring. (Note that when a saved analysis with a valid model is first loaded into the project space its models are embedded in the analysis and the displayed models reflect the analysis the model was originally built from, even if it resided on another client machine). Neural Networks Scoring - OUTPUT On the Neural Networks Scoring dialog click on OUTPUT and then storage: Figure 162: Neural Networks Scoring > Output On this screen select: • Output Table 252 • Database name — The name of the database. • Table name — The name of the scored output table to be created. • Create output table using the FALLBACK keyword — If a table is selected, it will be built with FALLBACK if this option is selected • Create output table using the MULTISET keyword — This option is not enabled for scored output tables; the MULTISET keyword is not used. • Advertise Output — The Advertise Output option “advertises” output by inserting information into one or more of the Advertise Output metadata tables according to the type of analysis and the options selected in the analysis. (For more information, refer to “Advertise Output” on page 112 of the Teradata Warehouse Miner User Guide (Volume 1)). Teradata Warehouse Miner User Guide - Volume 1 Chapter 2: Scoring Neural Networks Scoring • Advertise Note — An Advertise Note may be specified if desired when the Advertise Output option is selected or when the Always Advertise option is selected on the Databases tab of the Connection Properties dialog. It is a free-form text field of up to 30 characters that may be used to categorize or describe the output. • Generate the SQL for this analysis, but do not execute it — If this option is checked the SQL to score this model will be generated but not executed. • Maximum SQL statement size allowed (default 64000) — The SQL statements generated will not exceed this maximum value in characters. • Generate as a stored procedure with name — (This option is no longer available. To create a stored procedure to score this model, use the Refresh analysis and select this analysis as the analysis to be refreshed). Note: To create a stored procedure to score this model, use the Refresh analysis. Run the Neural Networks Scoring Analysis After setting parameters on the INPUT screens as described above, you are ready to run the analysis. To run the analysis you can either: • Click the Run icon on the toolbar, or • Select Run <project name> on the Project menu, or • Press the F5 key on your keyboard Results - Neural Networks Scoring The results of running the Teradata Warehouse Miner Neural Networks Scoring Analysis include the following outlined below. NEURAL NETWORKS Scoring - RESULTS - Reports On the NEURAL NETWORKS Scoring dialog click RESULTS and then click on reports (note that the RESULTS tab will be grayed-out/disabled until after the analysis is completed): Figure 163: Neural Networks Scoring > Results > Reports • NEURALNET Score Report • Resulting Scored Table Name — This is the name given the table with the scored values of the model. • Number of Rows in Scored Table — This is the number of rows in the scored table. Teradata Warehouse Miner User Guide - Volume 1 253 Chapter 2: Scoring Neural Networks Scoring NEURAL NETWORKS Scoring - RESULTS - Data On the NEURAL NETWORKS Scoring dialog click RESULTS and then click on data (note that the RESULTS tab will be grayed-out/disabled until after the analysis is completed): Figure 164: Neural Networks Scoring > Results > Data Results data, if any, is displayed in a data grid as described in “RESULTS Tab” on page 80 of the Teradata Warehouse Miner User Guide (Volume 1). After clicking the “Load” button, a sample of rows from the scored table will be displayed here - the size determined by the setting specified by Maximum result rows to display in ToolsPreferences-Limits. By default, the index of the table being scored as well as the dependent column prediction are in the scored table - additional columns as specified in the OUTPUT panel may be displayed as well. The following table is built in the requested Output Database by the NEURAL NETWORKS Scoring analysis. Note that the options selected affect the structure of the table. Those columns in bold below will comprise the Primary Index. Also note that there may be repeated groups of columns, and that some columns will be generated only if specific options are selected. Table 79: Output Database table (Built by Neural Networks scoring) Name Type Definition <app_var> User Defined One or more columns as selected under Retain Columns. <dep_var > User Defined The predicted value of the dependent variable. The name used defaults to the neuron number. The data type used is the same as the Dependent Variable. FLOAT If any additional probability output is requested on the OUTPUT panel, it will be displayed using the name provided in the PMML model. (Default) P_<dep_var><value> NEURAL NETWORKS Scoring - RESULTS - SQL On the NEURAL NETWORKS Scoring dialog click RESULTS and then click on SQL (note that the RESULTS tab will be grayed-out/disabled until after the analysis is completed): 254 Teradata Warehouse Miner User Guide - Volume 1 Chapter 2: Scoring Neural Networks Scoring Figure 165: Neural Networks Scoring > Results > SQL The SQL generated for scoring is returned here, but only if the Output - Storage option to Generate the SQL for this analysis, but do not execute it was selected. When SQL is displayed here, it may be selected and copied as desired. (Both right-click menu options and buttons to Select All and Copy are available). Neural Networks Scoring Tutorial 1 Create a new Neural Networks analysis to score named Neural Networks2 equivalent to Neural Networks Tutorial 2: Performing Classification with Fictitious Banking Data. 2 Create a new Neural Net Scoring analysis named Neural Net Scoring2. Parameterize this Neural Net Scoring analysis as follows: • Available Tables — twm_customer_analysis • Select Model Analysis — Neural Networks2 • Selected Model — (choose one of the models from the pull-down window) • Index Columns — cust_id Parameterize the output as follows: • Result Table Name — twm_score_neural_net2 Run the analysis, and click on Results when it completes. For this example, the Neural Scoring Analysis generated the following results. Teradata Warehouse Miner User Guide - Volume 1 255 Chapter 2: Scoring Neural Networks Scoring Figure 166: Neural Networks Scoring Tutorial: Report The predicted value of the dependent variable, ccacct, is displayed for each cust_id, as shown below (after sorting). Note that results may vary depending on the random element in the construction of neural network models. Figure 167: Neural Networks Scoring Tutorial: Data (Note that the scoring SQL could only be displayed if the Output option to Generate the SQL for this analysis, but do not execute it were selected, which was not the case in this example). 256 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Overview CHAPTER 3 Statistical Tests What’s In This Chapter For more information, see these subtopics: 1 “Overview” on page 257 2 “Parametric Tests” on page 261 3 “Binomial Tests” on page 286 4 “Kolmogorov-Smirnov Tests” on page 299 5 “Tests Based on Contingency Tables” on page 329 6 “Rank Tests” on page 342 Overview Teradata Warehouse Miner contains both parametric and nonparametric statistical tests from the classical statistics literature, as well as more recently developed tests. In addition, “group by” variables permit the ability to statistically analyze data groups defined by selected variables having specific values. In this way, multiple tests can be conducted at once to provide a profile of customer data showing hidden clues about customer behavior. In simplified terms, what statistical inference allows us to do is to find out whether the outcome of an experiment could have happened by accident, or if it is extremely unlikely to have happened by chance. Of course a very well designed experiment would have outcomes which are clearly different, and require no statistical test. Unfortunately, in nature noisy outcomes of experiments are common, and statistical inference is required to get the answer. It doesn’t matter whether our data come from an experiment we designed, or from a retail database. Questions can be asked of the data, and statistical inference can provide the answer. What is statistical inference? It is a process of drawing conclusions about parameters of a statistical distribution. In summary, there are three principal approaches to statistical inference. One type of statistical inference is Bayesian estimation, where conclusions are based upon posterior judgments about the parameter given an experimental outcome. A second type is based on the likelihood approach, in which all conclusions are inferred from the likelihood function of the parameter given an experimental outcome. A third type of inference is hypothesis testing, which includes both nonparametric and parametric inference. For nonparametric inference, estimators concerning the distribution function are independent of the specific mathematical form of the distribution function. Parametric inference, by contrast, involves estimators about the distribution function that assumes a particular mathematical form, most often the normal distribution. Parametric tests are based on the Teradata Warehouse Miner User Guide - Volume 1 257 Chapter 3: Statistical Tests Overview sampling distribution of a particular statistic. Given knowledge of the underlying distribution of a variable, how the statistic is distributed in multiple equal-size samples can be predicted. The statistical tests provided in Teradata Warehouse Miner are solely those of the hypothesis testing type, both parametric and nonparametric. Hypothesis tests generally belong to one of five classes: 1 parametric tests including the class of t-tests and F-tests assuming normality of data populations 2 nonparametric tests of the binomial type 3 nonparametric tests of the chi square type, based on contingency tables. 4 nonparametric tests based on ranks 5 nonparametric tests of the Kolmogorov-Smirnov type Within each class of tests there exist many variants, some of which have risen to the level of being named for their authors. Often tests have multiple names due to different originators. The tests may be applied to data in different ways, such as on one sample, two samples or multiple samples. The specific hypothesis of the test may be two-tailed, upper-tailed or lowertailed. Hypothesis tests vary depending on the assumptions made in the context of the experiment, and care must be exercised that they are valid in the particular context of the data to be examined. For example, is it a fair assumption that the variables are normally distributed? The choice of which test to apply will depend on the answer to this question. Failure to exercise proper judgement in which test to apply may result in false alarms, where the null hypothesis is rejected incorrectly, or misses, where the null hypothesis is accepted improperly. Note: Identity columns, i.e. columns defined with the attribute “GENERATED … AS IDENTITY”, cannot be analyzed by many of the statistical test functions and should therefore generally be avoided. Summary of Tests Parametric Tests Tests include the T-test, the F(1-way), F(2-way with equal Sample Size), F(3-way with equal Sample Size), and the F(2-way with unequal Sample Size). The two-sample t-test checks if two population means are equal. The ANOVA or F test determines if significant differences exist among treatment means or interactions. It’s a preliminary test that indicates if further analysis of the relationship among treatment means is warranted. Tests of the Binomial Type These tests include the Binomial test and Sign test. The data for a binomial test is assumed to come from n independent trials, and have outcomes in either of two classes. The binomial test reports whether the probability that the outcome is of the first class is a particular p_value, p*, usually ½. 258 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Overview Tests Based on Contingency Tables - Chi Square Type Tests include the Chi Square and Median test. The Chi Square Test determines whether the probabilities observed from data in a RxC contingency table are the same or different. Additional statistics provided are Phi coefficient, Cramer’s V, Likelihood Ratio Chi Square, Continuity-Adjusted Chi-Square, and Contingency Coefficient The Median test is a special case of the chi-square test with fixed marginal totals, testing whether several samples came from populations with the same median. Tests of the Kolmogorov-Smirnov Type These tests include the Kolmogorov-Smirnov and Lilliefors tests for goodness of fit to a particular distribution (normal), the Shapiro-Wilk and D'Agostino-Pearson tests of normality, and the Smirnov test of equality of two distributions. Tests Based on Ranks Tests include the MannWhitney test for 2 independent samples, Kruskal-Wallis test for k independent samples, Wilcoxon Signed Ranks test, and Friedman test. The Friedman test is an extension of the sign test for several independent samples. It is a test for treatment differences in a randomized, complete block design. Additional statistics provided are Kendall’s Coefficient of Concordance (W) and Spearman’s Rho. Data Requirements The following chart summarizes how the Statistical Test functions handle various types of input. Those cases with the note “should be normal numeric” will give warnings for any type of input that is not standard numeric, i.e. for character data, dates, big integers or decimals, etc. (In the table below, cat is an abbreviation for categorical, num for numeric and bignum for big integers or decimals). Table 80: Statistical Test functions handling of input Test Input Columns Tests Return Results With Note Median column of interest cat, num, date, bignum can be anything Median columns cat, num, date, bignum can be anything Median group by columns cat, num, date, bignum can be anything Chi Square 1st columns cat, num, date, bignum can be anything (limit of 2000 distinct value pairs) Chi Square 2nd columns cat, num, date, bignum can be anything Mann Whitney column of interest cat, num, date, bignum can be anything Teradata Warehouse Miner User Guide - Volume 1 259 Chapter 3: Statistical Tests Overview Table 80: Statistical Test functions handling of input Test Input Columns Tests Return Results With Note Mann Whitney columns cat, num, date, bignum can be anything Mann Whitney group by columns cat, num, date, bignum can be anything Wilcoxon 1st column num, date, bignum should be normal numeric Wilcoxon 2nd column num, date, bignum should be normal numeric Wilcoxon group by columns cat, num, date, bignum can be anything Friedman column of interest num should be normal numeric Friedman treatment column special count requirements Friedman block column special count requirements Friedman group by columns cat, num, date, bignum can be anything F(n)way column of interest num should be normal numeric F(n)way columns cat, num, date, bignum can be anything F(n)way group by columns cat, num, date, bignum can be anything F(2)way ucc column of interest num should be normal numeric F(2)way ucc columns cat, num, date, bignum can be anything F(2)way ucc group by columns cat, num, date, bignum can be anything T Paired 1st column num should be normal numeric T Paired 2nd column num, date, bignum should be normal numeric T Paired group by columns cat, num, date, bignum can be anything T Unpaired 1st column num should be normal numeric T Unpaired 2nd column num, date, bignum should be normal numeric T Unpaired group by columns cat, num, date, bignum can be anything T Unpaired w ind 1st column num should be normal numeric T Unpaired w ind indicator column cat, num, date, bignum can be anything 260 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Parametric Tests Table 80: Statistical Test functions handling of input Test Input Columns Tests Return Results With Note T Unpaired w ind group by columns cat, num, date, bignum can be anything Kolmogorov-Smirnov column of interest num, date, bignum should be normal numeric Kolmogorov-Smirnov group by columns cat, num, date, bignum can be anything Lilliefors column of interest num, date, bignum should be normal numeric Lilliefors group by columns cat, num, bignum can be anything but date Shapiro-Wilk column of interest num, date, bignum should be normal numeric Shapiro-Wilk group by columns cat, num, date, bignum can be anything D'Agostino-Pearson column of interest num should be normal numeric D'Agostino-Pearson group by columns cat, num, bignum can be anything but date Smirnov column of interest cat, num, date, bignum should be normal numeric Smirnov columns must be 2 distinct values must be 2 distinct values Smirnov group by columns cat, num, bignum can be anything but date Binomial 1st column num, date, bignum should be normal numeric Binomial 2nd column num, date, bignum should be normal numeric Binomial group by columns cat, num, date, bignum can be anything Sign 1st column num, bignum should be normal numeric Sign group by columns cat, num, date, bignum can be anything Parametric Tests Parametric Tests are a class of statistical test which requires particular assumptions about the data. These often include that the observations are independent and normally distributed. A researcher may want to verify the assumption of normality before using a parametric test. He Teradata Warehouse Miner User Guide - Volume 1 261 Chapter 3: Statistical Tests Parametric Tests could use any of the four provided described below, such as the Kolmogorov-Smirnov test for normality, to determine if his use of one of the parametric tests were appropriate. Two Sample T-Test for Equal Means For the paired t test, a one-to-one correspondence must exist between values in both samples. The test is whether paired values have mean differences which are not significantly different from zero. It assumes differences are identically distributed normal random variables, and that they are independent. The unpaired t test is similar, but there is no correspondence between values of the samples. It assumes that within each sample, values are identically distributed normal random variables, and that the two samples are independent of each other. The two sample sizes may be equal or unequal. Variances of both samples may be assumed to be equal (homoscedastic) or unequal (heteroscedastic). In both cases, the null hypothesis is that the population means are equal. Test output is a p-value which compared to the threshold determines whether the null hypothesis should be rejected. Two methods of data selection are available for the unpaired t test: The first, the “T Unpaired” simply selects the columns with the two unpaired datasets, some of which may be NULL. The second, “T Unpaired with Indicator”, selects the column of interest and a second indicator column which determines to which group the first variable belongs. If the indicator variable is negative or zero, it will be assigned to the first group; if it is positive, it will be assigned to the second group. The two sample t test for unpaired data is defined as shown below (though calculated differently in the SQL): Table 81: Two sample t tests for unpaired data H0 : 1 = 2 Ha: 1 2 Test Statistic: Y1 – Y2 T = --------------------------------------------------2 2 s1 N1 + s2 N2 where N1 and N2 are the sample sizes, and s22 are the sample variances. Y 1 and Y 2 are the sample means, and s12 Initiate a Two Sample T-Test Use the following procedure to initiate a new T-Test in Teradata Warehouse Miner: 262 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Parametric Tests 1 Click on the Add New Analysis icon in the toolbar: Figure 168: Add New Analysis from toolbar 2 In the resulting Add New Analysis dialog box, click on Statistical Tests under Categories and then under Analyses double-click on Parametric Tests: Figure 169: Add New Analysis > Statistical Tests > Parametric Tests 3 This will bring up the Parametric Tests dialog in which you will enter STATISTICAL TEST STYLE, INPUT and OUTPUT options to parameterize the analysis as described in the following sections. T-Test - INPUT - Data Selection On the Parametric Tests dialog click on INPUT and then click on data selection: Teradata Warehouse Miner User Guide - Volume 1 263 Chapter 3: Statistical Tests Parametric Tests Figure 170: T-Test > Input > Data Selection On this screen select: 1 Select Input Source Users may select between different sources of input. By selecting the Input Source Table the user can select from available databases, tables (or views) and columns in the usual manner. By selecting the Input Source Analysis however the user can select directly from the output of another analysis of qualifying type in the current project. Analyses that may be selected from directly include all of the Analytic Data Set (ADS) and Reorganization analyses (except Refresh). In place of Available Databases the user may select from Available Analyses, while Available Tables then contains a list of all the output tables that will eventually be produced by the selected Analysis, or it contains a single entry with the name of the analysis under the label Volatile Table, representing the output of the analysis that is ordinarily produced by a Select statement. For more information, refer to “INPUT Tab” on page 73 of the Teradata Warehouse Miner User Guide (Volume 1). 2 3 Select Columns From a Single Table • Available Databases (or Analyses) — These are the databases (or analyses) available to be processed. • Available Tables — These are the tables and views that are available to be processed. • Available Columns — These are the columns within the table/view that are available for processing. Select Statistical Test Style These are the parametric tests available (F(n-way), F(2-way with unequal cell counts, T Paired, T Unpaired, T Unpaired with Indicator). Select “T Paired”, “T Unpaired”, or “T Unpaired with Indicator”. 4 Select Optional Columns • Selected Columns — Select columns by highlighting and then either dragging and dropping into the Selected Columns window, or click on the arrow button to move highlighted columns into the Selected Columns window. Note: The Selected Columns window is actually a split window; you can insert columns as First Column, Second Column or Group By Columns. Make sure you have the correct portion of the window highlighted. 264 • First Column — The column that specifies the first variable for the Parametric Test analysis. • Second Column (or Indicator Column) — The column that specifies the second variable for the Parametric Test analysis. Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Parametric Tests (Or the column that determines to which group the first variable belongs. If negative or zero, it will be assigned to the first group; if it is positive, it will be assigned to the second group). • Group By Columns — The column(s) that specifies the variable(s) whose distinct value combinations will categorize the data, so a separate test is performed on each category. T-Test - INPUT - Analysis Parameters On the Parametric Tests dialog click on INPUT and then click on analysis parameters: Figure 171: T-Test > Input > Analysis Parameters On this screen enter or select: • Processing Options • Threshold Probability — Enter the “alpha” probability below which to reject the null hypothesis. • Equal Variance — Check this box if the “equal variance” assumption is to be used. Default is “unequal variance”. T-Test - OUTPUT On the Parametric Tests dialog click on OUTPUT: Figure 172: T-Test > Output On this screen select: • Store the tabular output of this analysis in the database — Option to generate a Teradata table populated with the results of the analysis. Once enabled, the following three fields must be specified: • Database Name — The database where the output table will be saved. • Output Name — The table name that the output will be saved under. • Output Type — The output type must be table when storing Statistical Test output in the database. • Stored Procedure — The creation of a stored procedure containing the SQL generated for this analysis can be requested by entering the desired name of the stored procedure Teradata Warehouse Miner User Guide - Volume 1 265 Chapter 3: Statistical Tests Parametric Tests here. This will result in the creation of a stored procedure in the user's login database in place of the execution of the SQL generated by the analysis. (For more information, please refer to “Stored Procedure Support” on page 109 of the Teradata Warehouse Miner User Guide (Volume 1)). • Procedure Comment — When an optional Procedure Comment is entered it is applied to a requested Stored Procedure with an SQL Comment statement. It can be up to 255 characters in length and contain substitution parameters for the output category (Score, ADS, Stats or Other), project name and/or analysis name (using the tags <Category>, <Project> and <Analysis>, respectively). (Note that the default value of this field may be set on the Defaults tab of the Preferences dialog, available from the Tools Menu). • Create output table using fallback keyword — Fallback keyword will be used to create the table • Create output table using multiset keyword — Multiset keyword will be used to create the table • Advertise Output — The Advertise Output option may be requested when creating a table, view or procedure. This feature “advertises” output by inserting information into one or more of the Advertise Output metadata tables according to the type of analysis and the options selected in the analysis. (For more information, refer to “Advertise Output” on page 112 of the Teradata Warehouse Miner User Guide (Volume 1)). • Advertise Note — An Advertise Note may be specified if desired when the Advertise Output option is selected or when the Always Advertise option is selected on the Databases tab of the Connection Properties dialog. It is a free-form text field of up to 30 characters that may be used to categorize or describe the output. • Generate SQL, but do not Execute it — Generate the Statistical Test SQL, but do not execute it. The SQL will be available to be viewed. Run the T-Test Analysis After setting parameters on the INPUT screens as described above, you are ready to run the analysis. To run the analysis you can either: • Click the Run icon on the toolbar, or • Select Run <project name> on the Project menu, or • Press the F5 key on your keyboard Results - T-Test Analysis The results of running the T-Test analysis include a table with a row for each group-by variable requested, as well as the SQL to perform the statistical analysis. All of these results are outlined below. T-Test - RESULTS - SQL On the Parametric Tests dialog click on RESULTS and then click on SQL: 266 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Parametric Tests Figure 173: T-Test > Results > SQL The series of SQL statements comprise the T-Test analysis. It is always returned, and is the only item returned when the Generate SQL Without Executing option is used. T-Test - RESULTS - Data On the Parametric Tests dialog click on RESULTS and then click on data: Figure 174: T-Test > Results > Data The output table is generated by the T-Test analysis for each group-by variable combination. Output Columns - T-Test Analysis The following table is built in the requested Output Database by the T-Test analysis. Any group-by columns will comprise the Unique Primary Index (UPI). Table 82: Output Database table Name Type Definition D_F INTEGER Degrees of Freedom for the group-by values selected. T Float The computed value of the T statistic TTestPValue Float The probability associated with the T statistic TTestCallP Char The TTest result: a=accept, p=reject (positive), n=reject(negative) Tutorial - T-Test In this example, a T-Test analysis of type T-Paired is performed on the fictitious banking data to analyze account usage. Parameterize a Parametric Test analysis as follows: • Available Tables — twm_customer_analysis • Statistical Test Style — T Paired • First Column — avg_cc_bal • Second Column — avg_sv_bal Teradata Warehouse Miner User Guide - Volume 1 267 Chapter 3: Statistical Tests Parametric Tests • Group By Columns — age, gender • Analysis Parameters • Threshold Probability — 0.05 • Equal Variance — true (checked) Run the analysis and click on Results when it completes. For this example, the Parametric Test analysis generated the following page. The paired t-test was computed on average credit card balance vs. average savings balance, by gender and age. Ages over 33 were excluded for brevity. Results were sorted by age and gender in the listing below. The tests shows whether the paired values have mean differences which are not significantly different from zero for each gender-age combination. A ‘p’ means the difference was significantly different from zero. An ‘a’ means the difference was insignificant. The SQL is available for viewing but not listed below. Table 83: T-Test gender age D_F TTestPValue T TTestCallP_0.05 F 13 7 0.01 3.99 p M 13 6 0.13 1.74 a F 14 5 0.10 2.04 a M 14 8 0.04 2.38 p F 15 18 0.01 3.17 p M 15 12 0.04 2.29 p F 16 9 0.00 4.47 p M 16 8 0.04 2.52 p F 17 13 0.00 4.68 p M 17 6 0.01 3.69 p F 18 9 0.00 6.23 p M 18 9 0.02 2.94 p F 19 9 0.01 3.36 p M 19 6 0.03 2.92 p F 22 3 0.21 1.57 a M 22 3 0.11 2.25 a F 23 3 0.34 1.13 a M 23 3 0.06 2.88 a F 25 4 0.06 2.59 a F 26 5 0.08 2.22 a 268 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Parametric Tests Table 83: T-Test gender age D_F TTestPValue T TTestCallP_0.05 F 27 5 0.09 2.12 a F 28 4 0.06 2.68 a M 28 4 0.03 3.35 p F 29 4 0.06 2.54 a M 29 5 0.16 1.65 a F 30 8 0.00 4.49 p M 30 5 0.01 4.25 p F 31 5 0.04 2.69 p M 31 6 0.05 2.52 p F 32 5 0.05 2.50 a M 32 6 0.10 1.98 a F 33 9 0.01 3.05 p M 33 4 0.09 2.27 a F-Test - N-Way • F-Test/Analysis of Variance — One Way, Equal or Unequal Sample Size • F-Test/Analysis of Variance — Two Way, Equal Sample Size • F-Test/Analysis of Variance — Three Way, Equal Sample Size The ANOVA or F test determines if significant differences exist among treatment means or interactions. It’s a preliminary test that indicates if further analysis of the relationship among treatment means is warranted. If the null hypothesis of no difference among treatments is accepted, the test result implies factor levels and response are unrelated, so the analysis is terminated. When the null hypothesis is rejected, the analysis is usually continued to examine the nature of the factor-level effects. Examples are: • Tukey’s Method — tests all possible pairwise differences of means • Scheffe’s Method — tests all possible contrasts at the same time • Bonferroni’s Method — tests, or puts simultaneous confidence intervals around a preselected group of contrasts The N-way F-Test is designed to execute within groups defined by the distinct values of the group-by variables (GBV's), the same as most of the other nonparametric tests. Two or more treatments must exist in the data within the groups defined by the distinct GBV values. Given a column of interest (dependent variable), one or more input columns (independent variables) and optionally one or more group-by columns (all from the same input table), an FTest is produced. The N-Way ANOVA tests whether a set of sample means are all equal (the Teradata Warehouse Miner User Guide - Volume 1 269 Chapter 3: Statistical Tests Parametric Tests null hypothesis). Output is a p-value which when compared to the user’s threshold, determines whether the null hypothesis should be rejected. Initiate an N-Way F-Test Use the following procedure to initiate a new F-Test analysis in Teradata Warehouse Miner: 1 Click on the Add New Analysis icon in the toolbar: Figure 175: Add New Analysis from toolbar 2 In the resulting Add New Analysis dialog box, click on Statistical Tests under Categories and then under Analyses double-click on Parametric Tests: Figure 176: Add New Analysis > Statistical Tests > Parametric Tests 3 This will bring up the Parametric Tests dialog in which you will enter STATISTICAL TEST STYLE, INPUT and OUTPUT options to parameterize the analysis as described in the following sections. F-Test (N-Way) - INPUT - Data Selection On the Parametric Tests dialog click on INPUT and then click on data selection: 270 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Parametric Tests Figure 177: F-Test > Input > Data Selection On this screen select: 1 Select Input Source Users may select between different sources of input. By selecting the Input Source Table the user can select from available databases, tables (or views) and columns in the usual manner. By selecting the Input Source Analysis however the user can select directly from the output of another analysis of qualifying type in the current project. Analyses that may be selected from directly include all of the Analytic Data Set (ADS) and Reorganization analyses (except Refresh). In place of Available Databases the user may select from Available Analyses, while Available Tables then contains a list of all the output tables that will eventually be produced by the selected Analysis, or it contains a single entry with the name of the analysis under the label Volatile Table, representing the output of the analysis that is ordinarily produced by a Select statement. For more information, refer to “INPUT Tab” on page 73 of the Teradata Warehouse Miner User Guide (Volume 1). 2 3 Select Columns From a Single Table • Available Databases (or Analyses) — These are the databases (or analyses) available to be processed. • Available Tables — These are the tables and views that are available to be processed. • Available Columns — These are the columns within the table/view that are available for processing. Select Statistical Test Style These are the parametric tests available (T, F(n-way), F(2-way with unequal cell counts) Select “F(n-way)”. 4 Select Optional Columns • Selected Columns — Select columns by highlighting and then either dragging and dropping into the Selected Columns window, or click on the arrow button to move highlighted columns into the Selected Columns window. Note: The Selected Columns window is actually a split window; you can insert columns as Column of Interest, Columns or Group By Columns. Make sure you have the correct portion of the window highlighted. • Column of Interest — The column that specifies the dependent variable for the Ftest analysis. • Columns — The column(s) that specifies the independent variable(s) for the F-test analysis. Selection of one column will generate a 1-Way F-test, two columns a 2Way F-test, and three columns a 3-Way F-test. Do not select over three columns because the 4-way, 5-way, etc. F-tests are not implemented in the version of TWM. Teradata Warehouse Miner User Guide - Volume 1 271 Chapter 3: Statistical Tests Parametric Tests Warning: For this test, equal cell counts are required for the 2 and 3 way tests. • Group By Columns — The column(s) that specifies the variable(s) whose distinct value combinations will categorize the data, so a separate test is performed on each category. F-Test (N-Way) - INPUT - Analysis Parameters On the Parametric Tests dialog click on INPUT and then click on analysis parameters: Figure 178: F-Test > Input > Analysis Parameters On this screen enter or select: • Processing Options • Threshold Probability — Enter the “alpha” probability below which to reject the null hypothesis. F-Test - OUTPUT On the Parametric Tests dialog click on OUTPUT: Figure 179: F-Test > Output On this screen select the following options if desired: • Store the tabular output of this analysis in the database — Option to generate a Teradata table populated with the results of the analysis. Once enabled, the following three fields must be specified: 272 • Database Name — The database where the output table will be saved. • Output Name — The table name that the output will be saved under. • Output Type — The output type must be table when storing Statistical Test output in the database. • Stored Procedure — The creation of a stored procedure containing the SQL generated for this analysis can be requested by entering the desired name of the stored procedure here. This will result in the creation of a stored procedure in the user's login database in place of the execution of the SQL generated by the analysis. (For more information, please refer to “Stored Procedure Support” on page 109 of the Teradata Warehouse Miner User Guide (Volume 1)). Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Parametric Tests • Procedure Comment — When an optional Procedure Comment is entered it is applied to a requested Stored Procedure with an SQL Comment statement. It can be up to 255 characters in length and contain substitution parameters for the output category (Score, ADS, Stats or Other), project name and/or analysis name (using the tags <Category>, <Project> and <Analysis>, respectively). (Note that the default value of this field may be set on the Defaults tab of the Preferences dialog, available from the Tools Menu). • Create output table using fallback keyword — Fallback keyword will be used to create the table • Create output table using multiset keyword — Multiset keyword will be used to create the table • Advertise Output — The Advertise Output option may be requested when creating a table, view or procedure. This feature “advertises” output by inserting information into one or more of the Advertise Output metadata tables according to the type of analysis and the options selected in the analysis. (For more information, refer to “Advertise Output” on page 112 of the Teradata Warehouse Miner User Guide (Volume 1)). • Advertise Note — An Advertise Note may be specified if desired when the Advertise Output option is selected or when the Always Advertise option is selected on the Databases tab of the Connection Properties dialog. It is a free-form text field of up to 30 characters that may be used to categorize or describe the output. • Generate SQL, but do not Execute it — Generate the Statistical Test SQL, but do not execute it. The SQL will be available to be viewed. Run the F-Test Analysis After setting parameters on the INPUT screens as described above, you are ready to run the analysis. To run the analysis you can either: • Click the Run icon on the toolbar, or • Select Run <project name> on the Project menu, or • Press the F5 key on your keyboard Results - F-Test Analysis The results of running the F-test analysis include a table with a row for each group-by variable requested, as well as the SQL to perform the statistical analysis. All of these results are outlined below. F-Test - RESULTS - SQL On the Parametric Tests dialog click on RESULTS and then click on SQL: Figure 180: F-Test > Results > SQL Teradata Warehouse Miner User Guide - Volume 1 273 Chapter 3: Statistical Tests Parametric Tests The series of SQL statements comprise the F-test Analysis. It is always returned, and is the only item returned when the Generate SQL Without Executing option is used. F-Test - RESULTS - data On the Parametric Tests dialog click on RESULTS and then click on data: Figure 181: F-Test > Results > data The output table is generated by the F-test Analysis for each group-by variable combination. Output Columns - F-Test Analysis The particular result table returned will depend on whether the test is 1-way, 2-way or 3-way, and is built in the requested Output Database by the F-test analysis. If group-by columns are present, they will comprise the Unique Primary Index (UPI). Otherwise DF will be the UPI. Table 84: Output Columns - 1-Way F-Test Analysis Name Type Definition DF INTEGER Degrees of Freedom for the Variable DFErr INTEGER Degrees of Freedom for Error F Float The computed value of the F statistic FPValue Float The probability associated with the F statistic FPText Char If not NULL, the probability is less than the smallest or more than the largest table value FCallP Char The F-Test result: a=accept, p=reject (positive), n=reject(negative) Table 85: Output Columns - 2-Way F-Test Analysis Name Type Definition DF INTEGER Degrees of Freedom for the model Fmodel Float The computed value of the F statistic for the model DFErr INTEGER Degrees of Freedom for Error term DF_1 INTEGER Degrees of Freedom for first variable F1 Float The computed value of the F statistic for the first variable 274 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Parametric Tests Table 85: Output Columns - 2-Way F-Test Analysis Name Type Definition DF_2 INTEGER Degrees of Freedom for second variable F2 Float The computed value of the F statistic for the second variable DF_12 INTEGER Degrees of Freedom for interaction F12 Float The computed value of the F statistic for interaction Fmodel_PValue Float The probability associated with the F statistic for the model Fmodel_PText Char If not NULL, the probability is less than the smallest or more than the largest table value Fmodel_CallP_0.05 Char The F test result: a=accept, p=reject for the model F1_PValue Float The probability associated with the F statistic for the first variable F1_PText Char If not NULL, the probability is less than the smallest or more than the largest table value F1_callP_0.05 Char The F test result: a=accept, p=reject for the first variable F2_PValue Float The probability associated with the F statistic for the second variable F2_PText Char If not NULL, the probability is less than the smallest or more than the largest table value F2_callP_0.05 Char The F test result: a=accept, p=reject for the second variable F12_PValue Float The probability associated with the F statistic for the interaction F12_PText Char If not NULL, the probability is less than the smallest or more than the largest table value F12_callP_0.05 Char The F test result: a=accept, p=reject for the interaction Table 86: Output Columns - 3-Way F-Test Analysis Name Type Definition DF INTEGER Degrees of Freedom for the model Fmodel Float The computed value of the F statistic for the model DFErr INTEGER Degrees of Freedom for Error term DF_1 INTEGER Degrees of Freedom for first variable F1 Float The computed value of the F statistic for the first variable DF_2 INTEGER Degrees of Freedom for second variable F2 Float The computed value of the F statistic for the second variable DF_3 INTEGER Degrees of Freedom for third variable Teradata Warehouse Miner User Guide - Volume 1 275 Chapter 3: Statistical Tests Parametric Tests Table 86: Output Columns - 3-Way F-Test Analysis Name Type Definition F3 Float The computed value of the F statistic for the third variable DF_12 INTEGER Degrees of Freedom for interaction of v1 and v2 F12 Float The computed value of the F statistic for interaction of v1 and v2 DF_13 INTEGER Degrees of Freedom for interaction of v1 and v3 F13 Float The computed value of the F statistic for interaction of v1 and v3 DF_23 INTEGER Degrees of Freedom for interaction of v2 and v3 F23 Float The computed value of the F statistic for interaction of v2 and v3 DF_123 INTEGER Degrees of Freedom for three-way interaction of v1, v2, and v3 F123 Float The computed value of the F statistic for three-way interaction of v1, v2 and v3 Fmodel_PValue Float The probability associated with the F statistic for the model Fmodel_PText Char If not NULL, the probability is less than the smallest or more than the largest table value Fmodel_callP_0.05 Char The F test result: a=accept, p=reject for the model F1_PValue Float The probability associated with the F statistic for the first variable F1_PText Char If not NULL, the probability is less than the smallest or more than the largest table value F1_callP_0.05 Char The F test result: a=accept, p=reject for the first variable F2_PValue Float The probability associated with the F statistic for the second variable F2_PText Char If not NULL, the probability is less than the smallest or more than the largest table value F2_callP_0.05 Char The F test result: a=accept, p=reject for the second variable F3_PValue Float The probability associated with the F statistic for the third variable F3_PText Char If not NULL, the probability is less than the smallest or more than the largest table value F3_callP_0.05 Char The F test result: a=accept, p=reject for the third variable F12_PValue Float The probability associated with the F statistic for the interaction of v1 and v2 F12_PText Char If not NULL, the probability is less than the smallest or more than the largest table value F12_callP_0.05 Char The F test result: a=accept, p=reject for the interaction of v1 and v2 F13_PValue Float The probability associated with the F statistic for the interaction of v1 and v3 276 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Parametric Tests Table 86: Output Columns - 3-Way F-Test Analysis Name Type Definition F13_PText Char If not NULL, the probability is less than the smallest or more than the largest table value F13_callP_0.05 Char The F test result: a=accept, p=reject for the interaction of v1 and v3 F23_PValue Float The probability associated with the F statistic for the interaction of v2 and v3 F23_PText Char If not NULL, the probability is less than the smallest or more than the largest table value F23_callP_0.05 Char The F test result: a=accept, p=reject for the interaction of v2 and v3 F123_PValue Float The probability associated with the F statistic for the three-way interaction of v1, v2 and v3 F123_PText Char If not NULL, the probability is less than the smallest or more than the largest table value F123_callP_0.05 Char The F test result: a=accept, p=reject for the three-way interaction of v1, v2 and v3 Tutorial - One-Way F-Test Analysis In this example, an F-test analysis is performed on the fictitious banking data to analyze income by gender. Parameterize an F-Test analysis as follows: • Available Tables — twm_customer • Column of Interest — income • Columns — gender • Group By Columns — years_with_bank, nbr_children • Analysis Parameters • Threshold Probability — 0.01 Run the analysis and click on Results when it completes. For this example, the F-Test analysis generated the following page. The F-Test was computed on income over gender for every combination of years_with_bank and nbr_children. Results were sorted by years_with_bank and nbr_children in the listing below. The tests shows whether significant differences exist in income for males and females, and does so separately for each value of years_with_bank and nbr_children. A ‘p’ means the difference was significant, and an ‘a’ means it was not significant. If the field is null, it indicates there was insufficient data for the test. The SQL is available for viewing but not listed below. Teradata Warehouse Miner User Guide - Volume 1 277 Chapter 3: Statistical Tests Parametric Tests Table 87: F-Test (one-way) years_with_bank nbr_children DF DFErr F FPValue FPText FCallP_0.01 0 0 1 53 0.99 0.25 >0.25 a 0 1 1 8 1.87 0.22 a 0 2 1 10 1.85 0.22 a 0 3 1 6 0.00 0.25 >0.25 a 0 4 1 0 0 5 0 0 1 0 1 55 0.00 0.25 >0.25 a 1 1 1 6 0.00 0.25 >0.25 a 1 2 1 14 0.00 0.25 >0.25 a 1 3 1 2 0.50 0.25 >0.25 a 1 4 0 0 1 5 0 0 2 0 1 55 0.82 0.25 >0.25 a 2 1 1 14 1.54 0.24 2 2 1 14 0.07 0.25 >0.25 a 2 3 1 1 0.30 0.25 >0.25 a 2 4 0 0 2 5 0 0 3 0 1 49 0.05 0.25 >0.25 a 3 1 1 9 1.16 0.25 >0.25 a 3 2 1 10 0.06 0.25 >0.25 a 3 3 1 6 16.90 0.01 3 4 1 1 4.50 0.25 3 5 0 0 4 0 1 52 1.84 0.20 4 1 1 10 0.54 0.25 4 2 1 6 2.38 0.20 4 3 0 0 4 4 0 0 278 a p >0.25 a a >0.25 a a Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Parametric Tests Table 87: F-Test (one-way) years_with_bank nbr_children DF DFErr F FPValue 4 5 0 1 5 0 1 5 1 5 FPText FCallP_0.01 46 4.84 0.04 1 15 0.48 0.25 2 1 10 3.51 0.09 a 5 3 1 2 2.98 0.24 a 5 4 0 0 6 0 1 46 0.01 0.25 6 1 1 14 3.67 0.08 6 2 1 15 0.13 0.25 6 3 0 0 6 5 0 0 7 0 1 41 4.99 0.03 7 1 1 8 0.01 0.25 >0.25 a 7 2 1 4 0.13 0.25 >0.25 a 7 3 1 2 0.04 0.25 >0.25 a 7 5 0 1 8 0 1 23 0.50 0.25 >0.25 a 8 1 1 7 0.38 0.25 >0.25 a 8 2 1 6 0.09 0.25 >0.25 a 8 3 1 0 8 5 0 0 9 0 1 26 0.07 0.25 >0.25 a 9 1 1 3 3.11 0.20 9 2 1 1 0.09 0.25 >0.25 a 9 3 1 1 0.12 0.25 >0.25 a a >0.25 >0.25 a a a >0.25 a a a F-Test/Analysis of Variance - Two Way Unequal Sample Size The ANOVA or F test determines if significant differences exist among treatment means or interactions. It’s a preliminary test that indicates if further analysis of the relationship among treatment means is warranted. If the null hypothesis of no difference among treatments is accepted, the test result implies factor levels and response are unrelated, so the analysis is Teradata Warehouse Miner User Guide - Volume 1 279 Chapter 3: Statistical Tests Parametric Tests terminated. When the null hypothesis is rejected, the analysis is usually continued to examine the nature of the factor-level effects. Examples are: • Tukey’s Method — tests all possible pairwise differences of means • Scheffe’s Method — tests all possible contrasts at the same time • Bonferroni’s Method — tests, or puts simultaneous confidence intervals around a preselected group of contrasts The 2-way Unequal Sample Size F-Test is designed to execute on the entire dataset. No group-by parameter is provided for this test, but if such a test is desired, multiple tests must be run on pre-prepared datasets with group-by variables in each as different constants. Two or more treatments must exist in the data within the dataset. (Note that this test will create a temporary work table in the Result Database and drop it at the end of processing, even if the Output option to “Store the tabular output of this analysis in the database” is not selected). Given a table name of tabulated values, an F-Test is produced. The N-Way ANOVA tests whether a set of sample means are all equal (the null hypothesis). Output is a p-value which when compared to the user’s threshold, determines whether the null hypothesis should be rejected. Initiate a 2-Way F-Test with Unequal Cell Counts Use the following procedure to initiate a new F-Test in Teradata Warehouse Miner: 1 Click on the Add New Analysis icon in the toolbar: Figure 182: Add New Analysis from toolbar 2 280 In the resulting Add New Analysis dialog box, click on Statistical Tests under Categories and then under Analyses double-click on Parametric Tests: Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Parametric Tests Figure 183: Add New Analysis > Statistical Tests > Parametric Tests 3 This will bring up the Parametric Tests dialog in which you will enter STATISTICAL TEST STYLE, INPUT and OUTPUT options to parameterize the analysis as described in the following sections. F-Test (Unequal Cell Counts) - INPUT - Data Selection On the Parametric Tests dialog click on INPUT and then click on data selection: Figure 184: F-Test > Input > Data Selection On this screen select: 1 Select Input Source Users may select between different sources of input. By selecting the Input Source Table the user can select from available databases, tables (or views) and columns in the usual manner. By selecting the Input Source Analysis however the user can select directly from the output of another analysis of qualifying type in the current project. Analyses that may be selected from directly include all of the Analytic Data Set (ADS) and Reorganization analyses (except Refresh). In place of Available Databases the user may select from Available Analyses, while Available Tables then contains a list of all the output tables that will eventually be produced by the selected Analysis. Note that if an analysis is selected it must be one that creates a table or view for output since a volatile table cannot be processed with this Statistical Test Style. For more Teradata Warehouse Miner User Guide - Volume 1 281 Chapter 3: Statistical Tests Parametric Tests information, about referencing an analysis for input refer to “INPUT Tab” on page 73 of the Teradata Warehouse Miner User Guide (Volume 1). 2 3 Select Columns From a Single Table • Available Databases (or Analyses) — These are the databases (or analyses) available to be processed. • Available Tables — These are the tables and views that are available to be processed. • Available Columns — These are the columns within the table/view that are available for processing. Select Statistical Test Style These are the parametric tests available (T, F(n-way), F(2-way with unequal cell counts). Select “F(2-way with unequal cell counts)”. 4 Select Optional Columns • Selected Columns — Select columns by highlighting and then either dragging and dropping into the Selected Columns window, or click on the arrow button to move highlighted columns into the Selected Columns window. Note: The Selected Columns window is actually a split window; you can insert columns as Column of Interest, First Column or Second Column. Make sure you have the correct portion of the window highlighted. • Column of Interest — The column that specifies the dependent variable for the Ftest analysis. • First Column — The column that specifies the first independent variable for the Ftest analysis. • Second Column — The column that specifies the second independent variable for the F-test analysis. F-Test - INPUT - Analysis Parameters On the Parametric Tests dialog click on INPUT and then click on analysis parameters: Figure 185: F-Test > Input > Analysis Parameters On this screen enter or select: • Processing Options 282 • Threshold Probability — Enter the “alpha” probability below which to reject the null hypothesis. • First Column Values — Use the selection wizard to choose any or all of the values of the first independent variable to be used in the analysis. Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Parametric Tests • Second Column Values — Use the selection wizard to choose any or all of the values of the second independent variable to be used in the analysis. F-Test - OUTPUT On the Parametric Tests dialog click on OUTPUT: Figure 186: F-Test > Output On this screen select the following options if desired: • Store the tabular output of this analysis in the database — Option to generate a Teradata table populated with the results of the analysis. Once enabled, the following three fields must be specified: • Database Name — The database where the output table will be saved. • Output Name — The table name that the output will be saved under. • Output Type — The output type must be table when storing Statistical Test output in the database. • Stored Procedure — The creation of a stored procedure containing the SQL generated for this analysis can be requested by entering the desired name of the stored procedure here. This will result in the creation of a stored procedure in the user's login database in place of the execution of the SQL generated by the analysis. (For more information, please refer to “Stored Procedure Support” on page 109 of the Teradata Warehouse Miner User Guide (Volume 1)). • Procedure Comment — When an optional Procedure Comment is entered it is applied to a requested Stored Procedure with an SQL Comment statement. It can be up to 255 characters in length and contain substitution parameters for the output category (Score, ADS, Stats or Other), project name and/or analysis name (using the tags <Category>, <Project> and <Analysis>, respectively). (Note that the default value of this field may be set on the Defaults tab of the Preferences dialog, available from the Tools Menu). • Create output table using fallback keyword — Fallback keyword will be used to create the table • Create output table using multiset keyword — Multiset keyword will be used to create the table • Advertise Output — The Advertise Output option may be requested when creating a table, view or procedure. This feature “advertises” output by inserting information into one or more of the Advertise Output metadata tables according to the type of analysis and the options selected in the analysis. (For more information, refer to “Advertise Output” on page 112 of the Teradata Warehouse Miner User Guide (Volume 1)). • Advertise Note — An Advertise Note may be specified if desired when the Advertise Output option is selected or when the Always Advertise option is selected on the Teradata Warehouse Miner User Guide - Volume 1 283 Chapter 3: Statistical Tests Parametric Tests Databases tab of the Connection Properties dialog. It is a free-form text field of up to 30 characters that may be used to categorize or describe the output. • Generate SQL, but do not Execute it — Generate the Statistical Test SQL, but do not execute it. The SQL will be available to be viewed. Run the F-Test Analysis After setting parameters on the INPUT screens as described above, you are ready to run the analysis. To run the analysis you can either: • Click the Run icon on the toolbar, or • Select Run <project name> on the Project menu, or • Press the F5 key on your keyboard Results - F-Test Analysis The results of running the F-test analysis include a table with a single row, as well as the SQL to perform the statistical analysis. All of these results are outlined below. F-Test - RESULTS - SQL On the Parametric Tests dialog click on RESULTS and then click on SQL: Figure 187: F-Test > Results > SQL The series of SQL statements comprise the F-test Analysis. It is always returned, and is the only item returned when the Generate SQL Without Executing option is used. F-Test - RESULTS - data On the Parametric Tests dialog click on RESULTS and then click on data: Figure 188: F-Test > Results > data The output table is generated by the F-test Analysis for each group-by variable combination. Output Columns - F-Test Analysis The result table returned is built in the requested Output Database by the F-test analysis. DF will be the UPI. 284 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Parametric Tests Table 88: Output Columns - 2-Way F-Test Analysis Name Type Definition DF INTEGER Degrees of Freedom for the model Fmodel Float The computed value of the F statistic for the model DFErr INTEGER Degrees of Freedom for Error term DF_1 INTEGER Degrees of Freedom for first variable F1 Float The computed value of the F statistic for the first variable DF_2 INTEGER Degrees of Freedom for second variable F2 Float The computed value of the F statistic for the second variable DF_12 INTEGER Degrees of Freedom for interaction F12 Float The computed value of the F statistic for interaction Fmodel_PValue Float The probability associated with the F statistic for the model Fmodel_PText Char If not NULL, the probability is less than the smallest or more than the largest table value Fmodel_CallP_0.05 Char The F test result: a=accept, p=reject for the model F1_PValue Float The probability associated with the F statistic for the first variable F1_PText Char If not NULL, the probability is less than the smallest or more than the largest table value F1_callP_0.05 Char The F test result: a=accept, p=reject for the first variable F2_PValue Float The probability associated with the F statistic for the second variable F2_PText Char If not NULL, the probability is less than the smallest or more than the largest table value F2_callP_0.05 Char The F test result: a=accept, p=reject for the second variable F12_PValue Float The probability associated with the F statistic for the interaction F12_PText Char If not NULL, the probability is less than the smallest or more than the largest table value F12_callP_0.05 Char The F test result: a=accept, p=reject for the interaction Tutorial - Two-Way Unequal Cell Count F-Test Analysis In this example, an F-test analysis is performed on the fictitious banking data to analyze income by years_with_bank and marital_status. Parameterize an F-Test analysis as follows: • Available Tables — twm_customer • Column of Interest — income • First Column — years_with_bank • Second Column — marital_status Teradata Warehouse Miner User Guide - Volume 1 285 Chapter 3: Statistical Tests Binomial Tests • Analysis Parameters • Threshold Probability — 0.05 • First Column Values — 0, 1, 2, 3, 4, 5, 6, 7 • Second Column Values — 1, 2, 3, 4 Run the analysis and click on Results when it completes. For this example, the F-Test analysis generated the following page. The F-Test was computed on income over years_with_bank and marital_status. The test shows whether significant differences exist in income for years_with_bank by marital_status. The first column, years_with_bank, is represented by F1. The second column, marital_status, is represented by F2. The interaction term is F12. A ‘p’ means the difference was significant, and an ‘a’ means it was not significant. If the field is null, it indicates there was insufficient data for the test. The SQL is available for viewing but not listed below. The results show that there are no significant differences in income for different values of years_with_bank or the interaction term for years_with_bank and marital_status. There was a highly significant (p<0.001) difference in income for different values of marital status. The overall model difference was significant at a level better than 0.001. Table 89: F-Test (Two-way Unequal Cell Count) (Part 1) DF Fmodel DFErr DF_1 F1 DF_2 F2 DF_12 F12 31 3.76 631 7 0.93 3 29.02 21 1.09 Table 90: F-Test (Two-way Unequal Cell Count) (Part 2) Fmodel_PValue Fmodel_PText Fmodel_CallP_0.05 F1_PValue F1_PText F1_CallP_0.05 0.001 <0.001 p 0.25 >0.25 a Table 91: F-Test (Two-way Unequal Cell Count) (Part 3) F2_PValue F2_PText F2_CallP_0.05 F12_PValue F12_PText F12_CallP_0.05 0.001 <0.001 p 0.25 >0.25 a Binomial Tests The data for a binomial test is assumed to come from n independent trials, and have outcomes in either of two classes. The other assumption is that the probability of each outcome of each trial is the same, designated p. The values of the outcome could come directly from the data, where the value is always one of two kinds. More commonly, however, the test is applied to the sign of the difference between two values. If the probability is 0.5, this is the oldest of all 286 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Binomial Tests nonparametric tests, and is called the ‘sign test’. Where the sign of the difference between two values is used, the binomial test reports whether the probability that the sign is positive is a particular p_value, p*. Binomial/Ztest Output for each unique set of values of the group-by variables (GBV's) is a p-value which when compared to the user’s choice of alpha, the probability threshold, determines whether the null hypothesis (p=p*, p<=p*, or p>p*) should be rejected for the GBV set. Though both binomial and Ztest results are provided for all N, for the approximate value obtained from the Z-test (nP) is appropriate when N is large. For values of N over 100, only the Z-test is performed. Otherwise, the value bP returned is the p_value of the one-tailed or two-tailed test, depending on the user’s choice. Initiate a Binomial Test Use the following procedure to initiate a new Binomial in Teradata Warehouse Miner: 1 Click on the Add New Analysis icon in the toolbar: Figure 189: Add New Analysis from toolbar 2 In the resulting Add New Analysis dialog box, click on Statistical Tests under Categories and then under Analyses double-click on Binomial Tests: Teradata Warehouse Miner User Guide - Volume 1 287 Chapter 3: Statistical Tests Binomial Tests Figure 190: Add New Analysis > Statistical Tests > Binomial Tests 3 This will bring up the Binomial Tests dialog in which you will enter STATISTICAL TEST STYLE, INPUT and OUTPUT options to parameterize the analysis as described in the following sections. Binomial Tests - INPUT - Data Selection On the Binomial Tests dialog click on INPUT and then click on data selection: Figure 191: Binomial Tests > Input > Data Selection On this screen select: 1 Select Input Source Users may select between different sources of input. By selecting the Input Source Table the user can select from available databases, tables (or views) and columns in the usual manner. By selecting the Input Source Analysis however the user can select directly from the output of another analysis of qualifying type in the current project. Analyses that may be selected from directly include all of the Analytic Data Set (ADS) and Reorganization analyses (except Refresh). In place of Available Databases the user may select from Available Analyses, while Available Tables then contains a list of all the output tables that will eventually be produced by the selected Analysis, or it contains a single entry with the name of the analysis under the label Volatile Table, representing the output of the analysis 288 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Binomial Tests that is ordinarily produced by a Select statement. For more information, refer to “INPUT Tab” on page 73 of the Teradata Warehouse Miner User Guide (Volume 1). 2 3 Select Columns From a Single Table • Available Databases (or Analyses) — These are the databases (or analyses) available to be processed. • Available Tables — These are the tables and views that are available to be processed. • Available Columns — These are the columns within the table/view that are available for processing. Select Statistical Test Style These are the binomial tests available (Binomial, Sign). Select “Binomial”. 4 Select Optional Columns • Selected Columns — Select columns by highlighting and then either dragging and dropping into the Selected Columns window, or click on the arrow button to move highlighted columns into the Selected Columns window. Note: The Selected Columns window is actually a split window; you can insert columns as First Column, Second Column or Group By Columns. Make sure you have the correct portion of the window highlighted. • First Column — The column that specifies the first variable for the Binomial Test analysis. • Second Column — The column that specifies the second variable for the Binomial Test analysis. • Group By Columns — The column(s) that specifies the variable(s) whose distinct value combinations will categorize the data, so a separate test is performed on each category. Binomial Tests - INPUT - Analysis Parameters On the Binomial Tests dialog click on INPUT and then click on analysis parameters: Figure 192: Binomial Tests > Input > Analysis Parameters On this screen enter or select: • Processing Options • Threshold Probability — Enter the “alpha” probability below which to reject the null hypothesis. • Single Tail — Check this box if the Binomial Test is to be single tailed. Default is twotailed. Teradata Warehouse Miner User Guide - Volume 1 289 Chapter 3: Statistical Tests Binomial Tests • Binomial Probability — If the binomial test is not ½, enter the probability desired. Default is 0.5. • Exact Matches Comparison Criterion — Check the button to specify how exact matches are to be handled. Default is they are discarded. Other options are to include them with negative count, or with positive count. Binomial Tests - OUTPUT On the Binomial Tests dialog click on OUTPUT: Figure 193: Binomial Tests > Output On this screen select the following options if desired: • Store the tabular output of this analysis in the database — Option to generate a Teradata table populated with the results of the analysis. Once enabled, the following three fields must be specified: 290 • Database Name — The database where the output table will be saved. • Output Name — The table name that the output will be saved under. • Output Type — The output type must be table when storing Statistical Test output in the database. • Stored Procedure — The creation of a stored procedure containing the SQL generated for this analysis can be requested by entering the desired name of the stored procedure here. This will result in the creation of a stored procedure in the user's login database in place of the execution of the SQL generated by the analysis. (For more information, please refer to “Stored Procedure Support” on page 109 of the Teradata Warehouse Miner User Guide (Volume 1)). • Procedure Comment — When an optional Procedure Comment is entered it is applied to a requested Stored Procedure with an SQL Comment statement. It can be up to 255 characters in length and contain substitution parameters for the output category (Score, ADS, Stats or Other), project name and/or analysis name (using the tags <Category>, <Project> and <Analysis>, respectively). (Note that the default value of this field may be set on the Defaults tab of the Preferences dialog, available from the Tools Menu). • Create output table using fallback keyword — Fallback keyword will be used to create the table • Create output table using multiset keyword — Multiset keyword will be used to create the table • Advertise Output — The Advertise Output option may be requested when creating a table, view or procedure. This feature “advertises” output by inserting information into one or more of the Advertise Output metadata tables according to the type of analysis Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Binomial Tests and the options selected in the analysis. (For more information, refer to “Advertise Output” on page 112 of the Teradata Warehouse Miner User Guide (Volume 1)). • Advertise Note — An Advertise Note may be specified if desired when the Advertise Output option is selected or when the Always Advertise option is selected on the Databases tab of the Connection Properties dialog. It is a free-form text field of up to 30 characters that may be used to categorize or describe the output. • Generate SQL, but do not Execute it — Generate the Statistical Test SQL, but do not execute it. The SQL will be available to be viewed. Run the Binomial Sign Test Analysis After setting parameters on the INPUT screens as described above, you are ready to run the analysis. To run the analysis you can either: • Click the Run icon on the toolbar, or • Select Run <project name> on the Project menu, or • Press the F5 key on your keyboard Results - Binomial Test The results of running the Binomial analysis include a table with a row for each group-by variable requested, as well as the SQL to perform the statistical analysis. All of these results are outlined below. Binomial Tests - RESULTS - SQL On the Binomial Tests dialog click on RESULTS and then click on SQL: Figure 194: Binomial Tests > Results > SQL The series of SQL statements comprise the Binomial Analysis. It is always returned, and is the only item returned when the Generate SQL Without Executing option is used. Binomial Tests - RESULTS - data On the Binomial Tests dialog click on RESULTS and then click on data: Figure 195: Binomial Tests > Results > data Teradata Warehouse Miner User Guide - Volume 1 291 Chapter 3: Statistical Tests Binomial Tests The output table is generated by the Binomial Analysis for each group-by variable combination. Output Columns - Binomial Tests The following table is built in the requested Output Database by the Binomial analysis. Any group-by columns will comprise the Unique Primary Index (UPI), otherwise the UPI will be “N”. Table 92: Output Database table (Built by the Binomial Analysis) Name Type Definition N INTEGER Total count of value pairs NPos INTEGER Count of positive value differences NNeg INTEGER Count of negative value differences BP FLOAT The Binomial Probability BinomialCallP Char The Binomial result: a=accept, p=reject (positive), n=reject(negative) Tutorial - Binomial Tests Analysis In this example, an Binomial analysis is performed on the fictitious banking data to analyze account usage. Parameterize the Binomial analysis as follows: • Available Tables — twm_customer_analysis • First Column — avg_sv_bal • Second Column — avg_ck_bal • Group By Columns — gender • Analysis Parameters • Threshold Probability — 0.05 • Single Tail — true • Binomial Probability — 0.5 • Exact Matches — discarded Run the analysis and click on Results when it completes. For this example, the Binomial analysis generated the following. The Binomial was computed on average savings balance (column 1) vs. average check account balance (column 2), by gender. The test is a Z Test since N>100, and Z is 3.29 (not in answer set) so the one-sided test of the null hypothesis that p is ½ is rejected as shown in the table below. Table 93: Binomial Test Analysis (Table 1) gender N NPos NNeg BP BinomialCallP_0.05 F 366 217 149 0.0002 p M 259 156 103 0.0005 p 292 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Binomial Tests Rerunning the test with parameter binomial probability set to 0.6 gives a different result: the one-sided test of the null hypothesis that p is 0.6 is accepted as shown in the table below. Table 94: Binomial Test Analysis (Table 2) gender N NPos NNeg BP BinomialCallP_0.05 F 366 217 149 0.3909 a M 259 156 103 0.4697 a Binomial Sign Test For the sign test, one column is selected and the test is whether the value is positive or not positive. Initiate a Binomial Sign Test Use the following procedure to initiate a new Binomial Sign Test in Teradata Warehouse Miner: 1 Click on the Add New Analysis icon in the toolbar: Figure 196: Add New Analysis from toolbar 2 In the resulting Add New Analysis dialog box, click on Statistical Tests under Categories and then under Analyses double-click on Binomial Tests: Teradata Warehouse Miner User Guide - Volume 1 293 Chapter 3: Statistical Tests Binomial Tests Figure 197: Add New Analysis > Statistical Tests > Binomial Tests 3 This will bring up the Binomial Tests dialog in which you will enter STATISTICAL TEST STYLE, INPUT and OUTPUT options to parameterize the analysis as described in the following sections. Binomial Sign Test - INPUT - Data Selection On the Binomial Tests dialog click on INPUT and then click on data selection: Figure 198: Binomial Sign Test > Input > Data Selection On this screen select: 1 Select Input Source Users may select between different sources of input. By selecting the Input Source Table the user can select from available databases, tables (or views) and columns in the usual manner. By selecting the Input Source Analysis however the user can select directly from the output of another analysis of qualifying type in the current project. Analyses that may be selected from directly include all of the Analytic Data Set (ADS) and Reorganization analyses (except Refresh). In place of Available Databases the user may select from Available Analyses, while Available Tables then contains a list of all the output tables that will eventually be produced by the selected Analysis, or it contains a single entry with the name of the analysis under the label Volatile Table, representing the output of the analysis 294 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Binomial Tests that is ordinarily produced by a Select statement. For more information, refer to “INPUT Tab” on page 73 of the Teradata Warehouse Miner User Guide (Volume 1). 2 3 Select Columns From a Single Table • Available Databases (or Analyses) — These are the databases (or analyses) available to be processed. • Available Tables — These are the tables and views that are available to be processed. • Available Columns — These are the columns within the table/view that are available for processing. Select Statistical Test Style These are the binomial tests available (Binomial, Sign). Select “Sign”. 4 Select Optional Columns • Selected Columns — Select columns by highlighting and then either dragging and dropping into the Selected Columns window, or click on the arrow button to move highlighted columns into the Selected Columns window. • Column/Group By Columns — Note that the Selected Columns window is actually a split window; you can insert columns as Column, or Group By Columns. Make sure you have the correct portion of the window highlighted. • Column — The column that specifies the first variable for the Binomial Test analysis. • Group By Columns — The column(s) that specifies the variable(s) whose distinct value combinations will categorize the data, so a separate test is performed on each category. Binomial Sign Test - INPUT - Analysis Parameters On the Binomial Tests dialog click on INPUT and then click on analysis parameters: Figure 199: Binomial Sign Test > Input > Analysis Parameters On this screen enter or select: • Processing Options • Threshold Probability — Enter the “alpha” probability below which to reject the null hypothesis. • Single Tail — Check this box if the Binomial Test is to be single tailed. Default is twotailed. Binomial Sign Test - OUTPUT On the Binomial Tests dialog click on OUTPUT: Teradata Warehouse Miner User Guide - Volume 1 295 Chapter 3: Statistical Tests Binomial Tests Figure 200: Binomial Sign Test > Output On this screen select the following options if desired: • Store the tabular output of this analysis in the database — Option to generate a Teradata table populated with the results of the analysis. Once enabled, the following three fields must be specified: 296 • Database Name — The database where the output table will be saved. • Output Name — The table name that the output will be saved under. • Output Type — The output type must be table when storing Statistical Test output in the database. • Stored Procedure — The creation of a stored procedure containing the SQL generated for this analysis can be requested by entering the desired name of the stored procedure here. This will result in the creation of a stored procedure in the user's login database in place of the execution of the SQL generated by the analysis. (For more information, please refer to “Stored Procedure Support” on page 109 of the Teradata Warehouse Miner User Guide (Volume 1)). • Procedure Comment — When an optional Procedure Comment is entered it is applied to a requested Stored Procedure with an SQL Comment statement. It can be up to 255 characters in length and contain substitution parameters for the output category (Score, ADS, Stats or Other), project name and/or analysis name (using the tags <Category>, <Project> and <Analysis>, respectively). (Note that the default value of this field may be set on the Defaults tab of the Preferences dialog, available from the Tools Menu). • Create output table using fallback keyword — Fallback keyword will be used to create the table • Create output table using multiset keyword — Multiset keyword will be used to create the table • Advertise Output — The Advertise Output option may be requested when creating a table, view or procedure. This feature “advertises” output by inserting information into one or more of the Advertise Output metadata tables according to the type of analysis and the options selected in the analysis. (For more information, refer to “Advertise Output” on page 112 of the Teradata Warehouse Miner User Guide (Volume 1)). • Advertise Note — An Advertise Note may be specified if desired when the Advertise Output option is selected or when the Always Advertise option is selected on the Databases tab of the Connection Properties dialog. It is a free-form text field of up to 30 characters that may be used to categorize or describe the output. • Generate SQL, but do not Execute it — Generate the Statistical Test SQL, but do not execute it. The SQL will be available to be viewed. Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Binomial Tests Run the Binomial Sign Analysis After setting parameters on the INPUT screens as described above, you are ready to run the analysis. To run the analysis you can either: • Click the Run icon on the toolbar, or • Select Run <project name> on the Project menu, or • Press the F5 key on your keyboard Results - Binomial Sign Test Analysis The results of running the Binomial Sign analysis include a table with a row for each groupby variable requested, as well as the SQL to perform the statistical analysis. All of these results are outlined below. Binomial Sign Test - RESULTS - SQL On the Binomial Tests dialog click on RESULTS and then click on SQL: Figure 201: Binomial Sign Test > Results > SQL The series of SQL statements comprise the Binomial Sign Analysis. It is always returned, and is the only item returned when the Generate SQL Without Executing option is used. Binomial Sign Test - RESULTS - data On the Binomial Tests dialog click on RESULTS and then click on data: Figure 202: Binomial Sign Test > Results > data The output table is generated by the Binomial Sign Analysis for each group-by variable combination. Output Columns - Binomial Sign Analysis The following table is built in the requested Output Database by the Binomial analysis. Any group-by columns will comprise the Unique Primary Index (UPI), otherwise the UPI will be “N”. Teradata Warehouse Miner User Guide - Volume 1 297 Chapter 3: Statistical Tests Binomial Tests Table 95: Binomial Sign Analysis: Output Columns Name Type Definition N INTEGER Total count of value pairs NPos INTEGER Count of positive values NNeg INTEGER Count of negative or zero values BP FLOAT The Binomial Probability BinomialCallP Char The Binomial Sign result: a=accept, p=reject (positive), n=reject(negative) Tutorial - Binomial Sign Analysis In this example, a Binomial analysis is performed on the fictitious banking data to analyze account usage. Parameterize the Binomial analysis as follows: • Available Tables — twm_customer_analysis • Column — female • Group By Columns — years_with_bank • Analysis Parameters • Threshold Probability — 0.05 • Single Tail — true Run the analysis and click on Results when it completes. For this example, the Binomial Sign analysis generated the following. The Binomial was computed on the Boolean variable “female” by years_with_bank. The one-sided test of the null hypothesis that p is ½ accepted for all cases except years_with_bank=2 as shown in the table below. Table 96: Tutorial - Binomial Sign Analysis years_with_bank N NPos NNeg BP BinomialCallP_0.05 0 88 51 37 0.08272 a 1 87 48 39 0.195595 a 2 94 57 37 0.024725 p 3 86 46 40 0.295018 a 4 78 39 39 0.545027 a 5 82 46 36 0.160147 a 6 83 46 37 0.19 a 7 65 36 29 0.22851 a 8 45 26 19 0.185649 a 9 39 23 16 0.168392 a 298 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Kolmogorov-Smirnov Tests Kolmogorov-Smirnov Tests Tests of the Kolmogorov-Smirnov Type are based on statistical procedures which use maximum vertical distance between functions as a measure of function similarity. Two empirical distribution functions are mapped against each other, or a single empirical function is mapped against a hypothetical (e.g. Normal) distribution. Conclusions are then drawn about the likelihood the two distributions are the same. Kolmogorov-Smirnov Test (One Sample) The Kolmogorov-Smirnov (one-sample) test determines whether a dataset matches a particular distribution (for this test, the normal distribution). The test has the advantage of making no assumption about the distribution of data. (Non-parametric and distribution free) Note that this generality comes at some cost: other tests (e.g. the Student's t-test) may be more sensitive if the data meet the requirements of the test. The Kolmogorov-Smirnov test is generally less powerful than the tests specifically designed to test for normality. This is especially true when the mean and variance are not specified in advance for the KolmogorovSmirnov test, which then becomes conservative. Further, the Kolmogorov-Smirnov test will not indicate the type of nonnormality, e.g. whether the distribution is skewed or heavy-tailed. Examination of the skewness and kurtosis, and of the histogram, boxplot, and normal probability plot for the data may show why the data failed the Kolmogorov-Smirnov test. In this test, the user can specify group-by variables (GBV's) so a separate test will be done for every unique set of values of the GBV's. Initiate a Kolmogorov-Smirnov Test Use the following procedure to initiate a new Kolmogorov-Smirnov Test in Teradata Warehouse Miner: 1 Click on the Add New Analysis icon in the toolbar: Figure 203: Add New Analysis from toolbar 2 In the resulting Add New Analysis dialog box, click on Statistical Tests under Categories and then under Analyses double-click on Kolmogorov-Smirnov Tests: Teradata Warehouse Miner User Guide - Volume 1 299 Chapter 3: Statistical Tests Kolmogorov-Smirnov Tests Figure 204: Add New Analysis > Statistical Tests > Kolmogorov-Smirnov Tests 3 This will bring up the Kolmogorov-Smirnov Tests dialog in which you will enter STATISTICAL TEST STYLE, INPUT and OUTPUT options to parameterize the analysis as described in the following sections. Kolmogorov-Smirnov Test - INPUT - Data Selection On the Kolmogorov-Smirnov Tests dialog click on INPUT and then click on data selection: Figure 205: Kolmogorov-Smirnov Test > Input > Data Selection On this screen select: 1 Select Input Source Users may select between different sources of input. By selecting the Input Source Table the user can select from available databases, tables (or views) and columns in the usual manner. By selecting the Input Source Analysis however the user can select directly from the output of another analysis of qualifying type in the current project. Analyses that may be selected from directly include all of the Analytic Data Set (ADS) and Reorganization analyses (except Refresh). In place of Available Databases the user may select from Available Analyses, while Available Tables then contains a list of all the output tables that will eventually be produced by the selected Analysis, or it contains a single entry with the name of the analysis under the label Volatile Table, representing the output of the analysis 300 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Kolmogorov-Smirnov Tests that is ordinarily produced by a Select statement. For more information, refer to “INPUT Tab” on page 73 of the Teradata Warehouse Miner User Guide (Volume 1). 2 3 Select Columns From a Single Table • Available Databases (or Analyses) — These are the databases (or analyses) available to be processed. • Available Tables — These are the tables and views that are available to be processed. • Available Columns — These are the columns within the table/view that are available for processing. Select Statistical Test Style These are the Tests of the Kolmogorov-Smirnov Type available (Kolmogorov-Smirnov, Lilliefors, Shapiro-Wilk, D'Agostino-Pearson, Smirnov). Select “Kolmogorov-Smirnov”. 4 Select Optional Columns • Selected Columns — Select columns by highlighting and then either dragging and dropping into the Selected Columns window, or click on the arrow button to move highlighted columns into the Selected Columns window. Note: The Selected Columns window is actually a split window; you can insert columns as Column of Interest, Group By Columns. Make sure you have the correct portion of the window highlighted. • Column of Interest — The column that specifies the numeric variable to be tested for normality. • Group By Columns — The columns which specify the variables whose distinct value combinations will categorize the data, so a separate test is performed on each category. Kolmogorov-Smirnov Test - INPUT - Analysis Parameters On the Kolmogorov-Smirnov Tests dialog click on INPUT and then click on analysis parameters: Figure 206: Kolmogorov-Smirnov Test > Input > Analysis Parameters On this screen enter or select: • Processing Options • Threshold Probability — Enter the “alpha” probability below which to reject the null hypothesis. Kolmogorov-Smirnov Test - OUTPUT On the Kolmogorov-Smirnov Tests dialog click on OUTPUT: Teradata Warehouse Miner User Guide - Volume 1 301 Chapter 3: Statistical Tests Kolmogorov-Smirnov Tests Figure 207: Kolmogorov-Smirnov Test > Output On this screen select: • Store the tabular output of this analysis in the database — Option to generate a Teradata table populated with the results of the analysis. Once enabled, the following three fields must be specified: 302 • Database Name — The database where the output table will be saved. • Output Name — The table name that the output will be saved under. • Output Type — The output type must be table when storing Statistical Test output in the database. • Stored Procedure — The creation of a stored procedure containing the SQL generated for this analysis can be requested by entering the desired name of the stored procedure here. This will result in the creation of a stored procedure in the user's login database in place of the execution of the SQL generated by the analysis. (For more information, please refer to “Stored Procedure Support” on page 109 of the Teradata Warehouse Miner User Guide (Volume 1)). • Procedure Comment — When an optional Procedure Comment is entered it is applied to a requested Stored Procedure with an SQL Comment statement. It can be up to 255 characters in length and contain substitution parameters for the output category (Score, ADS, Stats or Other), project name and/or analysis name (using the tags <Category>, <Project> and <Analysis>, respectively). (Note that the default value of this field may be set on the Defaults tab of the Preferences dialog, available from the Tools Menu). • Create output table using fallback keyword — Fallback keyword will be used to create the table • Create output table using multiset keyword — Multiset keyword will be used to create the table • Advertise Output — The Advertise Output option may be requested when creating a table, view or procedure. This feature “advertises” output by inserting information into one or more of the Advertise Output metadata tables according to the type of analysis and the options selected in the analysis. (For more information, refer to “Advertise Output” on page 112 of the Teradata Warehouse Miner User Guide (Volume 1)). • Advertise Note — An Advertise Note may be specified if desired when the Advertise Output option is selected or when the Always Advertise option is selected on the Databases tab of the Connection Properties dialog. It is a free-form text field of up to 30 characters that may be used to categorize or describe the output. • Generate SQL, but do not Execute it — Generate the Statistical Test SQL, but do not execute it. The SQL will be available to be viewed. Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Kolmogorov-Smirnov Tests Run the Kolmogorov-Smirnov Test Analysis After setting parameters on the INPUT screens as described above, you are ready to run the analysis. To run the analysis you can either: • Click the Run icon on the toolbar, or • Select Run <project name> on the Project menu, or • Press the F5 key on your keyboard Results - Kolmogorov-Smirnov Test The results of running the Kolmogorov-Smirnov Test analysis include a table with a row for each separate Kolmogorov-Smirnov test on all distinct-value group-by variables, as well as the SQL to perform the statistical analysis. All of these results are outlined below. Kolmogorov-Smirnov Test - RESULTS - SQL On the Kolmogorov-Smirnov Test dialog click on RESULTS and then click on SQL: Figure 208: Kolmogorov-Smirnov Test > Results > SQL The series of SQL statements comprise the Kolmogorov-Smirnov Test Analysis. It is always returned, and is the only item returned when the Generate SQL without Executing option is used. Kolmogorov-Smirnov Test - RESULTS - data On the Kolmogorov-Smirnov Test dialog click on RESULTS and then click on data: Figure 209: Kolmogorov-Smirnov Test > Results > Data The output table is generated by the Analysis for each separate Kolmogorov-Smirnov test on all distinct-value group-by variables. Output Columns - Kolmogorov-Smirnov Test Analysis The following table is built in the requested Output Database by the Kolmogorov-Smirnov test analysis. Group By Variables, if present, will be the Unique Primary Index (UPI). Otherwise Klm will be the UPI. Teradata Warehouse Miner User Guide - Volume 1 303 Chapter 3: Statistical Tests Kolmogorov-Smirnov Tests Table 97: Output Database table (Built by the Kolmogorov-Smirnov test analysis) Name Type Definition Klm Float Kolmogorov-Smirnov Value M INTEGER Count KlmPValue Float The probability associated with the Kolmogorov-Smirnov statistic KlmPText Char Text description if P is outside table range KlmCallP_0.05 Char The Kolmogorov-Smirnov result: a=accept, p=reject Tutorial - Kolmogorov-Smirnov Test Analysis In this example, a Kolmogorov-Smirnov test analysis is performed on the fictitious banking data to analyze account usage. Parameterize a Kolmogorov-Smirnov Test analysis as follows: • Available Tables — twm_customer_analysis • Column of Interest — income • Group By Columns — years_with_bank • Analysis Parameters • Threshold Probability — 0.05 Run the analysis and click on Results when it completes. For this example, the KolmogorovSmirnov Test analysis generated the following table. The Kolmogorov-Smirnov Test was computed for each distinct value of the group by variable “years_with_bank”. Results were sorted by years_with_bank. The tests shows customer incomes with years_with_bank of 1, 5,6,7,8, and 9 were normally distributed and those with 0, 2, and 3 were not. A ‘p’ means significantly nonnormal and an ‘a’ means accept the null hypothesis of normality. The SQL is available for viewing but not listed below. Table 98: Kolmogorov-Smirnov Test years_with_bank Klm M KlmPValue 0 0.159887652 88 0.019549995 p 1 0.118707332 87 0.162772589 a 2 0.140315991 94 0.045795894 p 3 0.15830739 86 0.025080666 p 4 0.999999 78 0.01 5 0.138336567 82 0.080579955 a 6 0.127171093 83 0.127653475 a 7 0.135147555 65 0.172828265 a 8 0.184197592 45 0.084134345 a 304 KlmPText <0.01 KlmCallP_0.05 p Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Kolmogorov-Smirnov Tests Table 98: Kolmogorov-Smirnov Test years_with_bank Klm M KlmPValue KlmPText KlmCallP_0.05 9 0.109205054 39 0.20 >0.20 a Lilliefors Test The Lilliefors test determines whether a dataset matches a particular distribution, and is identical to the Kolmogorov-Smirnov test except that conversion to Z-scores is made. The Lilliefors test is therefore a modification of the Kolmogorov-Smirnov test. The Lilliefors test computes the Lilliefors statistic and checks its significance. Exact tables of the quantiles of the test statistic were computed from random numbers in computer simulations. The computed value of the test statistic is compared with the quantiles of the statistic. When the test is for the normal distribution, the null hypothesis is that the distribution function is normal with unspecified mean and variance. The alternative hypothesis is that the distribution function is nonnormal. The empirical distribution of X is compared with a normal distribution with the same mean and variance as X. It is similar to the Kolmogorov-Smirnov test, but it adjusts for the fact that the parameters of the normal distribution are estimated from X rather than specified in advance. In this test, the user can specify group-by variables (GBV's) so a separate test will be done for every unique set of values of the GBV's. Initiate a Lilliefors Test Use the following procedure to initiate a new Lilliefors Test in Teradata Warehouse Miner: 1 Click on the Add New Analysis icon in the toolbar: Figure 210: Add New Analysis from toolbar 2 In the resulting Add New Analysis dialog box, click on Statistical Tests under Categories and then under Analyses double-click on Kolmogorov-Smirnov Tests: Teradata Warehouse Miner User Guide - Volume 1 305 Chapter 3: Statistical Tests Kolmogorov-Smirnov Tests Figure 211: Add New Analysis> Statistical Tests > Kolmogorov-Smirnov Tests 3 This will bring up the Kolmogorov-Smirnov Tests dialog in which you will enter STATISTICAL TEST STYLE, INPUT and OUTPUT options to parameterize the analysis as described in the following sections. Lilliefors Test - INPUT - Data Selection On the Kolmogorov-Smirnov Tests dialog click on INPUT and then click on data selection: Figure 212: Lillefors Test > Input > Data Selection On this screen select: 1 Select Input Source Users may select between different sources of input. By selecting the Input Source Table the user can select from available databases, tables (or views) and columns in the usual manner. By selecting the Input Source Analysis however the user can select directly from the output of another analysis of qualifying type in the current project. Analyses that may be selected from directly include all of the Analytic Data Set (ADS) and Reorganization analyses (except Refresh). In place of Available Databases the user may select from Available Analyses, while Available Tables then contains a list of all the output tables that will eventually be produced by the selected Analysis, or it contains a single entry with the name of the analysis under the label Volatile Table, representing the output of the analysis 306 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Kolmogorov-Smirnov Tests that is ordinarily produced by a Select statement. For more information, refer to “INPUT Tab” on page 73 of the Teradata Warehouse Miner User Guide (Volume 1). 2 3 Select Columns From a Single Table • Available Databases (or Analyses) — These are the databases (or analyses) available to be processed. • Available Tables — These are the tables and views that are available to be processed. • Available Columns — These are the columns within the table/view that are available for processing. Select Statistical Test Style These are the Tests of the Kolmogorov-Smirnov Type available (Kolmogorov-Smirnov, Lilliefors, Shapiro-Wilk, D'Agostino-Pearson, Smirnov). Select “Lilliefors”. 4 Select Optional Columns • Selected Columns — Select columns by highlighting and then either dragging and dropping into the Selected Columns window, or click on the arrow button to move highlighted columns into the Selected Columns window. Note: The Selected Columns window is actually a split window; you can insert columns as Column of Interest, Group By Columns. Make sure you have the correct portion of the window highlighted. • Column of Interest — The column that specifies the numeric variable to be tested for normality. • Group By Columns — The columns which specify the variables whose distinct value combinations will categorize the data, so a separate test is performed on each category. Lilliefors Test - INPUT - Analysis Parameters On the Kolmogorov-Smirnov Tests dialog click on INPUT and then click on analysis parameters: Figure 213: Lillefors Test > Input > Analysis Parameters On this screen enter or select: • Processing Options • Threshold Probability — Enter the “alpha” probability below which to reject the null hypothesis. Lilliefors Test - OUTPUT On the Kolmogorov-Smirnov Tests dialog click on OUTPUT: Teradata Warehouse Miner User Guide - Volume 1 307 Chapter 3: Statistical Tests Kolmogorov-Smirnov Tests Figure 214: Lillefors Test > Output On this screen select the following options if desired: • Store the tabular output of this analysis in the database — Option to generate a Teradata table populated with the results of the analysis. Once enabled, the following three fields must be specified: 308 • Database Name — The database where the output table will be saved. • Output Name — The table name that the output will be saved under. • Output Type — The output type must be table when storing Statistical Test output in the database. • Stored Procedure — The creation of a stored procedure containing the SQL generated for this analysis can be requested by entering the desired name of the stored procedure here. This will result in the creation of a stored procedure in the user's login database in place of the execution of the SQL generated by the analysis. (For more information, please refer to “Stored Procedure Support” on page 109 of the Teradata Warehouse Miner User Guide (Volume 1)). • Procedure Comment — When an optional Procedure Comment is entered it is applied to a requested Stored Procedure with an SQL Comment statement. It can be up to 255 characters in length and contain substitution parameters for the output category (Score, ADS, Stats or Other), project name and/or analysis name (using the tags <Category>, <Project> and <Analysis>, respectively). (Note that the default value of this field may be set on the Defaults tab of the Preferences dialog, available from the Tools Menu). • Create output table using fallback keyword — Fallback keyword will be used to create the table • Create output table using multiset keyword — Multiset keyword will be used to create the table • Advertise Output — The Advertise Output option may be requested when creating a table, view or procedure. This feature “advertises” output by inserting information into one or more of the Advertise Output metadata tables according to the type of analysis and the options selected in the analysis. (For more information, refer to “Advertise Output” on page 112 of the Teradata Warehouse Miner User Guide (Volume 1)). • Advertise Note — An Advertise Note may be specified if desired when the Advertise Output option is selected or when the Always Advertise option is selected on the Databases tab of the Connection Properties dialog. It is a free-form text field of up to 30 characters that may be used to categorize or describe the output. • Generate SQL, but do not Execute it — Generate the Statistical Test SQL, but do not execute it. The SQL will be available to be viewed. Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Kolmogorov-Smirnov Tests Run the Lilliefors Test Analysis After setting parameters on the INPUT screens as described above, you are ready to run the analysis. To run the analysis you can either: • Click the Run icon on the toolbar, or • Select Run <project name> on the Project menu, or • Press the F5 key on your keyboard Results - Lilliefors Test Analysis The results of running the Lilliefors Test analysis include a table with a row for each separate Lilliefors test on all distinct-value group-by variables, as well as the SQL to perform the statistical analysis. All of these results are outlined below. Lilliefors Test - RESULTS - SQL On the Kolmogorov-Smirnov Tests dialog click on RESULTS and then click on SQL: Figure 215: Lillefors Test > Results > SQL The series of SQL statements comprise the Lilliefors Test Analysis. It is always returned, and is the only item returned when the Generate SQL without Executing option is used. Lilliefors Test - RESULTS - Data On the Kolmogorov-Smirnov Tests dialog click on RESULTS and then click on data: Figure 216: Lillefors Test > Results > Data The output table is generated by the Analysis for each separate Lilliefors test on all distinctvalue group-by variables. Output Columns - Lilliefors Test Analysis The following table is built in the requested Output Database by the Lilliefors test analysis. Group By Variables, if present, will be the Unique Primary Index (UPI). Otherwise Lilliefors will be the UPI. Teradata Warehouse Miner User Guide - Volume 1 309 Chapter 3: Statistical Tests Kolmogorov-Smirnov Tests Table 99: Lilliefors Test Analysis: Output Columns Name Type Definition Lilliefors Float Lilliefors Value M INTEGER Count LillieforsPValue Float The probability associated with the Lilliefors statistic LillieforsPText Char Text description if P is outside table range LillieforsCallP_0.05 Char The Lilliefors result: a=accept, p=reject Tutorial - Lilliefors Test Analysis In this example, a Lilliefors test analysis is performed on the fictitious banking data to analyze account usage. Parameterize a Lilliefors Test analysis as follows: • Available Tables — twm_customer_analysis • Column of Interest — income • Group By Columns — years_with_bank • Analysis Parameters • Threshold Probability — 0.05 Run the analysis and click on Results when it completes. For this example, the Lilliefors Test analysis generated the following table. The Lilliefors Test was computed for each distinct value of the group by variable “years_with_bank”. Results were sorted by years_with_bank. The tests show customer all incomes were not normally distributed. ‘p’ means significantly nonnormal and an ‘a’ means accept the null hypothesis of normality. Note: The SQL is available for viewing but not listed below. Table 100: Lilliefors Test years_with_bank Lilliefors M LillieforsPValue LillieforsPText LillieforsCallP_0.05 0 0.166465166 88 0.01 <0.01 p 1 0.123396019 87 0.01 <0.01 p 2 0.146792366 94 0.01 <0.01 p 3 0.156845809 86 0.01 <0.01 p 4 0.192756959 78 0.01 <0.01 p 5 0.144308699 82 0.01 <0.01 p 6 0.125268495 83 0.01 <0.01 p 7 0.141128127 65 0.01 <0.01 p 8 0.191869596 45 0.01 <0.01 p 310 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Kolmogorov-Smirnov Tests Table 100: Lilliefors Test years_with_bank Lilliefors M LillieforsPValue LillieforsPText LillieforsCallP_0.05 9 0.111526787 39 0.20 >0.20 a Shapiro-Wilk Test The Shapiro-Wilk W test is designed to detect departures from normality without requiring that the mean or variance of the hypothesized normal distribution be specified in advance. It is considered to be one of the best omnibus tests of normality. The function is based on the approximations and code given by Royston (1982a, b). It can be used in samples as large as 2,000 or as small as 3. Royston (1982b) gives approximations and tabled values that can be used to compute the coefficients, and obtains the significance level of the W statistic. Small values of W are evidence of departure from normality. This test has done very well in comparison studies with other goodness of fit tests. In general, either the Shapiro-Wilk or D'Agostino-Pearson test is a powerful overall test for normality. As omnibus tests, however, they will not indicate the type of nonnormality, e.g. whether the distribution is skewed as opposed to heavy-tailed (or both). Examination of the calculated skewness and kurtosis, and of the histogram, boxplot, and normal probability plot for the data may provide clues as to why the data failed the Shapiro-Wilk or D'AgostinoPearson test. The standard algorithm for the Shapiro-Wilk test only applies to sample sizes from 3 to 2000. For larger sample sizes, a different normality test should be used. The test statistic is based on the Kolmogorov-Smirnov statistic for a normal distribution with the same mean and variance as the sample mean and variance. Initiate a Shapiro-Wilk Test Use the following procedure to initiate a new Shapiro-Wilk Test in Teradata Warehouse Miner: 1 Click on the Add New Analysis icon in the toolbar: Figure 217: Add New Analysis from toolbar 2 In the resulting Add New Analysis dialog box, click on Statistical Tests under Categories and then under Analyses double-click on Kolmogorov-Smirnov Tests: Teradata Warehouse Miner User Guide - Volume 1 311 Chapter 3: Statistical Tests Kolmogorov-Smirnov Tests Figure 218: Add New Analysis > Statistical Tests > Kolmogorov-Smirnov Tests 3 This will bring up the Kolmogorov-Smirnov Tests dialog in which you will enter STATISTICAL TEST STYLE, INPUT and OUTPUT options to parameterize the analysis as described in the following sections. Shapiro-Wilk Test - INPUT - Data Selection On the Kolmogorov-Smirnov Tests dialog click on INPUT and then click on data selection: Figure 219: Shapiro-Wilk Test > Input > Data Selection On this screen select: 1 Select Input Source Users may select between different sources of input. By selecting the Input Source Table the user can select from available databases, tables (or views) and columns in the usual manner. By selecting the Input Source Analysis however the user can select directly from the output of another analysis of qualifying type in the current project. Analyses that may be selected from directly include all of the Analytic Data Set (ADS) and Reorganization analyses (except Refresh). In place of Available Databases the user may select from Available Analyses, while Available Tables then contains a list of all the output tables that will eventually be produced by the selected Analysis, or it contains a single entry with the name of the analysis under the label Volatile Table, representing the output of the analysis 312 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Kolmogorov-Smirnov Tests that is ordinarily produced by a Select statement. For more information, refer to “INPUT Tab” on page 73 of the Teradata Warehouse Miner User Guide (Volume 1). 2 3 Select Columns From a Single Table • Available Databases (or Analyses) — These are the databases (or analyses) available to be processed. • Available Tables — These are the tables and views that are available to be processed. • Available Columns — These are the columns within the table/view that are available for processing. Select Statistical Test Style These are the Tests of the Kolmogorov-Smirnov Type available (Kolmogorov-Smirnov, Lilliefors, Shapiro-Wilk, D'Agostino-Pearson, Smirnov). Select “Shapiro-Wilk”. 4 Select Optional Columns • Selected Columns — Select columns by highlighting and then either dragging and dropping into the Selected Columns window, or click on the arrow button to move highlighted columns into the Selected Columns window. Note: The Selected Columns window is actually a split window; you can insert columns as Column of Interest, Group By Columns. Make sure you have the correct portion of the window highlighted. • Column of Interest — The column that specifies the numeric variable to be tested for normality. • Group By Columns — The columns which specify the variables whose distinct value combinations will categorize the data, so a separate test is performed on each category. Shapiro-Wilk Test - INPUT - Analysis Parameters On the Kolmogorov-Smirnov Tests dialog click on INPUT and then click on analysis parameters: Figure 220: Shapiro-Wilk Test > Input > Analysis Parameters On this screen enter or select: • Processing Options • Threshold Probability — Enter the “alpha” probability below which to reject the null hypothesis. Shapiro-Wilk Test - OUTPUT On the Kolmogorov-Smirnov Tests dialog click on OUTPUT: Teradata Warehouse Miner User Guide - Volume 1 313 Chapter 3: Statistical Tests Kolmogorov-Smirnov Tests Figure 221: Shapiro-Wilk Test > Output On this screen select: • Store the tabular output of this analysis in the database — Option to generate a Teradata table populated with the results of the analysis. Once enabled, the following three fields must be specified: 314 • Database Name — The database where the output table will be saved. • Output Name — The table name that the output will be saved under. • Output Type — The output type must be table when storing Statistical Test output in the database. • Stored Procedure — The creation of a stored procedure containing the SQL generated for this analysis can be requested by entering the desired name of the stored procedure here. This will result in the creation of a stored procedure in the user's login database in place of the execution of the SQL generated by the analysis. (For more information, please refer to “Stored Procedure Support” on page 109 of the Teradata Warehouse Miner User Guide (Volume 1)). • Procedure Comment — When an optional Procedure Comment is entered it is applied to a requested Stored Procedure with an SQL Comment statement. It can be up to 255 characters in length and contain substitution parameters for the output category (Score, ADS, Stats or Other), project name and/or analysis name (using the tags <Category>, <Project> and <Analysis>, respectively). (Note that the default value of this field may be set on the Defaults tab of the Preferences dialog, available from the Tools Menu). • Create output table using fallback keyword — Fallback keyword will be used to create the table • Create output table using multiset keyword — Multiset keyword will be used to create the table • Advertise Output — The Advertise Output option may be requested when creating a table, view or procedure. This feature “advertises” output by inserting information into one or more of the Advertise Output metadata tables according to the type of analysis and the options selected in the analysis. (For more information, refer to “Advertise Output” on page 112 of the Teradata Warehouse Miner User Guide (Volume 1)). • Advertise Note — An Advertise Note may be specified if desired when the Advertise Output option is selected or when the Always Advertise option is selected on the Databases tab of the Connection Properties dialog. It is a free-form text field of up to 30 characters that may be used to categorize or describe the output. • Generate SQL, but do not Execute it — Generate the Statistical Test SQL, but do not execute it. The SQL will be available to be viewed. Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Kolmogorov-Smirnov Tests Run the Shapiro-Wilk Test Analysis After setting parameters on the INPUT screens as described above, you are ready to run the analysis. To run the analysis you can either: • Click the Run icon on the toolbar, or • Select Run <project name> on the Project menu, or • Press the F5 key on your keyboard Results - Shapiro-Wilk Analysis The results of running the Shapiro-Wilk Test analysis include a table with a row for each separate Shapiro-Wilk test on all distinct-value group-by variables, as well as the SQL to perform the statistical analysis. All of these results are outlined below. Shapiro-Wilk Test - RESULTS - SQL On the Kolmogorov-Smirnov Tests dialog click on RESULTS and then click on SQL: Figure 222: Shapiro-Wilk Test > Results > SQL The series of SQL statements comprise the Shapiro-Wilk Test Analysis. It is always returned, and is the only item returned when the Generate SQL without Executing option is used. Shapiro-Wilk Test - RESULTS - data On the Kolmogorov-Smirnov Tests dialog click on RESULTS and then click on data: Figure 223: Shapiro-Wilk Test > Results > data The output table is generated for each separate Shapiro-Wilk test on all distinct-value groupby variables. Output Columns - Shapiro-Wilk Test Analysis The following table is built in the requested Output Database by the Shapiro-Wilk test analysis. Group By Variables, if present, will be the Unique Primary Index (UPI). Otherwise Shw will be the UPI. Teradata Warehouse Miner User Guide - Volume 1 315 Chapter 3: Statistical Tests Kolmogorov-Smirnov Tests Table 101: Shapiro-Wilk Test Analysis: Output Columns Name Type Definition Shw Float Shapiro-Wilk Value N INTEGER Count ShapiroWilkPValue Float The probability associated with the Shapiro-Wilk statistic ShapiroWilkPText Char Text description if P is outside table range ShapiroWilkCallP_0.05 Char The Shapiro-Wilk result: a=accept, p=reject Tutorial - Shapiro-Wilk Test Analysis In this example, a Shapiro-Wilk test analysis is performed on the fictitious banking data to analyze account usage. Parameterize a Shapiro-Wilk Test analysis as follows: • Available Tables — twm_customer_analysis • Column of Interest — income • Group By Columns — years_with_bank • Analysis Parameters • Threshold Probability — 0.05 Run the analysis and click on Results when it completes. For this example, the Shapiro-Wilk Test analysis generated the following table. The Shapiro-Wilk Test was computed for each distinct value of the group by variable “years_with_bank”. Results were sorted by years_ with_bank. The tests show customer all incomes were not normally distributed. ‘p’ means significantly nonnormal and an ‘a’ means accept the null hypothesis of normality. Note: The SQL is available for viewing but not listed below. Table 102: Shapiro-Wilk Test years_with_bank Shw N ShapiroWilkPValue 0 0.84919004 88 0.000001 p 1 0.843099681 87 0.000001 p 2 0.831069533 94 0.000001 p 3 0.838965439 86 0.000001 p 4 0.707924134 78 0.000001 p 5 0.768444329 82 0.000001 p 6 0.855276885 83 0.000001 p 7 0.827399691 65 0.000001 p 8 0.863932178 45 0.01 316 ShapiroWilkPText <0.01 ShapiroWilkCallP_0.05 p Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Kolmogorov-Smirnov Tests Table 102: Shapiro-Wilk Test years_with_bank Shw N ShapiroWilkPValue 9 0.930834522 39 0.029586304 ShapiroWilkPText ShapiroWilkCallP_0.05 p D'Agostino and Pearson Test In general, either the Shapiro-Wilk or D'Agostino-Pearson test is a powerful overall test for normality. These tests are designed to detect departures from normality without requiring that the mean or variance of the hypothesized normal distribution be specified in advance. Though these tests cannot indicate the type of nonnormality, they tend to be more powerful than the Kolmogorov-Smirnov test. The D'Agostino-Pearson Ksquared statistic has approximately a chi-squared distribution with 2 df when the population is normally distributed. Initiate a D'Agostino and Pearson Test Use the following procedure to initiate a new D'Agostino and Pearson Test in Teradata Warehouse Miner: 1 Click on the Add New Analysis icon in the toolbar: Figure 224: Add New Analysis from toolbar 2 In the resulting Add New Analysis dialog box, click on Statistical Tests under Categories and then under Analyses double-click on Kolmogorov-Smirnov Tests: Teradata Warehouse Miner User Guide - Volume 1 317 Chapter 3: Statistical Tests Kolmogorov-Smirnov Tests Figure 225: Add New Analysis > Statistical Tests > Kolmogorov-Smirnov Tests 3 This will bring up the Kolmogorov-Smirnov Tests dialog in which you will enter STATISTICAL TEST STYLE, INPUT and OUTPUT options to parameterize the analysis as described in the following sections. D'Agostino and Pearson Test - INPUT - Data Selection On the Kolmogorov-Smirnov Tests dialog click on INPUT and then click on data selection: Figure 226: D'Agostino and Pearson Test > Input > Data Selection On this screen select: 1 Select Input Source Users may select between different sources of input. By selecting the Input Source Table the user can select from available databases, tables (or views) and columns in the usual manner. By selecting the Input Source Analysis however the user can select directly from the output of another analysis of qualifying type in the current project. Analyses that may be selected from directly include all of the Analytic Data Set (ADS) and Reorganization analyses (except Refresh). In place of Available Databases the user may select from Available Analyses, while Available Tables then contains a list of all the output tables that will eventually be produced by the selected Analysis, or it contains a single entry with the name of the analysis under the label Volatile Table, representing the output of the analysis 318 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Kolmogorov-Smirnov Tests that is ordinarily produced by a Select statement. For more information, refer to “INPUT Tab” on page 73 of the Teradata Warehouse Miner User Guide (Volume 1). 2 3 Select Columns From a Single Table • Available Databases (or Analyses) — These are the databases (or analyses) available to be processed. • Available Tables — These are the tables and views that are available to be processed. • Available Columns — These are the columns within the table/view that are available for processing. Select Statistical Test Style These are the Tests of the Kolmogorov-Smirnov Type available (Kolmogorov-Smirnov, Lilliefors, Shapiro-Wilk, D'Agostino-Pearson, Smirnov). Select “D'Agostino and Pearson”. 4 Select Optional Columns • Selected Columns — Select columns by highlighting and then either dragging and dropping into the Selected Columns window, or click on the arrow button to move highlighted columns into the Selected Columns window. Note: The Selected Columns window is actually a split window; you can insert columns as Column of Interest, Group By Columns. Make sure you have the correct portion of the window highlighted. • Column of Interest — The column that specifies the numeric variable to be tested for normality. • Group By Columns — The columns which specify the variables whose distinct value combinations will categorize the data, so a separate test is performed on each category. D'Agostino and Pearson Test - INPUT - Analysis Parameters On the Kolmogorov-Smirnov Tests dialog click on INPUT and then click on analysis parameters: Figure 227: D'Agostino and Pearson Test > Input > Analysis Parameters On this screen enter or select: • Processing Options • Threshold Probability — Enter the “alpha” probability below which to reject the null hypothesis. Teradata Warehouse Miner User Guide - Volume 1 319 Chapter 3: Statistical Tests Kolmogorov-Smirnov Tests D'Agostino and Pearson Test - OUTPUT On the Kolmogorov-Smirnov Tests dialog click on OUTPUT: Figure 228: D'Agostino and Pearson Test > Output On this screen select the following options if desired: • Store the tabular output of this analysis in the database — Option to generate a Teradata table populated with the results of the analysis. Once enabled, the following three fields must be specified: 320 • Database Name — The database where the output table will be saved. • Output Name — The table name that the output will be saved under. • Output Type — The output type must be table when storing Statistical Test output in the database. • Stored Procedure — The creation of a stored procedure containing the SQL generated for this analysis can be requested by entering the desired name of the stored procedure here. This will result in the creation of a stored procedure in the user's login database in place of the execution of the SQL generated by the analysis. (For more information, please refer to “Stored Procedure Support” on page 109 of the Teradata Warehouse Miner User Guide (Volume 1)). • Procedure Comment — When an optional Procedure Comment is entered it is applied to a requested Stored Procedure with an SQL Comment statement. It can be up to 255 characters in length and contain substitution parameters for the output category (Score, ADS, Stats or Other), project name and/or analysis name (using the tags <Category>, <Project> and <Analysis>, respectively). (Note that the default value of this field may be set on the Defaults tab of the Preferences dialog, available from the Tools Menu). • Create output table using fallback keyword — Fallback keyword will be used to create the table • Create output table using multiset keyword — Multiset keyword will be used to create the table • Advertise Output — The Advertise Output option may be requested when creating a table, view or procedure. This feature “advertises” output by inserting information into one or more of the Advertise Output metadata tables according to the type of analysis and the options selected in the analysis. (For more information, refer to “Advertise Output” on page 112 of the Teradata Warehouse Miner User Guide (Volume 1)). • Advertise Note — An Advertise Note may be specified if desired when the Advertise Output option is selected or when the Always Advertise option is selected on the Databases tab of the Connection Properties dialog. It is a free-form text field of up to 30 characters that may be used to categorize or describe the output. Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Kolmogorov-Smirnov Tests • Generate SQL, but do not Execute it — Generate the Statistical Test SQL, but do not execute it. The SQL will be available to be viewed. Run the D'Agostino and Pearson Test Analysis After setting parameters on the INPUT screens as described above, you are ready to run the analysis. To run the analysis you can either: • Click the Run icon on the toolbar, or • Select Run <project name> on the Project menu, or • Press the F5 key on your keyboard Results - D'Agostino and Pearson Test Analysis The results of running the D'Agostino and Pearson Test analysis include a table with a row for each separate D'Agostino and Pearson test on all distinct-value group-by variables, as well as the SQL to perform the statistical analysis. All of these results are outlined below. D'Agostino and Pearson Test - RESULTS - SQL On the Kolmogorov-Smirnov Tests dialog click on RESULTS and then click on SQL: Figure 229: D'Agostino and Pearson Test > Results > SQL The series of SQL statements comprise the D'Agostino and Pearson Test Analysis. It is always returned, and is the only item returned when the Generate SQL without Executing option is used. D'Agostino and Pearson Test - RESULTS - data On the Kolmogorov-Smirnov Tests dialog click on RESULTS and then click on data: Figure 230: D'Agostino and Pearson Test > Results > data The output table is generated by the Analysis for each separate D'Agostino and Pearson test on all distinct-value group-by variables. Teradata Warehouse Miner User Guide - Volume 1 321 Chapter 3: Statistical Tests Kolmogorov-Smirnov Tests Output Columns - D'Agostino and Pearson Test Analysis The following table is built in the requested Output Database by the D'Agostino and Pearson test analysis. Group By Variables, if present, will be the Unique Primary Index (UPI). Otherwise T will be the UPI. Table 103: D'Agostino and Pearson Test Analysis: Output Columns Name Type Definition T Float K-Squared statistic Zkurtosis Float Z of kurtosis Zskew Float Z of Skewness ChiPValue Float The probability associated with the K-Squared statistic ChiPText Char Text description if P is outside table range ChiCallP_0.05 Char The D'Agostino-Pearson result: a=accept, p=reject Tutorial - D'Agostino and Pearson Test Analysis In this example, a D'Agostino and Pearson test analysis is performed on the fictitious banking data to analyze account usage. Parameterize a D'Agostino and Pearson Test analysis as follows: • Available Tables — twm_customer_analysis • Column of Interest — income • Group By Columns — years_with_bank • Analysis Parameters • Threshold Probability — 0.05 Run the analysis and click on Results when it completes. For this example, the D'Agostino and Pearson Test analysis generated the following table. The D'Agostino and Pearson Test was computed for each distinct value of the group by variable “years_with_bank”. Results were sorted by years_with_bank. The tests show customer all incomes were not normally distributed except those from years_with_bank = 9. ‘p’ means significantly nonnormal and an ‘a’ means accept the null hypothesis of normality. The SQL is available for viewing but not listed below. Table 104: D'Agostino and Pearson Test: Output Columns years_with_bank T Zkurtosis Zskew ChiPValue ChiPText ChiCallP_0.05 0 29.05255 2.71261 4.65771 0.0001 <0.0001 p 1 34.18025 3.30609 4.82183 0.0001 <0.0001 p 2 30.71123 2.78588 4.79062 0.0001 <0.0001 p 3 32.81104 3.06954 4.83621 0.0001 <0.0001 p 322 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Kolmogorov-Smirnov Tests Table 104: D'Agostino and Pearson Test: Output Columns years_with_bank T Zkurtosis Zskew ChiPValue ChiPText ChiCallP_0.05 4 82.01928 5.72010 7.02137 0.0001 <0.0001 p 5 62.36861 4.91949 6.17796 0.0001 <0.0001 p 6 24.80241 2.40521 4.36089 0.0001 <0.0001 p 7 17.72275 1.83396 3.78937 0.00019 p 8 6.55032 -0.23415 2.54863 0.03992 p 9 3.32886 -0.68112 1.69261 0.20447 a Smirnov Test The Smirnov test (aka “two-sample Kolmogorov-Smirnov test”) checks whether two datasets have a significantly different distribution. The tests have the advantage of making no assumption about the distribution of data. (non-parametric and distribution free). Note that this generality comes at some cost: other tests (e.g. the Student's t-test) may be more sensitive if the data meet the requirements of the test. Initiate a Smirnov Test Use the following procedure to initiate a new Smirnov Test in Teradata Warehouse Miner: 1 Click on the Add New Analysis icon in the toolbar: Figure 231: Add New Analysis from toolbar 2 In the resulting Add New Analysis dialog box, click on Statistical Tests under Categories and then under Analyses double-click on Kolmogorov-Smirnov Tests: Teradata Warehouse Miner User Guide - Volume 1 323 Chapter 3: Statistical Tests Kolmogorov-Smirnov Tests Figure 232: Add New Analysis > Statistical Tests > Kolmogorov-Smirnov Tests 3 This will bring up the Kolmogorov-Smirnov Tests dialog in which you will enter STATISTICAL TEST STYLE, INPUT and OUTPUT options to parameterize the analysis as described in the following sections. Smirnov Test - INPUT - Data Selection On the Kolmogorov-Smirnov Tests dialog click on INPUT and then click on data selection: Figure 233: Smirnov Test > Input > Data Selection On this screen select: 1 Select Input Source Users may select between different sources of input. By selecting the Input Source Table the user can select from available databases, tables (or views) and columns in the usual manner. By selecting the Input Source Analysis however the user can select directly from the output of another analysis of qualifying type in the current project. Analyses that may be selected from directly include all of the Analytic Data Set (ADS) and Reorganization analyses (except Refresh). In place of Available Databases the user may select from Available Analyses, while Available Tables then contains a list of all the output tables that will eventually be produced by the selected Analysis, or it contains a single entry with the name of the analysis under the label Volatile Table, representing the output of the analysis 324 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Kolmogorov-Smirnov Tests that is ordinarily produced by a Select statement. For more information, refer to “INPUT Tab” on page 73 of the Teradata Warehouse Miner User Guide (Volume 1). 2 3 Select Columns From a Single Table • Available Databases (or Analyses) — These are the databases (or analyses) available to be processed. • Available Tables — These are the tables and views that are available to be processed. • Available Columns — These are the columns within the table/view that are available for processing. Select Statistical Test Style These are the Tests of the Kolmogorov-Smirnov Type available (Kolmogorov-Smirnov, Lilliefors, Shapiro-Wilk, D'Agostino-Pearson, Smirnov). Select “Smirnov”. 4 Select Optional Columns • Selected Columns — Select columns by highlighting and then either dragging and dropping into the Selected Columns window, or click on the arrow button to move highlighted columns into the Selected Columns window. Note: The Selected Columns window is actually a split window; you can insert columns as Column of Interest, Columns, Group By Columns. Make sure you have the correct portion of the window highlighted. • Column of Interest — The column that specifies the numeric variable to be tested for normality. • Columns — The column specifying the 2-category variable that identifies the distribution to which the column of interest belongs. • Group By Columns — The columns which specify the variables whose distinct value combinations will categorize the data, so a separate test is performed on each category. Smirnov Test - INPUT - Analysis Parameters On the Kolmogorov-Smirnov Tests dialog click on INPUT and then click on analysis parameters: Figure 234: Smirnov Test > Input > Analysis Parameters On this screen enter or select: • Processing Options • Threshold Probability — Enter the “alpha” probability below which to reject the null hypothesis. Teradata Warehouse Miner User Guide - Volume 1 325 Chapter 3: Statistical Tests Kolmogorov-Smirnov Tests Smirnov Test - OUTPUT On the Kolmogorov-Smirnov Tests dialog click on OUTPUT: Figure 235: Smirnov Test > Output On this screen select the following options if desired: • Store the tabular output of this analysis in the database — Option to generate a Teradata table populated with the results of the analysis. Once enabled, the following three fields must be specified: 326 • Database Name — The database where the output table will be saved. • Output Name — The table name that the output will be saved under. • Output Type — The output type must be table when storing Statistical Test output in the database. • Stored Procedure — The creation of a stored procedure containing the SQL generated for this analysis can be requested by entering the desired name of the stored procedure here. This will result in the creation of a stored procedure in the user's login database in place of the execution of the SQL generated by the analysis. (For more information, please refer to “Stored Procedure Support” on page 109 of the Teradata Warehouse Miner User Guide (Volume 1)). • Procedure Comment — When an optional Procedure Comment is entered it is applied to a requested Stored Procedure with an SQL Comment statement. It can be up to 255 characters in length and contain substitution parameters for the output category (Score, ADS, Stats or Other), project name and/or analysis name (using the tags <Category>, <Project> and <Analysis>, respectively). (Note that the default value of this field may be set on the Defaults tab of the Preferences dialog, available from the Tools Menu). • Create output table using fallback keyword — Fallback keyword will be used to create the table • Create output table using multiset keyword — Multiset keyword will be used to create the table • Advertise Output — The Advertise Output option may be requested when creating a table, view or procedure. This feature “advertises” output by inserting information into one or more of the Advertise Output metadata tables according to the type of analysis and the options selected in the analysis. (For more information, refer to “Advertise Output” on page 112 of the Teradata Warehouse Miner User Guide (Volume 1)). • Advertise Note — An Advertise Note may be specified if desired when the Advertise Output option is selected or when the Always Advertise option is selected on the Databases tab of the Connection Properties dialog. It is a free-form text field of up to 30 characters that may be used to categorize or describe the output. Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Kolmogorov-Smirnov Tests • Generate SQL, but do not Execute it — Generate the Statistical Test SQL, but do not execute it. The SQL will be available to be viewed. Run the Smirnov Test Analysis After setting parameters on the INPUT screens as described above, you are ready to run the analysis. To run the analysis you can either: • Click the Run icon on the toolbar, or • Select Run <project name> on the Project menu, or • Press the F5 key on your keyboard Results - Smirnov Test Analysis The results of running the Smirnov Test analysis include a table with a row for each separate Smirnov test on all distinct-value group-by variables, as well as the SQL to perform the statistical analysis. All of these results are outlined below. Smirnov Test - RESULTS - SQL On the Kolmogorov-Smirnov Tests dialog click on RESULTS and then click on SQL: Figure 236: Smirnov Test > Results > SQL The series of SQL statements comprise the Smirnov Test Analysis. It is always returned, and is the only item returned when the Generate SQL without Executing option is used. Smirnov Test - RESULTS - data On the Kolmogorov-Smirnov Tests dialog click on RESULTS and then click on data: Figure 237: Smirnov Test > Results > data The output table is generated by the Analysis for each separate Smirnov test on all distinctvalue group-by variables. Output Columns - Smirnov Test Analysis The following table is built in the requested Output Database by the Smirnov test analysis. Group By Variables, if present, will be the Unique Primary Index (UPI). Otherwise M will be the UPI. Teradata Warehouse Miner User Guide - Volume 1 327 Chapter 3: Statistical Tests Kolmogorov-Smirnov Tests Table 105: Smirnov Test Analysis: Output Columns Name Type Definition M Integer Number of first distribution observations N Integer Number of second distribution observations D Float D Statistic SmirnovPValue Float The probability associated with the D statistic SmirnovPText Char Text description if P is outside table range SmirnovCallP_0.01 Char The D'Agostino-Pearson result: a=accept, p=reject Tutorial - Smirnov Test Analysis In this example, a Smirnov test analysis is performed on the fictitious banking data to analyze account usage. Parameterize a Smirnov Test analysis as follows: • Available Tables — twm_customer • Column of Interest — income • Columns — gender • Group By Columns — years_with_bank • Analysis Parameters • Threshold Probability — 0.05 Run the analysis and click on Results when it completes. For this example, the Smirnov Test analysis generated the following table. The Smirnov Test was computed for each distinct value of the group by variable “years_with_bank”. Results were sorted by years_with_bank. The tests show distributions of incomes of males and females were different for all values of years_with_bank. ‘p’ means significantly nonnormal and an ‘a’ means accept the null hypothesis of normality. The SQL is available for viewing but not listed below. Table 106: Smirnov Test years_with_bank M N D SmirnovPValue 0 37 51 1.422949567 0.000101 p 1 39 48 1.371667516 0.000103 p 2 37 57 1.465841724 0.000101 p 3 40 46 1.409836326 0.000105 p 4 39 39 1.397308541 0.000146 p 5 36 46 1.309704108 0.000105 p 6 37 46 1.287964978 0.000104 p 7 29 36 1.336945293 0.000112 p 328 SmirnovPText SmirnovCallP_0.01 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Tests Based on Contingency Tables Table 106: Smirnov Test years_with_bank M N D SmirnovPValue SmirnovPText SmirnovCallP_0.01 8 19 26 1.448297864 0.00011 p 9 16 23 1.403341724 0.000101 p Tests Based on Contingency Tables Tests Based on Contingency Tables are based on an array or matrix of numbers which represent counts or frequencies. The tests basically evaluate the matrix to detect if there is a nonrandom pattern of frequencies. Chi Square Test The most common application for chi-square is in comparing observed counts of particular cases to the expected counts. For example, a random sample of people would contain m males and f females but usually we would not find exactly m=½N and f=½N. We could use the chisquared test to determine if the difference were significant enough to rule out the 50/50 hypothesis. The Chi Square Test determines whether the probabilities observed from data in a RxC contingency table are the same or different. The null hypothesis is that probabilities observed are the same. Output is a p-value which when compared to the user’s threshold, determines whether the null hypothesis should be rejected. Other Calculated Measures of Association • Phi coefficient — The Phi coefficient is a measure of the degree of association between two binary variables, and represents the correlation between two dichotomous variables. It is based on adjusting chi-square significance to factor out sample size, and is the same as the Pearson correlation for two dichotomous variables. • Cramer’s V — Cramer's V is used to examine the association between two categorical variables when there is more than a 2 X 2 contingency (e.g., 2 X 3). In these more complex designs, phi is not appropriate, but Cramer's statistic is. Cramer's V represents the association or correlation between two variables. Cramer's V is the most popular of the chi-square-based measures of nominal association, designed so that the attainable upper limit is always 1. • Likelihood Ratio Chi Square — Likelihood ratio chi-square is an alternative to test the hypothesis of no association of columns and rows in nominal-level tabular data. It is based on maximum likelihood estimation, and involves the ratio between the observed and the expected frequencies, whereas the ordinary chi-square test involves the difference between the two. This is a more recent version of chi-square and is directly related to loglinear analysis and logistic regression. • Continuity-Adjusted Chi-Square — The continuity-adjusted chi-square statistic for 2 × 2 tables is similar to the Pearson chi-square, except that it is adjusted for the continuity of Teradata Warehouse Miner User Guide - Volume 1 329 Chapter 3: Statistical Tests Tests Based on Contingency Tables the chi-square distribution. The continuity-adjusted chi-square is most useful for small sample sizes. The use of the continuity adjustment is controversial; this chi-square test is more conservative, and more like Fisher's exact test, when your sample size is small. As the sample size increases, the statistic becomes more and more like the Pearson chisquare. • Contingency Coefficient — The contingency coefficient is an adjustment to phi coefficient, intended for tables larger than 2-by-2. It is always less than 1 and approaches 1.0 only for large tables. The larger the contingency coefficient, the stronger the association. Recommended only for 5-by-5 tables or larger, for smaller tables it underestimates level of association. Initiate a Chi Square Test Use the following procedure to initiate a new Chi Square Test in Teradata Warehouse Miner: 1 Click on the Add New Analysis icon in the toolbar: Figure 238: Add New Analysis from toolbar 2 330 In the resulting Add New Analysis dialog box, click on Statistical Tests under Categories and then under Analyses double-click on Tests Based on Contingency Tables: Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Tests Based on Contingency Tables Figure 239: Add New Analysis > Statistical Tests > Tests Based on Contingency Tables 3 This will bring up the Tests Based on Contingency Tables dialog in which you will enter STATISTICAL TEST STYLE, INPUT and OUTPUT options to parameterize the analysis as described in the following sections. Chi Square Test - INPUT - Data Selection On the Tests Based on Contingency Tables dialog click on INPUT and then click on data selection: Figure 240: Chi Square Test > Input > Data Selection On this screen select: 1 Select Input Source Users may select between different sources of input. By selecting the Input Source Table the user can select from available databases, tables (or views) and columns in the usual manner. By selecting the Input Source Analysis however the user can select directly from the output of another analysis of qualifying type in the current project. Analyses that may be selected from directly include all of the Analytic Data Set (ADS) and Reorganization analyses (except Refresh). In place of Available Databases the user may select from Available Analyses, while Available Tables then contains a list of all the output tables that will eventually be produced by the selected Analysis, or it contains a single entry with the name of the analysis under the label Volatile Table, representing the output of the analysis Teradata Warehouse Miner User Guide - Volume 1 331 Chapter 3: Statistical Tests Tests Based on Contingency Tables that is ordinarily produced by a Select statement. For more information, refer to “INPUT Tab” on page 73 of the Teradata Warehouse Miner User Guide (Volume 1). 2 3 Select Columns From a Single Table • Available Databases (or Analyses) — These are the databases (or analyses) available to be processed. • Available Tables — These are the tables and views that are available to be processed. • Available Columns — These are the columns within the table/view that are available for processing. Select Statistical Test Style These are the Tests Based on Contingency Tables available (Chi Square, Median). Select “Chi Square”. 4 Select Optional Columns • Selected Columns — Select columns by highlighting and then either dragging and dropping into the Selected Columns window, or click on the arrow button to move highlighted columns into the Selected Columns window. • First Columns/Second Columns — Note that the Selected Columns window is actually a split window; you can insert columns as First Columns, Second Columns. Make sure you have the correct portion of the window highlighted. • First Columns — The set of columns that specifies the first of a pair of variables for Chi Square analysis. • Second Columns — The set of columns that specifies the second of a pair of variables for Chi Square analysis. Each combination of the first and second variables will generate a separate Chi Square test. (Limitation: to avoid excessively long execution, the number of combinations is limited to 100, and unless the product of the number of distinct values of each pair is 2000 or less, the calculation will be skipped.) Note: Group-By Columns are not available in the Chi Square Test. Chi Square Test - INPUT - Analysis Parameters On the Tests Based on Contingency Tables dialog click on INPUT and then click on analysis parameters: Figure 241: Chi Square Test > Input > Analysis Parameters On this screen enter or select: • Processing Options 332 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Tests Based on Contingency Tables • Threshold Probability — Enter the “alpha” probability below which to reject the null hypothesis. Chi Square Test - OUTPUT On the Tests Based on Contingency Tables dialog click on OUTPUT: Figure 242: Chi Square Test > Output On this screen select the following options if desired: • Store the tabular output of this analysis in the database — Option to generate a Teradata table populated with the results of the analysis. Once enabled, the following three fields must be specified: • Database Name — The database where the output table will be saved. • Output Name — The table name that the output will be saved under. • Output Type — The output type must be table when storing Statistical Test output in the database. • Stored Procedure — The creation of a stored procedure containing the SQL generated for this analysis can be requested by entering the desired name of the stored procedure here. This will result in the creation of a stored procedure in the user's login database in place of the execution of the SQL generated by the analysis. (For more information, please refer to “Stored Procedure Support” on page 109 of the Teradata Warehouse Miner User Guide (Volume 1)). • Procedure Comment — When an optional Procedure Comment is entered it is applied to a requested Stored Procedure with an SQL Comment statement. It can be up to 255 characters in length and contain substitution parameters for the output category (Score, ADS, Stats or Other), project name and/or analysis name (using the tags <Category>, <Project> and <Analysis>, respectively). (Note that the default value of this field may be set on the Defaults tab of the Preferences dialog, available from the Tools Menu). • Create output table using fallback keyword — Fallback keyword will be used to create the table • Create output table using multiset keyword — Multiset keyword will be used to create the table • Advertise Output — The Advertise Output option may be requested when creating a table, view or procedure. This feature “advertises” output by inserting information into one or more of the Advertise Output metadata tables according to the type of analysis and the options selected in the analysis. (For more information, refer to “Advertise Output” on page 112 of the Teradata Warehouse Miner User Guide (Volume 1)). • Advertise Note — An Advertise Note may be specified if desired when the Advertise Output option is selected or when the Always Advertise option is selected on the Teradata Warehouse Miner User Guide - Volume 1 333 Chapter 3: Statistical Tests Tests Based on Contingency Tables Databases tab of the Connection Properties dialog. It is a free-form text field of up to 30 characters that may be used to categorize or describe the output. • Generate SQL, but do not Execute it — Generate the Statistical Test SQL, but do not execute it. The SQL will be available to be viewed. Run the Chi Square Test Analysis After setting parameters on the INPUT screens as described above, you are ready to run the analysis. To run the analysis you can either: • Click the Run icon on the toolbar, or • Select Run <project name> on the Project menu, or • Press the F5 key on your keyboard Results - Chi Square Analysis The results of running the Chi Square Test analysis include a table with a row for each separate Chi Square test on all pairs of selected variables, as well as the SQL to perform the statistical analysis. All of these results are outlined below. Chi Square Test - RESULTS - SQL On the Tests Based on Contingency Tables dialog click on RESULTS and then click on SQL: Figure 243: Chi Square Test > Results > SQL The series of SQL statements comprise the Chi Square Test Analysis. It is always returned, and is the only item returned when the Generate SQL without Executing option is used. Chi Square Test - RESULTS - data On the Tests Based on Contingency Tables dialog click on RESULTS and then click on data: Figure 244: Chi Square Test > Results > data The output table is generated by the Analysis for each separate Chi Square test on all pairs of selected variables 334 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Tests Based on Contingency Tables Output Columns - Chi Square Test Analysis The following table is built in the requested Output Database by the Chi Square test analysis. Column1 will be the Unique Primary Index (UPI). Table 107: Chi Square Test Analysis: Output Columns Name Type Definition column1 Char First of pair of variables column2 Char Second of pair of variables Chisq Float Chi Square Value DF INTEGER Degrees of Freedom Z Float Z Score CramersV Float § Cramer’s V PhiCoeff Float § Phi coefficient LlhChiSq Float Likelihood Ratio Chi Square ContAdjChiSq Float § Continuity-Adjusted Chi-Square ContinCoeff Float § Contingency Coefficient ChiPValue Float The probability associated with the Chi Square statistic ChiPText Char Text description if P is outside table range ChiCallP_0.05 Char The Chi Square result: a=accept, p=reject (positive), n=reject(negative) Tutorial - Chi Square Test Analysis In this example, a Chi Square test analysis is performed on the fictitious banking data to analyze account usage. Parameterize a Chi Square Test analysis as follows: • Available Tables — twm_customer_analysis • First Columns — female, single • Second Columns — svacct, ccacct, ckacct • Analysis Parameters • Threshold Probability — 0.05 Run the analysis and click on Results when it completes. For this example, the Chi Square Test analysis generated the following table. The Chi Square Test was computed on all combinations of pairs of the two sets of variables. Results were sorted by column1 and column2. The tests shows that probabilities observed are the same for three pairs of variables and different for three other pairs. A ‘p’ means significantly different and an ‘a’ means not significantly different. The SQL is available for viewing but not listed below. Teradata Warehouse Miner User Guide - Volume 1 335 Chapter 3: Statistical Tests Tests Based on Contingency Tables Table 108: Chi Square Test (Part 1) column1 column2 Chisq DF Z CramersV PhiCoeff LlhChiSq female ccacct 3.2131312 1 1.480358596 0.065584911 0.065584911 3.21543611 female ckacct 8.2389731 1 2.634555949 0.105021023 0.105021023 8.23745744 female svacct 3.9961257 1 1.716382791 0.073140727 0.073140727 3.98861957 single ccacct 6.9958187 1 2.407215881 0.096774063 0.096774063 7.01100739 single ckacct 0.6545145 1 0.191899245 0.02960052 0.02960052 0.65371179 single svacct 1.5387084 1 0.799100586 0.045385576 0.045385576 1.53297321 Table 109: Chi Square Test (Part 2) column1 column2 ContAdjChiSq ContinCoeff ChiPValue ChiPText female ccacct 2.954339388 0.065444311 0.077657185 a female ckacct 7.817638955 0.10444661 0.004512106 p female svacct 3.697357526 0.072945873 0.046729867 p single ccacct 6.600561728 0.096324066 0.00854992 p single ckacct 0.536617115 0.029587561 0.25 single svacct 1.35045989 0.045338905 0.226624385 >0.25 ChiCallP_0.05 a a Median Test The Median test is a special case of the chi-square test with fixed marginal totals. It tests whether several samples came from populations with the same median. The null hypothesis is that all samples have the same median. The median test is applied for data in similar cases as for the ANOVA for independent samples, but when: 1 the data are either importantly non-normally distributed 2 the measurement scale of the dependent variable is ordinal (not interval or ratio) 3 or the data sample is too small. Note: The Median test is a less powerful non-parametric test than alternative rank tests due to the fact the dependent variable is dichotomized at the median. Because this technique tends to discard most of the information inherent in the data, it is less often used. Frequencies are evaluated by a simple 2 x 2 contingency table, so it becomes simply a 2 x 2 chi square test of independence with 1 DF. Given k independent samples of numeric values, a Median test is produced for each set of unique values of the group-by variables (GBV's), if any, testing whether all the populations have the same median. Output for each set of unique values of the GBV's is a p-value, which 336 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Tests Based on Contingency Tables when compared to the user’s threshold, determines whether the null hypothesis should be rejected for the unique set of values of the GBV's. For more than 2 samples, this is sometimes called the Brown-Mood test. Initiate a Median Test Use the following procedure to initiate a new Median Test in Teradata Warehouse Miner: 1 Click on the Add New Analysis icon in the toolbar: Figure 245: Add New Analysis from toolbar 2 In the resulting Add New Analysis dialog box, click on Statistical Tests under Categories and then under Analyses double-click on Tests Based on Contingency Tables: Figure 246: Add New Analysis > Statistical Tests > Tests Based On Contingency Tables 3 This will bring up the Tests Based on Contingency Tables dialog in which you will enter STATISTICAL TEST STYLE, INPUT and OUTPUT options to parameterize the analysis as described in the following sections. Teradata Warehouse Miner User Guide - Volume 1 337 Chapter 3: Statistical Tests Tests Based on Contingency Tables Median Test - INPUT - Data Selection On the Tests Based on Contingency Tables dialog click on INPUT and then click on data selection: Figure 247: Median Test > Input > Data Selection On this screen select: 1 Select Input Source Users may select between different sources of input. By selecting the Input Source Table the user can select from available databases, tables (or views) and columns in the usual manner. By selecting the Input Source Analysis however the user can select directly from the output of another analysis of qualifying type in the current project. Analyses that may be selected from directly include all of the Analytic Data Set (ADS) and Reorganization analyses (except Refresh). In place of Available Databases the user may select from Available Analyses, while Available Tables then contains a list of all the output tables that will eventually be produced by the selected Analysis, or it contains a single entry with the name of the analysis under the label Volatile Table, representing the output of the analysis that is ordinarily produced by a Select statement. For more information, refer to “INPUT Tab” on page 73 of the Teradata Warehouse Miner User Guide (Volume 1). 2 3 Select Columns From a Single Table • Available Databases (or Analyses) — These are the databases (or analyses) available to be processed. • Available Tables — These are the tables and views that are available to be processed. • Available Columns — These are the columns within the table/view that are available for processing. Select Statistical Test Style These are the Tests Based on Contingency Tables available (Chi Square, Median). Select “Median”. 4 Select Optional Columns • Selected Columns — Select columns by highlighting and then either dragging and dropping into the Selected Columns window, or click on the arrow button to move highlighted columns into the Selected Columns window. Note: The Selected Columns window is actually a split window; you can insert columns as Column of Interest, Columns and Group By Columns. Make sure you have the correct portion of the window highlighted. 338 • Column of Interest — The numeric dependent variable for Median analysis. • Columns — The set of categorical independent variables for Median analysis. Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Tests Based on Contingency Tables • Group By Columns — The column(s) that specifies the variable(s) whose distinct value combinations will categorize the data, so a separate test is performed on each category. Median Test - INPUT - Analysis Parameters On the Tests Based on Contingency Tables dialog click on INPUT and then click on analysis parameters: Figure 248: Median Test > Input > Analysis Parameters On this screen enter or select: • Processing Options • Threshold Probability — Enter the “alpha” probability below which to reject the null hypothesis. Median Test - OUTPUT On the Tests Based on Contingency Tables dialog click on OUTPUT: Figure 249: Median Test > Output On this screen select the following options if desired: • Store the tabular output of this analysis in the database — Option to generate a Teradata table populated with the results of the analysis. Once enabled, the following three fields must be specified: • Database Name — The database where the output table will be saved. • Output Name — The table name that the output will be saved under. • Output Type — The output type must be table when storing Statistical Test output in the database. • Stored Procedure — The creation of a stored procedure containing the SQL generated for this analysis can be requested by entering the desired name of the stored procedure here. This will result in the creation of a stored procedure in the user's login database in place of the execution of the SQL generated by the analysis. (For more information, please refer to “Stored Procedure Support” on page 109 of the Teradata Warehouse Miner User Guide (Volume 1)). Teradata Warehouse Miner User Guide - Volume 1 339 Chapter 3: Statistical Tests Tests Based on Contingency Tables • Procedure Comment — When an optional Procedure Comment is entered it is applied to a requested Stored Procedure with an SQL Comment statement. It can be up to 255 characters in length and contain substitution parameters for the output category (Score, ADS, Stats or Other), project name and/or analysis name (using the tags <Category>, <Project> and <Analysis>, respectively). (Note that the default value of this field may be set on the Defaults tab of the Preferences dialog, available from the Tools Menu). • Create output table using fallback keyword — Fallback keyword will be used to create the table • Create output table using multiset keyword — Multiset keyword will be used to create the table • Advertise Output — The Advertise Output option may be requested when creating a table, view or procedure. This feature “advertises” output by inserting information into one or more of the Advertise Output metadata tables according to the type of analysis and the options selected in the analysis. (For more information, refer to “Advertise Output” on page 112 of the Teradata Warehouse Miner User Guide (Volume 1)). • Advertise Note — An Advertise Note may be specified if desired when the Advertise Output option is selected or when the Always Advertise option is selected on the Databases tab of the Connection Properties dialog. It is a free-form text field of up to 30 characters that may be used to categorize or describe the output. • Generate SQL, but do not Execute it — Generate the Statistical Test SQL, but do not execute it. The SQL will be available to be viewed. Run the Median Test Analysis After setting parameters on the INPUT screens as described above, you are ready to run the analysis. To run the analysis you can either: • Click the Run icon on the toolbar, or • Select Run <project name> on the Project menu, or • Press the F5 key on your keyboard Results - Median Analysis The results of running the Median Test analysis include a table with a row for each separate Median test on all distinct-value group-by variables, as well as the SQL to perform the statistical analysis. All of these results are outlined below. Median Test - RESULTS - SQL On the Tests Based on Contingency Tables dialog click on RESULTS and then click on SQL: Figure 250: Median Test > Results > SQL 340 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Tests Based on Contingency Tables The series of SQL statements comprise the Median Test Analysis. It is always returned, and is the only item returned when the Generate SQL without Executing option is used. Median Test - RESULTS - data On the Tests Based on Contingency Tables dialog click on RESULTS and then click on data: Figure 251: Median Test > Results > data The output table is generated by the Analysis for each group-by variable combination. Output Columns - Median Test Analysis The following table is built in the requested Output Database by the Median Test analysis. Group By Variables, if present, will be the Unique Primary Index (UPI). Otherwise ChiSq will be the UPI. Table 110: Median Test Analysis: Output Columns Name Type Definition Chisq Float Chi Square Value DF INTEGER Degrees of Freedom MedianPValue Float The probability associated with the Chi Square statistic MedianPText Char Text description if P is outside table range MedianCallP_0.01 Char The Chi Square result: a=accept, p=reject (positive), n=reject(negative) Tutorial - Median Test Analysis In this example, a Median test analysis is performed on the fictitious banking data to analyze account usage. Parameterize a Median Test analysis as follows: • Available Tables — twm_customer • Column of Interest — income • Columns — marital_status • Group By Columns — years_with_bank • Analysis Parameters • Threshold Probability — 0.01 Run the analysis and click on Results when it completes. For this example, the Median Test analysis generated the following table. The Median Test was computed on income over marital_status by years_with_bank. Teradata Warehouse Miner User Guide - Volume 1 341 Chapter 3: Statistical Tests Rank Tests Results were sorted by years_with_bank. The tests shows that values came from populations with the same median where MedianCallP_0.01 = ‘a’ (accept null hypothesis) and from populations with different medians where it is ‘p’ (reject null hypothesis). The SQL is available for viewing but not listed below. Table 111: Median Test years_with_bank ChiSq DF MedianPValue MedianPText MedianCallP_0.01 0 12.13288563 3 0.007361344 p 1 12.96799683 3 0.004848392 p 2 13.12480388 3 0.004665414 p 3 8.504645761 3 0.038753824 a 4 4.458333333 3 0.225502846 a 5 15.81395349 3 0.001527445 p 6 4.531466733 3 0.220383974 a 7 11.35971787 3 0.009950322 p 8 2.855999742 3 0.25 >0.25 a 9 2.23340311 3 0.25 >0.25 a Rank Tests Tests Based on Ranks use the ranks of the data rather than the data itself to calculate statistics. Therefore the data must have at least an ordinal scale of measurement. If data are nonnumeric but ordinal and ranked, these rank tests may be the most powerful tests available. Even numeric variables which meet the requirements of parametric tests, such as independent, randomly distributed normal variables, can be efficiently analyzed by these tests. These rank tests are valid for variables which are continuous, discrete, or a mixture of both. Types of Rank tests supported by Teradata Warehouse Miner include: • Mann-Whitney/Kruskal-Wallis • Mann-Whitney/Kruskal-Wallis (Independent Tests) • Wilcoxon Signed Rank • Friedman Mann-Whitney/Kruskal-Wallis Test The selection of which test to execute is automatically based on the number of distinct values of the independent variable. The Mann-Whitney is used for two groups, the Kruskal-Wallis for three or more groups. 342 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Rank Tests A special version of the Mann-Whitney/Kruskal-Wallis test performs a separate, independent test for each independent variable, and displays the result of each test with its accompanying column name. Under the primary version of the Mann-Whitney/Kruskal-Wallis test, all independent variable value combinations are used, often forcing the Kruskal-Wallis test, since the number of value combinations exceeds two. When a variable which has more than two distinct values is included in the set of independent variables, then the Kruskal-Wallis test is performed for all variables. Since Kruskal-Wallis is a generalization of Mann-Whitney, the Kruskal-Wallis results are valid for all the variables, including two-valued ones. In the discussion below, both types of Mann-Whitney/Kruskal-Wallis are referred to as MannWhitney/Kruskal-Wallis tests, since the only difference is the way the independent variable is treated. The Mann-Whitney test, AKA Wilcoxon Two Sample Test, is the nonparametric analog of the 2-sample t test. It is used to compare two independent groups of sampled data, and tests whether they are from the same population or from different populations, i.e. whether the samples have the same distribution function. Unlike the parametric t-test, this non-parametric test makes no assumptions about the distribution of the data (e.g., normality). It is to be used as an alternative to the independent group t-test, when the assumption of normality or equality of variance is not met. Like many non-parametric tests, it uses the ranks of the data rather than the data itself to calculate the U statistic. But since the Mann-Whitney test makes no distribution assumption, it is less powerful than the t-test. On the other hand, the MannWhitney is more powerful than the t-test when parametric assumptions are not met. Another advantage is that it will provide the same results under any monotonic transformation of the data so the results of the test are more generalizable. The Mann-Whitney is used when the independent variable is nominal or ordinal and the dependent variable is ordinal (or treated as ordinal). The main assumption is that the variable on which the 2 groups are to be compared is continuously distributed. This variable may be non-numeric, and if so, is converted to a rank based on alphanumeric precedence. The null hypothesis is that both samples have the same distribution. The alternative hypotheses are that the distributions differ from each other in either direction (two-tailed test), or in a specific direction (upper-tailed or lower-tailed tests). Output is a p-value, which when compared to the user’s threshold, determines whether the null hypothesis should be rejected. Given one or more columns (independent variables) whose values define two independent groups of sampled data, and a column (dependent variable) whose distribution is of interest from the same input table, the Mann-Whitney test is performed for each set of unique values of the group-by variables (GBV's), if any. The Kruskal-Wallis test is the nonparametric analog of the one-way analysis of variance or Ftest used to compare three or more independent groups of sampled data. When there are only two groups, it reduces to the Mann-Whitney test (above). The Kruskal-Wallis test tests whether multiple samples of data are from the same population or from different populations, i.e. whether the samples have the same distribution function. Unlike the parametric independent group ANOVA (one way ANOVA), this non-parametric test makes no assumptions about the distribution of the data (e.g., normality). Since this test does not make a distributional assumption, it is not as powerful as ANOVA. Teradata Warehouse Miner User Guide - Volume 1 343 Chapter 3: Statistical Tests Rank Tests Given k independent samples of numeric values, a Kruskal-Wallis test is produced for each set of unique values of the GBV's, testing whether all the populations are identical. This test variable may be non-numeric, and if so, is converted to a rank based on alphanumeric precedence. The null hypothesis is that all samples have the same distribution. The alternative hypotheses are that the distributions differ from each other. Output for each unique set of values of the GBV's is a statistic H, and a p-value, which when compared to the user’s threshold, determines whether the null hypothesis should be rejected for the unique set of values of the GBV's. Initiate a Mann-Whitney/Kruskal-Wallis Test Use the following procedure to initiate a new Mann-Whitney/Kruskal-Wallis Test in Teradata Warehouse Miner: 1 Click on the Add New Analysis icon in the toolbar: Figure 252: Add New Analysis from toolbar 2 In the resulting Add New Analysis dialog box, click on Statistical Tests under Categories and then under Analyses double-click on Rank Tests: Figure 253: Add New Analysis > Statistical Tests > Rank Tests 344 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Rank Tests 3 This will bring up the Rank Tests dialog in which you will enter STATISTICAL TEST STYLE, INPUT and OUTPUT options to parameterize the analysis as described in the following sections. Mann-Whitney/Kruskal-Wallis Test - INPUT - Data Selection On the Ranks Tests dialog click on INPUT and then click on data selection: Figure 254: Mann-Whitney/Kruskal-Wallis Test > Input > Data Selection On this screen select: 1 Select Input Source Users may select between different sources of input. By selecting the Input Source Table the user can select from available databases, tables (or views) and columns in the usual manner. By selecting the Input Source Analysis however the user can select directly from the output of another analysis of qualifying type in the current project. Analyses that may be selected from directly include all of the Analytic Data Set (ADS) and Reorganization analyses (except Refresh). In place of Available Databases the user may select from Available Analyses, while Available Tables then contains a list of all the output tables that will eventually be produced by the selected Analysis, or it contains a single entry with the name of the analysis under the label Volatile Table, representing the output of the analysis that is ordinarily produced by a Select statement. For more information, refer to “INPUT Tab” on page 73 of the Teradata Warehouse Miner User Guide (Volume 1). 2 3 Select Columns From a Single Table • Available Databases (or Analyses) — These are the databases (or analyses) available to be processed. • Available Tables — These are the tables and views that are available to be processed. • Available Columns — These are the columns within the table/view that are available for processing. Select Statistical Test Style These are the Tests Based on Ranks available (Mann-Whitney/Kruskal-Wallis, MannWhitney/Kruskal-Wallis Independent Tests, Wilcoxon, Friedman). Select “MannWhitney/Kruskal-Wallis” or Mann-Whitney/Kruskal-Wallis Independent Tests. 4 Select Optional Columns • Selected Columns — Select columns by highlighting and then either dragging and dropping into the Selected Columns window, or click on the arrow button to move highlighted columns into the Selected Columns window. Teradata Warehouse Miner User Guide - Volume 1 345 Chapter 3: Statistical Tests Rank Tests Note: The Selected Columns window is actually a split window; you can insert columns as Column of Interest, Columns or Group By Columns. Make sure you have the correct portion of the window highlighted. • Column of Interest — The column that specifies the dependent variable to be tested. Note that this variable may be non-numeric, but if so, will be converted to a rank based on alphanumeric precedence. • Columns — The columns that specify the independent variables, categorizing the data. • Group By Columns — The columns which specify the variables whose distinct value combinations will categorize the data, so a separate test is performed on each category. Mann-Whitney/Kruskal-Wallis Test - INPUT - Analysis Parameters On the Rank Tests dialog click on INPUT and then click on analysis parameters: Figure 255: Mann-Whitney/Kruskal-Wallis Test > Input > Analysis Parameters On this screen enter or select: • Processing Options • Threshold Probability — Enter the “alpha” probability below which to reject the null hypothesis. • Single Tail — Select the box if single tailed test is desired (default is two-tailed). The single-tail option is only valid if the test is Mann-Whitney. Mann-Whitney/Kruskal-Wallis Test - OUTPUT On the Rank Tests dialog click on OUTPUT: Figure 256: Mann-Whitney/Kruskal-Wallis Test > Output On this screen select the following options if desired: • Store the tabular output of this analysis in the database — Option to generate a Teradata table populated with the results of the analysis. Once enabled, the following three fields must be specified: • 346 Database Name — The database where the output table will be saved. Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Rank Tests • Output Name — The table name that the output will be saved under. • Output Type — The output type must be table when storing Statistical Test output in the database. • Stored Procedure — The creation of a stored procedure containing the SQL generated for this analysis can be requested by entering the desired name of the stored procedure here. This will result in the creation of a stored procedure in the user's login database in place of the execution of the SQL generated by the analysis. (For more information, please refer to “Stored Procedure Support” on page 109 of the Teradata Warehouse Miner User Guide (Volume 1)). • Procedure Comment — When an optional Procedure Comment is entered it is applied to a requested Stored Procedure with an SQL Comment statement. It can be up to 255 characters in length and contain substitution parameters for the output category (Score, ADS, Stats or Other), project name and/or analysis name (using the tags <Category>, <Project> and <Analysis>, respectively). (Note that the default value of this field may be set on the Defaults tab of the Preferences dialog, available from the Tools Menu). • Create output table using fallback keyword — Fallback keyword will be used to create the table • Create output table using multiset keyword — Multiset keyword will be used to create the table • Advertise Output — The Advertise Output option may be requested when creating a table, view or procedure. This feature “advertises” output by inserting information into one or more of the Advertise Output metadata tables according to the type of analysis and the options selected in the analysis. (For more information, refer to “Advertise Output” on page 112 of the Teradata Warehouse Miner User Guide (Volume 1)). • Advertise Note — An Advertise Note may be specified if desired when the Advertise Output option is selected or when the Always Advertise option is selected on the Databases tab of the Connection Properties dialog. It is a free-form text field of up to 30 characters that may be used to categorize or describe the output. • Generate SQL, but do not Execute it — Generate the Statistical Test SQL, but do not execute it. The SQL will be available to be viewed. Run the Mann-Whitney/Kruskal-Wallis Test Analysis After setting parameters on the INPUT screens as described above, you are ready to run the analysis. To run the analysis you can either: • Click the Run icon on the toolbar, or • Select Run <project name> on the Project menu, or • Press the F5 key on your keyboard Results - Mann-Whitney/Kruskal-Wallis Test Analysis The results of running the Mann-Whitney/Kruskal-Wallis Test analysis include a table with a row for each separate Mann-Whitney/Kruskal-Wallis test on all distinct-value group-by variables, as well as the SQL to perform the statistical analysis. In the case of Mann-Whitney/ Teradata Warehouse Miner User Guide - Volume 1 347 Chapter 3: Statistical Tests Rank Tests Kruskal-Wallis Independent Tests, the results will be displayed with a separate row for each independent variable column-name. All of these results are outlined below. Mann-Whitney/Kruskal-Wallis Test - RESULTS - SQL On the Rank Tests dialog click on RESULTS and then click on SQL: Figure 257: Mann-Whitney/Kruskal-Wallis Test > Results > SQL The series of SQL statements comprise the Mann-Whitney/Kruskal-Wallis Test Analysis. It is always returned, and is the only item returned when the Generate SQL without Executing option is used. Mann-Whitney/Kruskal-Wallis Test - RESULTS - data On the Rank Tests dialog click on RESULTS and then click on data: Figure 258: Mann-Whitney/Kruskal-Wallis Test > Results > data The output table is generated by the Analysis for each separate Mann-Whitney/KruskalWallis test on all distinct-value group-by variables. Output Columns - Mann-Whitney/Kruskal-Wallis Test Analysis The following table is built in the requested Output Database by the Mann-Whitney/KruskalWallis test analysis. Group By Variables, if present, will be the Unique Primary Index (UPI). Otherwise Z will be the UPI. In the case of Mann-Whitney/Kruskal-Wallis Independent Tests, the additional column _twm_independent_variable will contain the column-name of the independent variable for each separate test. Table for Mann-Whitney (if two groups) Table 112: Table for Mann-Whitney (if two groups) Name Type Definition Z Float Mann-Whitney Z Value 348 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Rank Tests Table 112: Table for Mann-Whitney (if two groups) Name Type Definition MannWhitneyPValue Float The probability associated with the Mann-Whitney/Kruskal-Wallis statistic MannWhitneyCallP_0.01 Char The Mann-Whitney/Kruskal-Wallis result: a=accept, p=reject Table 113: Table for Kruskal-Wallis (if more than two groups) Name Type Definition Z Float Kruskal-Wallis Z Value ChiSq Float Kruskal-Wallis Chi Square Statistic DF Integer Degrees of Freedom KruskalWallisPValue Float The probability associated with the Kruskal-Wallis statistic KruskalWallisPText Char The text description of probability if out of table range KruskalWallisCallP_0.01 Char The Kruskal-Wallis result: a=accept, p=reject Tutorial 1 - Mann-Whitney Test Analysis In this example, a Mann-Whitney test analysis is performed on the fictitious banking data to analyze account usage. Parameterize a Mann-Whitney Test analysis as follows: • Available Tables — twm_customer • Column of Interest — income • Columns — gender (2 distinct values -> Mann-Whitney test) • Group By Columns — years_with_bank • Analysis Parameters • Threshold Probability — 0.01 • Single Tail — false (default) Run the analysis and click on Results when it completes. For this example, the MannWhitney Test analysis generated the following table. The Mann-Whitney Test was computed for each distinct value of the group by variable “years_with_bank”. Results were sorted by years_with_bank. The tests show that customer incomes by gender were from the same population for all values of years_with_bank (an ‘a’ means accept the null hypothesis). The SQL is available for viewing but not listed below. Table 114: Mann-Whitney Test years_with_bank Z MannWhitneyPValue MannWhitneyCallP_0.01 0 -0.0127 0.9896 a 1 -0.2960 0.7672 a Teradata Warehouse Miner User Guide - Volume 1 349 Chapter 3: Statistical Tests Rank Tests Table 114: Mann-Whitney Test years_with_bank Z MannWhitneyPValue MannWhitneyCallP_0.01 2 -0.4128 0.6796 a 3 -0.6970 0.4858 a 4 -1.8088 0.0705 a 5 -2.2541 0.0242 a 6 -0.8683 0.3854 a 7 -1.7074 0.0878 a 8 -0.8617 0.3887 a 9 -0.4997 0.6171 a Tutorial 2 - Kruskal-Wallis Test Analysis In this example, a Kruskal-Wallis test analysis is performed on the fictitious banking data to analyze account usage. Parameterize a Kruskal-Wallis Test analysis as follows: • Available Tables — twm_customer • Column of Interest — income • Columns — marital_status (4 distinct values -> Kruskal-Wallis test) • Group By Columns — years_with_bank • Analysis Parameters • Threshold Probability — 0.01 • Single Tail — false (default) Run the analysis and click on Results when it completes. For this example, the Kruskal-Wallis Test analysis generated the following table. The test was computed for each distinct value of the group by variable “years_with_bank”. Results were sorted by years_with_bank. The tests shows customer incomes by marital_status were from the same population for years_with_ bank 4, 6, 8 and 9. Those with years_with_bank 0-3, 5 and 7 were from different populations for each marital status. An ‘n’ or ‘p’ means significant and an ‘a’ means accept the null hypothesis. The SQL is available for viewing but not listed below. Table 115: Kruskal-Wallis Test years_with_bank Z ChiSq DF KruskalWallisPValue 0 3.5507 20.3276 3 0.0002 1 4.0049 24.5773 3 0.0001 2 3.3103 18.2916 3 0.0004 p 3 3.0994 16.6210 3 0.0009 p 4 1.5879 7.5146 3 0.0596 a 350 KruskalWallisPText KruskalWallisCallP_0.01 p <0.0001 p Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Rank Tests Table 115: Kruskal-Wallis Test years_with_bank Z ChiSq DF KruskalWallisPValue KruskalWallisPText KruskalWallisCallP_0.01 5 4.3667 28.3576 3 0.0001 <0.0001 p 6 2.1239 10.2056 3 0.0186 a 7 3.2482 17.7883 3 0.0005 p 8 0.1146 2.6303 3 0.25 >0.25 a 9 -0.1692 2.0436 3 0.25 >0.25 a Tutorial 3 - Mann-Whitney Independent Tests Analysis In this example, a Mann-Whitney Independent Tests analysis is performed on the fictitious banking data to analyze account usage. Parameterize a Mann-Whitney Independent Tests analysis as follows: • Available Tables — twm_customer_analysis • Column of Interest — income • Columns — gender, ccacct, ckacct, svacct • Group By Columns • Analysis Parameters • Threshold Probability — 0.05 • Single Tail — false (default) Run the analysis and click on Results when it completes. For this example, the MannWhitney Independent Tests analysis generated the following table. The Mann-Whitney Test was computed separately for each independent variable. The tests show that customer incomes by gender and by svacct were from different populations, and that customer incomes by ckacct and by ccacct were from identical populations. The SQL is available for viewing but not listed below. Table 116: Mann-Whitney Test _twm_independent Z MannWhitneyPValue MannWhitneyCallP_0.05 gender -3.00331351 0.002673462 n svacct -3.37298401 0.000743646 n ckacct -1.92490664 0.05422922 a ccacct 1.764991014 0.077563672 a Wilcoxon Signed Ranks Test The Wilcoxon Signed Ranks Test is an alternative analogous to the t-test for correlated samples. The correlated-samples t-test makes assumptions about the data, and can be properly applied only if certain assumptions are met: Teradata Warehouse Miner User Guide - Volume 1 351 Chapter 3: Statistical Tests Rank Tests 1 the scale of measurement has the properties of an equal-interval scale 2 differences between paired values are randomly selected from the source population 3 The source population has a normal distribution. If any of these assumptions are invalid, the t-test for correlated samples should not be used. Of cases where these assumptions are unmet, the most common are those where the scale of measurement fails to have equal-interval scale properties, e.g. a case in which the measures are from a rating scale. When data within two correlated samples fail to meet one or another of the assumptions of the t-test, an appropriate non-parametric alternative is the Wilcoxon Signed-Rank Test, a test based on ranks. Assumptions for this test are: 1 The distribution of difference scores is symmetric (implies equal interval scale) 2 difference scores are mutually independent 3 difference scores have the same mean The original measures are replaced with ranks resulting in analysis only of the ordinal relationships. The signed ranks are organized and summed, giving a number, W. When the numbers of positive and negative signs are about equal, i.e. there is no tendency in either direction, the value of W will be near zero, and the null hypothesis will be supported. Positive or negative sums indicate there is a tendency for the ranks to have significance so there is a difference in the cases in the specified direction. Given a table name and names of paired numeric columns, a Wilcoxon test is produced. The Wilcoxon tests whether a sample comes from a population with a specific mean or median. The null hypothesis is that the samples come from populations with the same mean or median. The alternative hypothesis is that the samples come from populations with different means or medians (two-tailed test), or that in addition the difference is in a specific direction (upper-tailed or lower-tailed tests). Output is a p-value, which when compared to the user’s threshold, determines whether the null hypothesis should be rejected. Initiate a Wilcoxon Signed Ranks Test Use the following procedure to initiate a new Wilcoxon Signed Ranks Test in Teradata Warehouse Miner: 1 Click on the Add New Analysis icon in the toolbar: Figure 259: Add New Analysis from toolbar 2 352 In the resulting Add New Analysis dialog box, click on Statistical Tests under Categories and then under Analyses double-click on Rank Tests: Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Rank Tests Figure 260: Add New Analysis > Statistical Tests > Rank Tests 3 This will bring up the Rank Tests dialog in which you will enter STATISTICAL TEST STYLE, INPUT and OUTPUT options to parameterize the analysis as described in the following sections. Wilcoxon Signed Ranks Test - INPUT - Data Selection On the Rank Tests dialog click on INPUT and then click on data selection: Figure 261: Wilcoxon Signed Ranks Test > Input > Data Selection On this screen select: 1 Select Input Source Users may select between different sources of input. By selecting the Input Source Table the user can select from available databases, tables (or views) and columns in the usual manner. By selecting the Input Source Analysis however the user can select directly from the output of another analysis of qualifying type in the current project. Analyses that may be selected from directly include all of the Analytic Data Set (ADS) and Reorganization analyses (except Refresh). In place of Available Databases the user may select from Available Analyses, while Available Tables then contains a list of all the output tables that will eventually be produced by the selected Analysis, or it contains a single entry with the name of the analysis under the label Volatile Table, representing the output of the analysis Teradata Warehouse Miner User Guide - Volume 1 353 Chapter 3: Statistical Tests Rank Tests that is ordinarily produced by a Select statement. For more information, refer to “INPUT Tab” on page 73 of the Teradata Warehouse Miner User Guide (Volume 1). 2 3 Select Columns From a Single Table • Available Databases (or Analyses) — These are the databases (or analyses) available to be processed. • Available Tables — These are the tables and views that are available to be processed. • Available Columns — These are the columns within the table/view that are available for processing. Select Statistical Test Style These are the Tests Based on Ranks available (Mann-Whitney/Kruskal-Wallis, Wilcoxon, Friedman). Select “Wilcoxon”. 4 Select Optional Columns • Selected Columns — Select columns by highlighting and then either dragging and dropping into the Selected Columns window, or click on the arrow button to move highlighted columns into the Selected Columns window. Note: The Selected Columns window is actually a split window; you can insert columns as First Column, Second Column, Group By Columns. Make sure you have the correct portion of the window highlighted. • First Column — The column that specifies the variable from the first sample • Second Column — The column that specifies the variable from the second sample • Group By Columns — The columns which specify the variables whose distinct value combinations will categorize the data, so a separate test is performed on each category. Wilcoxon Signed Ranks Test - INPUT - Analysis Parameters On the Rank Tests dialog click on INPUT and then click on analysis parameters: Figure 262: Wilcoxon Signed Ranks Test > Input > Analysis Parameters On this screen enter or select: • Processing Options 354 • Threshold Probability — Enter the “alpha” probability below which to reject the null hypothesis. • Single Tail — Select the box if single tailed test is desired (default is two-tailed). The single-tail option is only valid if the test is Mann-Whitney. Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Rank Tests • Include Zero — The “include zero” option generates a variant of the Wilcoxon in which zero differences are included with the positive count. The default “discard zero” option is the true Wilcoxon. Wilcoxon Signed Ranks Test - OUTPUT On the Rank Tests dialog click on OUTPUT: Figure 263: Wilcoxon Signed Ranks Test > Output On this screen select the following options if desired: • Store the tabular output of this analysis in the database — Option to generate a Teradata table populated with the results of the analysis. Once enabled, the following three fields must be specified: • Database Name — The database where the output table will be saved. • Output Name — The table name that the output will be saved under. • Output Type — The output type must be table when storing Statistical Test output in the database. • Stored Procedure — The creation of a stored procedure containing the SQL generated for this analysis can be requested by entering the desired name of the stored procedure here. This will result in the creation of a stored procedure in the user's login database in place of the execution of the SQL generated by the analysis. (For more information, please refer to “Stored Procedure Support” on page 109 of the Teradata Warehouse Miner User Guide (Volume 1)). • Procedure Comment — When an optional Procedure Comment is entered it is applied to a requested Stored Procedure with an SQL Comment statement. It can be up to 255 characters in length and contain substitution parameters for the output category (Score, ADS, Stats or Other), project name and/or analysis name (using the tags <Category>, <Project> and <Analysis>, respectively). (Note that the default value of this field may be set on the Defaults tab of the Preferences dialog, available from the Tools Menu). • Create output table using fallback keyword — Fallback keyword will be used to create the table • Create output table using multiset keyword — Multiset keyword will be used to create the table • Advertise Output — The Advertise Output option may be requested when creating a table, view or procedure. This feature “advertises” output by inserting information into one or more of the Advertise Output metadata tables according to the type of analysis and the options selected in the analysis. (For more information, refer to “Advertise Output” on page 112 of the Teradata Warehouse Miner User Guide (Volume 1)). Teradata Warehouse Miner User Guide - Volume 1 355 Chapter 3: Statistical Tests Rank Tests • Advertise Note — An Advertise Note may be specified if desired when the Advertise Output option is selected or when the Always Advertise option is selected on the Databases tab of the Connection Properties dialog. It is a free-form text field of up to 30 characters that may be used to categorize or describe the output. • Generate SQL, but do not Execute it — Generate the Statistical Test SQL, but do not execute it. The SQL will be available to be viewed. Run the Wilcoxon Signed Ranks Test Analysis After setting parameters on the INPUT screens as described above, you are ready to run the analysis. To run the analysis you can either: • Click the Run icon on the toolbar, or • Select Run <project name> on the Project menu, or • Press the F5 key on your keyboard Results - Wilcoxon Analysis The results of running the Wilcoxon Signed Ranks Test analysis include a table with a row for each separate Wilcoxon Signed Ranks Test on all distinct-value group-by variables, as well as the SQL to perform the statistical analysis. All of these results are outlined below. Wilcoxon Signed Ranks Test - RESULTS - SQL On the Rank Tests dialog click on RESULTS and then click on SQL: Figure 264: Wilcoxon Signed Ranks Test > Results > SQL The series of SQL statements comprise the Wilcoxon Signed Ranks Test Analysis. It is always returned, and is the only item returned when the Generate SQL without Executing option is used. Wilcoxon Signed Ranks Test - RESULTS - data On the Rank Tests dialog click on RESULTS and then click on data: Figure 265: Wilcoxon Signed Ranks Test > Results > data The output table is generated by the Analysis for each separate Wilcoxon Signed Ranks Test on all distinct-value group-by variables. 356 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Rank Tests Output Columns - Wilcoxon Signed Ranks Test Analysis The following table is built in the requested Output Database by the Wilcoxon Signed Ranks Test analysis. Group By Variables, if present, will be the Unique Primary Index (UPI). Otherwise Z_ will be the UPI. Table 117: Wilcoxon Signed Ranks Test Analysis: Output Columns Name Type Definition N Integer variable count Z_ Float Mann-Whitney Z Value WilcoxonPValue Float The probability associated with the Wilcoxon statistic WilcoxonCallP_0.05 Char The Wilcoxon result: a=accept, p or n=reject Tutorial - Wilcoxon Test Analysis In this example, a Wilcoxon test analysis is performed on the fictitious banking data to analyze account usage. Parameterize a Wilcoxon Test analysis as follows: • Available Tables — twm_customer_analysis • First Column — avg_ck_bal • Second Column — avg_sv_bal • Group By Columns — years_with_bank • Analysis Parameters • Threshold Probability — 0.05 • Single Tail — false (default) • Include Zero — false (default) Run the analysis and click on Results when it completes. For this example, the Wilcoxon Test analysis generated the following table. The Wilcoxon Test was computed for each distinct value of the group by variable “gender”. The tests show the samples of avg_ck_bal and avg_ sv_bal came from populations with the same mean or median for customers with years_with_ bank of 0, 4-9, and from populations with different means or medians for those with years_ with_bank of 1-3. An ‘n’ or ‘p’ means significant and an ‘a’ means accept the null hypothesis. The SQL is available for viewing but not listed below. Table 118: Wilcoxon Test years_with_bank N Z_ WilcoxonPValue WilcoxonCallP_0.05 0 75 -1.77163 0.07639 a 1 77 -3.52884 0.00042 n 2 83 -2.94428 0.00324 n 3 69 -2.03882 0.04145 n Teradata Warehouse Miner User Guide - Volume 1 357 Chapter 3: Statistical Tests Rank Tests Table 118: Wilcoxon Test years_with_bank N Z_ WilcoxonPValue WilcoxonCallP_0.05 4 69 -0.56202 0.57412 a 5 67 -1.95832 0.05023 a 6 65 -1.25471 0.20948 a 7 48 -0.44103 0.65921 a 8 39 -1.73042 0.08363 a 9 33 -1.45623 0.14539 a Friedman Test with Kendall's Coefficient of Concordance & Spearman's Rho The Friedman test is an extension of the sign test for several independent samples. It is analogous to the 2-way Analysis of Variance, but depends only on the ranks of the observations, so it is like a 2-way ANOVA on ranks. The Friedman test should not be used for only three treatments due to lack of power, and is best for six or more treatments. It is a test for treatment differences in a randomized, complete block design. Data consists of b mutually independent k-variate random variables called blocks. The Friedman assumptions are that the data in these blocks are mutually independent, and that within each block, observations are ordinally rankable according to some criterion of interest. A Friedman Test is produced using rank scores and the F table, though alternative implementations call it the Friedman Statistic and use the chi-square table. Note that when all of the treatments are not applied to each block, it is an incomplete block design. The requirements of the Friedman test are not met under these conditions, and other tests such as the Durban test should be applied. In addition to the Friedman statistics, Kendall’s Coefficient of Concordance (W) is produced, as well as Spearman’s Rho. Kendall's coefficient of concordance can range from 0 to 1. The higher its value, the stronger the association. W is 1.0 if all treatments receive the same rankness in all blocks, and 0 if there is “perfect disagreement” among blocks. Spearman's rho is a measure of the linear relationship between two variables. It differs from Pearson's correlation only in that the computations are done after the numbers are converted to ranks. Spearman’s Rho equals 1 if there is perfect agreement among rankings; disagreement causes rho to be less than 1, sometimes becoming negative. Initiate a Friedman Test Use the following procedure to initiate a new Friedman Test in Teradata Warehouse Miner: 358 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Rank Tests 1 Click on the Add New Analysis icon in the toolbar: Figure 266: Add New Analysis from toolbar 2 In the resulting Add New Analysis dialog box, click on Statistical Tests under Categories and then under Analyses double-click on Rank Tests: Figure 267: Add New Analysis > Statistical Tests > Rank Tests 3 This will bring up the Rank Tests dialog in which you will enter STATISTICAL TEST STYLE, INPUT and OUTPUT options to parameterize the analysis as described in the following sections. Friedman Test - INPUT - Data Selection On the Rank Tests dialog click on INPUT and then click on data selection: Teradata Warehouse Miner User Guide - Volume 1 359 Chapter 3: Statistical Tests Rank Tests Figure 268: Friedman Test > Input > Data Selection On this screen select: 1 Select Input Source Users may select between different sources of input. By selecting the Input Source Table the user can select from available databases, tables (or views) and columns in the usual manner. By selecting the Input Source Analysis however the user can select directly from the output of another analysis of qualifying type in the current project. Analyses that may be selected from directly include all of the Analytic Data Set (ADS) and Reorganization analyses (except Refresh). In place of Available Databases the user may select from Available Analyses, while Available Tables then contains a list of all the output tables that will eventually be produced by the selected Analysis, or it contains a single entry with the name of the analysis under the label Volatile Table, representing the output of the analysis that is ordinarily produced by a Select statement. For more information, refer to “INPUT Tab” on page 73 of the Teradata Warehouse Miner User Guide (Volume 1). 2 3 Select Columns From a Single Table • Available Databases (or Analyses) — These are the databases (or analyses) available to be processed. • Available Tables — These are the tables and views that are available to be processed. • Available Columns — These are the columns within the table/view that are available for processing. Select Statistical Test Style These are the Tests Based on Ranks available (Mann-Whitney/Kruskal-Wallis, Wilcoxon, Friedman). Select “Friedman”. 4 Select Optional Columns • Selected Columns — Select columns by highlighting and then either dragging and dropping into the Selected Columns window, or click on the arrow button to move highlighted columns into the Selected Columns window. Note: The Selected Columns window is actually a split window; you can insert columns as Column of Interest, Treatment Column, Block Column, Group By Columns. Make sure you have the correct portion of the window highlighted. 360 • Column of Interest — The column that specifies the dependent variable to be analyzed • Treatment Column — The column that specifies the independent categorical variable representing treatments within blocks. • Block Column — The column that specifies the variable representing blocks, or independent experimental groups. Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Rank Tests Warning: Equal cell counts are required for all Treatment Column x Block Column pairs. Division by zero may occur in the case of unequal cell counts. • Group By Columns — The columns which specify the variables whose distinct value combinations will categorize the data, so a separate test is performed on each category. Warning: Equal cell counts are required for all Treatment Column x Block Column pairs within each group. Division by zero may occur in the case of unequal cell counts. Friedman Test - INPUT - Analysis Parameters On the Rank Tests dialog click on INPUT and then click on analysis parameters: Figure 269: Friedman Test > Input > Analysis Parameters On this screen enter or select: • Processing Options • Threshold Probability — Enter the “alpha” probability below which to reject the null hypothesis. Friedman Test - OUTPUT On the Rank Tests dialog click on OUTPUT: Figure 270: Friedman Test > Output On this screen select the following options if desired: • Store the tabular output of this analysis in the database — Option to generate a Teradata table populated with the results of the analysis. Once enabled, the following three fields must be specified: • Database Name — The database where the output table will be saved. • Output Name — The table name that the output will be saved under. • Output Type — The output type must be table when storing Statistical Test output in the database. • Stored Procedure — The creation of a stored procedure containing the SQL generated for this analysis can be requested by entering the desired name of the stored procedure here. This will result in the creation of a stored procedure in the user's login database Teradata Warehouse Miner User Guide - Volume 1 361 Chapter 3: Statistical Tests Rank Tests in place of the execution of the SQL generated by the analysis. (For more information, please refer to “Stored Procedure Support” on page 109 of the Teradata Warehouse Miner User Guide (Volume 1)). • Procedure Comment — When an optional Procedure Comment is entered it is applied to a requested Stored Procedure with an SQL Comment statement. It can be up to 255 characters in length and contain substitution parameters for the output category (Score, ADS, Stats or Other), project name and/or analysis name (using the tags <Category>, <Project> and <Analysis>, respectively). (Note that the default value of this field may be set on the Defaults tab of the Preferences dialog, available from the Tools Menu). • Create output table using fallback keyword — Fallback keyword will be used to create the table • Create output table using multiset keyword — Multiset keyword will be used to create the table • Advertise Output — The Advertise Output option may be requested when creating a table, view or procedure. This feature “advertises” output by inserting information into one or more of the Advertise Output metadata tables according to the type of analysis and the options selected in the analysis. (For more information, refer to “Advertise Output” on page 112 of the Teradata Warehouse Miner User Guide (Volume 1)). • Advertise Note — An Advertise Note may be specified if desired when the Advertise Output option is selected or when the Always Advertise option is selected on the Databases tab of the Connection Properties dialog. It is a free-form text field of up to 30 characters that may be used to categorize or describe the output. • Generate SQL, but do not Execute it — Generate the Statistical Test SQL, but do not execute it. The SQL will be available to be viewed. Run the Friedman Test Analysis After setting parameters on the INPUT screens as described above, you are ready to run the analysis. To run the analysis you can either: • Click the Run icon on the toolbar, or • Select Run <project name> on the Project menu, or • Press the F5 key on your keyboard Results - Friedman Test Analysis The results of running the Friedman Test analysis include a table with a row for each separate Friedman Test on all distinct-value group-by variables, as well as the SQL to perform the statistical analysis. All of these results are outlined below. Friedman Test - RESULTS - SQL On the Rank Tests dialog click on RESULTS and then click on SQL: 362 Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Rank Tests Figure 271: Friedman Test > Results > SQL The series of SQL statements comprise the Analysis. It is always returned, and is the only item returned when the Generate SQL without Executing option is used. Friedman Test - RESULTS - data On the Rank Tests dialog click on RESULTS and then click on data: Figure 272: Friedman Test > Results > data The output table is generated by the Analysis for each separate Friedman Test on all distinctvalue group-by variables. Output Columns - Friedman Test Analysis The following table is built in the requested Output Database by the Friedman Test analysis. Group By Variables, if present, will be the Unique Primary Index (UPI). Otherwise Kendalls_ W will be the UPI. Table 119: Friedman Test Analysis: Output Columns Name Type Definition Kendalls_W Float Kendall's W Average_Spearmans_Rho Float Average Spearman's Rho DF_1 Integer Degrees of Freedom for Treatments DF_2 Integer Degrees of Freedom for Blocks F Float 2-Way ANOVA F Statistic on ranks FriedmanPValue Float The probability associated with the Friedman statistic FriedmanPText Char The text description of probability if out of table range FriedmanCallP_0.05 Char The Friedman result: a=accept, p or n=reject Teradata Warehouse Miner User Guide - Volume 1 363 Chapter 3: Statistical Tests Rank Tests Tutorial - Friedman Test Analysis In this example, a Friedman test analysis is performed on the fictitious banking data to analyze account usage. If the data does not have equal cell counts in the treatment x block cells, stratified sampling can be used to identify the smallest count, and then produce a temporary table which can be analyzed. The first step is to identify the smallest count with a Free Form SQL analysis (or two Variable Creation analyses) with SQL such as the following (be sure to set the database in the FROM clause to that containing the demonstration data tables): SELECT MIN("_twm_N") AS smallest_count FROM ( SELECT marital_status ,gender ,COUNT(*) AS "_twm_N" FROM "twm_source"."twm_customer_analysis" GROUP BY "marital_status", "gender" ) AS "T0"; The second step is to use a Sample analysis with stratified sampling to create the temporary table with equal cell counts. The value 18 used in the stratified Sizes/Fractions parameter below corresponds to the smallest_count returned from above. Parameterize a Sample Analysis called Friedman Work Table Setup as follows: Input Options: • Available Tables — TWM_CUSTOMER_ANALYSIS • Selected Columns and Aliases • TWM_CUSTOMER_ANALYSIS.cust_id • TWM_CUSTOMER_ANALYSIS.gender • TWM_CUSTOMER_ANALYSIS.marital_status • TWM_CUSTOMER_ANALYSIS.income Analysis Parameters: • Sample Style — Stratified • Stratified Sample Options • Create a separate sample for each fraction/size — Enabled • Stratified Conditions 364 • gender='f' and marital_status='1' • gender='f' and marital_status='2' • gender='f' and marital_status='3' • gender='f' and marital_status='4' • gender='m' and marital_status='1' • gender='m' and marital_status='2' • gender='m' and marital_status='3' Teradata Warehouse Miner User Guide - Volume 1 Chapter 3: Statistical Tests Rank Tests • gender='m' and marital_status='4' • Sizes/Fractions — 18 (use the same value for all conditions) Output Options: • Store the tabular output of this analysis in the database — Enabled • Table Name — Twm_Friedman_Worktable Finally, Parameterize a Friedman Test analysis as follows: Input Options: • Select Input Source — Analysis • Available Analyses — Friedman Work Table Setup • Available Tables — Twm_Friedman_Worktable • Select Statistical Test Style — Friedman • Column of Interest — income • Treatment Column — gender • Block Column — marital_status Analysis Parameters: • Analysis Parameters • Threshold Probability — 0.05 Run the analysis and click on Results when it completes. For this example, the Friedman Test analysis generated the following table. (Note that results may vary due to the use of sampling in creating the input table Twm_Friedman_Worktable). The test shows that analysis of income by treatment (male vs. female) differences is significant at better than the 0.001 probability level. An ‘n’ or ‘p’ means significant and an ‘a’ means accept the null hypothesis. The SQL is available for viewing but not listed below. Table 120: Friedman Test Kendalls_W Average_Spearmans_Rho DF_1 DF_2 F FriedmanPValue FriedmanPText FriedmanCallP_0.001 0.76319692 5 0.773946177 1 0.001 p Teradata Warehouse Miner User Guide - Volume 1 71 228.8271876 <0.001 365 Chapter 3: Statistical Tests Rank Tests 366 Teradata Warehouse Miner User Guide - Volume 1 APPENDIX A References 1 Agrawal, R. Mannila, H. Srikant, R. Toivonen, H. and Verkamo, I., Fast Discovery of Association Rules. In Advances in Knowledge Discovery and Data Mining, 1996, eds. U.M. Fayyad, G. Paitetsky-Shapiro, P. Smyth and R. Uthurusamy. Menlo Park, AAAI Press/The MIT Press. 2 Agresti, A. (1990) Categorical Data Analysis. Wiley, New York. 3 Arabie, P., Hubert, L., and DeSoete, G., Clustering and Classification, World Scientific, 1996 4 Belsley, D.A., Kuh, E., and Welsch, R.E. (1980) Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. Wiley, New York. 5 Bradley, P., Fayyad, U. and Reina, C., Scaling EM Clustering to Large Databases, Microsoft Research Technical Report MSR-TR-98-35, 1998 6 Breiman, L., Friedman, J.H., Olshen, R.A., and Stone, C.J. Classification and Regression Trees. Wadsworth, Belmont, 1984. 7 Conover, W.J. Practical Nonparametric Statistics 3rd Edition 8 Cox, D.R. and Hinkley, D.V. (1974) Theoretical Statistics. Chapman & Hall/CRC, New York. 9 D'Agostino, RB. (1971) An omnibus test of normality for moderate and large size samples, Biometrica, 58, 341-348 10 D'Agostino, R. B. and Stephens, M. A., eds. Goodness-of-fit Techniques, 1986,. New York: Dekker. 11 D’Agostino, R, Belanger, A., and D’Agostino,R. Jr., A Suggestion for Using Powerful and Informative Tests of Normality, American Statistician, 1990, Vol. 44, No. 4 12 Finn, J.D. (1974) A General Model for Multivariate Analysis. Holt, Rinehart and Winston, New York. 13 Harman, H.H. (1976) Modern Factor Analysis. University of Chicago Press, Chicago. 14 Harter, H.L. and Owen, D.B., eds, Selected Tables in Mathematical Statistics, Vol. 1.. Providence, Rhode Island: American Mathematical Society. 15 Hosmer, D.W. and Lemeshow, S. (1989) Applied Logistic Regression. Wiley, New York. 16 Jennrich, R.I., and Sampson, P.F. (1966) Rotation For Simple Loadings. Psychometrika, Vol. 31, No. 3. 17 Johnson, R.A. and Wichern, D.W. (1998) Applied Multivariate Statistical Analysis, 4th Edition. Prentice Hall, New Jersey. 18 Kachigan, S.K. (1991) Multivariate Statistical Analysis. Radius Press, New York. 19 Kaiser, Henry F. (1958) The Varimax Criterion For Analytic Rotation In Factor Analysis. Psychometrika, Vol. 23, No. 3. Teradata Warehouse Miner User Guide - Volume 3 367 Appendix A: References 20 Kass, G. V. (1979) An Exploratory Technique for Investigating Large Quantities of Categorical Data, Applied Statistics (1980) 29, No. 2 pp. 119-127 21 Kaufman, L. and Rousseeuw, P., Finding Groups in Data, J Wiley & Sons, 1990 22 Kennedy, W.J. and Gentle, J.E. (1980) Statistical Computing. Marcel Dekker, New York. 23 Kleinbaum, D.G. and Kupper, L.L. (1978) Applied Regression Analysis and Other Multivariable Methods. Duxbury Press, North Scituate, Massachusetts. 24 Maddala, G.S. (1983) Limited-Dependent and Qualitative Variables In Econometrics. Cambridge University Press, Cambridge, United Kingdom. 25 Maindonald, J.H. (1984) Statistical Computation. Wiley, New York. 26 McCullagh, P.M. and Nelder, J.A. (1989) Generalized Linear Models, 2nd Edition. Chapman & Hall/CRC, New York. 27 McLachlan, G.J. and Krishnan, T., The EM Algorithm and Extensions, J Wiley & Sons, 1997 28 Menard, S (1995) Applied Logistic Regression Analysis, Sage, Thousand Oaks 29 Mulaik, S.A. (1972) The Foundations of Factor Analysis. McGraw-Hill, New York. 30 Neter, J., Kutner, M.H., Nachtsheim, C.J., and Wasserman, W. (1996) Applied Linear Statistical Models, 4th Edition. WCB/McGraw-Hill, New York. 31 NIST/SEMATECH e-Handbook of Statistical Methods, http://www.itl.nist.gov/div898/ handbook/, 2005. 32 Nocedal, J. and Wright, S.J. (1999) Numerical Optimization. Springer-Verlag, New York. 33 Orchestrate/OSH Component User’s Guide Vol II, Analytics Library, Chapter 2: Introduction to Data Mining. Torrent Systems, Inc., 1997. 34 Ordonez, C. and Cereghini, P. (2000) SQLEM: Fast Clustering in SQL using the EM Algorithm. SIGMOD Conference 2000: 559-570 35 Ordonez, C. (2004): Programming the K-means clustering algorithm in SQL. KDD 2004: 823-828 36 Ordonez, C. (2004): Horizontal aggregations for building tabular data sets. DMKD 2004: 35-42 37 Pagano, Gauvreau Principles of Biostatistics 2nd Edition 38 Peduzzi, P.N., Hardy, R.J., and Holford, T.R. (1980) A Stepwise Variable Selection Procedure for Nonlinear Regression Models. Biometrics 36, 511-516. 39 Pregibon, D. (1981) Logistic Regression Diagnostics. Annals of Statistics, Vol. 9, No. 4, 705-724. 40 PROPHET StatGuide, BBN Corporation, 1996. 41 Quinlan, J.R. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, 1993. 42 Roweis, S. and Ghahramani, Z., A Unifying Review of Linear Gaussian Models, Journal of Neural Computation, 1999 368 Teradata Warehouse Miner User Guide - Volume 3 Appendix A: References 43 Royston, JP., An Extension of Shapiro and Wilk’s W Test for Normality to Large Samples, Applied Statistics, 1982, 31, No. 2, pp.115-124 44 Royston, JP, Algorithm AS 177: Expected normal order statistics (exact and approximate), 1982, Applied Statistics, 31, 161-165. 45 Royston, JP., Algorithm AS 181: The W Test for Normality. 1982, Applied Statistics, 31, 176-180. 46 Royston , JP., A Remark on Algorithm AS 181: The W Test for Normality., 1995, Applied Statistics, 44, 547-551. 47 Rubin, Donald B., and Thayer, Dorothy T. (1982) EM Algorithms For ML Factor Analysis. Psychometrika, Vol. 47, No. 1. 48 Shapiro, SS and Francia, RS (1972). An approximate analysis of variance test for normality, Journal of the American Statistical Association, 67, 215-216 49 SPSS 7.5 Statistical Algorithms Manual, SPSS Inc., Chicago. 50 SYSTAT 9: Statistics I. (1999) SPSS Inc., Chicago. 51 Takahashi, T. (2005) Getting Started: International Character Sets and the Teradata Database, Teradata Corporation, 541-0004068-C02 52 Tatsuoka, M.M. (1971) Multivariate Analysis: Techniques For Educational and Psychological Research. Wiley, New York. 53 Tatsuoka, M.M. (1974) Selected Topics in Advanced Statistics, Classification Procedures, Institute for Personality and Ability Testing, 1974 54 Teradata Database SQL Functions, Operators, Expressions, and Predicates Release 14.10, B035-1145-112A, May 2013 55 Teradata Warehouse Miner Model Manager User Guide, B035-2303-093A, September 2013 56 Teradata Warehouse Miner Release Definition, B035-2494-093C, September 2013 57 Teradata Warehouse Miner User Guide, Volume 1 Introduction and Profiling, B035-2300- 093A, September 2013 58 Teradata Warehouse Miner User Guide, Volume 2 ADS Generation, B035-2301-093A, September 2013 59 Teradata Warehouse Miner User Guide, Volume 3 Analytic Functions, B035-2302-093A, September 2013 60 Teradata Warehouse Miner User’s Guide Release 03.00.02, B035-2093-022A, January 2002 61 Wendorf, Craig A., MANUALS FOR UNIVARIATE AND MULTIVARIATE STATISTICS © 1997, Revised 2004-03-12, UWSP, http://www.uwsp.edu/psych/cw/ statmanual, 2005 62 Wilkinson, L., Blank, G., and Gruber, C. (1996) Desktop Data Analysis SYSTAT. Prentice Hall, New Jersey. Teradata Warehouse Miner User Guide - Volume 3 369 Appendix A: References 370 Teradata Warehouse Miner User Guide - Volume 3