Download Teradata Warehouse Miner User Guide

Document related concepts

Principal component analysis wikipedia , lookup

Factor analysis wikipedia , lookup

Multinomial logistic regression wikipedia , lookup

Transcript
Teradata Warehouse Miner
User Guide - Volume 3
Analytic Functions
Release 5.4.2
B035-2302-106A
October 2016
The product or products described in this book are licensed products of Teradata Corporation or its affiliates.
Teradata, BYNET, DBC/1012, DecisionCast, DecisionFlow, DecisionPoint, Eye logo design, InfoWise, Meta Warehouse, MyCommerce,
SeeChain, SeeCommerce, SeeRisk, Teradata Warehouse Miner, Teradata Source Experts, WebAnalyst, and You’ve Never Seen Your Business
Like This Before are trademarks or registered trademarks of Teradata Corporation or its affiliates.
Adaptec and SCSISelect are trademarks or registered trademarks of Adaptec, Inc.
AMD Opteron and Opteron are trademarks of Advanced Micro Devices, Inc.
BakBone and NetVault are trademarks or registered trademarks of BakBone Software, Inc.
Cloudera and the Cloudera logo are trademarks of Cloudera, Inc.
This software contains material under license from DUNDAS SOFTWARE LTD., which is ©1994-1999 DUNDAS SOFTWARE LTD., all
rights reserved.
EMC, PowerPath, SRDF, and Symmetrix are registered trademarks of EMC Corporation.
GoldenGate is a trademark of GoldenGate Software, Inc.
Hewlett-Packard and HP are registered trademarks of Hewlett-Packard Company.
Hortonworks, the Hortonworks logo and other Hortonworks trademarks are trademarks of Hortonworks Inc. in the United States and other
countries.
Intel, Pentium, and XEON are registered trademarks of Intel Corporation.
IBM, CICS, DB2, MVS, RACF, Tivoli, and VM are registered trademarks of International Business Machines Corporation.
Linux is a registered trademark of Linus Torvalds.
LSI and Engenio are registered trademarks of LSI Corporation.
MapR, MapR Heatmap, Direct Access NFS, Distributed NameNode HA, Direct Shuffle and Lockless Storage Services are all trademarks of
MapR Technologies, Inc.
Microsoft, Active Directory, Windows, Windows NT, Windows Server, Windows Vista, Visual Studio and Excel are either registered trademarks
or trademarks of Microsoft Corporation in the United States or other countries.
MongoDB, Mongo, and the leaf logo are registered trademarks of MongoDB, Inc.
Novell and SUSE are registered trademarks of Novell, Inc., in the United States and other countries.
QLogic and SANbox trademarks or registered trademarks of QLogic Corporation.
SAS, SAS/C and Enterprise Miner are trademarks or registered trademarks of SAS Institute Inc.
SPSS is a registered trademark of SPSS Inc.
STATISTICA and StatSoft are trademarks or registered trademarks of StatSoft, Inc.
SPARC is a registered trademarks of SPARC International, Inc.
Sun Microsystems, Solaris, Sun, and Sun Java are trademarks or registered trademarks of Sun Microsystems, Inc., in the United States and
other countries.
Symantec, NetBackup, and VERITAS are trademarks or registered trademarks of Symantec Corporation or its affiliates in the United States
and other countries.
Unicode is a collective membership mark and a service mark of Unicode, Inc.
UNIX is a registered trademark of The Open Group in the United States and other countries.
Other product and company names mentioned herein may be the trademarks of their respective owners.
THE INFORMATION CONTAINED IN THIS DOCUMENT IS PROVIDED ON AN “AS-IS” BASIS, WITHOUT
WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING THE IMPLIED WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, OR NON-INFRINGEMENT. SOME JURISDICTIONS
DO NOT ALLOW THE EXCLUSION OF IMPLIED WARRANTIES, SO THE ABOVE EXCLUSION MAY NOT APPLY
TO YOU. IN NO EVENT WILL TERADATA CORPORATION BE LIABLE FOR ANY INDIRECT, DIRECT, SPECIAL,
INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING LOST PROFITS OR LOST SAVINGS, EVEN IF
EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
The information contained in this document may contain references or cross-references to features, functions, products, or services that are
not announced or available in your country. Such references do not imply that Teradata Corporation intends to announce such features, functions,
products, or services in your country. Please consult your local Teradata Corporation representative for those features, functions, products, or
services available in your country.
Information contained in this document may contain technical inaccuracies or typographical errors. Information may be changed or updated
without notice. Teradata Corporation may also make improvements or changes in the products or services described in this information at any
time without notice.
To maintain the quality of our products and services, we would like your comments on the accuracy, clarity, organization, and value of this
document. Please e-mail: [email protected]
Any comments or materials (collectively referred to as “Feedback”) sent to Teradata Corporation will be deemed non-confidential. Teradata
Corporation will have no obligation of any kind with respect to Feedback and will be free to use, reproduce, disclose, exhibit, display, transform,
create derivative works of, and distribute the Feedback and derivative works thereof without limitation on a royalty-free basis. Further, Teradata
Corporation will be free to use any ideas, concepts, know-how, or techniques contained in such Feedback for any purpose whatsoever, including
developing, manufacturing, or marketing products or services incorporating Feedback.
Copyright © 1999-2016 by Teradata Corporation. All Rights Reserved.
Teradata Warehouse Miner User Guide - Volume 3
iii
iv
Teradata Warehouse Miner User Guide - Volume 3
Preface
Purpose
This volume describes how to use the modeling, scoring and statistical test features of the
Teradata Warehouse Miner product. Teradata Warehouse Miner is a set of Microsoft .NET
interfaces and a multi-tier User Interface that together help understand the quality of data
residing in a Teradata database, create analytic data sets, and build and score analytic models
directly in the Teradata database.
Audience
This manual is written for users of Teradata Warehouse Miner, who should be familiar with
Teradata SQL, the operation and administration of the Teradata RDBMS system and
statistical techniques. They should also be familiar with the Microsoft Windows operating
environment and standard Microsoft Windows operating techniques.
This manual only applies to Teradata Warehouse Miner when operating on a Teradata
database.
Revision Record
The following table lists a history of releases where this guide has been revised:
Release
Date
Description
TWM 5.4.2
10/31/16
Maintenance Release
TWM 5.4.1
01/08/16
Maintenance Release
TWM 5.4.0
07/31/15
Feature Release
TWM 5.3.5
06/19/14
Maintenance Release
TWM 5.3.4
09/10/13
Maintenance Release
TWM 5.3.3
06/30/12
Maintenance Release
TWM 5.3.2
06/01/11
Maintenance Release
TWM 5.3.1
06/30/10
Maintenance Release
TWM 5.3.0
10/30/09
Feature Release
Teradata Warehouse Miner User Guide - Volume 3
v
Preface
How This Manual Is Organized
Release
Date
Description
TWM 5.2.2
02/05/09
Maintenance Release
TWM 5.2.1
12/15/08
Maintenance Release
TWM 5.2.0
05/31/08
Feature Release
TWM 5.1.1
01/23/08
Maintenance Release
TWM 5.1.0
07/12/07
Feature Release
TWM 5.0.1
11/16/06
Maintenance Release
TWM 5.0.0
09/22/06
Major Release
How This Manual Is Organized
This manual is organized and presents information as follows:
• Chapter 1: “Analytic Algorithms” — describes how to use the Teradata Warehouse Miner
Multivariate Statistics and Machine Learning Algorithms. This includes Linear
Regression, Logistic Regression, Factor Analysis, Decision Trees, Clustering and
Association Rules.
• Chapter 2: “Scoring” — describes how to use the Teradata Warehouse Miner Multivariate
Statistics and Machine Learning Algorithms scoring analyses. Scoring is available for
Linear Regression, Logistic Regression, Factor Analysis, Decision Trees and Clustering.
• Chapter 3: “Statistical Tests” — describes how to use Teradata Warehouse Miner
Statistical Tests. This includes Binomial, Kolmogorov Smirnov, Parametric, Rank, and
Contingency Tables-based tests.
Conventions Used In This Manual
The following typographical conventions are used in this guide:
Convention
Description
Italic
Titles (esp. screen names/titles)
New terms for emphasis
Monospace
Code sample
Output
vi
ALL CAPS
Acronyms
Bold
Important term or concept
GUI Item
Screen item and/or esp. something you will click on or highlight in
following a procedure.
Teradata Warehouse Miner User Guide - Volume 3
Preface
Related Documents
This document provides information for operations on both Teradata and Aster systems. In
some cases, certain information will only apply to either a Teradata or an Aster system.
“Teradata Only” and “Aster Only” markers are distributed throughout this document in order
to identify Teradata-specific and Aster-specific content, respectively.
Teradata Only
The following marker denotes information that only applies to a Teradata system:
While the following signals the conclusion of Teradata-specific content:
Aster Only
The following marker denotes information that only applies to an Aster system:
While the following signals the conclusion of Aster-specific content:
Related Documents
Related Teradata documentation and other sources of information are available from:
http://www.info.teradata.com
Additional technical information on data warehousing and other topics is available from:
http://www.teradata.com/t/resources
Support Information
Services, support and training information is available from:
http://www.teradata.com/services-support
Teradata Warehouse Miner User Guide - Volume 3
vii
Preface
Related Documents
viii
Teradata Warehouse Miner User Guide - Volume 3
Table of Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Audience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Revision Record . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
How This Manual Is Organized . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
Conventions Used In This Manual. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
Related Documents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Support Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Chapter 1: Analytic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Association Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Initiate an Association Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Association - INPUT - Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Association - INPUT - Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Association - INPUT - Expert Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Association - OUTPUT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Run the Association Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Results - Association Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Tutorial - Association Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Options - Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Using the TWM Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Success Analysis - Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Optimizing Performance of Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Initiate a Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Cluster - INPUT - Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Cluster - INPUT - Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Cluster - INPUT - Expert Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Cluster - OUTPUT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Run the Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Teradata Warehouse Miner User Guide - Volume 3
ix
Table of Contents
Results - Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Tutorial - Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Initiate a Decision Tree Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Decision Tree - INPUT - Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Decision Tree - INPUT - Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Decision Tree - INPUT - Expert Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Run the Decision Tree Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Results - Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Tutorial - Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
39
44
45
46
48
48
49
58
Factor Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Initiate a Factor Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Factor - INPUT - Data Selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Factor - INPUT - Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Factor Analysis - OUTPUT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Run the Factor Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Results - Factor Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Tutorial - Factor Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
62
62
71
72
73
76
77
77
87
Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Initiate a Linear Regression Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Linear Regression - INPUT - Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Linear Regression - INPUT - Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
Linear Regression - OUTPUT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Run the Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Results - Linear Regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Tutorial - Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Initiate a Logistic Regression Function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Logistic Regression - INPUT - Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Logistic Regression - INPUT - Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . .
Logistic Regression - INPUT - Expert Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Logistic Regression - OUTPUT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Run the Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Results - Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Tutorial - Logistic Regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
x
120
120
126
127
128
131
132
133
134
141
Teradata Warehouse Miner User Guide - Volume 3
Chapter 2: Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
Cluster Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
Initiate Cluster Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
Cluster Scoring - INPUT - Data Selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Cluster Scoring - INPUT - Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
Cluster Scoring - OUTPUT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
Run the Cluster Scoring Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
Results - Cluster Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
Tutorial - Cluster Scoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
Tree Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
Initiate Tree Scoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
Tree Scoring - INPUT - Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
Tree Scoring - INPUT - Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
Tree Scoring - OUTPUT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
Run the Tree Scoring Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
Results - Tree Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
Tutorial - Tree Scoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
Factor Scoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
Initiate Factor Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
Factor Scoring - INPUT - Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
Factor Scoring - INPUT - Analysis Parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
Factor Scoring - OUTPUT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
Run the Factor Scoring Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
Results - Factor Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
Tutorial - Factor Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
Linear Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
Linear Regression Model Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
Initiate Linear Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
Linear Scoring - INPUT - Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
Linear Scoring - INPUT - Analysis Parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
Linear Scoring - OUTPUT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
Run the Linear Scoring Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
Results - Linear Scoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
Tutorial - Linear Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
Logistic Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
Logistic Regression Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
Prediction Success Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
Multi-Threshold Success Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
Cumulative Lift Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
Initiate Logistic Scoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
Teradata Warehouse Miner User Guide - Volume 3
xi
Table of Contents
Logistic Scoring - INPUT - Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Logistic Scoring - INPUT - Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Logistic Scoring - OUTPUT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Run the Logistic Scoring Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Results - Logistic Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Tutorial - Logistic Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
188
189
190
191
191
194
Chapter 3: Statistical Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
Summary of Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
Data Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
Parametric Tests. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Two Sample T-Test for Equal Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
F-Test - N-Way . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
F-Test/Analysis of Variance - Two Way Unequal Sample Size. . . . . . . . . . . . . . . . . .
204
204
211
221
Binomial Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
Binomial/Ztest. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
Binomial Sign Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
Kolmogorov-Smirnov Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Kolmogorov-Smirnov Test (One Sample) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Lilliefors Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Shapiro-Wilk Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
D'Agostino and Pearson Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Smirnov Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
241
241
247
253
259
264
Tests Based on Contingency Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
Chi Square Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
Median Test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
Rank Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Mann-Whitney/Kruskal-Wallis Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Wilcoxon Signed Ranks Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Friedman Test with Kendall's Coefficient of Concordance & Spearman's Rho . . . . . .
283
283
292
299
Appendix A: References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
xii
Teradata Warehouse Miner User Guide - Volume 3
List of Figures
Figure 1: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Figure 2: Add New Analysis dialog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Figure 3: Association > Input > Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Figure 4: Association > Input > Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Figure 5: Association: X to X. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Figure 6: Association Combinations pane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Figure 7: Association > Input > Expert Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Figure 8: Association > Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Figure 9: Association > Results > SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Figure 10: Association > Results > Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Figure 11: Association > Results > Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Figure 12: Association Graph Selector. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Figure 13: Association Graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Figure 14: Association Graph: Tutorial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Figure 15: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Figure 16: Add New Analysis dialog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Figure 17: Clustering > Input > Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Figure 18: Clustering > Input > Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Figure 19: Clustering > Input > Expert Options. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Figure 20: Cluster > OUTPUT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Figure 21: Clustering > Results > Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Figure 22: Clustering > Results > Sizes Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Figure 23: Clustering > Results > Similarity Graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Figure 24: Clustering Analysis Tutorial: Sizes Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Figure 25: Clustering Analysis Tutorial: Similarity Graph . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Figure 26: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Figure 27: Add New Analysis dialog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Figure 28: Decision Tree > Input > Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Figure 29: Decision Tree > Input > Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Figure 30: Decision Tree > Input > Expert Options. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Figure 31: Tree Browser. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Figure 32: Tree Browser menu: Small Navigation Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Teradata Warehouse Miner User Guide - Volume 3
xiii
List of Figures
Figure 33: Tree Browser menu: Zoom Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Figure 34: Tree Browser menu: Print . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Figure 35: Text Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Figure 36: Rules List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Figure 37: Counts and Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Figure 38: Tree Pruning menu. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Figure 39: Tree Pruning Menu > Prune Selected Branch . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Figure 40: Tree Pruning menu (All Options Enabled) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Figure 41: Decision Tree Graph: Previously Pruned Tree . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Figure 42: Decision Tree Graph: Predicate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Figure 43: Decision Tree Graph: Lift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Figure 44: Decision Tree Graph Tutorial: Browser. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Figure 45: Decision Tree Graph Tutorial: Lift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Figure 46: Decision Tree Graph Tutorial: Browser. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Figure 47: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Figure 48: Add New Analysis dialog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Figure 49: Factor Analysis > Input > Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Figure 50: Factor Analysis > Input > Analysis Parameters. . . . . . . . . . . . . . . . . . . . . . . . . . 74
Figure 51: Factor Analysis > Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Figure 52: Factor Analysis > Results > Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Figure 53: Factor Analysis > Results > Pattern Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Figure 54: Factor Analysis > Results > Scree Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Figure 55: Factor Analysis Tutorial: Scree Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Figure 56: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Figure 57: Add New Analysis dialog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Figure 58: Linear Regression > Input > Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Figure 59: Linear Regression > Input > Analysis Parameters. . . . . . . . . . . . . . . . . . . . . . . 103
Figure 60: Linear Regression > OUTPUT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Figure 61: Linear Regression Tutorial: Linear Weights Graph. . . . . . . . . . . . . . . . . . . . . . 118
Figure 62: Linear Regression Tutorial: Scatter Plot (2d) . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Figure 63: Linear Regression Tutorial: Scatter Plot (3d) . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Figure 64: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Figure 65: Add New Analysis dialog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Figure 66: Logistic Regression > Input > Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 128
Figure 67: Logistic Regression > Input > Analysis Parameters . . . . . . . . . . . . . . . . . . . . . 129
Figure 68: Logistic Regression > Input > Expert Options. . . . . . . . . . . . . . . . . . . . . . . . . . 131
xiv
Teradata Warehouse Miner User Guide - Volume 3
List of Figures
Figure 69: Logistic Regression > OUTPUT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
Figure 70: Logistic Regression Tutorial: Logistic Weights Graph . . . . . . . . . . . . . . . . . . . 147
Figure 71: Logistic Regression Tutorial: Lift Chart. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
Figure 72: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
Figure 73: Add New Analysis > Scoring > Cluster Scoring . . . . . . . . . . . . . . . . . . . . . . . . 151
Figure 74: Add New Analysis > Input > Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Figure 75: Add New Analysis > Input > Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . 152
Figure 76: Cluster Scoring > Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
Figure 77: Cluster Scoring > Results > Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
Figure 78: Cluster Scoring > Results > Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
Figure 79: Cluster Scoring > Results > SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
Figure 80: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
Figure 81: Add New Analysis > Scoring > Tree Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . 158
Figure 82: Tree Scoring > Input > Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
Figure 83: Tree Scoring > Input > Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
Figure 84: Tree Scoring > Output. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
Figure 85: Tree Scoring > Results > Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
Figure 86: Tree Scoring > Results > Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
Figure 87: Tree Scoring > Results > Lift Graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
Figure 88: Tree Scoring > Results > SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
Figure 89: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
Figure 90: Add New Analysis > Scoring > Factor Scoring . . . . . . . . . . . . . . . . . . . . . . . . . 170
Figure 91: Factor Scoring > Input > Data Selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
Figure 92: Factor Scoring > Input > Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 171
Figure 93: Factor Scoring > Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
Figure 94: Factor Scoring > Results > Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
Figure 95: Factor Scoring > Results > Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
Figure 96: Factor Scoring > Results > SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
Figure 97: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
Figure 98: Add New Analysis > Scoring > Linear Scoring . . . . . . . . . . . . . . . . . . . . . . . . . 177
Figure 99: Linear Scoring > Input > Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
Figure 100: Linear Scoring > Input > Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . 179
Figure 101: Linear Scoring > Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
Figure 102: Linear Scoring > Results > Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
Figure 103: Linear Scoring > Results > Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
Figure 104: Linear Scoring > Results > SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
Teradata Warehouse Miner User Guide - Volume 3
xv
List of Figures
Figure 105: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
Figure 106: Add New Analysis > Scoring > Logistic Scoring . . . . . . . . . . . . . . . . . . . . . . 188
Figure 107: Logistic Scoring > Input > Data Selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
Figure 108: Logistic Scoring > Input > Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . 189
Figure 109: Logistic Scoring > Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
Figure 110: Logistic Scoring > Results > Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
Figure 111: Logistic Scoring > Results > Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
Figure 112: Logistic Scoring > Results > Lift Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
Figure 113: Logistic Scoring > Results > SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
Figure 114: Logistic Scoring Tutorial: Lift Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
Figure 115: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
Figure 116: Add New Analysis > Statistical Tests > Parametric Tests. . . . . . . . . . . . . . . . 205
Figure 117: T-Test > Input > Data Selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
Figure 118: T-Test > Input > Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
Figure 119: T-Test > Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
Figure 120: T-Test > Results > SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
Figure 121: T-Test > Results > Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
Figure 122: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
Figure 123: Add New Analysis > Statistical Tests > Parametric Tests. . . . . . . . . . . . . . . . 212
Figure 124: F-Test > Input > Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
Figure 125: F-Test > Input > Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
Figure 126: F-Test > Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
Figure 127: F-Test > Results > SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
Figure 128: F-Test > Results > data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
Figure 129: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
Figure 130: Add New Analysis > Statistical Tests > Parametric Tests. . . . . . . . . . . . . . . . 223
Figure 131: F-Test > Input > Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
Figure 132: F-Test > Input > Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
Figure 133: F-Test > Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
Figure 134: F-Test > Results > SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
Figure 135: F-Test > Results > data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
Figure 136: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
Figure 137: Add New Analysis > Statistical Tests > Binomial Tests . . . . . . . . . . . . . . . . . 230
Figure 138: Binomial Tests > Input > Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
Figure 139: Binomial Tests > Input > Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . 231
Figure 140: Binomial Tests > Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
xvi
Teradata Warehouse Miner User Guide - Volume 3
List of Figures
Figure 141: Binomial Tests > Results > SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
Figure 142: Binomial Tests > Results > data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
Figure 143: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
Figure 144: Add New Analysis > Statistical Tests > Binomial Tests . . . . . . . . . . . . . . . . . 236
Figure 145: Binomial Sign Test > Input > Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 236
Figure 146: Binomial Sign Test > Input > Analysis Parameters . . . . . . . . . . . . . . . . . . . . . 237
Figure 147: Binomial Sign Test > Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
Figure 148: Binomial Sign Test > Results > SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
Figure 149: Binomial Sign Test > Results > data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
Figure 150: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
Figure 151: Add New Analysis > Statistical Tests > Kolmogorov-Smirnov Tests . . . . . . . 242
Figure 152: Kolmogorov-Smirnov Test > Input > Data Selection. . . . . . . . . . . . . . . . . . . . 242
Figure 153: Kolmogorov-Smirnov Test > Input > Analysis Parameters . . . . . . . . . . . . . . . 243
Figure 154: Kolmogorov-Smirnov Test > Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
Figure 155: Kolmogorov-Smirnov Test > Results > SQL . . . . . . . . . . . . . . . . . . . . . . . . . . 245
Figure 156: Kolmogorov-Smirnov Test > Results > Data . . . . . . . . . . . . . . . . . . . . . . . . . . 245
Figure 157: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
Figure 158: Add New Analysis> Statistical Tests > Kolmogorov-Smirnov Tests . . . . . . . 248
Figure 159: Lillefors Test > Input > Data Selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
Figure 160: Lillefors Test > Input > Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 249
Figure 161: Lillefors Test > Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
Figure 162: Lillefors Test > Results > SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
Figure 163: Lillefors Test > Results > Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
Figure 164: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
Figure 165: Add New Analysis > Statistical Tests > Kolmogorov-Smirnov Tests . . . . . . . 254
Figure 166: Shapiro-Wilk Test > Input > Data Selection. . . . . . . . . . . . . . . . . . . . . . . . . . . 254
Figure 167: Shapiro-Wilk Test > Input > Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . 255
Figure 168: Shapiro-Wilk Test > Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
Figure 169: Shapiro-Wilk Test > Results > SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
Figure 170: Shapiro-Wilk Test > Results > data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
Figure 171: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
Figure 172: Add New Analysis > Statistical Tests > Kolmogorov-Smirnov Tests . . . . . . . 259
Figure 173: D'Agostino and Pearson Test > Input > Data Selection . . . . . . . . . . . . . . . . . . 260
Figure 174: D'Agostino and Pearson Test > Input > Analysis Parameters . . . . . . . . . . . . . 261
Figure 175: D'Agostino and Pearson Test > Output. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
Figure 176: D'Agostino and Pearson Test > Results > SQL . . . . . . . . . . . . . . . . . . . . . . . . 263
Teradata Warehouse Miner User Guide - Volume 3
xvii
List of Figures
Figure 177: D'Agostino and Pearson Test > Results > data . . . . . . . . . . . . . . . . . . . . . . . . 263
Figure 178: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
Figure 179: Add New Analysis > Statistical Tests > Kolmogorov-Smirnov Tests. . . . . . . 265
Figure 180: Smirnov Test > Input > Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
Figure 181: Smirnov Test > Input > Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . 267
Figure 182: Smirnov Test > Output. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
Figure 183: Smirnov Test > Results > SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
Figure 184: Smirnov Test > Results > data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
Figure 185: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
Figure 186: Add New Analysis > Statistical Tests > Tests Based on Contingency Tables 272
Figure 187: Chi Square Test > Input > Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
Figure 188: Chi Square Test > Input > Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . 273
Figure 189: Chi Square Test > Output. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
Figure 190: Chi Square Test > Results > SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
Figure 191: Chi Square Test > Results > data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
Figure 192: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
Figure 193: Add New Analysis > Statistical Tests > Tests Based On Contingency Tables 278
Figure 194: Median Test > Input > Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
Figure 195: Median Test > Input > Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 280
Figure 196: Median Test > Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
Figure 197: Median Test > Results > SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
Figure 198: Median Test > Results > data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
Figure 199: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
Figure 200: Add New Analysis > Statistical Tests > Rank Tests . . . . . . . . . . . . . . . . . . . . 285
Figure 201: Mann-Whitney/Kruskal-Wallis Test > Input > Data Selection . . . . . . . . . . . . 286
Figure 202: Mann-Whitney/Kruskal-Wallis Test > Input > Analysis Parameters . . . . . . . 287
Figure 203: Mann-Whitney/Kruskal-Wallis Test > Output. . . . . . . . . . . . . . . . . . . . . . . . . 287
Figure 204: Mann-Whitney/Kruskal-Wallis Test > Results > SQL . . . . . . . . . . . . . . . . . . 289
Figure 205: Mann-Whitney/Kruskal-Wallis Test > Results > data . . . . . . . . . . . . . . . . . . . 289
Figure 206: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
Figure 207: Add New Analysis > Statistical Tests > Rank Tests . . . . . . . . . . . . . . . . . . . . 294
Figure 208: Wilcoxon Signed Ranks Test > Input > Data Selection. . . . . . . . . . . . . . . . . . 294
Figure 209: Wilcoxon Signed Ranks Test > Input > Analysis Parameters . . . . . . . . . . . . . 295
Figure 210: Wilcoxon Signed Ranks Test > Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
Figure 211: Wilcoxon Signed Ranks Test > Results > SQL . . . . . . . . . . . . . . . . . . . . . . . . 297
Figure 212: Wilcoxon Signed Ranks Test > Results > data . . . . . . . . . . . . . . . . . . . . . . . . 297
xviii
Teradata Warehouse Miner User Guide - Volume 3
List of Figures
Figure 213: Add New Analysis from toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
Figure 214: Add New Analysis > Statistical Tests > Rank Tests. . . . . . . . . . . . . . . . . . . . . 300
Figure 215: Friedman Test > Input > Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
Figure 216: Friedman Test > Input > Analysis Parameters . . . . . . . . . . . . . . . . . . . . . . . . . 302
Figure 217: Friedman Test > Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
Figure 218: Friedman Test > Results > SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
Figure 219: Friedman Test > Results > data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
Teradata Warehouse Miner User Guide - Volume 3
xix
List of Figures
xx
Teradata Warehouse Miner User Guide - Volume 3
List of Tables
Table 1: Three-Level Hierarchy Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Table 2: Association Combinations output table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Table 3: Tutorial - Association Analysis Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Table 4: test_ClusterResults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Table 5: test_ClusterColumns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Table 6: Progress . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Table 7: Solution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Table 8: Confusion Matrix Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Table 9: Decision Tree Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Table 10: Variables: Dependent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Table 11: Variables: Independent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Table 12: Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Table 13: Cumulative Lift Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Table 14: Prime Factor Loadings report (Example) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Table 15: Prime Factor Variables report (Example). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Table 16: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Table 17: my_factor_reports_ tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Table 18: Factor Analysis Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Table 19: Execution Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Table 20: Eigenvalues. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Table 21: Principal Component Loadings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Table 22: Factor Variance to Total Variance Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Table 23: Variance Explained By Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Table 24: Difference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Table 25: Prime Factor Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Table 26: Eigenvalues of Unit Scaled X'X . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Table 27: Condition Indices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Table 28: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Table 29: Near Dependency report (example) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Table 30: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Table 31: Linear Regression Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
Table 32: Regression vs. Residual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
Teradata Warehouse Miner User Guide - Volume 3
xxi
List of Tables
Table 33: Execution Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
Table 34: Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Table 35: Out . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Table 36: Model Assessment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
Table 37: Columns In (Part 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
Table 38: Columns In (Part 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
Table 39: Columns In (Part 3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
Table 40: Columns Out . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
Table 41: Logistic Regression - OUTPUT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
Table 42: Logistic Regression Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
Table 43: Execution Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
Table 44: Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
Table 45: Columns Out . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
Table 46: Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
Table 47: Columns Out . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
Table 48: Prediction Success Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
Table 49: Multi-Threshold Success Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
Table 50: Cumulative Lift Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
Table 51: Output Database (Built by the Cluster Scoring analysis) . . . . . . . . . . . . . . . . . . 155
Table 52: Clustering Progress . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
Table 53: Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
Table 54: Confusion Matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
Table 55: Output Database table (Built by the Decision Tree Scoring analysis) . . . . . . . . 164
Table 56: Output Database table (Built by the Decision Tree Scoring analysis) - Create profiling tables option
selected (“_1” appended) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
Table 57: Output Database table (Built by the Decision Tree Scoring analysis) - Create profiling tables option
selected (“_2” appended) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
Table 58: Decision Tree Model Scoring Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
Table 59: Confusion Matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
Table 60: Cumulative Lift Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
Table 61: Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
Table 62: Output Database table (Built by Factor Scoring) . . . . . . . . . . . . . . . . . . . . . . . . 174
Table 63: Factor Analysis Score Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
Table 64: Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
Table 65: Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
Table 66: Output Database table (Built by Linear Regression scoring) . . . . . . . . . . . . . . . 182
xxii
Teradata Warehouse Miner User Guide - Volume 3
List of Tables
Table 67: Linear Regression Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
Table 68: Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
Table 69: Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
Table 70: Prediction Success Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
Table 71: Logistic Regression Multi-Threshold Success table . . . . . . . . . . . . . . . . . . . . . . 185
Table 72: Logistic Regression Cumulative Lift Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
Table 73: Output Database table (Built by Logistic Regression scoring) . . . . . . . . . . . . . . 192
Table 74: Logistic Regression Model Scoring Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
Table 75: Prediction Success Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
Table 76: Multi-Threshold Success Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
Table 77: Cumulative Lift Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
Table 78: Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
Table 79: Statistical Test functions handling of input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
Table 80: Two sample t tests for unpaired data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
Table 81: Output Database table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
Table 82: T-Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
Table 83: Output Columns - 1-Way F-Test Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
Table 84: Output Columns - 2-Way F-Test Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
Table 85: Output Columns - 3-Way F-Test Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
Table 86: F-Test (one-way) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
Table 87: Output Columns - 2-Way F-Test Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
Table 88: F-Test (Two-way Unequal Cell Count) (Part 1). . . . . . . . . . . . . . . . . . . . . . . . . . 228
Table 89: F-Test (Two-way Unequal Cell Count) (Part 2). . . . . . . . . . . . . . . . . . . . . . . . . . 228
Table 90: F-Test (Two-way Unequal Cell Count) (Part 3). . . . . . . . . . . . . . . . . . . . . . . . . . 228
Table 91: Output Database table (Built by the Binomial Analysis) . . . . . . . . . . . . . . . . . . . 234
Table 92: Binomial Test Analysis (Table 1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
Table 93: Binomial Test Analysis (Table 2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
Table 94: Binomial Sign Analysis: Output Columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
Table 95: Tutorial - Binomial Sign Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
Table 96: Output Database table (Built by the Kolmogorov-Smirnov test analysis) . . . . . . 245
Table 97: Kolmogorov-Smirnov Test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
Table 98: Lilliefors Test Analysis: Output Columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
Table 99: Lilliefors Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
Table 100: Shapiro-Wilk Test Analysis: Output Columns. . . . . . . . . . . . . . . . . . . . . . . . . . 257
Table 101: Shapiro-Wilk Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
Table 102: D'Agostino and Pearson Test Analysis: Output Columns . . . . . . . . . . . . . . . . . 263
Teradata Warehouse Miner User Guide - Volume 3
xxiii
List of Tables
Table 103: D'Agostino and Pearson Test: Output Columns . . . . . . . . . . . . . . . . . . . . . . . . 264
Table 104: Smirnov Test Analysis: Output Columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
Table 105: Smirnov Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
Table 106: Chi Square Test Analysis: Output Columns . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
Table 107: Chi Square Test (Part 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
Table 108: Chi Square Test (Part 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
Table 109: Median Test Analysis: Output Columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
Table 110: Median Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
Table 111: Table for Mann-Whitney (if two groups) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
Table 112: Table for Kruskal-Wallis (if more than two groups). . . . . . . . . . . . . . . . . . . . . 290
Table 113: Mann-Whitney Test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
Table 114: Kruskal-Wallis Test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
Table 115: Mann-Whitney Test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
Table 116: Wilcoxon Signed Ranks Test Analysis: Output Columns. . . . . . . . . . . . . . . . . 298
Table 117: Wilcoxon Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
Table 118: Friedman Test Analysis: Output Columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
Table 119: Friedman Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
xxiv
Teradata Warehouse Miner User Guide - Volume 3
CHAPTER 1
Analytic Algorithms
What’s In This Chapter
This chapter applies only to an instance of Teradata Warehouse Miner operating on a Teradata
database.
For more information, see these subtopics:
1
“Overview” on page 1
2
“Association Rules” on page 2
3
“Cluster Analysis” on page 20
4
“Decision Trees” on page 39
5
“Factor Analysis” on page 62
6
“Linear Regression” on page 92
7
“Logistic Regression” on page 120
Overview
Teradata Warehouse Miner contains several analytic algorithms from both the traditional
statistics and machine learning disciplines. These algorithms pertain to the exploratory data
analysis (EDA) and model-building phases of the data mining process. Along with these
algorithms, Teradata Warehouse Miner contains corresponding model scoring and evaluation
functions that pertain to the model evaluation and deployment phases of the data mining
process. A brief summary of the algorithms offered may be given as follows:
• Linear Regression — Linear regression can be used to predict or estimate the value of a
continuous numeric data element based upon a linear combination of other numeric data
elements present for each observation.
• Logistic Regression — Logistic regression can be used to predict or estimate a two-valued
variable based upon other numeric data elements present for each observation.
• Factor Analysis — Factor analysis is a collective term for a family of techniques. In
general, Factor analysis can be used to identify, quantify, and re-specify the common and
unique sources of variability in a set of numeric variables. One of its many applications
allows an analytical modeler to reduce the number of numeric variables needed to
describe a collection of observations by creating new variables, called factors, as linear
combinations of the original variables.
Teradata Warehouse Miner User Guide - Volume 3
1
Chapter 1: Analytic Algorithms
Association Rules
• Decision Trees — Decision trees, or rule induction, can be used to predict or estimate the
value of a multi-valued variable based upon other categorical and continuous numeric
data elements by building decision rules and presenting them graphically in the shape of a
tree, based upon splits on specific data values.
• Clustering — Cluster analysis can be used to form multiple groups of observations, such
that each group contains observations that are very similar to one another, based upon
values of multiple numeric data elements.
• Association Rules — Generate association rules and various measures of frequency,
relationship and statistical significance associated with these rules. These rules can be
general, or have a dimension of time association with them.
Association Rules
Overview
Association Rules are measurements on groups of observations or transactions that contain
items of some kind. These measurements seek to describe the relationships between the items
in the groups, such as the frequency of occurrence of items together in a group or the
probability that items occur in a group given that other specific items are in that group. The
nature of items and groups in association analysis and the meaning of the relationships
between items in a group will depend on the nature of the data being studied. For example,
the items may be products purchased and the groups the market baskets in which they were
purchased. (This is generally called market basket analysis). Another example is that items
may be accounts opened and the groups the customers that opened the accounts. This type of
association analysis is useful in a cross-sell application to determine what products and
services to sell with other products and services. Obviously the possibilities are endless when
it comes to the assignment of meaning to items and groups in business and scientific
transactions or observations.
Rules
What does an association analysis produce and what types of measurements does it include?
An association analysis produces association rules and various measures of frequency,
relationship and statistical significance associated with these rules. Association rules are of
the form  X 1 X 2 X n    Y 1 Y 2 Y m  where  X 1 X 2 X n  is a set of n items
that appear in a group along with a set of m items Y 1 Y 2 Y m  in the same group. For
example, if checking, saving and credit card accounts are owned by a customer, then the
customer will also own a certificate of deposit (CD) with a certain frequency. Relationship
means that, for example, owning a specific account or set of accounts, (antecedent), is
associated with ownership of one or more other specific accounts (consequent). Association
rules, in and of themselves, do not warrant inferences of causality, however they may point to
relationships among items or events that could be studied further using other analytical
techniques which are more appropriate for determining the structure and nature of causalities
that may exist.
2
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Association Rules
Measures
The four measurements made for association rules are support, confidence, lift and Z score.
Support
Support is a measure of the generality of an association rule, and is literally the percentage (a
value between 0 and 1) of groups that contain all of the items referenced in the rule. More
formally, in the association rule defined as L  R , L represents the items given to occur
together (the Left side or antecedent), and R represents the items that occur with them as a
result (the Right side or consequent). Support can actually be applied to a single item or a
single side of an association rule, as well as to an entire rule. The support of an item is simply
the percentage of groups containing that item.
Given the previous example of banking product ownership, let L be defined as the number of
customers who own the set of products on the left side and let R be defined as the number of
customers who own the set of products on the right side. Further, let LR be the number of
customers who own all products in the association rule (note that this notation does not mean
L times R), and let N be defined as the total number of customers under consideration. The
support of L, R and the association rule are given by:
L
Sup  L  = ---N
R
Sup  R  = ---N
LR
Sup  L  R  = -------N
Let’s say for example that out of 10 customers, 6 of them have a checking account, 5 have a
savings account, and 4 have both. If L is (checking) and R is (savings), then Sup  L  is
.6, Sup  R  is .5 and Sup  L  R  is .4.
Confidence
Confidence is the probability of R occurring in an item group given that L is in the item
group. The equation to calculate the probability of R occurring in an item group given that L
is in the item group is given by:
Teradata Warehouse Miner User Guide - Volume 3
3
Chapter 1: Analytic Algorithms
Association Rules
L  R
Conf  L  R  = Sup --------------------Sup  L 
Another way of expressing the measure confidence is as the percentage of groups containing
L that also contain R. This gives the following equivalent calculation for confidence:
LR
Conf  L  R  = -------L
Using the previous example of banking product ownership once again, the confidence that
checking account ownership implies savings account ownership is 4/6.
The expected value of an association rule is the number of customers that are expected to
have both L and R if there is no relationship between L and R. (To say that there is no
relationship between L and R means that customers who have L are neither more likely nor
less likely to have R than are customers who do not have L). The equation for the expected
value of the association rule is:
LR
E_LR = -----------N
An equivalent formula for the expected value of the association rule is:
E_LR = Sup  L   Sup  R   N
Again using the previous example, the expected value of the number of customers with
checking and savings is calculated as 6 * 5 / 10 or 3.
The expected confidence of a rule is the confidence that would result if there were no
relationship between L and R. This simply equals the percentage of customers that own R,
since if owning L has no effect on owning R, then it would be expected that the percentage of
L’s that own R would be the same as the percentage of the entire population that own R. The
following equation computes expected confidence:
R
E_Conf = ---- = Sup  R 
N
From the previous example, the expected confidence that checking implies savings is given
by 5/10.
Lift
Lift measures how much the probability of R is increased by the presence of L in an item
group. A lift of 1 indicates there are exactly as many occurrences of R as expected; thus, the
presence of L neither increases nor decreases the likelihood of R occurring. A lift of 5
4
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Association Rules
indicates that the presence of L implies that it is 5 times more likely for R to occur than would
otherwise be expected. A lift of 0.5 indicates that when L occurs, it is one half as likely that R
will occur. Lift can be calculated as follows:
LR
Lift  L  R  = --------------E_LR
From another viewpoint, lift measures the ratio of the actual confidence to the expected
confidence, and can be calculated equivalently as either of the following:
L  R
Lift  L  R  = Conf --------------------E_Conf
Conf  L  R 
Lift  L  R  = ---------------------------------Sup  R 
The lift associated with the previous example of “checking implies savings” is 4/3.
Z score
Z score measures how statistically different the actual result is from the expected result. A Z
score of zero corresponds to the situation where the actual number equals the expected. A Z
score of 1 means that the actual number is 1 standard deviation greater than expected. A Z
score of -3.0 means that the actual number is 3 standard deviations less than expected. As a
rule of thumb, a Z score greater than 3 (or less than -3) indicates a statistically significant
result, which means that a difference that large between the actual result and the expected is
very unlikely to be due to chance. A Z score attempts to help answer the question of how
confident you can be about the observed relationship between L and R, but does not directly
indicate the magnitude of the relationship. It is interesting to note that a negative Z score
indicates a negative association. These are rules L  R where ownership of L decreases the
likelihood of owning R.
The following equation calculates a measure of the difference between the expected number
of customers that have both L and R, if there is no relationship between L and R, and the
actual number of customers that have both L and R. (It can be derived starting with either the
formula for the standard deviation of the sampling distribution of proportions or the formula
for the standard deviation of a binomial variable).
 LR – E_LR 
Zscore  L  R  = --------------------------------------------------------------E_LR
SQRT  E_LR(1 – --------------

N 
or equivalently:
Teradata Warehouse Miner User Guide - Volume 3
5
Chapter 1: Analytic Algorithms
Association Rules
 N  Sup  LR  – N  Sup  L   Sup  R  
Zscore  L  R  = -------------------------------------------------------------------------------------------------------N  Sup  L Sup  R   1 – Sup  L Sup  R  
The mean value is E_LR, and the actual value is LR. The standard deviation is calculated
with SQRT (E_LR * (1 - E_LR/N)). From the previous example, the expected value is 6 * 5 /
10, so the mean value is 3. The actual value is calculated knowing that savings and checking
accounts are owned by 4 out of 10 customers. The standard deviation is SQRT(3*(1-3/10)) or
1.449. The Z score is therefore (4 - 3) / 1.449 = .690.
Interpreting Measures
None of the measures described above are “best”; they all measure slightly different things. In
the discussion below, product ownership association analysis is used as an example for
purposes of illustration. First look at confidence, which measures the strength of an
association: what percent of L customers also own R? Many people will sort associations by
confidence and consider the highest confidence rules to be the best. However, there are
several other factors to consider.
One factor to consider is that a rule may apply to very few customers, so is not very useful.
This is what support measures, the generality of the rule, or how often it applies. Thus a
rule L  R might have a confidence of 70%, but if that is just 7 out of 100 customers, it has
very low support and is not very useful. Another shortcoming of confidence is that by itself it
does not tell you whether owning L “changes” the likelihood of owning R, which is probably
the more important piece of information. For example, if 20% of the customers own R, then a
rule L  R (20% of those with L also own R) may have high confidence but is really
providing no information, because customers that own L have the same rate of ownership of
R as the entire population does. What is probably really wanted is to find the products L for
which the confidence of L  R is significantly greater than 20%. This is what lift measures,
the difference between the actual confidence and the expected confidence.
However, lift, like confidence, is much less meaningful when very small numbers are
involved; that is, when the support is low. If the expected number is 2 and there are actually 8
customers with product R, then the lift is an impressive 400. But because of the small
numbers involved, the association rule is likely of limited use, and might even have occurred
by chance. This is where the Z score comes in. For a rule L  R , confidence indicates the
likelihood that R is owned given that L is owned. Lift indicates how much owning L increases
or decreases the probability of the ownership of R, and Z score measures how trustworthy the
observed difference between the actual and expected ownership is relative to what could be
observed due to chance alone. For example, for a rule L  R , if it is expected to have
10,000 customers with both L and R, and there are actually 11,000, the lift would be only 1.1,
but the Z score would be very high, because such a large difference could not be due to
chance. Thus, a large Z score and small lift means there definitely is an effect, but it is small.
A large lift and small Z means there appears to be a large effect, but it might not be real.
A possible strategy then is given here as an illustration, but the exact strategy and threshold
values will depend on the nature of each business problem addressed with association
analysis. The full set of rules produced by an association analysis is often too large to
6
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Association Rules
examine in detail. First, prune out rules that have low Z scores. Try throwing out rules with a
Z score of less than 2, if not 3, 4 or 5. However, there is little reason to focus in on rules with
extremely high Z scores. Next, filter according to support and lift. Setting a limit on the Z
score will not remove rules with low support or with low lift that involve common products.
Where to set the support threshold depends on what products are of interest and performance
considerations. Where to set the lift threshold is not really a technical question, but a question
of preference as to how large a lift is useful from a business perspective. A lift of 1.5
for L  R means that customers that own L are 50% more likely to own R than among the
overall population. If a value of 1.5 does not yield interesting results, then set the threshold
higher.
Sequence Analysis
Sequence analysis is a form of association analysis where the items in an association rule are
considered to have a time ordering associated with them. By default, when sequence analysis
is requested, left side items are assumed to have “occurred” before right side items, and in
fact the items on each side of an association rule, left or right, are also time ordered within
themselves. If we use in a sequence analysis the more full notation for an association rule
L  R , namely  X 1 X 2 X m    Y 1 Y 2 Y n  , then we are asserting that not
only do the X items precede the Y items, but X 1 precedes X 2 , which precedes X· m , which
precedes Y 1, which precedes Y 2 , which precedes Y n .
It is important to note here that if a strict ordering of items in a sequence analysis is either not
desired or not possible for some reason (such as multiple purchases on the same day), an
option is provided to relax the strict ordering. With relaxed sequence analysis, all items on the
left must still precede all items on the right of a sequence rule, but the items on the left and the
items on the right are not time ordered amongst themselves. (When the rules are presented,
the items in each rule are ordered by name for convenience).
Lift and Z score are calculated differently for sequence analysis than for association analysis.
Recall that the expected value of the association rule, E_LR, is given by Sup (L) * Sup (R) *
N for a non-sequence association analysis. For example, if L occurs half the time and R
occurs half the time, then if L and R are independent of each other it can be expected that L
and R will occur together one-fourth of the time. But this does not take into account the fact
that with sequence analysis, the correct ordering can only be expected to happen some
percentage of the time if L and R are truly independent of each other. Interestingly, this
expected percentage of independent occurrence of correct ordering is calculated the same for
strictly ordered and relaxed ordered sequence analysis. With m items on the left and n on the
right, the probability of correct ordering is given by “m!n!/(m + n)!”. Note that this is the
inverse of the combinatorial analysis formula for the number of permutations of m + n objects
grouped such that m are alike and n are alike.
In the case of strictly ordered sequence analysis, the applicability of the formula just given for
the probability of correct ordering can be explained as follows. There are clearly m + n
objects in the rule, and saying that m are alike and n are alike corresponds to restricting the
permutations to those that preserve the ordering of the m items on the left side and the n items
on the right side of the rule. That is, all of the orderings of the items on a side other than the
correct ordering fall out as being the same permutation. The logic of the formula given for the
probability of correct ordering is perhaps easier to see in the case of relaxed ordering. Since
Teradata Warehouse Miner User Guide - Volume 3
7
Chapter 1: Analytic Algorithms
Association Rules
there are m + n items in the rule there are (m + n)! possible orderings of the items. Out of
these, there are m! ways the left items can be ordered and n! ways the right items can be
ordered while insuring that the m items on the left precede the n items on the right, so there
are m!n! valid orderings out of the (m + n)! possible.
The “probability of correct ordering” factor described above has a direct effect on the
calculation of lift and Z score. Lift is effectively divided by this factor, such that a factor of
one half results in doubling the lift and increasing the Z score as well. The resulting lift and Z
score for sequence analysis must be interpreted cautiously however since the assumptions
made in calculating the independent probability of correct ordering are quite broad. For
example, it is assumed that all combinations of ordering are equally likely to occur, and the
amount of time between occurrences is completely ignored. To give the user more control
over the calculation of lift and Z score for a sequence analysis, an option is provided to set the
“probability of correct ordering” factor to a constant value if desired. Setting it to 1 for
example effectively ignores this factor in the calculation of E_LR and therefore in lift and Z
score.
Initiate an Association Analysis
Use the following procedure to initiate a new Association analysis in Teradata Warehouse
Miner:
1
Click on the Add New Analysis icon in the toolbar:
Figure 1: Add New Analysis from toolbar
2
8
In the resulting Add New Analysis dialog box, click on Analytics under Categories and
then under Analyses double-click on Association:
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Association Rules
Figure 2: Add New Analysis dialog
3
This will bring up the Association dialog in which you will enter INPUT and OUTPUT
options to parameterize the analysis as described in the following sections.
Association - INPUT - Data Selection
On the Association dialog click on INPUT and then click on data selection:
Figure 3: Association > Input > Data Selection
On this screen select:
1
Select Input Source
Users may select between different sources of input. By selecting the Input Source Table
the user can select from available databases, tables (or views) and columns in the usual
manner. By selecting the Input Source Analysis however the user can select directly from
the output of another analysis of qualifying type in the current project. Analyses that may
be selected from directly include all of the Analytic Data Set (ADS) and Reorganization
analyses (except Refresh). In place of Available Databases the user may select from
Available Analyses, while Available Tables then contains a list of all the output tables that
will eventually be produced by the selected Analysis. (Note that since this analysis cannot
select from a volatile input table, Available Analyses will contain only those qualifying
analyses that create an output table or view).
2
Select Columns From a Single Table
Teradata Warehouse Miner User Guide - Volume 3
9
Chapter 1: Analytic Algorithms
Association Rules
•
Available Databases (or Analyses) — These are the databases (or analyses) available to
be processed.
•
Available Tables — These are the tables and views that are available to be processed.
•
Available Columns — These are the columns within the table/view that are available
for the Association analysis.
•
Selected Columns — Select columns by highlighting and then either dragging and
dropping into the Selected Columns window, or click on the arrow button to move
highlighted columns into the Selected Columns window.
Note: The Selected Columns window is actually a split window; you can either insert
columns as Group, Item, or Sequence columns. Make sure you have the correct portion
of the window highlighted.
•
Group Column — The column that specifies the group for the Association analysis.
This column should specify observations or transactions that contain items of some
kind.
•
Item Column — The column that specifies the items to be analyzed in the
Association analysis. The relationship of these items within the group will be
described by the Association analysis.
•
Sequence Column — The column that specifies the sequence of items in the
Association analysis. This column should have a time ordering relationship with
the item associated with them.
Association - INPUT - Analysis Parameters
On the Association dialog click on INPUT and then click on analysis parameters:
Figure 4: Association > Input > Analysis Parameters
On this screen select:
• Association Combinations — In this window specify one or more association
combinations in the format of “X TO Y” where the sum of X and Y must not exceed a total
of 10. First select an “X TO Y” combination from the drop-down lists:
Figure 5: Association: X to X
10
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Association Rules
Then click the Add button to add this combination to the window. Repeat for as many
combinations as needed:
Figure 6: Association Combinations pane
If needed, remove a combination by highlighting it in the window and then clicking on the
Remove button.
• Processing Options
•
Perform All Steps — Execute the entire Association/Sequence Analysis, regardless of
result sets generated from a previous execution.
•
Perform Support Calculation Only — In order to determine the minimum support value
to use, the user may choose to only build the single-item support table by using this
option, making it possible to stop and examine the table before proceeding.
•
Recalculate Final Affinities Only — Rebuild just the final association tables using
support tables from a previous run provided that intermediate work tables were not
dropped (see Drop All Support Tables After Execution option below).
•
Auto-Calculate group count — By default, the algorithm automatically determines the
actual input count.
•
Force Group Count To — If the Auto-Calculate group count is disabled, this option
can be used to fix the number of groups, overriding the actual input count. This is
useful in conjunction with the Reduced Input Options, to set the group count to the
group count in the original data set, rather than the reduced input data set.
•
Drop All Support Tables After Execution — Normally, the Association analysis
temporarily builds the support tables, dropping them prior to termination. If for
performance reasons, it is desired to use the Recalculate Final Affinities Only option,
this option can be disabled so that this clean-up of support tables does not happen.
•
Minimum Support — The minimum Support value that the association must have in
order to be reported. Using this option reduces the input data - this can be saved for
further processing using the Reduced Input Options. Using this option also invokes
list-wise deletion, automatically removing from processing (and from the reduced
input data) all rows containing a null Group, Item or Sequence column.
•
Minimum Confidence — The minimum Confidence value that the association must
have in order to be reported.
•
Minimum Lift — The minimum Lift value that the association must have in order to be
reported.
•
Minimum Z-Score — The minimum absolute Z-Score value that the association must
have in order to be reported.
Teradata Warehouse Miner User Guide - Volume 3
11
Chapter 1: Analytic Algorithms
Association Rules
• Sequence Options — If a column is specified with the Sequence Column option, then the
following two Sequence Options are enabled. Note that Sequence Analysis is not
available when Hierarchy Information is specified:
•
Use Relaxed Ordering — With this option, the items on each side of the association
rule may be in any sequence provided all the left items (antecedents) precede all the
right items (precedents).
•
Auto-Calculate Ordering Probability — Sequence analysis option to let the algorithm
calculate the “probability of correct ordering” according to the principles described in
“Sequence Analysis” on page 7. (Note that the following option to set “Ordering
Probability” to a chosen value is only available if this option is unchecked).
•
Ordering Probability — Sequence analysis option to set probability of correct ordering
to a non-zero constant value between 0 and 1. Setting it to a 1 effectively ignores this
principle in calculating lift and Z-score.
Association - INPUT - Expert Options
On the Association dialog click on INPUT and then click on expert options:
Figure 7: Association > Input > Expert Options
On this screen select:
• Where Conditions — An SQL WHERE clause may be specified here to provide further
input filtering for only those groups or items that you are interested in. This works exactly
like the Expert Options for the Descriptive Statistics, Transformation and Data
Reorganization functions - only the condition itself is entered here.
Using this option reduces the input data set - this can be saved for further processing using
the Reduced Input Options. Using this option also invokes list-wise deletion, automatically
removing from processing (and from the reduced input data) all rows containing a null
Group, Item or Sequence column.
• Include Hierarchy Table — A hierarchy lookup table may be specified to convert input
items on both the left and right sides of the association rule to a higher level in a hierarchy
if desired. Note that the column in the hierarchy table corresponding to the items in the
input table must not contain repeated values, so effectively the items in the input table
must match the lowest level in the hierarchy table. The following is an example of a threelevel hierarchy table compatible with Association analysis, provided the input table
matches up with the column ITEM1.
12
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Association Rules
Table 1: Three-Level Hierarchy Table
ITEM1
ITEM2
ITEM3
DESC1
DESC2
DESC3
A
P
Y
Savings
Passbook
Deposit
B
P
Y
Checking
Passbook
Deposit
C
W
Z
Atm
Electronic
Access
D
S
X
Charge
Short
Credit
E
T
Y
CD
Term
Deposit
F
T
Y
IRA
Term
Deposit
G
L
X
Mortgage
Long
Credit
H
L
X
Equity
Long
Credit
I
S
X
Auto
Short
Credit
J
W
Z
Internet
Electronic
Access
Using this option reduces the input data set - this can be saved for further processing using
the Reduced Input Options. Using this option also invokes list-wise deletion, automatically
removing from processing (and from the reduced input data) all rows containing a null
Group, Item or Sequence column.
The following columns in the hierarchy table must be specified with this option.
•
Item Column — The name of the column that can be joined to the column specified by
the Item Column option on the Select Column tab to look up the associated Hierarchy.
•
Hierarchy Column — The name of the column with the Hierarchy values.
• Include Description Table — For reporting purposes, a descriptive name or label can be
given to the items processed during the Association/Sequence Analysis.
•
Item ID Column — The name of the column that can be joined to the column specified
by the Item Column option on the Select Column tab (or Hierarchy Column option on
the Hierarchies tab if hierarchy information is also specified) to look up the
description.
•
Item Description Column — The name of the column with the descriptive values.
• Include Left Side Lookup Table — A focus products table may be specified to process only
those items that are of interest on the left side of the association.
•
Left Side Identifier Column — The name of the column where the Focus Products
values exist for the left side of the association.
• Include Right Side Lookup Table — A focus products table may be specified to process
only those items that are of interest on the right side of the association.
•
Right Side Identifier Column — The name of the column where the Focus Products
values exist for the right side of the association.
Teradata Warehouse Miner User Guide - Volume 3
13
Chapter 1: Analytic Algorithms
Association Rules
Association - OUTPUT
On the Association dialog click on OUTPUT:
Figure 8: Association > Output
On this screen select:
• Output Tables
•
Database Name — The database where the Association analysis build temporary and
permanent tables during the analysis. This defaults to the Result Database.
•
Table Names — Assign a table name for each displayed combination.
•
Advertise Output — The Advertise Output option “advertises” each output table
(including the Reduced Input Table, if saved) by inserting information into one or
more of the Advertise Output metadata tables according to the type of analysis and the
options selected in the analysis.
•
Advertise Note — An Advertise Note may be specified if desired when the Advertise
Output option is selected or when the Always Advertise option is selected on the
Connection Properties dialog. It is a free-form text field of up to 30 characters that
may be used to categorize or describe the output.
• Reduced Input Options — A reduced input set, based upon the minimum support value
specified, a product hierarchy or input filtering via a WHERE clause, can be saved and
used as input to a subsequent Association/Sequence analysis as follows:
•
Save Reduced Input Table — Check box to specify to the analysis that the reduced
input table should be saved.
•
Database Name — The database name where the reduced input table will be saved.
•
Table Name — The table name that the reduced input table will be saved under.
• Generate SQL, but do not Execute it — Generate the Association or Sequence Analysis
SQL, but do not execute it - the set of queries are returned with the analysis results.
Run the Association Analysis
After setting parameters on the INPUT screens as described above, you are ready to run the
analysis. To run the analysis you can either:
• Click the Run icon
on the toolbar, or
• Select Run <project name> on the Project menu, or
• Press the F5 key on your keyboard
14
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Association Rules
Results - Association Analysis
The results of running the Association analysis include a table for each association pair
requested, as well as the SQL to perform the association or sequence analysis. All of these
results are outlined below.
Association - RESULTS - SQL
On the Association dialog click on RESULTS and then click on SQL:
Figure 9: Association > Results > SQL
The series of SQL statements that comprise the Association/Sequence Analysis are displayed
here.
Association - RESULTS - data
On the Association dialog click on RESULTS and then click on data:
Figure 10: Association > Results > Data
Results data, if any, is displayed in a data grid.
An output table is generated for each item pair specified in the Association Combinations
option. Each table generated has the form specified below:
Table 2: Association Combinations output table
Name
Type
Definition
ITEMXOFY
User Defined
Two or more columns will be generated, depending upon the number
of Association Combinations. Together, these form the UPI of the
result table. The value for X in the column name is 1 through the
number of item pairs specified. The value for Y in the column name is
the sum of the number of items specified. For example, specifying
Left and Right Association Combinations or <1, 1> will produce two
columns: ITEM1OF2, ITEM2OF2. Specifying <1,2> will result in
three columns: ITEM1OF3, ITEM2OF3 and ITEM3OF3. The data
type is the same as the Item Column.
Default- Data
type of Item
Column
LSUPPORT
DECIMAL(18,5)
Teradata Warehouse Miner User Guide - Volume 3
The Support of the left-side item or antecedent only.
15
Chapter 1: Analytic Algorithms
Association Rules
Table 2: Association Combinations output table
Name
Type
Definition
RSUPPORT
DECIMAL(18,5)
The Support of the right-side item or consequent only.
SUPPORT
DECIMAL(18,5)
The Support of the association (i.e., antecedent and consequent
together).
CONFIDENCE
DECIMAL(18,5)
The Confidence of the association.
LIFT
DECIMAL(15,5)
The Lift of the association.
ZSCORE
DECIMAL(15,5)
The Z-Score of the association.
Association - RESULTS - graph
On the Association dialog click on RESULTS and then click on graph:
Figure 11: Association > Results > Graph
For 1-to-1 Associations, a tile map is available as described below. (No graph is available for
combinations other than 1-to-1).
• Graph Options — Two selectors with a Reference Table display underneath are used to
make association selections to graph. For example, the following selections produced the
graph below.
16
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Association Rules
Figure 12: Association Graph Selector
The Graph Options display has the following selectors:
a
Select item 1 of 2 from this table, then click button.
The first step is to select the left-side or antecedent items to graph associations for by
clicking or dragging the mouse just to the left of the row numbers displayed. Note that
the accumulated minimum and maximum values of the measures checked just above
the display are given in this table. (The third column, “Item2of2 count” is a count of
the number of associations that are found in the result table for this left-side item).
Once the selections are made, click the big button between the selectors.
b
Select from these tables to populate graph.
The second step is to select once again the desired left-side or antecedent items by
clicking or dragging the mouse just to the left of the row numbers displayed under the
general header “Item 1 of 2” in the left-hand portion of selector 2. Note that as “Item 1
of 2” items are selected, “Item 2 of 2” right-side or consequent items are automatically
selected in the right-hand portion of selector 2. Here the accumulated minimum and
maximum values of the measures checked just above this display are given in the
trailing columns of the table. (The third column “Item1of2 count” is a count of the
number of associations that are found in the result table for this right-side item when
limited to associations involving the left-side items selected in step 1). The
corresponding associations are automatically highlighted in the Reference Table
below.
An alternative second step is to directly select one or more “Item 2 of 2” items in the
right-hand portion of selector 2. The corresponding associations (again, limited to the
left-side items selected in the first step) are then highlighted in the Reference Table
below.
Teradata Warehouse Miner User Guide - Volume 3
17
Chapter 1: Analytic Algorithms
Association Rules
• Reference Table — This table displays the rows from the result table that correspond to the
selections made above in step 1, highlighting the rows corresponding to the selections
made in step 2.
•
(Row Number) — A sequential numbering of the rows in this display.
•
Item 1 of 2 — Left item or antecedent in the association rule.
•
Item 2 of 2 — Right item or consequent in the association rule.
•
LSupport — The left-hand item Support, calculated as the percentage (a value between
0 and 1) of groups that contain the left-hand item referenced in the association rule.
•
RSupport — The right-hand item Support, calculated as the percentage (a value
between 0 and 1) of groups that contain the right-hand item referenced in the
association rule.
•
Support — The Support, which is a measure of the generality of an association rule.
Calculated as the percentage (a value between 0 and 1) of groups that contain all of the
items referenced in the rule
•
Confidence — The Confidence defined as the probability of the right-hand item
occurring in an item group given that the left-hand item is in the item group.
•
Lift — The Lift which measures how much the probability of the existence of the
right-hand item is increased by the presence of the left hand item in a group.
•
ZScore — The Z score value, a measure of how statistically different the actual result
is from the expected result.
• Show Graph — A tile map is displayed when the “show graph” tab is selected, provided
that valid “graph options” selections have been made. The example below corresponds to
the graph options selected in the example above.
Figure 13: Association Graph
The tiles are color coded in the gradient specified on the right-hand side. Clicking on any
tile, brings up all statistics associated with that association, and highlights the two items in
18
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Association Rules
the association. Radio buttons above the upper right hand corner of the tile map can be
used to select the measure to color code in the tiles, that is either Support, Lift or Zscore.
Tutorial - Association Analysis
In this example, an Association analysis is performed on the fictitious banking data to analyze
channel usage. Parameterize an Association analysis as follows:
• Available Tables — twm_credit_tran
• Group Column — cust_id
• Item Column — channel
• Association Combinations
•
Left — 1
•
Right — 1
• Processing Options
•
Perform All Steps — Enabled
•
Minimum Support — 0
•
Minimum Confidence — 0.1
•
Minimum Lift — 1
•
Minimum Z-Score — 1
• Where Clause Text — channel <> ‘ ‘ (i.e., channel is not equal to a single blank)
• Output Tables
•
1 to 1 Table Name — twm_tutorials_assoc
Run the analysis, and click on Results when it completes. For this example, the Association
analysis generated the following pages. The SQL is not shown for brevity.
Table 3: Tutorial - Association Analysis Data
ITEM1OF2
ITEM20F2
LSUPPORT
RSUPPORT
SUPPORT
CONFIDENCE
LIFT
ZSCORE
A
E
0.85777
0.91685
0.80744
0.94132
1.02669
1.09511
B
K
0.49672
0.35667
0.21007
0.42291
1.18572
1.84235
B
V
0.49672
0.36324
0.22538
0.45374
1.24915
2.49894
C
K
0.67177
0.35667
0.26477
0.39414
1.10506
1.26059
C
V
0.67177
0.36324
0.27133
0.4039
1.11194
1.35961
E
A
0.91685
0.85777
0.80744
0.88067
1.0267
1.09511
K
B
0.35667
0.49672
0.21007
0.58898
1.18574
1.84235
K
C
0.35667
0.67177
0.26477
0.74234
1.10505
1.26059
K
V
0.35667
0.36324
0.1663
0.46626
1.28361
2.33902
V
B
0.36324
0.49672
0.22538
0.62047
1.24913
2.49894
Teradata Warehouse Miner User Guide - Volume 3
19
Chapter 1: Analytic Algorithms
Cluster Analysis
Table 3: Tutorial - Association Analysis Data
ITEM1OF2
ITEM20F2
LSUPPORT
RSUPPORT
SUPPORT
CONFIDENCE
LIFT
ZSCORE
V
C
0.36324
0.67177
0.27133
0.74697
1.11194
1.35961
V
K
0.36324
0.35667
0.1663
0.45782
1.2836
2.33902
Click on Graph Options and perform the following steps:
1
Select all data in selector 1 under the “Item 1 of 2” heading.
2
Click on the large button between selectors 1 and 2.
3
Select all data in selector 2 under the “Item 1 of 2” heading.
4
Click on the show graph tab.
When the tile map displays, perform the following additional steps:
a
Click on the bottom most tile. (Hovering over this tile will display the item names K
and V).
b
Try selecting different measures at the top right of the tile map. (Zscore will initially
be initially selected).
Figure 14: Association Graph: Tutorial
Cluster Analysis
Overview
The task of modeling multidimensional data sets encompasses a variety of statistical
techniques, including that of ‘cluster analysis’. Cluster analysis is a statistical process for
20
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Cluster Analysis
identifying homogeneous groups of data objects. It is based on unsupervised machine
learning and is crucial in data mining. Due to the massive sizes of databases today,
implementation of any clustering algorithm must be scalable to complete analysis within a
practicable amount of time, and must operate on large volumes of data with many variables.
Typical clustering statistical algorithms do not work well with large databases due to memory
limitations and execution times required.
The advantage of the cluster analysis algorithm in Teradata Warehouse Miner is that it
enables scalable data mining operations directly within the Teradata RDBMS. This is
achieved by performing the data intensive aspects of the algorithm using dynamically
generated SQL, while low-intensity processing is performed in Teradata Warehouse Miner. A
second key design feature is that model application or scoring is performed by generating and
executing SQL based on information about the model saved in metadata result tables. A third
key design feature is the use of the Expectation Maximization or EM algorithm, a particularly
sound statistical processing technique. Its simplicity makes possible a purely SQL-based
implementation that might not otherwise be feasible with other optimization techniques. And
finally, the Gaussian mixture model gives a probabilistic approach to cluster assignment,
allowing observations to be assigned probabilities for inclusion in each cluster. The clustering
is based on a simplified form of generalized distance in which the variables are assumed to be
independent, equivalent to Euclidean distances on standardized measures.
While this section primarily introduces Gaussian Mixture Model clustering, variations of this
technique are described in the next section. In particular, the Fast K-Means clustering option
uses a quite different technique: a stored procedure and a table operator that process the data
more directly in the database for a considerable performance boost.
Preprocessing - Cluster Analysis
Some preprocessing of the input data by the user may be necessary. Any categorical data to be
clustered must first be converted to design-coded numeric variables. Since null data values
may bias or invalidate the analysis, they may be replaced, or the listwise deletion option
selected to exclude rows with any null values in the preprocessing phase.
Teradata Warehouse Miner automatically builds a single input table from the requested
columns of the requested input table. If the user requests more than 30 input columns, the data
is unpivoted with additional rows added for the column values. Through this mechanism, any
number of columns within a table may be analyzed, and the SQL optimized for a particular
Teradata server capability.
Expectation Maximization Algorithm
The clustering algorithm requires specification of the desired number of clusters. After
preprocessing, an initialization step determines seed values for the clusters, and clustering is
then performed based on conditional probability and maximum likelihood principles using
the EM algorithm to converge on cluster assignments that yield the maximum likelihood
value.
In a Gaussian Mixture (GM) model, it is assumed that the variables being modeled are
members of a normal (Gaussian) probability distribution. For each cluster, a maximum
likelihood equation can be constructed indicating the probability that a randomly selected
Teradata Warehouse Miner User Guide - Volume 3
21
Chapter 1: Analytic Algorithms
Cluster Analysis
observation from that cluster would look like a particular observation. A maximum likelihood
rule for classification would assign this observation to the cluster with the highest likelihood
value. In the computation of these probabilities, conditional probabilities use the relative size
of clusters and prior probabilities, to compute a probability of membership of each row to
each cluster. Rows are reassigned to clusters with probabilistic weighting, after units of
distance have been transformed to units of standard deviation of the standard normal
distribution via the Gaussian distance function:
p mo =  2 
–n  2
R
–1  2
2
d mo
exp  – -------------
2
Where:
• p is dimensioned 1 by 1 and is the probability of membership of a point to a cluster
• d is dimensioned 1 by 1 and is the Mahalanobis Distance
• n is dimensioned 1 by 1 and is the number of variables
• R is dimensioned n by n and is the cluster variance/covariance matrix
The Gaussian Distance Function translates distance into a probability of membership under
this probabilistic model. Intermediate results are saved in Teradata tables after each iteration,
so the algorithm may be stopped at any point and the latest results viewed, or a new clustering
process begun at this point. These results consist of cluster means, variances and prior
probabilities.
Expectation Step
Means, variances and frequencies of rows assigned by cluster are first calculated. A
covariance inverse matrix is then constructed using these variances, with all non-diagonals
assumed to be zero. This simplification is tantamount to the assumption that the variables are
independent. Performance is improved thereby, allowing the number of calculations to be
proportional to the number of variables, rather than its square. Row distances to the mean of
each cluster are calculated using a Mahalanobis Distance (MD) metric:
n
2
do =
o
r
   xn – con  Rn
–1
 x n – c on 
i = 1j = 1
Where:
• m is the number of rows
• n is the number of variables
• o is the number of clusters
• d is dimensioned n by o and is the Mahalanobis Distance from a row to a cluster
• x is dimensioned m by n and is the data
• c is dimensioned 1 by n and are the cluster centroids
• R is dimensioned n by n and is the cluster variance/covariance matrix
22
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Cluster Analysis
Mahalanobis Distance is a rescaled unitless data form used to identify outlying data points.
Independent variables may be thought of as defining a multidimensional space in which each
observation can be plotted. Means (“centroids”) for each independent variable may also be
plotted. Mahalanobis distance is the distance of each observation from its centroid, defined by
variables that may be dependent. In the special case where variables are independent or
uncorrelated, it is equivalent to the simple Euclidean distance. In the default GM model,
separate covariance matrices are maintained, conforming to the specifications of a pure
maximum likelihood rule model.
The EM algorithm works by performing the expectation and maximization steps iteratively
until the log-likelihood value converges (i.e., changes less than a default or specified epsilon
value), or until a maximum specified number of iterations has been performed. The loglikelihood value is the sum over all rows of the natural log of the probabilities associated with
each cluster assignment. Although the EM algorithm is guaranteed to converge, it is possible
it may converge slowly for comparatively random data, or it may converge to a local
maximum rather than a global one.
Maximization Step
The row is assigned to the nearest cluster with a probabilistic weighting for the GM model, or
with certainty for the K-Means model.
Options - Cluster Analysis
K-Means Option
With the K-Means option, rows are reassigned to clusters by associating each to the closest
cluster centroid using the shortest distance. Data points are assumed to belong to only one
cluster, and the determination is considered a ‘hard assignment’. After the distances are
computed from a given point to each cluster centroid, the point is assigned to the cluster
whose center is nearest to the point. On the next iteration, the point’s value is used to redefine
that cluster’s mean and variance. This is in contrast to the default Gaussian option, wherein
rows are reassigned to clusters with probabilistic weighting, after units of distance have been
transformed to units of standard deviation via the Gaussian distance function.
Also with the K-means option, the variables' distances to cluster centroids are calculated by
summing, without any consideration of the variances, resulting effectively in the use of
unnormalized Euclidean distances. This implies that variables with large variances will have
a greater influence over the cluster definition than those with small variances. Therefore, a
typical preparatory step to conducting a K-means cluster analysis is to standardize all of the
numeric data to be clustered using the Z-score transformation function in Teradata Warehouse
Miner. K-means analyses of data that are not standardized typically produce results that: (a)
are dominated by variables with large variances, and (b) virtually or totally ignore variables
with small variances during cluster formation. Alternatively, the Rescale function could be
used to normalize all numeric data, with a lower boundary of zero and an upper boundary of
one. Normalizing the data prior to clustering gives all the variables equal weight.
Teradata Warehouse Miner User Guide - Volume 3
23
Chapter 1: Analytic Algorithms
Cluster Analysis
Fast K-Means Option
The Fast K-Means option provides a dramatic performance improvement over the K-Means
option. When selected, the options on the analysis parameters tab are altered and the options
on the expert options tab are not available. With Fast K-Means, the options include the
following:
• The Number of Clusters, Convergence Criterion and Maximum Iterations are provided as
before.
• The option to remove null values using list-wise deletion is not offered, it is automatically
done.
• The Variable Importance Evaluation Reports are not offered.
• The Cluster Definitions Database and Table names are supplied by you. This table stores
the model and the scoring module processes it. It can also be used to continue execution
starting with the cluster definitions in this table, rather than using random starting clusters.
• An Advertise Option option is provided for the Cluster Definitions table.
The Fast K-Means algorithm creates an output table structured differently than other
clustering algorithms. The output table is converted into the style used by other algorithms
(that is, viewed as a report and graphed in the usual manner). If the conversion is not possible,
you can view the cluster definitions in the new style as data, along with the progress report.
Note: Install the td_analyze external stored procedure and the tda_kmeans table operator
called by the stored procedure in the database where the TWM metadata tables reside. You
can use the Install or Uninstall UDF’s option under the Teradata Warehouse Miner start
program item, selecting the option to Install TD_Analyze UDFs.
Poisson Option
The Poisson option is designed to be applied to data containing mixtures of
Poisson-distributed variables. The data is first normalized so all variables have the same
means and variances, allowing the calculation of the distance metric without biasing the result
in favor of larger-magnitude variables. The EM algorithm is then applied with a probability
metric based on the likelihood function of the Poisson distribution function. As in the
Gaussian Mixture Model option, rows are assigned to the nearest cluster with a probabilistic
weighting. At the end of the EM iteration, the data is unnormalized and saved as a potential
result, until or unless replaced by the next iteration.
Average Mode - Minimum Generalized Distance
Within the GM model, a special “average mode” option is provided, using the minimum
generalized distance rule. With this option, a single covariance matrix is used for all clusters,
rather than using an individual covariance matrix for each cluster. A weighted average of the
covariance matrices is constructed for use in the succeeding iteration.
Automatic Scaling of Likelihood Values
When a large number of variables are input to the cluster analysis module, likelihood values
can become prohibitively small. The algorithm automatically scales these values to avoid loss
of precision, without invalidating the results in any way. The expert option ‘Scale Factor
24
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Cluster Analysis
Exponent (s)’ may be used to bypass this feature by using a specific value, e.g. 10s, to
multiply the probabilities.
Continue Option
The Continue Option allows clustering to be resumed where it left off by starting with the
cluster centroid, variance and probability values of the last complete iteration saved in the
metadata tables or output tables as requested on the Output Panel. Specifically, if the
Continue Option is selected and output tables are specified and exist, the information in the
output tables is used to restart processing. If output tables do not exist, then the model in
metadata is used to restart processing.
Note: If requested, the output tables are updated for each iteration of the algorithm and can,
therefore, provide a degree of recovery.
In the case of the Fast K-Means algorithm, however, the Continue Option depends on locating
the Cluster Definition table named on the analysis parameters tab, which is effectively the
model for this algorithm variation. The Cluster Definition table also is updated for each
iteration of the algorithm and can, therefore, provide a degree of recovery.
There is a special case of the Continue Option where using the Fast K-Means algorithm starts
processing and, if processing terminates successfully, allows continuing with a Gaussian
Mixture Model clustering.
Note: With Fast K-Means, the output tables are built only at the end of processing and not
after each iteration of the algorithm.
You can request output tables on a Fast K-Means analysis and request the same tables as
output tables on a Gaussian Mixture Model analysis with the Continue Option also selected.
Using the TWM Cluster Analysis
This section recommends parameter settings and techniques that apply primarily to the
Gaussian Mixture Model.
Sampling Large Database Tables as a Starting Method
It may be most effective to use the sample parameter to begin the analysis of extremely large
databases. The execution times are much faster and an approximate result obtained that can
be used as a starting point, as described above. Results may be compared using the loglikelihood value, where the largest value indicates the best clustering fit, in terms of
maximum likelihood. Because local maxima may result from a particular EM clustering
analysis, multiple executions from different samples may produce a seed that ultimately
yields the best log-likelihood value.
Clustering and Data Problems
Common data problems for cluster analysis include insufficient rows provided for the number
of clusters requested, and constants in the data resulting in singular covariance matrices.
When these problems occur, warning messages and recommendations are provided. An
option for dealing with null values during processing is described below.
Teradata Warehouse Miner User Guide - Volume 3
25
Chapter 1: Analytic Algorithms
Cluster Analysis
Additionally, Teradata errors may occur for non-normalized data having more than 15 digits
of significance. In this case, a preprocessing step of either multiplying (for small numbers) or
dividing (for large numbers) by a constant value may rectify overflow and underflow
conditions. The clusters will remain the same as all this does is change the unit of measure.
Clustering and Constants in Data
When one or more of the variables included in the clustering analysis have only a few values,
these values may be singled out and included in particular clusters as constants. This is most
likely when the number of clusters sought is large. When this happens, the covariance matrix
becomes singular and cannot be inverted, since some of the variances are zero. A feature is
provided in the cluster algorithm to improve the chance of success under these conditions, by
limiting how close to zero the variance may be set, e.g. 10-3. The default value is 10-10. If the
log-likelihood values increase for a number of iterations and then start decreasing, it is likely
due to the clustering algorithm having found clusters where selected variables are all the same
value (a constant), so the cluster variance is zero. Changing the minimum variance exponent
value to a larger value may reduce the effect of these constants, allowing the other variables
to converge to a higher log-likelihood value.
Clustering and Null Values
The presence of null values in the data may result in clusters that differ from those that would
have resulted from zero or numeric values. Since null data values may bias or invalidate the
analysis, they should be replaced or the column eliminated. Alternatively, the listwise
deletion option can be selected to exclude rows with any null values in the preprocessing
phase.
Stop Execution of a Clustering or Cluster Scoring Analysis
Analyses can be terminated prior to normal completion by highlighting the name and clicking
Stop on the Toolbar or by right-clicking the analysis name and selecting the Stop option.
Typically, this results in a Cancelled status and a Cancelled message during execution.
However, it can result in a Failed status and an error message, such as “The transaction was
aborted by the user,” particularly when using the Fast K-Means algorithm.
Success Analysis - Cluster Analysis
If the log-likelihood value converges and the requested number of clusters is obtained with
significant probabilities, then the clustering analysis can be considered successful. If the loglikelihood value declines, indicating convergence is complete, the iterations stop.
Occasionally, warning messages can indicate constants within one or more clusters.
Optimizing Performance of Clustering
Parallel execution of SQL is an important feature of the cluster analysis algorithm in Teradata
Warehouse Miner as well as Teradata. The number of variables to cluster in parallel is
determined by the ‘width’ parameter. The optimum value of width will depend on the size of
the Teradata system, its memory size, and so forth. Experience has shown that when a large
number of variables are clustered on, the optimum value of width ranges from 20-25. The
width value is dynamically set to the lesser of the specified Width option (default = 25) and
26
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Cluster Analysis
the number of columns, but can never exceed 118. If SQL errors indicate insufficient
memory, reducing the width parameter may alleviate the problem.
Initiate a Cluster Analysis
Use the following procedure to initiate a new Cluster analysis in Teradata Warehouse Miner:
1
Click on the Add New Analysis icon in the toolbar:
Figure 15: Add New Analysis from toolbar
2
In the resulting Add New Analysis dialog box, click on Analytics under Categories and
then under Analyses double-click on Clustering:
Figure 16: Add New Analysis dialog
3
This will bring up the Clustering dialog in which you will enter INPUT and OUTPUT
options to parameterize the analysis as described in the following sections.
Cluster - INPUT - Data Selection
On the Clustering dialog click on INPUT and then click on data selection:
Teradata Warehouse Miner User Guide - Volume 3
27
Chapter 1: Analytic Algorithms
Cluster Analysis
Figure 17: Clustering > Input > Data Selection
On this screen select:
1
Select Input Source
Users may select between different sources of input. By selecting the Input Source Table
the user can select from available databases, tables (or views) and columns in the usual
manner. By selecting the Input Source Analysis however the user can select directly from
the output of another analysis of qualifying type in the current project. Analyses that may
be selected from directly include all of the Analytic Data Set (ADS) and Reorganization
analyses (except Refresh). In place of Available Databases the user may select from
Available Analyses, while Available Tables then contains a list of all the output tables that
will eventually be produced by the selected Analysis. (Note that since this analysis cannot
select from a volatile input table, Available Analyses will contain only those qualifying
analyses that create an output table or view).
2
Select Columns From a Single Table
•
Available Databases (or Analyses) — All the databases (or analyses) that are available
for the Clustering analysis.
•
Available Tables — All the tables within the Source Database that are available for the
Clustering analysis.
•
Available Columns — Within the selected table or matrix, all columns which are
available for the Clustering analysis.
•
Selected Columns — Columns must be of numeric type. For Fast K-Means, selected
columns may not contain leading or trailing spaces and may not contain a separator
character '|' if scoring of the model is ever published. Select columns by highlighting
and then either dragging and dropping into the Selected Columns window, or click on
the arrow button to move highlighted columns into the Selected Columns window.
Cluster - INPUT - Analysis Parameters
On the Clustering dialog click on INPUT and then click on analysis parameters:
Figure 18: Clustering > Input > Analysis Parameters
28
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Cluster Analysis
On this screen select:
• Clustering Algorithm
•
Gaussian — Cluster the data using a Gaussian Mixture Model as described above.
This is the default Algorithm.
•
K-Means — Cluster the data using the K-Means Model as described above.
•
Fast K-Means — Cluster the data using a high-performing version of the K-Means
Model.
•
Poisson — Cluster the data using a Poisson Mixture Model as described above.
• Number of clusters — Enter the number of clusters before executing the cluster analysis.
• Convergence Criterion — For the Gaussian and Poisson Mixture Models, clustering stops
when the log-likelihood increases less than this amount. The default value is 0.001. Fast
K-Means uses this field as a threshold for cluster changes based on a different formula.
Generic K-Means, on the other hand, does not use this criterion as clustering stops when
the distances of all points to each cluster have not changed from the previous iteration. In
other words, when the assignment of rows to clusters has not changed from the previous
iteration, clustering has converged.
• Maximum Iterations — Clustering is stopped after this maximum number of iterations has
occurred. The default value is 50.
• Remove Null Values (using Listwise deletion) — This option eliminates all rows from
processing that contain any null input columns. The default is enabled. Fast K-Means
always performs Listwise deletion — it is not an option.
• Include Variable Importance Evaluation reports — Report shows resultant log-likelihood
when each variable is successively dropped out of the clustering calculations. The most
important variable will be listed next to the most negative log-likelihood value; the least
important variable will be listed with the least negative value. Fast K-Means does not
offer this option.
• Cluster Definitions Database and Table — Applies only to the Fast K-Means algorithm.
This table holds the model information and is used when continuing a previous run or
when scoring. An option is also provided to Advertise Output with an optional Advertise
Note.
• Generate SQL Only — Applies only to the Fast K-Means algorithm. This option, if
checked, generates the SQL call statement of the external stored procedure td_analyze but
does not execute it. The SQL can be viewed on the Results > SQL tab.
• Continue Execution (instead of starting over) — Previous execution results are used as seed
values for starting clustering.
Cluster - INPUT - Expert Options
This screen does not apply to the Fast K-Means algorithm.
On the Clustering dialog click on INPUT and then click on expert options:
Teradata Warehouse Miner User Guide - Volume 3
29
Chapter 1: Analytic Algorithms
Cluster Analysis
Figure 19: Clustering > Input > Expert Options
On this screen select:
• Width — Number of variables to process in parallel (dependent on system limits)
• Input Sample Fraction — Fraction of input dataset to cluster on.
• Scale Factor Exponent — If nonzero “s” is entered, this option overrides automatic
scaling, scaling by 10s.
• Minimum Probability Exponent — If “e” is entered, the Clustering analysis uses 10e as
smallest nonzero number in SQL calculations.
• Minimum Variance Exponent — If “v” is entered, the Clustering analysis uses 10v as the
minimum variance in SQL calculations.
• Use single cluster covariance — Simplified model that uses the same covariance table for
all clusters.
• Use Random Seeding — When enabled (default) this option seeds the initial clustering
answer matrix by randomly selecting a row for each cluster as the seed. This method is the
most commonly used type of seeding for all other clustering systems, according to the
literature. The byproduct of using this new method is that slightly different solutions will
be provided by successive clustering runs, and convergence may be quicker because
fewer iterations may be required.
• Seed Sample Percentage — If Use Random Seeding is disabled, the previous seeding
method of Teradata Warehouse Miner Clustering, where every row is assigned to one of
the clusters, and then averages used as the seeds. Enter a percentage (1-100) of the input
dataset to use as the starting seed.
Cluster - OUTPUT
This screen does not apply to the Fast K-Means algorithm.
On the Clustering dialog, click on OUTPUT:
Figure 20: Cluster > OUTPUT
On this screen select:
30
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Cluster Analysis
• Store the variables table of this analysis in the database — Check this box to store the
variables table of this analysis in two tables in the database, one for cluster columns and
one for cluster results.
• Database Name — The name of the database to create the output tables in.
• Output Table Prefix — The prefix of the output tables. (For example, if test is entered here,
tables test_ClusterColumns and test_ClusterResults will be created).
• Advertise Output — The Advertise Output option “advertises” output by inserting
information into one or more of the Advertise Output metadata tables according to the
type of analysis and the options selected in the analysis.
• Advertise Note — An Advertise Note may be specified if desired when the Advertise
Output option is selected or when the Always Advertise option is selected on the
Connection Properties dialog. It is a free-form text field of up to 30 characters that may be
used to categorize or describe the output.
By way of an example, the tutorial example with prefix test yields table test_
ClusterResults:
Table 4: test_ClusterResults
column_ix
cluster_id
priors
m
v
1
1
0.0692162138434691
-2231.95933518596
7306685.95957656
1
2
0.403625379654599
-947.132576882845
846532.221977884
1
3
0.527158406501931
-231.599917701351
105775.923364194
2
1
0.0692162138434691
3733.31923440023
18669805.3968291
2
2
0.403625379654599
1293.34863525092
1440668.11504453
2
3
0.527158406501931
231.817911577847
102307.594966697
3
1
0.0692162138434691
3725.87257974281
18930649.6488828
3
2
0.403625379654599
632.603945909026
499736.882919713
3
3
0.527158406501931
163.869611182736
57426.9984808451
and test_ClusterColumns:
Table 5: test_ClusterColumns
table_name
column_name
column_alias
column_order
index_flag
variable_type
twm_
customer_
analysis
avg_cc_bal
avg_cc_bal
1
0
1
twm_
customer_
analysis
avg_ck_bal
avg_ck_bal
2
0
1
Teradata Warehouse Miner User Guide - Volume 3
31
Chapter 1: Analytic Algorithms
Cluster Analysis
Table 5: test_ClusterColumns
table_name
column_name
column_alias
column_order
index_flag
variable_type
twm_
customer_
analysis
avg_sv_bal
avg_sv_bal
3
0
1
If Database Name is twm_results and Output Table Prefix is test, these tables are defined
respectively as:
CREATE SET TABLE twm_results.test_ClusterResults
(
column_ix INTEGER,
cluster_id INTEGER,
priors FLOAT,
m FLOAT,
v FLOAT)
UNIQUE PRIMARY INDEX ( column_ix ,cluster_id );
CREATE SET TABLE twm_results.test_ClusterColumns
(
table_name VARCHAR(30) CHARACTER SET UNICODE NOT CASESPECIFIC,
column_name VARCHAR(30) CHARACTER SET UNICODE NOT CASESPECIFIC,
column_alias VARCHAR(100) CHARACTER SET UNICODE NOT
CASESPECIFIC,
column_order SMALLINT,
index_flag SMALLINT,
variable_type INTEGER)
UNIQUE PRIMARY INDEX ( table_name ,column_name );
Run the Cluster Analysis
After setting parameters on the INPUT screens as described above, you are ready to run the
analysis. To run the analysis you can either:
• Click the Run icon
on the toolbar, or
• Select Run <project name> on the Project menu, or
• Press the F5 key on your keyboard
Results - Cluster Analysis
The results of running the Cluster analysis include a variety of statistical reports, a similarity/
dissimilarity graph, as well as a cluster size and distance measure graph. All of these results
are outlined below.
Cluster - RESULTS - reports
On the Clustering dialog click on RESULTS and then click on reports:
32
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Cluster Analysis
Figure 21: Clustering > Results > Reports
Clustering Progress
• Iteration — This represents the number of the step in the Expectation Maximization
clustering algorithm as it seeks to converge on a solution maximizing the log likelihood
function.
• Log Likelihood — This is the log likelihood value calculated at the end of this step in the
Expectation Maximization clustering algorithm. It does not appear when the K-Means
option is used.
• Diff — This is simply the difference in the log likelihood value between this and the
previous step in the modeling process, starting with 0 at the end of the first step. It does
not appear when the K-Means option is used.
• Timestamp — This is the day, date, hour, minute and second marking the end of this step
in processing.
Clustering Progress for Fast K-Means
The Clustering Progress report for Fast K-Means contains the Timestamp and Message
columns. The Message column contains information such as processing phase and iterations.
When using the Fast K-Means algorithm, the Clustering Solution report is derived from the
Cluster Definitions table. If the Clustering Solution report cannot be successfully derived, the
Cluster Definitions table can be viewed on the Results-Data tab as an alternative (although it
does not include all of the same information, such as variance).
The Importance of Variables report does not apply to Fast K-Means.
Importance of Variables
This report is available when the Include Variable Importance Evaluation Report option is
enabled on the Expert Options tab.
• Col — The column number in the order the input columns were requested.
• Name — Name of the column being clustered.
• Log Likelihood — This is the log likelihood value calculated if this variable was removed
from the clustering solution.
Clustering Solution
• Col — This is the column number in the order the input columns were requested.
• Table_Name — The name of the table associated with this input column.
• Column_Name — The name of the input column used in performing the cluster analysis.
Teradata Warehouse Miner User Guide - Volume 3
33
Chapter 1: Analytic Algorithms
Cluster Analysis
• Cluster_Id — The cluster number that this data applies to, from 1 to the number of clusters
requested.
• Weight — This is the so-called “prior probability” that an observation would belong to
this cluster, based on the percentage of observations belonging to this cluster at this stage.
• Mean — When the Gaussian Mixture Model algorithm is selected, Mean is the weighted
average of this column or variable amongst all the observations, where the weight used is
the probability of inclusion in this cluster. When the K-Means algorithm is selected, Mean
is the average value of this column or variable amongst the observations assigned to this
cluster at this iteration of the algorithm.
• Variance — When the Gaussian Mixture Model algorithm is selected, Variance is the
weighted variance of this variable amongst all the observations, where the weight used is
the probability of inclusion in this cluster. When the K-Means algorithm is selected,
Variance is the variance of this variable amongst the observations assigned to this cluster
at this iteration. (Variance is the square of a variable’s standard deviation, measuring in
some sense how its value varies from one observation to the next).
Cluster - RESULTS - sizes graph
On the Clustering dialog click on RESULTS and then click on sizes graph:
Figure 22: Clustering > Results > Sizes Graph
The Sizes (and Distances) graph plots the mean values of a pair of variables at a time,
indicating the clusters by color and number label, and the standard deviations (square root of
the variance) by the size of the ellipse surrounding the mean point, using the same colorcoding. Roughly speaking, this graph depicts the separation of the clusters with respect to
pairs of model variables. The following options are available:
• Non-Normalized — The default value to show the clusters without any normalization.
• Normalized — With the Normalized option, cluster means are divided by the largest
absolute mean and the size of the circle based on the variance is divided by the largest
absolute variance.
• Variables
•
Available — The variables that were input into the Clustering Analysis.
•
Selected — The variables that will be shown on the Size and Distances graph. Two
variables are required to be entered here.
• Clusters
34
•
Available — A list of clusters generated in the clustering solution.
•
Selected — The clusters that are shown on the Size and Distances graph. Up to twelve
clusters can be selected to be shown on the Size and Distances graph.
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Cluster Analysis
• Zoom In — While holding down the left mouse button on the Size and Distances graph,
drag a lasso around the area that you desire to magnify. Release the mouse button for the
zoom to take place. This can be repeated until the desired level of magnification is
achieved.
• Zoom Out — Hit the “Z” key, or toggle the Graph Options tab to go back to the original
magnification level.
Cluster - RESULTS - data
When clustering with the Fast K-Means algorithm, select this tab to display the cluster means
and solution progress reports. The cluster means display contains a subset of the information
shown on the Solution report and is intended as a backup in case the Solution report cannot be
produced. If the project is saved and reopened, the data is not displayed as it is with the other
tabs.
Note: The other clustering algorithms do not use this display.
Cluster - RESULTS - SQL
When clustering with the Fast K-Means algorithm, select this tab to display the SQL
generated by the algorithm, consisting of the call to the td_analyze external stored procedure.
Note: The other clustering algorithms do not use this display.
Cluster - RESULTS - similarity graph
On the Clustering dialog click on RESULTS and then click on similarity graph:
Figure 23: Clustering > Results > Similarity Graph
The Similarity graph allows plotting the means and variances of up to twelve clusters and
twelve variables at one time. The cluster means (i.e., the mean values of the variables for the
data points assigned to the cluster) are displayed with values varying along the x-axis. A
different line parallel to the x-axis is used for each variable. The normalized variances are
displayed for each variable by color-coding, and the clusters are identified by number next to
the point graphed. Roughly speaking, the more spread out the points on the graph, the more
differentiated the clusters are. The following options are available:
• Non-Normalized — The default value to show the clusters without any normalization.
• Normalized — With the Normalized option, the cluster mean is divided by the largest
absolute mean.
• Variables
•
Available — The variables that were input into the Clustering Analysis.
Teradata Warehouse Miner User Guide - Volume 3
35
Chapter 1: Analytic Algorithms
Cluster Analysis
•
Selected — The variables that will be shown on the Similarity graph. Up to twelve
variables can be entered here. selected to be shown on the Similarity graph
• Clusters
•
Available — A list of clusters generated in the clustering solution.
•
Selected — The clusters that will be shown on the Similarity graph. Up to twelve
clusters can be selected to be shown on the Similarity graph.
Tutorial - Cluster Analysis
In this example, Gaussian Mixture Model cluster analysis is performed on 3 variables giving
the average credit, checking and savings balances of customers, yielding a requested 3
clusters. Note that since Clustering in Teradata Warehouse Miner is non-deterministic, the
results may vary from these, or from execution to execution.
Parameterize a Cluster analysis as follows:
• Selected Tables and Columns
•
twm_customer_analysis.avg_cc_bal
•
twm_customer_analysis.avg_ck_bal
•
twm_customer_analysis.avg_sv_bal
• Number of Clusters — 3
• Algorithm — Gaussian Mixture Model
• Convergence Criterion — 0.1
• Use Listwise deletion to eliminate null values — Enabled
Run the analysis and click on Results when it completes. For this example, the Clustering
Analysis generated the following pages. Note that since Clustering is non-deterministic,
results may vary. A single click on each page name populates the page with the item.
Table 6: Progress
36
Iteration
Log Likelihood
Diff
Timestamp
1
-25.63
0
3:05 PM
2
-25.17
.46
3:05 PM
3
-24.89
.27
3:05 PM
4
-24.67
.21
3:05 PM
5
-24.42
.24
3:05 PM
6
-24.33
.09
3:06 PM
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Cluster Analysis
Table 7: Solution
Col
Table_Name
Column_Name
Cluster_Id
Weight
Mean
Variance
1
twm_customer_analysis
avg_cc_bal
1
.175
-1935.576
3535133.504
2
twm_customer_analysis
avg_ck_bal
1
.175
2196.395
9698027.496
3
twm_customer_analysis
avg_sv_bal
1
.175
674.72
825983.51
1
twm_customer_analysis
avg_cc_bal
2
.125
-746.095
770621.296
2
twm_customer_analysis
avg_ck_bal
2
.125
948.943
1984536.299
3
twm_customer_analysis
avg_sv_bal
2
.125
2793.892
11219857.457
1
twm_customer_analysis
avg_cc_bal
3
.699
-323.418
175890.376
2
twm_customer_analysis
avg_ck_bal
3
.699
570.259
661100.56
3
twm_customer_analysis
avg_sv_bal
3
.699
187.507
63863.503
Sizes Graph
By default, the following graph will be displayed.
This parameterization includes:
• Non-Normalized — Enabled
• Variables Selected
•
avg_cc_bal
•
avg_ck_bal
• Clusters Selected
•
Cluster 1
•
Cluster 2
•
Cluster 3
Teradata Warehouse Miner User Guide - Volume 3
37
Chapter 1: Analytic Algorithms
Cluster Analysis
Figure 24: Clustering Analysis Tutorial: Sizes Graph
Similarity Graph
By default, the following graph will be displayed. This parameterization includes:
• Non-Normalized — Enabled
• Variables Selected
•
avg_cc_bal
•
avg_ck_bal
•
avg_sv_bal
• Clusters Selected
38
•
Cluster 1
•
Cluster 2
•
Cluster 3
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Decision Trees
Figure 25: Clustering Analysis Tutorial: Similarity Graph
Decision Trees
Overview
Decision tree models are most commonly used for classification. What is a classification
model or classifier? It is simply a model for predicting a categorical variable, that is a variable
that assumes one of a predetermined set of values. These values can be either nominal or
ordinal, though ordinal variables are typically treated the same as nominal ones in these
models. (An example of a nominal variable is single, married and divorced marital status,
while an example of an ordinal or ordered variable is low, medium and high temperature). It
is the ability of decision trees to not only predict the value of a categorical variable, but to
directly use categorical variables as input or predictor variables that is perhaps their principal
advantage. Decision trees are by their very nature also well suited to deal with large numbers
of input variables, handle a mixture of data types and handle data that is not homogeneous
(i.e., the variables do not have the same interrelationships throughout the data space). They
also provide insight into the structure of the data space and the meaning of a model, a result at
times as important as the accuracy of a model. It should be noted that a variation of decision
trees called regression trees can be used to build regression models rather than classification
models, enjoying the same benefits just described. Most of the upcoming discussion is geared
toward classification trees with regression trees described separately.
What are Decision Trees?
What does a decision tree model look like? It first of all has a root node, which is associated
with all of the data in the training set used to build the tree. Each node in the tree is either a
decision node or a leaf node, which has no further connected nodes. A decision node
Teradata Warehouse Miner User Guide - Volume 3
39
Chapter 1: Analytic Algorithms
Decision Trees
represents a split in the data based on the values of a single input or predictor variable. A leaf
node represents a subset of the data that has a particular value of the predicted variable (i.e.,
the resulting class of the predicted variable). A measure of accuracy is also associated with
the leaf nodes of the tree.
The first issue in building a tree is the decision as to how data should be split at each decision
node in the tree. The second issue is when to stop splitting each decision node and make it a
leaf. And finally, what class should be assigned to each leaf node. In practice, researchers
have found that it is usually best to let a tree grow as big as it needs to and then prune it back
at the end to reduce its complexity and increase its interpretability.
Once a decision tree model is built it can be used to score or classify new data. If the new data
includes the values of the predicted variable it can be used to measure the effectiveness of the
model. Typically though scoring is performed in order to create a new table containing key
fields and the predicted value or class identifier.
Decision Trees in Teradata Warehouse Miner
Teradata Warehouse Miner provides decision trees for classification models and regression
models. They are built largely on the techniques described in [Breiman, Friedman, Olshen
and Stone] and [Quinlan]. As such, splits using the Gini diversity index, regression or
information gain ratio are provided. Pruning is also provided, using either the Gini diversity
index or gain ratio technique. In addition to a summary report, a graphical tree browser is
provided when a model is built, displaying the model either as a tree or a set of rules. Finally,
a scoring function is provided to score and/or evaluate a decision tree model. The scoring
function can also be used to simply generate the scoring SQL for later use.
A number of additional options are provided when building or scoring a decision tree model.
One of these options is whether or not to bin numeric variables during the tree building
process. Another involves including recalculated confidence measures at each leaf node in a
tree based on a validation table, supplementing confidence measures based on the training
data used to build the tree. Finally, at the time of scoring, a table profiling the leaf nodes in the
tree can be requested, at the same time each scored row is linked with a leaf node and
corresponding rule set.
Decision Tree SQL Generation
A key part to the design of the Teradata Warehouse Miner Decision Trees is SQL generation.
In order to avoid having to extract all of the data from the RDBMS, the product generates
SQL statements to return sufficient statistics. Before the model building begins, SQL is
generated to give a better understanding of the attributes and the predicted variable. For each
attribute, the algorithm must determine its cardinality and get all possible values of the
predicted variable and the counts associated with it from all of the observations. This
information helps to initialize some structures in memory for later use in the building process.
The driving SQL behind the entire building process is a SQL statement that makes it possible
to build a contingency table from the data. A contingency table is an m x n matrix that has m
rows corresponding to the distinct values of an attribute by n columns that correspond to the
predicted variable’s distinct values. The Teradata Warehouse Miner Decision Tree algorithms
can quickly generate the contingency table on massive amounts of data rows and columns.
40
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Decision Trees
This contingency table query allows the program to gather the sufficient statistics needed for
the algorithms to do their calculations. Since this consists of the counts of the N distinct
values of the dependent variable, a WHERE clause is simply added to this SQL when
building a contingency table on a subset of the data instead of the data in the whole table. The
WHERE clause expression in the statement helps define the subset of data which is the path
down the tree that defines which node is a candidate to be split.
Each type of decision tree uses a different method to compute which attribute is the best
choice to split a given subset of data upon. Each type of decision tree is considered in turn in
what follows. In the course of describing each algorithm, the following notation is used:
1
t denotes a node
2
j denotes the learning classes
3
J denotes the number of classes
4
s denotes a split
5
N(t) denotes the number of cases within a node t
6
p(j|t) is the proportion of class j learning samples in node t
7
An impurity function  is a symmetric function with maximum value
–1
–1
–1
  J  J   J  and
  1 0  0  =   0 1  0  =  =   0 0  1  = 0
8
t1 denotes a subnode i of t
9
i(t) denotes node impurity measure
10
t1 and tR are the left and right split nodes of t
Splitting on Information Gain Ratio
Information theory is the basic underlying idea in this type of decision tree. Splits on
categorical variables are made on each individual value. Splits on continuous variables are
made at one point in an ordered list of the actual values, that is a binary split is introduced
right on a particular value.
• Define the “info” at node t as the entropy:
info  t  = –  p  j t   log 2 p  j t  
• Suppose t is split into subnodes t1, …, t2 by predictor X. Define:
Info x =
Teradata Warehouse Miner User Guide - Volume 3
N  t1 

-----------info

t

1

Nt
41
Chapter 1: Analytic Algorithms
Decision Trees
Gain  X  = info  t  – info x  t 
 N  t1  
 N  t1  
Split  info  X  = –   -------------   log 2 ------------- 
 Nt 
 Nt 
gain  X 
Gain  ratio  X  = ------------------------------------Split  info  X 
Once the gain ratios have been computed the attribute with the highest gain ratio is used to
split the data. Then each subset goes through this process until the observations are all of one
class or a stopping criterion is met such as each node must contain at least 2 observations.
For a detailed description of this type of decision tree see [Quinlan].
Splitting on Gini Diversity Index
Node impurity is the idea behind the Gini diversity index split selection. To measure node
impurity, use the formula:
i  t  =   p  t   0
Maximum impurity arises when there is an equal distribution of the class that is to be
predicted. As in the heads and tails example, impurity is highest if half the total is heads and
the other half is tails. On the other hand, if there were only tails in a certain sample the
impurity would be 0.
The Gini index uses the following formula for its calculation of impurity:
2
it = 1 –  p j t
j
For a determination of the goodness of a split, the following formula is used:
i  s t  = i  t  – p L i  t L  – p R i  t R 
where tL and tR are the left and right sub nodes of t and pL and pR are the probabilities of
being in those sub nodes.
For a detailed description of this type of tree see [Breiman, Friedman, Olshen and Stone].
Regression Trees
Teradata Warehouse Miner provides regression tree models that are built largely on the
techniques described in [Breiman, Friedman, Olshen and Stone].
42
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Decision Trees
Like classification trees, regression trees utilize SQL in order to extract only the necessary
information from the RDBMS instead of extracting all the data from the table. An m x 3 table
is returned from the database that has m rows corresponding to the distinct values of an
attribute followed by the SUM and SQUARED SUM of the predicted variable and the total
number of rows having that attribute value.
Using the formula:
  yn – avg(y) 
2
the sum of squares for any particular node starting with the root node of all the data is
calculated first. The regression tree is built by iteratively splitting nodes and picking the split
for that node which will maximize a decrease in the within node sum of squares of the tree.
Splitting stops if the minimum number of observations in a node is reached or if all of the
predicted variable values are the same.
The value to predict for a leaf node is simply the average of all the predicted values that fall
into that leaf during model building.
Chaid Trees
CHAID trees utilize the chi squared significance test as a means of partitioning data.
Independent variables are tested by looping through the values and merging categories that
have the least significant difference from one another and also are still below the merging
significance level parameter (default .05). Once all independent variables have been
optimally merged the one with the highest significance is chosen for the split, the data is
subdivided, and the process is repeated on the subsets of the data. The splitting stops when the
significance goes above the splitting significance level (default .05).
For a detailed description of this type of tree see [Kass].
Decision Tree Pruning
Many times with algorithms such as those described above, a model over fits the data. One of
the ways of correcting this is to prune the model from the leaves up. In situations where the
error rate of leaves doesn’t increase when combined then they are joined into a new leaf.
A simple example may be given as follows. If there is nothing but random data for the
attributes and the class is set to predict “heads” 75% of the time and “tails” 25% of the time,
the result will be an over fit model that doesn’t predict the outcome well. Just by looking it
can be seen that instead of a built up model with many leaves, the model could just predict
“heads” and it would be correct 75% of the time, whereas over fitting usually does much
worse in such a case.
Teradata Warehouse Miner provides pruning according to the gain ratio and Gini diversity
index pruning techniques. It is possible to combine different splitting and pruning techniques,
however when pruning a regression tree the Gini diversity index technique must be used.
Teradata Warehouse Miner User Guide - Volume 3
43
Chapter 1: Analytic Algorithms
Decision Trees
Decision Trees and NULL Values
NULL values are handled by listwise deletion. This means that if there are NULL values in
any variables (independent and dependent) then that row where a NULL exists will be
removed from the model building process.
NULL values in scoring, however, are handled differently. Unlike in tree building where
listwise deletion is used, scoring can sometimes handle rows that have NULL values in some
of the independent variables. The only time a row will not get scored is if a decision node that
the row is being tested on has a NULL value for that decision. For instance, if the first split in
a tree is “age < 50,” only rows that don’t have a NULL value for age will pass down further in
the tree. This row could have a NULL value in the income variable. But since this decision is
on age, the NULL will have no impact at this split and the row will continue down the
branches until a leaf is reached or it has a NULL value in a variable used in another decision
node.
Initiate a Decision Tree Analysis
Use the following procedure to initiate a new Decision Tree analysis in Teradata Warehouse
Miner:
1
Click on the Add New Analysis icon in the toolbar:
Figure 26: Add New Analysis from toolbar
2
44
In the resulting Add New Analysis dialog box, click on Analytics under Categories and
then under Analyses double-click on Decision Tree:
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Decision Trees
Figure 27: Add New Analysis dialog
3
This will bring up the Decision Tree dialog in which you will enter INPUT and OUTPUT
options to parameterize the analysis as described in the following sections.
Decision Tree - INPUT - Data Selection
On the Decision Tree dialog click on INPUT and then click on data selection:
Figure 28: Decision Tree > Input > Data Selection
On this screen select:
1
Select Input Source
Users may select between different sources of input. By selecting the Input Source Table
the user can select from available databases, tables (or views) and columns in the usual
manner. By selecting the Input Source Analysis however the user can select directly from
the output of another analysis of qualifying type in the current project. Analyses that may
be selected from directly include all of the Analytic Data Set (ADS) and Reorganization
analyses (except Refresh). In place of Available Databases the user may select from
Available Analyses, while Available Tables then contains a list of all the output tables that
will eventually be produced by the selected Analysis. (Note that since this analysis cannot
select from a volatile input table, Available Analyses will contain only those qualifying
analyses that create an output table or view).
2
Select Columns From a Single Table
Teradata Warehouse Miner User Guide - Volume 3
45
Chapter 1: Analytic Algorithms
Decision Trees
•
Available Databases (or Analyses) — All the databases (or analyses) that are available
for the Decision Tree analysis.
•
Available Tables — All the tables that are available for the Decision Tree analysis.
•
Available Columns — Within the selected table or matrix, all columns that are
available for the Decision Tree analysis.
•
Selected Columns — Select columns by highlighting and then either dragging and
dropping into the Selected Columns window, or click on the arrow button to move
highlighted columns into the Selected Columns window.
Note: The Selected Columns window is actually a split window; you can either insert
columns as Dependent or Independent columns. Make sure you have the correct
portion of the window highlighted.
•
Independent — These may be of numeric or character type.
•
Dependent — The dependent variable column is the column whose value is being
predicted. It is selected from the Available Variables in the selected table. When
Gain Ratio or Gini Index are selected as the Tree Splitting criteria, this is treated as a
categorical variable with distinct values, in keeping with the nature of
classification trees. Note that in this case an error will occur if the Dependent
Variable has more than 50 distinct values. When Regression Trees is selected as the
Tree Splitting criteria, this is treated as a continuous variable. In this case it must
contain only numeric values.
Decision Tree - INPUT - Analysis Parameters
On the Decision Tree dialog click on INPUT and then click on analysis parameters:
Figure 29: Decision Tree > Input > Analysis Parameters
On this screen select:
• Splitting Options
•
46
Splitting Method
•
Gain Ratio — Option to use the Gain Ratio splitting criteria.
•
Gini Index — Option to use the Gini Index splitting criteria.
•
Chaid — Option to use the Chaid splitting criteria. When using this option you are
also given the opportunity to change the merging or splitting Chaid Significance
Levels.
•
Regression Trees — Option to use the Regression splitting criteria as outlined
above.
•
Gain Ratio Extreme — Option to use the Gain Ratio splitting criteria using a stored
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Decision Trees
procedure and table operator that process the data more directly in the database for
better resource utilization.
Note: When using this option, confirm that the td_analyze external stored
procedure and the tda_dt_calc table operator are installed in the database where the
TWM metadata tables reside. This can be performed using the Install or Uninstall
UDF's option under the Teradata Warehouse Miner start program item, selecting
the option to Install TD_Analyze UDFs.
•
Minimum Split Count — This option determines how far the splitting of the decision
tree will go. Unless a node is pure (meaning it has only observations with the same
dependent value) it will split if each branch that can come off this node will contain at
least this many observations. The default is a minimum of 2 cases for each branch.
•
Maximum Nodes — (This option is not available when using the Gain Ratio Extreme
splitting method.) If the nodes in the tree are equal to or exceed this value while
splitting a certain level of the tree, the algorithm stops the tree growing after
completing this level and returns the tree built so far. The default is 10000 nodes.
•
Maximum Depth — Another method of stopping the tree is to specify the maximum
depth the tree may grow to. This option will stop the algorithm if the tree being built
has this many levels. The default is 100 levels.
•
Chaid Significance Levels — (These options are only available when using the Chaid
splitting method.)
•
Merging — Independent variables are tested by looping through the values and
merging categories that have the least significant difference from one another and
also are still below this merging significance level parameter (default .05).
•
Splitting — Once all independent variables have been optimally merged the one
with the highest significance is chosen for the split, the data is subdivided, and the
process is repeated on the subsets of the data. The splitting stops when the
significance goes above this splitting significance level parameter (default .05).
•
Bin Numeric Variables — Option to automatically Bincode the continuous independent
variables. Continuous data is separated into one hundred bins when this option is
selected. If the variable has less than one hundred distinct values, this option is
ignored.
•
Include Validation Table — (This option is not available when using the Gain Ratio
Extreme splitting method.) A supplementary table may be utilized in the modeling
process to validate the effectiveness of the model on a separate set of observations. If
specified, this table is used to calculate a second set of confidence or targeted
confidence factors. These recalculated confidence factors are viewed in the tree
browser and/or added to the scored table when scoring the resultant model. When
Include Validation Table is selected, a separate validation table is required.
•
Database — The name of the database to look in for the validation table - by
default, this is the source database.
•
Table — The name of the validation table to use for recalculating confidence or
targeted confidence factors.
Teradata Warehouse Miner User Guide - Volume 3
47
Chapter 1: Analytic Algorithms
Decision Trees
•
Include Lift Table — (This option is not available when using the Gain Ratio Extreme
splitting method.) Option to generate a Cumulative Lift Table in the report to
demonstrate how effective the model is in estimating the dependent variable. Valid for
binary dependent variables only.
•
Response Value — An optional response value can be specified for the dependent
variable that will represent the response value. Note that all other dependent
variable values will be considered a non-response value.
Values — Bring up the Decision Tree values wizard to help in specifying the
response value.
• Pruning Options
•
•
Pruning Method — Pull-down list with the following values:
•
Gain Ratio — Option to use the Gain Ratio pruning criteria as outlined above.
•
Gini Index — (This option is not available when using the Gain Ratio Extreme
splitting method.) Option to use the Gini Index pruning criteria as outlined above.
•
None — Option to not prune the resultant decision tree.
Gini Test Table — (This option does not apply when using the Gain Ratio Extreme
splitting method.) When Gini Index pruning is selected as the pruning method, a
separate Test table is required.
•
Database — The name of the database to look for the Test table - by default, this is
the source database.
•
Table — The name of the table to use for test purposes during the Gini Pruning
process.
Decision Tree - INPUT - Expert Options
On the Decision Tree dialog click on INPUT and then click on expert options:
Figure 30: Decision Tree > Input > Expert Options
• Performance
•
Maximum amount of data for in-memory processing — (This option does not apply
when using the Gain Ratio Extreme splitting method.) By default, 2 MB of data can be
processed in memory for the tree. This can be increased here. For smaller data sets,
this option may be preferable over the SQL version of the decision tree.
Run the Decision Tree Analysis
After setting parameters on the INPUT screens as described above, you are ready to run the
analysis. To run the analysis you can either:
48
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Decision Trees
• Click the Run icon
on the toolbar, or
• Select Run <project name> on the Project menu, or
• Press the F5 key on your keyboard
Results - Decision Tree
The results of running the Decision Tree analysis include a variety of statistical reports as well
as a Graphic and Textual Tree browser. All of these results are outlined below.
Note: Not all Results features are provided when using the Gain Ratio Extreme splitting
method, including a lift table and graph, a validation matrix, and some of the information
displayed by decision tree graph nodes.
Decision Tree Reports
• Total observations — This is the number of observations in the training data set used to
build the tree. More precisely, this is the number of rows in the input table after any rows
have been excluded for containing a null value in a column selected as an independent or
dependent variable.
• Nodes before pruning — This is the number of nodes in the tree, including the root node,
before it is pruned back in the second stage of the tree-building process.
• Nodes after pruning — This is the number of nodes in the tree, including the root node,
after it is pruned back in the second stage of the tree-building process.
• Total nodes — This is the number of nodes in the tree, including the root node, when
either pruning is not requested or doesn’t remove any nodes.
• Model Accuracy — This is the percentage of observations in the training data set that the
tree accurately predicts the value of the dependent variable for.
Variables
• Independent Variables — A list of all the independent variables that made it into the
decision tree model.
• Dependent Variable — The dependent variable that the tree was built to predict.
Confusion Matrix
A N x (N+2) (for N outcomes of the dependent variable) confusion matrix is given with the
following format:
Table 8: Confusion Matrix Format
Actual ‘0’
Actual ‘1’
…
Actual ‘N’
Correct
Incorrect
Predicted ‘0’
# correct ‘0’
Predictions
# incorrect‘1’
Predictions
…
# incorrect ‘N’
Predictions
Total Correct
‘0’ Predictions
Total Incorrect
‘0’ Predictions
Predicted ‘1’
# incorrect‘0’
Predictions
# correct ‘1’
Predictions
…
# incorrect ‘N’
Predictions
Total Correct
‘1’ Predictions
Total Incorrect
‘1’ Predictions
…
…
…
…
…
…
…
Teradata Warehouse Miner User Guide - Volume 3
49
Chapter 1: Analytic Algorithms
Decision Trees
Table 8: Confusion Matrix Format
Predicted ‘N’
Actual ‘0’
Actual ‘1’
…
Actual ‘N’
Correct
Incorrect
# incorrect‘0’
Predictions
# incorrect ‘1’
Predictions
…
# correct ‘N’
Predictions
Total Correct
‘N’ Predictions
Total Incorrect
‘N’ Predictions
Validation Matrix
When the Include validation table option is selected, a validation matrix similar to the
confusion matrix is produced based on the data in the validation table rather than the input
table.
Cumulative Lift Table
The Cumulative Lift Table demonstrates how effective the model is in estimating the
dependent variable. It is produced using deciles based on the probability values. Note that the
deciles are labeled such that 1 is the highest decile and 10 is the lowest, based on the
probability values calculated by logistic regression. The information in this report however is
best viewed in the Lift Chart produced as a graph. Note that this is only valid for binary
dependent variables.
• Decile — The deciles in the report are based on the probability values predicted by the
model. Note that 1 is the highest decile and 10 is the lowest. That is, decile 1 contains data
on the 10% of the observations with the highest estimated probabilities that the dependent
variable is 1.
• Count — This column contains the count of observations in the decile.
• Response — This column contains the count of observations in the decile where the
actual value of the dependent variable is 1.
• Pct Response — This column contains the percentage of observations in the decile where
the actual value of the dependent variable is 1.
• Pct Captured Response — This column contains the percentage of responses in the decile
over all the responses in any decile.
• Lift — The lift value is the percentage response in the decile (Pct Response) divided by the
expected response, where the expected response is the percentage of response or
dependent 1-values over all observations. For example, if 10% of the observations overall
have a dependent variable with value 1, and 20% of the observations in decile 1 have a
dependent variable with value 1, then the lift value within decile 1 is 2.0, meaning that the
model gives a “lift” that is better than chance alone by a factor of two in predicting
response values of 1 within this decile.
• Cumulative Response — This is a cumulative measure of Response, from decile 1 to this
decile.
• Cumulative Pct Response — This is a cumulative measure of Pct Response, from decile 1
to this decile.
• Cumulative Pct Captured Response — This is a cumulative measure of Pct Captured
Response, from decile 1 to this decile.
• Cumulative Lift — This is a cumulative measure of Lift, from decile 1 to this decile.
50
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Decision Trees
Decision Tree Graphs
The Decision Tree Analysis can display either a graphical and textual representation of the
decision tree model, as well as a lift chart. Options are available to display decisions for any
node in the graphical or textual tree, as well as the counts and distribution of the dependent
variable. Additionally, manual pruning of the decision tree model is supported.
Tree Browser
Figure 31: Tree Browser
When Tree Browser is selected, two frames are shown: the upper frame gives a condensed
view to aid in navigating through the detailed tree in the lower frame. Set options by rightclicking on either frame to select from the following menu:
• Small Navigation Tree — Under Small Navigation Tree, the options are:
Figure 32: Tree Browser menu: Small Navigation Tree
•
Zoom — This option allows you to scale down the navigation tree so that more of it
will appear within the window. A slider bar is provided so you can select from a range
of new sizes while previewing the effect on the navigation tree. The slider bar can also
be used to bring the navigation tree back up to a larger dimension after it has been
reduced in size:
Teradata Warehouse Miner User Guide - Volume 3
51
Chapter 1: Analytic Algorithms
Decision Trees
Figure 33: Tree Browser menu: Zoom Tree
•
Show Extents Box/Hide Extents Box — With this option a box is drawn around the
nodes in the upper frame corresponding to the nodes displayed in the lower frame. The
box can be dragged and dropped over segments of the small tree, automatically
positioning the identical area in the detailed tree within the lower frame. Once set, the
option changes to allow hiding the box.
•
Hide Navigation Tree/Show Navigation Tree — With this option the upper frame is made
to disappear (or reappear) in order to give more room to the lower frame that contains
the details of the tree.
• Show Confidence Factors/Show Targeted Confidence — The Confidence Factor is a
measure of how “confident” the model is that it can predict the correct score for a record
that falls into a particular leaf node based on the training data the model was built from.
For example, if a leaf node contained 10 observations and 9 of them predict Buy and the
other record predicts Do Not Buy, then the model built will have a confidence factor of .9,
or 90% sure of predicting the right value for a record that falls into that leaf node of the
model.
Models built with a predicted variable that has only 2 outcomes can display a Targeted
Confidence value rather than a confidence factor. If the outcomes were 9 Buys and 1 Do
Not Buy at a particular node and if the target value was set to Buy, .9 is the targeted
confidence. However if it is desired to target the Do Not Buy outcome by setting the value
to Do Not Buy, then any record falling into this leaf of the tree would get a targeted
confidence of .1 or 10%.
This option also controls whether Recalculated Confidence Factors or Recalculated
Targeted Confidence factors are displayed in the case when the Include validation table
option is selected.
• Node Detail — The Node Detail feature can be used to copy the entire rule set for a
particular node to the Windows Clipboard for use in other applications.
• Print
52
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Decision Trees
Figure 34: Tree Browser menu: Print
•
Large Tree — Allows you to print the entire tree diagram. This will be printed in
pages, with the total number of pages reported before they are printed. (A page will
also be printed showing how the tree was mapped into individual pages). If All Pages
is selected the entire tree will be printed, across multiple pages if necessary. If Current
Browser Page is selected then only that portion of the tree which is viewable will be
printed in WYSIWYG fashion.
•
Small Tree — The entire navigation tree, showing the overall structure of the tree
diagram without node labels or statistics, can be printed in pages. (The fewest possible
pages will be printed if the navigation tree is reduced as small as possible before
printing the small tree). The total number of pages needed to print the smaller tree will
be reported before they are sent to the printer).
• Save — Currently, the Tree Browser only supports the creation of Bitmaps. If Tree Text is
currently selected, the entire tree will be saved. If Tree Browser is selected, only the
portion of the tree that is viewable will be saved in WYSIWYG fashion.
The lower frame shows the details of the decision tree in a graphical manner. The
graphical representation of the tree consists of the following objects:
• Root Node — The box at the top of the tree shows the total number of observations or
rows used in building the tree after any rows have been removed for containing null
values.
• Intermediate Node — The boxes representing intermediate nodes in the tree contain the
following information.
•
Decision — Condition under which data passes through this node.
•
N — Count of number of observations or rows passing through this node.
•
% — Percentage of observations or rows passing through this node.
• Leaf Node — The boxes representing leaf nodes in the tree contain the following
information.
•
Decision — Condition under which data passes to this node.
•
N — Count of number of observations or rows passing to this node.
•
% — Percentage of observations or rows passing to this node.
•
CF — Confidence factor
•
TF — Targeted confidence factor, alternative to CF display
•
RCF — Recalculated confidence factor based on validation table (if requested)
•
RTF — Recalculated targeted confidence factor based on validation table (if
requested)
Teradata Warehouse Miner User Guide - Volume 3
53
Chapter 1: Analytic Algorithms
Decision Trees
Text Tree
When Tree Text is selected, the diagram represents the decisions made by the tree as a
hierarchical structure of rules as follows:
Figure 35: Text Tree
The first rule corresponds to the root node of the tree. The rules corresponding to leaves in the
tree are distinguished by an arrow drawn as ‘-->’, followed by a predicted value of the
dependent variable.
Rules List
On both the Tree Browser and Text Tree, passing the mouse over a node or rule results in a
hyperlink indication. When Rules List is enabled, clicking on the hyperlink results in a popup
displaying all rules leading to that node or decision as follows:
54
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Decision Trees
Figure 36: Rules List
Note that the Node Detail, as described above, can be used to copy the Rules List to the
Windows Clipboard.
Counts and Distributions
On both the Tree Browser and Text Tree, passing the mouse over a node or rule results in a
hyperlink indication. When Counts and Distributions is enabled, clicking on the hyperlink
results in a pop-up displaying the Count/Distribution of the dependent variable at that node as
follows. Note that the Counts and Distribution option is only enabled when the dependent
variable is multinomial. For regression trees this is not valid, and it is shown directly on the
node or rule for binary trees.
Figure 37: Counts and Distributions
Note that the Node Detail, as described above, can be used to copy the Counts and
Distribution list to the Windows Clipboard.
Teradata Warehouse Miner User Guide - Volume 3
55
Chapter 1: Analytic Algorithms
Decision Trees
Tree Pruning
On both the Tree Browser and Text Tree, passing the mouse over a node or rule results in a
hyperlink indication. When Tree Pruning is enabled, the following menu appears:
Figure 38: Tree Pruning menu
Clicking on a node or rule highlights the node and all subnodes, indicating which portion of
the tree will be pruned. Additionally, the Prune Selected Branch option becomes enabled as
follows:
Figure 39: Tree Pruning Menu > Prune Selected Branch
Clicking on Prune Selected Branch will convert the highlighted node to a leaf node, and all
subnode will disappear. When this is done, the other two Tree Pruning options become
enabled:
Figure 40: Tree Pruning menu (All Options Enabled)
Click on Undo Last Prune, to revert back to the original tree, or the previously pruned tree if
Prune Selected Branch was done multiple times. Click on Save Pruned Tree to save the tree to
XML. This will be saved in metadata and can be rescored in a future release.
56
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Decision Trees
After a tree is manually pruned and saved to metadata using the Save Pruned Tree option, it
can be reopened and viewed in the Tree Browser and, if desired, pruned further. (All
additional prunes must be re-saved to metadata). A previously pruned tree will be labeled to
distinguish it from a tree that has not been manually pruned:
Figure 41: Decision Tree Graph: Previously Pruned Tree
“More >>”
On both the Tree Browser and Text Tree, if Gini Index has been selected for Tree Splitting,
large surrogate splits may occur. If a surrogate split is proceeded by “more >>”, the entire
surrogate split can be displayed in a separate pop-up screen by clicking on the node and/or
rule as follows:
Figure 42: Decision Tree Graph: Predicate
Lift Chart
This graph displays the statistic in the Cumulative Lift Table, with the following options:
• Non-Cumulative
•
% Response — This column contains the percentage of observations in the decile
where the actual value of the dependent variable is 1.
•
% Captured Response — This column contains the percentage of responses in the
decile over all the responses in any decile.
•
Lift — The lift value is the percentage response in the decile (Pct Response) divided by
the expected response, where the expected response is the percentage of response or
dependent 1-values over all observations. For example, if 10% of the observations
overall have a dependent variable with value 1, and 20% of the observations in decile
1 have a dependent variable with value 1, then the lift value within decile 1 is 2.0,
meaning that the model gives a “lift” that is better than chance alone by a factor of two
in predicting response values of 1 within this decile.
• Cumulative
Teradata Warehouse Miner User Guide - Volume 3
57
Chapter 1: Analytic Algorithms
Decision Trees
•
% Response — This is a cumulative measure of the percentage of observations in the
decile where the actual value of the dependent variable is 1, from decile 1 to this
decile.
•
% Captured Response — This is a cumulative measure of the percentage of responses
in the decile over all the responses in any decile, from decile 1 to this decile.
•
Cumulative Lift — This is a cumulative measure of the percentage response in the
decile (Pct Response) divided by the expected response, where the expected response
is the percentage of response or dependent 1-values over all observations, from decile
1 to this decile.
Any combination of options can be displayed as follows:
Figure 43: Decision Tree Graph: Lift
Tutorial - Decision Tree
In this example a standard Gain Ratio tree was built to predict credit card ownership ccacct
based on 20 numeric and categorical input variables. Notice that the tree initially built
contained 100 nodes but was pruned back to only 11, counting the root node. This yielded not
only a relatively simple tree structure, but also Model Accuracy of 95.72% on this training
data.
Parameterize a Decision Tree as follows:
• Available Tables — twm_customer_analysis
• Dependent Variable — ccacct
• Independent Variables
58
•
income
•
age
•
years_with_bank
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Decision Trees
•
nbr_children
•
gender
•
marital_status
•
city_name
•
state_code
•
female
•
single
•
married
•
separated
•
ckacct
•
svacct
•
avg_ck_bal
•
avg_sv_bal
•
avg_ck_tran_amt
•
avg_ck_tran_cnt
•
avg_sv_tran_amt
•
avg_sv_tran_cnt
• Tree Splitting — Gain Ratio
• Minimum Split Count — 2
• Maximum Nodes — 1000
• Maximum Depth — 10
• Bin Numeric Variables — Disabled
• Pruning Method — Gain Ratio
• Include Lift Table — Enabled
•
Response Value — 1
Run the analysis and click on Results when it completes. For this example, the Decision Tree
Analysis generated the following pages. A single click on each page name populates the page
with the item.
Table 9: Decision Tree Report
Total observations
747
Nodes before pruning
33
Nodes after pruning
11
Model Accuracy
95.72%
Teradata Warehouse Miner User Guide - Volume 3
59
Chapter 1: Analytic Algorithms
Decision Trees
Table 10: Variables: Dependent
Dependent Variable
ccacct
Table 11: Variables: Independent
Independent Variables
income
ckacct
avg_sv_bal
avg_sv_tran_cnt
Table 12: Confusion Matrix
Actual Non-Response
Actual Response
Correct
Incorrect
Predicted 0
340 / 45.52%
0 / 0.00%
340 / 45.52%
0 / 0.00%
Predicted 1
32 / 4.28%
375 / 50.20%
375 / 50.20%
32 / 4.28%
Table 13: Cumulative Lift Table
Cumulative
Response
Cumulative
Response
(%)
Cumulative
Captured
Response
(%)
Cumulative
Lift
Decile
Count
Response
Response
(%)
Captured
Response
(%)
1
5.00
5.00
100.00
1.33
1.99
5.00
100.00
1.33
1.99
2
0.00
0.00
0.00
0.00
0.00
5.00
100.00
1.33
1.99
3
0.00
0.00
0.00
0.00
0.00
5.00
100.00
1.33
1.99
4
0.00
0.00
0.00
0.00
0.00
5.00
100.00
1.33
1.99
5
0.00
0.00
0.00
0.00
0.00
5.00
100.00
1.33
1.99
6
402.00 370.00
92.04
98.67
1.83
375.00
92.14
100.00
1.84
7
0.00
0.00
0.00
0.00
0.00
375.00
92.14
100.00
1.84
8
0.00
0.00
0.00
0.00
0.00
375.00
92.14
100.00
1.84
9
0.00
0.00
0.00
0.00
0.00
375.00
92.14
100.00
1.84
10
340.00 0.00
0.00
0.00
0.00
375.00
50.20
100.00
1.00
60
Lift
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Decision Trees
Graphs
Tree Browser is displayed as follows:
Figure 44: Decision Tree Graph Tutorial: Browser
Select the Text Tree radio to view the rules in textual format:
Figure 45: Decision Tree Graph Tutorial: Lift
Additionally, you can click on Lift Chart to view the contents of the Lift Table graphically.
Teradata Warehouse Miner User Guide - Volume 3
61
Chapter 1: Analytic Algorithms
Factor Analysis
Figure 46: Decision Tree Graph Tutorial: Browser
Factor Analysis
Overview
Consider a data set with a number of correlated numeric variables that is to be used in some
type of analysis, such as linear regression or cluster analysis. Or perhaps it is desired to
understand customer behavior in a fundamental way, by discovering hidden structure and
meaning in data. Factor analysis can be used to reduce a number of correlated numeric
variables into a lesser number of variables called factors. These new variables or factors
should hopefully be conceptually meaningful if the second goal just mentioned is to be
achieved. Meaningful factors not only give insight into the dynamics of a business, but they
also make any models built using these factors more explainable, which is generally a
requirement for a useful analytic model.
There are two fundamental types of factor analysis, principal components and common
factors. Teradata Warehouse Miner offers principal components, maximum likelihood
common factors and principal axis factors, which is a restricted form of common factor
analysis. The product also offers factor rotations, both orthogonal and oblique, as postprocessing for any of these three types of models. Finally, as with all other models, automatic
factor model scoring is offered via dynamically generated SQL.
Before using the Teradata Warehouse Miner Factor Analysis module, the user must first build
a data reduction matrix using the Build Matrix function. The matrix must include all of the
input variables to be used in the factor analysis. The user can base the analysis on either a
covariance or correlation matrix, thus working with either centered and unscaled data, or
centered and normalized data (i.e., unit variance). Teradata Warehouse Miner automatically
converts the extended cross-products matrix stored in metadata results tables by the Build
62
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Factor Analysis
Matrix function into the desired covariance or correlation matrix. The choice will affect the
scaling of resulting factor measures and factor scores.
The primary source of information and formulae in this section is [Harman].
Principal Components Analysis
The goal of principal components analysis (PCA) is to account for the maximum amount of
the original data’s variance in the principal components created. Each of the original variables
can be expressed as a linear combination of the new principal components. Each principal
component in its turn, from the first to the last, accounts for a maximum amount of the
remaining sum of the variances of the original variables. This allows some of the later
components to be discarded and only the reduced set of components accounting for the
desired amount of total variance to be retained. If all the components were to be retained, then
all of the variance would be explained.
A principal components solution has many desirable properties. First, the new components
are independent of each other, that is, uncorrelated in statistical terminology or orthogonal in
the terminology of linear algebra. Further, the principal components can be calculated
directly, yielding a unique solution. This is true also of principal component scores, which can
be calculated directly from the solution and are also inherently orthogonal or independent of
each other.
Principal Axis Factors
The next step toward the full factor analysis model is a technique known as principal axis
factors (PAF), or sometimes also called iterated principal axis factors, or just principal
factors. The principal factors model is a blend of the principal components model described
earlier and the full common factor model. In the common factor model, each of the original
variables is described in terms of certain underlying or common factors, as well as a unique
factor for that variable. In principal axis factors however, each variable is described in terms
of common factors without a unique factor.
Unlike a principal components model for which there is a unique solution, a principal axis
factor model consists of estimated factors and scores. As with principal components, the
derived factors are orthogonal or independent of each other. The same is not necessarily true
of the scores however. (Refer to “Factor Scores” on page 65 for more information).
Maximum Likelihood Common Factors
The goal of common factors or classical factor analysis is to account in the new factors for the
maximum amount of covariance or correlation in the original input variables. In the common
factor model, each of the original input variables is expressed in terms of hypothetical
common factors plus a unique factor accounting for the remaining variance in that variable.
The user must specify the desired number of common factors to look for in the model. This
type of model represents factor analysis in the fullest sense. Teradata Warehouse Miner offers
maximum likelihood factors (MLF) for estimating common factors, using expectation
maximization or EM as the method to determine the maximum likelihood solution.
A potential benefit of common factor analysis is that it may reduce the original set of
variables into fewer factors than would principal components analysis. It may also produce
Teradata Warehouse Miner User Guide - Volume 3
63
Chapter 1: Analytic Algorithms
Factor Analysis
new variables that have more fundamental meaning. A drawback is that factors can only be
estimated using iterative techniques requiring more computation, as there is no unique
solution to the common factor analysis model. This is true also of common factor scores,
which must likewise be estimated.
As with principal components and principal axis factors, the derived factors are orthogonal or
independent of each other, but in this case by design (Teradata Warehouse Miner utilizes a
technique to insure this). The same is not necessarily true of the factor scores however. (Refer
to “Factor Scores” on page 65 for more information).
These three types of factor analysis then give the data analyst the choice of modeling the
original variables in their entirety (principal components), modeling them with hypothetical
common factors alone (principal axis factors), or modeling them with both common factors
and unique factors (maximum likelihood common factors).
Factor Rotations
Whatever technique is chosen to compute principal components or common factors, the new
components or factors may not have recognizable meaning. Correlations will be calculated
between the new factors and the original input variables, which presumably have business
meaning to the data analyst. But factor-variable correlations may not possess the subjective
quality of simple structure. The idea behind simple structure is to express each component or
factor in terms of fewer variables that are highly correlated with the factor (or vice versa),
with the remaining variables largely uncorrelated with the factor. This makes it easier to
understand the meaning of the components or factors in terms of the variables.
Factor rotations of various types are offered to allow the data analyst to attempt to find simple
structure and hence meaning in the new components or factors. Orthogonal rotations
maintain the independence of the components or factors while aligning them differently with
the data to achieve a particular simple structure goal. Oblique rotations relax the requirement
for factor independence while more aggressively seeking better data alignment. Teradata
Warehouse Miner offers several options for both orthogonal and oblique rotations.
Factor Loadings
The term factor loadings is sometimes used to refer to the coefficients of the linear
combinations of factors that make up the original variables in a factor analysis model. The
appropriate term for this however is the factor pattern. A factor loadings matrix is sometimes
also assumed to indicate the correlations between the factors and the original variables, for
which the appropriate term is factor structure. The good news is that whenever factors are
mutually orthogonal or independent of each other, the factor pattern P and the factor structure
S are the same. They are related by the equation S = PQ where Q is the matrix of correlations
between factors.
In the case of principal components analysis, factor loadings are labeled as component
loadings and represent both factor pattern and structure. For other types of analysis, loadings
are labeled as factor pattern but indicate structure also, unless a separate structure matrix is
also given (as is the case after oblique rotations, described later).
Keeping the above caveats in mind, the component loadings, pattern or structure matrix is
interpreted for its structure properties in order to understand the meaning of each new factor
64
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Factor Analysis
variable. When the analysis is based on a correlation matrix, the loadings, pattern or structure
can be interpreted as a correlation matrix with the columns corresponding to the factors and
the rows corresponding to the original variables. Like all correlations, the values range in
absolute value from 0 to 1 with the higher values representing a stronger correlation or
relationship between the variables and factors. By looking at these values, the user gets an
idea of the meaning represented by each factor. Teradata Warehouse Miner stores these so
called factor loadings and other related values in metadata result tables to make them
available for scoring.
Factor Scores
In order to use a factor as a variable, it must be assigned a value called a factor score for each
row or observation in the data. A factor score is actually a linear combination of the original
input variables (without a constant term), and the coefficients associated with the original
variables are called factor weights. Teradata Warehouse Miner provides a scoring function
that calculates these weights and creates a table of new factor score variables using
dynamically generated SQL. The ability to automatically generate factor scores, regardless of
the factor analysis or rotation options used, is one of the most powerful features of the
Teradata Warehouse Miner factor analysis module.
Principal Components
As mentioned earlier in the introduction, the goal of principal components analysis (PCA) is
to account for the maximum amount of the original data’s variance in the independent
principal components created. It was also stated that each of the original variables is
expressed as a linear combination of the new principal components, and that each principal
component in its turn, from the first to the last, accounts for a maximum amount of the
remaining sum of the variances of the original variables. These results are achieved by first
finding the eigenvalues and eigenvectors of the covariance or correlation matrix of the input
variables to be modeled. Although not ordinarily thought of in this way, when analyzing v
numeric columns in a table in a relational database, one is in some sense working in a vdimensional vector space corresponding to these columns. Back at the beginning of the
previous century when principal components analysis was developed, this was no small task.
Today however math library routines are available to perform these computations very
efficiently.
Although it won’t be attempted here to derive the mathematical solution to finding principal
components, it might be helpful to state the following definition (i.e., that a square matrix A
has an eigenvalue  and an eigenvector x if Ax = x ). Further, a v x v square symmetric
matrix A has v pairs of eigenvalues and eigenvectors,  1 e 1  2 e 2   v e v . It is further true
that eigenvectors can be found so that they have unit length and are mutually orthogonal (i.e.,
independent or uncorrelated), making them unique.
To return to the point at hand, the principal component loadings that are being sought are
actually the covariance or correlation matrix eigenvectors just described multiplied by the
square root of their respective eigenvalues. The step left out up to now however is the
reduction of these principal component loadings to a number fewer than the variables present
at the start. This can be achieved by first ordering the eigenvalues, and their corresponding
eigenvectors, from maximum to minimum in descending order, and then by throwing away
Teradata Warehouse Miner User Guide - Volume 3
65
Chapter 1: Analytic Algorithms
Factor Analysis
those eigenvalues below a minimum threshold value, such as 1.0. An alternative technique is
to retain a desired number of the largest components regardless of the magnitude of the
eigenvalues. Teradata Warehouse Miner provides both of these options to the user. The user
may further optionally request that the signs of the principal component loadings be inverted
if there are more minus signs than positive ones. This is purely cosmetic and does not affect
the solution in a substantive way. However, if signs are reversed, this must be kept in mind
when attempting to interpret or assign conceptual meaning to the factors.
A final point worth noting is that the eigenvalues themselves turn out to be the variance
accounted for by each principal component, allowing the computation of several variance
related measures and some indication of the effectiveness of the principal components model.
Principal Axis Factors
In order to talk about principal axis factors (PAF) the term communality must first be
introduced. In the common factor model, each original variable x is thought of to be a
combination of common factors and a unique factor. The variance of x can then also be
thought of as being composed of a common portion and a unique portion, that is
2
2
Var  x  =  c +  u . It is the common portion of the variance of x that is called the
communality of x, that is the variance that the variable has in common through the common
factors with all the other variables.
In the algorithm for principal axis factors described below it is of interest to both make an
initial estimate of the communality of each variable, and to calculate the actual communality
for the variables in a factor model with uncorrelated factors. One method of making an initial
estimate of the communality of each variable is to take the largest correlation of that variable
with respect to the other variables. The preferred method however is to calculate its squared
multiple correlation coefficient with respect to all of the other variables taken as a whole. This
is the technique used by Teradata Warehouse Miner. The multiple correlation coefficient is a
measure of the overall linear association of one variable with several other variables, that is,
the correlation between a variable and the best-fitting linear combination of the other
variables. The square of this value has the useful property of being a lower bound for the
communality. Once a factor model is built, the actual communality of a variable is simply the
sum of the squares of the factor loadings, i.e.
2
hj =
r
k – 1 fjk
2
With the idea of communality thus in place it is straightforward to describe the principal axis
factors algorithm. Begin by estimating the communality of each variable and replacing this
value in the appropriate position in the diagonal of the correlation or covariance matrix being
factored. Then a principal components solution is found in the usual manner, as described
earlier. As before, the user has the option of specifying either a fixed number of desired
factors or a minimum eigenvalue by which to reduce the number of factors in the solution.
Finally, the new communalities are calculated as the sum of the squared factor loadings, and
these values are substituted into the correlation or covariance matrix. This process is repeated
until the communalities change by only a small amount.
66
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Factor Analysis
Through its use of communality estimates, the principal axis factor method attempts to find
independent common factors that account for the covariance or correlation between the
original variables in the model, while ignoring the effect of unique factors. It is then possible
to use the factor loadings matrix to reproduce the correlation or covariance matrix and
compare this to the original as a way of assessing the effectiveness of the model. The
reproduced correlation or covariance matrix is simply the factor loadings matrix times its
transpose (i.e., CCT). The user may optionally request that the signs of the factor loadings be
inverted if there are more minus signs than positive ones. This is purely cosmetic and does not
affect the solution in a substantive way. However, if signs are reversed, this must be kept in
mind when attempting to interpret or assign meaning to the factors.
Maximum Likelihood Factors
As mentioned earlier, the common factor model attempts to find both common and unique
factors explaining the covariance or correlations amongst a set of variables. That is, an
attempt is made to find a factor pattern C and a uniqueness matrix R such that a covariance or
correlation matrix S can be modeled as S = CCT + R. To do this, it is necessary to utilize the
principle of maximum likelihood based on the assumption that the data comes from a
multivariate normal distribution. Due to dealing with the distribution function of the elements
of a covariance matrix it is necessary to use the Wishart distribution in order to derive the
likelihood equation. The optimization technique used then to maximize the likelihood of a
solution for C and R is the Expectation Maximization or EM technique. This technique, often
used in the replacement of missing data, is the same basic technique used in Teradata
Warehouse Miner’s cluster analysis algorithm. Some key points regarding this technique are
described below.
Beginning with a correlation or covariance matrix S as with our other factor techniques, a
principal components solution is first derived as an initial estimate for the factor pattern
matrix C, with the initial estimate for the uniqueness matrix R taken simply as S - CCT. Then
the maximum likelihood solution is iteratively found, yielding a best estimate of C and R. In
order then to assess the effectiveness of the model, the correlation or covariance matrix S is
compared to the reproduced matrix CCT - R.
It should be pointed out that when using the maximum likelihood solution the user must first
specify the number of common factors f to produce in the model. The software will not
automatically determine what this value should be or determine it based on a threshold value.
Also, an internal adjustment is made to the final factor pattern matrix C to make the factors
orthogonal, something that is automatically true of the other factor solutions. Finally, the user
may optionally request that the signs of a factor in the matrix C be inverted if there are more
minus signs than positive ones. This is purely cosmetic and does not affect the solution in a
substantive way. However, if signs are reversed, this must be kept in mind when attempting to
interpret or assign meaning to the factors.
Factor Rotations
Teradata Warehouse Miner offers a number of techniques for rotating factors in order to find
the elusive quality of simple structure described earlier. These may optionally be used in
combination with any of the factor techniques offered in the product. When a rotation is
performed, both the rotated matrix and the rotation matrix is reported, as well as the
Teradata Warehouse Miner User Guide - Volume 3
67
Chapter 1: Analytic Algorithms
Factor Analysis
reproduced correlation or covariance matrix after rotation. As before with the factor solutions
themselves, the user may optionally request that the signs of a factor in the rotated factor or
components matrix be inverted if there are more minus signs than positive ones. This is
purely cosmetic and does not affect the solution in a substantive way.
Orthogonal rotations
First consider orthogonal rotations, that is, rotations of a factor matrix A that result in a
rotated factor matrix B by way of an orthogonal transformation matrix T (i.e., B = AT).
Remember that the nice thing about orthogonal rotations on a factor matrix is that the
resulting factors scores are uncorrelated, a desirable property when the factors are going to be
used in subsequent regression, cluster or other type of analysis. But how is simple structure
obtained?
As described earlier, the idea behind simple structure is to express each component or factor
in terms of fewer variables that are highly correlated with the factor, with the remaining
variables not so correlated with the factor. The two most famous mathematical criteria for
simple factor structure are the quartimax and varimax criteria. Simply put, the varimax
criterion seeks to simplify the structure of columns or factors in the factor loading matrix,
whereas the quartimax criterion seeks to simplify the structure of the rows or variables in the
factor loading matrix. Less simply put, the varimax criterion seeks to maximize the variance
of the squared loadings across the variables for all factors. The quartimax criterion seeks to
maximize the variance of the squared loadings across the factors for all variables. The
solution to either optimization problem is mathematically quite involved, though in principle
it is based on fundamental techniques of linear algebra, differential calculus, and the use of
the popular Newton-Raphson iterative technique for finding the roots of equations.
Regardless of the criterion used, rotations are performed on normalized loadings, that is prior
to rotating, the rows of the factor loading matrix are set to unit length by dividing each
element by the square root of the communality for that variable. The rows are unnormalized
back to the original length after the rotation is performed. This has been found to improve
results, particularly for the varimax method.
Fortunately both the quartimax and varimax criteria can be expressed in terms of the same
equation containing a constant value that is 0 for quartimax and 1 for varimax. The orthomax
criterion is then obtained simply by setting this constant, call it gamma, to any desired value,
equamax corresponds to setting this constant to half the number of factors, and parsimax is
given by setting the value of gamma to v(f-1) / (v+f+2) where v is the number of variables
and f is the number of factors.
Oblique rotations
As mentioned earlier, oblique rotations relax the requirement for factor independence that
exists with orthogonal rotations, while more aggressively seeking better data alignment.
Teradata Warehouse Miner uses a technique known as the indirect oblimin method. As with
orthogonal rotations, there is a common equation for the oblique simple structure criterion
that contains a constant that can be set for various effects. A value of 0 for this constant, call it
gamma, yields the quartimin solution, which is the most oblique solution of those offered. A
value of 1 yields the covarimin solution, the least oblique case. And a value of 0.5 yields the
68
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Factor Analysis
biquartimin solution, a compromise between the two. A solution known as orthomin can be
achieved by setting the value of gamma to any desired positive value.
One of the distinctions of a factor solution that incorporates an oblique rotation is that the
factor loadings must be thought of in terms of two different matrices, the factor pattern P
matrix and the factor structure matrix S. These are related by the equation S = PQ where Q is
the matrix of correlations between factors. Obviously if the factors are not correlated, as in an
unrotated solution or after an orthogonal rotation, then Q is the identity matrix and the
structure and pattern matrix are the same. The result of an oblique rotation must include both
the pattern matrix that describes the common factors and the structure matrix of correlations
between the factors and original variables.
As with orthogonal rotations, oblique rotations are performed on normalized loadings that are
restored to their original size after rotation. A unique characteristic of the indirect oblimin
method of rotation is that it is performed on a reference structure based on the normals of the
original factor space. There is no inherent value in this, but is in fact just a side effect of the
technique. It means however that an oblique rotation results in a reference factor pattern,
structure and rotation matrix that is then converted back into the original factor space as the
final primary factor pattern, structure and rotation matrix.
Data Quality Reports
The same data quality reports optionally available for linear regression are also available
when performing Factor Analysis.
Prime Factor Reports
Prime Factor Loadings
This report provides a specially sorted presentation of the factor loadings. Like the standard
report of factor loadings, the rows represent the variables and the columns represent the
factors. In this case, however, each variable is associated with the factor for which it has the
largest loading as an absolute value. The variables having factor 1 as the prime factor are
listed first, in descending order of the loading with factor 1. Then the variables having factor
2 as the prime factor are listed, continuing on until all the variables are listed. It is possible
that not all factors will appear in the Prime Factor column, but all the variables will be listed
once and only once with all their factor loadings.
Note that in the special case after an oblique rotation has been performed in the factor
analysis, the report is based on the factor structure matrix and not the factor pattern matrix,
since the structure matrix values represent the correlations between the variables and the
factors.
The following is an example of a Prime Factor Loadings report.
Table 14: Prime Factor Loadings report (Example)
Variable
Prime Factor
Factor 1
Factor 2
Factor 3
income
Factor 1
.8229
-1.1675E-02
.1353
revenue
Factor 1
.8171
.4475
2.3336E-02
Teradata Warehouse Miner User Guide - Volume 3
69
Chapter 1: Analytic Algorithms
Factor Analysis
Table 14: Prime Factor Loadings report (Example)
Variable
Prime Factor
Factor 1
Factor 2
Factor 3
single
Factor 1
-.7705
.4332
.1554
age
Factor 1
.7348
-4.5584E-02
1.0212E-02
cust_years
Factor 2
.5158
.6284
.1577
purchases
Factor 2
.5433
-.5505
-.254
female
Factor 3
-4.1177E-02
.3366
-.9349
Prime Factor Variables
The Prime Factor Variables report is closely related to the Prime Factor Loadings report. It
associates variables with their prime factors and possibly other factors if a threshold percent
or loading value is specified. It provides a simple presentation, without numbers, of the
relationships between factors and the variables that contribute to them.
If a threshold percent of 1.0 is used, only prime factor relationships are reported. A threshold
percentage of less than 1.0 indicates that if the loading for a particular factor is equal to or
above this percentage of the loading for the variable's prime factor, then an association is
made between the variable and this factor as well. When the variable is associated with a
factor other than its prime factor, the variable name is given in parentheses. A threshold
loading value may alternately be used to determine the associations between variables and
factors. In this case, it is possible that a variable may not appear in the report, depending on
the threshold value and the loading values. However, if the option to reverse signs was
enabled, positive values may actually represent inverse relationships between factors and
original variables. Deselecting this option in a second run and examining factor loading
results will provide the true nature (directions) of relationships among variables and factors.
The following is an example of a Prime Factor Variables report.
Table 15: Prime Factor Variables report (Example)
70
Factor 1
Factor 2
Factor 3
income
cust_years
female
revenue
purchases
*
single
*
*
age
*
*
(purchases)
*
*
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Factor Analysis
Prime Factor Variables with Loadings
The Prime Factor Variables with Loadings is functionally the same as the Prime Factor
Variables report except that the actual loading values determining the associations between
the variables and factors are also given. The magnitude of the loading gives some idea of the
relative strength of the relationship and the sign indicates whether or not it is an inverse
relationship. A negative sign indicates an inverse relationship in the values (i.e., a negative
correlation).
The following is an example of a Prime Factor Variables with Loadings report.
Table 16:
Factor
Variable
Loading
Factor 1
income
.8229
Factor 1
revenue
.8171
Factor 1
single
-.7705
Factor 1
age
.7348
Factor 1
(purchases)
.5433
Factor 2
cust_years
.6284
Factor 2
purchases
-.5505
Factor 3
female
-.9349
Missing Data
Null values for columns in a factor analysis can adversely affect results. It is recommended
that the listwise deletion option be used when building the SSCP matrix with the Build Matrix
function. This ensures that any row for which one of the columns is null will be left out of the
matrix computations completely. Additionally, the Recode transformation function can be
used to build a new column, substituting a fixed known value for null.
Initiate a Factor Analysis
Use the following procedure to initiate a new Factor Analysis in Teradata Warehouse Miner:
Teradata Warehouse Miner User Guide - Volume 3
71
Chapter 1: Analytic Algorithms
Factor Analysis
1
Click on the Add New Analysis icon in the toolbar:
Figure 47: Add New Analysis from toolbar
2
In the resulting Add New Analysis dialog box, click on Analytics under Categories and
then under Analyses double-click on Factor Analysis:
Figure 48: Add New Analysis dialog
3
This will bring up the Factor Analysis dialog in which you will enter INPUT and
OUTPUT options to parameterize the analysis as described in the following sections.
Factor - INPUT - Data Selection
On the Factor Analysis dialog click on INPUT and then click on data selection:
Figure 49: Factor Analysis > Input > Data Selection
72
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Factor Analysis
On this screen select:
1
Select Input Source
Users may select between different sources of input, Table, Matrix or Analysis. By
selecting the Input Source Table the user can select from available databases, tables (or
views) and columns in the usual manner. (In this case a matrix will be dynamically built
and discarded when the algorithm completes execution). By selecting the Input Source
Matrix the user may can select from available matrices created by the Build Matrix
function. This has the advantage that the matrix selected for input is available for further
analysis after completion of the algorithm, perhaps selecting a different subset of columns
from the matrix.
By selecting the Input Source Analysis the user can select directly from the output of
another analysis of qualifying type in the current project. (In this case a matrix will be
dynamically built and discarded when the algorithm completes execution). Analyses that
may be selected from directly include all of the Analytic Data Set (ADS) and
Reorganization analyses (except Refresh). In place of Available Databases the user may
select from Available Analyses, while Available Tables then contains a list of all the output
tables that will eventually be produced by the selected Analysis. (Note that since this
analysis cannot select from a volatile input table, Available Analyses will contain only
those qualifying analyses that create an output table or view).
2
Select Columns From One Table
•
Available Databases (only for Input Source equal to Table) — All the databases that are
available for the Factor Analysis.
•
Available Matrices (only for Input Source equal to Matrix) — When the Input Source is
Matrix, a matrix must first be built by the user with the Build Matrix function before
Factor Analysis can be performed. Select the matrix that summarizes the data to be
analyzed. (The matrix must have been built with more rows than columns selected or
the Factor Analysis will produce a singular matrix, causing a failure).
•
Available Analyses (only for Input Source equal to Analysis) — All the analyses that
are available for the Factor Analysis.
•
Available Tables (only for Input Source equal to Table or Analysis) — All the tables
that are available for the Factor Analysis.
•
Available Columns — All the columns that are available for the Factor Analysis.
•
Selected Columns — Select columns by highlighting and then either dragging and
dropping into the Selected Columns window, or click on the arrow button to move
highlighted columns into the Selected Columns window. The algorithm requires that
the selected columns be of numeric type (or contain numbers in character format).
Factor - INPUT - Analysis Parameters
On the Factor Analysis dialog click on INPUT and then click on analysis parameters:
Teradata Warehouse Miner User Guide - Volume 3
73
Chapter 1: Analytic Algorithms
Factor Analysis
Figure 50: Factor Analysis > Input > Analysis Parameters
On this screen select:
• General Options
•
•
Analysis method
•
Principal Components (PCA) — As described above. This is the default method.
•
Principal Axis Factors (PAF) — As described above.
•
Maximum Likelihood Factors (MLF) — As described above.
Convergence Method
•
Minimum Eigenvalue
PCA — minimum eigenvalue to include in principal components (default 1.0)
PAF — minimum eigenvalue to include in factor loadings (default 0.0)
MLF — option does not apply (N/A)
•
•
•
•
74
Number of Factors — The user may request a specific number of factors as an
alternative to using the minimum eigenvalue option for PCA and PAF. Number of
factors is however required for MLF. The number of factors requested must not
exceed the number of requested variables.
Convergence Criterion
•
PCA — convergence criterion does not apply
•
PAF — iteration continues until maximum communality change does not exceed
convergence criterion
•
MLF — iteration continues until maximum change in the square root of uniqueness
values does not exceed convergence criterion
Maximum Iterations
•
PCA — maximum iterations does not apply (N/A)
•
PAF — the algorithm stops if the maximum iterations is exceeded (default 100)
•
MLF — the algorithm stops if the maximum iterations is exceeded (default 1000)
Matrix Type — The product automatically converts the extended cross-products matrix
stored in metadata results tables by the Build Matrix function into the desired
covariance or correlation matrix. The choice will affect the scaling of resulting factor
measures and factor scores.
•
Correlation — Build a correlation matrix as input to Factor Analysis. This is the
default option.
•
Covariance — Build a covariance matrix as input to Factor Analysis.
•
Invert signs if majority of matrix values are negative (checkbox) — You may
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Factor Analysis
optionally request that the signs of factor loadings and related values be changed if
there are more minus signs than positive ones. This is purely cosmetic and does not
affect the solution in a substantive way. Default is enabled.
• Rotation Options
•
Rotation Method
•
None — No factor rotation is performed. This is the default option.
•
Varimax — Gamma in rotation equation fixed at 1.0. The varimax criterion seeks to
simplify the structure of columns or factors in the factor loading matrix
•
Quartimax — Gamma in rotation equation fixed at 0.0. the quartimax criterion
seeks to simplify the structure of the rows or variables in the factor loading matrix
•
Equamax — Gamma in rotation equation fixed at f / 2.
•
Parsimax — Gamma in rotation equation fixed at v(f-1) / (v+f+2).
•
Orthomax — Gamma in rotation equation set by user.
•
Quartimin — Gamma in rotation equation fixed at 0.0. Provides the most oblique
rotation.
•
Biquartimin — Gamma in rotation equation fixed at 0.5.
•
Covarimin — Gamma in rotation equation fixed at 1.0. Provides the least oblique
rotation.
•
Orthomin — Gamma in rotation equation set by user.
• Report Options
•
Variable Statistics — This report gives the mean value and standard deviation of each
variable in the model based on the derived SSCP matrix.
•
Near Dependency — This report lists collinear variables or near dependencies in the
data based on the derived SSCP matrix.
•
Condition Index Threshold — Entries in the Near Dependency report are triggered
by two conditions occurring simultaneously. The one that involves this parameter
is the occurrence of a large condition index value associated with a specially
constructed principal factor. If a factor has a condition index greater than this
parameter’s value, it is a candidate for the Near Dependency report. A default
value of 30 is used as a rule of thumb.
•
Variance Proportion Threshold — Entries in the Near Dependency report are
triggered by two conditions occurring simultaneously. The one that involves this
parameter is when two or more variables have a variance proportion greater than
this threshold value for a factor with a high condition index. Another way of
saying this is that a ‘suspect’ factor accounts for a high proportion of the variance
of two or more variables. This parameter defines what a high proportion of
variance is. A default value of 0.5 is used as a rule of thumb.
•
Collinearity Diagnostics Report — This report provides the details behind the Near
Dependency report, consisting of the “Eigenvalues of Unit Scaled X’X”, “Condition
Indices” and “Variance Proportions” tables.
•
Factor Loading Reports
•
Factor Variables Report
Teradata Warehouse Miner User Guide - Volume 3
75
Chapter 1: Analytic Algorithms
Factor Analysis
•
Factor Variables with Loadings Report
•
Display Variables Using
•
Threshold percent
•
Threshold loading — A threshold percentage of less than 1.0 indicates that if the
loading for a particular factor is equal or above this percentage of the loading for
the variable's prime factor, then an association is made between the variable and
this factor as well. A threshold loading value may alternatively be used.
Factor Analysis - OUTPUT
On the Factor Analysis dialog, click on OUTPUT:
Figure 51: Factor Analysis > Output
On this screen select:
• Store the Factor Loadings/Weights/Statistics reports as tables in the database — Check this
box to store the following reports, if selected, as tables in the database.
•
Factor Loadings
•
Factor Variables With Loadings
•
Factor Weights
•
Variable Statistics
• Database Name — The name of the database to create the output tables in.
• Output Table Prefix — The prefix to each of the output table names.
For example, if my_factor_reports_ is entered here, the tables produced will be named as
follows:
Table 17: my_factor_reports_ tables
76
Report
Filename
Factor Loadings
my_factor_reports_FL
Factor Variables With Loadings
my_factor_reports_FV
Factor Weights
my_factor_reports_FW
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Factor Analysis
Table 17: my_factor_reports_ tables
Report
Filename
Variable Statistics
my_factor_reports_VS
The contents of the tables will match the contents of the reports except that there will be
no fixed ordering of the rows (unless an ORDER BY clause is used when selecting from
them).
• Advertise Output — The Advertise Output option “advertises” output by inserting
information into one or more of the Advertise Output metadata tables according to the
type of analysis and the options selected in the analysis. Note that this check box will be
disabled if the Always Advertise option is selected on the Connection Properties dialog,
because in this case advertising will be automatic.
Advertise Output information may be viewed using the Advertise Maintenance dialog
available from the Tools menu, from where the definition and contents of these tables may
also be viewed. For more information, refer to Advertise Output.
• Advertise Note — An Advertise Note may be specified if desired when the Advertise
Output option is selected or when the Always Advertise option is selected on the
Connection Properties dialog. It is a free-form text field of up to 30 characters that may be
used to categorize or describe the output.
Run the Factor Analysis
After setting parameters on the INPUT screens as described above, you are ready to run the
analysis. To run the analysis you can either:
• Click the Run icon
on the toolbar, or
• Select Run <project name> on the Project menu, or
• Press the F5 key on your keyboard
Results - Factor Analysis
The results of running the Factor Analysis include a factor patterns graph, a scree plot (unless
MLF was specified), and a variety of statistic reports. All of these results are outlined below.
Factor Analysis - RESULTS - Reports
On the Factor Analysis dialog, click on RESULTS and then click on reports (note that the
RESULTS tab will be grayed-out/disabled until after the analysis is completed):
Figure 52: Factor Analysis > Results > Reports
Teradata Warehouse Miner User Guide - Volume 3
77
Chapter 1: Analytic Algorithms
Factor Analysis
Data Quality Reports
• Variable Statistics — If selected on the Results Options tab, this report gives the mean
value and standard deviation of each variable in the model based on the SSCP matrix
provided as input.
• Near Dependency — If selected on the Results Options tab, this report lists collinear
variables or near dependencies in the data based on the SSCP matrix provided as input.
Entries in the Near Dependency report are triggered by two conditions occurring
simultaneously. The first is the occurrence of a large condition index value associated
with a specially constructed principal factor. If a factor has a condition index greater than
the parameter specified on the Results Option tab, it is a candidate for the Near
Dependency report. The other is when two or more variables have a variance proportion
greater than a threshold value for a factor with a high condition index. Another way of
saying this is that a ‘suspect’ factor accounts for a high proportion of the variance of two
or more variables. The parameter to defines what a high proportion of variance is also set
on the Results Option tab. A default value of 0.5.
• Detailed Collinearity Diagnostics — If selected on the Results Options tab, this report
provides the details behind the Near Dependency report, consisting of the following
tables.
•
Eigenvalues of Unit Scaled X'X — Report of the eigenvalues of all variables scaled so
that each variable adds up to 1 when summed over all the observations or rows. In
order to calculate the singular values of X (the rows of X are the observations), the
mathematically equivalent square root of the eigenvalues of XTX are computed
instead for practical reasons
•
Condition Indices — The condition index of each eigenvalue, calculated as the square
root of the ratio of the largest eigenvalue to the given eigenvalue, a value always 1 or
greater.
•
Variance Proportions — The variance decomposition of these eigenvalues is computed
using the eigenvalues together with the eigenvectors associated with them. The result
is a matrix giving, for each variable, the proportion of variance associated with each
eigenvalue.
Principal Component Analysis report
• Number of Variables — This is the number of variables to be factored, taken from the
matrix that is input to the algorithm. Note that there are no dependent or independent
variables in a factor analysis model.
• Minimum Eigenvalue — The minimum value of a factor’s associated eigenvalue,
determining whether or not to include the factor in the final model. This field is not
displayed if the Number of Factors option is used to determine the number of factors
retained.
• Number of Factors — This value reflects the number of factors retained in the final factor
analysis model. If the Number of Factors option is explicitly set by the user to determine
the number of factors, then this reported value reflects the value set by the user.
Otherwise, it reflects the number of factors resulting from applying the Minimum
Eigenvalue option.
78
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Factor Analysis
• Matrix Type (cor/cov) — This value reflects the type of input matrix requested by the user,
either correlation (cor) or covariance (cov).
• Rotation (none/orthogonal/oblique) — This value reflects the type of rotation, if any,
requested by the user, either none, orthogonal, or oblique.
• Gamma — This value is a coefficient in the rotation equation that reflects the type of
rotation requested, if any, and in some cases is explicitly set by the user. Gamma is
determined as follows.
• Orthogonal rotations:
•
Varimax — (gamma in rotation equation fixed at 1.0)
•
Quartimax — (gamma in rotation equation fixed at 0.0)
•
Equamax — (gamma in rotation equation fixed at f / 2)*
•
Parsimax — (gamma in rotation equation fixed at v(f-1) / (v+f+2))*
•
Orthomax — (gamma in rotation equation set by user)
* where v is the number of variables and f is the number of factors
• Oblique rotations
•
Quartimin — (gamma in rotation equation fixed at 0.0)
•
Biquartimin — (gamma in rotation equation fixed at 0.5)
•
Covarimin — (gamma in rotation equation fixed at 1.0)
•
Orthomin — (gamma in rotation equation set by user)
Principal Axis Factors report
• Number of Variables — This is the number of variables to be factored, taken from the
matrix that is input to the algorithm. Note that there are no dependent or independent
variables in a factor analysis model.
• Minimum Eigenvalue — The minimum value of a factor’s associated eigenvalue,
determining whether or not to include the factor in the final model. This field is not
displayed if the Number of Factors option is used to determine the number of factors
retained.
• Number of Factors — This value reflects the number of factors retained in the final factor
analysis model. If the Number of Factors option is explicitly set by the user to determine
the number of factors, then this reported value reflects the value set by the user.
Otherwise, it reflects the number of factors resulting from applying the Minimum
Eigenvalue option.
• Maximum Iterations — This is the maximum number of iterations requested by the user.
• Convergence Criterion — This is the value requested by the user as the convergence
criterion such that iteration continues until the maximum change in the square root of
uniqueness values does not exceed this value.
• Rotation (none/orthogonal/oblique) — This value reflects the type of rotation, if any,
requested by the user, either none, orthogonal, or oblique.
• Gamma — This value is a coefficient in the rotation equation that reflects the type of
rotation requested, if any, and in some cases is explicitly set by the user. Gamma is
determined as follows.
Teradata Warehouse Miner User Guide - Volume 3
79
Chapter 1: Analytic Algorithms
Factor Analysis
• Orthogonal rotations
•
Varimax — (gamma in rotation equation fixed at 1.0)
•
Quartimax — (gamma in rotation equation fixed at 0.0)
•
Equamax — (gamma in rotation equation fixed at f / 2)*
•
Parsimax — (gamma in rotation equation fixed at v(f-1) / (v+f+2))*
•
Orthomax — (gamma in rotation equation set by user)
* where v is the number of variables and f is the number of factors
•
Oblique rotations
•
Quartimin — (gamma in rotation equation fixed at 0.0)
•
Biquartimin — (gamma in rotation equation fixed at 0.5)
•
Covarimin — (gamma in rotation equation fixed at 1.0)
•
Orthomin — (gamma in rotation equation set by user)
Maximum Likelihood (EM) Factor Analysis report
• Number of Variables — This is the number of variables to be factored, taken from the
matrix that is input to the algorithm. Note that there are no dependent or independent
variables in a factor analysis model.
• Number of Observations — This is the number of observations in the data used to build the
matrix that is input to the algorithm.
• Number of Factors — This reflects the number of factors requested by the user for the
factor analysis model.
• Maximum Iterations — This is the maximum number of iterations requested by the user.
(The actual number of iterations used is reflected in the Total Number of Iterations field
further down in the report).
• Convergence Criterion — This is the value requested by the user as the convergence
criterion such that iteration continues until the maximum change in the square root of
uniqueness values does not exceed this value. (It should be noted that convergence is
based on uniqueness values rather than maximum likelihood values, something that is
done strictly for practical reasons based on experimentation).
• Matrix Type (cor/cov) — This value reflects the type of input matrix requested by the user,
either correlation (cor) or covariance (cov).
• Rotation (none/orthogonal/oblique) — This value reflects the type of rotation, if any,
requested by the user, either none, orthogonal, or oblique.
• Gamma — This value is a coefficient in the rotation equation that reflects the type of
rotation requested, if any, and in some cases is explicitly set by the user. Gamma is
determined as follows.
• Orthogonal rotations
80
•
Varimax — (gamma in rotation equation fixed at 1.0)
•
Quartimax — (gamma in rotation equation fixed at 0.0)
•
Equamax — (gamma in rotation equation fixed at f / 2)*
•
Parsimax — (gamma in rotation equation fixed at v(f-1) / (v+f+2))*
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Factor Analysis
•
Orthomax — (gamma in rotation equation set by user)
* where v is the number of variables and f is the number of factors
• Oblique rotations
•
Quartimin — (gamma in rotation equation fixed at 0.0)
•
Biquartimin — (gamma in rotation equation fixed at 0.5)
•
Covarimin — (gamma in rotation equation fixed at 1.0)
•
Orthomin — (gamma in rotation equation set by user)
• Total Number of Iterations — This value is the number of iterations that the algorithm
performed to converge on a maximum likelihood solution.
• Final Average Likelihood — This is the final value of the average likelihood over all the
observations represented in the input matrix.
• Change in Avg Likelihood — This is the final change, from the previous to the final
iteration, in value of the average likelihood over all the observations represented in the
input matrix.
• Maximum Change in Sqrt (uniqueness) — The algorithm calculates a uniqueness value for
each factor each time it iterates, and keeps track of how much the positive square root of
each of these values changes from one iteration to the next. The maximum change in this
value is given here, and it is of interest because it is used to determine convergence of the
model. (Refer to “Final Uniqueness Values” on page 83 for an explanation of these values
in the common factor model).
Max Change in Sqrt (Communality) For Each Iteration
This report, printed for Principal Axis Factors only, and only if the user requests the Report
Output option Long, shows the progress of the algorithm in converging on a solution. It does
this by showing, at each iteration, the maximum change in the positive square root of the
communality of each of the variables. The communality of a variable is that portion of its
variance that can be attributed to the common factors. Simply put, when the communality
values for all of the variables stop changing sufficiently, the algorithm stops.
Matrix to be Factored
The correlation or covariance matrix to be factored is printed out only if the user requests the
Report Output option Long. Only the lower triangular portion of this symmetric matrix is
reported and output is limited to at most 100 rows for expediency. (If it is necessary to view
the entire matrix, the Get Matrix function with the Export to File option is recommended).
Initial Communality Estimates
This report is produced only for Principal Axis Factors and Maximum Likelihood Factors.
The communality of a variable is that portion of its variance that can be attributed to the
common factors, excluding uniqueness. The initial communality estimates for each variable
are made by calculating the squared multiple correlation coefficient of each variable with
respect to the other variables taken together.
Teradata Warehouse Miner User Guide - Volume 3
81
Chapter 1: Analytic Algorithms
Factor Analysis
Final Communality Estimates
This report is produced only for Principal Axis Factors and Maximum Likelihood Factors.
The communality of a variable is that portion of its variance that can be attributed to the
common factors, excluding uniqueness. The final communality estimates for each variable
are computed as:
2
hj =
r
k – 1 fjk
2
(i.e., as the sum of the squares of the factor loadings for each variable).
Eigenvalues
These are the resulting eigenvalues of the principal component or principal axis factor
solution, in descending order. At this stage, there are as many eigenvalues as input variables
since the number of factors has not been reduced yet.
Eigenvectors
These are the resulting eigenvectors of the principal components or principal axis factor
solution, in descending order. At this stage, there are as many eigenvectors as input variables
since the number of factors has not been reduced yet. Eigenvectors are printed out only if the
user requests the Report Output option Long.
Principal Component Loadings (Principal Components)
This matrix of values, which is variables by factors in size, represents both the factor pattern
and factor structure, i.e., the linear combination of factors for each variable and the
correlations between factors and variables (provided Matrix Type is Correlation). The number
of factors has been reduced to meet the minimum eigenvalue or number of factors requested,
but the output does not reflect any factor rotations that may have been requested.
This output table contains the raw data used in the Prime Factor Reports, which are probably
better to use for interpreting results. If the user requested a Matrix Type of Correlation, the
principal component loadings can be interpreted as the correlations between the original
variables and the newly created factors. An absolute value approaching 1 indicates that a
variable is contributing strongly to a particular factor.
Factor Pattern (Principal Axis Factors)
This matrix of values, which is variables by factors in size, represents both the factor pattern
and factor structure, i.e., the linear combination of factors for each variable and the
correlations between factors and variables (provided Matrix Type is Correlation). The number
of factors has been reduced to meet the minimum eigenvalue or number of factors requested,
but the output does not reflect any factor rotations that may have been requested.
This output table contains the raw data used in the Prime Factor Reports, which are probably
better to use for interpreting results. If the user requested a Matrix Type of Correlation, the
factor pattern can be interpreted as the correlations between the original variables and the
newly created factors. An absolute value approaching 1 indicates that a variable is
contributing strongly to a particular factor.
82
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Factor Analysis
Factor Pattern (Maximum Likelihood Factors)
This matrix of values, which is variables by factors in size, represents both the factor pattern
and factor structure, i.e., the linear combination of factors for each variable and the
correlations between factors and variables (provided Matrix Type is Correlation). The number
of factors has been fixed at the number of factors requested. The output at this stage does not
reflect any factor rotations that may have been requested.
This output table contains the raw data used in the Prime Factor Reports, which are probably
better to use for interpreting results. If the user requested a Matrix Type of Correlation, the
factor pattern can be interpreted as the correlations between the original variables and the
newly created factors. An absolute value approaching 1 indicates that a variable is
contributing strongly to a particular factor.
Variance Explained by Factors
This report provides the amount of variance in all of the original variables taken together that
is accounted for by each factor. For Principal Components and Principal Axis Factor
solutions, the variance is the same as the eigenvalues calculated for the solution. In general
however, and for Maximum Likelihood Factor solutions in particular, the variance is the sum
of the squared loadings for each factor.
(After an oblique rotation, if the factors are correlated, there is an interaction term that must
also be added in based on the loadings and the correlations between factors. A separate report
entitled Contributions of Rotated Factors To Variance is provided if an oblique rotation is
performed).
• Factor Variance — This column shows the actual amount of variance in the original
variables accounted for by each factor.
• Percent of Total Variance — This column shows the percentage of the total variance in the
original variables accounted for by each factor.
• Cumulative Percent — This column shows the cumulative percentage of the total variance
in the original variables accounted for by Factor 1 through each subsequent factor in turn.
Factor Variance to Total Variance Ratio
This is simply the ratio of the variance explained by all the factors to the total variance in the
original data.
Condition Indices of Components
The condition index of a principal component or principal factor is the square root of the ratio
of the largest eigenvalue to the eigenvalue associated with that component or factor.
This report is provided for Principal Components and Principal Axis Factors only.
Final Uniqueness Values
The common factor model seeks to find a factor pattern C and a uniqueness matrix R such
that a covariance or correlation matrix S can be modeled as S = CCT + R. The uniqueness
matrix is a diagonal matrix, so there is a single uniqueness value for each variable in the
model. The theory behind the uniqueness value of a variable is that the variance of each
Teradata Warehouse Miner User Guide - Volume 3
83
Chapter 1: Analytic Algorithms
Factor Analysis
variable can be expressed as the sum of its communality and uniqueness, that is the variance
of the jth variable is given by:
2
2
2
sj = hj + uj
This report is provided for Maximum Likelihood Factors only.
Reproduced Matrix Based on Loadings
The results of a factor analysis can be used to reproduce or approximate the original
correlation or covariance matrix used to build the factor analysis model. This is done to
evaluate the effectiveness of the model in accounting for the variance in the original data. For
Principal Components and Principal Axis Factors the reproduced matrix is simply the
loadings matrix times its transpose. For Maximum Likelihood Factors it is the loadings
matrix times its transpose plus the uniqueness matrix.
This report is provided only when Long is selected as the Output Option.
Difference Between Original and Reproduced cor/cov Matrix
This report gives the differences between the original correlation or covariance matrix values
used in the factor analysis and the Reproduced Matrix Based on Loadings. (In the case of
Principal Axis Factors, the reproduced matrix is compared to the original matrix with the
initial communality estimates placed in the diagonal of the matrix).
This report is provided only when Long is selected as the Output Option.
Absolute Difference
This report summarizes the absolute value of the differences between the original correlation
or covariance matrix values used in the factor analysis and the Reproduced Matrix Based on
Loadings.
• Mean — This is the average absolute difference in correlation or covariance over the
entire matrix.
• Standard Deviation — This is the standard deviation of the absolute differences in
correlation or covariance over the entire matrix.
• Minimum — This is the minimum absolute difference in correlation or covariance over the
entire matrix.
• Maximum — This is the maximum absolute difference in correlation or covariance over
the entire matrix.
Rotated Loading Matrix
This report of the factor loadings (pattern) after rotation is given only after orthogonal
rotations.
Rotated Structure
This report of the factor structure after rotation is given only after oblique rotations. Note that
after an oblique rotation the rotated structure matrix is usually different from the rotated
pattern matrix.
84
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Factor Analysis
Rotated Pattern
This report of the factor pattern after rotation is given after both orthogonal and oblique
rotations. Note that after an oblique rotation the rotated pattern matrix is usually different
from the rotated structure matrix.
Rotation Matrix
After rotating the factor pattern matrix P to get the rotated matrix PR, the rotation matrix T is
also produced such that PR = PT. However, after an oblique rotation the rotation matrix obeys
the following equation: PR = P(TT)-1.
This report is provided only when Long is selected as the Output Option.
Variance Explained by Rotated Factors
This is the same report as Variance Explained by Factors except that it is based on the rotated
factor loadings. Comparison of the two reports can show the effects of rotation on the
effectiveness of the model.
After an oblique rotation, another report is produced called the Contributions of Rotated
Factors to Variance to show both the contributions of individual factors and the contributions
of factor interactions to the explanation of the variance in the original variables analyzed.
Rotated Factor Variance to Total Variance Ratio
This is the same report as Factor Variance to Total Variance Ratio except that it is based on
the rotated factor loadings. Comparison of the two reports can show the effects of rotation on
the effectiveness of the model.
Correlations Among Rotated Factors
After an oblique rotation the factors are generally no longer orthogonal or uncorrelated with
each other. This report is a standard Pearson product-moment correlation matrix treating the
rotated factors as new variables. Values range from 0 to -1 or +1 indicating no correlation to
maximum correlation respectively (a negative correlation indicates that two factors vary in
opposite directions with respect to each other).
This report is provided only after an oblique rotation is performed.
Contributions of Rotated Factors to Variance
In general, the variance of the original variables explained by a factor is the sum of the
squared loadings for the factor. But after an oblique rotation the factors may be correlated, so
additional interaction terms between the factors must be considered in computing the
explained variance reported in the Variance Explained by Rotated Factors report.
The contributions of factors to variance may be characterized as direct contributions:
and joint contributions:
where p and q vary by factors with p < q, j varies by variables, and r is the correlation
between factors. The Contributions of Rotated Factors to Variance report displays direct
contributions along the diagonal and joint contributions off the diagonal.
Teradata Warehouse Miner User Guide - Volume 3
85
Chapter 1: Analytic Algorithms
Factor Analysis
n
Vp =
 bjp
2
j=1
n
V pq = 2r Tp Tq  b jp b jq
j=1
This report is provided only after an oblique rotation is performed.
Factor Weights
A report of Factor Weights may be selected on the analysis parameters tab. Factor weights
are the coefficients that are multiplied by the variables in the factor model to determine the
value of each factor as a linear combination of input variables when scoring. (Using the
Factor Scoring analysis with Scoring Method equal to Score and output option Generate the
SQL for this analysis but do not execute it checked, it may be seen that the Factor Weights
report displays the same coefficients that are used when scoring a factor model). Whereas
factor loadings generally indicate the correlation between factors and model variables (i.e., in
the absence of an oblique rotation), factor weights can give an indication of the relative
contribution of each model variable to each new variable (factor).
Factor Analysis - RESULTS - Pattern Graph
On the Factor Analysis dialog, click on RESULTS and then click on pattern graph (note that
the RESULTS tab will be grayed-out/disabled until after the analysis is completed):
Figure 53: Factor Analysis > Results > Pattern Graph
The Factor Analysis Pattern Graph plots the final factor pattern values for up to twelve
variables, two factors at a time. These factor pattern values are the coefficients in the linear
combination of factors that comprise each variable. When the Analysis Type is Principal
Components, these pattern values are referred to as factor loadings. When the Matrix Type is
86
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Factor Analysis
Correlation, the values of these coefficients are standardized to be between -1 and 1 (if
Covariance, they are not). Unless an oblique rotation has been performed, these values also
represent the factor structure (i.e., the correlation between a factor and a variable).
The following options are available:
• Variables
•
Available — A list of all variables that were input to the Factor Analysis.
•
Selected — A list the variables (up to 12), that will be displayed on the Factor Patterns
graph.
• Factors
• Available — A list of all factors generated by the Factor Analysis.
• Selected — The selected two factors that will be displayed on the Factor Patterns graph.
Factor Analysis - RESULTS - Scree Plot
Unless MLF was specified, a screen plot is generated. On the Factor Analysis dialog, click on
RESULTS and then click on scree plot (note that the RESULTS tab will be grayed-out/disabled
until after the analysis is completed):
Figure 54: Factor Analysis > Results > Scree Plot
A definition of the word scree is a heap of stones or rocky debris, such as at the bottom of a
hill. So in a scree plot the object is to find where the plotted points flatten out, in order to
determine how many Principal Component or Principal Axis factors should be retained in the
factor analysis model (the scree plot does not apply to Maximum Likelihood factor analysis).
The plot shows the eigenvalues of each factor in descending order from left to right. Since the
eigenvalues represent the amount of variance in the original variables is explained by the
factors, when the eigenvalues flatten out in the plot, the factors they represent add less and
less to the effectiveness of the model.
Tutorial - Factor Analysis
In this example, principal components analysis is performed on a correlation matrix for 21
numeric variables. This reduces the variables to 7 factors using a minimum eigenvalue of 1.
The Scree Plot supports limiting the number of factors to 7 by showing how the eigenvalues
(and thus the explained variance) level off at 7 or above.
Parameterize a Factor Analysis as follows:
• Available Matrices — Customer_Analysis_Matrix
• Selected Variables
•
income
Teradata Warehouse Miner User Guide - Volume 3
87
Chapter 1: Analytic Algorithms
Factor Analysis
•
age
•
years_with_bank
•
nbr_children
•
female
•
single
•
married
•
separated
•
ccacct
•
ckacct
•
svacct
•
avg_cc_bal
•
avg_ck_bal
•
avg_sv_bal
•
avg_cc_tran_amt
•
avg_cc_tran_cnt
•
avg_ck_tran_amt
•
avg_ck_tran_cnt
•
avg_sv_tran_amt
•
avg_sv_tran_cnt
•
cc_rev
• Analysis Method — Principal Components
• Matrix Type — Correlation
• Minimum Eigenvalue — 1
• Invert signs if majority of matrix values are negative — Enabled
• Rotation Options — None
• Factor Variables — Enabled
• Threshold Percent — 1
• Long Report — Not enabled
Run the analysis, and click on Results when it completes. For this example, the Factor
Analysis generated the following pages. A single click on each page name populates the
Results page with the item.
Table 18: Factor Analysis Report
88
Number of Variables
21
Minimum Eigenvalue
1
Number of Factors
7
Matrix Type
Correlation
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Factor Analysis
Table 18: Factor Analysis Report
Rotation
None
Table 19: Execution Summary
6/20/2004 1:55:02 PM
Getting Matrix
6/20/2004 1:55:02 PM
Principal Components Analysis Running...x
6/20/2004 1:55:02 PM
Creating Report
Table 20: Eigenvalues
Factor 1
4.292
Factor 2
2.497
Factor 3
1.844
Factor 4
1.598
Factor 5
1.446
Factor 6
1.254
Factor 7
1.041
(Factor 8)
.971
(Factor 9)
.926
(Factor 10)
.871
(Factor 11)
.741
(Factor 12)
.693
(Factor 13)
.601
(Factor 14)
.504
(Factor 15)
.437
(Factor 16)
.347
(Factor 17)
.34
(Factor 18)
.253
(Factor 19)
.151
(Factor 20)
.123
(Factor 21)
7.01E-02
Teradata Warehouse Miner User Guide - Volume 3
89
Chapter 1: Analytic Algorithms
Factor Analysis
Table 21: Principal Component Loadings
Variable Name
Factor 1
Factor 2
Factor 3
Factor 4
Factor 5
Factor 6
Factor 7
age
0.2876
-0.4711
0.1979
0.2615
0.2975
0.3233
-0.2463
avg_cc_bal
-0.7621
0.0131
0.1628
-0.1438
0.3508
-0.1550
-0.0300
avg_cc_tran_amt
0.3716
-0.0318
-0.1360
0.0543
-0.1975
0.0100
0.0971
avg_cc_tran_cnt
0.4704
0.0873
-0.4312
0.5592
-0.0241
0.0133
0.0782
avg_ck_bal
0.5778
0.0527
-0.0981
-0.4598
0.0735
-0.0123
-0.0542
avg_ck_tran_amt
0.7698
0.0386
-0.0929
-0.4535
0.2489
0.0585
0.0190
avg_ck_tran_cnt
0.3127
0.1180
-0.1619
-0.1114
0.5435
0.1845
0.0884
avg_sv_bal
0.3785
0.3084
0.4893
0.0186
-0.0768
-0.0630
0.0517
avg_sv_tran_amt
0.4800
0.4351
0.5966
0.1456
-0.0155
0.0272
0.1281
avg_sv_tran_cnt
0.2042
0.3873
0.4931
0.1144
0.2420
0.0884
-0.0646
cc_rev
0.8377
-0.0624
-0.1534
0.0691
-0.3800
0.1036
0.0081
ccacct
0.2025
0.5213
0.4007
0.3021
0.0499
-0.1988
0.1733
ckacct
0.4007
0.1496
-0.4215
0.5497
0.1127
-0.0818
-0.0086
female
-0.0209
0.1165
-0.1357
0.3119
0.1887
-0.2228
-0.3438
income
0.6992
-0.2888
0.1353
-0.2987
-0.2684
0.0733
0.0310
married
0.0595
-0.7702
0.2674
0.2434
0.1945
0.0873
0.2768
nbr_children
0.2560
-0.4477
0.1238
-0.0895
-0.0739
-0.5642
0.0898
separated
0.3030
0.0692
0.0545
-0.0666
-0.0796
-0.5089
-0.6425
single
-0.2902
0.7648
-0.3004
-0.2010
-0.2120
0.2527
0.0360
svacct
0.4365
0.1616
-0.2592
-0.1705
0.6336
-0.1071
0.0318
years_with_bank
0.0362
-0.0966
0.2120
0.0543
-0.0668
0.5507
-0.5299
Variance
Table 22: Factor Variance to Total Variance Ratio
.665
Table 23: Variance Explained By Factors
Factor
Variance
Percent of Total Variance
Cumulative Percent
Condition Indices
Factor 1
4.2920
20.4383
20.4383
1.0000
90
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Factor Analysis
Table 23: Variance Explained By Factors
Factor
Variance
Percent of Total Variance
Cumulative Percent
Condition Indices
Factor 2
2.4972
11.8914
32.3297
1.3110
Factor 3
1.8438
8.7800
41.1097
1.5257
Factor 4
1.5977
7.6082
48.7179
1.6390
Factor 5
1.4462
6.8869
55.6048
1.7227
Factor 6
1.2544
5.9735
61.5782
1.8497
Factor 7
1.0413
4.9586
66.5369
2.0302
Table 24: Difference
Mean
Standard Deviation
Minimum
Maximum
0.0570
0.0866
0.0000
0.7909
Table 25: Prime Factor Variables
Factor 1
Factor 2
Factor 3
Factor 4
Factor 5
Factor 6
Factor 7
cc_rev
married
avg_sv_tran_amt
avg_cc_tran_cnt
svacct
nbr_children
separated
avg_ck_tran_amt
single
avg_sv_tran_cnt
ckacct
avg_ck_tran_cnt
years_with_bank
female
avg_cc_bal
ccacct
avg_sv_bal
*
*
*
*
income
age
*
*
*
*
*
avg_ck_bal
*
*
*
*
*
*
avg_cc_tran_amt
*
*
*
*
*
*
Pattern Graph
By default, the first twelve variables input to the Factor Analysis, and the first two factors
generated, are displayed on the Factor Patterns graph:
Scree Plot
On the scree plot, all possible factors are shown. In this case, only factors with an eigenvalue
greater than 1 were generated by the Factor Analysis:
Teradata Warehouse Miner User Guide - Volume 3
91
Chapter 1: Analytic Algorithms
Linear Regression
Figure 55: Factor Analysis Tutorial: Scree Plot
Linear Regression
Overview
Linear regression is one of the oldest and most fundamental types of analysis in statistics. The
British scientist Sir Francis Galton originally developed it in the latter part of the 19th
century. The term “regression” derives from the nature of his original study in which he found
that the children of both tall and short parents tend to “revert” or “regress” toward average
heights. [Neter] It has also been associated with the work of Gauss and Legendre who used
linear models in working with astronomical data. Linear regression is thought of today as a
special case of generalized linear models, which also includes models such as logit models
(logistic regression), log-linear models and multinomial response models. [McCullagh]
Why build a linear regression model? It is after all one of the simplest types of models that
can be built. Why not start out with a more sophisticated model such as a decision tree? One
reason is that if a simpler model will suffice, it is better than an unnecessarily complex model.
Another reason is to learn about the relationships between a set of observed variables. Is there
in fact a linear relationship between each of the observed variables and the variable to
predict? Which variables help in predicting the target dependent variable? If a linear
relationship does not exist, is there another type of relationship that does? By transforming a
variable, say by taking its exponent or log or perhaps squaring it, and then building a linear
regression model, these relationships can hopefully be seen. In some cases, it may even be
possible to create an essentially non-linear model using linear regression by transforming the
data first. In fact, one of the many sophisticated forms of regression, called piecewise linear
92
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Linear Regression
regression, was designed specifically to build nonlinear models of nonlinear phenomena.
Finally, in spite of being a relatively simple type of model, there is a rich set of statistics
available to explore the nature of any linear regression model built.
Multiple Linear Regression
Multiple linear regression analysis attempts to predict, or estimate, the value of a dependent
variable as a linear combination of independent variables, usually with a constant term
included. That is, it attempts to find the b-coefficients in the following equation in order to
best predict the value of the dependent variable y based on the independent variables x1 to xn.
)
y = b0 + b1 x1 +  + bn xn
The best values of the coefficients are defined to be the values that minimize the sum of
squared error values:
y
)
 y –
2
over all the observations.
)
Note that this requires that the actual value of y be known for each observation, in order to
contrast it with the predicted value y . This technique is called “least-squared errors.” It
turns out that the b-coefficient values to minimize the sum of squared errors can be solved
using a little calculus and linear algebra. It is worth spending just a little more effort in
describing this technique in order to explain how Teradata Warehouse Miner performs linear
regression analysis. It also introduces the concept of a cross-products matrix and its relatives
the covariance matrix and the correlation matrix that are so important in multivariate
statistical analysis.
In order to minimize the sum of squared errors, the equation for the sum of squared errors is
expanded using the equation for the estimated y value, and then the partial derivatives of this
equation with respect to each b-coefficient are derived and set equal to 0. (This is done in
order to find the minimum with respect to all of the coefficient values). This leads to n
simultaneous equations in n unknowns, which are commonly referred to as the normal
equations. For example:
  1  1   b0 +   1  x1   b1 +   1  x2   b2 =
2
  x1  1   b0 +   x1   b1 +   x1  x2   b2 =
Teradata Warehouse Miner User Guide - Volume 3
1  y
 x1 y
93
Chapter 1: Analytic Algorithms
Linear Regression
2
  x2  1   b0 +   x2 x1   b1 +   x2   b2 =
 x2 y
The equations above have been presented in a way that gives a hint to how they can be solved
using matrix algebra (i.e., by first computing the extended Sum-of-Squares-and-CrossProducts (SSCP) matrix for the constant 1 and the variables x1, x2 and y). By doing this one
gets all of the
terms in the equation. Teradata Warehouse Miner offers the Build Matrix
function to build the SSCP matrix directly in the Teradata database using generated SQL. The
linear regression module then reads this matrix from metadata results tables and performs the
necessary calculations to solve for the least-squares b-coefficients. Therefore, that part of
constructing a linear regression algorithm that requires access to the detail data is simply the
building of the extended SSCP matrix (i.e., include the constant 1 as the first variable), and
the rest is calculated on the client machine.

There is however much more to linear regression analysis than building a model (i.e.,
calculating the least-squares values of the b-coefficients). Other aspects such as model
diagnostics, stepwise model selection and scoring are described below.
Model Diagnostics
One of the advantages in using a statistical modeling technique such as linear regression (as
opposed to a machine learning technique, for example) is the ability to compute rigorous,
well-understood measurements of the effectiveness of the model. Most of these
measurements are based upon a huge body of work in the areas of probability and probability
theory.
Goodness of fit
)
Several model diagnostics are provided to give an assessment of the effectiveness of the
overall model. One of these is called the residual sums of squares or sum of squared errors
RSS, which is simply the sum of the squared differences between the dependent variable y
estimated by the model and the actual value of y, over all of the rows:
y –
y
)
RSS =
2
Now suppose a similar measure was created based on a naive estimate of y, namely the mean
value y :
TSS =
 y – y
2
often called the total sums of squares about the mean.
Then, a measure of the improvement of the fit given by the linear regression model is given
by:
TSS – RSS
2
R = ---------------------------TSS
94
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Linear Regression
This is called the squared multiple correlation coefficient R2, which has a value between 0
and 1, with 1 indicating the maximum improvement in fit over estimating y naively with the
mean value of y. The multiple correlation coefficient R is actually the correlation between the
real y values and the values predicted based on the independent x variables, sometimes
written R y  x 1 x 2 x n , which is calculated here simply as the positive square root of the R2
value. A variation of this measure adjusted for the number of observations and independent
variables in the model is given by the adjusted R2 value:
2
n–1
2
R = 1 – ---------------------  1 – R 
n–p–1
where n is the number of observations and p is the number of independent variables
(substitute n-p in the denominator if there is no constant term).
The numerator in the equation for R2, namely TSS - RSS, is sometimes called the due-toregression sums of squares or DRS. Another way of looking at this is that the total
unexplained variation about the mean TSS is equal to the variation due to regression DRS plus
the unexplained residual variation RSS. This leads to an equation sometimes known as the
fundamental equation of regression analysis:
=
 y
2
– y +  y – y 
)
2
)
 y – y
2
Which is the same as saying that TSS = DRS + RSS. From these values a statistical test called
an F-test can be made to determine if all the x variables taken together explain a significant
amount of variation in y. This test is carried out on the F-ratio given by:
meanDRS
F = -------------------------meanRSS
The values meanDRS and meanRSS are calculated by dividing DRS and RSS by their
respective degrees of freedom (p for DRS and n-p-1 for RSS).
Standard errors and confidence intervals
Measurements are made of the standard deviation of the sampling distribution of each bcoefficient value, and from this, estimates of a confidence interval for each of the coefficients
are made. For example, if one of the coefficients has a value of 6, and a 95% confidence
interval of 5 to 7, it can be said that the true population coefficient is contained in this
interval, with a confidence coefficient of 95%. In other words, if repeated samples were taken
of the same size from the population, then 95% of the intervals like the one constructed here,
would contain the true value for the population coefficient.
Teradata Warehouse Miner User Guide - Volume 3
95
Chapter 1: Analytic Algorithms
Linear Regression
Another set of useful statistics is calculated as the ratio of each b-coefficient value to its
standard error. This statistic is sometimes called a T-statistic or Wald statistic. Along with its
associated t-distribution probability value, it can be used to assess the statistical significance
of this term in the model.
Standardized coefficients
The least-squares estimates of the b-coefficients are converted to so-called beta-coefficients
or standardized coefficients to give a model in terms of the z-scores of the independent
variables. That is, the entire model is recast to use standardized values of the variables and the
coefficients are recomputed accordingly. Standardized values cast each variable into units
measuring the number of standard deviations away from the mean value for that variable. The
advantage of doing this is that the values of the coefficients are scaled equivalently so that
their relative importance in the model can be more easily seen. Otherwise the coefficient for a
variable such as income would be difficult to compare to a variable such as age or the number
of years an account has been open.
Incremental R-squared
It is possible to calculate the value R2 incrementally by considering the cumulative
contributions of x variables added to the model one at a time, namely R y  x 1 ,
R y  x1 x2  R y  x1 x2 xn . These are called incremental R2 values, and they give a measure
of how much the addition of each x variable contributes to explaining the variation in y in the
observations. This points out the fact that the order in which the independent x variables are
specified in creating the model is important.
Multiple Correlation Coefficients
Another measure that can be computed for each independent variable in the model is the
squared multiple correlation coefficient with respect to the other independent variables in the
model taken together. These values range from 0 to1 with 0 indicating a lack of correlation
and 1 indicating the maximum correlation.
Multiple correlation coefficients are sometimes presented in related forms such as variance
inflation factors or tolerances. A variance inflation factor is given by the formula:
1
V k -----------------2
1 – Rk
Where Vk is the variance inflation factor and Rk2 is the squared multiple correlation
coefficient for the kth independent variable. Tolerance is given by the formula Tk = 1 - Rk2,
where Tk is the tolerance of the kth independent variable and Rk2 is as before.
These values may be of limited value as indicators of possible collinearity or near
dependencies among variables in the case of high correlation values, but the absence of high
correlation values does not necessarily indicate the absence of collinearity problems. Further,
multiple correlation coefficients are unable to distinguish between several near dependencies
should they exist. The reader is referred to [Belsley, Kuh and Welsch] for more information
on collinearity diagnostics, as well as to the upcoming section on the subject.
96
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Linear Regression
Data Quality Reports
A variety of data quality reports are available with the Teradata Warehouse Miner Linear
Regression algorithm. Reports include:
1
Constant Variables
2
Variable Statistics
3
Detailed Collinearity Diagnostics
4
•
Eigenvalues of Unit Scaled X'X
•
Condition Indices
•
Variance Proportions
Near Dependency
Constant Variables
Before attempting to build a model the algorithm checks to see if any variables in the model
have a constant value. This check is based on the standard deviation values derived from the
SSCP matrix input to the algorithm. If a variable with a constant value (i.e., a standard
deviation of zero) is detected, the algorithm stops and notifies the user while producing a
Constant Variables Table report. After reading this report, the user may then remove the
variables in the report from the model and execute the algorithm again.
It is possible that a variable may appear in the Constant Variables Table report that does not
actually have a constant value in the data. This can happen when a column has extremely
large values that are close together in value. In this case the standard deviation will appear to
be zero due to precision loss and will be rejected as a constant column. The remedy for this is
to re-scale the values in the column prior to building a matrix or doing the analysis. The ZScore or the Rescale transformation functions may be used for this purpose.
Variable Statistics
The user may optionally request that a Variables Statistics Report be provided, giving the
mean value and standard deviation of each variable in the model based on the SSCP matrix
provided as input.
Detailed Collinearity Diagnostics
One of the conditions that can lead to a poor linear regression model is when the independent
variables in the model are not independent of each other, that is, when they are collinear
(highly correlated) with one another. Collinearity can be loosely defined as a condition where
one variable is nearly a linear combination of one or more other variables, sometimes also
called a near dependency. This leads to an ill conditioned matrix of variables.
Teradata Warehouse Miner provides an optional Detailed Collinearity Diagnostics report
using a specialized technique described in [Belsley, Kuh and Welsch]. This technique
involves performing a singular value decomposition of the independent x variables in the
model in order to measure collinearity.
The analysis proceeds roughly as follows. In order to put all variables on an equal footing, the
data is scaled so that each variable adds up to 1 when summed over all the observations or
rows. In order to calculate the singular values of X (the rows of X are the observations), the
Teradata Warehouse Miner User Guide - Volume 3
97
Chapter 1: Analytic Algorithms
Linear Regression
mathematically equivalent square root of the eigenvalues of XTX are computed instead for
practical reasons. The condition index of each eigenvalue is calculated as the square root of
the ratio of the largest eigenvalue to the given eigenvalue, a value always 1 or greater. The
variance decomposition of these eigenvalues is computed using the eigenvalues together with
the eigenvectors associated with them. The result is a matrix giving, for each variable, the
proportion of variance associated with each eigenvalue.
Large condition indices indicate a probable near dependency. A value of 10 may indicate a
weak dependency, values of 15 to 30 may be considered a borderline dependency, above 30
worth investigating further, and above 100, a potentially damaging collinearity. As a rule of
thumb, an eigenvalue with a condition index greater than 30 and an associated variance
proportion of greater than 50% with two or more model variables implies that a collinearity
problem exists. (The somewhat subjective conclusions described here and the experiments
they are based on are described in detail in [Belsley, Kuh and Welsch]).
An example of the Detailed Collinearity Diagnostics report is given below.
Table 26: Eigenvalues of Unit Scaled X'X
Factor 1
5.2029
Factor 2
.8393
Factor 3
.5754
Factor 4
.3764
Factor 5
4.1612E-03
Factor 6
1.8793E-03
Factor 7
2.3118E-08
Table 27: Condition Indices
98
Factor 1
1
Factor 2
2.4898
Factor 3
3.007
Factor 4
3.718
Factor 5
35.3599
Factor 6
52.6169
Factor 7
15001.8594
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Linear Regression
Table 28:
Variable
Name
Factor 1
Factor 2
Factor 3
Factor 4
Factor 5
Factor 6
Factor 7
CONSTANT
1.3353E-09
1.0295E-08
1.3781E-09
1.6797E-08
1.1363E-11
2.1981E-07
1
cust_id
1.3354E-09
1.0296E-08
1.3782E-09
1.6799E-08
1.1666E-11
2.2068E-07
1
income
2.3079E-04
1.8209E-03
1.6879E-03
1.1292E-03
.9951
4.4773E-06
1.2957E-05
age
1.0691E-04
1.9339E-04
9.321E-05
1.7896E-03
1.56E-05
.9963
1.4515E-03
children
2.9943E-03
4.4958E-02
.2361
1.6499E-03
3.6043E-04
.713
9.1708E-04
combo1
2.3088E-04
1.8703E-03
1.6658E-03
1.1339E-03
.995
1.0973E-04
2.3525E-05
combo2
1.4002E-04
3.1477E-05
4.4942E-05
5.0407E-03
4.7784E-06
.9935
1.2583E-03
Near Dependency
In addition to or in place of the Detailed Collinearity Diagnostics report, the user may
optionally request a Near Dependency report based on the automated application of the
specialized criteria used in the aforementioned report. Requesting the Near Dependency
report greatly simplifies the search for collinear variables or near dependencies in the data.
The user may specify the threshold value for the condition index (by default 30) and the
variance proportion (by default 0.5) such that a near dependency is reported. That is, if two or
more variables have a variance proportion greater than the variance proportion threshold, for
a condition index with value greater than the condition index threshold, the variables involved
in the near dependency are reported along with their variance proportions, their means and
their standard deviations. Near dependencies are reported in descending order based on their
condition index value, and variables contributing to a near dependency are reported in
descending order based on their variance proportion.
The following is an example of a Near Dependency report.
Table 29: Near Dependency report (example)
Variable Name
Factor
Condition Index
Variance
Proportion
Mean
Standard Deviation
CONSTANT
7
15001.8594
1
*
*
cust_id
7
15001.8594
1
1362987.891
293.5012
age
6
52.6169
.9963
33.744
22.3731
combo2
6
52.6169
.9935
25.733
23.4274
children
6
52.6169
.713
.534
1.0029
income
5
35.3599
.9951
16978.026
21586.8442
combo1
5
35.3599
.995
33654.602
43110.862
Teradata Warehouse Miner User Guide - Volume 3
99
Chapter 1: Analytic Algorithms
Linear Regression
Stepwise Linear Regression
Automated stepwise regression analysis is a technique to aid in regression model selection.
That is, it helps in deciding which independent variables to include in a regression model. If
there are only two or three independent variables under consideration, one could try all
possible models. But since there are 2k - 1 models that can be built from k variables, this
quickly becomes impractical as the number of variables increases (32 variables yield more
than 4 billion models!).
The automated stepwise procedures described below can provide insight into the variables
that should be included in a regression model. It is not recommended that stepwise procedures
be the sole deciding factor in the makeup of a model. For one thing, these techniques are not
guaranteed to produce the best results. And sometimes, variables should be included because
of certain descriptive or intuitive qualities, or excluded for subjective reasons. Therefore an
element of human decision-making is recommended to produce a model with useful business
application.
Forward-Only Stepwise Linear Regression
The forward only procedure consists solely of forward steps as described below, starting
without any independent x variables in the model. Forward steps are continued until no
variables can be added to the model.
Forward Stepwise Linear Regression
The forward stepwise procedure is a combination of the forward and backward steps
described below, starting without any independent x variables in the model. One forward step
is followed by one backward step, and these single forward and backward steps are alternated
until no variables can be added or removed.
Backward-Only Stepwise Linear Regression
The backward only procedure consists solely of backward steps as described below, starting
with all of the independent x variables in the model. Backward steps are continued until no
variables can be removed from the model.
Backward Stepwise Linear Regression
The backward stepwise procedure is a combination of the backward and forward steps as
described below, starting with all of the independent x variables in the model. One backward
step is followed by one forward step, and these single backward and forward steps are
alternated until no variables can be added or removed.
Stepwise Linear Regression - Forward Step
Each forward step seeks to add the independent variable x that will best contribute to
explaining the variance in the dependent variable y. In order to do this a quantity called the
partial F statistic must be computed for each xi variable that can be added to the model. A
quantity called the extra sums of squares is first calculated as follows: ESS = “DRS with xi” “DRS w/o xi”, where DRS is the Regression Sums of squares or “due-to-regression sums of
squares”. Then, the partial F statistic is given by f(xi) = ESS(xi) / meanRSS(xi) where
meanRSS is the Residual Mean Square. Each forward step then consists of adding the
100
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Linear Regression
variable with the largest partial F statistic providing it is greater than the criterion to enter
value.
An equivalent alternative to using the partial F statistic is to use the probability or P-value
associated with the T-statistic mentioned earlier under model diagnostics. The t statistic is the
ratio of the b-coefficient to its standard error. Teradata Warehouse Miner offers both
alternatives as an option. When the P-value is used, a forward step consists of adding the
variable with the smallest P-value providing it is less than the criterion to enter. In this case, if
more than one variable has a P-value of 0, the variable with the largest F statistic is entered.
Stepwise Linear Regression - Backward Step
Each backward step seeks to remove the independent variable xi that least contributes to
explaining the variance in the dependent variable y. The partial F statistic is calculated for
each independent x variable in the model. If the smallest value is less than the criterion to
remove, it is removed.
As with forward steps, an option is provided to use the probability or P-value associated with
the T-statistic, that is, the ratio of the b-coefficient to its standard error. In this case all the
probabilities or P-values are calculated for the variables currently in the model at one time,
and the one with the largest P-value is removed if it is greater than the criterion to remove.
Linear Regression and Missing Data
Null values for columns in a linear regression analysis can adversely affect results. It is
recommended that the listwise deletion option be used when building the input matrix with
the Build Matrix function. This ensures that any row for which one of the columns is null will
be left out of the matrix computations completely. Another strategy is to use the Recoding
transformation function to build a new column, substituting a fixed known value for null
values. Yet another option is to use one of the analytic algorithms in Teradata Warehouse
Miner to estimate replacement values for null values. This technique is often called missing
value imputation.
Initiate a Linear Regression Function
Use the following procedure to initiate a new Linear Regression analysis in Teradata
Warehouse Miner:
1
Click on the Add New Analysis icon in the toolbar:
Figure 56: Add New Analysis from toolbar
Teradata Warehouse Miner User Guide - Volume 3
101
Chapter 1: Analytic Algorithms
Linear Regression
2
In the resulting Add New Analysis dialog box, click on Analytics under Categories and
then under Analyses double-click on Linear Regression:
Figure 57: Add New Analysis dialog
3
This will bring up the Linear Regression dialog in which you will enter INPUT and
OUTPUT options to parameterize the analysis as described in the following sections.
Linear Regression - INPUT - Data Selection
On the Linear Regression dialog click on INPUT and then click on data selection:
Figure 58: Linear Regression > Input > Data Selection
On this screen select:
1
Select Input Source
Users may select between different sources of input, Table, Matrix or Analysis. By
selecting the Input Source Table the user can select from available databases, tables (or
views) and columns in the usual manner. (In this case a matrix will be dynamically built
and discarded when the algorithm completes execution). By selecting the Input Source
Matrix the user may can select from available matrices created by the Build Matrix
function. This has the advantage that the matrix selected for input is available for further
analysis after completion of the algorithm, perhaps selecting a different subset of columns
from the matrix.
102
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Linear Regression
By selecting the Input Source Analysis the user can select directly from the output of
another analysis of qualifying type in the current project. (In this case a matrix will be
dynamically built and discarded when the algorithm completes execution). Analyses that
may be selected from directly include all of the Analytic Data Set (ADS) and
Reorganization analyses (except Refresh). In place of Available Databases the user may
select from Available Analyses, while Available Tables then contains a list of all the output
tables that will eventually be produced by the selected Analysis. (Note that since this
analysis cannot select from a volatile input table, Available Analyses will contain only
those qualifying analyses that create an output table or view).
2
Select Columns From One Table
•
Available Databases (only for Input Source equal to Table) — All the databases which
are available for the Linear Regression analysis.
•
Available Matrices (only for Input Source equal to Matrix) — When the Input source is
Matrix, a matrix must first be built with the Build Matrix function before linear
regression can be performed. Select the matrix that summarizes the data to be
analyzed. (The matrix must have been built with more rows than selected columns or
the Linear Regression analysis will produce a singular matrix, causing a failure).
•
Available Analyses (only for Input Source equal to Analysis) — All the analyses that
are available for the Linear Regression analysis.
•
Available Tables (only for Input Source equal to Table or Analysis) — All the tables
that are available for the Linear Regression analysis.
•
Available Columns — All the columns that are available for the Linear Regression
analysis.
•
Selected Columns — Select columns by highlighting and then either dragging and
dropping into the Selected Columns window, or click on the arrow button to move
highlighted columns into the Selected Columns window.
Note: The Selected Columns window is actually a split window; you can either insert
columns as Dependent or Independent columns. Make sure you have the correct
portion of the window highlighted. The Dependent variable column is the column
whose value is being predicted by the linear regression model. The algorithm requires
that the Dependent and Independent columns must be of numeric type (or contain
numbers in character format).
Linear Regression - INPUT - Analysis Parameters
On the Linear Regression dialog click on INPUT and then click on analysis parameters:
Figure 59: Linear Regression > Input > Analysis Parameters
Teradata Warehouse Miner User Guide - Volume 3
103
Chapter 1: Analytic Algorithms
Linear Regression
On this screen select:
• Regression Options
•
Include Constant — This option specifies that the linear regression model should
include a constant term. With a constant, the linear equation can be thought of as:
ŷ = b 0 + b 1 x 1 +  + b n x n
Without a constant, the equation changes to:
ŷ = b 1 x 1 +  + b n x n
•
Stepwise Options — The Linear Regression analysis can use the stepwise technique to
automatically determine a variable’s importance (or lack there of) to a particular
model. If selected, the algorithm is performed repeatedly with various combinations of
independent variable columns to attempt to arrive at a final “best” model. The
stepwise options are:
Step Direction — (Selecting “None” turns off the Stepwise option).
•
•
Forward Only — Option to add qualifying independent variables one at a time.
•
Forward — Option for independent variables being added one at a time to an
empty model, possibly removing a variable after a variable is added.
•
Backward Only — Option to remove independent variables one at a time.
•
Backward — Option for variables being removed from an initial model containing
all of the independent variables, possibly adding a variable after a variable is
removed.
Step Method
•
F Statistic — Option to choose the partial F test statistic (F statistic) as the basis for
adding or removing model variables.
•
P-value — Option to choose the probability associated with the T-statistic (Pvalue) as the basis for adding or removing model variables.
•
Criterion to Enter
•
Criterion to Remove — If the step method is to use the F statistic, then an independent
variable is only added to the model if the F statistic is greater than the criterion to enter
and removed if it is less than the criterion to remove. When the F statistic is used, the
default for each is 3.84.
If the step method is to use the P-value, then an independent variable is added to the
model if the P-value is less than the criterion to enter and removed if it is greater than
the criterion to remove. When the P-value is used, the default for each is 0.05.
The default F statistic criteria of 3.84 corresponds to a P-value of 0.05. These default
values are provided with the assumption that the input variables are somewhat
correlated. If this is not the case, a lower F statistic or higher P-value criteria can be
used. Also, a higher F statistic or lower P value can be specified if more stringent
criteria are desired for including variables in a model.
•
104
Report Options — Statistical diagnostics can be taken on each variable during the
execution of the Linear Regression Analysis. These diagnostics include:
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Linear Regression
•
Variable Statistics — This report gives the mean value and standard deviation of
each variable in the model based on the SSCP matrix provided as input.
•
Near Dependency — This report lists collinear variables or near dependencies in
the data based on the SSCP matrix provided as input.
Condition Index Threshold — Entries in the Near Dependency report are triggered
by two conditions occurring simultaneously. The one that involves this parameter
is the occurrence of a large condition index value associated with a specially
constructed principal factor. If a factor has a condition index greater than this
parameter’s value, it is a candidate for the Near Dependency report. A default
value of 30 is used as a rule of thumb.
Variance Proportion Threshold — Entries in the Near Dependency report are
triggered by two conditions occurring simultaneously. The one that involves this
parameter is when two or more variables have a variance proportion greater than
this threshold value for a factor with a high condition index. Another way of
saying this is that a ‘suspect’ factor accounts for a high proportion of the variance
of two or more variables. This parameter defines what a high proportion of
variance is. A default value of 0.5 is used as a rule of thumb.
•
Detailed Collinearity Diagnostics — This report provides the details behind the
Near Dependency report, consisting of the “Eigenvalues of Unit Scaled X’X”,
“Condition Indices” and “Variance Proportions” tables.
Linear Regression - OUTPUT
On the Linear Regression dialog click on OUTPUT:
Figure 60: Linear Regression > OUTPUT
On this screen select:
• Store the variables table of this analysis in the database — Check this box to store the
model variables table of this analysis in the database.
• Database Name — The name of the database to create the output table in.
• Output Table Name — The name of the output table.
• Advertise Output — The Advertise Output option “advertises” output by inserting
information into one or more of the Advertise Output metadata tables according to the
type of analysis and the options selected in the analysis.
• Advertise Note — An Advertise Note may be specified if desired when the Advertise
Output option is selected or when the Always Advertise option is selected on the
Connection Properties dialog. It is a free-form text field of up to 30 characters that may be
used to categorize or describe the output.
Teradata Warehouse Miner User Guide - Volume 3
105
Chapter 1: Analytic Algorithms
Linear Regression
By way of an example, the tutorial example creates the following output table:
Table 30:
Upper
Increment
Standard al RCoefficient Squared
SqMultiCo
rrCoef(1Tolerance)
0.1694
1.6294
0.0331
0.8787
0.1312
0.0417
0.0111
0.5771
0.0263
0.8794
0.0168
-2.7887
0.0054
-1.3198
-0.2293
-0.036
0.8779
0.0207
0.0004
-41.3942
0
-0.0182
-0.0166
-0.6382
0.7556
0.3135
10.2793
0.8162
12.5947
0
8.677
11.8815
0.1703
0.8732
0.1073
income
0.0005
0
24.5414
0
0.0005
0.0005
0.3777
0.8462
0.311
married
-4.3056
0.8039
-5.3558
0
-5.8838
-2.7273
-0.0718
0.8766
0.0933
0.9749
-6.6301
0
-8.378
-4.55
0
0
Column
Name
B
Standard
Coefficient Error
T Statistic
P-Value
Lower
nbr_
children
0.8994
0.3718
2.4187
0.0158
years_
0.2941
with_bank
0.1441
2.0404
avg_sv_
tran_cnt
-0.7746
0.2777
avg_cc_
bal
-0.0174
ckacct
(Constant) -6.464
If Database Name is twm_results and Output Table Name is test, the output table is
defined as:
CREATE SET TABLE twm_results.test2
(
"Column Name" VARCHAR(30) CHARACTER SET UNICODE NOT
CASESPECIFIC,
"B Coefficient" FLOAT,
"Standard Error" FLOAT,
"T Statistic" FLOAT,
"P-Value" FLOAT,
"Lower" FLOAT,
"Upper" FLOAT,
"Standard Coefficient" FLOAT,
"Incremental R-Squared" FLOAT,
"SqMultiCorrCoef(1-Tolerance)" FLOAT)
UNIQUE PRIMARY INDEX ( "Column Name" );
Run the Linear Regression
After setting parameters on the INPUT screens as described above, you are ready to run the
analysis. To run the analysis you can either:
• Click the Run icon
on the toolbar, or
• Select Run <project name> on the Project menu, or
• Press the F5 key on your keyboard
106
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Linear Regression
Results - Linear Regression
The results of running the Teradata Warehouse Miner Linear Regression analysis include a
variety of statistical reports on the individual variables and generated model as well as barcharts displaying coefficients and T-statistics. All of these results are outlined below.
Linear Regression - RESULTS
On the Linear Regression dialog, click on RESULTS (note that the RESULTS tab will be
grayed-out/disabled until after the analysis is completed) to view results. Result options are as
follows:
Linear Regression Reports
Data Quality Reports
• Variable Statistics — If selected on the Results Options tab, this report gives the mean
value and standard deviation of each variable in the model based on the SSCP matrix
provided as input.
• Near Dependency — If selected on the Results Options tab, this report lists collinear
variables or near dependencies in the data based on the SSCP matrix provided as input.
Entries in the Near Dependency report are triggered by two conditions occurring
simultaneously. The first is the occurrence of a large condition index value associated
with a specially constructed principal factor. If a factor has a condition index greater than
the parameter specified on the Results Option tab, it is a candidate for the Near
Dependency report. The other is when two or more variables have a variance proportion
greater than a threshold value for a factor with a high condition index. Another way of
saying this is that a ‘suspect’ factor accounts for a high proportion of the variance of two
or more variables. The parameter to defines what a high proportion of variance is also set
on the Results Option tab. A default value of 0.5.
• Detailed Collinearity Diagnostics — If selected on the Results Options tab, this report
provides the details behind the Near Dependency report, consisting of the following
tables.
•
Eigenvalues of Unit Scaled X'X — Report of the eigenvalues of all variables scaled so
that each variable adds up to 1 when summed over all the observations or rows. In
order to calculate the singular values of X (the rows of X are the observations), the
mathematically equivalent square root of the eigenvalues of XTX are computed instead
for practical reasons.
•
Condition Indices — The condition index of each eigenvalue, calculated as the square
root of the ratio of the largest eigenvalue to the given eigenvalue, a value always 1 or
greater.
•
Variance Proportions — The variance decomposition of these eigenvalues is computed
using the eigenvalues together with the eigenvectors associated with them. The result
is a matrix giving, for each variable, the proportion of variance associated with each
eigenvalue.
Linear Regression Step N (Stepwise-only)
• Linear Regression Model Assessment
Teradata Warehouse Miner User Guide - Volume 3
107
Chapter 1: Analytic Algorithms
Linear Regression
•
Squared Multiple Correlation Coefficient (R-squared) — This is the same value
calculated for the Linear Regression report, but it is calculated here for the model as it
stands at this step. The closer to 1 its value is, the more effective the model.
•
Standard Error of Estimate — This is the same value calculated for the Linear
Regression report, but it is calculated here for the model as it stands at this step.
• In Report — This report contains the same fields as the Variables in Model report
(described below) with the addition of the following field.
•
F Stat — F Stat is the partial F statistic for this variable in the model, which may be
used to decide its inclusion in the model. A quantity called the extra sums of squares is
first calculated as follows: ESS = “DRS with x” - “DRS w/o”, where DRS is the
Regression Sums of squares or “due-to-regression sums of squares”. Then the partial F
statistic is given by F(xi) = ESS(xi) / meanRSS(xi), where meanRSS is the Residual
Mean Square.
• Out Report
•
Independent Variable — This is an independent variable not included in the model at
this step.
•
P-Value — This is the probability associated with the T-statistic associated with each
variable not in, or excluded from, the model, as described for the Variables in Model
report as T Stat and P-Value. (Note that it is not the P-Value associated with F Stat).
When the P-Value is used for step decisions, a forward step consists of adding the
variable with the smallest P-value providing it is less than the criterion to enter. For
backward steps, all the probabilities or P-values are calculated for the variables
currently in the model at one time, and the one with the largest P-value is removed if it
is greater than the criterion to remove.
•
F Stat — F Stat is the partial F statistic for this variable in the model, which may be
used to decide its inclusion in the model. A quantity called the extra sums of squares is
first calculated as follows: ESS = “DRS with xi” - “DRS w/o xi”, where DRS is the
Regression Sums of squares or “due-to-regression sums of squares”. Then the partial F
statistic is given by F(xi) = ESS(xi) / meanRSS(xi), where meanRSS is the Residual
Mean Square.
•
Partial Correlation — The partial correlation coefficient for a variable not in the model
is based on the square root of a measure called the coefficient of partial determination,
which represents the marginal contribution of the variable to a model that doesn’t
include the variable. (Here, contribution to the model means reduction in the
unexplained variation of the dependent variable).
The formula for the partial correlation of the ith independent variable in the linear
regression model built from all the independent variables is given by:
Ri =
DRS – NDRS
----------------------------------RSS
where DRS is the Regression Sums of squares for the model including those variables
currently in the model, NDRS is the Regression Sums of squares for the current model
108
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Linear Regression
without the ith variable, and RSS is the Residual Sums of squares for the current
model.
Linear Regression Model
• Total Observations — This is the number of rows originally summarized in the SSCP
matrix that the linear regression analysis is based on. The number of observations reflects
the row count after any rows were eliminated by listwise deletion (recommended) when
the matrix was built.
• Total Sums of squares — The so-called Total Sums of squares is given by the
2
equation TSS =
 y – y  where y is the dependent variable that is being predicted
and y is its mean value. The Total Sums of squares is sometimes also called the total sums
of squares about the mean. Of particular interest is its relation to the “due-to-regression
sums of squares” and the “residual sums of squares” given by TSS = DRS + RSS. This is a
shorthand form of what is sometimes known as the fundamental equation of regression
analysis:

 y – y
2
=
  ŷ – y 
2
=
  y – ŷ 
2
where y is the dependent variable, y is its mean value and ŷ is its predicted value.
• Multiple Correlation Coefficient (R) — The multiple correlation coefficient R is the
correlation between the real dependent variable y values and the values predicted based on
the independent x variables, sometimes written R y  x1 x2 xn , which is calculated in
Teradata Warehouse Miner simply as the positive square root of the Squared Multiple
Correlation Coefficient (R2) value.
• Squared Multiple Correlation Coefficient (R-squared) — The squared multiple correlation
coefficient R2 is a measure of the improvement of the fit given by the linear regression
model over estimating the dependent variable y naïvely with the mean value of y. It is
given by:
TSS – RSS
2
R = ---------------------------TSS
where TSS is the Total Sums of squares and RSS is the Residual Sums of squares. It has a
value between 0 and 1, with 1 indicating the maximum improvement in fit over estimating
y naïvely with the mean value of y.
• Adjusted R-squared — The adjusted R2 value is a variation of the Squared Multiple
Correlation Coefficient (R2) that has been adjusted for the number of observations and
independent variables in the model. Its formula is given by:
n–1
2
2
R = 1 – ---------------------  1 – R 
n–p–1
where n is the number of observations and p is the number of independent variables
(substitute n-p in the denominator if there is no constant term).
• Standard Error of Estimate — The standard error of estimate is calculated as the square
root of the average squared residual value over all the observations, i.e.
Teradata Warehouse Miner User Guide - Volume 3
109
Chapter 1: Analytic Algorithms
Linear Regression
2
 y – ŷ 

-------------------------n–p–1
where y is the actual value of the dependent variable, ŷ is its predicted value, n is the
number of observations, and p is the number of independent variables (substitute n-p in
the denominator if there is no constant term).
• Regression Sums of squares — This is the “due-to-regression sums of squares” or DRS
referred to in the description of the Total Sums of squares, where it is pointed out that TSS
= DRS + RSS. It is also the middle term in what is sometimes known as the fundamental
equation of regression analysis:
 y – y
2
=
  ŷ – y 
2
=
  y – ŷ 
2
where y is the dependent variable, is its mean value and is its predicted value.
• Regression Degrees of Freedom — The Regression Degrees of Freedom is equal to the
number of independent variables in the linear regression model. It is used in the
calculation of the Regression Mean-Square.
• Regression Mean-Square — The Regression Mean-Square is simply the Regression Sums
of squares divided by the Regression Degrees of Freedom. This value is also the
numerator in the calculation of the Regression F Ratio.
• Regression F Ratio — A statistical test called an F-test is made to determine if all the
independent x variables taken together explain a statistically significant amount of
variation in the dependent variable y. This test is carried out on the F-ratio given by
meanDRS
F = -------------------------meanRSS
where meanDRS is the Regression Mean-Square and meanRSS is the Residual MeanSquare. A large value of the F Ratio means that the model as a whole is statistically
significant.
(The easiest way to assess the significance of this term in the model is to check if the
associated Regression P-Value is less than 0.05. However, the critical value of the F Ratio
could be looked up in an F distribution table. This value is very roughly in the range of 1
to 3, depending on the number of observations and variables).
• Regression P-value — This is the probability or P-value associated with the statistical test
on the Regression F Ratio. This statistical F-test is made to determine if all the
independent x variables taken together explain a statistically significant amount of
variation in the dependent variable y. A value close to 0 indicates that they do.
The hypothesis being tested or null hypothesis is that the coefficients in the model are all
zero except the constant term (i.e., all the corresponding independent variables together
110
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Linear Regression
contribute nothing to the model). The P-value in this case is the probability that the null
hypothesis is true and the given F statistic has the value it has or smaller. A right tail test
on the F distribution is performed with a 5% significance level used by convention. If the
P-value is less than the significance level (i.e., less than 0.05), the null hypothesis should
be rejected (i.e., the coefficients taken together are significant and not all 0).
• Residual Sums of squares — The residual sums of squares or sum of squared errors RSS
is simply the sum of the squared differences between the dependent variable estimated by
the model and the actual value of y, over all of the rows:
RSS =
  y – ŷ 
2
• Residual Degrees of Freedom — The Residual Degrees of Freedom is given by n-p-1
where n is the number of observations and p is the number of independent variables (or np if there is no constant term). It is used in the calculation of the Residual Mean-Square.
• Residual Mean-Square — The Residual Mean-Square is simply the Residual Sums of
squares divided by the Residual Degrees of Freedom. This value is also the denominator
in the calculation of the Regression F Ratio.
Linear Regression Variables in Model Report
• Dependent Variable — The dependent variable is the variable being predicted by the linear
regression model.
• Independent Variable — Each independent variable in the model is listed along with
accompanying measures. Unless the user deselects the option Include Constant on the
Regression Options tab of the input dialog, the first independent variable listed is
CONSTANT, a fixed value representing the constant term in the linear regression model.
• B Coefficient — Linear regression attempts to find the b-coefficients in the
equation ŷ = b 0 + b 1 x 1 + b n x n in order to best predict the value of the dependent
variable y based on the independent variables x1 to xn. The best values of the coefficients
are defined to be the values that minimize the sum of squared error values
  y – ŷ 
2
over all the observations.
• Standard Error — This is the standard error of the B Coefficient term of the linear
regression model, a measure of how accurate the B Coefficient term is over all the
observations used to build the model. It is the basis for estimating a confidence interval
for the B Coefficient value.
• T Statistic — The T-statistic is the ratio of a B Coefficient value to its standard error (Std
Error). Along with the associated t-distribution probability value or P-value, it can be used
to assess the statistical significance of this term in the linear model.
(The easiest way to assess the significance of this term in the model is to check if the Pvalue is less than 0.05. However, one could look up the critical T Stat value in a two-tailed
T distribution table with probability .95 and degrees of freedom roughly the number of
observations minus the number of variables. This would show that for all practical
Teradata Warehouse Miner User Guide - Volume 3
111
Chapter 1: Analytic Algorithms
Linear Regression
purposes, if the absolute value of T Stat is greater than 2 the model term is statistically
significant).
• P-value — This is the t-distribution probability value associated with the T-statistic (T
Stat), that is, the ratio of the b-coefficient value to its standard error (Std Error). It can be
used to assess the statistical significance of this term in the linear model. A value close to
0 implies statistical significance and means this term in the model is important.
The hypothesis being tested or null hypothesis is that the coefficient in the model is
actually zero (i.e., the corresponding independent variable contributes nothing to the
model). The P-value in this case is the probability that the null hypothesis is true and the
given T-statistic has the absolute value it has or smaller. A two-tailed test on the tdistribution is performed with a 5% significance level used by convention. If the P-value
is less than the significance level (i.e., less than 0.05), the null hypothesis should be
rejected (i.e., the coefficient is statistically significant and not 0).
• Squared Multiple Correlation Coefficient (R-squared) — The Squared Multiple Correlation
Coefficient (Rk2) is a measure of the correlation of this, the kth variable with respect to the
other independent variables in the model taken together. (This measure should not be
confused with the R2 measure of the same name that applies to the model taken as a
whole). The value ranges from 0 to 1 with 0 indicating a lack of correlation and 1
indicating the maximum correlation. It is not calculated for the constant term in the
model.
Multiple correlation coefficients are sometimes presented in related forms such as
variance inflation factors or tolerances. The variance inflation factor is given by the
formula:
1
V k = --------------21 – Rk
where Vk is the variance inflation factor and Rk2 is the squared multiple correlation
coefficient for the kth independent variable. Tolerance is given by the
2
formula T k = 1 – R k where Tk is the tolerance of the kth independent variable and Rk2
is as before.
(Refer to the section Multiple Correlation Coefficients for details on the limitations of
using this measure to detect collinearity problems in the data).
• Lower — Lower is the lower value in the confidence interval for this coefficient and is
based on its standard error value. For example, if the coefficient has a value of 6 and a
confidence interval of 5 to 7, it means that according to the normal error distribution
assumptions of the model, there is a 95% probability that the true population value of the
coefficient is actually between 5 and 7.
• Upper — Upper is the upper value in the confidence interval for this coefficient based on
its standard error value. For example, if the coefficient has a value of 6 and a confidence
interval of 5 to 7, it means that according to the normal error distribution assumptions of
the model, there is a 95% probability that the true population value of the coefficient is
actually between 5 and 7.
112
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Linear Regression
• Standard Coefficient — Standardized coefficients, sometimes called beta-coefficients,
express the linear model in terms of the z-scores or standardized values of the independent
variables. Standardized values cast each variable into units measuring the number of
standard deviations away from the mean value for that variable. The advantage of
examining standardized coefficients is that they are scaled equivalently, so that their
relative importance in the model can be more easily seen.
• Incremental R-squared — It is possible to calculate the model’s Squared Multiple
Correlation value incrementally by considering the cumulative contributions of x
variables added to the model one at a time, namely R y  x  R y  x x   R y  x x x .
1
1 2
1 2
n
These are called Incremental R2 values, and they give a measure of how much the addition
of each x variable contributes to explaining the variation in y in the observations.
Linear Regression Graphs
The Linear Regression Analysis can display the coefficients and/or T-statistics of the resultant
model.
Weights Graph
This graph displays the relative magnitudes of the standardized coefficients and/or the Tstatistic associated with each standardized coefficient in the linear regression model. The
sign, positive or negative, is portrayed by the colors red or blue respectively. The user may
scroll to the left or right to see all the variables in the model. The T-statistic is the ratio of the
coefficient value to its standard error, so the larger its value the more reliable the value of the
coefficient is.
The following options are available on the Graphics Options tab on the Linear Weights graph:
• Graph Type — The following can be graphed by the Linear Weights Graph
•
T Statistic — Display the T Statistics on the bar chart.
•
Standardized Coefficient — Display the Standardized Coefficients on the bar chart.
• Vertical Axis — The user may request multiple vertical axes in order to display separate
coefficient values that are orders of magnitude different from the rest of the values. If the
coefficients are of roughly the same magnitude, this option is grayed out.
•
Single — Display the Standardized Coefficients or T Statistics on single axis on the
bar chart.
•
Multiple — Display the Standardized Coefficients or T Statistics on dual axes on the
bar chart.
Tutorial - Linear Regression
Parameterize a Linear Regression Analysis as follows:
• Available Matrices — Customer_Analysis_Matrix
• Dependent Variable — cc_rev
• Independent Variables
•
income — age
•
years_with_bank — nbr_children
Teradata Warehouse Miner User Guide - Volume 3
113
Chapter 1: Analytic Algorithms
Linear Regression
•
female — single
•
married — separated
•
ccacct — ckacct
•
svacct — avg_cc_bal
•
avg_ck_bal — avg_sv_bal
•
avg_cc_tran_amt — avg_cc_tran_cnt
•
avg_ck_tran_amt — avg_ck_tran_cnt
•
avg_sv_tran_amt — avg_sv_tran_cnt
• Include Constant — Enabled
• Step Direction — Forward
• Step Method — F Statistic
• Criterion to Enter — 3.84
• Criterion to Remove — 3.84
Run the analysis, and click on Results when it completes. For this example, the Linear
Regression Analysis generated the following pages. A single click on each page name
populates Results with the item.
Table 31: Linear Regression Report
Total Observations:
747
Total Sum of Squares:
6.69E5
Multiple Correlation Coefficient (R):
0.9378
Squared Multiple Correlation Coefficient (1-Tolerance):
0.8794
Adjusted R-Squared:
0.8783
Standard Error of Estimate:
1.04E1
Table 32: Regression vs. Residual
Sum of Squares
Degrees of
Freedom
Mean-Square
F Ratio
P-value
Regression
5.88E5
7
8.40E4
769.8872
0.0000
Residual
8.06E4
739
1.09E2
N/A
N/A
Table 33: Execution Status
114
6/20/2004 2:07:28 PM
Getting Matrix
6/20/2004 2:07:28 PM
Stepwise Regression Running...
6/20/2004 2:07:28 PM
Step 0 Complete
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Linear Regression
Table 33: Execution Status
6/20/2004 2:07:28 PM
Step 1 Complete
6/20/2004 2:07:28 PM
Step 2 Complete
6/20/2004 2:07:28 PM
Step 3 Complete
6/20/2004 2:07:28 PM
Step 4 Complete
6/20/2004 2:07:28 PM
Step 5 Complete
6/20/2004 2:07:28 PM
Step 6 Complete
6/20/2004 2:07:28 PM
Step 7 Complete
6/20/2004 2:07:29 PM
Creating Report
Table 34: Variables
Column
Name
B
Standard
Coefficient Error
T Statistic
P-value
Lower
Upper
Standard Incremental
Coefficient R
Squared
Multiple
Correlation
Coefficient
(1Tolerance)
(Constant)
-6.4640
0.9749
-6.6301
0.0000
-8.3780
-4.5500
0.0000
0.0000
N/A
avg_cc_bal
-0.0174
0.0004
-41.3942
0.0000
-0.0182
-0.0166
-0.6382
0.7556
0.3135
income
0.0005
0.0000
24.5414
0.0000
0.0005
0.0005
0.3777
0.8462
0.3110
ckacct
10.2793
0.8162
12.5947
0.0000
8.6770
11.8815
0.1703
0.8732
0.1073
married
-4.3056
0.8039
-5.3558
0.0000
-5.8838
-2.7273
-0.0718
0.8766
0.0933
avg_sv_
tran_cnt
-0.7746
0.2777
-2.7887
0.0054
-1.3198
-0.2293
-0.0360
0.8779
0.0207
nbr_
children
0.8994
0.3718
2.4187
0.0158
0.1694
1.6294
0.0331
0.8787
0.1312
years_with_ 0.2941
bank
0.1441
2.0404
0.0417
0.0111
0.5771
0.0263
0.8794
0.0168
Step 0
Table 35: Out
Independent Variable
P-value
F Stat
age
0.0000
19.7680
avg_cc_bal
0.0000
2302.7983
avg_cc_tran_amt
0.0000
69.5480
Teradata Warehouse Miner User Guide - Volume 3
115
Chapter 1: Analytic Algorithms
Linear Regression
Table 35: Out
Independent Variable
P-value
F Stat
avg_cc_tran_cnt
0.0000
185.3197
avg_ck_bal
0.0000
116.5094
avg_ck_tran_amt
0.0000
271.3578
avg_ck_tran_cnt
0.0002
13.9152
avg_sv_bal
0.0000
37.8598
avg_sv_tran_amt
0.0000
76.1104
avg_sv_tran_cnt
0.7169
0.1316
ccacct
0.1754
1.8399
ckacct
0.0000
105.5843
female
0.5404
0.3751
income
0.0000
647.3239
married
0.8937
0.0179
nbr_children
0.0000
30.2315
separated
0.0000
28.7618
single
0.0000
17.1850
svacct
0.0001
15.7289
years_with_bank
0.1279
2.3235
Step 1
Table 36: Model Assessment
Squared Multiple Correlation Coefficient (1-Tolerance)
0.7556
Standard Error of Estimate
14.8111
Table 37: Columns In (Part 1)
Independent Variable
B Coefficient
Standard Error
T Statistic
P-value
avg_cc_bal
-0.0237
0.0005
-47.9875
0.0000
116
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Linear Regression
Table 38: Columns In (Part 2)
Independent Variable
B Coefficient
Lower
Upper
F Stat
avg_cc_bal
-0.0237
-0.0247
-0.0227
2302.7983
Incremental R2
0.7556
Table 39: Columns In (Part 3)
Independent Variable
B Coefficient
Standard Coefficient
Squared Multiple
Correlation Coefficient
(1-Tolerance)
avg_cc_bal
-0.0237
-0.8692
0.0000
Table 40: Columns Out
Independent Variable
P-value
F Stat
Partial Correlation
age
0.0539
3.7287
0.0708
avg_cc_tran_amt
0.0000
27.4695
0.1921
avg_cc_tran_cnt
0.2346
1.4153
0.0436
avg_ck_bal
0.0000
17.1826
0.1520
avg_ck_tran_amt
0.0000
94.9295
0.3572
avg_ck_tran_cnt
0.4712
0.5198
0.0264
avg_sv_bal
0.0083
6.9952
0.0970
avg_sv_tran_amt
0.0164
5.7848
0.0882
avg_sv_tran_cnt
0.1314
2.2807
0.0554
ccacct
0.8211
0.0512
0.0083
ckacct
0.0000
41.3084
0.2356
female
0.3547
0.8575
0.0340
income
0.0000
438.7799
0.7680
married
0.4812
0.4967
0.0258
nbr_children
0.0000
30.4645
0.2024
separated
0.0004
12.8680
0.1315
single
0.0024
9.3169
0.1119
svacct
0.0862
2.9523
0.0630
years_with_bank
0.3407
0.9090
0.0350
Teradata Warehouse Miner User Guide - Volume 3
117
Chapter 1: Analytic Algorithms
Linear Regression
Linear Weights Graph
By default, the Linear Weights graph displays the relative magnitudes of the T-statistic
associated with each coefficient in the linear regression model:
Figure 61: Linear Regression Tutorial: Linear Weights Graph
Select the Graphics Options tab and change the Graph Type to Standardized Coefficient to view
the standardized coefficient values.
Although not generated automatically, a Scatter Plot is useful for analyzing the model built
with the Linear Regression analysis. As an example, a scatter plot is brought up to look at the
dependent variable (“cc_rev”), with the first two independent variables that made it into the
model (“avg_cc_bal,” “income”). Create a new Scatter Plot analysis, and pick these three
variables in the Selected Tables and Columns option. The results are shown first in two
dimensions (avg_cc_bal and cc_rev), and then with all three:
118
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Linear Regression
Figure 62: Linear Regression Tutorial: Scatter Plot (2d)
Figure 63: Linear Regression Tutorial: Scatter Plot (3d)
Teradata Warehouse Miner User Guide - Volume 3
119
Chapter 1: Analytic Algorithms
Logistic Regression
Logistic Regression
Overview
In many types of regression problems, the response variable or dependent variable to be
predicted has only two possible outcomes. For example, will the customer buy the product in
response to the promotion or not? Is the transaction fraudulent or not? Will the customer close
their account or not? There are many examples of business problems with only two possible
outcomes. Unfortunately the linear regression model comes up short in finding solutions to
this type of problem. It is worth trying to understand what these shortcomings are and how the
logistic regression model is an improvement when predicting a two-valued response variable.
When the response variable y has only two possible values, which may be coded as a 0 and 1,
the expected value of yi, E(yi), is actually the probability that the value will be 1. The error
term for a linear regression model for a two-valued response function also has only two
possible values, so it doesn't have a normal distribution or constant variance over the values
of the independent variables. Finally, the regression model can produce a value that doesn't
fall within the necessary constraint of 0 to 1. What would be better would be to compute a
continuous probability function between 0 and 1. In order to achieve this continuous
probability function, the usual linear regression expression b0 + b1x1 + ... + bnxn is
transformed using a function called a logit transformation function. This function is an
example of a sigmoid function, so named because it looks like a sigma or 's' when plotted. It is
of course the logit transformation function that gives rise to the term logistic regression.
The type of logistic regression model that Teradata Warehouse Miner supports is one with a
two-valued dependent variable, referred to as a binary logit model. However, Teradata
Warehouse Miner is capable of coding values for the dependent variable so that the user is not
required to code their dependent variable to two distinct values. The user can choose which
values to represent as the response value (i.e., 1 or TRUE) and all other will be treated as nonresponse values (i.e., 0 or FALSE). Even though values other than 1 and 0 are supported in
the dependent variable, throughout this section the dependent variable response value is
represented as 1 and the non-response value as 0 for ease of reading.
The primary sources of information and formulae in this section are [Hosmer] and [Neter].
Logit model
The logit transformation function is chosen because of its mathematical power and simplicity,
and because it lends an intuitive understanding to the coefficients eventually created in the
model. The following equations describe the logistic regression model, with   x  being the
probability that the dependent variable is 1, and g(x) being the logit transformation:
b +b x ++b x
n n
e 0 1 x
  x  = -------------------------------------------------b + b x +  + bn xn
1+e 0 1 x
120
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Logistic Regression
x
g  x  = ln -------------------- = b 0 + b 1 x 1 + b n x n
1 – x
Notice that the logit transformation g(x) has linear parameters (b-values) and may be
continuous with unrestricted range. Using these functions, a binomial error distribution is
found with y =   x  +  . The solution to a logistic regression model is to find the b-values
that “best” predict the dichotomous y variable based on the values of the numeric x variables.
Maximum likelihood
In linear regression analysis it is possible to use a least-squares approach to finding the best bvalues in the linear regression equation. The least-squared error approach leads to a set of n
normal equations in n unknowns that can be solved for directly. But that approach does not
work here for logistic regression. Suppose any b-values are selected and the question is asked
what is the likelihood that they match the logistic distribution defined, using statistical
principles and the assumption that errors have a normal probability distribution. This
technique of picking the most likely b-values that match the observed data is known as a
maximum likelihood solution. In the case of linear regression, a maximum likelihood solution
turns out to be mathematically equivalent to a least squares solution. But here maximum
likelihood must be used directly.
For convenience, compute the natural logarithm of the likelihood function so that it is
possible to convert the product of likelihood’s into a sum, which is easier to work with. The
log likelihood equation for a given vector B of b-values with v x-variables is given by:
n
ln L  b 0  b v  =
n
 yi  B'X  –  ln  1 + exp  B'X  
i=1
i=1
where
B’X = b0 + b1x1 + ... + bvxv.
By differentiating this equation with respect to the constant term b0 and with respect to the
variable terms bi, the likelihood equations are derived:
n
  yi –   xi  
= 0
i=1
Teradata Warehouse Miner User Guide - Volume 3
121
Chapter 1: Analytic Algorithms
Logistic Regression
and
n
 xi  yi –   xi  
= 0
i=1
where
exp  B'X 
  x i  = --------------------------------1 + exp  B'X 
The log likelihood equation is not linear in the unknown b-value parameters, so it must be
solved using non-linear optimization techniques described below.
Computational technique
Unlike with linear regression, logistic regression calculations cannot be based on an SSCP
matrix. Teradata Warehouse Miner therefore dynamically generates SQL to perform the
calculations required to solve the model, produce model diagnostics, produce success tables,
and to score new data with a model once it is built. However, to enhance performance with
small data sets, Teradata Warehouse Miner provides an optional in-memory calculation
feature (that is also helpful when one of the stepwise options is used). This feature selects the
data into the client system’s memory if it will fit into a user-specified maximum memory
amount. The maximum amount of memory in megabytes to use is specified on the expert
options tab of the analysis input screen. The user can adjust this value according to their
workstation and network requirements. Setting this amount to zero will disable the feature.
Teradata Warehouse Miner offers two optimization techniques for logistic regression, the
default method of iteratively reweighted least squares (RLS), equivalent to the Gauss-Newton
technique, and the quasi-Newton method of Broyden-Fletcher-Goldfarb-Shanno (BFGS). The
RLS method is considerably faster than the BFGS method unless there are a large number of
columns (RLS grows in complexity roughly as the square of the number of columns). Having
a choice between techniques can be useful for more than performance reasons however, since
there may be cases where one or the other technique has better convergence properties.
You may specify your choice of technique, or allow Teradata Warehouse Miner to
automatically select it for you. With the automatic option the program will select RLS if there
are less than 35 independent variable columns; otherwise it will select BFGS.
Logistic Regression Model Diagnostics
Logistic regression has counterparts to many of the same model diagnostics available with
linear regression. In a similar manner to linear regression, these diagnostics provide a
mathematically sound way to evaluate a model built with logistic regression.
122
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Logistic Regression
Standard errors and statistics
As is the case with linear regression, measurements are made of the standard error associated
with each b-coefficient value. Similarly, the T-statistic or Wald statistic as it is also called, is
calculated for each b-coefficient as the ratio of the b-coefficient value to its standard error.
Along with its associated t-distribution probability value, it can be used to assess the
statistical significance of this term in the model.
The computation of the standard errors of the coefficients is based on a matrix called the
information matrix or Hessian matrix. This matrix is the matrix of second order partial
derivatives of the log likelihood function with respect to all possible pairs of the coefficient
values. The formula for the “j, k” element of the information matrix is:
2
n
 LB
------------------ = –  x ik x ik  i  1 –  i 
B j B k
i–1
where
exp  B'X 
  x i  = --------------------------------1 + exp  B'X 
Unlike the case with linear regression, confidence intervals are not computed directly on the
standard error values, but on something called the odds ratios, described below.
Odds ratios and confidence intervals
In linear regression, the meaning of each b-coefficient in the model can be thought of as the
amount the dependent y variable changes when the corresponding independent x variable
changes by 1. Because of the logit transformation, however, the meaning of each b-coefficient
in a logistic regression model is not so clear. In a logistic regression model, the increase of an
x variable by 1 implies a change in the odds that the outcome y variable will be 1 rather than
0.
Looking back at the formula for the logit response function:
x
g  x  = ln -------------------- = b 0 +  + b n x n
1 – x
it is evident that the response function is actually the log of the odds that the response is 1,
where   x  is the probability that the response is 1 and 1 –   x  is the probability that the
Teradata Warehouse Miner User Guide - Volume 3
123
Chapter 1: Analytic Algorithms
Logistic Regression
response is 0. Now suppose that one of the x variables, say xj, varies by 1. Then the response
function will vary by bj. This can be written as g(x0...xj + 1...xn) - g(x0...xj...xn) = bj. But it
could also be written as:
ln  odds j + 1 
ln  odds j + 1  – ln  odds j  = ------------------------------- = b j
odds j
Therefore
odds j + 1
-------------------- = exp  b j 
odds j
the formula for the odds ratio of the coefficient bj . By taking the exponent of a b-coefficient,
one gets the odds ratio that is the factor by which the odds change due to a unit increase in xj.
Because this odds ratio is the value that has more meaning, confidence intervals are
calculated on odds ratios for each of the coefficients rather than on the coefficients
themselves. The confidence interval is computed based on a 95% confidence level and a twotailed normal distribution.
Logistic Regression Goodness of fit
In linear regression one of the key measures associated with goodness of fit is the residual
sums of squares RSS. An analogous measure for logistic regression is a statistic sometimes
called the deviance. Its value is based on the ratio of the likelihood of a given model to the
likelihood of a perfectly fitted or saturated model and is given by D = -2ln(ModelLH /
SatModelLH). This can be rewritten D=-2LM + 2LS in terms of the model log likelihood and
the saturated model log likelihood. Looking at the data as a set of n independent Bernoulli
observations, LS is actually 0, so that D = -2LM. Two models can be contrasted by taking the
difference between their deviance values, which leads to a statistic G = D1 - D2 = -2(L1 - L2).
This is similar to the numerator in the partial F test in linear regression, the extra sums of
squares or ESS mentioned in the section on linear regression.
In order to get an assessment of the utility of the independent model terms taken as a whole,
the deviance difference statistic is calculated for the model with a constant term only versus
the model with all variables fitted. This statistic is then G = -2(L0 - LM). LM is calculated
using the log likelihood formula given earlier. L0, the log likelihood of the constant only
model with n observations is given by:
L 0 =   y    ln  y  +  n –  y   ln  n –  y  – n  ln  n 
G follows a chi-square distribution with “variables minus one” degrees of freedom, and as
such provides a probability value to test whether all the x-term coefficients should in fact be
zero.
124
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Logistic Regression
Finally, there are a number of pseudo R-squared values that have been suggested in the
literature. These are not truly speaking goodness of fit measures, but can nevertheless be
useful in assessing the model. Teradata Warehouse Miner provides one such measure
suggested by McFadden as (L0 - LM) / L0. [Agresti]
Logistic Regression Data Quality Reports
The same data quality reports optionally available for linear regression are also available
when performing logistic regression. Since an SSCP matrix is not used in the logistic
regression algorithm, additional internal processing is needed to produce data quality reports,
especially for the Near Dependency report and the Detailed Collinearity Diagnostics report.
Stepwise Logistic Regression
Automated stepwise regression procedures are available for logistic regression to aid in
model selection just as they are for linear regression. The procedures are in fact very similar
to those described for linear regression. As such an attempt will be made to highlight the
similarities and differences in the descriptions below.
As is the case with stepwise linear regression, the automated stepwise procedures described
below can provide insight into the variables that should be included in a logistic regression
model. An element of human decision-making however is recommended in order to produce
a model with useful business application.
Forward-Only Stepwise Logistic Regression
The forward only procedure consists solely of forward steps as described below, starting
without any independent x variables in the model. Forward steps are continued until no
variables can be added to the model.
Forward Stepwise Logistic Regression
The forward stepwise procedure is a combination of the forward and backward steps always
done in pairs, as described below, starting without any independent x variables in the model.
One forward step is always followed by one backward step, and these single forward and
backward steps are alternated until no variables can be added or removed. Additional checks
are made after each step to see if the same variables exist in the model as existed after a
previous step in the same direction. When this condition is detected in both the forward and
backward directions the algorithm will also terminate.
Backward-Only Stepwise Logistic Regression
The backward only procedure consists solely of backward steps as described below, starting
with all of the independent x variables in the model. Backward steps are continued until no
variables can be removed from the model.
Backward Stepwise Logistic Regression
The backward stepwise procedure is a combination of the backward and forward steps always
done in pairs, as described below, starting with all of the independent x variables in the
model. One backward step is followed by one forward step, and these single backward and
forward steps are alternated until no variables can be added or removed. Additional checks
are made after each step to see if the same variables exist in the model as existed after a
Teradata Warehouse Miner User Guide - Volume 3
125
Chapter 1: Analytic Algorithms
Logistic Regression
previous step in the same direction. When this condition is detected in both the backward and
forward directions the algorithm will also terminate.
Stepwise Logistic Regression - Forward step
In stepwise linear regression the partial F statistic, or the analogous T-statistic probability
value, is computed separately for each variable outside the model, adding each of them into
the model one at a time. The analogous procedure for logistic regression would consist of
computing the likelihood ratio statistic G, described in the Goodness of Fit section, for each
variable outside the model, selecting the variable that results in the largest G value when
added to the model. In the case of logistic regression however this becomes an expensive
proposition because the solution of the model for each variable requires another iterative
maximum likelihood solution, contrasted to the more rapidly achieved closed form solution
available in linear regression.
What is needed is a statistic that can be calculated without requiring an additional maximum
likelihood solution. Teradata Warehouse Miner uses such a statistic proposed by Peduzzi,
Hardy and Holford that they call a W statistic. This statistic is comparatively inexpensive to
compute for each variable outside the model and is therefore expedient to use as a criterion
for selecting a variable to add to the model. The W statistic is assumed to follow a chi square
distribution with one degree of freedom due to its similarity to other statistics, and it gives
evidence of behaving similarly to the likelihood ratio statistic. Therefore, the variable with
the smallest chi square probability or P-value associated with its W statistic is added to the
model in a forward step if the P-value is less than the criterion to enter. If more than one
variable has a P-value of 0, then the variable with the largest W statistic is entered. For more
information, refer to [Peduzzi, Hardy and Holford].
Stepwise Logistic Regression - Backward step
Each backward step seeks to remove those variables that have statistical significance below a
certain level. This is done by first fitting the model with the currently selected variables,
including the calculation of the probability or P-value associated with the T-statistic for each
variable, which is the ratio of the b-coefficient to its standard error. The variable with the
largest P-value is removed if it is greater than the criterion to remove.
Logistic Regression and Missing Data
Null values for columns in a logistic regression analysis can adversely affect results, so
Teradata Warehouse Miner ensures that listwise deletion is effectively performed with logistic
regression. This ensures that any row for which one of the independent or dependent variable
columns is null will be left out of computations completely. Additionally, the Recode
transformation function can be used to build a new column, substituting a fixed known value
for null.
Initiate a Logistic Regression Function
Use the following procedure to initiate a new Logistic Regression analysis in Teradata
Warehouse Miner:
126
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Logistic Regression
1
Click on the Add New Analysis icon in the toolbar:
Figure 64: Add New Analysis from toolbar
2
In the resulting Add New Analysis dialog box, click on Analytics under Categories and
then under Analyses double-click on Logistic Regression:
Figure 65: Add New Analysis dialog
3
This will bring up the Logistic Regression dialog in which you will enter INPUT and
OUTPUT options to parameterize the analysis as described in the following sections.
Logistic Regression - INPUT - Data Selection
On the Logistic Regression dialog click on INPUT and then click on data selection:
Teradata Warehouse Miner User Guide - Volume 3
127
Chapter 1: Analytic Algorithms
Logistic Regression
Figure 66: Logistic Regression > Input > Data Selection
On this screen select:
1
Select Input Source
Users may select between different sources of input. By selecting the Input Source Table
the user can select from available databases, tables (or views) and columns in the usual
manner. By selecting the Input Source Analysis however the user can select directly from
the output of another analysis of qualifying type in the current project. Analyses that may
be selected from directly include all of the Analytic Data Set (ADS) and Reorganization
analyses (except Refresh). In place of Available Databases the user may select from
Available Analyses, while Available Tables then contains a list of all the output tables that
will eventually be produced by the selected Analysis. (Note that since this analysis cannot
select from a volatile input table, Available Analyses will contain only those qualifying
analyses that create an output table or view).
2
Select Columns From a Single Table
•
Available Databases (or Analyses) — All the databases (or analyses) that are available
for the Logistic Regression analysis.
•
Available Tables — All the tables that are available for the Logistic Regression
analysis.
•
Available Columns — Within the selected table or matrix, all columns which are
available for the Logistic Regression analysis.
•
Selected Columns — Select columns by highlighting and then either dragging and
dropping into the Selected Columns window, or click on the arrow button to move
highlighted columns into the Selected Columns window.
Note: The Selected Columns window is actually a split window; you can either insert
columns as Dependent or Independent columns. Make sure you have the correct
portion of the window highlighted. The Dependent variable column is the column
whose value is being predicted by the logistic regression model. The algorithm
requires that the Independent columns must be of numeric type (or contain numbers in
character format). The Dependent column may be of any type.
Logistic Regression - INPUT - Analysis Parameters
On the Logistic Regression dialog click on INPUT and then click on analysis parameters:
128
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Logistic Regression
Figure 67: Logistic Regression > Input > Analysis Parameters
On this screen select:
• Regression Options
•
Convergence Criterion — The algorithm continues to repeatedly estimate the model
coefficient values until either the difference in the log likelihood function from one
iteration to the next is less than or equal to the convergence criterion or the maximum
iterations is reached. Default value is 0.001.
•
Maximum iterations — The algorithm stops iterating if the maximum iterations is
reached. The default value is 100.
•
Response Value — The value of the dependent variable that will represent the
response value. All other dependent variable values will be considered a non-response
value.
•
Include Constant Term (checkbox) — This option specifies that the logistic regression
model should include a constant term.
With a constant, the logistic equation can be thought of as:
b0 + b1 xx +  + bn xn
e
  x  = -------------------------------------------------b + b x +  + bn xn
1+e 0 1 x
x
g  x  = ln -------------------- = b 0 + b 1 x 1  + b n x n
1 – x
Without a constant, the equation changes to:
b x ++b x
n n
e 1 x
  x  = ----------------------------------------b1 xx +  + bn xn
1+e
The default value is to include the constant term.
• Stepwise Options — If selected, the algorithm is performed repeatedly with various
combinations of independent variable columns to attempt to arrive at a final “best” model.
The default is to not use Stepwise Regression.
•
Step Direction — (Selecting “None” turns off the Stepwise option).
•
Forward — Option for independent variables being added one at a time to an
Teradata Warehouse Miner User Guide - Volume 3
129
Chapter 1: Analytic Algorithms
Logistic Regression
x
g  x  = ln -------------------- = b 1 x 1  + b n x n
1 – x
empty model, possibly removing a variable after a variable is added.
•
Forward Only — Option for qualifying independent variables being added one at a
time.
•
Backward — Option for removing variables from an initial model containing all of
the independent variables, possibly adding a variable after a variable is removed.
•
Backward Only — Option for independent variables being removed one at a time.
•
Criterion to Enter — An independent variable is only added to the model if its W
statistic chi-square P-value is less than the specified criterion to enter. The default
value is 0.05.
•
Criterion to Remove — An independent variable is only removed if its T-statistic Pvalue is greater than the specified criterion to remove. The default value is 0.05 for
each.
• Report Options
•
Prediction Success Table — Creates a prediction success table using sums of
probabilities rather than estimates based on a threshold value. The default is to
generate the prediction success table.
•
Multi-Threshold Success Table — This table provides values similar to those in the
prediction success table, but based on a range of threshold values, thus allowing the
user to compare success scenarios using different threshold values. The default is to
generate the multi-threshold Success table.
•
•
Threshold Begin
•
Threshold End
•
Threshold Increment — Specifies the threshold values to be used in the multithreshold success table. If the computed probability is greater than or equal to a
threshold value, that observation is assigned a 1 rather than a 0. Default values are
0, 1 and .05 respectively.
Cumulative Lift Table — Produce a cumulative lift table for deciles based on
probability values. The default is to generate the Cumulative Lift table.
• (Data Quality Reports) — These are the same data quality reports provided for Linear
Regression and Factor Analysis. However, in the case of Logistic Regression, the “Sums
of squares and Cross Products” or SSCP matrix is not readily available since it is not input
to the algorithm, so it is derived dynamically by the algorithm. If there are a large number
of independent variables in the model it may be more efficient to use the Build Matrix
function to build and save the matrix and the Linear Regression function to produce the
Data Quality Reports listed below.
130
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Logistic Regression
•
Variable Statistics — This report gives the mean value and standard deviation of each
variable in the model based on the derived SSCP matrix.
•
Near Dependency — This report lists collinear variables or near dependencies in the
data based on the derived SSCP matrix.
•
•
Condition Index Threshold — Entries in the Near Dependency report are triggered
by two conditions occurring simultaneously. The one that involves this parameter
is the occurrence of a large condition index value associated with a specially
constructed principal factor. If a factor has a condition index greater than this
parameter’s value, it is a candidate for the Near Dependency report. A default
value of 30 is used as a rule of thumb.
•
Variance Proportion Threshold — Entries in the Near Dependency report are
triggered by two conditions occurring simultaneously. The one that involves this
parameter is when two or more variables have a variance proportion greater than
this threshold value for a factor with a high condition index. Another way of
saying this is that a ‘suspect’ factor accounts for a high proportion of the variance
of two or more variables. This parameter defines what a high proportion of
variance is. A default value of 0.5 is used as a rule of thumb.
Detailed Collinearity Diagnostics — This report provides the details behind the Near
Dependency report, consisting of the “Eigenvalues of Unit Scaled X’X”, “Condition
Indices” and “Variance Proportions” tables.
Logistic Regression - INPUT - Expert Options
On the Logistic Regression dialog click on INPUT and then click on expert options:
Figure 68: Logistic Regression > Input > Expert Options
On this screen select:
• Optimization Method
•
Automatic — The program selects Reweighted Least Squares (RLS) unless there are 35
or more independent variable columns, in which case Quasi-Newton BFGS is selected
instead. This is the default option.
•
Quasi-Newton (BFGS) — The user may explicitly request this optimization technique
attributed to Broyden-Fletcher-Goldfarb-Shanno. Quasi-Newton methods do not
require a Hessian matrix of second partial derivatives of the objective function to be
calculated explicitly, saving time in some situations.
•
Reweighted Least Squares (RLS) — The user may explicitly request this optimization
technique equivalent to the Gauss-Newton method. It involves computing a matrix
Teradata Warehouse Miner User Guide - Volume 3
131
Chapter 1: Analytic Algorithms
Logistic Regression
very similar to a Hessian matrix but is typically the fastest technique for logistic
regression.
• Performance
•
Maximum amount of data for in-memory processing — Enter a number of megabytes.
•
Use multiple threads when applicable — This flag indicates that multiple SQL
statements may be executed simultaneously, up to 5 simultaneous executions as
needed. It only applies when not processing in memory, and only to certain processing
performed in SQL. Where and when multi-threading is used is dependent on the
number of columns and the Optimization Method selected (but both RLS and BFGS
can potentially make some use of multi-threading).
Logistic Regression - OUTPUT
On the Logistic Regression dialog click on OUTPUT:
Figure 69: Logistic Regression > OUTPUT
On this screen select:
• Store the variables table of this analysis in the database — Store the model variables table
of this analysis in the database.
• Database Name — Name of the database to create the output table in.
• Output Table Name — Name of the output table.
• Advertise Output — “Advertises” output by inserting information into one or more of the
Advertise Output metadata tables according to the type of analysis and the options
selected in the analysis.
• Advertise Note — Specify when the Advertise Output option is selected or when the
Always Advertise option is selected on the Connection Properties dialog. It is a free-form
text field of up to 30 characters that may be used to categorize or describe the output.
By way of an example, the tutorial example produces the following output table:
Table 41: Logistic Regression - OUTPUT
Standardi
zed
Coefficie
nt
Column
Name
B
Coefficie
nt
years_
with_
bank
0.044251 4.914916 0.026929 0.906555 0.831242 0.988692 0.098102 11
2.216961 39
5
6
0.053055 0.144717
98
14
1
132
Standard
Error
Wald
Statistic
T
Statistic
P-Value
Odds
Ratio
Lower
Upper
Partial R
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Logistic Regression
Table 41: Logistic Regression - OUTPUT
Standardi
zed
Coefficie
nt
Column
Name
B
Coefficie
nt
avg_sv_
tran_cnt
0.213310 31.22951 3.22526
1.192052 7
5.588337 E-08
avg_sv_
tran_amt
0.030762 0.003824 64.70387 8.043871 3.552714 1.03124
34
32
E-15
ckacct
0.465670 0.236528 3.876044 1.968767 0.049353 1.593081 1.002084 2.53263
2
8
13
avg_ck_
tran_cnt
0.009613 5.608763 0.018127 0.977489 0.959244 0.996082 0.022767 534
2.368283 26
7
1
3
0.059032 0.179196
57
84
4
married
0.233367 7.115234 -2.66744
0.622493 6
9
Standard
Error
Wald
Statistic
T
Statistic
P-Value
Odds
Ratio
Lower
Upper
Partial R
0.303597 0.199861 0.461178 6
2
0.168006 0.914416
2
5
1.02354
1.038999 0.246071 2.061767
7
0.042563 0.127321
37
6
0.007810 0.536604 0.339634 0.847807 519
5
2
3
0.070282 0.171455
51
6
(Constan 0.273292 18.84624 1.614427
t)
1.186426 9
4.341225 E-05
avg_sv_
bal
0.003125 0.000559 31.16868 5.582892 3.323695 1.00313
305
8004
E-08
1.00203
1.004231 0.167831 2.625869
3
If Database Name is twm_results and Output Table Name is test, the output table is
defined as:
CREATE SET TABLE twm_results.test
(
"Column Name" VARCHAR(30) CHARACTER SET UNICODE NOT
CASESPECIFIC,
"B Coefficient" FLOAT,
"Standard Error" FLOAT,
"Wald Statistic" FLOAT,
"T Statistic" FLOAT,
"P-Value" FLOAT,
"Odds Ratio" FLOAT,
"Lower" FLOAT,
"Upper" FLOAT,
"Partial R" FLOAT,
"Standardized Coefficient" FLOAT)
UNIQUE PRIMARY INDEX ( "Column Name" );
Run the Logistic Regression
After setting parameters on the INPUT screens as described above, you are ready to run the
analysis. To run the analysis you can either:
• Click the Run icon
on the toolbar, or
• Select Run <project name> on the Project menu, or
Teradata Warehouse Miner User Guide - Volume 3
133
Chapter 1: Analytic Algorithms
Logistic Regression
• Press the F5 key on your keyboard
Results - Logistic Regression
The results of running the Teradata Warehouse Miner Linear Regression analysis include a
variety of statistical reports on the individual variables and generated model as well as barcharts displaying coefficients and T-statistics. All of these results are outlined below. The title
of this report is preceded by the name of the technique that was used to build the model either Reweighted Least Squares Logistic Regression or Quasi-Newton (BFGS) Logistic
Regression.
On the Logistic Regression dialog, click on RESULTS (note that the RESULTS tab will be
grayed-out/disabled until after the analysis is completed) to view results. Result options are as
follows:
• Data Quality Reports
134
•
Variable Statistics — If selected on the Results Options tab, this report gives the mean
value and standard deviation of each variable in the model based on the SSCP matrix
provided as input.
•
Near Dependency — If selected on the Results Options tab, this report lists collinear
variables or near dependencies in the data based on the SSCP matrix provided as
input. Entries in the Near Dependency report are triggered by two conditions
occurring simultaneously. The first is the occurrence of a large condition index value
associated with a specially constructed principal factor. If a factor has a condition
index greater than the parameter specified on the Results Option tab, it is a candidate
for the Near Dependency report. The other is when two or more variables have a
variance proportion greater than a threshold value for a factor with a high condition
index. Another way of saying this is that a ‘suspect’ factor accounts for a high
proportion of the variance of two or more variables. The parameter to defines what a
high proportion of variance is also set on the Results Option tab. A default value of
0.5.
•
Detailed Collinearity Diagnostics — If selected on the Results Options tab, this report
provides the details behind the Near Dependency report, consisting of the following
tables.
•
Eigenvalues of Unit Scaled X'X — Report of the eigenvalues of all variables scaled
so that each variable adds up to 1 when summed over all the observations or rows.
In order to calculate the singular values of X (the rows of X are the observations),
the mathematically equivalent square root of the eigenvalues of XTX are computed
instead for practical reasons
•
Condition Indices — The condition index of each eigenvalue, calculated as the
square root of the ratio of the largest eigenvalue to the given eigenvalue, a value
always 1 or greater.
•
Variance Proportions — The variance decomposition of these eigenvalues is
computed using the eigenvalues together with the eigenvectors associated with
them. The result is a matrix giving, for each variable, the proportion of variance
associated with each eigenvalue.
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Logistic Regression
• Logistic Regression Step N (Stepwise-only)
•
In Report — This report is the same as the Variables in Model report, but it is provided
for each step during stepwise logistic regression based on the variables currently in the
model at each step.
•
Out Report
•
Column Name — The independent variable excluded from the model.
•
W Statistic — The W Statistic is a specialized statistic designed to determine the
best variable to add to a model without calculating a maximum likelihood solution
for each variable outside the model. The W statistic is assumed to follow a chi
square distribution with one degree of freedom due to its similarity to other
statistics, and it gives evidence of behaving similarly to the likelihood ratio
statistic. For more information, refer to [Peduzzi, Hardy and Holford].
•
Chi Sqr P-value — The W statistic is assumed to follow a chi square distribution on
one degree of freedom due to its similarity to other statistics, and it gives evidence
of behaving similarly to the likelihood ratio statistic. Therefore, the variable with
the smallest chi square probability or P-value associated with its W statistic is
added to the model in a forward step if the P-value is less than the criterion to
enter.
• Logistic Regression Model
•
Total Observations — This is the number of rows in the table that the logistic
regression analysis is based on. The number of observations reflects the row count
after any rows were eliminated by listwise deletion (due to one of the variables being
null).
•
Total Iterations — The number of iterations used by the non-linear optimization
algorithm in maximizing the log likelihood function.
•
Initial Log Likelihood — The initial log likelihood is the log likelihood of the constant
only model and is given only when the constant is included in the model. The formula
for initial log likelihood is given by:
L 0 =   y    ln  y  +  n –  y   ln  n –  y  – n  ln  n 
where n is the number of observations.
•
Final Log Likelihood — This is the value of the log likelihood function after the last
iteration.
•
Likelihood Ratio Test G Statistic — Deviance, given by D = -2LM, where LM is the log
likelihood of the logistic regression model, is a measure analogous to the residual
sums of squares RSS in a linear regression model. In order to assess the utility of the
independent terms taken as a whole in the logistic regression model, the deviance
difference statistic G is calculated for the model with a constant term only versus the
model with all variables fitted. This statistic is then G = -2(L0 - LM), where L0 is the
log likelihood of a model containing only a constant. The G statistic, like the deviance
D, is an example of a likelihood ratio test statistic.
Teradata Warehouse Miner User Guide - Volume 3
135
Chapter 1: Analytic Algorithms
Logistic Regression
•
Chi-Square Degrees of Freedom — The G Statistic follows a chi-square distribution
with “variables minus one” degrees of freedom. This field then is the degrees of
freedom for the G Statistic’s chi-square test.
•
Chi-Square Value — This is the chi-square random variable value for the Likelihood
Ratio Test G Statistic. This can be used to test whether all the independent variable
coefficients should be 0. Examining the field Chi-square Probability is however the
easiest way to assess this test.
•
Chi-Square Probability — This is the chi-square probability value for the Likelihood
Ratio Test G Statistic. It can be used to test whether all the independent variable
coefficients should be 0. That is, the probability that a chi-square distributed variable
would have the value G or greater is the probability associated with having all 0
coefficients. The null hypothesis that all the terms should be 0 can be rejected if this
probability is sufficiently small, say less than 0.05.
•
McFadden's Pseudo R-Squared — To mimic the Squared Multiple Correlation
Coefficient (R2) in a linear regression model, the researcher McFadden suggested this
measure given by (L0 - LM) / L0 where L0 is the log likelihood of a model containing
only a constant and LM is the log likelihood of the logistic regression model. Although
it is not truly speaking a goodness of fit measure, it can be useful in assessing a logistic
regression model. (Experience shows that the value of this statistic tends to be less
than the R2 value it mimics. In fact, values between 0.20 and 0.40 are quite
satisfactory).
•
Dependent Variable Name — Column chosen as the dependent variable.
•
Dependent Variable Response Values — The response value chosen for the dependent
variable on the Regression Options tab.
•
Dependent Variable Distinct Values — The number of distinct values that the dependent
variable takes on.
• Logistic Regression Variables in Model report
•
Column Name — This is the name of the independent variable in the model or
CONSTANT for the constant term.
•
B Coefficient — The b-coefficient is the coefficient in the logistic regression model for
this variable. The following equations describe the logistic regression model, with
being the probability that the dependent variable is 1, and g(x) being the logit
transformation:
b +b x ++b x
n n
e 0 1 x
  x  = -------------------------------------------------b0 + b1 xx +  + bn xn
1+e
x
g  x  = ln -------------------- = b 0 + b 1 x 1  + b n x n
1 – x
136
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Logistic Regression
•
Standard Error — The standard error of a b-coefficient in the logistic regression model
is a measure of its expected accuracy. It is analogous to the standard error of a
coefficient in a linear regression model.
•
Wald Statistic — The Wald statistic is calculated as the square of the T-statistic (T Stat)
described below. The T-statistic is calculated for each b-coefficient as the ratio of the
b-coefficient value to its standard error.
•
T Statistic — In a manner analogous to linear regression, the T-statistic is calculated
for each b-coefficient as the ratio of the b-coefficient value to its standard error. Along
with its associated t-distribution probability value, it can be used to assess the
statistical significance of this term in the model.
•
P-value — This is the t-distribution probability value associated with the T-statistic (T
Stat), that is, the ratio of the b-coefficient value (B Coef) to its standard error (Std
Error). It can be used to assess the statistical significance of this term in the logistic
regression model. A value close to 0 implies statistical significance and means this
term in the model is important.
The P-value represents the probability that the null hypothesis is true, that is the
observation of the estimated coefficient value is chance occurrence (i.e., the null
hypothesis is that the coefficient equals zero). The smaller the P-value, the stronger the
evidence for rejecting the null hypothesis that the coefficient is actually equal to zero.
In other words, the smaller the P-value, the larger the evidence that the coefficient is
different from zero.
•
Odds Ratio — The odds ratio for an independent variable in the model is calculated by
taking the exponent of the b-coefficient. The odds ratio is the factor by which the odds
of the dependent variable being 1 change due to a unit increase in this independent
variable.
•
Lower — Because of the intuitive meaning of the odds ratio, confidence intervals for
coefficients in the model are calculated on odds ratios rather than on the coefficients
themselves. The confidence interval is computed based on a 95% confidence level and
a two-tailed normal distribution. “Lower” is the lower range of this confidence
interval.
•
Upper — Because of the intuitive meaning of the odds ratio, confidence intervals for
coefficients in the model are calculated on odds ratios rather than on the coefficients
themselves. The confidence interval is computed based on a 95% confidence level and
a two-tailed normal distribution. “Upper” is the upper range of this confidence
interval.
•
Partial R — The Partial R statistic is calculated for each b-coefficient value as:
wi – 2
Sign  b i   -------------– 2L 0
where bi is the b-coefficient and wi is the Wald Statistic of the ith independent variable,
while L0 is the initial log likelihood of the model. (Note that if wi <= 2 then Partial R
is set to 0). This statistic provides a measure of the relative importance of each
Teradata Warehouse Miner User Guide - Volume 3
137
Chapter 1: Analytic Algorithms
Logistic Regression
variable in the model. It is calculated only when the constant term is included in the
model. [SPSS]
•
Standardized Coefficient — The estimated standardized coefficient is calculated for
each b-coefficient value as:

 b i   i    -------
3
where bi is the b-coefficient,  i is the standard deviation of the ith independent

3
variable, and ------- is the standard deviation of the standard logistic distribution. This
calculation only provides an estimate of the standardized coefficients since it uses a
constant value for the logistic distribution without regard to the actual distribution of
the dependent variable in the model. [Menard]
• Prediction Success Table — The prediction success table is computed using only
probabilities and not estimates based on a threshold value. Using an input table that
contains known values for the dependent variable, the sum of the probability values   x 
and 1 –   x  , which correspond to the probability that the predicted value is 1 or 0
respectively, are calculated separately for rows with actual value of 1 and 0. Refer to the
Model Evaluation section for more information.
138
•
Estimate Response — The entries in the “Estimate Response” column are the sums of
the probabilities   x  that the outcome is 1, summed separately over the observations
where the actual outcome is 1 and 0 and then totaled. (Note that this is independent of
the threshold value that is used in scoring to determine which probabilities correspond
to an estimate of 1 and 0 respectively).
•
Estimate Non-Response — The entries in the “Estimate Non-Response” column are
the sums of the probabilities 1 –   x  that the outcome is 0, summed separately over
the observations where the actual outcome is 1 and 0 and then totaled. (Note that this
is independent of the threshold value that is used in scoring to determine which
probabilities correspond to an estimate of 1 and 0 respectively).
•
Actual Total — The entries in this column are the sums of the entries in the Estimate
Response and Estimate Non-Response columns, across the rows in the Prediction
Success Table. But in fact this turns out to be the number of actual 0’s and 1’s and total
observations in the training data.
•
Actual Response — The entries in the “Actual Response” row correspond to the
observations in the data where the actual value of the dependent variable is 1.
•
Actual Non-Response — The entries in the “Actual Non-Response” row correspond to
the observations in the data where the actual value of the dependent variable is 0.
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Logistic Regression
•
Estimated Total — The entries in this row are the sums of the entries in the Actual
Response and Actual Non-Response rows, down the columns in the Prediction
Success Table. This turns out to be the sum of the probabilities of estimated 0’s and 1’s
and total observations in the model.
• Multi-Threshold Success Table — This table provides values similar to those in the
prediction success table, but instead of summing probabilities, the estimated values based
on a threshold value are summed instead. Rather than just one threshold however, several
thresholds ranging from a user specified low to high value are displayed in user specified
increments. This allows the user to compare several success scenarios using different
threshold values, to aid in the choice of an ideal threshold. Refer to the Model Evaluation
section for more information.
•
Threshold Probability — This column gives various incremental values of the
probability at or above which an observation is to have an estimated value of 1 for the
dependent variable. For example, at a threshold of 0.5, a response value of 1 is
estimated if the probability predicted by the logistic regression model is greater than
or equal to 0.5. The user may request the starting, ending and increment values for
these thresholds.
•
Actual Response, Estimate Response — This column corresponds to the number of
observations for which the model estimated a value of 1 for the dependent variable
and the actual value of the dependent variable is 1.
•
Actual Response, Estimate Non-Response — This column corresponds to the number
of observations for which the model estimated a value of 0 for the dependent variable
but the actual value of the dependent variable is 1, a “false negative” error case for the
model.
•
Actual Non-Response, Estimate Response — This column corresponds to the number
of observations for which the model estimated a value of 1 for the dependent variable
but the actual value of the dependent variable is 0, a “false positive” error case for the
model.
•
Actual Non-Response, Estimate Non-Response — This column corresponds to the
number of observations for which the model estimated a value of 0 for the dependent
variable and the actual value of the dependent variable is 0.
• Cumulative Lift Table — The Cumulative Lift Table demonstrates how effective the model
is in estimating the dependent variable. It is produced using deciles based on the
probability values. Note that the deciles are labeled such that 1 is the highest decile and 10
is the lowest, based on the probability values calculated by logistic regression. The
information in this report however is best viewed in the Lift Chart produced as a graph
under a logistic regression analysis.
•
Decile — The deciles in the report are based on the probability values predicted by the
model. Note that 1 is the highest decile and 10 is the lowest. That is, decile 1 contains
data on the 10% of the observations with the highest estimated probabilities that the
dependent variable is 1.
•
Count — This column contains the count of observations in the decile.
•
Response — This column contains the count of observations in the decile where the
actual value of the dependent variable is 1.
Teradata Warehouse Miner User Guide - Volume 3
139
Chapter 1: Analytic Algorithms
Logistic Regression
•
Response (%) — This column contains the percentage of observations in the decile
where the actual value of the dependent variable is 1.
•
Captured Response (%) — This column contains the percentage of responses in the
decile over all the responses in any decile.
•
Lift — The lift value is the percentage response in the decile (Pct Response) divided by
the expected response, where the expected response is the percentage of response or
dependent 1-values over all observations. For example, if 10% of the observations
overall have a dependent variable with value 1, and 20% of the observations in decile
1 have a dependent variable with value 1, then the lift value within decile 1 is 2.0,
meaning that the model gives a “lift” that is better than chance alone by a factor of two
in predicting response values of 1 within this decile.
•
Cumulative Response — This is a cumulative measure of Response, from decile 1 to
this decile.
•
Cumulative Response (%) — This is a cumulative measure of Pct Response, from
decile 1 to this decile.
•
Cumulative Captured Response (%) — This is a cumulative measure of Pct Captured
Response, from decile 1 to this decile.
•
Cumulative Lift — This is a cumulative measure of Lift, from decile 1 to this decile.
Logistic Regression Graphs
The Logistic Regression Analysis can display bar chars for the T-statistics, Wald Statistics,
Log Odds Ratios, Partial R and Estimated Standard Coefficients of the resultant model. In
addition, a Lift Chart in deciles is generated.
Logistic Weights Graph
This graph displays the relative magnitudes of the T-statistics, Wald Statistics, Log Odds
Ratios, Partial R and Estimated Standard Coefficients associated with each variable in the
logistic regression model. The sign, positive or negative, is portrayed by the colors red or blue
respectively. The user may scroll to the left or right to see all the variables associated statistics
in the model.
The following options are available on the Graphics Options tab on the Logistic Weights
graph:
• Graph Type — The following can be graphed by the Linear Weights Graph
•
Vertical Axis — The user may request multiple vertical axes in order to display
separate coefficient values that are orders of magnitude different from the rest of the
values. If the coefficients are of roughly the same magnitude, this option is grayed out.
•
Single — Display the selected statistics on single axis on the bar chart.
•
Multiple — Display the selected statistics on dual axes on the bar chart.
Lift Chart
This graph displays the statistics in the Cumulative Lift Table, with the following options:
• Non-Cumulative
140
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Logistic Regression
•
% Response — This column contains the percentage of observations in the decile
where the actual value of the dependent variable is 1.
•
% Captured Response — This column contains the percentage of responses in the
decile over all the responses in any decile.
•
Lift — The lift value is the percentage response in the decile (Pct Response) divided by
the expected response, where the expected response is the percentage of response or
dependent 1-values over all observations. For example, if 10% of the observations
overall have a dependent variable with value 1, and 20% of the observations in decile
1 have a dependent variable with value 1, then the lift value within decile 1 is 2.0,
meaning that the model gives a “lift” that is better than chance alone by a factor of two
in predicting response values of 1 within this decile.
• Cumulative
•
% Response — This is a cumulative measure of the percentage of observations in the
decile where the actual value of the dependent variable is 1, from decile 1 to this
decile.
•
% Captured Response — This is a cumulative measure of the percentage of responses
in the decile over all the responses in any decile, from decile 1 to this decile.
•
Cumulative Lift — This is a cumulative measure of the percentage response in the
decile (Pct Response) divided by the expected response, where the expected response
is the percentage of response or dependent 1-values over all observations, from decile
1 to this decile.
Tutorial - Logistic Regression
The following is an example of using the stepwise feature of Logistic Regression analysis.
The stepwise feature adds extra processing steps to the analysis; that is, normal Logistic
Regression processing is a subset of the output shown below. In this example, ccacct (has
credit card, 0 or 1) is being predicted in terms of 16 independent variables, from income to
avg_sv_tran_cnt. The forward stepwise process determines that only 7 out of the original 16
input variables should be used in the model. These include avg_sv_tran_amt (average amount
of savings transactions), avg_sv_tran_cnt (average number of savings transactions per
month), avg_sv_bal (average savings account balance), married, years_with_bank, avg_ck_
tran_cnt (average number of checking transactions per month), and ckacct (has checking
account, 0 or 1).
Step 0 shows that all of the original 16 independent variables are excluded from the model,
the starting point for forward stepwise regression. In Step 1, the Model Assessment report
shows that the variable avg_sv_tran_amt added to the model, along with the constant term,
with all other variables still excluded from the model. For the sake of brevity, Steps 2 through
6 are not shown. Then in Step 7, the variable ckacct is the last variable added to the model.
At this point the stepwise algorithm stops because there are no more variables qualifying to
be added or removed from the model, and the Reweighted Least Squares Logistic Regression
and Variables in Model reports are given, just as they would be if these variables were
analyzed without stepwise requested. Finally the Prediction Success Table, Multi-Threshold
Success Table, and Cumulative Lift Table are given, as requested, to complete the analysis.
Teradata Warehouse Miner User Guide - Volume 3
141
Chapter 1: Analytic Algorithms
Logistic Regression
Parameterize a Logistic Regression Analysis as follows:
• Available Table — twm_customer_analysis
• Dependent Variable — cc_acct
• Independent Variables
•
income — age
•
years_with_bank — nbr_children
•
female — single
•
married — separated
•
ckacct — svacct
•
avg_ck_bal — avg_sv_bal
•
avg_ck_tran_amt — avg_ck_tran_cnt
•
avg_sv_tran_amt — avg_sv_tran_cnt
• Convergence Criterion — 0.001
• Maximum Iterations — 100
• Response Value — 1
• Include Constant — Enabled
• Prediction Success Table — Enabled
• Multi-Threshold Success Table — Enabled
•
Threshold Begin — 0
•
Threshold End — 1
•
Threshold Increment — 0.05
• Cumulative Lift Table — Enabled
• Use Stepwise Regression — Enabled
•
Criterion to Enter — 0.05
•
Criterion to Remove — 0.05
•
Direction — Forward
• Optimization Type — Automatic
Run the analysis, and click on Results when it completes. For this example, the Logistic
Regression Analysis generated the following pages. A single click on each page name
populates Results with the item.
Table 42: Logistic Regression Report
142
Total Observations:
747
Total Iterations:
9
Initial Log Likelihood:
-517.7749
Final Log Likelihood:
-244.4929
Likelihood Ratio Test G Statistic:
546.5641
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Logistic Regression
Table 42: Logistic Regression Report
Chi-Square Degrees of Freedom:
7.0000
Chi-Square Value:
14.0671
Chi-Square Probability:
0.0000
McFadden's Pseudo R-Squared:
0.5278
Dependent Variable:
ccacct
Dependent Response Value:
1
Total Distinct Values:
2
Table 43: Execution Summary
6/20/2004 2:19:02 PM
Stepwise Logistic Regression Running.
6/20/2004 2:19:03 PM
Step 0 Complete
6/20/2004 2:19:03 PM
Step 1 Complete
6/20/2004 2:19:03 PM
Step 2 Complete
6/20/2004 2:19:03 PM
Step 3 Complete
6/20/2004 2:19:03 PM
Step 4 Complete
6/20/2004 2:19:04 PM
Step 5 Complete
6/20/2004 2:19:04 PM
Step 6 Complete
6/20/2004 2:19:04 PM
Step 7 Complete
6/20/2004 2:19:04 PM
Log Likelihood: -517.78094387828
6/20/2004 2:19:04 PM
Log Likelihood: -354.38456690558
6/20/2004 2:19:04 PM
Log Likelihood: -287.159936852895
6/20/2004 2:19:04 PM
Log Likelihood: -258.834546711159
6/20/2004 2:19:04 PM
Log Likelihood: -247.445356552554
6/20/2004 2:19:04 PM
Log Likelihood: -244.727173470081
6/20/2004 2:19:04 PM
Log Likelihood: -244.49467692232
6/20/2004 2:19:04 PM
Log Likelihood: -244.492882024522
6/20/2004 2:19:04 PM
Log Likelihood: -244.492881920691
6/20/2004 2:19:04 PM
Computing Multi-Threshold Success Table
6/20/2004 2:19:06 PM
Computing Prediction Success Table
6/20/2004 2:19:06 PM
Computing Cumulative Lift Table
6/20/2004 2:19:07 PM
Creating Report
Teradata Warehouse Miner User Guide - Volume 3
143
Chapter 1: Analytic Algorithms
Logistic Regression
Table 44: Variables
Column
Name
B
Standard
Coefficient Error
Wald
Statistic
T
Statistic
P-Value
Odds
Ratio
Lower
Upper
Partial R
Standardized
Coefficient
(Constant)
-1.1864
0.2733
18.8462
-4.3412
0.0000
N/A
N/A
N/A
N/A
N/A
avg_sv_
tran_amt
0.0308
0.0038
64.7039
8.0439
0.0000
1.0312 1.0235 1.0390 0.2461
2.0618
avg_sv_
tran_cnt
-1.1921
0.2133
31.2295
-5.5883
0.0000
0.3036 0.1999 0.4612 -0.1680
-0.9144
avg_sv_bal
0.0031
0.0006
31.1687
5.5829
0.0000
1.0031 1.0020 1.0042 0.1678
2.6259
married
-0.6225
0.2334
7.1152
-2.6674
0.0078
0.5366 0.3396 0.8478 -0.0703
-0.1715
years_with_ -0.0981
bank
0.0443
4.9149
-2.2170
0.0269
0.9066 0.8312 0.9887 -0.0531
-0.1447
avg_ck_
tran_cnt
-0.0228
0.0096
5.6088
-2.3683
0.0181
0.9775 0.9592 0.9961 -0.0590
-0.1792
ckacct
0.4657
0.2365
3.8760
1.9688
0.0494
1.5931 1.0021 2.5326 0.0426
0.1273
Step 0
Table 45: Columns Out
144
Column Name
W Statistic
Chi-Square P-Value
age
1.9521
0.1624
avg_ck_bal
0.5569
0.4555
avg_ck_tran_amt
1.6023
0.2056
avg_ck_tran_cnt
0.0844
0.7714
avg_sv_bal
85.5070
0.0000
avg_sv_tran_amt
233.7979
0.0000
avg_sv_tran_cnt
44.0510
0.0000
ckacct
21.8407
0.0000
female
3.2131
0.0730
income
1.9877
0.1586
married
19.6058
0.0000
nbr_children
5.1128
0.0238
separated
5.5631
0.0183
single
6.9958
0.0082
svacct
7.4642
0.0063
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Logistic Regression
Table 45: Columns Out
Column Name
W Statistic
Chi-Square P-Value
years_with_bank
3.0069
0.0829
Step 1
Table 46: Variables
Column
Name
B
Standard
Coefficient Error
Wald
Statistic
T
Statistic
P-Value
Odds
Ratio
avg_sv_
tran_amt
0.0201
193.2455 13.9013
0.0000
1.0203 1.0174 1.0232 0.4297
0.0014
Lower
Upper
Partial R
Standardized
Coefficient
1.3445
Table 47: Columns Out
Column Name
W Statistic
Chi-Square P-Value
age
3.4554
0.0630
avg_ck_bal
0.4025
0.5258
avg_ck_tran_amt
0.3811
0.5370
avg_ck_tran_cnt
11.3612
0.0007
avg_sv_bal
46.6770
0.0000
avg_sv_tran_cnt
134.8091
0.0000
ckacct
7.8238
0.0052
female
2.4111
0.1205
income
5.2143
0.0224
married
7.7743
0.0053
nbr_children
2.6647
0.1026
separated
3.9342
0.0473
single
2.7417
0.0978
svacct
2.0405
0.1532
years_with_bank
13.2617
0.0003
Teradata Warehouse Miner User Guide - Volume 3
145
Chapter 1: Analytic Algorithms
Logistic Regression
Step 2-7
Table 48: Prediction Success Table
Estimate Response
Estimate Non-Response
Actual Total
Actual Response
304.5868
70.4132
375.0000
Actual Non-Response
70.4133
301.5867
372.0000
Actual Total
375.0000
372.0000
747.0000
Table 49: Multi-Threshold Success Table
Threshold Probability
Actual Response,
Estimate Response
Actual Response,
Estimate Non-Response
Actual Non-Response,
Estimate Response
Actual Non-Response,
Estimate Non-Response
0
375
0
372
0
.05
375
0
353
19
.1
374
1
251
121
.15
373
2
152
220
.2
369
6
90
282
.25
361
14
58
314
.3
351
24
37
335
.35
344
31
29
343
.4
329
46
29
343
.45
318
57
28
344
.5
313
62
24
348
.55
305
70
23
349
.6
291
84
23
349
.65
286
89
21
351
.7
276
99
20
352
.75
265
110
20
352
.8
253
122
20
352
.85
243
132
16
356
.9
229
146
13
359
.95
191
184
11
361
146
Teradata Warehouse Miner User Guide - Volume 3
Chapter 1: Analytic Algorithms
Logistic Regression
Table 50: Cumulative Lift Table
Cumulative
Response
(%)
Cumulative
Captured
Response
(%)
Cumulative
Lift
Decile
Count
Response
Response
(%)
Captured
Response
(%)
1
74.0000
73.0000
98.6486
19.4667
1.9651 73.0000
98.6486
19.4667
1.9651
2
75.0000
69.0000
92.0000
18.4000
1.8326 142.0000
95.3020
37.8667
1.8984
3
75.0000
71.0000
94.6667
18.9333
1.8858 213.0000
95.0893
56.8000
1.8942
4
74.0000
65.0000
87.8378
17.3333
1.7497 278.0000
93.2886
74.1333
1.8583
5
75.0000
66.0000
88.0000
17.6000
1.7530 344.0000
92.2252
91.7333
1.8371
6
75.0000
24.0000
32.0000
6.4000
0.6374 368.0000
82.1429
98.1333
1.6363
7
74.0000
4.0000
5.4054
1.0667
0.1077 372.0000
71.2644
99.2000
1.4196
8
73.0000
2.0000
2.7397
0.5333
0.0546 374.0000
62.8571
99.7333
1.2521
9
69.0000
1.0000
1.4493
0.2667
0.0289 375.0000
56.4759
100.0000
1.1250
10
83.0000
0.0000
0.0000
0.0000
0.0000 375.0000
50.2008
100.0000
1.0000
Lift
Cumulative
Response
Logistic Weights Graph
By default, the Logistic Weights graph displays the relative magnitudes of the T-statistic
associated with each coefficient in the logistic regression model:
Figure 70: Logistic Regression Tutorial: Logistic Weights Graph
Teradata Warehouse Miner User Guide - Volume 3
147
Chapter 1: Analytic Algorithms
Logistic Regression
Select the Graphics Options tab and change the Graph Type to Wald Statistic, Log Odds Ratio,
Partial R or Estimated Standardized Coefficient to view those statistical measures respectively
Lift Chart
By default, the Lift Chart displays the cumulative measure of the percentage of observations
in the decile where the actual value of the dependent variable is 1, from decile 1 to this decile
(Cumulative, %Response):
Figure 71: Logistic Regression Tutorial: Lift Chart
148
Teradata Warehouse Miner User Guide - Volume 3
CHAPTER 2
Scoring
What’s In This Chapter
This chapter applies only to an instance of Teradata Warehouse Miner operating on a Teradata
database.
For more information, see these subtopics:
1
“Overview” on page 149
2
“Cluster Scoring” on page 149
3
“Tree Scoring” on page 157
4
“Factor Scoring” on page 168
5
“Linear Scoring” on page 176
6
“Logistic Scoring” on page 184
Overview
Model scoring in Teradata Warehouse Miner is performed entirely through generated SQL,
executed in the database (although PMML based scoring generally requires that certain
supplied User Defined Functions be installed beforehand). A scoring analysis is provided for
every Teradata Warehouse Miner algorithm that produces a predictive model (thus excluding
the Association Rules algorithm).
Scoring applies a predictive model to a data set that has the same columns as those used in
building the model, with the exception that the scoring input table need not always include the
predicted or dependent variable column for those models that utilize one. In fact, the
dependent variable column is required only when model evaluation is requested in the Tree
Scoring, Linear Scoring and Logistic Scoring analyses.
Cluster Scoring
Scoring a table is the assignment of each row to a cluster. In the Gaussian Mixture model, the
“maximum probability rule” is used to assign the row to the cluster for which its conditional
probability is the largest. The model also assigns relative probabilities of each cluster to the
row, so the soft assignment of a row to more than one cluster can be obtained.
Teradata Warehouse Miner User Guide - Volume 3
149
Chapter 2: Scoring
Cluster Scoring
When scoring is requested, the selected table is scored against centroids/variances from the
selected Clustering analysis. After a single iteration, each row is assigned to one of the
previously defined clusters, together with the probability of membership. The row to cluster
assignment is based on the largest probability.
The Cluster Scoring analysis scores an input table that contains the same columns that were
used to perform the selected Clustering Analysis. The implicit assumption in doing this is that
the underlying population distributions are the same. When scoring is requested, the specified
table is scored against the centroids and variances obtained in the selected Clustering
analysis. Only a single iteration is required before the new scored table is produced.
After clusters have been identified by their centroids and variances, the scoring engine
identifies to which cluster each row belongs. The Gaussian Mixture model permits multiple
cluster memberships, with scoring showing the probability of membership to each cluster. In
addition, the highest probability is used to assign the row absolutely to a cluster. The resulting
score table consists of the index (key) columns, followed by probabilities for each cluster
membership, followed by the assigned cluster number (the cluster with the highest probability
of membership).
Initiate Cluster Scoring
After generating a Cluster analysis (as described in “Cluster Analysis” on page 20), use the
following procedure to initiate Cluster Scoring:
1
Click on the Add New Analysis icon in the toolbar:
Figure 72: Add New Analysis from toolbar
2
150
In the resulting Add New Analysis dialog box, click on Scoring under Categories and then
under Analyses double-click on Cluster Scoring:
Teradata Warehouse Miner User Guide - Volume 3
Chapter 2: Scoring
Cluster Scoring
Figure 73: Add New Analysis > Scoring > Cluster Scoring
3
This will bring up the Cluster Scoring dialog in which you will enter INPUT and
OUTPUT options to parameterize the analysis as described in the following sections.
Cluster Scoring - INPUT - Data Selection
On the Factor Scoring dialog click on INPUT and then click on data selection:
Figure 74: Add New Analysis > Input > Data Selection
On this screen select:
1
Select Input Source
Users may select between different sources of input. By selecting the Input Source Table
the user can select from available databases, tables (or views) and columns in the usual
manner. By selecting the Input Source Analysis however the user can select directly from
the output of another analysis of qualifying type in the current project. Analyses that may
be selected from directly include all of the Analytic Data Set (ADS) and Reorganization
analyses (except Refresh). In place of Available Databases the user may select from
Available Analyses, while Available Tables then contains a list of all the output tables that
will eventually be produced by the selected Analysis. (Note that since this analysis cannot
select from a volatile input table, Available Analyses will contain only those qualifying
analyses that create an output table or view).
2
Select Columns From a Single Table
Teradata Warehouse Miner User Guide - Volume 3
151
Chapter 2: Scoring
Cluster Scoring
3
•
Available Databases — All available source databases that have been added on the
Connection Properties dialog.
•
Available Tables — The tables available for scoring are listed in this window, though
all may not strictly qualify; the input table to be scored must contain the same column
names used in the original analysis.
•
Available Columns — The columns available for scoring are listed in this window.
•
Selected Columns — The Selected Columns window is a split window for specifying
Index and/or Retain columns.
•
Index Columns — If a table is specified as input, the primary index of the table is
defaulted here but can be changed. If a view is specified as input, an index must be
provided. When scoring a Fast K-Means model, any columns used to determine
clusters in the analysis being scored are not necessarily specified as Index columns
when scoring. A duplicate definition error can occur.
•
Retain Columns — Other columns within the table being scored can be appended
to the scored table by specifying them here. Columns specified in Index Columns
are not necessarily specified here. None of the columns involved in Fast K-Means
clustering can contain leading or trailing spaces or, if publishing, a separator
character ' | '.
Select Model Analysis — Select from the list an existing Cluster analysis on which to run
the scoring. The Cluster analysis must exist in the same project as the Cluster Scoring
analysis.
Cluster Scoring - INPUT - Analysis Parameters
On the Cluster Scoring dialog click on INPUT and then click on analysis parameters:
Figure 75: Add New Analysis > Input > Analysis Parameters
On this screen select:
• Score Options
•
Include Cluster Membership — The name of the column in the output score table
representing the cluster number to which an observation or row belongs can be set by
the user. For Fast K-Means, this option must be checked and a column name specified.
For other model types, this column may be excluded by clearing the selection box, but
if this is done the cluster probability scores must be included.
•
152
Column Name — Name of the column that will be populated with the cluster
numbers, but it cannot have the same name as any of the columns in the table being
scored
Teradata Warehouse Miner User Guide - Volume 3
Chapter 2: Scoring
Cluster Scoring
•
Include Cluster Probability Scores — Specify the prefix of the name of the columns in
the output score table representing the probabilities that an observation or row belongs
to each cluster. A column is created for each possible cluster, adding the cluster
number to this prefix (for example, p1, p2, p3). To exclude these columns, clear the
selection box, but you must include the cluster membership number.
•
Column Prefix — Specify a prefix for each column generated (one per cluster) that
will be populated with the probability scores. The prefix includes sequential
numbers, beginning with 1 and incrementing for each cluster appended to it. If the
resultant column conflicts with a column in the table to be scored, an error occurs.
Cluster Scoring - OUTPUT
On the Cluster Scoring dialog click on OUTPUT:
Figure 76: Cluster Scoring > Output
On this screen select:
• Output Table
•
Database name — The name of the database.
•
Table name — The name of the scored output table to be created-required only if a
scoring option is selected.
•
Create output table using the FALLBACK keyword — If a table is selected, it will be
built with FALLBACK if this option is selected
•
Create output table using the MULTISET keyword — This option is not enabled for
scored output tables; the MULTISET keyword is not used.
•
Advertise Output — The Advertise Output option “advertises” output by inserting
information into one or more of the Advertise Output metadata tables according to the
type of analysis and the options selected in the analysis.
•
Advertise Note — An Advertise Note may be specified if desired when the Advertise
Output option is selected or when the Always Advertise option is selected on the
Connection Properties dialog. It is a free-form text field of up to 30 characters that
may be used to categorize or describe the output.
•
Generate the SQL for this analysis, but do not execute it — If this option is selected the
analysis will only generate SQL, returning it and terminating immediately.
Note: To create a stored procedure to score this model, use the Refresh analysis.
Run the Cluster Scoring Analysis
After setting parameters on the INPUT screens as described above, you are ready to run the
analysis. To run the analysis you can either:
Teradata Warehouse Miner User Guide - Volume 3
153
Chapter 2: Scoring
Cluster Scoring
• Click the Run icon
on the toolbar, or
• Select Run <project name> on the Project menu, or
• Press the F5 key on your keyboard
Results - Cluster Scoring
The results of running the Teradata Warehouse Miner Cluster Scoring Analysis include a
variety of statistical reports on the scored model. All of these results are outlined below.
Cluster Scoring - RESULTS - reports
On the Cluster Scoring dialog, click RESULTS and then click reports (note that the RESULTS
tab will be grayed-out/disabled until after the analysis is completed):
Figure 77: Cluster Scoring > Results > Reports
• Clustering Scoring Report
•
Iteration — When scoring, the algorithm performs only one iteration, so this value is
always 1.
•
Log Likelihood — This is the log likelihood value calculated using the scored data,
giving a measure of the effectiveness of the model applied to this data.
•
Diff — Since only one iteration of the algorithm is performed when scoring, this is
always 0.
•
Timestamp — This is the day, date, hour, minute and second marking the end of the
scoring processing.
The Cluster Scoring report for Fast K-Means contains the Timestamp and Message columns,
where the message indicates progress and a single iteration.
Cluster Scoring - RESULTS - data
On the Cluster Scoring dialog click RESULTS and then click on data (note that the RESULTS
tab will be grayed-out/disabled until after the analysis is completed):
Figure 78: Cluster Scoring > Results > Data
Results data, if any, is displayed in a data grid.
154
Teradata Warehouse Miner User Guide - Volume 3
Chapter 2: Scoring
Cluster Scoring
If a table was created, a sample of rows is displayed here - the size determined by the setting
specified by Maximum result rows to display in Tools-Preferences-Limits.
The following table is built in the requested Output Database by the Cluster Scoring analysis.
Note that the options selected affect the structure of the table. Those columns in bold below
will comprise the Unique Primary Index (UPI). Also note that there may be repeated groups
of columns, and that some columns will be generated only if specific options are selected.
Table 51: Output Database (Built by the Cluster Scoring analysis)
Name
Type
Definition
Key
User Defined
One or more unique-key columns, which default to the index, defined
in the table to be scored (i.e., in Selected Tables). The data type
defaults to the same as the scored table, but can be changed via
Primary Index Columns and Types.
Probability (Default)
FLOAT
The probabilities that an observation or row belongs to each cluster if
the Include Cluster Probability Scores option is selected. A column is
created for each possible cluster, adding the cluster number to the
prefix entered in the Column Prefix option. This prefix will be used for
each column generated (one per cluster) that will be populated with
the probability scores. Note that the prefix used will have sequential
numbers, beginning with 1 and incrementing for each cluster
appended to it. (By default, the Column Prefix is p, so p1, p2, p3, etc.
will be generated). These columns may be excluded by not selecting
the Include Cluster Probability Scores option, but if this is done the
cluster membership number must be included.
Cluster Number
(Default)
INTEGER
The column in the output score table representing the cluster number
to which an observation or row belongs can be set by the user. This
column may be excluded by not selecting the Include Cluster
Membership option, but if this is done the cluster probability scores
must be included (see above). The name of the column defaults to
Cluster Number, but this can be overwritten by entering another value
in Column Name under the Include Cluster Membership option. This
cannot have the same name as any of the index columns in the table
being scored, and the name entered cannot exist as a column in the
table being scored.
When scoring a Fast K-Means model, the score table differs from that shown above. The first
column identifies the cluster and has the default name Cluster Number. The next columns are
the Index columns and the last columns are the Retain columns, if any. (A Fast K-Means
score table does not contain probability columns.)
Cluster Scoring - RESULTS - SQL
With Fast K-Means scoring, the SQL is displayed whether or not the Generate SQL Only
option is selected. The SQL generated is simply the call to td_analyze, followed by any
requested postprocessing SQL.
On the Cluster Scoring dialog click RESULTS and then click on SQL (note that the RESULTS
tab will be grayed-out/disabled until after the analysis is completed):
Teradata Warehouse Miner User Guide - Volume 3
155
Chapter 2: Scoring
Cluster Scoring
Figure 79: Cluster Scoring > Results > SQL
The SQL generated for scoring is returned here, but only if the Output - Storage option to
Generate the SQL for this analysis, but do not execute it was selected. When SQL is
displayed here, it may be selected and copied as desired. (Both right-click menu options and
buttons to Select All and Copy are available).
Tutorial - Cluster Scoring
In this example, the same table is scored as was used to build the cluster analysis model.
Parameterize a Cluster Score Analysis as follows:
• Selected Table — twm_customer_analysis
• Include Cluster Membership — Enabled
• Column Name — Clusterno
• Include Cluster Probability Scores — Enabled
• Column Prefix — p
• Result Table Name — twm_score_cluster_1
• Index Columns — cust_id
Run the analysis, and click on Results when it completes. For this example, the Cluster
Scoring Analysis generated the following pages. A single click on each page name populates
Results with the item.
Table 52: Clustering Progress
Iteration
Log Likelihood
Diff
Timestamp
1
-24.3
0
Tue Jun 12 15:41:58 2001
Table 53: Data
cust_id
p1
p2
p3
clusterno
1362509
.457
.266
.276
1
1362573
1.12E-22
1
0
2
1362589
6E-03
5.378E-03
.989
3
1362693
8.724E-03
8.926E-03
.982
3
1362716
3.184E-03
3.294E-03
.994
3
156
Teradata Warehouse Miner User Guide - Volume 3
Chapter 2: Scoring
Tree Scoring
Table 53: Data
cust_id
p1
p2
p3
clusterno
1362822
.565
.132
.303
1
1363017
7.267E-02
.927
1.031E-18
2
1363078
3.598E-03
3.687E-03
.993
3
1363438
2.366E-03
2.607E-03
.995
3
1363465
.115
5.923E-02
.826
3
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
Tree Scoring
After building a model a means of deploying it is required to allow scoring of new data sets.
The way in which Teradata Warehouse Miner deploys a decision tree model is via SQL. A
series of SQL statements is generated from the metadata model that describes the decision
tree. The SQL uses CASE statements to classify the predicted value. Here is an example of a
statement:
SELECT CASE WHEN(subset1 expression) THEN ‘Buy’
WHEN(subset2 expression) THEN ‘Do not Buy’
END
FROM tablename;
Note that Tree Scoring applies a Decision Tree model to a data set that has the same columns
as those used in building the model (with the exception that the scoring input table need not
include the predicted or dependent variable column unless model evaluation is requested).
A number of scoring options including model evaluation and profiling rulesets are provided
on the analysis parameters panel of the Tree Scoring analysis.
Initiate Tree Scoring
After generating a Decision Tree analysis (as described in “Decision Trees” on page 39) use
the following procedure to initiate Tree Scoring:
Teradata Warehouse Miner User Guide - Volume 3
157
Chapter 2: Scoring
Tree Scoring
1
Click on the Add New Analysis icon in the toolbar:
Figure 80: Add New Analysis from toolbar
2
In the resulting Add New Analysis dialog box, click on Scoring under Categories and then
under Analyses double-click on Tree Scoring:
Figure 81: Add New Analysis > Scoring > Tree Scoring
3
This will bring up the Tree Scoring dialog in which you will enter INPUT and OUTPUT
options to parameterize the analysis as described in the following sections.
Tree Scoring - INPUT - Data Selection
On the Tree Scoring dialog click on INPUT and then click on data selection:
158
Teradata Warehouse Miner User Guide - Volume 3
Chapter 2: Scoring
Tree Scoring
Figure 82: Tree Scoring > Input > Data Selection
On this screen select:
1
Select Input Source
Users may select between different sources of input. By selecting the Input Source Table
the user can select from available databases, tables (or views) and columns in the usual
manner. By selecting the Input Source Analysis however the user can select directly from
the output of another analysis of qualifying type in the current project. Analyses that may
be selected from directly include all of the Analytic Data Set (ADS) and Reorganization
analyses (except Refresh). In place of Available Databases the user may select from
Available Analyses, while Available Tables then contains a list of all the output tables that
will eventually be produced by the selected Analysis. (Note that since this analysis cannot
select from a volatile input table, Available Analyses will contain only those qualifying
analyses that create an output table or view).
2
Select Columns From a Single Table
•
Available Databases — All available source databases that have been added on the
Connection Properties dialog.
•
Available Tables — The tables available for scoring are listed in this window, though
all may not strictly qualify; the input table to be scored must contain the same column
names used in the original analysis.
•
Available Columns — The columns available for scoring are listed in this window.
•
Selected Columns
•
Index Columns — Note that the Selected Columns window is actually a split
window for specifying Index and/or Retain columns if desired. If a table is
specified as input, the primary index of the table is defaulted here, but can be
changed. If a view is specified as input, an index must be provided.
•
Retain Columns — Other columns within the table being scored can be appended
to the scored table, by specifying them here. Columns specified in Index Columns
may not be specified here.
3
Select Model Analysis
4
Select from the list an existing Decision Tree analysis on which to run the scoring. The
Decision Tree analysis must exist in the same project as the Decision Tree Scoring
analysis.
Tree Scoring - INPUT - Analysis Parameters
On the Tree Scoring dialog click on INPUT and then click on analysis parameters:
Teradata Warehouse Miner User Guide - Volume 3
159
Chapter 2: Scoring
Tree Scoring
Figure 83: Tree Scoring > Input > Analysis Parameters
On this screen select:
• Scoring Method
•
Score — Option to create a score table only.
•
Evaluate — Option to perform model evaluation only. Not available for Decision Tree
models built using the Regression Trees option.
•
Evaluate and Score — Option to create a score table and perform model evaluation.
Not available for Decision Tree models built using the Regression Trees option.
• Scoring Options
•
Use Dependent variable for predicted value column name — Option to use the exact
same column name as the dependent variable when the model is scored. This is the
default option.
•
Predicted Value Column Name — If above option is not checked, then enter here the
name of the column in the score table which contains the estimated value of the
dependent variable.
•
Include Confidence Factor — If this option is checked then the confidence factor will
be added to the output table. The Confidence Factor is a measure of how “confident”
the model is that it can predict the correct score for a record that falls into a particular
leaf node based on the training data the model was built from.
Example: If a leaf node contained 10 observations and 9 of them predict Buy and the
other record predicts Do Not Buy, then the model built will have a confidence factor of
.9, or be 90% sure of predicting the right value for a record that falls into that leaf node
of the model.
If the Include validation table option was selected when the decision tree model was
built, additional information is provided in the scored table and/or results depending
on the scoring option selected. If Score Only is selected, a recalculated confidence
factor based on the original validation table is included in the scored output table. If
Evaluate Only is selected, a confusion matrix based on the selected table to score is
added to the results. If Evaluate and Score is selected, then a confusion matrix based
on the selected table to score is added to the results and a recalculated confidence
factor based on the selected table to score is included in the scored output table.
•
160
Targeted Confidence (Binary Outcome Only) — Models built with a predicted variable
that has only 2 outcomes can add a targeted confidence value to the output table. The
outcomes of the above example were 9 Buys and 1 Do Not Buy at that particular node
and if the target value was set to Buy, .9 is the targeted confidence. However if it is
desired to target the Do Not Buy outcome by setting the value to Do Not Buy, then any
record falling into this leaf of the tree would get a targeted confidence of .1 or 10%.
Teradata Warehouse Miner User Guide - Volume 3
Chapter 2: Scoring
Tree Scoring
If the Include validation table option was selected when the decision tree model was
built, additional information is provided in a manner similar to that for the Include
Confidence Factor option described above.
•
Targeted Value — The value for the binary targeted confidence.
Note that Include Confidence Factor and Targeted Confidence are mutually exclusive
options, so that only one of the two may be selected.
•
Create Profiling Tables — If this option is selected, additional tables are created to
profile the leaf nodes in the tree and to link scored rows to the leaf nodes that they
correspond to. To do this, a node ID field is added to the scored output table and two
additional tables are built to describe the leaf nodes. One table contains confidence
factor or targeted confidence (if requested) and prediction information (named by
appending “_1” to the scored output table name), and the other contains the rules
corresponding to each leaf node (named by appending “_2” to the scored output table
name).
Note however that selection of the option to Create Profiling Tables is ignored if the
Evaluate scoring method or the output option to Generate the SQL for this analysis
but do not execute it is selected. It is also ignored if the analysis is being refreshed by a
Refresh analysis that requests the creation of a stored procedure.
Tree Scoring - OUTPUT
On the Tree Scoring dialog click on OUTPUT:
Figure 84: Tree Scoring > Output
On this screen select:
• Output Table
•
Database name — The name of the database.
•
Table name — The name of the scored output table to be created.
•
Create output table using the FALLBACK keyword — If a table is selected, it will be
built with FALLBACK if this option is selected
•
Create output table using the MULTISET keyword — This option is not enabled for
scored output tables; the MULTISET keyword is not used.
•
Advertise Output — The Advertise Output option “advertises” output (including
Profiling Tables if requested) by inserting information into one or more of the
Advertise Output metadata tables according to the type of analysis and the options
selected in the analysis.
•
Advertise Note — An Advertise Note may be specified if desired when the Advertise
Output option is selected or when the Always Advertise option is selected on the
Teradata Warehouse Miner User Guide - Volume 3
161
Chapter 2: Scoring
Tree Scoring
Connection Properties dialog. It is a free-form text field of up to 30 characters that
may be used to categorize or describe the output.
•
Generate the SQL for this analysis, but do not execute it — If this option is selected
the analysis will only generate SQL, returning it and terminating immediately.
Note: To create a stored procedure to score this model, use the Refresh analysis.
Run the Tree Scoring Analysis
After setting parameters on the INPUT screens as described above, you are ready to run the
analysis. To run the analysis you can either:
• Click the Run icon
on the toolbar, or
• Select Run <project name> on the Project menu, or
• Press the F5 key on your keyboard
Results - Tree Scoring
The results of running the Teradata Warehouse Miner Decision Tree Scoring Analysis include
a variety of statistical reports on the scored model. All of these results are outlined below.
Tree Scoring - RESULTS - Reports
On the Tree Scoring dialog click RESULTS and then click on reports (note that the RESULTS
tab will be grayed-out/disabled until after the analysis is completed):
Figure 85: Tree Scoring > Results > Reports
• Decision Tree Score Report
•
Resulting Scored Table Name — This is the name given the table with the scored
values of the decision tree model.
•
Number of Rows in Scored Table — This is the number of rows in the scored decision
tree table.
• Confusion Matrix — A N x (N+2) (for N outcomes of the dependent variable) confusion
matrix is given with the following format:
Table 54: Confusion Matrix
Actual ‘0’
Actual ‘1’
…
Actual ‘N’
Correct
Incorrect
Predicted ‘0’
# correct ‘0’
Predictions
# incorrect‘1’
Predictions
…
# incorrect ‘N’
Predictions
Total Correct
‘0’ Predictions
Total Incorrect
‘0’ Predictions
Predicted ‘1’
# incorrect‘0’
Predictions
# correct ‘1’
Predictions
…
# incorrect ‘N’
Predictions
Total Correct
‘1’ Predictions
Total Incorrect
‘1’ Predictions
162
Teradata Warehouse Miner User Guide - Volume 3
Chapter 2: Scoring
Tree Scoring
Table 54: Confusion Matrix
Actual ‘0’
Actual ‘1’
…
Actual ‘N’
Correct
Incorrect
…
…
…
…
…
…
…
Predicted ‘N’
# incorrect‘0’
Predictions
# incorrect ‘1’
Predictions
…
# correct ‘N’
Predictions
Total Correct
‘N’ Predictions
Total Incorrect
‘N’ Predictions
• Cumulative Lift Table — The Cumulative Lift Table demonstrates how effective the model
is in estimating the dependent variable. It is produced using deciles based on the
probability values. Note that the deciles are labeled such that 1 is the highest decile and 10
is the lowest, based on the probability values calculated by logistic regression. The
information in this report however is best viewed in the Lift Chart produced as a graph.
Note that this is only valid for binary dependent variables.
•
Decile — The deciles in the report are based on the probability values predicted by the
model. Note that 1 is the highest decile and 10 is the lowest. That is, decile 1 contains
data on the 10% of the observations with the highest estimated probabilities that the
dependent variable is 1.
•
Count — This column contains the count of observations in the decile.
•
Response — This column contains the count of observations in the decile where the
actual value of the dependent variable is 1.
•
Pct Response — This column contains the percentage of observations in the decile
where the actual value of the dependent variable is 1.
•
Pct Captured Response — This column contains the percentage of responses in the
decile over all the responses in any decile.
•
Lift — The lift value is the percentage response in the decile (Pct Response) divided by
the expected response, where the expected response is the percentage of response or
dependent 1-values over all observations. For example, if 10% of the observations
overall have a dependent variable with value 1, and 20% of the observations in decile
1 have a dependent variable with value 1, then the lift value within decile 1 is 2.0,
meaning that the model gives a “lift” that is better than chance alone by a factor of two
in predicting response values of 1 within this decile.
•
Cumulative Response — This is a cumulative measure of Response, from decile 1 to
this decile.
•
Cumulative Pct Response — This is a cumulative measure of Pct Response, from
decile 1 to this decile.
•
Cumulative Pct Captured Response — This is a cumulative measure of Pct Captured
Response, from decile 1 to this decile.
•
Cumulative Lift — This is a cumulative measure of Lift, from decile 1 to this decile.
Tree Scoring - RESULTS - Data
On the Tree Scoring dialog click RESULTS and then click on data (note that the RESULTS tab
will be grayed-out/disabled until after the analysis is completed):
Teradata Warehouse Miner User Guide - Volume 3
163
Chapter 2: Scoring
Tree Scoring
Figure 86: Tree Scoring > Results > Data
Results data, if any, is displayed in a data grid.
If a table was created, a sample of rows is displayed here - the size determined by the setting
specified by Maximum result rows to display in Tools-Preferences-Limits.
The following table is built in the requested Output Database by the Decision Tree Scoring
analysis. Note that the options selected affect the structure of the table. Those columns in bold
below will comprise the Primary Index. Also note that there may be repeated groups of
columns, and that some columns will be generated only if specific options are selected.
Table 55: Output Database table (Built by the Decision Tree Scoring analysis)
Name
Type
Definition
Key
User Defined
One or more key columns, which default to the index, defined in the table to be
scored (i.e., in Selected Table). The data type defaults to the same as the scored
table, but can be changed via Primary Index Columns.
<app_var>
User Defined
One or more columns as selected under Retain Columns.
<dep_var >
User Defined
The predicted value of the dependent variable. The name used defaults to the
Dependent Variable specified when the tree was built. If Use Dependent
variable for predicted value column name is not selected, then an appropriate
column name must be entered and is used here. The data type used is the same
as the Dependent Variable.
_tm_node_id
FLOAT
When the Create profiling tables option is selected this column is included to
link each row with a particular leaf node in the decision tree and thereby with a
specific set of rules.
_tm_target, or
FLOAT
One of two measures that are mutually exclusive. If the Include Confidence
Factor option is selected, _tm_confidence will be generated and populated with
Confidence Factors - a measure of how “confident” the model is that it can
predict the correct score for a record that falls into a particular leaf node based
on the training data the model was built from.
(Default)
_tm_confidence
If the Targeted Confidence (Binary Outcome Only) option is selected, then _tm_
target will be generated and populated with Targeted Confidences for models
built with a predicted value that has only 2 outcomes. The Targeted confidence
is a measure of how confident the model is that it can predict the correct score
for a particular leaf node based upon a user specified Target Value. For
example, if a particular decision node had an outcome of 9 “Buys” and 1 “Do
Not Buy” at that particular node, setting the Target Value to “Buy”, would
generate a .9 or 9% targeted confidence. However if it is desired to set the
Target Value to “Do Not Buy”, then any record falling into this leaf of the tree
would get a targeted confidence of .1 or 10%.
164
Teradata Warehouse Miner User Guide - Volume 3
Chapter 2: Scoring
Tree Scoring
Table 55: Output Database table (Built by the Decision Tree Scoring analysis)
Name
Type
Definition
_tm_recalc_target, or
FLOAT
Recalculated versions of the confidence factor or targeted confidence factor
based on the original validation table when Score Only is selected, or based on
the selected table to score when Evaluate and Score is selected.
_tm_recalc_confidence
The following table is built in the requested Output Database by the Decision Tree Scoring
analysis when the Create profiling tables option is selected. (It is named by appending “_1” to
the scored result table name).
Table 56: Output Database table (Built by the Decision Tree Scoring analysis) - Create profiling tables option selected (“_1” appended)
Name
Type
Definition
_tm_node_id
FLOAT
This column identifies a particular leaf node in the decision tree.
_tm_target, or
FLOAT
The confidence factor or targeted confidence factor for this leaf node, as
described above for the scored output table.
VARCHAR(n)
The predicted value of the dependent variable at this leaf node.
_tm_confidence
_tm_prediction
The following table is built in the requested Output Database by the Decision Tree Scoring
analysis when the Create profiling tables option is selected. (It is named by appending “_2” to
the scored result table name).
Table 57: Output Database table (Built by the Decision Tree Scoring analysis) - Create profiling tables option selected (“_2” appended)
Name
Type
Definition
_tm_node_id
FLOAT
This column identifies a particular leaf node in the decision tree.
_tm_sequence_id
FLOAT
An integer from 1 to n to order the rules associated with a leaf node.
_tm_rule
VARCHAR(n)
A rule for inclusion in the ruleset for this leaf node in the decision tree (rules
are joined with a logical AND).
Tree Scoring - RESULTS - Lift Graph
On the Tree Scoring dialog click RESULTS and then click on lift graph (note that the RESULTS
tab will be grayed-out/disabled until after the analysis is completed):
Figure 87: Tree Scoring > Results > Lift Graph
Teradata Warehouse Miner User Guide - Volume 3
165
Chapter 2: Scoring
Tree Scoring
This chart displays the information in the Cumulative Lift Table. This is the same graph
described in “Results - Logistic Regression” on page 134 as Lift Chart, but applied to
possibly new data.
Tree Scoring - RESULTS - SQL
On the Tree Scoring dialog click RESULTS and then click on SQL (note that the RESULTS tab
will be grayed-out/disabled until after the analysis is completed):
Figure 88: Tree Scoring > Results > SQL
The SQL generated for scoring is returned here, but only if the Score Method was set to Score
on the Input - Analysis Parameters tab, and if the Output - Storage option to Generate the
SQL for this analysis, but do not execute it was selected. When SQL is displayed here, it may
be selected and copied as desired. (Both right-click menu options and buttons to Select All
and Copy are available).
Tutorial - Tree Scoring
In this example, the same table is scored as was used to build the decision tree model, as a
matter of convenience. Typically, this would not be done unless the contents of the table
changed since the model was built.
Parameterize a Decision Tree Scoring Analysis as follows:
• Selected Tables — twm_customer_analysis
• Scoring Method — Evaluate and Score
• Use the name of the dependent variable as the predicted value column name — Enabled
• Targeted Confidence(s) - For binary outcome only — Enabled
•
Targeted Value — 1
• Result Table Name — twm_score_tree_1
• Primary Index Columns — cust_id
Run the analysis, and click on Results when it completes. For this example, the Decision Tree
Scoring Analysis generated the following pages. A single click on each page name populates
Results with the item.
Table 58: Decision Tree Model Scoring Report
166
Resulting Scored Table Name
score_tree_1
Number of Rows in Scored File
747
Teradata Warehouse Miner User Guide - Volume 3
Chapter 2: Scoring
Tree Scoring
Table 59: Confusion Matrix
Actual Non-Response
Actual Response
Correct
Incorrect
Predicted 0
340/45.52%
0/0.00%
340/45.52%
0/0.00%
Predicted 1
32/4.28%
375/50.20%
375/50.20%
32/4.28%
Cumulativ
e Lift
Table 60: Cumulative Lift Table
Captured
Response
(%)
Lift
Cumulativ
e
Response
Cumulativ
e
Response
(%)
Cumulativ
e
Captured
Response
(%)
Decile
Count
Response
Response
(%)
1
5
5.00
100.00
1.33
1.99
5.00
100.00
1.33
1.99
2
0
0.00
0.00
0.00
0.00
5.00
100.00
1.33
1.99
3
0
0.00
0.00
0.00
0.00
5.00
100.00
1.33
1.99
4
0
0.00
0.00
0.00
0.00
5.00
100.00
1.33
1.99
5
0
0.00
0.00
0.00
0.00
5.00
100.00
1.33
1.99
6
402
370.00
92.04
98.67
1.83
375.00
92.14
100.00
1.84
7
0
0.00
0.00
0.00
0.00
375.00
92.14
100.00
1.84
8
0
0.00
0.00
0.00
0.00
375.00
92.14
100.00
1.84
9
0
0.00
0.00
0.00
0.00
375.00
92.14
100.00
1.84
10
340
0.00
0.00
0.00
0.00
375.00
50.20
100.00
1.00
Table 61: Data
cust_id
cc_acct
_tm_target
1362480
1
0.92
1362481
0
0
1362484
1
0.92
1362485
0
0
1362486
1
0.92
…
…
…
Lift Graph
Additionally, you can click on Lift Chart to view the contents of the Lift Table graphically.
Teradata Warehouse Miner User Guide - Volume 3
167
Chapter 2: Scoring
Factor Scoring
Factor Scoring
Factor analysis is designed primarily for the purpose of discovering the underlying structure
or meaning in a set of variables and to facilitate their reduction to a fewer number of variables
called factors or components. The first goal is facilitated by finding the factor loadings that
describe the variables in a data set in terms of a linear combination of factors. The second
goal is facilitated by finding a description for the factors as linear combinations of the original
variables they describe. These are sometimes called factor measurements or scores. After
computing the factor loadings, computing factor scores might seem like an afterthought, but it
is somewhat more involved than that. Teradata Warehouse Miner does automate the process
however based on the model information stored in metadata results tables, computing factor
scores directly in the database by dynamically generating and executing SQL.
Note that Factor Scoring computes factor scores for a data set that has the same columns as
those used in performing the selected Factor Analysis. When scoring is performed, a table is
created including index (key) columns, optional “retain” columns, and factor scores for each
row in the input table being scored. Scoring is performed differently depending on the type of
factor analysis that was performed, whether principal components (PCA), principal axis
factors (PAF) or maximum likelihood factors (MLF). Further, scoring is affected by whether
or not the factor analysis included a rotation. Also, input data is centered based on the mean
value of each variable, and if the factor analysis was performed on a correlation matrix, input
values are each divided by the standard deviation of the variable in order to normalize to unit
length variance.
When scoring a table using a PCA factor analysis model, the scores can be calculated directly
without estimation, even if an orthogonal rotation was performed. When scoring using a PAF
or MLF model, or a PCA model with an oblique rotation, a unique solution does not exist and
cannot be directly solved for (a condition known as the indeterminacy of factor
measurements). There are many techniques however for estimating factor measurements, and
the technique used by Teradata Warehouse Miner is known as estimation by regression. This
technique involves regressing each factor on the original variables in the factor analysis
model using linear regression techniques. It gives an accurate solution in the “least-squared
error” sense but it typically introduces some degree of dependence or correlation in the
computed factor scores.
A final word about the independence or orthogonality of factor scores is appropriate here. It
was pointed out earlier that factor loadings are orthogonal using the techniques offered by
Teradata Warehouse Miner unless an oblique rotation is performed. Factor scores however
will not necessarily be orthogonal for principal axis factors and maximum likelihood factors
and with oblique rotations since scores are estimated by regression. This is a subtle distinction
that is an easy source of confusion. That is, the new variables or factor scores created by a
factor analysis, expressed as a linear combination of the original variables, are not necessarily
independent of each other, even if the factors themselves are. The user may measure their
independence however by using the Matrix and Export Matrix functions to build a correlation
matrix from the factor score table once it is built.
168
Teradata Warehouse Miner User Guide - Volume 3
Chapter 2: Scoring
Factor Scoring
Initiate Factor Scoring
After generating a Factor Analysis (as described in “Factor Analysis” on page 62) use the
following procedure to initiate Factor Scoring:
1
Click on the Add New Analysis icon in the toolbar:
Figure 89: Add New Analysis from toolbar
2
In the resulting Add New Analysis dialog box, click on Scoring under Categories and then
under Analyses double-click on Factor Scoring:
Figure 90: Add New Analysis > Scoring > Factor Scoring
3
This will bring up the Factor Scoring dialog in which you will enter INPUT and
OUTPUT options to parameterize the analysis as described in the following sections.
Factor Scoring - INPUT - Data Selection
On the Factor Scoring dialog click on INPUT and then click on data selection:
Teradata Warehouse Miner User Guide - Volume 3
169
Chapter 2: Scoring
Factor Scoring
Figure 91: Factor Scoring > Input > Data Selection
On this screen select:
1
Select Input Source
Users may select between different sources of input. By selecting the Input Source Table
the user can select from available databases, tables (or views) and columns in the usual
manner. By selecting the Input Source Analysis however the user can select directly from
the output of another analysis of qualifying type in the current project. Analyses that may
be selected from directly include all of the Analytic Data Set (ADS) and Reorganization
analyses (except Refresh). In place of Available Databases the user may select from
Available Analyses, while Available Tables then contains a list of all the output tables that
will eventually be produced by the selected Analysis. (Note that since this analysis cannot
select from a volatile input table, Available Analyses will contain only those qualifying
analyses that create an output table or view).
2
3
Select Columns From a Single Table
•
Available Databases — All available source databases that have been added on the
Connection Properties dialog.
•
Available Tables — The tables available for scoring are listed in this window, though
all may not strictly qualify; the input table to be scored must contain the same column
names used in the original analysis.
•
Available Columns — The columns available for scoring are listed in this window.
•
Selected Columns
•
Index Columns — Note that the Selected Columns window is actually a split
window for specifying Index and/or Retain columns if desired. If a table is
specified as input, the primary index of the table is defaulted here, but can be
changed. If a view is specified as input, an index must be provided.
•
Retain Columns — Other columns within the table being scored can be appended
to the scored table, by specifying them here. Columns specified in Index Columns
may not be specified here.
Select Model Analysis
Select from the list an existing Factor Analysis analysis on which to run the scoring. The
Factor Analysis analysis must exist in the same project as the Factor Scoring analysis.
Factor Scoring - INPUT - Analysis Parameters
On the Factor Scoring dialog click on INPUT and then click on analysis parameters:
170
Teradata Warehouse Miner User Guide - Volume 3
Chapter 2: Scoring
Factor Scoring
Figure 92: Factor Scoring > Input > Analysis Parameters
On this screen select:
• Scoring Method
•
Score — Option to create a score table only.
•
Evaluate — Option to perform model evaluation only.
•
Evaluate and Score — Option to create a score table and perform model evaluation.
•
Factor Names — The names of the factor columns in the created table of scores are
optional parameters if scoring is selected. The default names of the factor columns are
factor1, factor2 ... factorn.
Factor Scoring - OUTPUT
On the Factor Scoring dialog click on OUTPUT:
Figure 93: Factor Scoring > Output
On this screen select:
• Output Table
•
Database name — The name of the database.
•
Table name — The name of the scored output table to be created-required only if a
scoring option is selected.
•
Create output table using the FALLBACK keyword — If a table is selected, it will be
built with FALLBACK if this option is selected
•
Create output table using the MULTISET keyword — This option is not enabled for
scored output tables; the MULTISET keyword is not used.
•
Advertise Output — The Advertise Output option “advertises” output by inserting
information into one or more of the Advertise Output metadata tables according to the
type of analysis and the options selected in the analysis.
•
Advertise Note — An Advertise Note may be specified if desired when the Advertise
Output option is selected or when the Always Advertise option is selected on the
Connection Properties dialog. It is a free-form text field of up to 30 characters that
may be used to categorize or describe the output.
Teradata Warehouse Miner User Guide - Volume 3
171
Chapter 2: Scoring
Factor Scoring
•
Generate the SQL for this analysis, but do not execute it — If this option is selected the
analysis will only generate SQL, returning it and terminating immediately.
Note: To create a stored procedure to score this model, use the Refresh analysis.
Run the Factor Scoring Analysis
After setting parameters on the INPUT screens as described above, you are ready to run the
analysis. To run the analysis you can either:
• Click the Run icon
on the toolbar, or
• Select Run <project name> on the Project menu, or
• Press the F5 key on your keyboard
Results - Factor Scoring
The results of running the Teradata Warehouse Miner Factor Analysis Scoring/Evaluation
Analysis include a variety of statistical reports on the scored model. All of these results are
outlined below.
Factor Scoring - RESULTS - reports
On the Factor Scoring dialog click RESULTS and then click on reports (note that the RESULTS
tab will be grayed-out/disabled until after the analysis is completed):
Figure 94: Factor Scoring > Results > Reports
• Resulting Scored Table — Name of the scored table - equivalent to Result Table Name.
• Number of Rows in Scored Table — Number of rows in the Resulting Scored Table.
• Evaluation — Model evaluation for factor analysis consists of computing the standard
error of estimate for each variable based on working backwards and re-estimating their
values using the scored factors. Estimated values of the original data are made using the
T
factor scoring equation Ŷ = XC where Ŷ is the estimated raw data, X is the scored
data, and C is the factor pattern matrix or rotated factor pattern matrix if rotation was
included in the model. The standard error of estimate for each variable y in the original
data Y is then given by:
2
 y – ŷ 

-------------------------n–p
where each ŷ is the estimated value of each variable y, n is the number of observations
and p is the number of factors.
172
Teradata Warehouse Miner User Guide - Volume 3
Chapter 2: Scoring
Factor Scoring
Factor Scoring - RESULTS - Data
On the Factor Scoring dialog click RESULTS and then click on data (note that the RESULTS
tab will be grayed-out/disabled until after the analysis is completed):
Figure 95: Factor Scoring > Results > Data
Results data, if any, is displayed in a data grid.
If a table was created, a sample of rows is displayed here - the size determined by the setting
specified by Maximum result rows to display in Tools-Preferences-Limits.
The following table is built in the requested Output Database by Factor Scoring. Note that the
options selected affect the structure of the table. Those columns in bold below will comprise
the Unique Primary Index (UPI). Also note that there may be repeated groups of columns,
and that some columns will be generated only if specific options are selected.
Table 62: Output Database table (Built by Factor Scoring)
Name
Type
Definition
Key
User Defined
One or more unique-key columns which default to the index defined in the
table to be scored (i.e., in Selected Tables). The data type defaults to the same
as the scored table, but can be changed via Primary Index Columns.
<app_var>
User Defined
One or more columns as selected under Retain Columns. The data type defaults
to the same as that within the appended table, but can be changed via Columns
Types (for appended columns).
Factorx
FLOAT
A column generated for each scored factor. The names of the factor columns in
the created table of scores are optional parameters if scoring is selected. The
default names of the factor columns are factor1, factor2, ... factorn, unless
Factor Names are specified.
(Default)
Factor Scoring - RESULTS - SQL
On the Factor Scoring dialog click RESULTS and then click on SQL (note that the RESULTS
tab will be grayed-out/disabled until after the analysis is completed):
Figure 96: Factor Scoring > Results > SQL
Teradata Warehouse Miner User Guide - Volume 3
173
Chapter 2: Scoring
Factor Scoring
The SQL generated for scoring is returned here, but only if the Score Method was set to Score
on the Input - Analysis Parameters tab, and if the Output - Storage option to Generate the
SQL for this analysis, but do not execute it was selected. When SQL is displayed here, it may
be selected and copied as desired. (Both right-click menu options and buttons to Select All
and Copy are available).
Tutorial - Factor Scoring
In this example, the same table is scored as was used to build the factor analysis model.
Parameterize a Factor Analysis Scoring Analysis as follows:
• Selected Table — twm_customer_analysis
• Evaluate and Score — Enabled
• Factor Names
•
Factor1
•
Factor2
•
Factor3
•
Factor4
•
Factor5
•
Factor6
•
Factor7
• Result Table Name — twm_score_factor_1
• Index Columns — cust_id
Run the analysis, and click on Results when it completes. For this example, the Factor
Analysis Scoring/Evaluation function generated the following pages. A single click on each
page name populates Results with the item.
Table 63: Factor Analysis Score Report
Resulting Scored Table
<result_db >.score_factor_1
Number of Rows in Scored Table
747
Table 64: Evaluation
174
Variable Name
Standard Error of Estimate
income
0.4938
age
0.5804
years_with_bank
0.5965
nbr_children
0.6180
female
0.8199
Teradata Warehouse Miner User Guide - Volume 3
Chapter 2: Scoring
Linear Scoring
Table 64: Evaluation
Variable Name
Standard Error of Estimate
single
0.3013
married
0.3894
separated
0.4687
ccacct
0.6052
ckacct
0.5660
svacct
0.5248
avg_cc_bal
0.4751
avg_ck_bal
0.6613
avg_sv_bal
0.7166
avg_cc_tran_amt
0.8929
avg_cc_tran_cnt
0.5174
avg_ck_tran_amt
0.3563
avg_ck_tran_cnt
0.7187
avg_sv_tran_amt
0.4326
avg_sv_tran_cnt
0.6967
cc_rev
0.3342
Table 65: Data
cust_id
factor1
factor2
factor3
factor4
factor5
factor6
factor7
1362480
1.43
-0.28
1.15
-0.50
-0.31
-0.05
1.89
1362481
-1.03
-1.37
0.57
-0.08
-0.60
-0.39
-0.55
...
...
...
...
...
...
...
...
Linear Scoring
Once a linear regression model has been built, it can be used to “score” new data, that is, to
estimate the value of the dependent variable in the model using data for which its value may
not be known. Scoring is performed using the values of the b-coefficients in the linear
regression model and the names of the independent variable columns they correspond to.
Other information needed includes the table name(s) in which the data resides, the new table
to be created, and primary index information for the new table. The result of scoring a linear
regression model will be a new table containing primary index columns and an estimate of the
Teradata Warehouse Miner User Guide - Volume 3
175
Chapter 2: Scoring
Linear Scoring
dependent variable, optionally including a residual value for each row, calculated as the
difference between the estimated value and the actual value of the dependent variable. (The
option to include the residual value is available only when model evaluation is requested).
Note that Linear Scoring applies a Linear Regression model to a data set that has the same
columns as those used in building the model (with the exception that the scoring input table
need not include the predicted or dependent variable column unless model evaluation is
requested).
Linear Regression Model Evaluation
Linear regression model evaluation begins with scoring a table that includes the actual values
of the dependent variable. The standard error of estimate for the model is calculated and
reported and may be compared to the standard error of estimate reported when the model was
built. The standard error of estimate is calculated as the square root of the average squared
residual value over all the observations, i.e.
2
 y – ŷ 

-------------------------n–p–1
where ŷ is the actual value of the dependent variable, is its predicted value, n is the number
of observations, and p is the number of independent variables (substituting n-p in the
denominator if there is no constant term).
Initiate Linear Scoring
After generating a Linear Regression analysis (as described in “Linear Regression” on
page 92) use the following procedure to initiate Linear Regression Scoring:
1
Click on the Add New Analysis icon in the toolbar:
Figure 97: Add New Analysis from toolbar
2
176
In the resulting Add New Analysis dialog box, click on Scoring under Categories and then
under Analyses double-click on Linear Scoring:
Teradata Warehouse Miner User Guide - Volume 3
Chapter 2: Scoring
Linear Scoring
Figure 98: Add New Analysis > Scoring > Linear Scoring
3
This will bring up the Linear Scoring dialog in which you will enter INPUT and
OUTPUT options to parameterize the analysis as described in the following sections.
Linear Scoring - INPUT - Data Selection
On the Linear Scoring dialog click on INPUT and then click on data selection:
Figure 99: Linear Scoring > Input > Data Selection
On this screen select:
1
Select Input Source
Users may select between different sources of input. By selecting the Input Source Table
the user can select from available databases, tables (or views) and columns in the usual
manner. By selecting the Input Source Analysis however the user can select directly from
the output of another analysis of qualifying type in the current project. Analyses that may
be selected from directly include all of the Analytic Data Set (ADS) and Reorganization
analyses (except Refresh). In place of Available Databases the user may select from
Available Analyses, while Available Tables then contains a list of all the output tables that
will eventually be produced by the selected Analysis. (Note that since this analysis cannot
select from a volatile input table, Available Analyses will contain only those qualifying
analyses that create an output table or view).
2
Select Columns From a Single Table
Teradata Warehouse Miner User Guide - Volume 3
177
Chapter 2: Scoring
Linear Scoring
3
•
Available Databases — All available source databases that have been added on the
Connection Properties dialog.
•
Available Tables — The tables available for scoring are listed in this window, though
all may not strictly qualify; the input table to be scored must contain the same column
names used in the original analysis.
•
Available Columns — The columns available for scoring are listed in this window.
•
Selected Columns
•
Index Columns — Note that the Selected Columns window is actually a split
window for specifying Index and/or Retain columns if desired. If a table is
specified as input, the primary index of the table is defaulted here, but can be
changed. If a view is specified as input, an index must be provided.
•
Retain Columns — Other columns within the table being scored can be appended
to the scored table, by specifying them here. Columns specified in Index Columns
may not be specified here.
Select Model Analysis
Select from the list an existing Linear Regression analysis on which to run the scoring.
The Linear Regression analysis must exist in the same project as the Linear Scoring
analysis.
Linear Scoring - INPUT - Analysis Parameters
On the Linear Scoring dialog click on INPUT and then click on analysis parameters:
Figure 100: Linear Scoring > Input > Analysis Parameters
On this screen select:
• Scoring Method
•
Score — Option to create a score table only.
•
Evaluate — Option to perform model evaluation only.
•
Evaluate and Score — Option to create a score table and perform model evaluation.
• Scoring Options
•
Use Dependent variable for predicted value column name — Option to use the exact
same column name as the dependent variable when the model is scored. This is the
default option.
•
178
Predicted Value Column Name — If above option is not checked, then enter here the
name of the column in the score table which contains the estimated value of the
dependent variable.
Teradata Warehouse Miner User Guide - Volume 3
Chapter 2: Scoring
Linear Scoring
•
Residual Column Name — If Evaluate and Score is requested, enter the name of the
column that will contain the residual values of the evaluation. This column will be
populated with the difference between the estimated value and the actual value of the
dependent variable.
Linear Scoring - OUTPUT
On the Linear Scoring dialog click on OUTPUT:
Figure 101: Linear Scoring > Output
On this screen select:
• Output Table
•
Database name — The name of the database.
•
Table name — The name of the scored output table to be created-required only if a
scoring option is selected.
•
Create output table using the FALLBACK keyword — If a table is selected, it will be
built with FALLBACK if this option is selected
•
Create output table using the MULTISET keyword — This option is not enabled for
scored output tables; the MULTISET keyword is not used.
•
Advertise Output — The Advertise Output option “advertises” output by inserting
information into one or more of the Advertise Output metadata tables according to the
type of analysis and the options selected in the analysis.
•
Advertise Note — An Advertise Note may be specified if desired when the Advertise
Output option is selected or when the Always Advertise option is selected on the
Connection Properties dialog. It is a free-form text field of up to 30 characters that
may be used to categorize or describe the output.
•
Generate the SQL for this analysis, but do not execute it — If this option is selected the
analysis will only generate SQL, returning it and terminating immediately.Hint: To
create a stored procedure to score this model, use the Refresh analysis.
Note: To create a stored procedure to score this model, use the Refresh analysis.
Run the Linear Scoring Analysis
After setting parameters on the INPUT screens as described above, you are ready to run the
analysis. To run the analysis you can either:
• Click the Run icon
on the toolbar, or
• Select Run <project name> on the Project menu, or
• Press the F5 key on your keyboard
Teradata Warehouse Miner User Guide - Volume 3
179
Chapter 2: Scoring
Linear Scoring
Results - Linear Scoring
The results of running the Linear Regression Scoring/Evaluation analysis include a variety of
statistical reports on the scored model. All of these results are outlined below.
Linear Scoring - RESULTS - reports
On the Linear Scoring dialog click RESULTS and then click on reports (note that the RESULTS
tab will be grayed-out/disabled until after the analysis is completed):
Figure 102: Linear Scoring > Results > Reports
• Resulting Scored Table — Name of the scored table - equivalent to Result Table Name.
• Number of Rows in Scored Table — Number of rows in the Resulting Scored Table.
• Evaluation
•
Minimum Absolute Error
•
Maximum Absolute Error
•
Average Absolute Error
The term ‘error’ in the evaluation of a linear regression model refers to the difference
between the value of the dependent variable predicted by the model and the actual
value in a training set of data (data where the value of the dependent variable is
known). Considering the absolute value of the error (changing negative differences to
positive differences) provides a measure of the magnitude of the error in the model,
which is a more useful measure of the model’s accuracy. With this introduction, the
terms minimum, maximum and average absolute error have the usual meanings when
calculated over all the observations in the input or scored table.
•
Standard Error of Estimate
The standard error of estimate is calculated as the square root of the average squared
residual value over all the observations, i.e.
2
 y – ŷ 

-------------------------n–p–1
where y is the actual value of the dependent variable, ŷ is its predicted value, n is the
number of observations, and p is the number of independent variables (substitute n-p
in the denominator if there is no constant term).
Linear Scoring - RESULTS - data
On the Linear Scoring dialog click RESULTS and then click on data (note that the RESULTS
tab will be grayed-out/disabled until after the analysis is completed):
180
Teradata Warehouse Miner User Guide - Volume 3
Chapter 2: Scoring
Linear Scoring
Figure 103: Linear Scoring > Results > Data
Results data, if any, is displayed in a data grid.
If a table was created, a sample of rows is displayed here - the size determined by the setting
specified by Maximum result rows to display in Tools-Preferences-Limits.
The following table is built in the requested Output Database by Linear Regression scoring.
Note that the options selected affect the structure of the table. Those columns in bold below
will comprise the Unique Primary Index (UPI). Also note that there may be repeated groups
of columns, and that some columns will be generated only if specific options are selected.
Table 66: Output Database table (Built by Linear Regression scoring)
Name
Type
Definition
Key
User Defined
One or more unique-key columns which default to the index defined in the
table to be scored (i.e., in Selected Tables). The data type defaults to the same
as the scored table, but can be changed via Primary Index Columns.
<app_var>
User Defined
One or more columns as selected under Retain Columns.
<dep_var>
FLOAT
The predicted value of the dependent variable. The name used defaults to the
Dependent Variable specified when the model was built. If Use Dependent
variable for predicted value column name is not selected, then an appropriate
column name must be entered here.
FLOAT
The residual values of the evaluation, the difference between the estimated
value and the actual value of the dependent variable. This is generated only if
the Evaluate or Evaluate and Score options are selected. The name defaults to
“Residual” unless it is overwritten by the user.
(Default)
Residual
(Default)
Linear Scoring - RESULTS - SQL
On the Linear Scoring dialog click RESULTS and then click on SQL (note that the RESULTS
tab will be grayed-out/disabled until after the analysis is completed):
Figure 104: Linear Scoring > Results > SQL
Teradata Warehouse Miner User Guide - Volume 3
181
Chapter 2: Scoring
Linear Scoring
The SQL generated for scoring is returned here, but only if the Score Method was set to Score
on the Input - Analysis Parameters tab, and if the Output - Storage option to Generate the
SQL for this analysis, but do not execute it was selected. When SQL is displayed here, it may
be selected and copied as desired. (Both right-click menu options and buttons to Select All
and Copy are available).
Tutorial - Linear Scoring
In this example, the same table is scored as was used to build the linear model, as a matter of
convenience. Typically, this would not be done unless the contents of the table changed since
the model was built. In the case of this example, the Standard Error of Estimate can be seen to
be exactly the same, 10.445, that it was when the model was built (see “Tutorial - Linear
Regression” on page 113).
Parameterize a Linear Regression Scoring Analysis as follows:
• Selected Table — twm_customer_analysis
• Evaluate and Score — Enabled
• Use dependent variable for predicted value column name — Enabled
• Residual column name — Residual
• Result Table Name — twm_score_linear_1
• Index Columns — cust_id
Run the analysis, and click on Results when it completes. For this example, the Linear
Regression Scoring/Evaluation Analysis generated the following pages. A single click on
each page name populates Results with the item.
Table 67: Linear Regression Reports
Resulting Scored Table
<result_db>.score_linear_1
Number of Rows in Scored Table
747
Table 68: Evaluation
Minimum Absolute Error
0.0056
Maximum Absolute Error
65.7775
Average Absolute Error
7.2201
Standard Error of Estimate
10.4451
Table 69: Data
182
cust_id
cc_rev
Residual
1362480
59.188
15.812
1362481
3.412
-3.412
Teradata Warehouse Miner User Guide - Volume 3
Chapter 2: Scoring
Logistic Scoring
Table 69: Data
cust_id
cc_rev
Residual
1362484
12.254
-.254
1362485
28.272
1.728
1362486
-9.026E-02
9.026E-02
1362487
14.325
-1.325
1362488
-5.105
5.105
1362489
69.738
12.262
1362492
53.368
.632
1362496
-5.876
5.876
…
…
…
…
…
…
…
…
…
Logistic Scoring
Once a logistic regression model has been built, it can be used to “score” new data, that is, to
estimate the value of the dependent variable in the model using data for which its value may
not be known. Scoring is performed using the values of the b-coefficients in the logistic
regression model and the names of the independent variable column names they correspond
to. This information resides in the results metadata stored in the Teradata database by
Teradata Warehouse Miner. Other information needed includes the table name in which the
data resides, the new table to be created, and primary index information for the new table.
Scoring a logistic regression model requires some steps beyond those required in scoring a
linear regression model. The result of scoring a logistic regression model will be a new table
containing primary index columns, the probability that the dependent variable is 1
(representing the response value) rather than 0 (representing the non-response value), and/or
an estimate of the dependent variable, either 0 or 1, based on a user specified threshold value.
For example, if the threshold value is 0.5, then a value of 1 is estimated if the probability
value is greater than or equal to 0.5. The probability is based on the logistic regression
functions given earlier.
The user can achieve different results based on the threshold value applied to the probability.
The model evaluation tables described below can be used to determine what this threshold
value should be.
Note that Logistic Scoring applies a Logistic Regression model to a data set that has the same
columns as those used in building the model (with the exception that the scoring input table
Teradata Warehouse Miner User Guide - Volume 3
183
Chapter 2: Scoring
Logistic Scoring
need not include the predicted or dependent variable column unless model evaluation is
requested).
Logistic Regression Model Evaluation
The same model evaluation that is available when building a Logistic Regression model is
also available when scoring it, including the following report tables.
Prediction Success Table
The prediction success table is computed using only probabilities and not estimates based on
a threshold value. Using an input table that contains known values for the dependent variable,
the sum of the probability values   x  and 1 –   x  , which correspond to the probability
that the predicted value is 1 or 0 respectively, are calculated separately for rows with actual
values of 1 and 0. This produces a report table such as that shown below.
Table 70: Prediction Success Table
Estimate Response
Estimate Non-Response
Actual Total
Actual Response
306.5325
68.4675
375.0000
Actual Non-Response
69.0115
302.9885
372.0000
Estimated Total
375.5440
371.4560
747.0000
An interesting and useful feature of this table is that it is independent of the threshold value
that is used in scoring to determine which probabilities correspond to an estimate of 1 and 0
respectively. This is possible because the entries in the “Estimate Response” column are the
sums of the probabilities   x  that the outcome is 1, summed separately over the rows where
the actual outcome is 1 and 0 and then totaled. Similarly, the entries in the “Estimate NonResponse” column are the sums of the probabilities 1 –   x  that the outcome is 0.
Multi-Threshold Success Table
This table provides values similar to those in the prediction success table, but instead of
summing probabilities, the estimated values based on a threshold value are summed instead.
Rather than just one threshold however, several thresholds ranging from a user specified low
to high value are displayed in user specified increments. This allows the user to compare
several success scenarios using different threshold values, to aid in the choice of an ideal
threshold.
It might be supposed that the ideal threshold value would be the one that maximizes the
number of correctly classified observations. However, subjective business considerations
may be applied by looking at all of the success values. It may be that wrong predictions in one
direction (say estimate 1 when the actual value is 0) may be more tolerable than in the other
direction (estimate 0 when the actual value is 1). One may, for example, mind less
overlooking fraudulent behavior than wrongly accusing someone of fraud.
The following is an example of a logistic regression multi-threshold success table.
184
Teradata Warehouse Miner User Guide - Volume 3
Chapter 2: Scoring
Logistic Scoring
Table 71: Logistic Regression Multi-Threshold Success table
Threshold Probability
Actual Response,
Estimate Response
Actual Response,
Estimate NonResponse
Actual Non-Response,
Estimate Response
Actual Non-Response,
Estimate NonResponse
0.0000
375
0
372
0
0.0500
375
0
326
46
0.1000
374
1
231
141
0.1500
372
3
145
227
0.2000
367
8
93
279
0.2500
358
17
59
313
0.3000
354
21
46
326
0.3500
347
28
38
334
0.4000
338
37
32
340
0.4500
326
49
27
345
0.5000
318
57
27
345
0.5500
304
71
26
346
0.6000
296
79
24
348
0.6500
287
88
22
350
0.7000
279
96
21
351
0.7500
270
105
19
353
0.8000
258
117
18
354
0.8500
245
130
16
356
0.9000
222
153
12
360
0.9500
187
188
10
362
Cumulative Lift Table
The Cumulative Lift Table is produced for deciles based on the probability values. Note that
the deciles are labeled such that 1 is the highest decile and 10 is the lowest, based on the
probability values calculated by logistic regression. Within each decile, the following
measures are given:
1
count of “response” values
2
count of observations
3
percentage response (percentage of response values within the decile)
4
captured response (percentage of responses over all response values)
Teradata Warehouse Miner User Guide - Volume 3
185
Chapter 2: Scoring
Logistic Scoring
5
lift value (percentage response / expected response, where the expected response is the
percentage of responses over all observations)
6
cumulative versions of each of the measures above
The following is an example of a logistic regression Cumulative Lift Table.
Table 72: Logistic Regression Cumulative Lift Table
Cumulative
Response
Cumulative
Response
(%)
Cumulative
Captured
Response
(%)
Cumulative
Lift
Decile
Count
Response
Response
(%)
Captured
Response
(%)
1
74.0000
73.0000
98.6486
19.4667
1.9651
73.0000
98.6486
19.4667
1.9651
2
75.0000
69.0000
92.0000
18.4000
1.8326
142.0000
95.3020
37.8667
1.8984
3
75.0000
71.0000
94.6667
18.9333
1.8858
213.0000
95.0893
56.8000
1.8942
4
74.0000
65.0000
87.8378
17.3333
1.7497
278.0000
93.2886
74.1333
1.8583
5
75.0000
63.0000
84.0000
16.8000
1.6733
341.0000
91.4209
90.9333
1.8211
6
75.0000
23.0000
30.6667
6.1333
0.6109
364.0000
81.2500
97.0667
1.6185
7
74.0000
8.0000
10.8108
2.1333
0.2154
372.0000
71.2644
99.2000
1.4196
8
75.0000
2.0000
2.6667
0.5333
0.0531
374.0000
62.6466
99.7333
1.2479
9
75.0000
1.0000
1.3333
0.2667
0.0266
375.0000
55.8036
100.0000
1.1116
10
75.0000
0.0000
0.0000
0.0000
0.0000
375.0000
50.2008
100.0000
1.0000
Lift
Initiate Logistic Scoring
After generating a Logistic Regression analysis (as described in “Logistic Regression” on
page 120) use the following procedure to initiate Logistic Scoring:
1
Click on the Add New Analysis icon in the toolbar:
Figure 105: Add New Analysis from toolbar
2
186
In the resulting Add New Analysis dialog box, click on Scoring under Categories and then
under Analyses double-click on Logistic Scoring:
Teradata Warehouse Miner User Guide - Volume 3
Chapter 2: Scoring
Logistic Scoring
Figure 106: Add New Analysis > Scoring > Logistic Scoring
3
This will bring up the Logistic Scoring dialog in which you will enter INPUT and
OUTPUT options to parameterize the analysis as described in the following sections.
Logistic Scoring - INPUT - Data Selection
On the Logistic Scoring dialog click on INPUT and then click on data selection:
Figure 107: Logistic Scoring > Input > Data Selection
On this screen select:
1
Select Input Source
Users may select between different sources of input. By selecting the Input Source Table
the user can select from available databases, tables (or views) and columns in the usual
manner. By selecting the Input Source Analysis however the user can select directly from
the output of another analysis of qualifying type in the current project. Analyses that may
be selected from directly include all of the Analytic Data Set (ADS) and Reorganization
analyses (except Refresh). In place of Available Databases the user may select from
Available Analyses, while Available Tables then contains a list of all the output tables that
will eventually be produced by the selected Analysis. (Note that since this analysis cannot
select from a volatile input table, Available Analyses will contain only those qualifying
analyses that create an output table or view).
2
Select Columns From a Single Table
Teradata Warehouse Miner User Guide - Volume 3
187
Chapter 2: Scoring
Logistic Scoring
3
•
Available Databases — All available source databases that have been added on the
Connection Properties dialog.
•
Available Tables — The tables available for scoring are listed in this window, though
all may not strictly qualify; the input table to be scored must contain the same column
names used in the original analysis.
•
Available Columns — The columns available for scoring are listed in this window.
•
Selected Columns
•
Index Columns — Note that the Selected Columns window is actually a split
window for specifying Index and/or Retain columns if desired. If a table is
specified as input, the primary index of the table is defaulted here, but can be
changed. If a view is specified as input, an index must be provided.
•
Retain Columns — Other columns within the table being scored can be appended
to the scored table, by specifying them here. Columns specified in Index Columns
may not be specified here.
Select Model Analysis
Select from the list an existing Logistic Regression analysis on which to run the scoring.
The Logistic Regression analysis must exist in the same project as the Logistic Scoring
analysis.
Logistic Scoring - INPUT - Analysis Parameters
On the Logistic Scoring dialog click on INPUT and then click on analysis parameters:
Figure 108: Logistic Scoring > Input > Analysis Parameters
On this screen select:
• Scoring Method
•
Score — Option to create a score table only.
•
Evaluate — Option to perform model evaluation only.
•
Evaluate and Score — Option to create a score table and perform model evaluation.
• Scoring Options
•
Include Probability Score Column — Inclusion of a column in the score table that
contains the probability between 0 and 1 that the value of the dependent variable is 1 is
an optional parameter when scoring is selected. The default is to include a probability
score column in the created score table. (Either the probability score or the estimated
value or both must be requested when scoring).
•
188
Column Name — Column name containing the probability between 0 and 1 that the
value of the dependent variable is 1.
Teradata Warehouse Miner User Guide - Volume 3
Chapter 2: Scoring
Logistic Scoring
•
Include Estimate from Threshold Column — Inclusion of a column in the score table
that contains the estimated value of the dependent variable is an option when scoring
is selected. The default is to include an estimated value column in the created score
table. (Either the probability score or the estimated value or both must be requested
when scoring).
•
Column Name — Column name containing the estimated value of the dependent
variable.
•
Threshold Default — The threshold value is a value between 0 and 1 that
determines which probabilities result in an estimated value of 0 or 1. For example,
with a threshold value of 0.3, probabilities of 0.3 or greater yield an estimated
value of 1, while probabilities less than 0.3 yield an estimated value of 0. The
threshold option is valid only if the Include Estimate option has been requested and
scoring is selected. If the Include Estimate option is requested but the threshold
value is not specified, a default threshold value of 0.5 is used.
• Evaluation Options
•
Prediction Success Table — Creates a prediction success table using sums of
probabilities rather than estimates based on a threshold value. The default value is to
include the Prediction Success Table. (This only applies if evaluation is requested).
•
Multi-Threshold Success Table — This table provides values similar to those in the
prediction success table, but based on a range of threshold values, thus allowing the
user to compare success scenarios using different threshold values. The default value
is to include the multi-threshold success table. (This only applies if evaluation is
requested).
•
Threshold Begin
•
Threshold End
•
Threshold Increment
Specifies the threshold values to be used in the multi-threshold success table. If the
computed probability is greater than or equal to a threshold value, that observation
is assigned a 1 rather than a 0. Default values are 0, 1 and .05 respectively.
•
Cumulative Lift Table — Produce a cumulative lift table for deciles based on
probability values. The default value is to include the cumulative lift table. (This only
applies if evaluation is requested).
Logistic Scoring - OUTPUT
On the Logistic Scoring dialog click on OUTPUT:
Figure 109: Logistic Scoring > Output
On this screen select:
Teradata Warehouse Miner User Guide - Volume 3
189
Chapter 2: Scoring
Logistic Scoring
• Output Table
•
Database name — The name of the database.
•
Table name — The name of the scored output table to be created-required only if a
scoring option is selected.
•
Create output table using the FALLBACK keyword — If a table is selected, it will be
built with FALLBACK if this option is selected.
•
Create output table using the MULTISET keyword — This option is not enabled for
scored output tables; the MULTISET keyword is not used.
•
Advertise Output — The Advertise Output option “advertises” output by inserting
information into one or more of the Advertise Output metadata tables according to the
type of analysis and the options selected in the analysis.
•
Advertise Note — An Advertise Note may be specified if desired when the Advertise
Output option is selected or when the Always Advertise option is selected on the
Connection Properties dialog. It is a free-form text field of up to 30 characters that
may be used to categorize or describe the output.
•
Generate the SQL for this analysis, but do not execute it — If this option is selected the
analysis will only generate SQL, returning it and terminating immediately.
Note: To create a stored procedure to score this model, use the Refresh analysis.
Run the Logistic Scoring Analysis
After setting parameters on the INPUT screens as described above, you are ready to run the
analysis. To run the analysis you can either:
• Click the Run icon
on the toolbar, or
• Select Run <project name> on the Project menu, or
• Press the F5 key on your keyboard
Results - Logistic Scoring
The results of running the Logistic Scoring analysis include a variety of statistical reports on
the scored model, and if selected, a Lift Chart. All of these results are outlined below.
It is important to note that although a response value other than 1 may have been indicated
when the Logistic Regression model was built, the Logistic Regression Scoring analysis will
always use the value 1 as the response value, and the value 0 for the non-response value(s).
Logistic Scoring - RESULTS - Reports
On the Logistic Scoring dialog click RESULTS and then click on reports (note that the
RESULTS tab will be grayed-out/disabled until after the analysis is completed):
190
Teradata Warehouse Miner User Guide - Volume 3
Chapter 2: Scoring
Logistic Scoring
Figure 110: Logistic Scoring > Results > Reports
• Resulting Scored Table — Name of the scored table - equivalent to Result Table Name.
• Number of Rows in Scored Table — Number of rows in the Resulting Scored Table.
• Prediction Success Table — This is the same report described in “Results - Logistic
Regression” on page 134, but applied to possibly new data.
• Multi-Threshold Success Table — This is the same report described in “Results - Logistic
Regression” on page 134, but applied to possibly new data.
• Cumulative Lift Table — This is the same report described in “Results - Logistic
Regression” on page 134, but applied to possibly new data.
Logistic Scoring - RESULTS - Data
On the Logistic Scoring dialog click RESULTS and then click on data (note that the RESULTS
tab will be grayed-out/disabled until after the analysis is completed):
Figure 111: Logistic Scoring > Results > Data
Results data, if any, is displayed in a data grid.
If a table was created, a sample of rows is displayed here - the size determined by the setting
specified by Maximum result rows to display in Tools-Preferences-Limits.
The following table is built in the requested Output Database by Logistic Regression scoring.
Note that the options selected affect the structure of the table. Those columns in bold below
will comprise the Unique Primary Index (UPI). Also note that there may be repeated groups
of columns, and that some columns will be generated only if specific options are selected.
Table 73: Output Database table (Built by Logistic Regression scoring)
Name
Type
Definition
Key
User Defined
One or more unique-key columns which default to the index defined in the
table to be scored (i.e., in Selected Table). The data type defaults to the same as
the scored table, but can be changed via Primary Index Columns.
<app_var>
User Defined
One or more columns as selected under Retain Columns.
Teradata Warehouse Miner User Guide - Volume 3
191
Chapter 2: Scoring
Logistic Scoring
Table 73: Output Database table (Built by Logistic Regression scoring)
Name
Type
Definition
Probability
FLOAT
A probability between 0 and 1 that the value of the dependent variable is 1. The
name used defaults to “Probability” unless an appropriate column name is
entered. Generated only if Include Probability Score Column is selected. The
default is to not include a probability score column in the created score table.
(Either the probability score or the estimated value or both must be requested
when scoring).
FLOAT
The estimated value of the dependent variable,. The default is to not include an
estimated value column in the created score table. Generated only if Include
Estimate from Threshold Column is selected. (Either the probability score or the
estimated value or both must be requested when scoring).
(Default)
Estimate
(Default)
Logistic Scoring - RESULTS - Lift Graph
On the Logistic Scoring dialog click RESULTS and then click on lift graph (note that the
RESULTS tab will be grayed-out/disabled until after the analysis is completed):
Figure 112: Logistic Scoring > Results > Lift Graph
This chart displays the information in the Cumulative Lift Table. This is the same graph
described in “Results - Logistic Regression” on page 134 as Lift Chart, but applied to
possibly new data.
Logistic Scoring - RESULTS - SQL
On the Logistic Scoring dialog click RESULTS and then click on SQL (note that the RESULTS
tab will be grayed-out/disabled until after the analysis is completed):
Figure 113: Logistic Scoring > Results > SQL
The SQL generated for scoring is returned here, but only if the Score Method was set to Score
on the Input - Analysis Parameters tab, and if the Output - Storage option to Generate the
SQL for this analysis, but do not execute it was selected. When SQL is displayed here, it may
be selected and copied as desired. (Both right-click menu options and buttons to Select All
and Copy are available).
192
Teradata Warehouse Miner User Guide - Volume 3
Chapter 2: Scoring
Logistic Scoring
Tutorial - Logistic Scoring
In this example, the same table is scored as was used to build the logistic regression model, as
a matter of convenience. Typically, this would not be done unless the contents of the table
changed since the model was built.
Parameterize a Logistic Regression Scoring Analysis as follows:
• Selected Table — twm_customer_analysis
• Evaluate and Score — Enabled
• Include Probability Score Column — Enabled
•
Column Name — Probability
• Include Estimate from Threshold Column — Enabled
•
Column Name — Estimate
•
Threshold Default — 0.35
• Prediction Success Table — Enabled
• Multi-Threshold Success Table — Enabled
•
Threshold Begin — 0
•
Threshold End — 1
•
Threshold Increment — 0.05
• Cumulative Lift Table — Enabled
• Result Table Name — score_logistic_1
• Index Columns — cust_id
Run the analysis, and click on Results when it completes. For this example, the Logistic
Regression Scoring/Evaluation Analysis generated the following pages. A single click on
each page name populates Results with the item.
Table 74: Logistic Regression Model Scoring Report
Resulting Scored Table
<result_db>.score_logistic_1
Number of Rows in Scored Table
747
Table 75: Prediction Success Table
Estimate Response
Estimate Non-Response
Actual Total
Actual Response
304.58 / 40.77%
70.42 / 9.43%
375.00 / 50.20%
Actual Non-Response
70.41 / 9.43%
301.59 / 40.37%
372.00 / 49.80%
Estimated Total
374.99 / 50.20%
372.01 / 49.80%
747.00 / 100.00%
Teradata Warehouse Miner User Guide - Volume 3
193
Chapter 2: Scoring
Logistic Scoring
Table 76: Multi-Threshold Success Table
Threshold Probability
Actual Response,
Estimate Response
Actual Response,
Estimate NonResponse
Actual Non-Response,
Estimate Response
Actual Non-Response,
Estimate NonResponse
0.0000
375
0
372
0
0.0500
375
0
353
19
0.1000
374
1
251
121
0.1500
373
2
152
220
0.2000
369
6
90
282
0.2500
361
14
58
314
0.3000
351
24
37
335
0.3500
344
31
29
343
0.4000
329
46
29
343
0.4500
318
57
28
344
0.5000
313
62
24
348
0.5500
305
70
23
349
0.6000
291
84
23
349
0.6500
286
89
21
351
0.7000
276
99
20
352
0.7500
265
110
20
352
0.8000
253
122
20
352
0.8500
243
132
16
356
0.9000
229
146
13
359
0.9500
191
184
11
361
Lift
Table 77: Cumulative Lift Table
Cumulative
Response
Cumulative
Response
(%)
Cumulative
Captured
Response
(%)
Cumulative
Lift
Decile
Count
Response
Response
(%)
Captured
Response
(%)
1
74.0000
73.0000
98.6486
19.4667
1.9651
73.0000
98.6486
19.4667
1.9651
2
75.0000
69.0000
92.0000
18.4000
1.8326
142.0000
95.3020
37.8667
1.8984
3
75.0000
71.0000
94.6667
18.9333
1.8858
213.0000
95.0893
56.8000
1.8942
194
Teradata Warehouse Miner User Guide - Volume 3
Chapter 2: Scoring
Logistic Scoring
Table 77: Cumulative Lift Table
Cumulative
Response
Cumulative
Response
(%)
Cumulative
Captured
Response
(%)
Cumulative
Lift
Decile
Count
Response
Response
(%)
Captured
Response
(%)
4
74.0000
65.0000
87.8378
17.3333
1.7497
278.0000
93.2886
74.1333
1.8583
5
75.0000
66.0000
88.0000
17.6000
1.7530
344.0000
92.2252
91.7333
1.8371
6
75.0000
24.0000
32.0000
6.4000
0.6374
368.0000
82.1429
98.1333
1.6363
7
74.0000
4.0000
5.4054
1.0667
0.1077
372.0000
71.2644
99.2000
1.4196
8
73.0000
2.0000
2.7397
0.5333
0.0546
374.0000
62.8571
99.7333
1.2521
9
69.0000
1.0000
1.4493
0.2667
0.0289
375.0000
56.4759
100.0000
1.1250
10
83.0000
0.0000
0.0000
0.0000
0.0000
375.0000
50.2008
100.0000
1.0000
Lift
Table 78: Data
cust_id
Probability
Estimate
1362480
1.00
1
1362481
0.08
0
1362484
1.00
1
1362485
0.14
0
1362486
0.66
1
1362487
0.86
1
1362488
0.07
0
1362489
1.00
1
1362492
0.29
0
1362496
0.35
1
…
...
...
Lift Graph
By default, the Lift Graph displays the cumulative measure of the percentage of observations
in the decile where the actual value of the dependent variable is 1, from decile 1 to this decile
(Cumulative, %Response).
Teradata Warehouse Miner User Guide - Volume 3
195
Chapter 2: Scoring
Logistic Scoring
Figure 114: Logistic Scoring Tutorial: Lift Graph
196
Teradata Warehouse Miner User Guide - Volume 3
Chapter 2: Scoring
Logistic Scoring
Teradata Warehouse Miner User Guide - Volume 3
197
Chapter 2: Scoring
Logistic Scoring
198
Teradata Warehouse Miner User Guide - Volume 3
CHAPTER 3
Statistical Tests
What’s In This Chapter
This chapter applies only to an instance of Teradata Warehouse Miner operating on a Teradata
database.
For more information, see these subtopics:
1
“Overview” on page 199
2
“Parametric Tests” on page 204
3
“Binomial Tests” on page 228
4
“Kolmogorov-Smirnov Tests” on page 241
5
“Tests Based on Contingency Tables” on page 270
6
“Rank Tests” on page 283
Overview
Teradata Warehouse Miner contains both parametric and nonparametric statistical tests from
the classical statistics literature, as well as more recently developed tests. In addition, “group
by” variables permit the ability to statistically analyze data groups defined by selected
variables having specific values. In this way, multiple tests can be conducted at once to
provide a profile of customer data showing hidden clues about customer behavior.
In simplified terms, what statistical inference allows us to do is to find out whether the
outcome of an experiment could have happened by accident, or if it is extremely unlikely to
have happened by chance. Of course a very well designed experiment would have outcomes
which are clearly different, and require no statistical test. Unfortunately, in nature noisy
outcomes of experiments are common, and statistical inference is required to get the answer.
It doesn’t matter whether our data come from an experiment we designed, or from a retail
database. Questions can be asked of the data, and statistical inference can provide the answer.
What is statistical inference? It is a process of drawing conclusions about parameters of a
statistical distribution. In summary, there are three principal approaches to statistical
inference. One type of statistical inference is Bayesian estimation, where conclusions are
based upon posterior judgments about the parameter given an experimental outcome. A
second type is based on the likelihood approach, in which all conclusions are inferred from
the likelihood function of the parameter given an experimental outcome. A third type of
inference is hypothesis testing, which includes both nonparametric and parametric inference.
Teradata Warehouse Miner User Guide - Volume 1
199
Chapter 3: Statistical Tests
Overview
For nonparametric inference, estimators concerning the distribution function are independent
of the specific mathematical form of the distribution function. Parametric inference, by
contrast, involves estimators about the distribution function that assumes a particular
mathematical form, most often the normal distribution. Parametric tests are based on the
sampling distribution of a particular statistic. Given knowledge of the underlying distribution
of a variable, how the statistic is distributed in multiple equal-size samples can be predicted.
The statistical tests provided in Teradata Warehouse Miner are solely those of the hypothesis
testing type, both parametric and nonparametric. Hypothesis tests generally belong to one of
five classes:
1
parametric tests including the class of t-tests and F-tests assuming normality of data
populations
2
nonparametric tests of the binomial type
3
nonparametric tests of the chi square type, based on contingency tables.
4
nonparametric tests based on ranks
5
nonparametric tests of the Kolmogorov-Smirnov type
Within each class of tests there exist many variants, some of which have risen to the level of
being named for their authors. Often tests have multiple names due to different originators.
The tests may be applied to data in different ways, such as on one sample, two samples or
multiple samples. The specific hypothesis of the test may be two-tailed, upper-tailed or lowertailed.
Hypothesis tests vary depending on the assumptions made in the context of the experiment,
and care must be exercised that they are valid in the particular context of the data to be
examined. For example, is it a fair assumption that the variables are normally distributed?
The choice of which test to apply will depend on the answer to this question. Failure to
exercise proper judgment in which test to apply may result in false alarms, where the null
hypothesis is rejected incorrectly, or misses, where the null hypothesis is accepted
improperly.
Note: Identity columns (i.e., columns defined with the attribute “GENERATED … AS
IDENTITY”), cannot be analyzed by many of the statistical test functions and should
therefore generally be avoided.
Summary of Tests
Parametric Tests
Tests include the T-test, the F(1-way), F(2-way with equal Sample Size), F(3-way with equal
Sample Size), and the F(2-way with unequal Sample Size).
The two-sample t-test checks if two population means are equal.
The ANOVA or F test determines if significant differences exist among treatment means or
interactions. It’s a preliminary test that indicates if further analysis of the relationship among
treatment means is warranted.
200
Teradata Warehouse Miner User Guide - Volume 1
Chapter 3: Statistical Tests
Overview
Tests of the Binomial Type
These tests include the Binomial test and Sign test. The data for a binomial test is assumed to
come from n independent trials, and have outcomes in either of two classes. The binomial test
reports whether the probability that the outcome is of the first class is a particular p_value, p*,
usually ½.
Tests Based on Contingency Tables - Chi Square Type
Tests include the Chi Square and Median test.
The Chi Square Test determines whether the probabilities observed from data in a RxC
contingency table are the same or different. Additional statistics provided are Phi coefficient,
Cramer’s V, Likelihood Ratio Chi Square, Continuity-Adjusted Chi-Square, and Contingency
Coefficient
The Median test is a special case of the chi-square test with fixed marginal totals, testing
whether several samples came from populations with the same median.
Tests of the Kolmogorov-Smirnov Type
These tests include the Kolmogorov-Smirnov and Lilliefors tests for goodness of fit to a
particular distribution (normal), the Shapiro-Wilk and D'Agostino-Pearson tests of normality,
and the Smirnov test of equality of two distributions.
Tests Based on Ranks
Tests include the MannWhitney test for 2 independent samples, Kruskal-Wallis test for k
independent samples, Wilcoxon Signed Ranks test, and Friedman test.
The Friedman test is an extension of the sign test for several independent samples. It is a test
for treatment differences in a randomized, complete block design. Additional statistics
provided are Kendall’s Coefficient of Concordance (W) and Spearman’s Rho.
Data Requirements
The following chart summarizes how the Statistical Test functions handle various types of
input. Those cases with the note “should be normal numeric” will give warnings for any type
of input that is not standard numeric (i.e., for character data, dates, big integers or decimals,
etc.).
In the table below, cat is an abbreviation for categorical, num for numeric and bignum for big
integers or decimals:
Table 79: Statistical Test functions handling of input
Test
Input Columns
Tests Return Results With
Note
Median
column of interest
cat, num, date, bignum
can be anything
Median
columns
cat, num, date, bignum
can be anything
Median
group by columns
cat, num, date, bignum
can be anything
Teradata Warehouse Miner User Guide - Volume 1
201
Chapter 3: Statistical Tests
Overview
Table 79: Statistical Test functions handling of input
Test
Input Columns
Tests Return Results With
Note
Chi Square
1st columns
cat, num, date, bignum
can be anything (limit of 2000
distinct value pairs)
Chi Square
2nd columns
cat, num, date, bignum
can be anything
Mann Whitney
column of interest
cat, num, date, bignum
can be anything
Mann Whitney
columns
cat, num, date, bignum
can be anything
Mann Whitney
group by columns
cat, num, date, bignum
can be anything
Wilcoxon
1st column
num, date, bignum
should be normal numeric
Wilcoxon
2nd column
num, date, bignum
should be normal numeric
Wilcoxon
group by columns
cat, num, date, bignum
can be anything
Friedman
column of interest
num
should be normal numeric
Friedman
treatment column
special count requirements
Friedman
block column
special count requirements
Friedman
group by columns
cat, num, date, bignum
can be anything
F(n)way
column of interest
num
should be normal numeric
F(n)way
columns
cat, num, date, bignum
can be anything
F(n)way
group by columns
cat, num, date, bignum
can be anything
F(2)way ucc
column of interest
num
should be normal numeric
F(2)way ucc
columns
cat, num, date, bignum
can be anything
F(2)way ucc
group by columns
cat, num, date, bignum
can be anything
T Paired
1st column
num
should be normal numeric
T Paired
2nd column
num, date, bignum
should be normal numeric
T Paired
group by columns
cat, num, date, bignum
can be anything
202
Teradata Warehouse Miner User Guide - Volume 1
Chapter 3: Statistical Tests
Overview
Table 79: Statistical Test functions handling of input
Test
Input Columns
Tests Return Results With
Note
T Unpaired
1st column
num
should be normal numeric
T Unpaired
2nd column
num, date, bignum
should be normal numeric
T Unpaired
group by columns
cat, num, date, bignum
can be anything
T Unpaired w ind
1st column
num
should be normal numeric
T Unpaired w ind
indicator column
cat, num, date, bignum
can be anything
T Unpaired w ind
group by columns
cat, num, date, bignum
can be anything
Kolmogorov-Smirnov
column of interest
num, date, bignum
should be normal numeric
Kolmogorov-Smirnov
group by columns
cat, num, date, bignum
can be anything
Lilliefors
column of interest
num, date, bignum
should be normal numeric
Lilliefors
group by columns
cat, num, bignum
can be anything but date
Shapiro-Wilk
column of interest
num, date, bignum
should be normal numeric
Shapiro-Wilk
group by columns
cat, num, date, bignum
can be anything
D'Agostino-Pearson
column of interest
num
should be normal numeric
D'Agostino-Pearson
group by columns
cat, num, bignum
can be anything but date
Smirnov
column of interest
cat, num, date, bignum
should be normal numeric
Smirnov
columns
must be 2 distinct values
must be 2 distinct values
Smirnov
group by columns
cat, num, bignum
can be anything but date
Binomial
1st column
num, date, bignum
should be normal numeric
Binomial
2nd column
num, date, bignum
should be normal numeric
Binomial
group by columns
cat, num, date, bignum
can be anything
Sign
1st column
num, bignum
should be normal numeric
Sign
group by columns
cat, num, date, bignum
can be anything
Teradata Warehouse Miner User Guide - Volume 1
203
Chapter 3: Statistical Tests
Parametric Tests
Parametric Tests
Parametric Tests are a class of statistical test which requires particular assumptions about the
data. These often include that the observations are independent and normally distributed. A
researcher may want to verify the assumption of normality before using a parametric test. He
could use any of the four provided described below, such as the Kolmogorov-Smirnov test for
normality, to determine if his use of one of the parametric tests were appropriate.
Two Sample T-Test for Equal Means
For the paired t test, a one-to-one correspondence must exist between values in both samples.
The test is whether paired values have mean differences which are not significantly different
from zero. It assumes differences are identically distributed normal random variables, and
that they are independent.
The unpaired t test is similar, but there is no correspondence between values of the samples.
It assumes that within each sample, values are identically distributed normal random
variables, and that the two samples are independent of each other. The two sample sizes may
be equal or unequal. Variances of both samples may be assumed to be equal (homoscedastic)
or unequal (heteroscedastic). In both cases, the null hypothesis is that the population means
are equal. Test output is a p-value which compared to the threshold determines whether the
null hypothesis should be rejected.
Two methods of data selection are available for the unpaired t test: The first, the “T Unpaired”
simply selects the columns with the two unpaired datasets, some of which may be NULL. The
second, “T Unpaired with Indicator”, selects the column of interest and a second indicator
column which determines to which group the first variable belongs. If the indicator variable is
negative or zero, it will be assigned to the first group; if it is positive, it will be assigned to the
second group.
The two sample t test for unpaired data is defined as shown below (though calculated
differently in the SQL):
Table 80: Two sample t tests for unpaired data
H0 :
1 = 2
Ha:
1  2
Test Statistic:
Y1 – Y2
T = --------------------------------------------------2
2
 s1  N1  +  s2  N2 
where N1 and N2 are the sample sizes,
and s22 are the sample variances.
204
Y 1 and Y 2 are the sample means, and s12
Teradata Warehouse Miner User Guide - Volume 1
Chapter 3: Statistical Tests
Parametric Tests
Initiate a Two Sample T-Test
Use the following procedure to initiate a new T-Test in Teradata Warehouse Miner:
1
Click on the Add New Analysis icon in the toolbar:
Figure 115: Add New Analysis from toolbar
2
In the resulting Add New Analysis dialog box, click on Statistical Tests under Categories
and then under Analyses double-click on Parametric Tests:
Figure 116: Add New Analysis > Statistical Tests > Parametric Tests
3
This will bring up the Parametric Tests dialog in which you will enter STATISTICAL
TEST STYLE, INPUT and OUTPUT options to parameterize the analysis as described in
the following sections.
T-Test - INPUT - Data Selection
On the Parametric Tests dialog click on INPUT and then click on data selection:
Teradata Warehouse Miner User Guide - Volume 1
205
Chapter 3: Statistical Tests
Parametric Tests
Figure 117: T-Test > Input > Data Selection
On this screen select:
1
Select Input Source
Users may select between different sources of input. By selecting the Input Source Table
the user can select from available databases, tables (or views) and columns in the usual
manner. By selecting the Input Source Analysis however the user can select directly from
the output of another analysis of qualifying type in the current project. Analyses that may
be selected from directly include all of the Analytic Data Set (ADS) and Reorganization
analyses (except Refresh). In place of Available Databases the user may select from
Available Analyses, while Available Tables then contains a list of all the output tables that
will eventually be produced by the selected Analysis, or it contains a single entry with the
name of the analysis under the label Volatile Table, representing the output of the analysis
that is ordinarily produced by a Select statement.
2
3
Select Columns From a Single Table
•
Available Databases (or Analyses) — These are the databases (or analyses) available to
be processed.
•
Available Tables — These are the tables and views that are available to be processed.
•
Available Columns — These are the columns within the table/view that are available
for processing.
Select Statistical Test Style
These are the parametric tests available (F(n-way), F(2-way with unequal cell counts, T
Paired, T Unpaired, T Unpaired with Indicator). Select “T Paired”, “T Unpaired”, or “T
Unpaired with Indicator”.
4
Select Optional Columns
•
Selected Columns — Select columns by highlighting and then either dragging and
dropping into the Selected Columns window, or click on the arrow button to move
highlighted columns into the Selected Columns window.
Note: The Selected Columns window is actually a split window; you can insert
columns as First Column, Second Column or Group By Columns. Make sure you have
the correct portion of the window highlighted.
206
•
First Column — The column that specifies the first variable for the Parametric Test
analysis.
•
Second Column (or Indicator Column) — The column that specifies the second
variable for the Parametric Test analysis.
Teradata Warehouse Miner User Guide - Volume 1
Chapter 3: Statistical Tests
Parametric Tests
(Or the column that determines to which group the first variable belongs. If
negative or zero, it will be assigned to the first group; if it is positive, it will be
assigned to the second group).
•
Group By Columns — The column(s) that specifies the variable(s) whose distinct
value combinations will categorize the data, so a separate test is performed on each
category.
T-Test - INPUT - Analysis Parameters
On the Parametric Tests dialog click on INPUT and then click on analysis parameters:
Figure 118: T-Test > Input > Analysis Parameters
On this screen enter or select:
• Processing Options
•
Threshold Probability — Enter the “alpha” probability below which to reject the null
hypothesis.
•
Equal Variance — Check this box if the “equal variance” assumption is to be used.
Default is “unequal variance”.
T-Test - OUTPUT
On the Parametric Tests dialog click on OUTPUT:
Figure 119: T-Test > Output
On this screen select:
• Store the tabular output of this analysis in the database — Option to generate a Teradata
table populated with the results of the analysis. Once enabled, the following three fields
must be specified:
•
Database Name — The database where the output table will be saved.
•
Output Name — The table name that the output will be saved under.
•
Output Type — The output type must be table when storing Statistical Test output in
the database.
•
Stored Procedure — The creation of a stored procedure containing the SQL generated
for this analysis can be requested by entering the desired name of the stored procedure
Teradata Warehouse Miner User Guide - Volume 1
207
Chapter 3: Statistical Tests
Parametric Tests
here. This will result in the creation of a stored procedure in the user's login database
in place of the execution of the SQL generated by the analysis.
•
Procedure Comment — When an optional Procedure Comment is entered it is applied
to a requested Stored Procedure with an SQL Comment statement. It can be up to 255
characters in length and contain substitution parameters for the output category (Score,
ADS, Stats or Other), project name and/or analysis name (using the tags <Category>,
<Project> and <Analysis>, respectively). (Note that the default value of this field may
be set on the Defaults tab of the Preferences dialog, available from the Tools Menu).
•
Create output table using fallback keyword — Fallback keyword will be used to create
the table
•
Create output table using multiset keyword — Multiset keyword will be used to create
the table
•
Advertise Output — The Advertise Output option may be requested when creating a
table, view or procedure. This feature “advertises” output by inserting information into
one or more of the Advertise Output metadata tables according to the type of analysis
and the options selected in the analysis.
•
Advertise Note — An Advertise Note may be specified if desired when the Advertise
Output option is selected or when the Always Advertise option is selected on the
Connection Properties dialog. It is a free-form text field of up to 30 characters that
may be used to categorize or describe the output.
•
Generate SQL, but do not Execute it — Generate the Statistical Test SQL, but do not
execute it. The SQL will be available to be viewed.
Run the T-Test Analysis
After setting parameters on the INPUT screens as described above, you are ready to run the
analysis. To run the analysis you can either:
• Click the Run icon
on the toolbar, or
• Select Run <project name> on the Project menu, or
• Press the F5 key on your keyboard
Results - T-Test Analysis
The results of running the T-Test analysis include a table with a row for each group-by
variable requested, as well as the SQL to perform the statistical analysis. All of these results
are outlined below.
T-Test - RESULTS - SQL
On the Parametric Tests dialog click on RESULTS and then click on SQL:
208
Teradata Warehouse Miner User Guide - Volume 1
Chapter 3: Statistical Tests
Parametric Tests
Figure 120: T-Test > Results > SQL
The series of SQL statements comprise the T-Test analysis. It is always returned, and is the
only item returned when the Generate SQL Without Executing option is used.
T-Test - RESULTS - Data
On the Parametric Tests dialog click on RESULTS and then click on data:
Figure 121: T-Test > Results > Data
The output table is generated by the T-Test analysis for each group-by variable combination.
Output Columns - T-Test Analysis
The following table is built in the requested Output Database by the T-Test analysis. Any
group-by columns will comprise the Unique Primary Index (UPI).
Table 81: Output Database table
Name
Type
Definition
D_F
INTEGER
Degrees of Freedom for the group-by values selected.
T
Float
The computed value of the T statistic
TTestPValue
Float
The probability associated with the T statistic
TTestCallP
Char
The TTest result: a=accept, p=reject (positive),
n=reject(negative)
Tutorial - T-Test
In this example, a T-Test analysis of type T-Paired is performed on the fictitious banking data
to analyze account usage. Parameterize a Parametric Test analysis as follows:
• Available Tables — twm_customer_analysis
• Statistical Test Style — T Paired
• First Column — avg_cc_bal
• Second Column — avg_sv_bal
Teradata Warehouse Miner User Guide - Volume 1
209
Chapter 3: Statistical Tests
Parametric Tests
• Group By Columns — age, gender
• Analysis Parameters
•
Threshold Probability — 0.05
•
Equal Variance — true (checked)
Run the analysis and click on Results when it completes. For this example, the Parametric
Test analysis generated the following page. The paired t-test was computed on average credit
card balance vs. average savings balance, by gender and age. Ages over 33 were excluded for
brevity. Results were sorted by age and gender in the listing below. The tests shows whether
the paired values have mean differences which are not significantly different from zero for
each gender-age combination. A ‘p’ means the difference was significantly different from
zero. An ‘a’ means the difference was insignificant. The SQL is available for viewing but not
listed below.
Table 82: T-Test
gender
age
D_F
TTestPValue
T
TTestCallP_0.05
F
13
7
0.01
3.99
p
M
13
6
0.13
1.74
a
F
14
5
0.10
2.04
a
M
14
8
0.04
2.38
p
F
15
18
0.01
3.17
p
M
15
12
0.04
2.29
p
F
16
9
0.00
4.47
p
M
16
8
0.04
2.52
p
F
17
13
0.00
4.68
p
M
17
6
0.01
3.69
p
F
18
9
0.00
6.23
p
M
18
9
0.02
2.94
p
F
19
9
0.01
3.36
p
M
19
6
0.03
2.92
p
F
22
3
0.21
1.57
a
M
22
3
0.11
2.25
a
F
23
3
0.34
1.13
a
M
23
3
0.06
2.88
a
F
25
4
0.06
2.59
a
F
26
5
0.08
2.22
a
210
Teradata Warehouse Miner User Guide - Volume 1
Chapter 3: Statistical Tests
Parametric Tests
Table 82: T-Test
gender
age
D_F
TTestPValue
T
TTestCallP_0.05
F
27
5
0.09
2.12
a
F
28
4
0.06
2.68
a
M
28
4
0.03
3.35
p
F
29
4
0.06
2.54
a
M
29
5
0.16
1.65
a
F
30
8
0.00
4.49
p
M
30
5
0.01
4.25
p
F
31
5
0.04
2.69
p
M
31
6
0.05
2.52
p
F
32
5
0.05
2.50
a
M
32
6
0.10
1.98
a
F
33
9
0.01
3.05
p
M
33
4
0.09
2.27
a
F-Test - N-Way
• F-Test/Analysis of Variance — One Way, Equal or Unequal Sample Size
• F-Test/Analysis of Variance — Two Way, Equal Sample Size
• F-Test/Analysis of Variance — Three Way, Equal Sample Size
The ANOVA or F test determines if significant differences exist among treatment means or
interactions. It’s a preliminary test that indicates if further analysis of the relationship among
treatment means is warranted. If the null hypothesis of no difference among treatments is
accepted, the test result implies factor levels and response are unrelated, so the analysis is
terminated. When the null hypothesis is rejected, the analysis is usually continued to examine
the nature of the factor-level effects. Examples are:
• Tukey’s Method — tests all possible pairwise differences of means
• Scheffe’s Method — tests all possible contrasts at the same time
• Bonferroni’s Method — tests, or puts simultaneous confidence intervals around a preselected group of contrasts
The N-way F-Test is designed to execute within groups defined by the distinct values of the
group-by variables (GBV's), the same as most of the other nonparametric tests. Two or more
treatments must exist in the data within the groups defined by the distinct GBV values.
Given a column of interest (dependent variable), one or more input columns (independent
variables) and optionally one or more group-by columns (all from the same input table), an FTest is produced. The N-Way ANOVA tests whether a set of sample means are all equal (the
Teradata Warehouse Miner User Guide - Volume 1
211
Chapter 3: Statistical Tests
Parametric Tests
null hypothesis). Output is a p-value which when compared to the user’s threshold,
determines whether the null hypothesis should be rejected.
Initiate an N-Way F-Test
Use the following procedure to initiate a new F-Test analysis in Teradata Warehouse Miner:
1
Click on the Add New Analysis icon in the toolbar:
Figure 122: Add New Analysis from toolbar
2
In the resulting Add New Analysis dialog box, click on Statistical Tests under Categories
and then under Analyses double-click on Parametric Tests:
Figure 123: Add New Analysis > Statistical Tests > Parametric Tests
3
This will bring up the Parametric Tests dialog in which you will enter STATISTICAL
TEST STYLE, INPUT and OUTPUT options to parameterize the analysis as described in
the following sections.
F-Test (N-Way) - INPUT - Data Selection
On the Parametric Tests dialog click on INPUT and then click on data selection:
212
Teradata Warehouse Miner User Guide - Volume 1
Chapter 3: Statistical Tests
Parametric Tests
Figure 124: F-Test > Input > Data Selection
On this screen select:
1
Select Input Source
Users may select between different sources of input. By selecting the Input Source Table
the user can select from available databases, tables (or views) and columns in the usual
manner. By selecting the Input Source Analysis however the user can select directly from
the output of another analysis of qualifying type in the current project. Analyses that may
be selected from directly include all of the Analytic Data Set (ADS) and Reorganization
analyses (except Refresh). In place of Available Databases the user may select from
Available Analyses, while Available Tables then contains a list of all the output tables that
will eventually be produced by the selected Analysis, or it contains a single entry with the
name of the analysis under the label Volatile Table, representing the output of the analysis
that is ordinarily produced by a Select statement.
2
3
Select Columns From a Single Table
•
Available Databases (or Analyses) — These are the databases (or analyses) available to
be processed.
•
Available Tables — These are the tables and views that are available to be processed.
•
Available Columns — These are the columns within the table/view that are available
for processing.
Select Statistical Test Style
These are the parametric tests available (T, F(n-way), F(2-way with unequal cell counts)
Select “F(n-way)”.
4
Select Optional Columns
•
Selected Columns — Select columns by highlighting and then either dragging and
dropping into the Selected Columns window, or click on the arrow button to move
highlighted columns into the Selected Columns window.
Note: The Selected Columns window is actually a split window; you can insert
columns as Column of Interest, Columns or Group By Columns. Make sure you have
the correct portion of the window highlighted.
•
Column of Interest — The column that specifies the dependent variable for the Ftest analysis.
•
Columns — The column(s) that specifies the independent variable(s) for the F-test
analysis. Selection of one column will generate a 1-Way F-test, two columns a 2Way F-test, and three columns a 3-Way F-test. Do not select over three columns
because the 4-way, 5-way, etc. F-tests are not implemented in the version of TWM.
Teradata Warehouse Miner User Guide - Volume 1
213
Chapter 3: Statistical Tests
Parametric Tests
Warning:
For this test, equal cell counts are required for the 2 and 3 way tests.
• Group By Columns — The column(s) that specifies the variable(s) whose distinct
value combinations will categorize the data, so a separate test is performed on each
category.
F-Test (N-Way) - INPUT - Analysis Parameters
On the Parametric Tests dialog click on INPUT and then click on analysis parameters:
Figure 125: F-Test > Input > Analysis Parameters
On this screen enter or select:
• Processing Options
•
Threshold Probability — Enter the “alpha” probability below which to reject the null
hypothesis.
F-Test - OUTPUT
On the Parametric Tests dialog click on OUTPUT:
Figure 126: F-Test > Output
On this screen select the following options if desired:
• Store the tabular output of this analysis in the database — Option to generate a Teradata
table populated with the results of the analysis. Once enabled, the following three fields
must be specified:
214
•
Database Name — The database where the output table will be saved.
•
Output Name — The table name that the output will be saved under.
•
Output Type — The output type must be table when storing Statistical Test output in
the database.
•
Stored Procedure — The creation of a stored procedure containing the SQL generated
for this analysis can be requested by entering the desired name of the stored procedure
here. This will result in the creation of a stored procedure in the user's login database
in place of the execution of the SQL generated by the analysis.
•
Procedure Comment — When an optional Procedure Comment is entered it is applied
to a requested Stored Procedure with an SQL Comment statement. It can be up to 255
Teradata Warehouse Miner User Guide - Volume 1
Chapter 3: Statistical Tests
Parametric Tests
characters in length and contain substitution parameters for the output category (Score,
ADS, Stats or Other), project name and/or analysis name (using the tags <Category>,
<Project> and <Analysis>, respectively). (Note that the default value of this field may
be set on the Defaults tab of the Preferences dialog, available from the Tools Menu).
•
Create output table using fallback keyword — Fallback keyword will be used to create
the table
•
Create output table using multiset keyword — Multiset keyword will be used to create
the table
•
Advertise Output — The Advertise Output option may be requested when creating a
table, view or procedure. This feature “advertises” output by inserting information into
one or more of the Advertise Output metadata tables according to the type of analysis
and the options selected in the analysis.
•
Advertise Note — An Advertise Note may be specified if desired when the Advertise
Output option is selected or when the Always Advertise option is selected on the
Connection Properties dialog. It is a free-form text field of up to 30 characters that
may be used to categorize or describe the output.
•
Generate SQL, but do not Execute it — Generate the Statistical Test SQL, but do not
execute it. The SQL will be available to be viewed.
Run the F-Test Analysis
After setting parameters on the INPUT screens as described above, you are ready to run the
analysis. To run the analysis you can either:
• Click the Run icon
on the toolbar, or
• Select Run <project name> on the Project menu, or
• Press the F5 key on your keyboard
Results - F-Test Analysis
The results of running the F-test analysis include a table with a row for each group-by
variable requested, as well as the SQL to perform the statistical analysis. All of these results
are outlined below.
F-Test - RESULTS - SQL
On the Parametric Tests dialog click on RESULTS and then click on SQL:
Figure 127: F-Test > Results > SQL
The series of SQL statements comprise the F-test Analysis. It is always returned, and is the
only item returned when the Generate SQL Without Executing option is used.
Teradata Warehouse Miner User Guide - Volume 1
215
Chapter 3: Statistical Tests
Parametric Tests
F-Test - RESULTS - data
On the Parametric Tests dialog click on RESULTS and then click on data:
Figure 128: F-Test > Results > data
The output table is generated by the F-test Analysis for each group-by variable combination.
Output Columns - F-Test Analysis
The particular result table returned will depend on whether the test is 1-way, 2-way or 3-way,
and is built in the requested Output Database by the F-test analysis. If group-by columns are
present, they will comprise the Unique Primary Index (UPI). Otherwise DF will be the UPI.
Table 83: Output Columns - 1-Way F-Test Analysis
Name
Type
Definition
DF
INTEGER
Degrees of Freedom for the Variable
DFErr
INTEGER
Degrees of Freedom for Error
F
Float
The computed value of the F statistic
FPValue
Float
The probability associated with the F statistic
FPText
Char
If not NULL, the probability is less than the smallest or more
than the largest table value
FCallP
Char
The F-Test result: a=accept, p=reject (positive),
n=reject(negative)
Table 84: Output Columns - 2-Way F-Test Analysis
Name
Type
Definition
DF
INTEGER
Degrees of Freedom for the model
Fmodel
Float
The computed value of the F statistic for the model
DFErr
INTEGER
Degrees of Freedom for Error term
DF_1
INTEGER
Degrees of Freedom for first variable
F1
Float
The computed value of the F statistic for the first variable
DF_2
INTEGER
Degrees of Freedom for second variable
F2
Float
The computed value of the F statistic for the second variable
216
Teradata Warehouse Miner User Guide - Volume 1
Chapter 3: Statistical Tests
Parametric Tests
Table 84: Output Columns - 2-Way F-Test Analysis
Name
Type
Definition
DF_12
INTEGER
Degrees of Freedom for interaction
F12
Float
The computed value of the F statistic for interaction
Fmodel_PValue
Float
The probability associated with the F statistic for the model
Fmodel_PText
Char
If not NULL, the probability is less than the smallest or more than the
largest table value
Fmodel_CallP_0.05
Char
The F test result: a=accept, p=reject for the model
F1_PValue
Float
The probability associated with the F statistic for the first variable
F1_PText
Char
If not NULL, the probability is less than the smallest or more than the
largest table value
F1_callP_0.05
Char
The F test result: a=accept, p=reject for the first variable
F2_PValue
Float
The probability associated with the F statistic for the second variable
F2_PText
Char
If not NULL, the probability is less than the smallest or more than the
largest table value
F2_callP_0.05
Char
The F test result: a=accept, p=reject for the second variable
F12_PValue
Float
The probability associated with the F statistic for the interaction
F12_PText
Char
If not NULL, the probability is less than the smallest or more than the
largest table value
F12_callP_0.05
Char
The F test result: a=accept, p=reject for the interaction
Table 85: Output Columns - 3-Way F-Test Analysis
Name
Type
Definition
DF
INTEGER
Degrees of Freedom for the model
Fmodel
Float
The computed value of the F statistic for the model
DFErr
INTEGER
Degrees of Freedom for Error term
DF_1
INTEGER
Degrees of Freedom for first variable
F1
Float
The computed value of the F statistic for the first variable
DF_2
INTEGER
Degrees of Freedom for second variable
F2
Float
The computed value of the F statistic for the second variable
DF_3
INTEGER
Degrees of Freedom for third variable
F3
Float
The computed value of the F statistic for the third variable
DF_12
INTEGER
Degrees of Freedom for interaction of v1 and v2
Teradata Warehouse Miner User Guide - Volume 1
217
Chapter 3: Statistical Tests
Parametric Tests
Table 85: Output Columns - 3-Way F-Test Analysis
Name
Type
Definition
F12
Float
The computed value of the F statistic for interaction of v1 and v2
DF_13
INTEGER
Degrees of Freedom for interaction of v1 and v3
F13
Float
The computed value of the F statistic for interaction of v1 and v3
DF_23
INTEGER
Degrees of Freedom for interaction of v2 and v3
F23
Float
The computed value of the F statistic for interaction of v2 and v3
DF_123
INTEGER
Degrees of Freedom for three-way interaction of v1, v2, and v3
F123
Float
The computed value of the F statistic for three-way interaction of v1,
v2 and v3
Fmodel_PValue
Float
The probability associated with the F statistic for the model
Fmodel_PText
Char
If not NULL, the probability is less than the smallest or more than the
largest table value
Fmodel_callP_0.05
Char
The F test result: a=accept, p=reject for the model
F1_PValue
Float
The probability associated with the F statistic for the first variable
F1_PText
Char
If not NULL, the probability is less than the smallest or more than the
largest table value
F1_callP_0.05
Char
The F test result: a=accept, p=reject for the first variable
F2_PValue
Float
The probability associated with the F statistic for the second variable
F2_PText
Char
If not NULL, the probability is less than the smallest or more than the
largest table value
F2_callP_0.05
Char
The F test result: a=accept, p=reject for the second variable
F3_PValue
Float
The probability associated with the F statistic for the third variable
F3_PText
Char
If not NULL, the probability is less than the smallest or more than the
largest table value
F3_callP_0.05
Char
The F test result: a=accept, p=reject for the third variable
F12_PValue
Float
The probability associated with the F statistic for the interaction of v1
and v2
F12_PText
Char
If not NULL, the probability is less than the smallest or more than the
largest table value
F12_callP_0.05
Char
The F test result: a=accept, p=reject for the interaction of v1 and v2
F13_PValue
Float
The probability associated with the F statistic for the interaction of v1
and v3
F13_PText
Char
If not NULL, the probability is less than the smallest or more than the
largest table value
F13_callP_0.05
Char
The F test result: a=accept, p=reject for the interaction of v1 and v3
218
Teradata Warehouse Miner User Guide - Volume 1
Chapter 3: Statistical Tests
Parametric Tests
Table 85: Output Columns - 3-Way F-Test Analysis
Name
Type
Definition
F23_PValue
Float
The probability associated with the F statistic for the interaction of v2
and v3
F23_PText
Char
If not NULL, the probability is less than the smallest or more than the
largest table value
F23_callP_0.05
Char
The F test result: a=accept, p=reject for the interaction of v2 and v3
F123_PValue
Float
The probability associated with the F statistic for the three-way
interaction of v1, v2 and v3
F123_PText
Char
If not NULL, the probability is less than the smallest or more than the
largest table value
F123_callP_0.05
Char
The F test result: a=accept, p=reject for the three-way interaction of
v1, v2 and v3
Tutorial - One-Way F-Test Analysis
In this example, an F-test analysis is performed on the fictitious banking data to analyze
income by gender. Parameterize an F-Test analysis as follows:
• Available Tables — twm_customer
• Column of Interest — income
• Columns — gender
• Group By Columns — years_with_bank, nbr_children
• Analysis Parameters
•
Threshold Probability — 0.01
Run the analysis and click on Results when it completes. For this example, the F-Test analysis
generated the following page. The F-Test was computed on income over gender for every
combination of years_with_bank and nbr_children. Results were sorted by years_with_bank
and nbr_children in the listing below.
The tests shows whether significant differences exist in income for males and females, and
does so separately for each value of years_with_bank and nbr_children. A ‘p’ means the
difference was significant, and an ‘a’ means it was not significant. If the field is null, it
indicates there was insufficient data for the test. The SQL is available for viewing but not
listed below.
Table 86: F-Test (one-way)
years_with_bank
nbr_children
DF
DFErr
F
FPValue
FPText
FCallP_0.01
0
0
1
53
0.99
0.25
>0.25
a
0
1
1
8
1.87
0.22
a
0
2
1
10
1.85
0.22
a
Teradata Warehouse Miner User Guide - Volume 1
219
Chapter 3: Statistical Tests
Parametric Tests
Table 86: F-Test (one-way)
years_with_bank
nbr_children
DF
DFErr
F
FPValue
FPText
FCallP_0.01
0
3
1
6
0.00
0.25
>0.25
a
0
4
1
0
0
5
0
0
1
0
1
55
0.00
0.25
>0.25
a
1
1
1
6
0.00
0.25
>0.25
a
1
2
1
14
0.00
0.25
>0.25
a
1
3
1
2
0.50
0.25
>0.25
a
1
4
0
0
1
5
0
0
2
0
1
55
0.82
0.25
>0.25
a
2
1
1
14
1.54
0.24
2
2
1
14
0.07
0.25
>0.25
a
2
3
1
1
0.30
0.25
>0.25
a
2
4
0
0
2
5
0
0
3
0
1
49
0.05
0.25
>0.25
a
3
1
1
9
1.16
0.25
>0.25
a
3
2
1
10
0.06
0.25
>0.25
a
3
3
1
6
16.90
0.01
3
4
1
1
4.50
0.25
3
5
0
0
4
0
1
52
1.84
0.20
4
1
1
10
0.54
0.25
4
2
1
6
2.38
0.20
a
4
3
0
0
4
4
0
0
4
5
0
1
5
0
1
46
4.84
0.04
a
5
1
1
15
0.48
0.25
5
2
1
10
3.51
0.09
220
a
p
>0.25
a
a
>0.25
>0.25
a
a
a
Teradata Warehouse Miner User Guide - Volume 1
Chapter 3: Statistical Tests
Parametric Tests
Table 86: F-Test (one-way)
years_with_bank
nbr_children
DF
DFErr
F
FPValue
FPText
FCallP_0.01
5
3
1
2
2.98
0.24
5
4
0
0
6
0
1
46
0.01
0.25
6
1
1
14
3.67
0.08
6
2
1
15
0.13
0.25
6
3
0
0
6
5
0
0
7
0
1
41
4.99
0.03
7
1
1
8
0.01
0.25
>0.25
a
7
2
1
4
0.13
0.25
>0.25
a
7
3
1
2
0.04
0.25
>0.25
a
7
5
0
1
8
0
1
23
0.50
0.25
>0.25
a
8
1
1
7
0.38
0.25
>0.25
a
8
2
1
6
0.09
0.25
>0.25
a
8
3
1
0
8
5
0
0
9
0
1
26
0.07
0.25
>0.25
a
9
1
1
3
3.11
0.20
9
2
1
1
0.09
0.25
>0.25
a
9
3
1
1
0.12
0.25
>0.25
a
a
>0.25
a
a
>0.25
a
a
a
F-Test/Analysis of Variance - Two Way Unequal Sample Size
The ANOVA or F test determines if significant differences exist among treatment means or
interactions. It’s a preliminary test that indicates if further analysis of the relationship among
treatment means is warranted. If the null hypothesis of no difference among treatments is
accepted, the test result implies factor levels and response are unrelated, so the analysis is
terminated. When the null hypothesis is rejected, the analysis is usually continued to examine
the nature of the factor-level effects. Examples are:
• Tukey’s Method — tests all possible pairwise differences of means
• Scheffe’s Method — tests all possible contrasts at the same time
• Bonferroni’s Method — tests, or puts simultaneous confidence intervals around a preselected group of contrasts
Teradata Warehouse Miner User Guide - Volume 1
221
Chapter 3: Statistical Tests
Parametric Tests
The 2-way Unequal Sample Size F-Test is designed to execute on the entire dataset. No
group-by parameter is provided for this test, but if such a test is desired, multiple tests must be
run on pre-prepared datasets with group-by variables in each as different constants. Two or
more treatments must exist in the data within the dataset. (Note that this test will create a
temporary work table in the Result Database and drop it at the end of processing, even if the
Output option to “Store the tabular output of this analysis in the database” is not selected).
Given a table name of tabulated values, an F-Test is produced. The N-Way ANOVA tests
whether a set of sample means are all equal (the null hypothesis). Output is a p-value which
when compared to the user’s threshold, determines whether the null hypothesis should be
rejected.
Initiate a 2-Way F-Test with Unequal Cell Counts
Use the following procedure to initiate a new F-Test in Teradata Warehouse Miner:
1
Click on the Add New Analysis icon in the toolbar:
Figure 129: Add New Analysis from toolbar
2
222
In the resulting Add New Analysis dialog box, click on Statistical Tests under Categories
and then under Analyses double-click on Parametric Tests:
Teradata Warehouse Miner User Guide - Volume 1
Chapter 3: Statistical Tests
Parametric Tests
Figure 130: Add New Analysis > Statistical Tests > Parametric Tests
3
This will bring up the Parametric Tests dialog in which you will enter STATISTICAL
TEST STYLE, INPUT and OUTPUT options to parameterize the analysis as described in
the following sections.
F-Test (Unequal Cell Counts) - INPUT - Data Selection
On the Parametric Tests dialog click on INPUT and then click on data selection:
Figure 131: F-Test > Input > Data Selection
On this screen select:
1
Select Input Source
Users may select between different sources of input. By selecting the Input Source Table
the user can select from available databases, tables (or views) and columns in the usual
manner. By selecting the Input Source Analysis however the user can select directly from
the output of another analysis of qualifying type in the current project. Analyses that may
be selected from directly include all of the Analytic Data Set (ADS) and Reorganization
analyses (except Refresh). In place of Available Databases the user may select from
Available Analyses, while Available Tables then contains a list of all the output tables that
will eventually be produced by the selected Analysis.
Note that if an analysis is selected it must be one that creates a table or view for output
since a volatile table cannot be processed with this Statistical Test Style.
Teradata Warehouse Miner User Guide - Volume 1
223
Chapter 3: Statistical Tests
Parametric Tests
2
3
Select Columns From a Single Table
•
Available Databases (or Analyses) — These are the databases (or analyses) available to
be processed.
•
Available Tables — These are the tables and views that are available to be processed.
•
Available Columns — These are the columns within the table/view that are available
for processing.
Select Statistical Test Style
These are the parametric tests available (T, F(n-way), F(2-way with unequal cell counts).
Select “F(2-way with unequal cell counts)”.
4
Select Optional Columns
•
Selected Columns — Select columns by highlighting and then either dragging and
dropping into the Selected Columns window, or click on the arrow button to move
highlighted columns into the Selected Columns window.
Note: The Selected Columns window is actually a split window; you can insert
columns as Column of Interest, First Column or Second Column. Make sure you have
the correct portion of the window highlighted.
•
Column of Interest — The column that specifies the dependent variable for the Ftest analysis.
•
First Column — The column that specifies the first independent variable for the Ftest analysis.
•
Second Column — The column that specifies the second independent variable for
the F-test analysis.
F-Test - INPUT - Analysis Parameters
On the Parametric Tests dialog click on INPUT and then click on analysis parameters:
Figure 132: F-Test > Input > Analysis Parameters
On this screen enter or select:
• Processing Options
224
•
Threshold Probability — Enter the “alpha” probability below which to reject the null
hypothesis.
•
First Column Values — Use the selection wizard to choose any or all of the values of
the first independent variable to be used in the analysis.
•
Second Column Values — Use the selection wizard to choose any or all of the values
of the second independent variable to be used in the analysis.
Teradata Warehouse Miner User Guide - Volume 1
Chapter 3: Statistical Tests
Parametric Tests
F-Test - OUTPUT
On the Parametric Tests dialog click on OUTPUT:
Figure 133: F-Test > Output
On this screen select the following options if desired:
• Store the tabular output of this analysis in the database — Option to generate a Teradata
table populated with the results of the analysis. Once enabled, the following three fields
must be specified:
•
Database Name — The database where the output table will be saved.
•
Output Name — The table name that the output will be saved under.
•
Output Type — The output type must be table when storing Statistical Test output in
the database.
•
Stored Procedure — The creation of a stored procedure containing the SQL generated
for this analysis can be requested by entering the desired name of the stored procedure
here. This will result in the creation of a stored procedure in the user's login database
in place of the execution of the SQL generated by the analysis.
•
Procedure Comment — When an optional Procedure Comment is entered it is applied
to a requested Stored Procedure with an SQL Comment statement. It can be up to 255
characters in length and contain substitution parameters for the output category (Score,
ADS, Stats or Other), project name and/or analysis name (using the tags <Category>,
<Project> and <Analysis>, respectively). (Note that the default value of this field may
be set on the Defaults tab of the Preferences dialog, available from the Tools Menu).
•
Create output table using fallback keyword — Fallback keyword will be used to create
the table
•
Create output table using multiset keyword — Multiset keyword will be used to create
the table
•
Advertise Output — The Advertise Output option may be requested when creating a
table, view or procedure. This feature “advertises” output by inserting information into
one or more of the Advertise Output metadata tables according to the type of analysis
and the options selected in the analysis.
•
Advertise Note — An Advertise Note may be specified if desired when the Advertise
Output option is selected or when the Always Advertise option is selected on the
Connection Properties dialog. It is a free-form text field of up to 30 characters that
may be used to categorize or describe the output.
•
Generate SQL, but do not Execute it — Generate the Statistical Test SQL, but do not
execute it. The SQL will be available to be viewed.
Teradata Warehouse Miner User Guide - Volume 1
225
Chapter 3: Statistical Tests
Parametric Tests
Run the F-Test Analysis
After setting parameters on the INPUT screens as described above, you are ready to run the
analysis. To run the analysis you can either:
• Click the Run icon
on the toolbar, or
• Select Run <project name> on the Project menu, or
• Press the F5 key on your keyboard
Results - F-Test Analysis
The results of running the F-test analysis include a table with a single row, as well as the SQL
to perform the statistical analysis. All of these results are outlined below.
F-Test - RESULTS - SQL
On the Parametric Tests dialog click on RESULTS and then click on SQL:
Figure 134: F-Test > Results > SQL
The series of SQL statements comprise the F-test Analysis. It is always returned, and is the
only item returned when the Generate SQL Without Executing option is used.
F-Test - RESULTS - data
On the Parametric Tests dialog click on RESULTS and then click on data:
Figure 135: F-Test > Results > data
The output table is generated by the F-test Analysis for each group-by variable combination.
Output Columns - F-Test Analysis
The result table returned is built in the requested Output Database by the F-test analysis. DF
will be the UPI.
Table 87: Output Columns - 2-Way F-Test Analysis
Name
Type
Definition
DF
INTEGER
Degrees of Freedom for the model
226
Teradata Warehouse Miner User Guide - Volume 1
Chapter 3: Statistical Tests
Parametric Tests
Table 87: Output Columns - 2-Way F-Test Analysis
Name
Type
Definition
Fmodel
Float
The computed value of the F statistic for the model
DFErr
INTEGER
Degrees of Freedom for Error term
DF_1
INTEGER
Degrees of Freedom for first variable
F1
Float
The computed value of the F statistic for the first variable
DF_2
INTEGER
Degrees of Freedom for second variable
F2
Float
The computed value of the F statistic for the second variable
DF_12
INTEGER
Degrees of Freedom for interaction
F12
Float
The computed value of the F statistic for interaction
Fmodel_PValue
Float
The probability associated with the F statistic for the model
Fmodel_PText
Char
If not NULL, the probability is less than the smallest or more than the largest
table value
Fmodel_CallP_0.05
Char
The F test result: a=accept, p=reject for the model
F1_PValue
Float
The probability associated with the F statistic for the first variable
F1_PText
Char
If not NULL, the probability is less than the smallest or more than the largest
table value
F1_callP_0.05
Char
The F test result: a=accept, p=reject for the first variable
F2_PValue
Float
The probability associated with the F statistic for the second variable
F2_PText
Char
If not NULL, the probability is less than the smallest or more than the largest
table value
F2_callP_0.05
Char
The F test result: a=accept, p=reject for the second variable
F12_PValue
Float
The probability associated with the F statistic for the interaction
F12_PText
Char
If not NULL, the probability is less than the smallest or more than the largest
table value
F12_callP_0.05
Char
The F test result: a=accept, p=reject for the interaction
Tutorial - Two-Way Unequal Cell Count F-Test Analysis
In this example, an F-test analysis is performed on the fictitious banking data to analyze
income by years_with_bank and marital_status. Parameterize an F-Test analysis as follows:
• Available Tables — twm_customer
• Column of Interest — income
• First Column — years_with_bank
• Second Column — marital_status
• Analysis Parameters
Teradata Warehouse Miner User Guide - Volume 1
227
Chapter 3: Statistical Tests
Binomial Tests
•
Threshold Probability — 0.05
•
First Column Values — 0, 1, 2, 3, 4, 5, 6, 7
•
Second Column Values — 1, 2, 3, 4
Run the analysis and click on Results when it completes. For this example, the F-Test analysis
generated the following page. The F-Test was computed on income over years_with_bank
and marital_status.
The test shows whether significant differences exist in income for years_with_bank by
marital_status. The first column, years_with_bank, is represented by F1. The second column,
marital_status, is represented by F2. The interaction term is F12.
A ‘p’ means the difference was significant, and an ‘a’ means it was not significant. If the field
is null, it indicates there was insufficient data for the test. The SQL is available for viewing
but not listed below.
The results show that there are no significant differences in income for different values of
years_with_bank or the interaction term for years_with_bank and marital_status. There was a
highly significant (p<0.001) difference in income for different values of marital status. The
overall model difference was significant at a level better than 0.001.
Table 88: F-Test (Two-way Unequal Cell Count) (Part 1)
DF
Fmodel
DFErr
DF_1
F1
DF_2
F2
DF_12
F12
31
3.76
631
7
0.93
3
29.02
21
1.09
Table 89: F-Test (Two-way Unequal Cell Count) (Part 2)
Fmodel_PValue
Fmodel_PText
Fmodel_CallP_0.05 F1_PValue
F1_PText
F1_CallP_0.05
0.001
<0.001
p
0.25
>0.25
a
Table 90: F-Test (Two-way Unequal Cell Count) (Part 3)
F2_PValue
F2_PText
F2_CallP_0.05
F12_PValue
F12_PText
F12_CallP_0.05
0.001
<0.001
p
0.25
>0.25
a
Binomial Tests
The data for a binomial test is assumed to come from n independent trials, and have outcomes
in either of two classes. The other assumption is that the probability of each outcome of each
trial is the same, designated p. The values of the outcome could come directly from the data,
where the value is always one of two kinds. More commonly, however, the test is applied to
the sign of the difference between two values. If the probability is 0.5, this is the oldest of all
nonparametric tests, and is called the ‘sign test’. Where the sign of the difference between two
228
Teradata Warehouse Miner User Guide - Volume 1
Chapter 3: Statistical Tests
Binomial Tests
values is used, the binomial test reports whether the probability that the sign is positive is a
particular p_value, p*.
Binomial/Ztest
Output for each unique set of values of the group-by variables (GBV's) is a p-value which
when compared to the user’s choice of alpha, the probability threshold, determines whether
the null hypothesis (p=p*, p<=p*, or p>p*) should be rejected for the GBV set. Though both
binomial and Ztest results are provided for all N, for the approximate value obtained from the
Z-test (nP) is appropriate when N is large. For values of N over 100, only the Z-test is
performed. Otherwise, the value bP returned is the p_value of the one-tailed or two-tailed test,
depending on the user’s choice.
Initiate a Binomial Test
Use the following procedure to initiate a new Binomial in Teradata Warehouse Miner:
1
Click on the Add New Analysis icon in the toolbar:
Figure 136: Add New Analysis from toolbar
2
In the resulting Add New Analysis dialog box, click on Statistical Tests under Categories
and then under Analyses double-click on Binomial Tests:
Teradata Warehouse Miner User Guide - Volume 1
229
Chapter 3: Statistical Tests
Binomial Tests
Figure 137: Add New Analysis > Statistical Tests > Binomial Tests
3
This will bring up the Binomial Tests dialog in which you will enter STATISTICAL TEST
STYLE, INPUT and OUTPUT options to parameterize the analysis as described in the
following sections.
Binomial Tests - INPUT - Data Selection
On the Binomial Tests dialog click on INPUT and then click on data selection:
Figure 138: Binomial Tests > Input > Data Selection
On this screen select:
1
Select Input Source
Users may select between different sources of input. By selecting the Input Source Table
the user can select from available databases, tables (or views) and columns in the usual
manner. By selecting the Input Source Analysis however the user can select directly from
the output of another analysis of qualifying type in the current project. Analyses that may
be selected from directly include all of the Analytic Data Set (ADS) and Reorganization
analyses (except Refresh). In place of Available Databases the user may select from
Available Analyses, while Available Tables then contains a list of all the output tables that
will eventually be produced by the selected Analysis, or it contains a single entry with the
name of the analysis under the label Volatile Table, representing the output of the analysis
that is ordinarily produced by a Select statement.
230
Teradata Warehouse Miner User Guide - Volume 1
Chapter 3: Statistical Tests
Binomial Tests
2
3
Select Columns From a Single Table
•
Available Databases (or Analyses) — These are the databases (or analyses) available to
be processed.
•
Available Tables — These are the tables and views that are available to be processed.
•
Available Columns — These are the columns within the table/view that are available
for processing.
Select Statistical Test Style
These are the binomial tests available (Binomial, Sign). Select “Binomial”.
4
Select Optional Columns
•
Selected Columns — Select columns by highlighting and then either dragging and
dropping into the Selected Columns window, or click on the arrow button to move
highlighted columns into the Selected Columns window.
Note: The Selected Columns window is actually a split window; you can insert
columns as First Column, Second Column or Group By Columns. Make sure you have
the correct portion of the window highlighted.
•
First Column — The column that specifies the first variable for the Binomial Test
analysis.
•
Second Column — The column that specifies the second variable for the Binomial
Test analysis.
•
Group By Columns — The column(s) that specifies the variable(s) whose distinct
value combinations will categorize the data, so a separate test is performed on each
category.
Binomial Tests - INPUT - Analysis Parameters
On the Binomial Tests dialog click on INPUT and then click on analysis parameters:
Figure 139: Binomial Tests > Input > Analysis Parameters
On this screen enter or select:
• Processing Options
•
Threshold Probability — Enter the “alpha” probability below which to reject the null
hypothesis.
•
Single Tail — Check this box if the Binomial Test is to be single tailed. Default is twotailed.
•
Binomial Probability — If the binomial test is not ½, enter the probability desired.
Default is 0.5.
Teradata Warehouse Miner User Guide - Volume 1
231
Chapter 3: Statistical Tests
Binomial Tests
•
Exact Matches Comparison Criterion — Check the button to specify how exact matches
are to be handled. Default is they are discarded. Other options are to include them with
negative count, or with positive count.
Binomial Tests - OUTPUT
On the Binomial Tests dialog click on OUTPUT:
Figure 140: Binomial Tests > Output
On this screen select the following options if desired:
• Store the tabular output of this analysis in the database — Option to generate a Teradata
table populated with the results of the analysis. Once enabled, the following three fields
must be specified:
232
•
Database Name — The database where the output table will be saved.
•
Output Name — The table name that the output will be saved under.
•
Output Type — The output type must be table when storing Statistical Test output in
the database.
•
Stored Procedure — The creation of a stored procedure containing the SQL generated
for this analysis can be requested by entering the desired name of the stored procedure
here. This will result in the creation of a stored procedure in the user's login database
in place of the execution of the SQL generated by the analysis.
•
Procedure Comment — When an optional Procedure Comment is entered it is applied
to a requested Stored Procedure with an SQL Comment statement. It can be up to 255
characters in length and contain substitution parameters for the output category (Score,
ADS, Stats or Other), project name and/or analysis name (using the tags <Category>,
<Project> and <Analysis>, respectively). (Note that the default value of this field may
be set on the Defaults tab of the Preferences dialog, available from the Tools Menu).
•
Create output table using fallback keyword — Fallback keyword will be used to create
the table
•
Create output table using multiset keyword — Multiset keyword will be used to create
the table
•
Advertise Output — The Advertise Output option may be requested when creating a
table, view or procedure. This feature “advertises” output by inserting information into
one or more of the Advertise Output metadata tables according to the type of analysis
and the options selected in the analysis.
•
Advertise Note — An Advertise Note may be specified if desired when the Advertise
Output option is selected or when the Always Advertise option is selected on the
Connection Properties dialog. It is a free-form text field of up to 30 characters that
may be used to categorize or describe the output.
Teradata Warehouse Miner User Guide - Volume 1
Chapter 3: Statistical Tests
Binomial Tests
•
Generate SQL, but do not Execute it — Generate the Statistical Test SQL, but do not
execute it. The SQL will be available to be viewed.
Run the Binomial Sign Test Analysis
After setting parameters on the INPUT screens as described above, you are ready to run the
analysis. To run the analysis you can either:
• Click the Run icon
on the toolbar, or
• Select Run <project name> on the Project menu, or
• Press the F5 key on your keyboard
Results - Binomial Test
The results of running the Binomial analysis include a table with a row for each group-by
variable requested, as well as the SQL to perform the statistical analysis. All of these results
are outlined below.
Binomial Tests - RESULTS - SQL
On the Binomial Tests dialog click on RESULTS and then click on SQL:
Figure 141: Binomial Tests > Results > SQL
The series of SQL statements comprise the Binomial Analysis. It is always returned, and is
the only item returned when the Generate SQL Without Executing option is used.
Binomial Tests - RESULTS - data
On the Binomial Tests dialog click on RESULTS and then click on data:
Figure 142: Binomial Tests > Results > data
The output table is generated by the Binomial Analysis for each group-by variable
combination.
Output Columns - Binomial Tests
The following table is built in the requested Output Database by the Binomial analysis. Any
group-by columns will comprise the Unique Primary Index (UPI), otherwise the UPI will be
“N”.
Teradata Warehouse Miner User Guide - Volume 1
233
Chapter 3: Statistical Tests
Binomial Tests
Table 91: Output Database table (Built by the Binomial Analysis)
Name
Type
Definition
N
INTEGER
Total count of value pairs
NPos
INTEGER
Count of positive value differences
NNeg
INTEGER
Count of negative value differences
BP
FLOAT
The Binomial Probability
BinomialCallP
Char
The Binomial result: a=accept, p=reject (positive), n=reject(negative)
Tutorial - Binomial Tests Analysis
In this example, an Binomial analysis is performed on the fictitious banking data to analyze
account usage. Parameterize the Binomial analysis as follows:
• Available Tables — twm_customer_analysis
• First Column — avg_sv_bal
• Second Column — avg_ck_bal
• Group By Columns — gender
• Analysis Parameters
•
Threshold Probability — 0.05
•
Single Tail — true
•
Binomial Probability — 0.5
•
Exact Matches — discarded
Run the analysis and click on Results when it completes. For this example, the Binomial
analysis generated the following. The Binomial was computed on average savings balance
(column 1) vs. average check account balance (column 2), by gender. The test is a Z Test
since N>100, and Z is 3.29 (not in answer set) so the one-sided test of the null hypothesis that
p is ½ is rejected as shown in the table below.
Table 92: Binomial Test Analysis (Table 1)
gender
N
NPos
NNeg
BP
BinomialCallP_0.05
F
366
217
149
0.0002
p
M
259
156
103
0.0005
p
Rerunning the test with parameter binomial probability set to 0.6 gives a different result: the
one-sided test of the null hypothesis that p is 0.6 is accepted as shown in the table below.
234
Teradata Warehouse Miner User Guide - Volume 1
Chapter 3: Statistical Tests
Binomial Tests
Table 93: Binomial Test Analysis (Table 2)
gender
N
NPos
NNeg
BP
BinomialCallP_0.05
F
366
217
149
0.3909
a
M
259
156
103
0.4697
a
Binomial Sign Test
For the sign test, one column is selected and the test is whether the value is positive or not
positive.
Initiate a Binomial Sign Test
Use the following procedure to initiate a new Binomial Sign Test in Teradata Warehouse
Miner:
1
Click on the Add New Analysis icon in the toolbar:
Figure 143: Add New Analysis from toolbar
2
In the resulting Add New Analysis dialog box, click on Statistical Tests under Categories
and then under Analyses double-click on Binomial Tests:
Teradata Warehouse Miner User Guide - Volume 1
235
Chapter 3: Statistical Tests
Binomial Tests
Figure 144: Add New Analysis > Statistical Tests > Binomial Tests
3
This will bring up the Binomial Tests dialog in which you will enter STATISTICAL TEST
STYLE, INPUT and OUTPUT options to parameterize the analysis as described in the
following sections.
Binomial Sign Test - INPUT - Data Selection
On the Binomial Tests dialog click on INPUT and then click on data selection:
Figure 145: Binomial Sign Test > Input > Data Selection
On this screen select:
1
Select Input Source
Users may select between different sources of input. By selecting the Input Source Table
the user can select from available databases, tables (or views) and columns in the usual
manner. By selecting the Input Source Analysis however the user can select directly from
the output of another analysis of qualifying type in the current project. Analyses that may
be selected from directly include all of the Analytic Data Set (ADS) and Reorganization
analyses (except Refresh). In place of Available Databases the user may select from
Available Analyses, while Available Tables then contains a list of all the output tables that
will eventually be produced by the selected Analysis, or it contains a single entry with the
name of the analysis under the label Volatile Table, representing the output of the analysis
that is ordinarily produced by a Select statement.
236
Teradata Warehouse Miner User Guide - Volume 1
Chapter 3: Statistical Tests
Binomial Tests
2
3
Select Columns From a Single Table
•
Available Databases (or Analyses) — These are the databases (or analyses) available to
be processed.
•
Available Tables — These are the tables and views that are available to be processed.
•
Available Columns — These are the columns within the table/view that are available
for processing.
Select Statistical Test Style
These are the binomial tests available (Binomial, Sign). Select “Sign”.
4
Select Optional Columns
•
Selected Columns — Select columns by highlighting and then either dragging and
dropping into the Selected Columns window, or click on the arrow button to move
highlighted columns into the Selected Columns window.
•
Column/Group By Columns — Note that the Selected Columns window is actually a
split window; you can insert columns as Column, or Group By Columns. Make sure
you have the correct portion of the window highlighted.
•
Column — The column that specifies the first variable for the Binomial Test
analysis.
•
Group By Columns — The column(s) that specifies the variable(s) whose distinct
value combinations will categorize the data, so a separate test is performed on each
category.
Binomial Sign Test - INPUT - Analysis Parameters
On the Binomial Tests dialog click on INPUT and then click on analysis parameters:
Figure 146: Binomial Sign Test > Input > Analysis Parameters
On this screen enter or select:
• Processing Options
•
Threshold Probability — Enter the “alpha” probability below which to reject the null
hypothesis.
•
Single Tail — Check this box if the Binomial Test is to be single tailed. Default is twotailed.
Binomial Sign Test - OUTPUT
On the Binomial Tests dialog click on OUTPUT:
Teradata Warehouse Miner User Guide - Volume 1
237
Chapter 3: Statistical Tests
Binomial Tests
Figure 147: Binomial Sign Test > Output
On this screen select the following options if desired:
• Store the tabular output of this analysis in the database — Option to generate a Teradata
table populated with the results of the analysis. Once enabled, the following three fields
must be specified:
•
Database Name — The database where the output table will be saved.
•
Output Name — The table name that the output will be saved under.
•
Output Type — The output type must be table when storing Statistical Test output in
the database.
•
Stored Procedure — The creation of a stored procedure containing the SQL generated
for this analysis can be requested by entering the desired name of the stored procedure
here. This will result in the creation of a stored procedure in the user's login database
in place of the execution of the SQL generated by the analysis.
•
Procedure Comment — When an optional Procedure Comment is entered it is applied
to a requested Stored Procedure with an SQL Comment statement. It can be up to 255
characters in length and contain substitution parameters for the output category (Score,
ADS, Stats or Other), project name and/or analysis name (using the tags <Category>,
<Project> and <Analysis>, respectively). (Note that the default value of this field may
be set on the Defaults tab of the Preferences dialog, available from the Tools Menu).
•
Create output table using fallback keyword — Fallback keyword will be used to create
the table
•
Create output table using multiset keyword — Multiset keyword will be used to create
the table
•
Advertise Output — The Advertise Output option may be requested when creating a
table, view or procedure. This feature “advertises” output by inserting information into
one or more of the Advertise Output metadata tables according to the type of analysis
and the options selected in the analysis.
•
Advertise Note — An Advertise Note may be specified if desired when the Advertise
Output option is selected or when the Always Advertise option is selected on the
Connection Properties dialog. It is a free-form text field of up to 30 characters that
may be used to categorize or describe the output.
•
Generate SQL, but do not Execute it — Generate the Statistical Test SQL, but do not
execute it. The SQL will be available to be viewed.
Run the Binomial Sign Analysis
After setting parameters on the INPUT screens as described above, you are ready to run the
analysis. To run the analysis you can either:
238
Teradata Warehouse Miner User Guide - Volume 1
Chapter 3: Statistical Tests
Binomial Tests
• Click the Run icon
on the toolbar, or
• Select Run <project name> on the Project menu, or
• Press the F5 key on your keyboard
Results - Binomial Sign Test Analysis
The results of running the Binomial Sign analysis include a table with a row for each groupby variable requested, as well as the SQL to perform the statistical analysis. All of these
results are outlined below.
Binomial Sign Test - RESULTS - SQL
On the Binomial Tests dialog click on RESULTS and then click on SQL:
Figure 148: Binomial Sign Test > Results > SQL
The series of SQL statements comprise the Binomial Sign Analysis. It is always returned, and
is the only item returned when the Generate SQL Without Executing option is used.
Binomial Sign Test - RESULTS - data
On the Binomial Tests dialog click on RESULTS and then click on data:
Figure 149: Binomial Sign Test > Results > data
The output table is generated by the Binomial Sign Analysis for each group-by variable
combination.
Output Columns - Binomial Sign Analysis
The following table is built in the requested Output Database by the Binomial analysis. Any
group-by columns will comprise the Unique Primary Index (UPI), otherwise the UPI will be
“N”.
Table 94: Binomial Sign Analysis: Output Columns
Name
Type
Definition
N
INTEGER
Total count of value pairs
Teradata Warehouse Miner User Guide - Volume 1
239
Chapter 3: Statistical Tests
Binomial Tests
Table 94: Binomial Sign Analysis: Output Columns
Name
Type
Definition
NPos
INTEGER
Count of positive values
NNeg
INTEGER
Count of negative or zero values
BP
FLOAT
The Binomial Probability
BinomialCallP
Char
The Binomial Sign result: a=accept, p=reject (positive), n=reject(negative)
Tutorial - Binomial Sign Analysis
In this example, a Binomial analysis is performed on the fictitious banking data to analyze
account usage. Parameterize the Binomial analysis as follows:
• Available Tables — twm_customer_analysis
• Column — female
• Group By Columns — years_with_bank
• Analysis Parameters
•
Threshold Probability — 0.05
•
Single Tail — true
Run the analysis and click on Results when it completes. For this example, the Binomial Sign
analysis generated the following. The Binomial was computed on the Boolean variable
“female” by years_with_bank. The one-sided test of the null hypothesis that p is ½ accepted
for all cases except years_with_bank=2 as shown in the table below.
Table 95: Tutorial - Binomial Sign Analysis
years_with_bank
N
NPos
NNeg
BP
BinomialCallP_0.05
0
88
51
37
0.08272
a
1
87
48
39
0.195595
a
2
94
57
37
0.024725
p
3
86
46
40
0.295018
a
4
78
39
39
0.545027
a
5
82
46
36
0.160147
a
6
83
46
37
0.19
a
7
65
36
29
0.22851
a
8
45
26
19
0.185649
a
9
39
23
16
0.168392
a
240
Teradata Warehouse Miner User Guide - Volume 1
Chapter 3: Statistical Tests
Kolmogorov-Smirnov Tests
Kolmogorov-Smirnov Tests
Tests of the Kolmogorov-Smirnov Type are based on statistical procedures which use
maximum vertical distance between functions as a measure of function similarity. Two
empirical distribution functions are mapped against each other, or a single empirical function
is mapped against a hypothetical (e.g. Normal) distribution. Conclusions are then drawn
about the likelihood the two distributions are the same.
Kolmogorov-Smirnov Test (One Sample)
The Kolmogorov-Smirnov (one-sample) test determines whether a dataset matches a
particular distribution (for this test, the normal distribution). The test has the advantage of
making no assumption about the distribution of data. (Non-parametric and distribution free)
Note that this generality comes at some cost: other tests (e.g. the Student's t-test) may be more
sensitive if the data meet the requirements of the test. The Kolmogorov-Smirnov test is
generally less powerful than the tests specifically designed to test for normality. This is
especially true when the mean and variance are not specified in advance for the KolmogorovSmirnov test, which then becomes conservative. Further, the Kolmogorov-Smirnov test will
not indicate the type of nonnormality, e.g. whether the distribution is skewed or heavy-tailed.
Examination of the skewness and kurtosis, and of the histogram, boxplot, and normal
probability plot for the data may show why the data failed the Kolmogorov-Smirnov test.
In this test, the user can specify group-by variables (GBV's) so a separate test will be done for
every unique set of values of the GBV's.
Initiate a Kolmogorov-Smirnov Test
Use the following procedure to initiate a new Kolmogorov-Smirnov Test in Teradata
Warehouse Miner:
1
Click on the Add New Analysis icon in the toolbar:
Figure 150: Add New Analysis from toolbar
2
In the resulting Add New Analysis dialog box, click on Statistical Tests under Categories
and then under Analyses double-click on Kolmogorov-Smirnov Tests:
Teradata Warehouse Miner User Guide - Volume 1
241
Chapter 3: Statistical Tests
Kolmogorov-Smirnov Tests
Figure 151: Add New Analysis > Statistical Tests > Kolmogorov-Smirnov Tests
3
This will bring up the Kolmogorov-Smirnov Tests dialog in which you will enter
STATISTICAL TEST STYLE, INPUT and OUTPUT options to parameterize the analysis
as described in the following sections.
Kolmogorov-Smirnov Test - INPUT - Data Selection
On the Kolmogorov-Smirnov Tests dialog click on INPUT and then click on data selection:
Figure 152: Kolmogorov-Smirnov Test > Input > Data Selection
On this screen select:
1
Select Input Source
Users may select between different sources of input. By selecting the Input Source Table
the user can select from available databases, tables (or views) and columns in the usual
manner. By selecting the Input Source Analysis however the user can select directly from
the output of another analysis of qualifying type in the current project. Analyses that may
be selected from directly include all of the Analytic Data Set (ADS) and Reorganization
analyses (except Refresh). In place of Available Databases the user may select from
Available Analyses, while Available Tables then contains a list of all the output tables that
will eventually be produced by the selected Analysis, or it contains a single entry with the
name of the analysis under the label Volatile Table, representing the output of the analysis
that is ordinarily produced by a Select statement.
242
Teradata Warehouse Miner User Guide - Volume 1
Chapter 3: Statistical Tests
Kolmogorov-Smirnov Tests
2
3
Select Columns From a Single Table
•
Available Databases (or Analyses) — These are the databases (or analyses) available to
be processed.
•
Available Tables — These are the tables and views that are available to be processed.
•
Available Columns — These are the columns within the table/view that are available
for processing.
Select Statistical Test Style
These are the Tests of the Kolmogorov-Smirnov Type available (Kolmogorov-Smirnov,
Lilliefors, Shapiro-Wilk, D'Agostino-Pearson, Smirnov). Select “Kolmogorov-Smirnov”.
4
Select Optional Columns
•
Selected Columns — Select columns by highlighting and then either dragging and
dropping into the Selected Columns window, or click on the arrow button to move
highlighted columns into the Selected Columns window.
Note: The Selected Columns window is actually a split window; you can insert
columns as Column of Interest, Group By Columns. Make sure you have the correct
portion of the window highlighted.
•
Column of Interest — The column that specifies the numeric variable to be tested
for normality.
•
Group By Columns — The columns which specify the variables whose distinct
value combinations will categorize the data, so a separate test is performed on each
category.
Kolmogorov-Smirnov Test - INPUT - Analysis Parameters
On the Kolmogorov-Smirnov Tests dialog click on INPUT and then click on analysis
parameters:
Figure 153: Kolmogorov-Smirnov Test > Input > Analysis Parameters
On this screen enter or select:
• Processing Options
•
Threshold Probability — Enter the “alpha” probability below which to reject the null
hypothesis.
Kolmogorov-Smirnov Test - OUTPUT
On the Kolmogorov-Smirnov Tests dialog click on OUTPUT:
Teradata Warehouse Miner User Guide - Volume 1
243
Chapter 3: Statistical Tests
Kolmogorov-Smirnov Tests
Figure 154: Kolmogorov-Smirnov Test > Output
On this screen select:
• Store the tabular output of this analysis in the database — Option to generate a Teradata
table populated with the results of the analysis. Once enabled, the following three fields
must be specified:
•
Database Name — The database where the output table will be saved.
•
Output Name — The table name that the output will be saved under.
•
Output Type — The output type must be table when storing Statistical Test output in
the database.
•
Stored Procedure — The creation of a stored procedure containing the SQL generated
for this analysis can be requested by entering the desired name of the stored procedure
here. This will result in the creation of a stored procedure in the user's login database
in place of the execution of the SQL generated by the analysis.
•
Procedure Comment — When an optional Procedure Comment is entered it is applied
to a requested Stored Procedure with an SQL Comment statement. It can be up to 255
characters in length and contain substitution parameters for the output category (Score,
ADS, Stats or Other), project name and/or analysis name (using the tags <Category>,
<Project> and <Analysis>, respectively). (Note that the default value of this field may
be set on the Defaults tab of the Preferences dialog, available from the Tools Menu).
•
Create output table using fallback keyword — Fallback keyword will be used to create
the table
•
Create output table using multiset keyword — Multiset keyword will be used to create
the table
•
Advertise Output — The Advertise Output option may be requested when creating a
table, view or procedure. This feature “advertises” output by inserting information into
one or more of the Advertise Output metadata tables according to the type of analysis
and the options selected in the analysis.
•
Advertise Note — An Advertise Note may be specified if desired when the Advertise
Output option is selected or when the Always Advertise option is selected on the
Connection Properties dialog. It is a free-form text field of up to 30 characters that
may be used to categorize or describe the output.
•
Generate SQL, but do not Execute it — Generate the Statistical Test SQL, but do not
execute it. The SQL will be available to be viewed.
Run the Kolmogorov-Smirnov Test Analysis
After setting parameters on the INPUT screens as described above, you are ready to run the
analysis. To run the analysis you can either:
244
Teradata Warehouse Miner User Guide - Volume 1
Chapter 3: Statistical Tests
Kolmogorov-Smirnov Tests
• Click the Run icon
on the toolbar, or
• Select Run <project name> on the Project menu, or
• Press the F5 key on your keyboard
Results - Kolmogorov-Smirnov Test
The results of running the Kolmogorov-Smirnov Test analysis include a table with a row for
each separate Kolmogorov-Smirnov test on all distinct-value group-by variables, as well as
the SQL to perform the statistical analysis. All of these results are outlined below.
Kolmogorov-Smirnov Test - RESULTS - SQL
On the Kolmogorov-Smirnov Test dialog click on RESULTS and then click on SQL:
Figure 155: Kolmogorov-Smirnov Test > Results > SQL
The series of SQL statements comprise the Kolmogorov-Smirnov Test Analysis. It is always
returned, and is the only item returned when the Generate SQL without Executing option is
used.
Kolmogorov-Smirnov Test - RESULTS - data
On the Kolmogorov-Smirnov Test dialog click on RESULTS and then click on data:
Figure 156: Kolmogorov-Smirnov Test > Results > Data
The output table is generated by the Analysis for each separate Kolmogorov-Smirnov test on
all distinct-value group-by variables.
Output Columns - Kolmogorov-Smirnov Test Analysis
The following table is built in the requested Output Database by the Kolmogorov-Smirnov
test analysis. Group By Variables, if present, will be the Unique Primary Index (UPI).
Otherwise Klm will be the UPI.
Table 96: Output Database table (Built by the Kolmogorov-Smirnov test analysis)
Name
Type
Definition
Klm
Float
Kolmogorov-Smirnov Value
Teradata Warehouse Miner User Guide - Volume 1
245
Chapter 3: Statistical Tests
Kolmogorov-Smirnov Tests
Table 96: Output Database table (Built by the Kolmogorov-Smirnov test analysis)
Name
Type
Definition
M
INTEGER
Count
KlmPValue
Float
The probability associated with the Kolmogorov-Smirnov statistic
KlmPText
Char
Text description if P is outside table range
KlmCallP_0.05
Char
The Kolmogorov-Smirnov result: a=accept, p=reject
Tutorial - Kolmogorov-Smirnov Test Analysis
In this example, a Kolmogorov-Smirnov test analysis is performed on the fictitious banking
data to analyze account usage. Parameterize a Kolmogorov-Smirnov Test analysis as follows:
• Available Tables — twm_customer_analysis
• Column of Interest — income
• Group By Columns — years_with_bank
• Analysis Parameters
•
Threshold Probability — 0.05
Run the analysis and click on Results when it completes. For this example, the KolmogorovSmirnov Test analysis generated the following table. The Kolmogorov-Smirnov Test was
computed for each distinct value of the group by variable “years_with_bank”. Results were
sorted by years_with_bank. The tests shows customer incomes with years_with_bank of 1,
5,6,7,8, and 9 were normally distributed and those with 0, 2, and 3 were not. A ‘p’ means
significantly nonnormal and an ‘a’ means accept the null hypothesis of normality. The SQL is
available for viewing but not listed below.
Table 97: Kolmogorov-Smirnov Test
years_with_bank
Klm
M
KlmPValue
0
0.159887652
88
0.019549995
p
1
0.118707332
87
0.162772589
a
2
0.140315991
94
0.045795894
p
3
0.15830739
86
0.025080666
p
4
0.999999
78
0.01
5
0.138336567
82
0.080579955
a
6
0.127171093
83
0.127653475
a
7
0.135147555
65
0.172828265
a
8
0.184197592
45
0.084134345
a
9
0.109205054
39
0.20
246
KlmPText
<0.01
>0.20
KlmCallP_0.05
p
a
Teradata Warehouse Miner User Guide - Volume 1
Chapter 3: Statistical Tests
Kolmogorov-Smirnov Tests
Lilliefors Test
The Lilliefors test determines whether a dataset matches a particular distribution, and is
identical to the Kolmogorov-Smirnov test except that conversion to Z-scores is made. The
Lilliefors test is therefore a modification of the Kolmogorov-Smirnov test. The Lilliefors test
computes the Lilliefors statistic and checks its significance. Exact tables of the quantiles of
the test statistic were computed from random numbers in computer simulations. The
computed value of the test statistic is compared with the quantiles of the statistic.
When the test is for the normal distribution, the null hypothesis is that the distribution
function is normal with unspecified mean and variance. The alternative hypothesis is that the
distribution function is nonnormal. The empirical distribution of X is compared with a normal
distribution with the same mean and variance as X. It is similar to the Kolmogorov-Smirnov
test, but it adjusts for the fact that the parameters of the normal distribution are estimated from
X rather than specified in advance.
In this test, the user can specify group-by variables (GBV's) so a separate test will be done for
every unique set of values of the GBV's.
Initiate a Lilliefors Test
Use the following procedure to initiate a new Lilliefors Test in Teradata Warehouse Miner:
1
Click on the Add New Analysis icon in the toolbar:
Figure 157: Add New Analysis from toolbar
2
In the resulting Add New Analysis dialog box, click on Statistical Tests under Categories
and then under Analyses double-click on Kolmogorov-Smirnov Tests:
Teradata Warehouse Miner User Guide - Volume 1
247
Chapter 3: Statistical Tests
Kolmogorov-Smirnov Tests
Figure 158: Add New Analysis> Statistical Tests > Kolmogorov-Smirnov Tests
3
This will bring up the Kolmogorov-Smirnov Tests dialog in which you will enter
STATISTICAL TEST STYLE, INPUT and OUTPUT options to parameterize the analysis
as described in the following sections.
Lilliefors Test - INPUT - Data Selection
On the Kolmogorov-Smirnov Tests dialog click on INPUT and then click on data selection:
Figure 159: Lillefors Test > Input > Data Selection
On this screen select:
1
Select Input Source
Users may select between different sources of input. By selecting the Input Source Table
the user can select from available databases, tables (or views) and columns in the usual
manner. By selecting the Input Source Analysis however the user can select directly from
the output of another analysis of qualifying type in the current project. Analyses that may
be selected from directly include all of the Analytic Data Set (ADS) and Reorganization
analyses (except Refresh). In place of Available Databases the user may select from
Available Analyses, while Available Tables then contains a list of all the output tables that
will eventually be produced by the selected Analysis, or it contains a single entry with the
name of the analysis under the label Volatile Table, representing the output of the analysis
that is ordinarily produced by a Select statement.
248
Teradata Warehouse Miner User Guide - Volume 1
Chapter 3: Statistical Tests
Kolmogorov-Smirnov Tests
2
3
Select Columns From a Single Table
•
Available Databases (or Analyses) — These are the databases (or analyses) available to
be processed.
•
Available Tables — These are the tables and views that are available to be processed.
•
Available Columns — These are the columns within the table/view that are available
for processing.
Select Statistical Test Style
These are the Tests of the Kolmogorov-Smirnov Type available (Kolmogorov-Smirnov,
Lilliefors, Shapiro-Wilk, D'Agostino-Pearson, Smirnov). Select “Lilliefors”.
4
Select Optional Columns
•
Selected Columns — Select columns by highlighting and then either dragging and
dropping into the Selected Columns window, or click on the arrow button to move
highlighted columns into the Selected Columns window.
Note: The Selected Columns window is actually a split window; you can insert
columns as Column of Interest, Group By Columns. Make sure you have the correct
portion of the window highlighted.
•
Column of Interest — The column that specifies the numeric variable to be tested
for normality.
•
Group By Columns — The columns which specify the variables whose distinct
value combinations will categorize the data, so a separate test is performed on each
category.
Lilliefors Test - INPUT - Analysis Parameters
On the Kolmogorov-Smirnov Tests dialog click on INPUT and then click on analysis
parameters:
Figure 160: Lillefors Test > Input > Analysis Parameters
On this screen enter or select:
• Processing Options
•
Threshold Probability — Enter the “alpha” probability below which to reject the null
hypothesis.
Lilliefors Test - OUTPUT
On the Kolmogorov-Smirnov Tests dialog click on OUTPUT:
Teradata Warehouse Miner User Guide - Volume 1
249
Chapter 3: Statistical Tests
Kolmogorov-Smirnov Tests
Figure 161: Lillefors Test > Output
On this screen select the following options if desired:
• Store the tabular output of this analysis in the database — Option to generate a Teradata
table populated with the results of the analysis. Once enabled, the following three fields
must be specified:
•
Database Name — The database where the output table will be saved.
•
Output Name — The table name that the output will be saved under.
•
Output Type — The output type must be table when storing Statistical Test output in
the database.
•
Stored Procedure — The creation of a stored procedure containing the SQL generated
for this analysis can be requested by entering the desired name of the stored procedure
here. This will result in the creation of a stored procedure in the user's login database
in place of the execution of the SQL generated by the analysis.
•
Procedure Comment — When an optional Procedure Comment is entered it is applied
to a requested Stored Procedure with an SQL Comment statement. It can be up to 255
characters in length and contain substitution parameters for the output category (Score,
ADS, Stats or Other), project name and/or analysis name (using the tags <Category>,
<Project> and <Analysis>, respectively). (Note that the default value of this field may
be set on the Defaults tab of the Preferences dialog, available from the Tools Menu).
•
Create output table using fallback keyword — Fallback keyword will be used to create
the table
•
Create output table using multiset keyword — Multiset keyword will be used to create
the table
•
Advertise Output — The Advertise Output option may be requested when creating a
table, view or procedure. This feature “advertises” output by inserting information into
one or more of the Advertise Output metadata tables according to the type of analysis
and the options selected in the analysis.
•
Advertise Note — An Advertise Note may be specified if desired when the Advertise
Output option is selected or when the Always Advertise option is selected on the
Connection Properties dialog. It is a free-form text field of up to 30 characters that
may be used to categorize or describe the output.
•
Generate SQL, but do not Execute it — Generate the Statistical Test SQL, but do not
execute it. The SQL will be available to be viewed.
Run the Lilliefors Test Analysis
After setting parameters on the INPUT screens as described above, you are ready to run the
analysis. To run the analysis you can either:
250
Teradata Warehouse Miner User Guide - Volume 1
Chapter 3: Statistical Tests
Kolmogorov-Smirnov Tests
• Click the Run icon
on the toolbar, or
• Select Run <project name> on the Project menu, or
• Press the F5 key on your keyboard
Results - Lilliefors Test Analysis
The results of running the Lilliefors Test analysis include a table with a row for each separate
Lilliefors test on all distinct-value group-by variables, as well as the SQL to perform the
statistical analysis. All of these results are outlined below.
Lilliefors Test - RESULTS - SQL
On the Kolmogorov-Smirnov Tests dialog click on RESULTS and then click on SQL:
Figure 162: Lillefors Test > Results > SQL
The series of SQL statements comprise the Lilliefors Test Analysis. It is always returned, and
is the only item returned when the Generate SQL without Executing option is used.
Lilliefors Test - RESULTS - Data
On the Kolmogorov-Smirnov Tests dialog click on RESULTS and then click on data:
Figure 163: Lillefors Test > Results > Data
The output table is generated by the Analysis for each separate Lilliefors test on all distinctvalue group-by variables.
Output Columns - Lilliefors Test Analysis
The following table is built in the requested Output Database by the Lilliefors test analysis.
Group By Variables, if present, will be the Unique Primary Index (UPI). Otherwise Lilliefors
will be the UPI.
Table 98: Lilliefors Test Analysis: Output Columns
Name
Type
Definition
Lilliefors
Float
Lilliefors Value
M
INTEGER
Count
Teradata Warehouse Miner User Guide - Volume 1
251
Chapter 3: Statistical Tests
Kolmogorov-Smirnov Tests
Table 98: Lilliefors Test Analysis: Output Columns
Name
Type
Definition
LillieforsPValue
Float
The probability associated with the Lilliefors statistic
LillieforsPText
Char
Text description if P is outside table range
LillieforsCallP_0.05
Char
The Lilliefors result: a=accept, p=reject
Tutorial - Lilliefors Test Analysis
In this example, a Lilliefors test analysis is performed on the fictitious banking data to
analyze account usage. Parameterize a Lilliefors Test analysis as follows:
• Available Tables — twm_customer_analysis
• Column of Interest — income
• Group By Columns — years_with_bank
• Analysis Parameters
•
Threshold Probability — 0.05
Run the analysis and click on Results when it completes. For this example, the Lilliefors Test
analysis generated the following table. The Lilliefors Test was computed for each distinct
value of the group by variable “years_with_bank”. Results were sorted by years_with_bank.
The tests show customer all incomes were not normally distributed.
‘p’ means significantly nonnormal and an ‘a’ means accept the null hypothesis of normality.
Note: The SQL is available for viewing but not listed below.
Table 99: Lilliefors Test
years_with_bank
Lilliefors
M
LillieforsPValue
LillieforsPText
LillieforsCallP_0.05
0
0.166465166
88
0.01
<0.01
p
1
0.123396019
87
0.01
<0.01
p
2
0.146792366
94
0.01
<0.01
p
3
0.156845809
86
0.01
<0.01
p
4
0.192756959
78
0.01
<0.01
p
5
0.144308699
82
0.01
<0.01
p
6
0.125268495
83
0.01
<0.01
p
7
0.141128127
65
0.01
<0.01
p
8
0.191869596
45
0.01
<0.01
p
9
0.111526787
39
0.20
>0.20
a
252
Teradata Warehouse Miner User Guide - Volume 1
Chapter 3: Statistical Tests
Kolmogorov-Smirnov Tests
Shapiro-Wilk Test
The Shapiro-Wilk W test is designed to detect departures from normality without requiring
that the mean or variance of the hypothesized normal distribution be specified in advance. It
is considered to be one of the best omnibus tests of normality. The function is based on the
approximations and code given by Royston (1982a, b). It can be used in samples as large as
2,000 or as small as 3. Royston (1982b) gives approximations and tabled values that can be
used to compute the coefficients, and obtains the significance level of the W statistic. Small
values of W are evidence of departure from normality. This test has done very well in
comparison studies with other goodness of fit tests.
In general, either the Shapiro-Wilk or D'Agostino-Pearson test is a powerful overall test for
normality. As omnibus tests, however, they will not indicate the type of nonnormality, e.g.
whether the distribution is skewed as opposed to heavy-tailed (or both). Examination of the
calculated skewness and kurtosis, and of the histogram, boxplot, and normal probability plot
for the data may provide clues as to why the data failed the Shapiro-Wilk or D'AgostinoPearson test.
The standard algorithm for the Shapiro-Wilk test only applies to sample sizes from 3 to 2000.
For larger sample sizes, a different normality test should be used. The test statistic is based on
the Kolmogorov-Smirnov statistic for a normal distribution with the same mean and variance
as the sample mean and variance.
Initiate a Shapiro-Wilk Test
Use the following procedure to initiate a new Shapiro-Wilk Test in Teradata Warehouse
Miner:
1
Click on the Add New Analysis icon in the toolbar:
Figure 164: Add New Analysis from toolbar
2
In the resulting Add New Analysis dialog box, click on Statistical Tests under Categories
and then under Analyses double-click on Kolmogorov-Smirnov Tests:
Teradata Warehouse Miner User Guide - Volume 1
253
Chapter 3: Statistical Tests
Kolmogorov-Smirnov Tests
Figure 165: Add New Analysis > Statistical Tests > Kolmogorov-Smirnov Tests
3
This will bring up the Kolmogorov-Smirnov Tests dialog in which you will enter
STATISTICAL TEST STYLE, INPUT and OUTPUT options to parameterize the analysis
as described in the following sections.
Shapiro-Wilk Test - INPUT - Data Selection
On the Kolmogorov-Smirnov Tests dialog click on INPUT and then click on data selection:
Figure 166: Shapiro-Wilk Test > Input > Data Selection
On this screen select:
1
Select Input Source
Users may select between different sources of input. By selecting the Input Source Table
the user can select from available databases, tables (or views) and columns in the usual
manner. By selecting the Input Source Analysis however the user can select directly from
the output of another analysis of qualifying type in the current project. Analyses that may
be selected from directly include all of the Analytic Data Set (ADS) and Reorganization
analyses (except Refresh). In place of Available Databases the user may select from
Available Analyses, while Available Tables then contains a list of all the output tables that
will eventually be produced by the selected Analysis, or it contains a single entry with the
name of the analysis under the label Volatile Table, representing the output of the analysis
that is ordinarily produced by a Select statement.
254
Teradata Warehouse Miner User Guide - Volume 1
Chapter 3: Statistical Tests
Kolmogorov-Smirnov Tests
2
3
Select Columns From a Single Table
•
Available Databases (or Analyses) — These are the databases (or analyses) available to
be processed.
•
Available Tables — These are the tables and views that are available to be processed.
•
Available Columns — These are the columns within the table/view that are available
for processing.
Select Statistical Test Style
These are the Tests of the Kolmogorov-Smirnov Type available (Kolmogorov-Smirnov,
Lilliefors, Shapiro-Wilk, D'Agostino-Pearson, Smirnov). Select “Shapiro-Wilk”.
4
Select Optional Columns
•
Selected Columns — Select columns by highlighting and then either dragging and
dropping into the Selected Columns window, or click on the arrow button to move
highlighted columns into the Selected Columns window.
Note: The Selected Columns window is actually a split window; you can insert
columns as Column of Interest, Group By Columns. Make sure you have the correct
portion of the window highlighted.
•
Column of Interest — The column that specifies the numeric variable to be tested
for normality.
•
Group By Columns — The columns which specify the variables whose distinct
value combinations will categorize the data, so a separate test is performed on each
category.
Shapiro-Wilk Test - INPUT - Analysis Parameters
On the Kolmogorov-Smirnov Tests dialog click on INPUT and then click on analysis
parameters:
Figure 167: Shapiro-Wilk Test > Input > Analysis Parameters
On this screen enter or select:
• Processing Options
•
Threshold Probability — Enter the “alpha” probability below which to reject the null
hypothesis.
Shapiro-Wilk Test - OUTPUT
On the Kolmogorov-Smirnov Tests dialog click on OUTPUT:
Teradata Warehouse Miner User Guide - Volume 1
255
Chapter 3: Statistical Tests
Kolmogorov-Smirnov Tests
Figure 168: Shapiro-Wilk Test > Output
On this screen select:
• Store the tabular output of this analysis in the database — Option to generate a Teradata
table populated with the results of the analysis. Once enabled, the following three fields
must be specified:
•
Database Name — The database where the output table will be saved.
•
Output Name — The table name that the output will be saved under.
•
Output Type — The output type must be table when storing Statistical Test output in
the database.
•
Stored Procedure — The creation of a stored procedure containing the SQL generated
for this analysis can be requested by entering the desired name of the stored procedure
here. This will result in the creation of a stored procedure in the user's login database
in place of the execution of the SQL generated by the analysis.
•
Procedure Comment — When an optional Procedure Comment is entered it is applied
to a requested Stored Procedure with an SQL Comment statement. It can be up to 255
characters in length and contain substitution parameters for the output category (Score,
ADS, Stats or Other), project name and/or analysis name (using the tags <Category>,
<Project> and <Analysis>, respectively). (Note that the default value of this field may
be set on the Defaults tab of the Preferences dialog, available from the Tools Menu).
•
Create output table using fallback keyword — Fallback keyword will be used to create
the table
•
Create output table using multiset keyword — Multiset keyword will be used to create
the table
•
Advertise Output — The Advertise Output option may be requested when creating a
table, view or procedure. This feature “advertises” output by inserting information into
one or more of the Advertise Output metadata tables according to the type of analysis
and the options selected in the analysis.
•
Advertise Note — An Advertise Note may be specified if desired when the Advertise
Output option is selected or when the Always Advertise option is selected on the
Connection Properties dialog. It is a free-form text field of up to 30 characters that
may be used to categorize or describe the output.
•
Generate SQL, but do not Execute it — Generate the Statistical Test SQL, but do not
execute it. The SQL will be available to be viewed.
Run the Shapiro-Wilk Test Analysis
After setting parameters on the INPUT screens as described above, you are ready to run the
analysis. To run the analysis you can either:
256
Teradata Warehouse Miner User Guide - Volume 1
Chapter 3: Statistical Tests
Kolmogorov-Smirnov Tests
• Click the Run icon
on the toolbar, or
• Select Run <project name> on the Project menu, or
• Press the F5 key on your keyboard
Results - Shapiro-Wilk Analysis
The results of running the Shapiro-Wilk Test analysis include a table with a row for each
separate Shapiro-Wilk test on all distinct-value group-by variables, as well as the SQL to
perform the statistical analysis. All of these results are outlined below.
Shapiro-Wilk Test - RESULTS - SQL
On the Kolmogorov-Smirnov Tests dialog click on RESULTS and then click on SQL:
Figure 169: Shapiro-Wilk Test > Results > SQL
The series of SQL statements comprise the Shapiro-Wilk Test Analysis. It is always returned,
and is the only item returned when the Generate SQL without Executing option is used.
Shapiro-Wilk Test - RESULTS - data
On the Kolmogorov-Smirnov Tests dialog click on RESULTS and then click on data:
Figure 170: Shapiro-Wilk Test > Results > data
The output table is generated for each separate Shapiro-Wilk test on all distinct-value groupby variables.
Output Columns - Shapiro-Wilk Test Analysis
The following table is built in the requested Output Database by the Shapiro-Wilk test
analysis. Group By Variables, if present, will be the Unique Primary Index (UPI). Otherwise
Shw will be the UPI.
Table 100: Shapiro-Wilk Test Analysis: Output Columns
Name
Type
Definition
Shw
Float
Shapiro-Wilk Value
N
INTEGER
Count
Teradata Warehouse Miner User Guide - Volume 1
257
Chapter 3: Statistical Tests
Kolmogorov-Smirnov Tests
Table 100: Shapiro-Wilk Test Analysis: Output Columns
Name
Type
Definition
ShapiroWilkPValue
Float
The probability associated with the Shapiro-Wilk statistic
ShapiroWilkPText
Char
Text description if P is outside table range
ShapiroWilkCallP_0.05
Char
The Shapiro-Wilk result: a=accept, p=reject
Tutorial - Shapiro-Wilk Test Analysis
In this example, a Shapiro-Wilk test analysis is performed on the fictitious banking data to
analyze account usage. Parameterize a Shapiro-Wilk Test analysis as follows:
• Available Tables — twm_customer_analysis
• Column of Interest — income
• Group By Columns — years_with_bank
• Analysis Parameters
•
Threshold Probability — 0.05
Run the analysis and click on Results when it completes. For this example, the Shapiro-Wilk
Test analysis generated the following table. The Shapiro-Wilk Test was computed for each
distinct value of the group by variable “years_with_bank”. Results were sorted by years_
with_bank. The tests show customer all incomes were not normally distributed.
‘p’ means significantly nonnormal and an ‘a’ means accept the null hypothesis of normality.
Note: The SQL is available for viewing but not listed below.
Table 101: Shapiro-Wilk Test
years_with_bank
Shw
N
ShapiroWilkPValue
0
0.84919004
88
0.000001
p
1
0.843099681
87
0.000001
p
2
0.831069533
94
0.000001
p
3
0.838965439
86
0.000001
p
4
0.707924134
78
0.000001
p
5
0.768444329
82
0.000001
p
6
0.855276885
83
0.000001
p
7
0.827399691
65
0.000001
p
8
0.863932178
45
0.01
9
0.930834522
39
0.029586304
258
ShapiroWilkPText
<0.01
ShapiroWilkCallP_0.05
p
p
Teradata Warehouse Miner User Guide - Volume 1
Chapter 3: Statistical Tests
Kolmogorov-Smirnov Tests
D'Agostino and Pearson Test
In general, either the Shapiro-Wilk or D'Agostino-Pearson test is a powerful overall test for
normality. These tests are designed to detect departures from normality without requiring that
the mean or variance of the hypothesized normal distribution be specified in advance. Though
these tests cannot indicate the type of nonnormality, they tend to be more powerful than the
Kolmogorov-Smirnov test.
The D'Agostino-Pearson Ksquared statistic has approximately a chi-squared distribution with
2 df when the population is normally distributed.
Initiate a D'Agostino and Pearson Test
Use the following procedure to initiate a new D'Agostino and Pearson Test in Teradata
Warehouse Miner:
1
Click on the Add New Analysis icon in the toolbar:
Figure 171: Add New Analysis from toolbar
2
In the resulting Add New Analysis dialog box, click on Statistical Tests under Categories
and then under Analyses double-click on Kolmogorov-Smirnov Tests:
Figure 172: Add New Analysis > Statistical Tests > Kolmogorov-Smirnov Tests
Teradata Warehouse Miner User Guide - Volume 1
259
Chapter 3: Statistical Tests
Kolmogorov-Smirnov Tests
3
This will bring up the Kolmogorov-Smirnov Tests dialog in which you will enter
STATISTICAL TEST STYLE, INPUT and OUTPUT options to parameterize the analysis
as described in the following sections.
D'Agostino and Pearson Test - INPUT - Data Selection
On the Kolmogorov-Smirnov Tests dialog click on INPUT and then click on data selection:
Figure 173: D'Agostino and Pearson Test > Input > Data Selection
On this screen select:
1
Select Input Source
Users may select between different sources of input. By selecting the Input Source Table
the user can select from available databases, tables (or views) and columns in the usual
manner. By selecting the Input Source Analysis however the user can select directly from
the output of another analysis of qualifying type in the current project. Analyses that may
be selected from directly include all of the Analytic Data Set (ADS) and Reorganization
analyses (except Refresh). In place of Available Databases the user may select from
Available Analyses, while Available Tables then contains a list of all the output tables that
will eventually be produced by the selected Analysis, or it contains a single entry with the
name of the analysis under the label Volatile Table, representing the output of the analysis
that is ordinarily produced by a Select statement.
2
3
Select Columns From a Single Table
•
Available Databases (or Analyses) — These are the databases (or analyses) available to
be processed.
•
Available Tables — These are the tables and views that are available to be processed.
•
Available Columns — These are the columns within the table/view that are available
for processing.
Select Statistical Test Style
These are the Tests of the Kolmogorov-Smirnov Type available (Kolmogorov-Smirnov,
Lilliefors, Shapiro-Wilk, D'Agostino-Pearson, Smirnov). Select “D'Agostino and
Pearson”.
4
Select Optional Columns
•
260
Selected Columns — Select columns by highlighting and then either dragging and
dropping into the Selected Columns window, or click on the arrow button to move
highlighted columns into the Selected Columns window.
Teradata Warehouse Miner User Guide - Volume 1
Chapter 3: Statistical Tests
Kolmogorov-Smirnov Tests
Note: The Selected Columns window is actually a split window; you can insert
columns as Column of Interest, Group By Columns. Make sure you have the correct
portion of the window highlighted.
•
Column of Interest — The column that specifies the numeric variable to be tested
for normality.
•
Group By Columns — The columns which specify the variables whose distinct
value combinations will categorize the data, so a separate test is performed on each
category.
D'Agostino and Pearson Test - INPUT - Analysis Parameters
On the Kolmogorov-Smirnov Tests dialog click on INPUT and then click on analysis
parameters:
Figure 174: D'Agostino and Pearson Test > Input > Analysis Parameters
On this screen enter or select:
• Processing Options
•
Threshold Probability — Enter the “alpha” probability below which to reject the null
hypothesis.
D'Agostino and Pearson Test - OUTPUT
On the Kolmogorov-Smirnov Tests dialog click on OUTPUT:
Figure 175: D'Agostino and Pearson Test > Output
On this screen select the following options if desired:
• Store the tabular output of this analysis in the database — Option to generate a Teradata
table populated with the results of the analysis. Once enabled, the following three fields
must be specified:
•
Database Name — The database where the output table will be saved.
•
Output Name — The table name that the output will be saved under.
•
Output Type — The output type must be table when storing Statistical Test output in
the database.
Teradata Warehouse Miner User Guide - Volume 1
261
Chapter 3: Statistical Tests
Kolmogorov-Smirnov Tests
•
Stored Procedure — The creation of a stored procedure containing the SQL generated
for this analysis can be requested by entering the desired name of the stored procedure
here. This will result in the creation of a stored procedure in the user's login database
in place of the execution of the SQL generated by the analysis.
•
Procedure Comment — When an optional Procedure Comment is entered it is applied
to a requested Stored Procedure with an SQL Comment statement. It can be up to 255
characters in length and contain substitution parameters for the output category (Score,
ADS, Stats or Other), project name and/or analysis name (using the tags <Category>,
<Project> and <Analysis>, respectively). (Note that the default value of this field may
be set on the Defaults tab of the Preferences dialog, available from the Tools Menu).
•
Create output table using fallback keyword — Fallback keyword will be used to create
the table
•
Create output table using multiset keyword — Multiset keyword will be used to create
the table
•
Advertise Output — The Advertise Output option may be requested when creating a
table, view or procedure. This feature “advertises” output by inserting information into
one or more of the Advertise Output metadata tables according to the type of analysis
and the options selected in the analysis.
•
Advertise Note — An Advertise Note may be specified if desired when the Advertise
Output option is selected or when the Always Advertise option is selected on the
Connection Properties dialog. It is a free-form text field of up to 30 characters that
may be used to categorize or describe the output.
•
Generate SQL, but do not Execute it — Generate the Statistical Test SQL, but do not
execute it. The SQL will be available to be viewed.
Run the D'Agostino and Pearson Test Analysis
After setting parameters on the INPUT screens as described above, you are ready to run the
analysis. To run the analysis you can either:
• Click the Run icon
on the toolbar, or
• Select Run <project name> on the Project menu, or
• Press the F5 key on your keyboard
Results - D'Agostino and Pearson Test Analysis
The results of running the D'Agostino and Pearson Test analysis include a table with a row for
each separate D'Agostino and Pearson test on all distinct-value group-by variables, as well as
the SQL to perform the statistical analysis. All of these results are outlined below.
D'Agostino and Pearson Test - RESULTS - SQL
On the Kolmogorov-Smirnov Tests dialog click on RESULTS and then click on SQL:
262
Teradata Warehouse Miner User Guide - Volume 1
Chapter 3: Statistical Tests
Kolmogorov-Smirnov Tests
Figure 176: D'Agostino and Pearson Test > Results > SQL
The series of SQL statements comprise the D'Agostino and Pearson Test Analysis. It is
always returned, and is the only item returned when the Generate SQL without Executing
option is used.
D'Agostino and Pearson Test - RESULTS - data
On the Kolmogorov-Smirnov Tests dialog click on RESULTS and then click on data:
Figure 177: D'Agostino and Pearson Test > Results > data
The output table is generated by the Analysis for each separate D'Agostino and Pearson test
on all distinct-value group-by variables.
Output Columns - D'Agostino and Pearson Test Analysis
The following table is built in the requested Output Database by the D'Agostino and Pearson
test analysis. Group By Variables, if present, will be the Unique Primary Index (UPI).
Otherwise T will be the UPI.
Table 102: D'Agostino and Pearson Test Analysis: Output Columns
Name
Type
Definition
T
Float
K-Squared statistic
Zkurtosis
Float
Z of kurtosis
Zskew
Float
Z of Skewness
ChiPValue
Float
The probability associated with the K-Squared statistic
ChiPText
Char
Text description if P is outside table range
ChiCallP_0.05
Char
The D'Agostino-Pearson result: a=accept, p=reject
Tutorial - D'Agostino and Pearson Test Analysis
In this example, a D'Agostino and Pearson test analysis is performed on the fictitious banking
data to analyze account usage. Parameterize a D'Agostino and Pearson Test analysis as
follows:
Teradata Warehouse Miner User Guide - Volume 1
263
Chapter 3: Statistical Tests
Kolmogorov-Smirnov Tests
• Available Tables — twm_customer_analysis
• Column of Interest — income
• Group By Columns — years_with_bank
• Analysis Parameters
•
Threshold Probability — 0.05
Run the analysis and click on Results when it completes. For this example, the D'Agostino
and Pearson Test analysis generated the following table. The D'Agostino and Pearson Test
was computed for each distinct value of the group by variable “years_with_bank”. Results
were sorted by years_with_bank. The tests show customer all incomes were not normally
distributed except those from years_with_bank = 9. ‘p’ means significantly nonnormal and an
‘a’ means accept the null hypothesis of normality. The SQL is available for viewing but not
listed below.
Table 103: D'Agostino and Pearson Test: Output Columns
years_with_bank
T
Zkurtosis
Zskew
ChiPValue
ChiPText
ChiCallP_0.05
0
29.05255
2.71261
4.65771
0.0001
<0.0001
p
1
34.18025
3.30609
4.82183
0.0001
<0.0001
p
2
30.71123
2.78588
4.79062
0.0001
<0.0001
p
3
32.81104
3.06954
4.83621
0.0001
<0.0001
p
4
82.01928
5.72010
7.02137
0.0001
<0.0001
p
5
62.36861
4.91949
6.17796
0.0001
<0.0001
p
6
24.80241
2.40521
4.36089
0.0001
<0.0001
p
7
17.72275
1.83396
3.78937
0.00019
p
8
6.55032
-0.23415
2.54863
0.03992
p
9
3.32886
-0.68112
1.69261
0.20447
a
Smirnov Test
The Smirnov test (aka “two-sample Kolmogorov-Smirnov test”) checks whether two datasets
have a significantly different distribution. The tests have the advantage of making no
assumption about the distribution of data. (non-parametric and distribution free). Note that
this generality comes at some cost: other tests (e.g. the Student's t-test) may be more sensitive
if the data meet the requirements of the test.
Initiate a Smirnov Test
Use the following procedure to initiate a new Smirnov Test in Teradata Warehouse Miner:
264
Teradata Warehouse Miner User Guide - Volume 1
Chapter 3: Statistical Tests
Kolmogorov-Smirnov Tests
1
Click on the Add New Analysis icon in the toolbar:
Figure 178: Add New Analysis from toolbar
2
In the resulting Add New Analysis dialog box, click on Statistical Tests under Categories
and then under Analyses double-click on Kolmogorov-Smirnov Tests:
Figure 179: Add New Analysis > Statistical Tests > Kolmogorov-Smirnov Tests
3
This will bring up the Kolmogorov-Smirnov Tests dialog in which you will enter
STATISTICAL TEST STYLE, INPUT and OUTPUT options to parameterize the analysis
as described in the following sections.
Smirnov Test - INPUT - Data Selection
On the Kolmogorov-Smirnov Tests dialog click on INPUT and then click on data selection:
Teradata Warehouse Miner User Guide - Volume 1
265
Chapter 3: Statistical Tests
Kolmogorov-Smirnov Tests
Figure 180: Smirnov Test > Input > Data Selection
On this screen select:
1
Select Input Source
Users may select between different sources of input. By selecting the Input Source Table
the user can select from available databases, tables (or views) and columns in the usual
manner. By selecting the Input Source Analysis however the user can select directly from
the output of another analysis of qualifying type in the current project. Analyses that may
be selected from directly include all of the Analytic Data Set (ADS) and Reorganization
analyses (except Refresh). In place of Available Databases the user may select from
Available Analyses, while Available Tables then contains a list of all the output tables that
will eventually be produced by the selected Analysis, or it contains a single entry with the
name of the analysis under the label Volatile Table, representing the output of the analysis
that is ordinarily produced by a Select statement.
2
3
Select Columns From a Single Table
•
Available Databases (or Analyses) — These are the databases (or analyses) available to
be processed.
•
Available Tables — These are the tables and views that are available to be processed.
•
Available Columns — These are the columns within the table/view that are available
for processing.
Select Statistical Test Style
These are the Tests of the Kolmogorov-Smirnov Type available (Kolmogorov-Smirnov,
Lilliefors, Shapiro-Wilk, D'Agostino-Pearson, Smirnov). Select “Smirnov”.
4
Select Optional Columns
•
Selected Columns — Select columns by highlighting and then either dragging and
dropping into the Selected Columns window, or click on the arrow button to move
highlighted columns into the Selected Columns window.
Note: The Selected Columns window is actually a split window; you can insert
columns as Column of Interest, Columns, Group By Columns. Make sure you have the
correct portion of the window highlighted.
266
•
Column of Interest — The column that specifies the numeric variable to be tested for
normality.
•
Columns — The column specifying the 2-category variable that identifies the
distribution to which the column of interest belongs.
•
Group By Columns — The columns which specify the variables whose distinct value
combinations will categorize the data, so a separate test is performed on each category.
Teradata Warehouse Miner User Guide - Volume 1
Chapter 3: Statistical Tests
Kolmogorov-Smirnov Tests
Smirnov Test - INPUT - Analysis Parameters
On the Kolmogorov-Smirnov Tests dialog click on INPUT and then click on analysis
parameters:
Figure 181: Smirnov Test > Input > Analysis Parameters
On this screen enter or select:
• Processing Options
•
Threshold Probability — Enter the “alpha” probability below which to reject the null
hypothesis.
Smirnov Test - OUTPUT
On the Kolmogorov-Smirnov Tests dialog click on OUTPUT:
Figure 182: Smirnov Test > Output
On this screen select the following options if desired:
• Store the tabular output of this analysis in the database — Option to generate a Teradata
table populated with the results of the analysis. Once enabled, the following three fields
must be specified:
•
Database Name — The database where the output table will be saved.
•
Output Name — The table name that the output will be saved under.
•
Output Type — The output type must be table when storing Statistical Test output in
the database.
•
Stored Procedure — The creation of a stored procedure containing the SQL generated
for this analysis can be requested by entering the desired name of the stored procedure
here. This will result in the creation of a stored procedure in the user's login database
in place of the execution of the SQL generated by the analysis.
•
Procedure Comment — When an optional Procedure Comment is entered it is applied
to a requested Stored Procedure with an SQL Comment statement. It can be up to 255
characters in length and contain substitution parameters for the output category (Score,
ADS, Stats or Other), project name and/or analysis name (using the tags <Category>,
<Project> and <Analysis>, respectively). (Note that the default value of this field may
be set on the Defaults tab of the Preferences dialog, available from the Tools Menu).
Teradata Warehouse Miner User Guide - Volume 1
267
Chapter 3: Statistical Tests
Kolmogorov-Smirnov Tests
•
Create output table using fallback keyword — Fallback keyword will be used to create
the table
•
Create output table using multiset keyword — Multiset keyword will be used to create
the table
•
Advertise Output — The Advertise Output option may be requested when creating a
table, view or procedure. This feature “advertises” output by inserting information into
one or more of the Advertise Output metadata tables according to the type of analysis
and the options selected in the analysis.
•
Advertise Note — An Advertise Note may be specified if desired when the Advertise
Output option is selected or when the Always Advertise option is selected on the
Connection Properties dialog. It is a free-form text field of up to 30 characters that
may be used to categorize or describe the output.
•
Generate SQL, but do not Execute it — Generate the Statistical Test SQL, but do not
execute it. The SQL will be available to be viewed.
Run the Smirnov Test Analysis
After setting parameters on the INPUT screens as described above, you are ready to run the
analysis. To run the analysis you can either:
• Click the Run icon
on the toolbar, or
• Select Run <project name> on the Project menu, or
• Press the F5 key on your keyboard
Results - Smirnov Test Analysis
The results of running the Smirnov Test analysis include a table with a row for each separate
Smirnov test on all distinct-value group-by variables, as well as the SQL to perform the
statistical analysis. All of these results are outlined below.
Smirnov Test - RESULTS - SQL
On the Kolmogorov-Smirnov Tests dialog click on RESULTS and then click on SQL:
Figure 183: Smirnov Test > Results > SQL
The series of SQL statements comprise the Smirnov Test Analysis. It is always returned, and
is the only item returned when the Generate SQL without Executing option is used.
Smirnov Test - RESULTS - data
On the Kolmogorov-Smirnov Tests dialog click on RESULTS and then click on data:
268
Teradata Warehouse Miner User Guide - Volume 1
Chapter 3: Statistical Tests
Kolmogorov-Smirnov Tests
Figure 184: Smirnov Test > Results > data
The output table is generated by the Analysis for each separate Smirnov test on all distinctvalue group-by variables.
Output Columns - Smirnov Test Analysis
The following table is built in the requested Output Database by the Smirnov test analysis.
Group By Variables, if present, will be the Unique Primary Index (UPI). Otherwise M will be
the UPI.
Table 104: Smirnov Test Analysis: Output Columns
Name
Type
Definition
M
Integer
Number of first distribution observations
N
Integer
Number of second distribution observations
D
Float
D Statistic
SmirnovPValue
Float
The probability associated with the D statistic
SmirnovPText
Char
Text description if P is outside table range
SmirnovCallP_0.01
Char
The D'Agostino-Pearson result: a=accept, p=reject
Tutorial - Smirnov Test Analysis
In this example, a Smirnov test analysis is performed on the fictitious banking data to analyze
account usage. Parameterize a Smirnov Test analysis as follows:
• Available Tables — twm_customer
• Column of Interest — income
• Columns — gender
• Group By Columns — years_with_bank
• Analysis Parameters
•
Threshold Probability — 0.05
Run the analysis and click on Results when it completes. For this example, the Smirnov Test
analysis generated the following table. The Smirnov Test was computed for each distinct
value of the group by variable “years_with_bank”. Results were sorted by years_with_bank.
The tests show distributions of incomes of males and females were different for all values of
years_with_bank. ‘p’ means significantly nonnormal and an ‘a’ means accept the null
hypothesis of normality. The SQL is available for viewing but not listed below.
Teradata Warehouse Miner User Guide - Volume 1
269
Chapter 3: Statistical Tests
Tests Based on Contingency Tables
Table 105: Smirnov Test
years_with_bank
M
N
D
SmirnovPValue
SmirnovPText
SmirnovCallP_0.01
0
37
51
1.422949567
0.000101
p
1
39
48
1.371667516
0.000103
p
2
37
57
1.465841724
0.000101
p
3
40
46
1.409836326
0.000105
p
4
39
39
1.397308541
0.000146
p
5
36
46
1.309704108
0.000105
p
6
37
46
1.287964978
0.000104
p
7
29
36
1.336945293
0.000112
p
8
19
26
1.448297864
0.00011
p
9
16
23
1.403341724
0.000101
p
Tests Based on Contingency Tables
Tests Based on Contingency Tables are based on an array or matrix of numbers which
represent counts or frequencies. The tests basically evaluate the matrix to detect if there is a
nonrandom pattern of frequencies.
Chi Square Test
The most common application for chi-square is in comparing observed counts of particular
cases to the expected counts. For example, a random sample of people would contain m males
and f females but usually we would not find exactly m=½N and f=½N. We could use the chisquared test to determine if the difference were significant enough to rule out the 50/50
hypothesis.
The Chi Square Test determines whether the probabilities observed from data in a RxC
contingency table are the same or different. The null hypothesis is that probabilities observed
are the same. Output is a p-value which when compared to the user’s threshold, determines
whether the null hypothesis should be rejected.
Other Calculated Measures of Association
• Phi coefficient — The Phi coefficient is a measure of the degree of association between
two binary variables, and represents the correlation between two dichotomous variables. It
is based on adjusting chi-square significance to factor out sample size, and is the same as
the Pearson correlation for two dichotomous variables.
• Cramer’s V — Cramer's V is used to examine the association between two categorical
variables when there is more than a 2 X 2 contingency (e.g., 2 X 3). In these more
270
Teradata Warehouse Miner User Guide - Volume 1
Chapter 3: Statistical Tests
Tests Based on Contingency Tables
complex designs, phi is not appropriate, but Cramer's statistic is. Cramer's V represents
the association or correlation between two variables. Cramer's V is the most popular of the
chi-square-based measures of nominal association, designed so that the attainable upper
limit is always 1.
• Likelihood Ratio Chi Square — Likelihood ratio chi-square is an alternative to test the
hypothesis of no association of columns and rows in nominal-level tabular data. It is based
on maximum likelihood estimation, and involves the ratio between the observed and the
expected frequencies, whereas the ordinary chi-square test involves the difference
between the two. This is a more recent version of chi-square and is directly related to loglinear analysis and logistic regression.
• Continuity-Adjusted Chi-Square — The continuity-adjusted chi-square statistic for 2 × 2
tables is similar to the Pearson chi-square, except that it is adjusted for the continuity of
the chi-square distribution. The continuity-adjusted chi-square is most useful for small
sample sizes. The use of the continuity adjustment is controversial; this chi-square test is
more conservative, and more like Fisher's exact test, when your sample size is small. As
the sample size increases, the statistic becomes more and more like the Pearson chisquare.
• Contingency Coefficient — The contingency coefficient is an adjustment to phi
coefficient, intended for tables larger than 2-by-2. It is always less than 1 and approaches
1.0 only for large tables. The larger the contingency coefficient, the stronger the
association. Recommended only for 5-by-5 tables or larger, for smaller tables it
underestimates level of association.
Initiate a Chi Square Test
Use the following procedure to initiate a new Chi Square Test in Teradata Warehouse Miner:
1
Click on the Add New Analysis icon in the toolbar:
Figure 185: Add New Analysis from toolbar
2
In the resulting Add New Analysis dialog box, click on Statistical Tests under Categories
and then under Analyses double-click on Tests Based on Contingency Tables:
Teradata Warehouse Miner User Guide - Volume 1
271
Chapter 3: Statistical Tests
Tests Based on Contingency Tables
Figure 186: Add New Analysis > Statistical Tests > Tests Based on Contingency Tables
3
This will bring up the Tests Based on Contingency Tables dialog in which you will enter
STATISTICAL TEST STYLE, INPUT and OUTPUT options to parameterize the analysis
as described in the following sections.
Chi Square Test - INPUT - Data Selection
On the Tests Based on Contingency Tables dialog click on INPUT and then click on data
selection:
Figure 187: Chi Square Test > Input > Data Selection
On this screen select:
1
Select Input Source
Users may select between different sources of input. By selecting the Input Source Table
the user can select from available databases, tables (or views) and columns in the usual
manner. By selecting the Input Source Analysis however the user can select directly from
the output of another analysis of qualifying type in the current project. Analyses that may
be selected from directly include all of the Analytic Data Set (ADS) and Reorganization
analyses (except Refresh). In place of Available Databases the user may select from
Available Analyses, while Available Tables then contains a list of all the output tables that
will eventually be produced by the selected Analysis, or it contains a single entry with the
272
Teradata Warehouse Miner User Guide - Volume 1
Chapter 3: Statistical Tests
Tests Based on Contingency Tables
name of the analysis under the label Volatile Table, representing the output of the analysis
that is ordinarily produced by a Select statement.
2
3
Select Columns From a Single Table
•
Available Databases (or Analyses) — These are the databases (or analyses) available to
be processed.
•
Available Tables — These are the tables and views that are available to be processed.
•
Available Columns — These are the columns within the table/view that are available
for processing.
Select Statistical Test Style
These are the Tests Based on Contingency Tables available (Chi Square, Median). Select
“Chi Square”.
4
Select Optional Columns
•
Selected Columns — Select columns by highlighting and then either dragging and
dropping into the Selected Columns window, or click on the arrow button to move
highlighted columns into the Selected Columns window.
•
First Columns/Second Columns — Note that the Selected Columns window is actually
a split window; you can insert columns as First Columns, Second Columns. Make sure
you have the correct portion of the window highlighted.
•
First Columns — The set of columns that specifies the first of a pair of variables for
Chi Square analysis.
•
Second Columns — The set of columns that specifies the second of a pair of variables
for Chi Square analysis.
Each combination of the first and second variables will generate a separate Chi Square
test. (Limitation: to avoid excessively long execution, the number of combinations is
limited to 100, and unless the product of the number of distinct values of each pair is
2000 or less, the calculation will be skipped.)
Note: Group-By Columns are not available in the Chi Square Test.
Chi Square Test - INPUT - Analysis Parameters
On the Tests Based on Contingency Tables dialog click on INPUT and then click on analysis
parameters:
Figure 188: Chi Square Test > Input > Analysis Parameters
On this screen enter or select:
• Processing Options
Teradata Warehouse Miner User Guide - Volume 1
273
Chapter 3: Statistical Tests
Tests Based on Contingency Tables
•
Threshold Probability — Enter the “alpha” probability below which to reject the null
hypothesis.
Chi Square Test - OUTPUT
On the Tests Based on Contingency Tables dialog click on OUTPUT:
Figure 189: Chi Square Test > Output
On this screen select the following options if desired:
• Store the tabular output of this analysis in the database — Option to generate a Teradata
table populated with the results of the analysis. Once enabled, the following three fields
must be specified:
274
•
Database Name — The database where the output table will be saved.
•
Output Name — The table name that the output will be saved under.
•
Output Type — The output type must be table when storing Statistical Test output in
the database.
•
Stored Procedure — The creation of a stored procedure containing the SQL generated
for this analysis can be requested by entering the desired name of the stored procedure
here. This will result in the creation of a stored procedure in the user's login database
in place of the execution of the SQL generated by the analysis.
•
Procedure Comment — When an optional Procedure Comment is entered it is applied
to a requested Stored Procedure with an SQL Comment statement. It can be up to 255
characters in length and contain substitution parameters for the output category (Score,
ADS, Stats or Other), project name and/or analysis name (using the tags <Category>,
<Project> and <Analysis>, respectively). (Note that the default value of this field may
be set on the Defaults tab of the Preferences dialog, available from the Tools Menu).
•
Create output table using fallback keyword — Fallback keyword will be used to create
the table
•
Create output table using multiset keyword — Multiset keyword will be used to create
the table
•
Advertise Output — The Advertise Output option may be requested when creating a
table, view or procedure. This feature “advertises” output by inserting information into
one or more of the Advertise Output metadata tables according to the type of analysis
and the options selected in the analysis.
•
Advertise Note — An Advertise Note may be specified if desired when the Advertise
Output option is selected or when the Always Advertise option is selected on the
Connection Properties dialog. It is a free-form text field of up to 30 characters that
may be used to categorize or describe the output.
Teradata Warehouse Miner User Guide - Volume 1
Chapter 3: Statistical Tests
Tests Based on Contingency Tables
•
Generate SQL, but do not Execute it — Generate the Statistical Test SQL, but do not
execute it. The SQL will be available to be viewed.
Run the Chi Square Test Analysis
After setting parameters on the INPUT screens as described above, you are ready to run the
analysis. To run the analysis you can either:
• Click the Run icon
on the toolbar, or
• Select Run <project name> on the Project menu, or
• Press the F5 key on your keyboard
Results - Chi Square Analysis
The results of running the Chi Square Test analysis include a table with a row for each
separate Chi Square test on all pairs of selected variables, as well as the SQL to perform the
statistical analysis. All of these results are outlined below.
Chi Square Test - RESULTS - SQL
On the Tests Based on Contingency Tables dialog click on RESULTS and then click on SQL:
Figure 190: Chi Square Test > Results > SQL
The series of SQL statements comprise the Chi Square Test Analysis. It is always returned,
and is the only item returned when the Generate SQL without Executing option is used.
Chi Square Test - RESULTS - data
On the Tests Based on Contingency Tables dialog click on RESULTS and then click on data:
Figure 191: Chi Square Test > Results > data
The output table is generated by the Analysis for each separate Chi Square test on all pairs of
selected variables
Output Columns - Chi Square Test Analysis
The following table is built in the requested Output Database by the Chi Square test analysis.
Column1 will be the Unique Primary Index (UPI).
Teradata Warehouse Miner User Guide - Volume 1
275
Chapter 3: Statistical Tests
Tests Based on Contingency Tables
Table 106: Chi Square Test Analysis: Output Columns
Name
Type
Definition
column1
Char
First of pair of variables
column2
Char
Second of pair of variables
Chisq
Float
Chi Square Value
DF
INTEGER
Degrees of Freedom
Z
Float
Z Score
CramersV
Float
§ Cramer’s V
PhiCoeff
Float
§ Phi coefficient
LlhChiSq
Float
Likelihood Ratio Chi Square
ContAdjChiSq
Float
§ Continuity-Adjusted Chi-Square
ContinCoeff
Float
§ Contingency Coefficient
ChiPValue
Float
The probability associated with the Chi Square statistic
ChiPText
Char
Text description if P is outside table range
ChiCallP_0.05
Char
The Chi Square result: a=accept, p=reject (positive), n=reject(negative)
Tutorial - Chi Square Test Analysis
In this example, a Chi Square test analysis is performed on the fictitious banking data to
analyze account usage. Parameterize a Chi Square Test analysis as follows:
• Available Tables — twm_customer_analysis
• First Columns — female, single
• Second Columns — svacct, ccacct, ckacct
• Analysis Parameters
•
Threshold Probability — 0.05
Run the analysis and click on Results when it completes. For this example, the Chi Square
Test analysis generated the following table. The Chi Square Test was computed on all
combinations of pairs of the two sets of variables. Results were sorted by column1 and
column2. The tests shows that probabilities observed are the same for three pairs of variables
and different for three other pairs. A ‘p’ means significantly different and an ‘a’ means not
significantly different. The SQL is available for viewing but not listed below.
Table 107: Chi Square Test (Part 1)
column1
column2
Chisq
DF
Z
CramersV
PhiCoeff
LlhChiSq
female
ccacct
3.2131312
1
1.480358596
0.065584911
0.065584911
3.21543611
female
ckacct
8.2389731
1
2.634555949
0.105021023
0.105021023
8.23745744
276
Teradata Warehouse Miner User Guide - Volume 1
Chapter 3: Statistical Tests
Tests Based on Contingency Tables
Table 107: Chi Square Test (Part 1)
column1
column2
Chisq
DF
Z
CramersV
PhiCoeff
LlhChiSq
female
svacct
3.9961257
1
1.716382791
0.073140727
0.073140727
3.98861957
single
ccacct
6.9958187
1
2.407215881
0.096774063
0.096774063
7.01100739
single
ckacct
0.6545145
1
0.191899245
0.02960052
0.02960052
0.65371179
single
svacct
1.5387084
1
0.799100586
0.045385576
0.045385576
1.53297321
Table 108: Chi Square Test (Part 2)
column1
column2
ContAdjChiSq
ContinCoeff
ChiPValue
ChiPText
female
ccacct
2.954339388
0.065444311
0.077657185
a
female
ckacct
7.817638955
0.10444661
0.004512106
p
female
svacct
3.697357526
0.072945873
0.046729867
p
single
ccacct
6.600561728
0.096324066
0.00854992
p
single
ckacct
0.536617115
0.029587561
0.25
single
svacct
1.35045989
0.045338905
0.226624385
>0.25
ChiCallP_0.05
a
a
Median Test
The Median test is a special case of the chi-square test with fixed marginal totals. It tests
whether several samples came from populations with the same median. The null hypothesis is
that all samples have the same median.
The median test is applied for data in similar cases as for the ANOVA for independent
samples, but when:
1
the data are either importantly non-normally distributed
2
the measurement scale of the dependent variable is ordinal (not interval or ratio)
3
or the data sample is too small.
Note: The Median test is a less powerful non-parametric test than alternative rank tests due to
the fact the dependent variable is dichotomized at the median. Because this technique tends to
discard most of the information inherent in the data, it is less often used. Frequencies are
evaluated by a simple 2 x 2 contingency table, so it becomes simply a 2 x 2 chi square test of
independence with 1 DF.
Given k independent samples of numeric values, a Median test is produced for each set of
unique values of the group-by variables (GBV's), if any, testing whether all the populations
have the same median. Output for each set of unique values of the GBV's is a p-value, which
when compared to the user’s threshold, determines whether the null hypothesis should be
rejected for the unique set of values of the GBV's. For more than 2 samples, this is sometimes
called the Brown-Mood test.
Teradata Warehouse Miner User Guide - Volume 1
277
Chapter 3: Statistical Tests
Tests Based on Contingency Tables
Initiate a Median Test
Use the following procedure to initiate a new Median Test in Teradata Warehouse Miner:
1
Click on the Add New Analysis icon in the toolbar:
Figure 192: Add New Analysis from toolbar
2
In the resulting Add New Analysis dialog box, click on Statistical Tests under Categories
and then under Analyses double-click on Tests Based on Contingency Tables:
Figure 193: Add New Analysis > Statistical Tests > Tests Based On Contingency Tables
3
This will bring up the Tests Based on Contingency Tables dialog in which you will enter
STATISTICAL TEST STYLE, INPUT and OUTPUT options to parameterize the analysis
as described in the following sections.
Median Test - INPUT - Data Selection
On the Tests Based on Contingency Tables dialog click on INPUT and then click on data
selection:
278
Teradata Warehouse Miner User Guide - Volume 1
Chapter 3: Statistical Tests
Tests Based on Contingency Tables
Figure 194: Median Test > Input > Data Selection
On this screen select:
1
Select Input Source
Users may select between different sources of input. By selecting the Input Source Table
the user can select from available databases, tables (or views) and columns in the usual
manner. By selecting the Input Source Analysis however the user can select directly from
the output of another analysis of qualifying type in the current project. Analyses that may
be selected from directly include all of the Analytic Data Set (ADS) and Reorganization
analyses (except Refresh). In place of Available Databases the user may select from
Available Analyses, while Available Tables then contains a list of all the output tables that
will eventually be produced by the selected Analysis, or it contains a single entry with the
name of the analysis under the label Volatile Table, representing the output of the analysis
that is ordinarily produced by a Select statement.
2
3
Select Columns From a Single Table
•
Available Databases (or Analyses) — These are the databases (or analyses) available to
be processed.
•
Available Tables — These are the tables and views that are available to be processed.
•
Available Columns — These are the columns within the table/view that are available
for processing.
Select Statistical Test Style
These are the Tests Based on Contingency Tables available (Chi Square, Median). Select
“Median”.
4
Select Optional Columns
•
Selected Columns — Select columns by highlighting and then either dragging and
dropping into the Selected Columns window, or click on the arrow button to move
highlighted columns into the Selected Columns window.
Note: The Selected Columns window is actually a split window; you can insert
columns as Column of Interest, Columns and Group By Columns. Make sure you have
the correct portion of the window highlighted.
•
Column of Interest — The numeric dependent variable for Median analysis.
•
Columns — The set of categorical independent variables for Median analysis.
•
Group By Columns — The column(s) that specifies the variable(s) whose distinct
value combinations will categorize the data, so a separate test is performed on each
category.
Teradata Warehouse Miner User Guide - Volume 1
279
Chapter 3: Statistical Tests
Tests Based on Contingency Tables
Median Test - INPUT - Analysis Parameters
On the Tests Based on Contingency Tables dialog click on INPUT and then click on analysis
parameters:
Figure 195: Median Test > Input > Analysis Parameters
On this screen enter or select:
• Processing Options
•
Threshold Probability — Enter the “alpha” probability below which to reject the null
hypothesis.
Median Test - OUTPUT
On the Tests Based on Contingency Tables dialog click on OUTPUT:
Figure 196: Median Test > Output
On this screen select the following options if desired:
• Store the tabular output of this analysis in the database — Option to generate a Teradata
table populated with the results of the analysis. Once enabled, the following three fields
must be specified:
280
•
Database Name — The database where the output table will be saved.
•
Output Name — The table name that the output will be saved under.
•
Output Type — The output type must be table when storing Statistical Test output in
the database.
•
Stored Procedure — The creation of a stored procedure containing the SQL generated
for this analysis can be requested by entering the desired name of the stored procedure
here. This will result in the creation of a stored procedure in the user's login database
in place of the execution of the SQL generated by the analysis.
•
Procedure Comment — When an optional Procedure Comment is entered it is applied
to a requested Stored Procedure with an SQL Comment statement. It can be up to 255
characters in length and contain substitution parameters for the output category (Score,
ADS, Stats or Other), project name and/or analysis name (using the tags <Category>,
<Project> and <Analysis>, respectively). (Note that the default value of this field may
be set on the Defaults tab of the Preferences dialog, available from the Tools Menu).
Teradata Warehouse Miner User Guide - Volume 1
Chapter 3: Statistical Tests
Tests Based on Contingency Tables
•
Create output table using fallback keyword — Fallback keyword will be used to create
the table
•
Create output table using multiset keyword — Multiset keyword will be used to create
the table
•
Advertise Output — The Advertise Output option may be requested when creating a
table, view or procedure. This feature “advertises” output by inserting information into
one or more of the Advertise Output metadata tables according to the type of analysis
and the options selected in the analysis.
•
Advertise Note — An Advertise Note may be specified if desired when the Advertise
Output option is selected or when the Always Advertise option is selected on the
Connection Properties dialog. It is a free-form text field of up to 30 characters that
may be used to categorize or describe the output.
•
Generate SQL, but do not Execute it — Generate the Statistical Test SQL, but do not
execute it. The SQL will be available to be viewed.
Run the Median Test Analysis
After setting parameters on the INPUT screens as described above, you are ready to run the
analysis. To run the analysis you can either:
• Click the Run icon
on the toolbar, or
• Select Run <project name> on the Project menu, or
• Press the F5 key on your keyboard
Results - Median Analysis
The results of running the Median Test analysis include a table with a row for each separate
Median test on all distinct-value group-by variables, as well as the SQL to perform the
statistical analysis. All of these results are outlined below.
Median Test - RESULTS - SQL
On the Tests Based on Contingency Tables dialog click on RESULTS and then click on SQL:
Figure 197: Median Test > Results > SQL
The series of SQL statements comprise the Median Test Analysis. It is always returned, and is
the only item returned when the Generate SQL without Executing option is used.
Median Test - RESULTS - data
On the Tests Based on Contingency Tables dialog click on RESULTS and then click on data:
Teradata Warehouse Miner User Guide - Volume 1
281
Chapter 3: Statistical Tests
Tests Based on Contingency Tables
Figure 198: Median Test > Results > data
The output table is generated by the Analysis for each group-by variable combination.
Output Columns - Median Test Analysis
The following table is built in the requested Output Database by the Median Test analysis.
Group By Variables, if present, will be the Unique Primary Index (UPI). Otherwise ChiSq
will be the UPI.
Table 109: Median Test Analysis: Output Columns
Name
Type
Definition
Chisq
Float
Chi Square Value
DF
INTEGER
Degrees of Freedom
MedianPValue
Float
The probability associated with the Chi Square statistic
MedianPText
Char
Text description if P is outside table range
MedianCallP_0.01
Char
The Chi Square result: a=accept, p=reject (positive), n=reject(negative)
Tutorial - Median Test Analysis
In this example, a Median test analysis is performed on the fictitious banking data to analyze
account usage. Parameterize a Median Test analysis as follows:
• Available Tables — twm_customer
• Column of Interest — income
• Columns — marital_status
• Group By Columns — years_with_bank
• Analysis Parameters
•
Threshold Probability — 0.01
Run the analysis and click on Results when it completes. For this example, the Median Test
analysis generated the following table. The Median Test was computed on income over
marital_status by years_with_bank.
Results were sorted by years_with_bank. The tests shows that values came from populations
with the same median where MedianCallP_0.01 = ‘a’ (accept null hypothesis) and from
populations with different medians where it is ‘p’ (reject null hypothesis).
The SQL is available for viewing but not listed below.
282
Teradata Warehouse Miner User Guide - Volume 1
Chapter 3: Statistical Tests
Rank Tests
Table 110: Median Test
years_with_bank
ChiSq
DF
MedianPValue
MedianPText
MedianCallP_0.01
0
12.13288563
3
0.007361344
p
1
12.96799683
3
0.004848392
p
2
13.12480388
3
0.004665414
p
3
8.504645761
3
0.038753824
a
4
4.458333333
3
0.225502846
a
5
15.81395349
3
0.001527445
p
6
4.531466733
3
0.220383974
a
7
11.35971787
3
0.009950322
p
8
2.855999742
3
0.25
>0.25
a
9
2.23340311
3
0.25
>0.25
a
Rank Tests
Tests Based on Ranks use the ranks of the data rather than the data itself to calculate statistics.
Therefore the data must have at least an ordinal scale of measurement. If data are nonnumeric but ordinal and ranked, these rank tests may be the most powerful tests available.
Even numeric variables which meet the requirements of parametric tests, such as
independent, randomly distributed normal variables, can be efficiently analyzed by these
tests. These rank tests are valid for variables which are continuous, discrete, or a mixture of
both.
Types of Rank tests supported by Teradata Warehouse Miner include:
• Mann-Whitney/Kruskal-Wallis
• Mann-Whitney/Kruskal-Wallis (Independent Tests)
• Wilcoxon Signed Rank
• Friedman
Mann-Whitney/Kruskal-Wallis Test
The selection of which test to execute is automatically based on the number of distinct values
of the independent variable. The Mann-Whitney is used for two groups, the Kruskal-Wallis
for three or more groups.
A special version of the Mann-Whitney/Kruskal-Wallis test performs a separate, independent
test for each independent variable, and displays the result of each test with its accompanying
column name. Under the primary version of the Mann-Whitney/Kruskal-Wallis test, all
independent variable value combinations are used, often forcing the Kruskal-Wallis test, since
Teradata Warehouse Miner User Guide - Volume 1
283
Chapter 3: Statistical Tests
Rank Tests
the number of value combinations exceeds two. When a variable which has more than two
distinct values is included in the set of independent variables, then the Kruskal-Wallis test is
performed for all variables. Since Kruskal-Wallis is a generalization of Mann-Whitney, the
Kruskal-Wallis results are valid for all the variables, including two-valued ones. In the
discussion below, both types of Mann-Whitney/Kruskal-Wallis are referred to as MannWhitney/Kruskal-Wallis tests, since the only difference is the way the independent variable is
treated.
The Mann-Whitney test, AKA Wilcoxon Two Sample Test, is the nonparametric analog of the
2-sample t test. It is used to compare two independent groups of sampled data, and tests
whether they are from the same population or from different populations (i.e., whether the
samples have the same distribution function). Unlike the parametric t-test, this nonparametric test makes no assumptions about the distribution of the data (e.g., normality). It is
to be used as an alternative to the independent group t-test, when the assumption of normality
or equality of variance is not met. Like many non-parametric tests, it uses the ranks of the
data rather than the data itself to calculate the U statistic. But since the Mann-Whitney test
makes no distribution assumption, it is less powerful than the t-test. On the other hand, the
Mann-Whitney is more powerful than the t-test when parametric assumptions are not met.
Another advantage is that it will provide the same results under any monotonic
transformation of the data so the results of the test are more generalizable.
The Mann-Whitney is used when the independent variable is nominal or ordinal and the
dependent variable is ordinal (or treated as ordinal). The main assumption is that the variable
on which the 2 groups are to be compared is continuously distributed. This variable may be
non-numeric, and if so, is converted to a rank based on alphanumeric precedence.
The null hypothesis is that both samples have the same distribution. The alternative
hypotheses are that the distributions differ from each other in either direction (two-tailed test),
or in a specific direction (upper-tailed or lower-tailed tests). Output is a p-value, which when
compared to the user’s threshold, determines whether the null hypothesis should be rejected.
Given one or more columns (independent variables) whose values define two independent
groups of sampled data, and a column (dependent variable) whose distribution is of interest
from the same input table, the Mann-Whitney test is performed for each set of unique values
of the group-by variables (GBV's), if any.
The Kruskal-Wallis test is the nonparametric analog of the one-way analysis of variance or Ftest used to compare three or more independent groups of sampled data. When there are only
two groups, it reduces to the Mann-Whitney test (above). The Kruskal-Wallis test tests
whether multiple samples of data are from the same population or from different populations
(i.e., whether the samples have the same distribution function). Unlike the parametric
independent group ANOVA (one way ANOVA), this non-parametric test makes no
assumptions about the distribution of the data (e.g., normality). Since this test does not make
a distributional assumption, it is not as powerful as ANOVA.
Given k independent samples of numeric values, a Kruskal-Wallis test is produced for each
set of unique values of the GBV's, testing whether all the populations are identical. This test
variable may be non-numeric, and if so, is converted to a rank based on alphanumeric
precedence. The null hypothesis is that all samples have the same distribution. The alternative
hypotheses are that the distributions differ from each other. Output for each unique set of
284
Teradata Warehouse Miner User Guide - Volume 1
Chapter 3: Statistical Tests
Rank Tests
values of the GBV's is a statistic H, and a p-value, which when compared to the user’s
threshold, determines whether the null hypothesis should be rejected for the unique set of
values of the GBV's.
Initiate a Mann-Whitney/Kruskal-Wallis Test
Use the following procedure to initiate a new Mann-Whitney/Kruskal-Wallis Test in Teradata
Warehouse Miner:
1
Click on the Add New Analysis icon in the toolbar:
Figure 199: Add New Analysis from toolbar
2
In the resulting Add New Analysis dialog box, click on Statistical Tests under Categories
and then under Analyses double-click on Rank Tests:
Figure 200: Add New Analysis > Statistical Tests > Rank Tests
3
This will bring up the Rank Tests dialog in which you will enter STATISTICAL TEST
STYLE, INPUT and OUTPUT options to parameterize the analysis as described in the
following sections.
Teradata Warehouse Miner User Guide - Volume 1
285
Chapter 3: Statistical Tests
Rank Tests
Mann-Whitney/Kruskal-Wallis Test - INPUT - Data Selection
On the Ranks Tests dialog click on INPUT and then click on data selection:
Figure 201: Mann-Whitney/Kruskal-Wallis Test > Input > Data Selection
On this screen select:
1
Select Input Source
Users may select between different sources of input. By selecting the Input Source Table
the user can select from available databases, tables (or views) and columns in the usual
manner. By selecting the Input Source Analysis however the user can select directly from
the output of another analysis of qualifying type in the current project. Analyses that may
be selected from directly include all of the Analytic Data Set (ADS) and Reorganization
analyses (except Refresh). In place of Available Databases the user may select from
Available Analyses, while Available Tables then contains a list of all the output tables that
will eventually be produced by the selected Analysis, or it contains a single entry with the
name of the analysis under the label Volatile Table, representing the output of the analysis
that is ordinarily produced by a Select statement.
2
3
Select Columns From a Single Table
•
Available Databases (or Analyses) — These are the databases (or analyses) available to
be processed.
•
Available Tables — These are the tables and views that are available to be processed.
•
Available Columns — These are the columns within the table/view that are available
for processing.
Select Statistical Test Style
These are the Tests Based on Ranks available (Mann-Whitney/Kruskal-Wallis, MannWhitney/Kruskal-Wallis Independent Tests, Wilcoxon, Friedman). Select “MannWhitney/Kruskal-Wallis” or Mann-Whitney/Kruskal-Wallis Independent Tests.
4
Select Optional Columns
•
Selected Columns — Select columns by highlighting and then either dragging and
dropping into the Selected Columns window, or click on the arrow button to move
highlighted columns into the Selected Columns window.
Note: The Selected Columns window is actually a split window; you can insert
columns as Column of Interest, Columns or Group By Columns. Make sure you have
the correct portion of the window highlighted.
•
286
Column of Interest — The column that specifies the dependent variable to be
tested. Note that this variable may be non-numeric, but if so, will be converted to a
rank based on alphanumeric precedence.
Teradata Warehouse Miner User Guide - Volume 1
Chapter 3: Statistical Tests
Rank Tests
•
Columns — The columns that specify the independent variables, categorizing the
data.
•
Group By Columns — The columns which specify the variables whose distinct
value combinations will categorize the data, so a separate test is performed on each
category.
Mann-Whitney/Kruskal-Wallis Test - INPUT - Analysis Parameters
On the Rank Tests dialog click on INPUT and then click on analysis parameters:
Figure 202: Mann-Whitney/Kruskal-Wallis Test > Input > Analysis Parameters
On this screen enter or select:
• Processing Options
•
Threshold Probability — Enter the “alpha” probability below which to reject the null
hypothesis.
•
Single Tail — Select the box if single tailed test is desired (default is two-tailed). The
single-tail option is only valid if the test is Mann-Whitney.
Mann-Whitney/Kruskal-Wallis Test - OUTPUT
On the Rank Tests dialog click on OUTPUT:
Figure 203: Mann-Whitney/Kruskal-Wallis Test > Output
On this screen select the following options if desired:
• Store the tabular output of this analysis in the database — Option to generate a Teradata
table populated with the results of the analysis. Once enabled, the following three fields
must be specified:
•
Database Name — The database where the output table will be saved.
•
Output Name — The table name that the output will be saved under.
•
Output Type — The output type must be table when storing Statistical Test output in
the database.
•
Stored Procedure — The creation of a stored procedure containing the SQL generated
for this analysis can be requested by entering the desired name of the stored procedure
Teradata Warehouse Miner User Guide - Volume 1
287
Chapter 3: Statistical Tests
Rank Tests
here. This will result in the creation of a stored procedure in the user's login database
in place of the execution of the SQL generated by the analysis.
•
Procedure Comment — When an optional Procedure Comment is entered it is applied
to a requested Stored Procedure with an SQL Comment statement. It can be up to 255
characters in length and contain substitution parameters for the output category (Score,
ADS, Stats or Other), project name and/or analysis name (using the tags <Category>,
<Project> and <Analysis>, respectively). (Note that the default value of this field may
be set on the Defaults tab of the Preferences dialog, available from the Tools Menu).
•
Create output table using fallback keyword — Fallback keyword will be used to create
the table
•
Create output table using multiset keyword — Multiset keyword will be used to create
the table
•
Advertise Output — The Advertise Output option may be requested when creating a
table, view or procedure. This feature “advertises” output by inserting information into
one or more of the Advertise Output metadata tables according to the type of analysis
and the options selected in the analysis.
•
Advertise Note — An Advertise Note may be specified if desired when the Advertise
Output option is selected or when the Always Advertise option is selected on the
Connection Properties dialog. It is a free-form text field of up to 30 characters that
may be used to categorize or describe the output.
•
Generate SQL, but do not Execute it — Generate the Statistical Test SQL, but do not
execute it. The SQL will be available to be viewed.
Run the Mann-Whitney/Kruskal-Wallis Test Analysis
After setting parameters on the INPUT screens as described above, you are ready to run the
analysis. To run the analysis you can either:
• Click the Run icon
on the toolbar, or
• Select Run <project name> on the Project menu, or
• Press the F5 key on your keyboard
Results - Mann-Whitney/Kruskal-Wallis Test Analysis
The results of running the Mann-Whitney/Kruskal-Wallis Test analysis include a table with a
row for each separate Mann-Whitney/Kruskal-Wallis test on all distinct-value group-by
variables, as well as the SQL to perform the statistical analysis. In the case of Mann-Whitney/
Kruskal-Wallis Independent Tests, the results will be displayed with a separate row for each
independent variable column-name.
All of these results are outlined below.
Mann-Whitney/Kruskal-Wallis Test - RESULTS - SQL
On the Rank Tests dialog click on RESULTS and then click on SQL:
288
Teradata Warehouse Miner User Guide - Volume 1
Chapter 3: Statistical Tests
Rank Tests
Figure 204: Mann-Whitney/Kruskal-Wallis Test > Results > SQL
The series of SQL statements comprise the Mann-Whitney/Kruskal-Wallis Test Analysis. It is
always returned, and is the only item returned when the Generate SQL without Executing
option is used.
Mann-Whitney/Kruskal-Wallis Test - RESULTS - data
On the Rank Tests dialog click on RESULTS and then click on data:
Figure 205: Mann-Whitney/Kruskal-Wallis Test > Results > data
The output table is generated by the Analysis for each separate Mann-Whitney/KruskalWallis test on all distinct-value group-by variables.
Output Columns - Mann-Whitney/Kruskal-Wallis Test Analysis
The following table is built in the requested Output Database by the Mann-Whitney/KruskalWallis test analysis. Group By Variables, if present, will be the Unique Primary Index (UPI).
Otherwise Z will be the UPI. In the case of Mann-Whitney/Kruskal-Wallis Independent Tests,
the additional column _twm_independent_variable will contain the column-name of the
independent variable for each separate test.
Table for Mann-Whitney (if two groups)
Table 111: Table for Mann-Whitney (if two groups)
Name
Type
Definition
Z
Float
Mann-Whitney Z Value
MannWhitneyPValue
Float
The probability associated with the Mann-Whitney/Kruskal-Wallis
statistic
MannWhitneyCallP_0.01
Char
The Mann-Whitney/Kruskal-Wallis result: a=accept, p=reject
Teradata Warehouse Miner User Guide - Volume 1
289
Chapter 3: Statistical Tests
Rank Tests
Table 112: Table for Kruskal-Wallis (if more than two groups)
Name
Type
Definition
Z
Float
Kruskal-Wallis Z Value
ChiSq
Float
Kruskal-Wallis Chi Square Statistic
DF
Integer
Degrees of Freedom
KruskalWallisPValue
Float
The probability associated with the Kruskal-Wallis statistic
KruskalWallisPText
Char
The text description of probability if out of table range
KruskalWallisCallP_0.01
Char
The Kruskal-Wallis result: a=accept, p=reject
Tutorial 1 - Mann-Whitney Test Analysis
In this example, a Mann-Whitney test analysis is performed on the fictitious banking data to
analyze account usage. Parameterize a Mann-Whitney Test analysis as follows:
• Available Tables — twm_customer
• Column of Interest — income
• Columns — gender (2 distinct values -> Mann-Whitney test)
• Group By Columns — years_with_bank
• Analysis Parameters
•
Threshold Probability — 0.01
•
Single Tail — false (default)
Run the analysis and click on Results when it completes. For this example, the MannWhitney Test analysis generated the following table. The Mann-Whitney Test was computed
for each distinct value of the group by variable “years_with_bank”. Results were sorted by
years_with_bank. The tests show that customer incomes by gender were from the same
population for all values of years_with_bank (an ‘a’ means accept the null hypothesis). The
SQL is available for viewing but not listed below.
Table 113: Mann-Whitney Test
years_with_bank
Z
MannWhitneyPValue
MannWhitneyCallP_0.01
0
-0.0127
0.9896
a
1
-0.2960
0.7672
a
2
-0.4128
0.6796
a
3
-0.6970
0.4858
a
4
-1.8088
0.0705
a
5
-2.2541
0.0242
a
6
-0.8683
0.3854
a
290
Teradata Warehouse Miner User Guide - Volume 1
Chapter 3: Statistical Tests
Rank Tests
Table 113: Mann-Whitney Test
years_with_bank
Z
MannWhitneyPValue
MannWhitneyCallP_0.01
7
-1.7074
0.0878
a
8
-0.8617
0.3887
a
9
-0.4997
0.6171
a
Tutorial 2 - Kruskal-Wallis Test Analysis
In this example, a Kruskal-Wallis test analysis is performed on the fictitious banking data to
analyze account usage. Parameterize a Kruskal-Wallis Test analysis as follows:
• Available Tables — twm_customer
• Column of Interest — income
• Columns — marital_status (4 distinct values -> Kruskal-Wallis test)
• Group By Columns — years_with_bank
• Analysis Parameters
•
Threshold Probability — 0.01
•
Single Tail — false (default)
Run the analysis and click on Results when it completes. For this example, the Kruskal-Wallis
Test analysis generated the following table. The test was computed for each distinct value of
the group by variable “years_with_bank”. Results were sorted by years_with_bank. The tests
shows customer incomes by marital_status were from the same population for years_with_
bank 4, 6, 8 and 9. Those with years_with_bank 0-3, 5 and 7 were from different populations
for each marital status. An ‘n’ or ‘p’ means significant and an ‘a’ means accept the null
hypothesis. The SQL is available for viewing but not listed below.
Table 114: Kruskal-Wallis Test
years_with_bank
Z
ChiSq
DF
KruskalWallisPValue
0
3.5507
20.3276
3
0.0002
1
4.0049
24.5773
3
0.0001
2
3.3103
18.2916
3
0.0004
p
3
3.0994
16.6210
3
0.0009
p
4
1.5879
7.5146
3
0.0596
a
5
4.3667
28.3576
3
0.0001
6
2.1239
10.2056
3
0.0186
a
7
3.2482
17.7883
3
0.0005
p
8
0.1146
2.6303
3
0.25
>0.25
a
9
-0.1692
2.0436
3
0.25
>0.25
a
Teradata Warehouse Miner User Guide - Volume 1
KruskalWallisPText
KruskalWallisCallP_0.01
p
<0.0001
<0.0001
p
p
291
Chapter 3: Statistical Tests
Rank Tests
Tutorial 3 - Mann-Whitney Independent Tests Analysis
In this example, a Mann-Whitney Independent Tests analysis is performed on the fictitious
banking data to analyze account usage. Parameterize a Mann-Whitney Independent Tests
analysis as follows:
• Available Tables — twm_customer_analysis
• Column of Interest — income
• Columns — gender, ccacct, ckacct, svacct
• Group By Columns
• Analysis Parameters
•
Threshold Probability — 0.05
•
Single Tail — false (default)
Run the analysis and click on Results when it completes. For this example, the MannWhitney Independent Tests analysis generated the following table. The Mann-Whitney Test
was computed separately for each independent variable. The tests show that customer
incomes by gender and by svacct were from different populations, and that customer incomes
by ckacct and by ccacct were from identical populations. The SQL is available for viewing
but not listed below.
Table 115: Mann-Whitney Test
_twm_independent
Z
MannWhitneyPValue
MannWhitneyCallP_0.05
gender
-3.00331351
0.002673462
n
svacct
-3.37298401
0.000743646
n
ckacct
-1.92490664
0.05422922
a
ccacct
1.764991014
0.077563672
a
Wilcoxon Signed Ranks Test
The Wilcoxon Signed Ranks Test is an alternative analogous to the t-test for correlated
samples. The correlated-samples t-test makes assumptions about the data, and can be properly
applied only if certain assumptions are met:
1
the scale of measurement has the properties of an equal-interval scale
2
differences between paired values are randomly selected from the source population
3
The source population has a normal distribution.
If any of these assumptions are invalid, the t-test for correlated samples should not be used.
Of cases where these assumptions are unmet, the most common are those where the scale of
measurement fails to have equal-interval scale properties, e.g. a case in which the measures
are from a rating scale. When data within two correlated samples fail to meet one or another
of the assumptions of the t-test, an appropriate non-parametric alternative is the Wilcoxon
Signed-Rank Test, a test based on ranks. Assumptions for this test are:
292
Teradata Warehouse Miner User Guide - Volume 1
Chapter 3: Statistical Tests
Rank Tests
1
The distribution of difference scores is symmetric (implies equal interval scale)
2
difference scores are mutually independent
3
difference scores have the same mean
The original measures are replaced with ranks resulting in analysis only of the ordinal
relationships. The signed ranks are organized and summed, giving a number, W. When the
numbers of positive and negative signs are about equal (i.e., there is no tendency in either
direction), the value of W will be near zero, and the null hypothesis will be supported.
Positive or negative sums indicate there is a tendency for the ranks to have significance so
there is a difference in the cases in the specified direction.
Given a table name and names of paired numeric columns, a Wilcoxon test is produced. The
Wilcoxon tests whether a sample comes from a population with a specific mean or median.
The null hypothesis is that the samples come from populations with the same mean or
median. The alternative hypothesis is that the samples come from populations with different
means or medians (two-tailed test), or that in addition the difference is in a specific direction
(upper-tailed or lower-tailed tests). Output is a p-value, which when compared to the user’s
threshold, determines whether the null hypothesis should be rejected.
Initiate a Wilcoxon Signed Ranks Test
Use the following procedure to initiate a new Wilcoxon Signed Ranks Test in Teradata
Warehouse Miner:
1
Click on the Add New Analysis icon in the toolbar:
Figure 206: Add New Analysis from toolbar
2
In the resulting Add New Analysis dialog box, click on Statistical Tests under Categories
and then under Analyses double-click on Rank Tests:
Teradata Warehouse Miner User Guide - Volume 1
293
Chapter 3: Statistical Tests
Rank Tests
Figure 207: Add New Analysis > Statistical Tests > Rank Tests
3
This will bring up the Rank Tests dialog in which you will enter STATISTICAL TEST
STYLE, INPUT and OUTPUT options to parameterize the analysis as described in the
following sections.
Wilcoxon Signed Ranks Test - INPUT - Data Selection
On the Rank Tests dialog click on INPUT and then click on data selection:
Figure 208: Wilcoxon Signed Ranks Test > Input > Data Selection
On this screen select:
1
Select Input Source
Users may select between different sources of input. By selecting the Input Source Table
the user can select from available databases, tables (or views) and columns in the usual
manner. By selecting the Input Source Analysis however the user can select directly from
the output of another analysis of qualifying type in the current project. Analyses that may
be selected from directly include all of the Analytic Data Set (ADS) and Reorganization
analyses (except Refresh). In place of Available Databases the user may select from
Available Analyses, while Available Tables then contains a list of all the output tables that
will eventually be produced by the selected Analysis, or it contains a single entry with the
name of the analysis under the label Volatile Table, representing the output of the analysis
that is ordinarily produced by a Select statement.
294
Teradata Warehouse Miner User Guide - Volume 1
Chapter 3: Statistical Tests
Rank Tests
2
3
Select Columns From a Single Table
•
Available Databases (or Analyses) — These are the databases (or analyses) available to
be processed.
•
Available Tables — These are the tables and views that are available to be processed.
•
Available Columns — These are the columns within the table/view that are available
for processing.
Select Statistical Test Style
These are the Tests Based on Ranks available (Mann-Whitney/Kruskal-Wallis, Wilcoxon,
Friedman). Select “Wilcoxon”.
4
Select Optional Columns
•
Selected Columns — Select columns by highlighting and then either dragging and
dropping into the Selected Columns window, or click on the arrow button to move
highlighted columns into the Selected Columns window.
Note: The Selected Columns window is actually a split window; you can insert
columns as First Column, Second Column, Group By Columns. Make sure you have the
correct portion of the window highlighted.
•
First Column — The column that specifies the variable from the first sample
•
Second Column — The column that specifies the variable from the second sample
•
Group By Columns — The columns which specify the variables whose distinct
value combinations will categorize the data, so a separate test is performed on each
category.
Wilcoxon Signed Ranks Test - INPUT - Analysis Parameters
On the Rank Tests dialog click on INPUT and then click on analysis parameters:
Figure 209: Wilcoxon Signed Ranks Test > Input > Analysis Parameters
On this screen enter or select:
• Processing Options
•
Threshold Probability — Enter the “alpha” probability below which to reject the null
hypothesis.
•
Single Tail — Select the box if single tailed test is desired (default is two-tailed). The
single-tail option is only valid if the test is Mann-Whitney.
•
Include Zero — The “include zero” option generates a variant of the Wilcoxon in
which zero differences are included with the positive count. The default “discard zero”
option is the true Wilcoxon.
Teradata Warehouse Miner User Guide - Volume 1
295
Chapter 3: Statistical Tests
Rank Tests
Wilcoxon Signed Ranks Test - OUTPUT
On the Rank Tests dialog click on OUTPUT:
Figure 210: Wilcoxon Signed Ranks Test > Output
On this screen select the following options if desired:
• Store the tabular output of this analysis in the database — Option to generate a Teradata
table populated with the results of the analysis. Once enabled, the following three fields
must be specified:
296
•
Database Name — The database where the output table will be saved.
•
Output Name — The table name that the output will be saved under.
•
Output Type — The output type must be table when storing Statistical Test output in
the database.
•
Stored Procedure — The creation of a stored procedure containing the SQL generated
for this analysis can be requested by entering the desired name of the stored procedure
here. This will result in the creation of a stored procedure in the user's login database
in place of the execution of the SQL generated by the analysis.
•
Procedure Comment — When an optional Procedure Comment is entered it is applied
to a requested Stored Procedure with an SQL Comment statement. It can be up to 255
characters in length and contain substitution parameters for the output category (Score,
ADS, Stats or Other), project name and/or analysis name (using the tags <Category>,
<Project> and <Analysis>, respectively). (Note that the default value of this field may
be set on the Defaults tab of the Preferences dialog, available from the Tools Menu).
•
Create output table using fallback keyword — Fallback keyword will be used to create
the table
•
Create output table using multiset keyword — Multiset keyword will be used to create
the table
•
Advertise Output — The Advertise Output option may be requested when creating a
table, view or procedure. This feature “advertises” output by inserting information into
one or more of the Advertise Output metadata tables according to the type of analysis
and the options selected in the analysis.
•
Advertise Note — An Advertise Note may be specified if desired when the Advertise
Output option is selected or when the Always Advertise option is selected on the
Connection Properties dialog. It is a free-form text field of up to 30 characters that
may be used to categorize or describe the output.
•
Generate SQL, but do not Execute it — Generate the Statistical Test SQL, but do not
execute it. The SQL will be available to be viewed.
Teradata Warehouse Miner User Guide - Volume 1
Chapter 3: Statistical Tests
Rank Tests
Run the Wilcoxon Signed Ranks Test Analysis
After setting parameters on the INPUT screens as described above, you are ready to run the
analysis. To run the analysis you can either:
• Click the Run icon
on the toolbar, or
• Select Run <project name> on the Project menu, or
• Press the F5 key on your keyboard
Results - Wilcoxon Analysis
The results of running the Wilcoxon Signed Ranks Test analysis include a table with a row for
each separate Wilcoxon Signed Ranks Test on all distinct-value group-by variables, as well as
the SQL to perform the statistical analysis. All of these results are outlined below.
Wilcoxon Signed Ranks Test - RESULTS - SQL
On the Rank Tests dialog click on RESULTS and then click on SQL:
Figure 211: Wilcoxon Signed Ranks Test > Results > SQL
The series of SQL statements comprise the Wilcoxon Signed Ranks Test Analysis. It is
always returned, and is the only item returned when the Generate SQL without Executing
option is used.
Wilcoxon Signed Ranks Test - RESULTS - data
On the Rank Tests dialog click on RESULTS and then click on data:
Figure 212: Wilcoxon Signed Ranks Test > Results > data
The output table is generated by the Analysis for each separate Wilcoxon Signed Ranks Test
on all distinct-value group-by variables.
Output Columns - Wilcoxon Signed Ranks Test Analysis
The following table is built in the requested Output Database by the Wilcoxon Signed Ranks
Test analysis. Group By Variables, if present, will be the Unique Primary Index (UPI).
Otherwise Z_ will be the UPI.
Teradata Warehouse Miner User Guide - Volume 1
297
Chapter 3: Statistical Tests
Rank Tests
Table 116: Wilcoxon Signed Ranks Test Analysis: Output Columns
Name
Type
Definition
N
Integer
variable count
Z_
Float
Mann-Whitney Z Value
WilcoxonPValue
Float
The probability associated with the Wilcoxon statistic
WilcoxonCallP_0.05
Char
The Wilcoxon result: a=accept, p or n=reject
Tutorial - Wilcoxon Test Analysis
In this example, a Wilcoxon test analysis is performed on the fictitious banking data to
analyze account usage. Parameterize a Wilcoxon Test analysis as follows:
• Available Tables — twm_customer_analysis
• First Column — avg_ck_bal
• Second Column — avg_sv_bal
• Group By Columns — years_with_bank
• Analysis Parameters
•
Threshold Probability — 0.05
•
Single Tail — false (default)
•
Include Zero — false (default)
Run the analysis and click on Results when it completes. For this example, the Wilcoxon Test
analysis generated the following table. The Wilcoxon Test was computed for each distinct
value of the group by variable “gender”. The tests show the samples of avg_ck_bal and avg_
sv_bal came from populations with the same mean or median for customers with years_with_
bank of 0, 4-9, and from populations with different means or medians for those with years_
with_bank of 1-3. An ‘n’ or ‘p’ means significant and an ‘a’ means accept the null hypothesis.
The SQL is available for viewing but not listed below.
Table 117: Wilcoxon Test
years_with_bank
N
Z_
WilcoxonPValue
WilcoxonCallP_0.05
0
75
-1.77163
0.07639
a
1
77
-3.52884
0.00042
n
2
83
-2.94428
0.00324
n
3
69
-2.03882
0.04145
n
4
69
-0.56202
0.57412
a
5
67
-1.95832
0.05023
a
6
65
-1.25471
0.20948
a
7
48
-0.44103
0.65921
a
298
Teradata Warehouse Miner User Guide - Volume 1
Chapter 3: Statistical Tests
Rank Tests
Table 117: Wilcoxon Test
years_with_bank
N
Z_
WilcoxonPValue
WilcoxonCallP_0.05
8
39
-1.73042
0.08363
a
9
33
-1.45623
0.14539
a
Friedman Test with Kendall's Coefficient of Concordance & Spearman's
Rho
The Friedman test is an extension of the sign test for several independent samples. It is
analogous to the 2-way Analysis of Variance, but depends only on the ranks of the
observations, so it is like a 2-way ANOVA on ranks.
The Friedman test should not be used for only three treatments due to lack of power, and is
best for six or more treatments. It is a test for treatment differences in a randomized, complete
block design. Data consists of b mutually independent k-variate random variables called
blocks. The Friedman assumptions are that the data in these blocks are mutually independent,
and that within each block, observations are ordinally rankable according to some criterion of
interest.
A Friedman Test is produced using rank scores and the F table, though alternative
implementations call it the Friedman Statistic and use the chi-square table. Note that when all
of the treatments are not applied to each block, it is an incomplete block design. The
requirements of the Friedman test are not met under these conditions, and other tests such as
the Durban test should be applied.
In addition to the Friedman statistics, Kendall’s Coefficient of Concordance (W) is produced,
as well as Spearman’s Rho. Kendall's coefficient of concordance can range from 0 to 1. The
higher its value, the stronger the association. W is 1.0 if all treatments receive the same
rankness in all blocks, and 0 if there is “perfect disagreement” among blocks.
Spearman's rho is a measure of the linear relationship between two variables. It differs from
Pearson's correlation only in that the computations are done after the numbers are converted
to ranks. Spearman’s Rho equals 1 if there is perfect agreement among rankings;
disagreement causes rho to be less than 1, sometimes becoming negative.
Initiate a Friedman Test
Use the following procedure to initiate a new Friedman Test in Teradata Warehouse Miner:
Teradata Warehouse Miner User Guide - Volume 1
299
Chapter 3: Statistical Tests
Rank Tests
1
Click on the Add New Analysis icon in the toolbar:
Figure 213: Add New Analysis from toolbar
2
In the resulting Add New Analysis dialog box, click on Statistical Tests under Categories
and then under Analyses double-click on Rank Tests:
Figure 214: Add New Analysis > Statistical Tests > Rank Tests
3
This will bring up the Rank Tests dialog in which you will enter STATISTICAL TEST
STYLE, INPUT and OUTPUT options to parameterize the analysis as described in the
following sections.
Friedman Test - INPUT - Data Selection
On the Rank Tests dialog click on INPUT and then click on data selection:
300
Teradata Warehouse Miner User Guide - Volume 1
Chapter 3: Statistical Tests
Rank Tests
Figure 215: Friedman Test > Input > Data Selection
On this screen select:
1
Select Input Source
Users may select between different sources of input. By selecting the Input Source Table
the user can select from available databases, tables (or views) and columns in the usual
manner. By selecting the Input Source Analysis however the user can select directly from
the output of another analysis of qualifying type in the current project. Analyses that may
be selected from directly include all of the Analytic Data Set (ADS) and Reorganization
analyses (except Refresh). In place of Available Databases the user may select from
Available Analyses, while Available Tables then contains a list of all the output tables that
will eventually be produced by the selected Analysis, or it contains a single entry with the
name of the analysis under the label Volatile Table, representing the output of the analysis
that is ordinarily produced by a Select statement.
2
3
Select Columns From a Single Table
•
Available Databases (or Analyses) — These are the databases (or analyses) available to
be processed.
•
Available Tables — These are the tables and views that are available to be processed.
•
Available Columns — These are the columns within the table/view that are available
for processing.
Select Statistical Test Style
These are the Tests Based on Ranks available (Mann-Whitney/Kruskal-Wallis, Wilcoxon,
Friedman). Select “Friedman”.
4
Select Optional Columns
•
Selected Columns — Select columns by highlighting and then either dragging and
dropping into the Selected Columns window, or click on the arrow button to move
highlighted columns into the Selected Columns window.
Note: The Selected Columns window is actually a split window; you can insert
columns as Column of Interest, Treatment Column, Block Column, Group By
Columns. Make sure you have the correct portion of the window highlighted.
•
Column of Interest — The column that specifies the dependent variable to be
analyzed
•
Treatment Column — The column that specifies the independent categorical
variable representing treatments within blocks.
•
Block Column — The column that specifies the variable representing blocks, or
independent experimental groups.
Teradata Warehouse Miner User Guide - Volume 1
301
Chapter 3: Statistical Tests
Rank Tests
Warning:
Equal cell counts are required for all Treatment Column x Block Column pairs. Division by zero
may occur in the case of unequal cell counts.
• Group By Columns — The columns which specify the variables whose distinct
value combinations will categorize the data, so a separate test is performed on each
category.
Warning:
Equal cell counts are required for all Treatment Column x Block Column pairs within each group.
Division by zero may occur in the case of unequal cell counts.
Friedman Test - INPUT - Analysis Parameters
On the Rank Tests dialog click on INPUT and then click on analysis parameters:
Figure 216: Friedman Test > Input > Analysis Parameters
On this screen enter or select:
• Processing Options
•
Threshold Probability — Enter the “alpha” probability below which to reject the null
hypothesis.
Friedman Test - OUTPUT
On the Rank Tests dialog click on OUTPUT:
Figure 217: Friedman Test > Output
On this screen select the following options if desired:
• Store the tabular output of this analysis in the database — Option to generate a Teradata
table populated with the results of the analysis. Once enabled, the following three fields
must be specified:
302
•
Database Name — The database where the output table will be saved.
•
Output Name — The table name that the output will be saved under.
•
Output Type — The output type must be table when storing Statistical Test output in
the database.
•
Stored Procedure — The creation of a stored procedure containing the SQL generated
for this analysis can be requested by entering the desired name of the stored procedure
Teradata Warehouse Miner User Guide - Volume 1
Chapter 3: Statistical Tests
Rank Tests
here. This will result in the creation of a stored procedure in the user's login database
in place of the execution of the SQL generated by the analysis.
•
Procedure Comment — When an optional Procedure Comment is entered it is applied
to a requested Stored Procedure with an SQL Comment statement. It can be up to 255
characters in length and contain substitution parameters for the output category (Score,
ADS, Stats or Other), project name and/or analysis name (using the tags <Category>,
<Project> and <Analysis>, respectively). (Note that the default value of this field may
be set on the Defaults tab of the Preferences dialog, available from the Tools Menu).
•
Create output table using fallback keyword — Fallback keyword will be used to create
the table
•
Create output table using multiset keyword — Multiset keyword will be used to create
the table
•
Advertise Output — The Advertise Output option may be requested when creating a
table, view or procedure. This feature “advertises” output by inserting information into
one or more of the Advertise Output metadata tables according to the type of analysis
and the options selected in the analysis.
•
Advertise Note — An Advertise Note may be specified if desired when the Advertise
Output option is selected or when the Always Advertise option is selected on the
Connection Properties dialog. It is a free-form text field of up to 30 characters that
may be used to categorize or describe the output.
•
Generate SQL, but do not Execute it — Generate the Statistical Test SQL, but do not
execute it. The SQL will be available to be viewed.
Run the Friedman Test Analysis
After setting parameters on the INPUT screens as described above, you are ready to run the
analysis. To run the analysis you can either:
• Click the Run icon
on the toolbar, or
• Select Run <project name> on the Project menu, or
• Press the F5 key on your keyboard
Results - Friedman Test Analysis
The results of running the Friedman Test analysis include a table with a row for each separate
Friedman Test on all distinct-value group-by variables, as well as the SQL to perform the
statistical analysis. All of these results are outlined below.
Friedman Test - RESULTS - SQL
On the Rank Tests dialog click on RESULTS and then click on SQL:
Teradata Warehouse Miner User Guide - Volume 1
303
Chapter 3: Statistical Tests
Rank Tests
Figure 218: Friedman Test > Results > SQL
The series of SQL statements comprise the Analysis. It is always returned, and is the only
item returned when the Generate SQL without Executing option is used.
Friedman Test - RESULTS - data
On the Rank Tests dialog click on RESULTS and then click on data:
Figure 219: Friedman Test > Results > data
The output table is generated by the Analysis for each separate Friedman Test on all distinctvalue group-by variables.
Output Columns - Friedman Test Analysis
The following table is built in the requested Output Database by the Friedman Test analysis.
Group By Variables, if present, will be the Unique Primary Index (UPI). Otherwise Kendalls_
W will be the UPI.
Table 118: Friedman Test Analysis: Output Columns
Name
Type
Definition
Kendalls_W
Float
Kendall's W
Average_Spearmans_Rho
Float
Average Spearman's Rho
DF_1
Integer
Degrees of Freedom for Treatments
DF_2
Integer
Degrees of Freedom for Blocks
F
Float
2-Way ANOVA F Statistic on ranks
FriedmanPValue
Float
The probability associated with the Friedman statistic
FriedmanPText
Char
The text description of probability if out of table range
FriedmanCallP_0.05
Char
The Friedman result: a=accept, p or n=reject
304
Teradata Warehouse Miner User Guide - Volume 1
Chapter 3: Statistical Tests
Rank Tests
Tutorial - Friedman Test Analysis
In this example, a Friedman test analysis is performed on the fictitious banking data to
analyze account usage. If the data does not have equal cell counts in the treatment x block
cells, stratified sampling can be used to identify the smallest count, and then produce a
temporary table which can be analyzed. The first step is to identify the smallest count with a
Free Form SQL analysis (or two Variable Creation analyses) with SQL such as the following
(be sure to set the database in the FROM clause to that containing the demonstration data
tables):
SELECT
MIN("_twm_N") AS smallest_count
FROM
(
SELECT
marital_status
,gender
,COUNT(*) AS "_twm_N"
FROM "twm_source"."twm_customer_analysis"
GROUP BY "marital_status", "gender"
) AS "T0";
The second step is to use a Sample analysis with stratified sampling to create the temporary
table with equal cell counts. The value 18 used in the stratified Sizes/Fractions parameter
below corresponds to the smallest_count returned from above.
Parameterize a Sample Analysis called Friedman Work Table Setup as follows:
Input Options:
• Available Tables — TWM_CUSTOMER_ANALYSIS
• Selected Columns and Aliases
•
TWM_CUSTOMER_ANALYSIS.cust_id
•
TWM_CUSTOMER_ANALYSIS.gender
•
TWM_CUSTOMER_ANALYSIS.marital_status
•
TWM_CUSTOMER_ANALYSIS.income
Analysis Parameters:
• Sample Style — Stratified
• Stratified Sample Options
• Create a separate sample for each fraction/size — Enabled
• Stratified Conditions
•
gender='f' and marital_status='1'
•
gender='f' and marital_status='2'
•
gender='f' and marital_status='3'
•
gender='f' and marital_status='4'
•
gender='m' and marital_status='1'
•
gender='m' and marital_status='2'
•
gender='m' and marital_status='3'
Teradata Warehouse Miner User Guide - Volume 1
305
Chapter 3: Statistical Tests
Rank Tests
•
gender='m' and marital_status='4'
• Sizes/Fractions — 18 (use the same value for all conditions)
Output Options:
• Store the tabular output of this analysis in the database — Enabled
• Table Name — Twm_Friedman_Worktable
Finally, Parameterize a Friedman Test analysis as follows:
Input Options:
• Select Input Source — Analysis
• Available Analyses — Friedman Work Table Setup
• Available Tables — Twm_Friedman_Worktable
• Select Statistical Test Style — Friedman
• Column of Interest — income
• Treatment Column — gender
• Block Column — marital_status
Analysis Parameters:
• Analysis Parameters
•
Threshold Probability — 0.05
Run the analysis and click on Results when it completes. For this example, the Friedman Test
analysis generated the following table. (Note that results may vary due to the use of sampling
in creating the input table Twm_Friedman_Worktable). The test shows that analysis of income
by treatment (male vs. female) differences is significant at better than the 0.001 probability
level. An ‘n’ or ‘p’ means significant and an ‘a’ means accept the null hypothesis. The SQL is
available for viewing but not listed below.
Table 119: Friedman Test
Kendalls_W
Average_Spearmans_Rho
DF_1 DF_2 F
FriedmanPValue FriedmanPText
FriedmanCallP_0.001
0.76319692
5
0.773946177
1
0.001
p
306
71
228.8271876
<0.001
Teradata Warehouse Miner User Guide - Volume 1
APPENDIX A
References
1
Agrawal, R. Mannila, H. Srikant, R. Toivonen, H. and Verkamo, I., Fast Discovery of
Association Rules. In Advances in Knowledge Discovery and Data Mining, 1996, eds.
U.M. Fayyad, G. Paitetsky-Shapiro, P. Smyth and R. Uthurusamy. Menlo Park, AAAI
Press/The MIT Press.
2
Agresti, A. (1990) Categorical Data Analysis. Wiley, New York.
3
Arabie, P., Hubert, L., and DeSoete, G., Clustering and Classification, World Scientific,
1996
4
Belsley, D.A., Kuh, E., and Welsch, R.E. (1980) Regression Diagnostics: Identifying
Influential Data and Sources of Collinearity. Wiley, New York.
5
Bradley, P., Fayyad, U. and Reina, C., Scaling EM Clustering to Large Databases,
Microsoft Research Technical Report MSR-TR-98-35, 1998
6
Breiman, L., Friedman, J.H., Olshen, R.A., and Stone, C.J. Classification and Regression
Trees. Wadsworth, Belmont, 1984.
7
Conover, W.J. Practical Nonparametric Statistics 3rd Edition
8
Cox, D.R. and Hinkley, D.V. (1974) Theoretical Statistics. Chapman & Hall/CRC, New
York.
9
D'Agostino, RB. (1971) An omnibus test of normality for moderate and large size
samples, Biometrica, 58, 341-348
10 D'Agostino, R. B. and Stephens, M. A., eds. Goodness-of-fit Techniques, 1986,. New
York: Dekker.
11 D’Agostino, R, Belanger, A., and D’Agostino,R. Jr., A Suggestion for Using Powerful
and Informative Tests of Normality, American Statistician, 1990, Vol. 44, No. 4
12 Finn, J.D. (1974) A General Model for Multivariate Analysis. Holt, Rinehart and
Winston, New York.
13 Harman, H.H. (1976) Modern Factor Analysis. University of Chicago Press, Chicago.
14 Harter, H.L. and Owen, D.B., eds, Selected Tables in Mathematical Statistics, Vol. 1..
Providence, Rhode Island: American Mathematical Society.
15 Hosmer, D.W. and Lemeshow, S. (1989) Applied Logistic Regression. Wiley, New York.
16 Jennrich, R.I., and Sampson, P.F. (1966) Rotation For Simple Loadings. Psychometrika,
Vol. 31, No. 3.
17 Johnson, R.A. and Wichern, D.W. (1998) Applied Multivariate Statistical Analysis, 4th
Edition. Prentice Hall, New Jersey.
18 Kachigan, S.K. (1991) Multivariate Statistical Analysis. Radius Press, New York.
19 Kaiser, Henry F. (1958) The Varimax Criterion For Analytic Rotation In Factor Analysis.
Psychometrika, Vol. 23, No. 3.
Teradata Warehouse Miner User Guide - Volume 3
307
Appendix A: References
20 Kass, G. V. (1979) An Exploratory Technique for Investigating Large Quantities of
Categorical Data, Applied Statistics (1980) 29, No. 2 pp. 119-127
21 Kaufman, L. and Rousseeuw, P., Finding Groups in Data, J Wiley & Sons, 1990
22 Kennedy, W.J. and Gentle, J.E. (1980) Statistical Computing. Marcel Dekker, New York.
23 Kleinbaum, D.G. and Kupper, L.L. (1978) Applied Regression Analysis and Other
Multivariable Methods. Duxbury Press, North Scituate, Massachusetts.
24 Maddala, G.S. (1983) Limited-Dependent and Qualitative Variables In Econometrics.
Cambridge University Press, Cambridge, United Kingdom.
25 Maindonald, J.H. (1984) Statistical Computation. Wiley, New York.
26 McCullagh, P.M. and Nelder, J.A. (1989) Generalized Linear Models, 2nd Edition.
Chapman & Hall/CRC, New York.
27 McLachlan, G.J. and Krishnan, T., The EM Algorithm and Extensions, J Wiley & Sons,
1997
28 Menard, S (1995) Applied Logistic Regression Analysis, Sage, Thousand Oaks
29 Mulaik, S.A. (1972) The Foundations of Factor Analysis. McGraw-Hill, New York.
30 Neter, J., Kutner, M.H., Nachtsheim, C.J., and Wasserman, W. (1996) Applied Linear
Statistical Models, 4th Edition. WCB/McGraw-Hill, New York.
31 NIST/SEMATECH e-Handbook of Statistical Methods, http://www.itl.nist.gov/div898/
handbook/, 2005.
32 Nocedal, J. and Wright, S.J. (1999) Numerical Optimization. Springer-Verlag, New York.
33 Orchestrate/OSH Component User’s Guide Vol II, Analytics Library, Chapter 2:
Introduction to Data Mining. Torrent Systems, Inc., 1997.
34 Ordonez, C. and Cereghini, P. (2000) SQLEM: Fast Clustering in SQL using the EM
Algorithm. SIGMOD Conference 2000: 559-570
35 Ordonez, C. (2004): Programming the K-means clustering algorithm in SQL. KDD 2004:
823-828
36 Ordonez, C. (2004): Horizontal aggregations for building tabular data sets. DMKD 2004:
35-42
37 Pagano, Gauvreau Principles of Biostatistics 2nd Edition
38 Peduzzi, P.N., Hardy, R.J., and Holford, T.R. (1980) A Stepwise Variable Selection
Procedure for Nonlinear Regression Models. Biometrics 36, 511-516.
39 Pregibon, D. (1981) Logistic Regression Diagnostics. Annals of Statistics, Vol. 9, No. 4,
705-724.
40 PROPHET StatGuide, BBN Corporation, 1996.
41 Quinlan, J.R. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo,
1993.
42 Roweis, S. and Ghahramani, Z., A Unifying Review of Linear Gaussian Models, Journal
of Neural Computation, 1999
308
Teradata Warehouse Miner User Guide - Volume 3
Appendix A: References
43 Royston, JP., An Extension of Shapiro and Wilk’s W Test for Normality to Large Samples,
Applied Statistics, 1982, 31, No. 2, pp.115-124
44 Royston, JP, Algorithm AS 177: Expected normal order statistics (exact and
approximate), 1982, Applied Statistics, 31, 161-165.
45 Royston, JP., Algorithm AS 181: The W Test for Normality. 1982, Applied Statistics, 31,
176-180.
46 Royston, JP., A Remark on Algorithm AS 181: The W Test for Normality., 1995, Applied
Statistics, 44, 547-551.
47 Rubin, Donald B., and Thayer, Dorothy T. (1982) EM Algorithms For ML Factor
Analysis. Psychometrika, Vol. 47, No. 1.
48 Shapiro, SS and Francia, RS (1972). An approximate analysis of variance test for
normality, Journal of the American Statistical Association, 67, 215-216
49 SPSS 7.5 Statistical Algorithms Manual, SPSS Inc., Chicago.
50 SYSTAT 9: Statistics I. (1999) SPSS Inc., Chicago.
51 Takahashi, T. (2005) Getting Started: International Character Sets and the Teradata
Database, Teradata Corporation, 541-0004068-C02
52 Tatsuoka, M.M. (1971) Multivariate Analysis: Techniques For Educational and
Psychological Research. Wiley, New York.
53 Tatsuoka, M.M. (1974) Selected Topics in Advanced Statistics, Classification Procedures,
Institute for Personality and Ability Testing, 1974
54 Teradata Database SQL Functions, Operators, Expressions, and Predicates Release 15.0,
B035-1145-015A, March 2014
55 Teradata Warehouse Miner Model Manager User Guide, B035-2303-106A, October 2016
56 Teradata Warehouse Miner Release Definition, B035-2494-106C, October 2016
57 Teradata Warehouse Miner User Guide, Volume 1 Introduction and Profiling, B035-2300-
106A, October 2016
58 Teradata Warehouse Miner User Guide, Volume 2 ADS Generation, B035-2301-106A,
October 2016
59 Teradata Warehouse Miner User Guide, Volume 3 Analytic Functions, B035-2302-106A,
October 2016
60 Wendorf, Craig A., MANUALS FOR UNIVARIATE AND MULTIVARIATE
STATISTICS © 1997, Revised 2004-03-12, UWSP, 2005
61 Wilkinson, L., Blank, G., and Gruber, C. (1996) Desktop Data Analysis SYSTAT. Prentice
Hall, New Jersey.
Teradata Warehouse Miner User Guide - Volume 3
309
Appendix A: References
310
Teradata Warehouse Miner User Guide - Volume 3