Download Partner Technical Guide

Teradata ISV Partner Technical Guide Integration with Teradata ISV Partner Technical Guide June 2015 Organization Name: Location: Teradata Partners Integration Lab 17095 Via Del Campo, San Diego, CA 92127 1 6/18/2015 PREFACE Revision History: Version A01 A02 A03 A04 A05 A06 A07 A08 Author(s) Partner Engineering Partner Engineering Partner Engineering Partner Engineering Partner Engineering Partner Engineering Partner Engineering Partner Engineering Date 9/20/2005 9/22/2005 3/19/2009 7/20/2009 9/3/2010 2/26/2013 1/2/2014 6/8/2015 Comments Initial Revision Minor Updates Update Minor Update Update Update Update Update and Addition of FAQ material Audience: The audience for this document is Teradata ISV Partners. 2 Contents Contents 3 1. Teradata Partners Program .............................................................................................. 5 Section 1.1 Teradata Database -- Introduction .......................................................................... 5 Section 1.2 Teradata Support ..................................................................................................... 6 Section 1.3 Teradata Partner Intelligence Network and the Teradata Education Network ....... 6 2. Teradata Basics ............................................................................................................... 7 Section 2.1 Unified Data Architecture ....................................................................................... 7 Section 2.2 Data Types .............................................................................................................. 8 Section 2.3 Primary Index......................................................................................................... 14 Section 2.4 NoPI Objects .......................................................................................................... 15 Section 2.5 Secondary Indexes ................................................................................................ 16 Section 2.6 Intermediate/Temporary Tables ............................................................................. 17 Section 2.7 Locking .................................................................................................................. 17 Section 2.8 Statistics ................................................................................................................. 19 2.8.1 Random AMP Sampling .............................................................................................. 20 2.8.2 Full Statistics Collection .............................................................................................. 22 2.8.3 Collection with the USING SAMPLE option .............................................................. 23 2.8.4 Collect Statistics Summary Option .............................................................................. 24 2.8.5 Summary: Teradata Statistics Collection ..................................................................... 25 2.8.6 New opportunities for statistics collection in Teradata 14.0[1] .................................... 26 2.8.7 Recommended Reading ............................................................................................... 28 Section 2.9 Stored Procedures .................................................................................................. 29 Section 2.10 User Defined Functions (UDF) ............................................................................ 32 UDF’s are invoked qualifying the database name where they are stored, e.g. DBName.UDFname(), or if stored in the special database call SYSLIB, without database name qualification, e.g. UDFname(). ......................................................................................... 33 Section 2.11 Table Operators .................................................................................................... 35 Section 2.12. QueryGrid ........................................................................................................... 37 Section 2.13 DBQL ................................................................................................................... 39 Section 2.14 Administrative Tips ............................................................................................. 43 3. Workload Management ................................................................................................. 45 Section 3.1 Workload Administration ...................................................................................... 45 4. Migrating to Teradata ................................................................................................... 46 Section 4.1 Utilities and Client Access ..................................................................................... 46 Section 4.1.1 Teradata Load/Unload Protocols & Products ..................................................... 46 Section 4.1.2 Input Data Sources with Scripting Tools ............................................................ 48 Section 4.1.3 Teradata Parallel Transporter .............................................................................. 48 Section 4.1.4 Restrictions & Other Techniques ........................................................................ 58 Section 4.2 Load Strategies & Architectural Options ............................................................... 61 Section 4.2.1 ETL Architectural Options ................................................................................. 61 Section 4.2.2 ISV ETL Tool Advantages vs. Teradata Tools ................................................... 62 3 Section 4.2.3 Load Strategies.................................................................................................... 62 Section 4.2.4 ETL Tool Integration .......................................................................................... 63 Section 4.3 Concurrency of Load and Unload Jobs .................................................................. 64 Section 4.4 Load comparisons ................................................................................................. 64 5. References ..................................................................................................................... 64 Section 5.1 SQL Examples ....................................................................................................... 64 Derived Tables ...................................................................................................................... 65 Recursive SQL ...................................................................................................................... 65 Sub Queries ........................................................................................................................... 66 Case Statement ...................................................................................................................... 67 Sum (SQL-99 Window Function) ......................................................................................... 69 Rank (SQL-99 Window Function)........................................................................................ 70 Fast Path Insert ...................................................................................................................... 71 Fast Path Delete .................................................................................................................... 72 Section 5.2 Set vs. Procedural................................................................................................... 72 Section 5.3 Statistics collection “cheat sheet” .......................................................................... 75 Section 5.4 Reserved words ...................................................................................................... 78 Section 5.5 Orange Books and Suggested Reading .................................................................. 78 4 1. Teradata Partners Program Section 1.1 Teradata Database -- Introduction This document is intended for ISV partners new to the Teradata partner program. There is a wealth of documentation available on the database and client/utility software; in fact, the majority of information contained in this document came from those sources. ISV’s need to understand that although Teradata is an ANSI standard relational database management system, it is different and in order to leverage Teradata strength’s ISV’s will need to understand these differences in order to do an effective job in their integration. It is primarily those key differences and strengths that are pointed out in this document; this document is intended as a quick start guide and not a replacement for education or the utilization of the extensive user documentation provided for Teradata. The test environment Teradata normally recommends for our partners is the usage of the Teradata Partner Engineering lab in San Diego. For testing in this environment, you can connect to Teradata servers in the lab via high speed internet connections. For some partners, the Teradata Partner Engineering lab will not be sufficient as it does not cover all requirements. For example, the lab is not an applicable environment for performance testing. While Teradata can accommodate some requests on a case-by-case basis, it is not a function of the lab. Therefore, if you decide to execute your testing on your own premises, the following test environments are supported: Teradata Database Support Matrix: Operating System TD 13.00 TD 13.10 TD 14.00 TD 14.10 TD 15.00 TD 15.10 X Linux SLES 9 X X X X (SP3) X (SP3) X (SP3) Linux SLES 10 X X X (SP1) X (SP1) Linux SLES 11 X MP-RAS Windows Server 2003 (32-bit) Windows Server 2003 (64-bit) X X Teradata Database client applications run on many operating systems. See Teradata Tools and Utilities Supported Platforms and Product Versions, at: http://www.info.teradata.com This can be installed on a Teradata or non-Teradata server. The minimum hardware requirements for non-Teradata systems are described in the Field Installation Guide for Teradata Node Software for Windows, and in the Field Installation Guide for Teradata Node Software for Linux SLES9, SLES10 and SLES11 manuals. The latest guides can be found on http://www.info.teradata.com/. 5 Section 1.2 Teradata Support Partners that have an active partner support contract with us can submit incidents via T@YS (http://tays.teradata.com). In addition, T@YS gives access to download drivers and patches and search the knowledge repositories. All incidents are submitted via T@YS on-line to our CS organization. Our CS organization assumes the partner has a decent level of Teradata knowledge. It is not in the scope of the CS organization to educate the partner, hand hold a partner through an installation of TD s/w or through the resolution of an issue or to provide general Teradata consulting advice, etc. To sign up for T@YS, submit your name, title, address, phone number and e-mail to your global alliance support partner. Section 1.3 Teradata Partner Intelligence Network and the Teradata Education Network The Teradata Partner Intelligence Network is the single most valuable and complete source of information for Teradata ISV Partners. Partners that have an active partner support contract with Teradata can access this Network at http://partnerintelligence.teradata.com. All partners should receive an orientation session for the Teradata Partner Intelligence Network as part of their Teradata Partner benefit package. This network is the one source for all the tools and resources partners need to develop and promote their integrated solutions for Teradata. Partners that have an active partner support contract with us also have access to the Teradata Education Network (TEN) (http://www.teradata.com/ten). In order to sign up, submit your name, title, address, phone number and e-mail to your global alliance support partner. Access to TEN is free. Depending on the type of education selected, there may be a cost as follows:     Web Based Training- offered at no cost Recorded Virtual Classes- offered at no cost Live Virtual Classes- offered at nominal fee per course Instructor Led Classes- offered at a discount Of particular interest is “Teradata Essentials for Partners.” The primary focus of this four-day technical class is to provide a foundational understanding of Teradata’s design and implementation to Alliance Partners. The class is given in a lecture format and provides a technical and detailed description of the following topics:    Data Warehousing Teradata concepts, features, and functions Teradata physical database design – make the correct index selections by understanding distribution, access, and join techniques. 6    Explains, space utilization of tables and databases, join indexes, and hash indexes are discussed in relation to physical database design Teradata Application Utilities – BTEQ and TPT (Load, Update, Export and Stream adapters); details on when and how to use Key Teradata features and utilities (up to Teradata 14.0) are included For more training and education information, including a Partner Curriculum Map with descriptions of the courses as well as additional recommended training, visit the “Education” webpage on the Teradata PartnerIntelligence website. http://partnerintelligence.teradata.com 2. Teradata Basics Section 2.1 Unified Data Architecture In October of 2012, Teradata introduced the Teradata® Unified Data Environment™ , and the Teradata Unified Data Architecture™ (UDA) integrates the Teradata analytics platform, the Teradata Aster discovery platform, and Hadoop technology. In the UDA, data is intended to flow where it is most efficiently processed. In the case of unstructured data with multiple formats, it can be quickly loaded in Hadoop as a low-cost platform for landing, staging, and refining raw data in batch. Teradata’s partnership with Hortonworks enables enterprises to use Hadoop to capture extensive volumes of historical data and perform massive processing to refine the data with no need for expensive and specialized knowledge and resources. The data can then be leveraged in the Teradata Aster analytics discovery platform for real-time, deep statistical analysis and discovery, allowing users to identify patterns and trends in the data. The patented Aster SQL-MapReduce® parallel programming framework combines the analytic power of MapReduce with the familiarity of SQL. After analysis in the Aster environment, the relevant data can be routed to the Teradata Database, integrating the data discovered by the Teradata Aster discovery platform with all of the existing operational data, resulting in intelligence that can be leveraged across the enterprise. Some of the benefits that can be realized by employing the UDA:  Capture and refine data from a wide variety of sources.  Perform multi-structured data preprocessing.  Develop rapid analytics.  Process embedded analytics, analyzing both relational and non-relational data.  Produce semi-structured data as output, often with metadata and heuristic analysis.  Solve new analytical workloads with reduced time to insight. 7  Use massively parallel storage in Hadoop to efficiently retain data For more information on UDA, contact your Partner Integration Lab consultant, or visit the PartnerIntelligence website at http://partnerintelligence.teradata.com. Section 2.2 Data Types These data types are derived from the Teradata Database 15.00 SQL Reference Manual. Every data value belongs to an SQL data type. For example, when you define a column in a CREATE TABLE statement, you must specify the data type of the column. The set of data values that a column defines can belong to one of the following groups of data types: • • • • • • • • • • • • Array/VARRAY Byte and BLOB Character and CLOB DateTime Geospatial Interval JSON Numeric, including Number Parameter Period UDT XML Array/VARRAY Data Type A data type used for storing and accessing multidimensional data. The ARRAY data type can store many values of the same specific data type in a sequential or matrix-like format. ARRAY data types can be single or multi-dimensional. Teradata Database supports a one-dimensional(1-D) ARRAY data type and a multidimensional (n-D) ARRAY data type, with up to 5 dimensions. Single - The 1-D ARRAY type is defined as a variable-length ordered list of values of the same data type. It has a maximum number of values that you specify when you create the ARRAY type. You can access each element value in a 1-D ARRAY type using a numeric index value. Multi-dimensional – an n-D ARRAY is a mapping from integer coordinates to an element type. The n-D ARRAY type is defined as a variable-length ordered list of values of the same data type. It has 2-5 dimensions, with a maximum number of values for each dimension, which you specify when you create the ARRAY type. You can also create an ARRAY data type using the VARRAY keyword and syntax for Oracle compatibility. 8 Byte and BLOB Data Types The BYTE, VARBYTE and BLOB data types are stored in the client system format – they are never translated by Teradata Database. They store raw data as logical bit streams. For any machine, BYTE, VARBYTE, and BLOB data is transmitted directly from the memory of the client system. The sort order is logical, and values are compared as if they were n-byte, unsigned binary integers suitable for digitized images. The following are examples of Byte data types. BYTE - Represents a fixed-length binary string. VARBYTE - Represents a variable-length binary string. BLOB - Represents a large binary string of raw bytes. A binary large object (BLOB) column can store binary objects, such as graphics, video clips, files, and documents. Character and CLOB Data Types In general, CHARACTER, VARCHAR, and CLOB data types represent character data. Character data is automatically translated between the client and the database. Its form-of-use is determined by the client character set or session character set. The form of character data internal to Teradata Database is determined by the server character set attribute of the column. The following are examples of Character data types. CHAR[(n)] - Represents a fixed length character string for Teradata Database internal character storage. VARCHAR (n) - Represents a variable length character string of length 0 to n for Teradata Database internal character storage. LONG VARCHAR - Specifies the longest permissible variable length character string for Teradata Database internal character storage. CLOB - Represents a large character string. A character large object (CLOB) column can store character data, such as simple text, HTML, or XML documents. Teradata does not support converting a character to its underlying ASCII integer value. To accomplish this task there is a function CHAR2HEXINT that returns a hex representation of a character, and there is also an ASCII() UDF in the Oracle library. DateTime Data Types DateTime values represent dates, times, and timestamps. Use the following SQL data types to specify DateTime values. DATE - Represents a date value that includes year, month, and day components. TIME [WITH TIME ZONE] - Represents a time value that includes hour, minute, second, fractional second, and [optional] time zone components. TIMESTAMP [WITH TIME ZONE] - Represents a timestamp value that includes year, month, day, hour, minute, second, fractional second, and [optional] time zone components. Date and Time DML syntax can be tricky in Teradata. Below are a few common date/time queries that have proven to be useful. Note that some of the output shown is dependent upon the “Date/Time Format” parameter selected in the ODBC Setup Options. 9 a) • • • Current Date and Time: SELECT Current_Date; SELECT Current_Time; SELECT Current_TimeStamp; Retrieves the System Date (mm/dd/yyyy) Retrieves the System Time (hh:mm:ss) Retrieves the System TimeStamp (mm/dd/yyyy) b) Timestamps: • SELECT CAST (Current_Timestamp as DATE); Extracts the Date (mm/dd/yyyy) • SELECT CAST (Current_Timestamp as TIME(6)); Extracts the Time (hh:mm:ss) • SELECT Day_Of_Week Uses TD System Calendar to FROM Sys_Calendar.Calendar compute the Day of Week (#) WHERE Calendar_Date = CAST(Current_Timestamp AS DATE); • SELECT EXTRACT (DAY FROM Current_Timestamp); Extracts Day of Month • SELECT EXTRACT (MONTH FROM Current_Timestamp); Extracts Month • SELECT EXTRACT (YEAR FROM Current_Timestamp); Extracts Year • SELECT CAST(Current_Timestamp as DATE) - CAST (AnyTimestampCol AS DATE) Computes the # of days between timestamps • SELECT Current_Timestamp - AnyTimestampColumn day(4) to second(6) Computes the length of time between timestamps c) Month: • SELECT CURRENT_DATE - EXTRACT (DAY FROM CURRENT_DATE) + 1 Computes the first day of the month • SELECT Add_Months ((CURRENT_DATE - EXTRACT (DAY FROM CURRENT_DATE) + 1),1)-1 Computes the last day of the month • SELECT Add_Months (Current_Date, 3); Adds 3 months to the current date Geospatial Data Types Geospatial information identifies the geographic location of features and boundaries on the planet. Geospatial data types provide a way for applications to store, manage, retrieve, manipulate, analyze, and display geographic information to interface with Teradata Database. Teradata Database geospatial data types define methods that perform geometric calculations and test for spatial relationships between two geospatial values. You can use geospatial types to represent geometries having up to three dimensions. The following are examples of Geospatial data types. ST_Geometry -- A Teradata proprietary internal UDT that can represent any of the following geospatial types: •ST_Point: 0-dimensional geometry that represents a single location in two-dimensional coordinate space. •ST_LineString: 1-dimensional geometry usually stored as a sequence of points with a linear interpolation between points. •ST_Polygon: 2-dimensional geometry consisting of one exterior boundary and zero or more interior boundaries, where each interior boundary defines a hole. •ST_GeomCollection: Collection of zero or more ST_Geometry values. 10 •ST_MultiPoint: 0-dimensional geometry collection where the elements are restricted to ST_Point values. •ST_MultiLineString: 1-dimensional geometry collection where the elements are restricted to ST_LineString values. •ST_MultiPolygon: 2-dimensional geometry collection where the elements are restricted to ST_Polygon values. •GeoSequence: Extension of ST_LineString that can contain tracking information, such as time stamps, in addition to geospatial information. The ST_Geometry type supports methods for performing geometric calculations. MBR -- Teradata Database also provides a UDT called MBR that provides a way to obtain the minimum bounding rectangle (MBR) of a geometry for tessellation purposes. ST_Geometry defines a method called ST_MBR that returns the MBR of a geometry. The ST_Geometry and MBR UDTs are defined in the SYSUDTLIB database. Interval Data Types An interval value is a span of time. There are two mutually exclusive interval type categories. Year- Month represents a time span that can include a number of years and months. •INTERVAL YEAR •INTERVAL YEAR TO MONTH •INTERVAL MONTH Day-Time represents a time span that can include a number of days, hours, minutes, or seconds. •INTERVAL DAY •INTERVAL DAY TO HOUR •INTERVAL DAY TO MINUTE •INTERVAL DAY TO SECOND •INTERVAL HOUR •INTERVAL HOUR TO MINUTE •INTERVAL HOUR TO SECOND •INTERVAL MINUTE •INTERVAL MINUTE TO SECOND •INTERVAL SECOND JSON Data Type JSON(Javascript Object Notation) is a text-based, data interchange format, often used in web applications to transmit data. JSON has been widely adopted by web application developers because compared to XML it is easier to read and write for humans and easier to parse and generate for machines. JSON documents can be stored and processed in Teradata Database. Teradata Database can store JSON records as a JSON document or store JSON records in relational format. Teradata Database provides the following support for JSON data. 11 • Methods, functions, and stored procedures that operate on the JSON data type, such as parsing and validation. • Shredding functionality that allows you to extract values from JSON documents up to 16MB in size and store the extracted data in relational format. • Publishing functionality that allows you to publish the results of SQL queries in JSON format. • Schema-less or dynamic schema with the ability to add a new attribute without changing the schema. Data with new attributes is immediately available for querying. Rows without the new column can be filtered out. • Use existing join indexing structures on extracted portions of the JSON data type. • Apply advanced analytics to JSON data. • Functionality to convert an ST_Geometry object into a GeoJSON value and a GeoJSON value into an ST_Geometry object. • Allows JSON data of varying maximum length and JSON data can be internally compressed. • Collect statistics on extracted portions of the JSON data type. • Use standard SQL to query JSON data. • JSONPath provides simple traversal and regular expressions with wildcards to filter and navigate complex JSON documents. Numeric Data Types A numeric value is either an exact numeric number (integer or decimal) or an approximate numeric number (floating point). The following are examples of Numeric data types. BIGINT -- Represents a signed, binary integer value from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807. INTEGER -- Represents a signed, binary integer value from -2,147,483,648 to 2,147,483,647. SMALLINT -- Represents a signed binary integer value in the range -32768 to 32767. BYTEINT -- Represents a signed binary integer value in the range -128 to 127. REAL, DOUBLE PRECISION, FLOAT - Represents a value in sign/magnitude form from 2.226*10-308 to 1.797*10308. DECIMAL [(n[,m])] and NUMERIC [(n[,m])] - Represents a decimal number of n digits, with m of those n digits to the right of the decimal point. NUMBER – Represents a numeric value with optional precision and scale limitations. If the default datatype used to aggregate numeric values is causing an overflow, then a cast to BIGINT may resolve the problem. For the differences between NUMBER, DECIMAL, FLOAT and BIGINT see the SQL Data Types and Literals manual.. Parameter Data Types These are data types that can be used only with input or result parameters in a function, method, stored procedure, or external stored procedure. They include the following data types. TD_ANYTYPE - An input or result parameter data type that is used in UDFs, UDMs, and external stored procedures, and that can accept any system-defined data type or user-defined type (UDT). The parameter attributes and return type are determined at execution time. 12 VARIANT_TYPE - An input parameter data type that can be used to package and pass in a varying number of parameters of varying data types to a UDF as a single UDF input parameter. Period Data Types A period data type represents a set of contiguous time granules that extends from a beginning bound up to but not including an ending bound. Use Period data types to represents time periods. The following are examples of Period data types. PERIOD(DATE) - Represents an anchored duration of DATE elements that include year, month, and day components. PERIOD(TIME[(n)][WITH TIME ZONE]) - Represents an anchored duration of TIME elements that include hour, minute, second, fractional second, a and [optional] time zone components. PERIOD(TIMESTAMP[(n)][WITH TIME ZONE]) - Represents an anchored duration of TIMESTAMP elements that include year, month, day, hour, minute, second, fractional second, and [optional] time zone components. UDT Data Types UDT data types are custom data types that you define to model the structure and behavior of data that your application deals with. Teradata Database supports distinct and structured UDTs. The following are examples of UDT data types. Distinct - A UDT that is based on a single predefined data type, such as INTEGER or VARCHAR . Structured - A UDT that is a collection of one or more fields called attributes, each of which is defined as a predefined data type or other UDT (which allows nesting). XML Data Type An XML data type that allows you to store XML content in a compact binary form that preserves the information set of the XML document, including the hierarchy information and type information derived from XML validation. The document identity is preserved as opposed to XML shredding, which only extracts values out of the XML document. Related Topics For detailed information on data types, see the SQL Data Types and Literals, SQL Geospatial Types, and Teradata XML manuals. Also refer to the SQL Functions, Operators, Expressions, and Predicates manual for a list of data type conversion functions, including support of Oracle data type conversion functions. 13 Section 2.3 Primary Index A primary index is required for all Teradata database tables. If you do not assign a primary index explicitly when you create a table, then the system assigns one automatically according to the following rules: Stage 1 2 3 WHEN a primary key column is defined, but a primary index is not WHEN neither a primary key nor primary index is defined. WHEN no primary key, primary index, or uniquely constrained column is defined Process THEN the system selects the….. primary key column set to be the primary index and defines it as a UPI. THEN the system selects the….. first column having a UNIQUE constraint to be the primary index and defines it as a UPI. THEN the system selects the….. first column defined for the table to be the primary index. If the first column defined in the table has a LOB data type, then the CREATE TABLE operation aborts and the system returns an error message. WHEN THEN the system selects the….. the table has only one UPI. column and its kind is defined as SET anything else NUPI. Use the CREATE TABLE statement to create primary indexes. Data accessed using a primary index is always a one-AMP operation because a row and its primary index are stored together in the same structure. This is true whether the primary index is unique or non-unique, and whether it is partitioned or non-partitioned. Purpose of the Primary Index The primary index has four purposes:  To define the distribution of the rows to the AMPs. With the exception of NoPI tables and column-partitioned tables and join indexes, Teradata Database distributes table rows across the AMPs on the hash of their primary index value. The determination of which hash bucket, and hence which AMP the row is to be stored on, is made solely on the value of the primary index. 14 The choice of columns for the primary index affects how even this distribution is. An even distribution of rows to the AMPs is usually of critical importance in picking a primary index column set.  To provide access to rows more efficiently than with full table scan. If the values for all the primary index columns are specified in the constraint clause in the DML statement, single-AMP access can be made to the rows using that primary index value. With a partitioned primary index, faster access is also possible when all the values of the partitioning columns are specified or if there is a constraint on partitioning columns. Other retrievals might use a secondary index, a hash or join index, a full table scan, or a mix of several different index types.  To provide for efficient joins. If there is an equality join constraint on the primary index of a table, it may be possible to do a direct join to the table (that is, rows of the table might not have to be redistributed, spooled, and sorted prior to the join).  To provide a means for efficient aggregations. If the GROUP BY key is on the primary index of a table, it is often possible to perform a more efficient aggregation. The following restrictions apply to primary indexes:        No more than one primary index can be defined on a table. No more than 64 columns can be specified in a primary index definition. You cannot include columns having XML, BLOB, BLOB-based UDT, CLOB, CLOBbased UDT, XML-based UDT, Period, ARRAY, VARRAY, VARIANT_TYPE, Geospatial or JSON data types in any primary index definition. You cannot define a primary index for a NoPI table until Teradata database V15.10. You cannot define a primary index for column-partitioned tables and join indexes until Teradata database V15.10 Primary index columns cannot be defined on row-level security constraint columns You cannot specify multivalue compression for primary index columns Section 2.4 NoPI Objects NoPI Objects Starting with Teradata Database 13.0, a table can be defined without a primary index. This feature is referred to as the NoPI Table feature. A NoPI object is a table or join index that does 15 not have a primary index and always has a table kind of MULTISET. Without a PI, the hash value as well as AMP ownership of a row is arbitrary. A NoPI table is internally treated as a hash table; it is just that typically all the rows on one AMP will have the same hash bucket value. The basic types of NoPI objects are: • Nonpartitioned NoPI tables • Column-partitioned tables and join indexes (these may also have row partitioning) The chief purpose of NoPI tables is as staging tables. FastLoad can efficiently load data into empty nonpartitioned NoPI staging tables because NoPI tables do not have the overhead of row distribution among the AMPs and sorting the rows on the AMPs by rowhash. Nonpartitioned NoPI tables are also critical to support Extended MultiLoad Protocol (MLOADX). A nonpartitioned NoPI staging table is used for each MLOADX job. NoPI tables are not intended to be used as tables which are utilized by end user queries. They exist primarily as a landing place for FastLoad or MultiLoad staging data. For more information, please refer to the Teradata Database Design manual. Section 2.5 Secondary Indexes A secondary index is never required for Teradata Database tables, but they can sometimes improve system performance, particularly in decision support environments. When column demographics suggest their usefulness, the Optimizer selects secondary indexes to provide faster set selection. While secondary indexes are exceedingly useful for optimizing repetitive and standardized queries, the Teradata Database is also highly optimized to perform full-table scans in parallel. Because of the strength of full-table scan optimization in the Teradata Database, there is little reason to be heavy-handed about assigning multiple secondary indexes to a table. Secondary indexes are less frequently included in query plans by the Optimizer than the primary index for the table being accessed. You can create secondary indexes when you create the table via the CREATE TABLE statement, or you can add them later using the CREATE INDEX statement Data access using a secondary index varies depending on whether the index is unique or nonunique.  Restrictions on Secondary Indexes The following restrictions apply to secondary indexes:  Teradata Database tables can have up to a total of 32 secondary, hash, and join indexes. 16       No more than 64 columns can be included in a secondary index definition. You can include UDT columns in a secondary index definition You cannot include columns having XML, BLOB, CLOB, BLOB-based UDT, CLOBbased UDT, XML-based UDT, Period, JSON, ARRAY, or VARRAY data types in any secondary index definition. You can define a simple NUSI on a geospatial column, but you cannot include a column having a geospatial data type in a composite NUSI definition or in a USI definition You can include row-level security columns in a secondary index definition You cannot include the system-derived PARTITION or PARTITION#Ln columns in any secondary index definition  Space Considerations for Secondary Indexes Creating a secondary index causes the system to build a subtable to contain its index rows, thus adding another set of rows that requires updating each time a table row is inserted, deleted, or updated. Secondary index subtables are also duplicated whenever a table is defined with FALLBACK, so the maintenance overhead is effectively doubled. When compression at the data block level is enabled for their primary table, secondary index subtables are not compressed. Section 2.6 Intermediate/Temporary Tables In addition to derived tables, there are two Teradata table structures that can be used as intermediate or temporary table types: Volatile tables and Global Temporary tables. Volatile Tables are created in memory and last only for the duration of the session. Neither the definition nor the contents of a volatile table persist across a system restart. There is no entry in the Data Dictionary and no transaction logging. Global Temporary Tables also have no transaction logging. Global Temporary Tables are entered in the data dictionary, but are not materialized for the session until the table is loaded with data. Space usage is charged to login user temporary space. Each user session can materialize as many as 2000 global temporary tables at t time. One should specify a primary index (PI) for a temp table where access or join by the PI is anticipated. Not specifying a PI will cause a default to a NOPI table or a PI on the first column of the table regardless of the data demographics. If you do not know in advance what the best PI candidate will be, then specify a NOPI table to insure even distribution of the rows across all AMPs. Section 2.7 Locking The Teradata Database can lock several different types of resources in several different ways. Most locks used on Teradata resources are obtained automatically. Users can override some locks by making certain lock specifications, but cannot downgrade the severity of a lock -17 Teradata Database only allows overrides when it can assure data integrity. The data integrity requirement of a request decides the type of lock that the system uses. A request for a locked resource by another user is queued until the process using the resource releases its lock on that resource. Lock Levels: Three levels of database locking are provided: Object Locked Database Table Description Locks all objects in a database Locks all rows in the table or view and any index and fallback subtables Locks the primary copy of a row and all rows that share the same hash code within the same table (primary and Fallback rows, and Secondary Index subtable rows). Row Hash Levels of Lock Types Lock Type Exclusive Description The requester has exclusive rights to the locked resource. No other person can read from, write to, or access the locked resource in any way. Exclusive locks are applied only to databases or tables, never to rows. The requester has exclusive rights to the locked resource except for readers not concerned with data consistency (access lock readers). Several readers can hold read locks on a resource, during which the system permits no modification of that resource. Write Read Read locks ensure consistency during read operations such as those that occur during a SELECT statement. The requester is willing to accept minor inconsistencies of the data while accessing the database (an approximation is good enough). An access lock permits modifications on the underlying data while the SELECT operation is in progress. Access The same information is shown in the following table: Lock Request Access Read None Access Read Write Exclusive Granted Granted Granted Granted Granted Granted Granted Queued Queued Queued 18 Write Exclusive Granted Granted Granted Queued Queued Queued Queued Queued Queued Queued Automatic Database Lock Levels The Teradata Database applies most of its locks automatically. The following table illustrates how the Teradata Database applies different locks for various types of SQL statements. Locking Level by Access Type UPI/NUPI/USI NUSI/Full Table Scan Row Hash Table Row Hash Table Row Hash Table Row Hash Not Applicable Not Applicable Database Type if SQL Statement SELECT UPDATE DELETE INSERT CREATE DATABASE DROP DATABASE MODIFY DATABASE CREATE TABLE Not Applicable DROP TABLE ALTER TABLE Table Locking Mode Read Write Write Write Exclusive Exclusive Is it generally recommended that queries that do not require READ lock access use the SQL clause “LOCKING ROW FOR ACCESS.” This will allow non-blocked access to tables that are being updated. In addition to the information listed here, refer to the orange book, “Understanding Oracle and Teradata Transactions and Isolation Levels for Oracle Migrations” for a further understanding of the Teradata database differences. Section 2.8 Statistics Good primary index selection and timely and appropriate statistics collection may very well be the two most important factors in obtaining Teradata Database performance. Over the last two decades, Teradata software releases have consistently provided improvements and enhancements in the way statistics are collected, and then utilized by the cost-based Teradata Optimizer. The Optimizer doesn’t perform a detailed evaluation of every possible query plan (multiple joins could produce billions of possibilities). Instead, it uses sophisticated algorithms to identify and select the most promising candidates for detailed evaluation, then picks what it perceives as the best plan among those. The essential task of the optimizer is to produce the optimal execution plan (the one with the lowest cost) from many possible plans. The basis on which different plans are compared with each other is the cost which is derived from the estimation of cardinalities of the temporary or intermediate relations, after an operation such as 19 selections, joins and projections. The estimations in Teradata are derived primarily from statistics and random AMP samples. Accurate estimations are crucial to get optimal plans. Providing statistical information for performance optimization is critical to optimal query plans, but collecting statistics can prove difficult due to the demands of time and system resources . Without full or all-AMP sampled statistics, query optimization must rely on extrapolation and dynamic AMP sample estimates of table cardinality, which does not collect all of the statistics that a COLLECT STATISTICS request does. Besides estimated cardinalities, dynamic AMP samples also collect a few other statistics, but far fewer than are collected by a COLLECT STATISTICS request. Statistics and demographics provide the Optimizer with information it uses to reformulate queries in ways that permit it to produce the least costly access and join plans. The critical issues you must evaluate when deciding whether to collect statistics are not whether query optimization can or cannot occur in the face of inaccurate statistics, but the following pair of probing questions. • How accurate must the available statistics be in order to generate the best possible query plan? • How poor a query plan are you willing to accept? Different strategies can be used to attain the right balance between the need for statistics and the demands of time and resources. The main strategies for collecting statistics are: Random AMP Sampling and Full Sampling. 2.8.1 Random AMP Sampling The optimizer builds an execution plan for each SQL statement that enters the parsing engine. When no statistics have been collected, the system default is for the optimizer is to make a rough estimate of a table’s demographics by using dynamic samples from one or more AMPs (one being the default). These samples are collected automatically each time the table header is accessed from disk and are embedded in the table header when it is placed in the dictionary cache. By default, the optimizer does the single AMP sampling to produce random AMP sample demographics with some exceptions (volatile, sparse single table join indexes and aggregate join indexes). By changing an internal field in the dbscontrol record called RandomAMPSampling, it can be requested that sampling be performed on 2 AMPs, 5 AMPs, all AMPs on a node, or all AMPs in the system. When using these options, random sampling uses the same techniques as single-AMP random AMP sampling, but more AMPs participate. Touching more AMPs may improve the quality of the statistical information available during plan generation, particularly if rows are not evenly distributed. In Teradata Database 12.0 and higher releases, all-AMP sampling was enhanced to use an efficient technique using “Last done Channel mechanism” which considerably reduces the messaging overhead. This is used when all-AMP sampling is enabled in the dbscontrol or cost 20 profile but dbscontrol internal flag RowsSampling5 is set to 0 (which is the default). If set to greater than 0, this flag causes the sampling logic to read the specified percentage of rows to determine the number of distinct values for primary index. Random AMP Sample Characteristics • Estimates fewer statistics than COLLECT STATISTICS does. Statistics estimated include the following for all columns. • Cardinalities • Average rows per value For indexes only, the following additional statistics are estimated. • Average rows per index • Average size of the index per AMP • Number of distinct values • Extremely fast collection time, so is not detectable. • Stored in the file system data block descriptor for the table, not in interval histograms in the Data Dictionary. • Occurs automatically. Cannot be invoked by user. • Automatically refreshed when batch table INSERT … DELETE operations exceed a threshold of 10% of table cardinality. Cardinality is not refreshed by individual INSERT or DELETE requests even if the sum of their updates exceed the 10% threshold. • Cached with the data block descriptor. • Not used for non-indexed selection criteria or indexed selection with non-equality conditions. Best Use • Good for cardinality estimates when there is little or no skew and the table has significantly more rows than the number of AMPs in the system. • Collects reliable statistics for NUSI columns when there is limited skew and the table has significantly more rows than the number of AMPs in the system. • Useful as a temporary fallback measure for columns and indexes on which you have not yet decided whether to collect statistics or not. Dynamic AMP sampling provides a reasonable fallback mechanism for supporting the optimization of newly devised ad hoc queries until you understand where collected statistics are needed to support query plans for them. Teradata Database stores cardinality estimates from dynamic AMP samples in the interval histogram for estimating table growth even when complete, fresh statistics are available. Pros and cons of Random AMP sampling: Pros:  Provides row count information of all indexes including the Primary Index.  The row count of Primary Index is the total table rows.  The row count of NUSI subtable is the number of distinct values of the NUSI columns.  The estimated number of distinct values is used for single-table equality predicates, join cardinality, aggregate estimations, costing, etc.  Can potentially eliminate the need to collect statistics on the indexes.  Up-To-Date information – usually most fresh 21  This operation is automatically performed   Works only with indexed columns. The single-AMP sampling may not be good enough for small tables and tables with non-uniform distribution on the primary index. Does not provide the following information. Number of nulls Skew Info Value Range For NUSIs, the estimated number of distinct values on a single-AMP is assumed to be the total distinct values. This is true for highly non-unique columns but can cause distinct value underestimation for fairly unique columns. On the other hand, it can cause overestimation for highly nonunique columns because of rowid spill over. Cannot estimate the number of distinct values for non-unique primary indexes. Single table estimations can use this information only for equality conditions assuming uniform distribution. Cons:     It is strongly recommended to contact Teradata Global Support Center (GSC) to assess the impact of enabling all-AMP sampling on your configuration and to help change the internal dbscontrol settings. 2.8.2 Full Statistics Collection Generically defined, a histogram is a count of the number of occurrences, or cardinality, of a particular category of data that fall into defined disjunct value range categories. These categories are typically referred to as bins or buckets. Issuing a COLLECT STATISTICS statement is the most complete method of gathering demographic information about a column or an index. Teradata Database uses equal-height, high-biased, and history interval histograms (a representation of a frequency distribution) to represent the cardinalities and other statistical values and demographics of columns and indexes for all-AMPs sampled statistics and for full-table statistics. The greater the number of intervals in a histogram, the more accurately it can describe the distribution of data by characterizing a smaller percentage of its composition per each interval. Each interval histogram in the system is composed of a number of intervals (the default is 250 and the maximum is 500) intervals. A 500 interval histogram permits each interval to characterize roughly 0.25% of the data. Because these statistics are kept in a persistent state, it is up to the administrator to keep collected statistics fresh. It is common for many Teradata Warehouse sites to re-collect statistics on the majority of their tables weekly, and on particularly volatile tables daily, if deemed necessary. Full statistics Characteristics • Collects all statistics for the data. 22 • Time consuming. • Most accurate of the three methods of collecting statistics. • Stored in interval histograms in the Data Dictionary. Best Use • Best choice for columns or indexes with highly skewed data values. • Recommended for tables with fewer than 1,000 rows per AMP. • Recommended for selection columns having a moderate to low number of distinct values. • Recommended for most NUSIs, PARTITION columns, and other selection columns because collection time on NUSIs is very fast. • Recommended for all column sets or indexes where full statistics add value, and where sampling does not provide satisfactory statistical estimates. 2.8.3 Collection with the USING SAMPLE option Collecting full statistics involves scanning the base table and performing a sort, sometimes a sort on a large volume of data, to compute the number of occurrences for each distinct value. The time and resources required to adequately collect statistics and keep them fresh can be problematic, particularly with large data volumes. Collecting statistics on a sample of the data reduces the resources required and the time to perform statistics collection. However, the USING SAMPLE alternative was certainly not designed to replace full statistics collection. It requires some careful analysis and planning to determine under which conditions it will add benefit. The quality of the statistics collected with full-table sampling is not guaranteed to be as good as the quality of statistics collected on an entire table without sampling. Do not think of sampled statistics as an alternative to collecting full-table statistics, but as an alternative to never, or rarely, collecting statistics. When you use sampled statistics rather than full-table statistics, you are trading time in exchange for what are likely to be less accurate statistics. The underlying premise for using sampled statistics is usually that sampled statistics are better than no statistics. Do not confuse statistical sampling with the dynamic AMP samples (system default) that the Optimizer collects when it has no statistics on which to base a query plan. Statistical samples taken across all AMPs are likely to be much more accurate than dynamic AMP samples. Sampled statistics are different from dynamic AMP samples in that you specify the percentage of rows you want to sample explicitly in a COLLECT STATISTICS (Optimizer Form) request to collect sampled statistics, while the number of AMPs from which dynamic AMP samples are collected and the time when those samples are collected is determined by Teradata Database, not by user choice. Sampled statistics produce a full set of collected statistics, while dynamic AMP samples collect only a subset of the statistics that are stored in interval histograms. Sampled Statistics Characteristics 23 • Collects all statistics for the data, but not by accessing all rows in the table. • Significantly faster collection time than full statistics. • Stored in interval histograms in the Data Dictionary. Best Use • Acceptable for columns or indexes that are highly singular; meaning that their number of distinct values approaches the cardinality of the table. • Recommended for unique columns, unique indexes, and for columns or indexes that are highly singular. Experience suggests that sampled statistics are useful for very large tables; meaning tables with tens of billions of rows. • Not recommended for tables whose cardinality is less than 20 times the number of AMPs in the system. 2.8.4 Collect Statistics Summary Option New in Teradata 14.0, table-level statistics known as “summary statistics” are collected alongside of the column or index statistics you request. Summary statistics do not cause their own histogram to be built, but rather they create a short listing of facts about the table undergoing collection that are held in the new DBC.StatsTbl. It is a very fast operation. Summary stats report on things such as the table’s row count, average block size, and some metrics around block level compression and (in the future) temperature. An example of actual execution times that I ran are shown below, comparing regular column statistics collection against summary statistics collection for the same large table. Time is reported in MM:SS: COLLECT STATISTICS ON Items COLUMN I_ProductID; Elapsed time (mm:ss): 9:55 COLLECT SUMMARY STATISTICS ON Items; Elapsed time (mm:ss): 00:01 You can request summary statistics for a table, but even if you never do that, each individual statistics collection statement causes summary stats to be gathered. For this reason, it is recommended that you group your statistics collections against the same table into one statement, in order to avoid even the small overhead involved in building summary stats repeatedly for the same table within the same script. There are several benefits in having summary statistics. One critical advantage is that the optimizer now uses summary stats to get the most up-to-date row count from the table in order to provide more accurate extrapolations. It no longer needs to depend on primary index or PARTITION stats, as was the case in earlier releases, to perform good extrapolations when it finds statistics on a table to be stale. Here’s an example of what the most recent summary statistic for the Items table looks like: SHOW SUMMARY STATISTICS VALUES ON Items; COLLECT SUMMARY STATISTICS 24 ON CAB.Items VALUES ( /** TableLevelSummary **/ /* Version */ 5, /* NumOfRecords */ 50, /* Reserved1 */ 0.000000, /* Reserved2 */ 0.000000, /* SummaryRecord[1] */ /* Temperature */ 0, /* TimeStamp */ TIMESTAMP '2011-12-29 13:30:46', /* NumOfAMPs */ 160, /* OneAMPSampleEst */ 5761783680, /* AllAMPSampleEst */ 5759927040, /* RowCount */ 5759985050, /* DelRowCount */ 0, /* PhyRowCount */ 5759927040, /* AvgRowsPerBlock */ 81921.871617, /* AvgBlockSize */ 65024.000000, /* BLCPctCompressed */ 0.00, /* BLCBlkUcpuCost */ 0.000000, /* BLCBlkURatio */ 0.000000, /* RowSizeSampleEst */ 148.000000, /* Reserved2 */ 0.000000, /* Reserved3 */ 0.000000, /* Reserved4 */ 0.000000 ); 2.8.5 Summary: Teradata Statistics Collection The decision between full-table and all-AMPs sampled statistics seems to be a simple one: always collect full-table statistics, because they provide the best opportunity for producing optimal query plans. While the above statement may be true, the decision is not so easily made in a production environment. Other factors must be taken into consideration, including the length of time required to collect the statistics and the resource consumption the collection of full-table statistics incurs while running other workloads on the system. To resolve this, the benefits and drawbacks of each method must be considered. An excellent information table comparing the three methods (Full Statistics, Sampled Statistics, Dynamic AMP Samples) is provided in Chapter 2 of the SQL Request and Transaction Processing Release 14.0 manual, under the heading Relative Benefits of Collecting Full-Table and Sampled Statistics. 25 2.8.6 New opportunities for statistics collection in Teradata 14.0[1] Teradata 14.0 offers some very helpful enhancements to the statistics collection process. This posting discusses a few of the key ones, with an explanation of how these enhancements can be used to streamline your statistics collection process and help your statistics be more effective. For more detail on these and other statistics collection enhancements, please read the orange book titled Teradata 14.0 Statistics Enhancements, authored by Rama Korlapati, Teradata Labs. New USING options add greater flexibility In Teradata 14.0 you may optionally specify a USING clause within the collect statistics statement. As an example, here are the 3 new USING options that are available in 14.0 with parameters you might use: . . . USING MAXINTERVALS 300 . . . USING MAXVALUELENGTH 50 . . . USING SAMPLE 10 PERCENT MAXINTERVALS allows you to increase or decrease the number of intervals one statistic at a time in the new version 5 statistics histogram. The default maximum number of intervals is 250. The valid range is 0 to 500. A larger number of intervals can be useful if you have widespread skew on a column or index you are collecting statistics on, and you want more individual high-row-count values to be represented in the histogram. Each statistics interval highlights its single most popular value, which is designates as its “mode value” and lists the number of rows that carry that value. By increasing the number of intervals, you will be providing the optimizer an accurate row count for a greater number of popular values. MAXVALUELENGTH lets you expand the length of the values contained in the histogram for that statistic. The new default length is 25 bytes, when previously it was 16. If needed, you can specify well over 1000 bytes for a maximum value length. No padding is done to the values in the histogram, so only values that actually need that length will incur the space (which is why the parameter is named MAXVALUELENGTH instead of VALUELENGTH). The 16-byte limit on value sizes in earlier releases was always padded to full size. Even if you statistics value was one character, you used the full 16 bytes to represent it. Another improvement around value lengths stored in the histogram has to do with multicolumn statistics. In earlier releases the 16 byte limit for values in the intervals was taken from the beginning of the combined value string. In 14.0 each column within the statistic will be able to 26 represent its first 25 bytes in the histogram as the default, so no column will go without representation in a multicolumn statistics histogram. SAMPLE n PERCENT allows you to specify sampling at the individual statistics collection level, rather than at the system level. This allows you to easily apply different levels of statistics sampling to different columns and indexes. Here's an example of how this USING syntax might look: COLLECT STATISTICS USING MAXVALUELENGTH 50 COLUMN ( P_NAME ) ON CAB.product; Combining multiple collections in one statement Statistic collection statements for the same table that share the same USING options, and that request full statistics (as opposed to sampled), can now be grouped syntactically. In fact it is recommended that once you get on 14.0 that you collect all such statistics on a table as one group. The optimizer will then look for opportunities to overlap the collections, wherever possible, reducing the time to perform the statistics collection and the resources it uses. Here is an example The old way: COLLECT STATISTICS COLUMN (o_orderdatetime,o_orderID) ON Orders; COLLECT STATISTICS COLUMN (o_orderdatetime) ON Orders; COLLECT STATISTICS COLUMN (o_orderID) ON Orders; The new, recommended way: COLLECT STATISTICS COLUMN (o_orderdatetime,o_orderID) , COLUMN (o_orderdatetime) , COLUMN (o_orderID) ON Orders; This is particularly useful when the same column appears in single and also multicolumn statistics, as in the example above. In those cases the optimizer will perform the most inclusive 27 collection first (o_orderdatetime,o_orderID), and then re-use the spool built for that step to derive the statistics for the other two columns. Only a single table scan is required, instead of 3 table scans using the old approach. Sometimes the optimizer will choose to perform separate collections (scans of the table) the first time it sees a set of bundled statistics. But based on demographics it has available from the first collection, it may come to understand that it can group future collections and use pre-aggregation and rollup enhancements to satisfy them all in one scan. But you have to remember to re-code your statistics collection statements when you get on 14.0 in order to experience this savings. Automated statistics management (Teradata release 14.10 and above) Description • Identify and collect missing statistics • Automates and provides intelligence to DBA tasks related to Optimizer Statistics Collections where such tasks include: • Identify and collect missing statistics needed for query optimization. • Detect stale statistics and promptly refresh them. • Identify and remove unused statistics from routine maintenance • Prioritize list of pending collections such that important and stale statistics are given precedence • Execute needed collections in the background during scheduled time periods • NEW Statistics Management Viewpoint portlet. Benefits • Automation of statistics collection/re-collection improves query and system performance. • Automation of tasks greatly reduces the burden of statistics management from the DBA. Note: With Teradata Software Release 15.0 and above, the Teradata Statistics Wizard is no longer supported. 2.8.7 Recommended Reading The subject of Teradata database statistics is far too complex and detailed to be summarily defined or exhausted in this guidebook. There are many new statistics collection options with Teradata Release 14.0, and also improvements to existing options. For example, one of the new options in 14.0 is called SUMMARY. This is used to collect only the table-level statistical information such as row count, average block size, average row size, etc. without the histogram detail. This option can be used to provide up-to-date summary information to the optimizer in a quick and efficient way. When SUMMARY option is specified in a collect statistics statement, no column or index specification is allowed. The following resources are recommended reading to further your knowledge of statistics as they pertain to the Teradata Database. 28 SQL Request and Transaction Processing Release 14.0 manual. Excellent, technically detailed information on different statistic collection strategies is provided in chapter 2. Also, great explanations of how the optimizer uses statistics. The following Teradata Orange Books: Optimizer Cardinality Estimation Improvements Teradata Database 12.0 by Rama Korlapati Teradata 14.0 Statistics Enhancements by Rama Korlapati Statistics Extrapolations by Rama Korlapati Collecting Statistics by Carrie Ballinger (written for Teradata release V2R6.2, but still a valuable resource) Anything written by Carrie Ballinger on the subject of Teradata statistics. Check out her contributions to the Teradata Developers Exchange at: http://developer.teradata.com/ Including - New opportunities for statistics collection in Teradata 14.0 on Carrie’s Blog. http://developer.teradata.com/blog/carrie/2012/08/new-opportunities-for-statistics-collection-interadata-14-0 Also on http://developer.teradata.com/ When is the right time to refresh statistics? - Part I (and Part II) by Marcio Moura http://developer.teradata.com/blog/mtmoura/2009/12/when-is-the-right-time-to-refresh-statisticspart-i Others http://developer.teradata.com/tools/articles/easy-statistics-recommendations-statistics-wizardfeature http://developer.teradata.com/database/articles/statistics-collection-recommendations-forteradata-12 http://developer.teradata.com/blog/carrie/2012/04/teradata-13-10-statistics-collectionrecommendations Optimizer article by Alan Greenspan: http://www.teradatamagazine.com/Article.aspx?id=12639 Section 2.9 Stored Procedures Stored procedures are available in Teradata and allow for procedural manipulation of set or table data. Advantages to using stored procedures include: 1) One set of code can be used many times by many users/clients 2) Stored procedures are stored in a compiled object code, eliminating the need to process raw SQL and SPL (Stored Procedure Language) for each request 3) Enforcement of business rules and standards Stored procedures can be internal (SQL and/or SPL) or external (C, C++, Java in Teradata 12 and beyond) and are considered database objects. Internal and protected external stored procedures are run by Parsing Engines (in other words, governed internally by Teradata). Internal stored procedures are written in SQL and SPL whereas external stored procedures 29 cannot execute SQL Statements. External Stored procedures, however, can execute other stored procedures providing an indirect method of executing SQL statements. External stored procedures can also execute as a separate process/thread (outside of a Teradata Parsing Engine), or as a function depending on the protection mode used when the stored procedure was created. Protected mode invokes the procedure directly under Teradata while unprotected mode allows the procedure to run in its own thread. The tradeoff is protected mode will ensure that memory and other resources don’t conflict with Teradata but can negatively affect performance. Running in unprotected mode can provide better performance but there is risk of a potential resource conflict (memory usage/fault, using processing resources that would be used by Teradata). If you are attempting to run a stored procedure and it’s very slow, one of the first items to check is the protection mode that was selected when the procedure was created. Example (Creating/Replacing a Stored Procedure) REPLACE PROCEDURE sp_db.test_sp ( IN in_parm INTEGER, OUT return_parm CHAR(4) ) BEGIN SELECT return_code INTO return_parm FROM sp_db.table1 WHERE table1.key_column = in_parm ; END; If sp_db.test_sp already exists, replace the CREATE with REPLACE. Note: If creating stored procedures using SQL Assistant, make sure that ‘Allow the Use of ODBC SQL Extensions’ is checked (Menu – Tools/Options, Query tab). SQL Assistant will not recognize the CREATE/REPLACE commands if this option is not checked. Example (Altering a Stored Procedure) ALTER PROCEDURE sp_db.test_sp LANGUAGE C EXECUTE NOT PROTECTED; This modifies an existing external stored procedure (written in C) to run unprotected. Example (Calling a Stored Procedure) Stored procedures can be invoked with a CALL statement as part of a macro. CREATE MACRO test_macro (:returned_value CHAR(4)) AS CALL sp_db.test_sp(15467, :returned_value)); Teradata V12 Stored Procedures can return result sets. Prior to V12, result sets can be stored in a table (permanent or temp table) for access outside the stored procedure. 30 Error handling is also built-in to Teradata stored procedures through messaging facilities (Signal, Resignal) and a host of available standard diagnostic variables. External stored procedures also can use a debug trace facility that provides a means to store tracing information in a global temporary trace table. It is important to find a balance when using stored procedures, especially when porting existing stored procedures from another database. Teradata’s strength lies in its ability to process large sets of data rather quickly. Using row at a time processing such as cursors can cause slower performance. In Teradata 13, recursive queries are allowed in stored procedures allowing for a set approach to many of the problems cursors have been used in the past to solve. Elements of XSP Body (with example) #define SQL_TEXT Latin_Text #include <sqltypes_td.h> #include <string.h> void xsp_getregion( VARCHAR_LATIN *region, char sqlstate[6]) { char tmp_string[64]; if (strlen((const char *)region) > 4) { /* Strip off the first four characters */ strcpy(tmp_string, (char *)region); strcpy((char *)region, &tmp_string[4]); } } SP Commands       SHOW PROCEDURE Display procedure statements and comments HELP PROCEDURE Show procedures parameter names and types ALTER PROCEDURE Change attributes such as protected mode /storing of SPL. COMPILE / REecompile stored procedures EXECUTE PROTECTED/EXECUTE NOT PROTECTED Provides an execution option for fault isolation and performance ATTRIBUTES clause Display transaction and platform the SP was created on DROP PROCEDURE Removes unwanted SPs. For XSP, it is removed from the available SP library with a relink of the library. Handling SPs during migration • Prior to Migration 31 • • > Pre_upgrade_prep.pl > script identifies SPs that will not recompile automatically. During Migration > Qualified SPs and XSPs are recompiled automatically > SPs with NO SPL and those that fail recompile are identified – Will have to be manually recreated/recompiled respectively SPs must be recompiled > on new Major releases of TD > after cross-platform migration Section 2.10 User Defined Functions (UDF) User Defined Functions are programs or routines written in C/C++, and Java (Teradata V13) allowing users to add extensions to the Teradata SQL language and are classified by their input and output parameters: Output Parameter Type Scalar Input Parameter Type Set Scalar Set User Defined Scalar Functions User Defined Table Functions User Defined Aggregate Functions User Defined Table Operator Scalar functions accept inputs from an argument list and returns a single result value. Some examples of built-in Teradata scalar functions include: SUBSTR, ABS, and SQRT. Scalar UDFs can also be written using SQL constructs. These are called SQL UDFs and they are very limited. SQL commands cannot be issued in "SQL UDFs", so they are basically limited to single statements using SQL functions. Two main advantages: they can simplify SQL DML statements that use the function call instead of long convoluted logic, and they run faster than C language UDFs. Example: REPLACE FUNCTION Power( M Float, E Float ) RETURNS FLOAT LANGUAGE SQL CONTAINS SQL RETURNS NULL ON NULL INPUT DETERMINISTIC 32 SQL SECURITY DEFINER COLLATION INVOKER INLINE TYPE 1 RETURN CASE M WHEN 0 then 0 ELSE EXP ( LN ( M ) * E ) END ; SELECT POWER(cast(2 as decimal(17,0))-cast(1 as decimal(17,0)),2); SELECT POWER(cast(1234567890098765432 as bigint)-1,2); These are not the equivalent of PL/SQL Functions because they can't really do SQL. For ideas on translating PL/SQL Functions to Teradata see http://developer.teradata.com/blog/georgecoleman/2014/01/ordered-analytical-functionstranslating-sql-functions-to-set-sql Aggregate functions are similar to scalar functions except they work on sets (created by GROUP BY clauses in a SQL statement) of data, one row at a time, and returning a single result. SUM, MIN, MAX, and AVG are examples of Teradata built-in aggregate functions. Table functions return a table a row at a time and are unlike scalar and aggregate UDF’s since they cannot be called in places that system functions are called. Table functions require a different syntax are invoked similarly to a derived table: INSERT INTO Sales_Table SELECT S.Store, S.Item, S.Quantity FROM TABLE (Sales_Retrieve(9005)) As S; The table function, Sales_Retrieve is passed the parameter of 9005. The results of Sales_Retrieve will be packaged to match the 3 columns in the SELECT clause. UDF’s can be an optimal solution for: 1. Additional SQL functions to enhance existing Teradata supplied functions. For example, certain string manipulations may be common for a given application. It may make sense to create UDF’s for those string manipulations; the rule is created once and can be used by many. UDF’s can make porting from different databases easier (i.e. the Oracle DECODE function) by coding a UDF to match the other database function. Recreating the function (one code change) reduces the amount of SQL rewrite for an application that may use the function many times (many code changes). 2. Complex algorithms for data mining, forecasting, business rules, and encryption 3. Analysis of non-traditional data types (i.e. image, text, audio, etc.) 4. XML string processing 5. ELT (Extract Transform Load) data validation UDF’s are invoked qualifying the database name where they are stored, e.g. DBName.UDFname(), or if stored in the special database call SYSLIB, without database name qualification, e.g. UDFname(). UDF’s versus Stored Procedures 33  UDF’s are invoked in a SQL DML statement whereas Stored Procedures must be invoked with an explicit CALL statement  UDF’s cannot modify data with INSERT, UPDATE, or DELETE statements and can only work with data that is passed in as parameters.  UDF’s are written in C/C++ (Java in Teradata V13), Internal Stored Procedures are written in SPL (Stored Procedure Language). External Stored Procedures are similar to UDF’s; they are written in C/C++ and Java however a UDF cannot CALL a stored procedure. However a stored procedure can invoke a UDF as part of a DML statement.  UDF’s run on the AMPs while Stored Procedures run under control of the Parsing Engines (PEs). Starting with Teradata V2R6.0, some UDF’s can run as part of the PEs.  UDF’s can only return a single value (except for Table UDF’s) Stored procedures can handle multiple SQL exceptions whereas UDF’s can only catch and pass one value to the caller. Protected versus Unprotected Mode UDF’s can run in either protected or unprotected mode. When a UDF is first created, it is in protected mode. An ALTER statement is used to switch the UDF from protected to unprotected mode. In protected mode, the UDF is run by an AMP UDF Server Task (by default there are two per AMP). Running under the Server Task creates overhead and can result in slower execution times. To increase performance, the UDF can be run in unprotected mode which means it is run directly by the AMP itself. Unprotected mode should be used only when the UDF has been fully tested and deemed fail-safe. If the function fails in unprotected mode, a database restart is possible since the AMP is no longer insulated from the function via the Server Task process. Developing and distributing UDF’s: UDF’s support the concept of packages which allow developers to create function suites or libraries (i.e. .DLL’s in Windows and .SO’s on UNIX) that are easily deployable across systems. Identifying Overloaded Functions Teradata has the concept of the “FunctionName” and “SpecificName” in which the specific name is qualified by the parameters. You can see the detail with: select functionname (format 'x(20)'), specificname (format 'x(20)'), numparameters, parameterdatatypes from dbc.functions where databasename='xxx' and functionname like ‘FL%’ order by 1; Standard Teradata Functions can be found in database: TD_SYSFNLIB The following is a query which can be used to identify functions and a date of last modification. SELECT dbase.databasename (Format 'x(20)') (Title 'DB Name') 34 , UDFInfo.FunctionName (Format 'x(20)') (Title 'Function') ,case when UDFInfo.FunctionType='A' then 'Aggregate' when UDFInfo.FunctionType='B' then 'Aggr Ordered Anal' when UDFInfo.FunctionType='C' then 'Contract Func' when UDFInfo.FunctionType='E' then 'Ext Stored Proc' when UDFInfo.FunctionType='F' then 'Scalar' when UDFInfo.FunctionType='H' then 'Method' when UDFInfo.FunctionType='I' then 'Internal' when UDFInfo.FunctionType='L' then 'Table Op' when UDFInfo.FunctionType='R' then 'Table Function' when UDFInfo.FunctionType='S' then 'Ordered Anal' else 'Unknown' end (varchar(17), Title 'Function//Type') , CAST(TVM.LastAlterTimeStamp AS DATE) (FORMAT 'MMM-DD',Title 'Altered') FROM DBC.UDFInfo, DBC.DBase, DBC.TVM WHERE DBC.UDFInfo.DatabaseId = DBC.DBase.DatabaseId AND DBC.UDFInfo.FunctionId = DBC.TVM.TVMId ORDER BY 1,2,3,4; Archive/Restore/Copy/Migrating Considerations The following list BAR related operations and describes how they act on UDFs. It is very important to understand that only database level operations will act on UDF. There is no way of selectively archive or restore a UDF (by the way, the same is true for stored procedures): Dictionary Archive ALL AMP Archive Cluster Archive Dictionary Restore/Copy ALL AMP Restore/Copy Cluster Restore/Copy UDF Dictionary information (summary) is saved but not UDF source UDF Dictionary information and UDF source code are archived UDF Source code is archived UDF Dictionary Information is restored (no UDF source code) UDF Dictionary and source code are restored/copied UDF Source code restored Section 2.11 Table Operators A Teradata release 14.10 table operator is a type of function that can accept an arbitrary row format and based on operation and input row type can produce an arbitrary output row format. The name is derived from the concept of a database physical operator. A physical operator takes as input one or more data streams and produces an output data stream. Examples of physical operators are join, aggregate, scan etc. The notion is that from the perspective of the function implementer, programmer, a new physical operator can be implemented that has complete control of the Teradata parallel "step" processing. From the function user, SQL writer, perspective it is very analogues to the concept of a "from clause" table function. 35 Differences Between Table Functions and Table Operators • The inputs and outputs for table operators are a set of rows (a table) and not columns. The default format of a row is IndicData. • In a table function, the row iterator is outside of the function and the iterator calls the function for each input row. In the table operators, the operator writer is responsible for iterating over the input and producing the output rows for further consumption. The table operator itself is called only once. This reduces per row costs and provides more flexible read/write patterns. System Defined Table Operators A table operator can be system defined or user defined. Teradata release 14.10 introduces three new system defined table operators:    TD_UNPIVOT which transforms columns into rows based on the syntax of the unpivot expression. CalcMatrix which calculates a Correlation, Covariance or Sums of Squares and Cross Products matrix LOAD_FROM_HCATALOG which is used for accessing the Hadoop file system. Use Options The table operator is always executed on the AMP within a return step (stpret in DBQL). This implies that it can read from spool, base table, PPI partition, index structure, etc. and will always write its output to spool. Some concepts related to the operator execution. If a HASH BY and/or a LOCAL ODER BY is specified the input data will always be spooled to enforce the HASH BY geography and the LOCAL ORDER BY ordering within the AMP. HASH BY can be used to assign rows to a AMP and LOCAL ORDER BY can be used to order the rows within a AMP. You can specify either or both of the clauses independently. If a PARTITION BY and ORDER BY is specified the input data will always be spooled to enforce the PARTITON by grouping and ORDER BY ordering within the PARTITION. You can specify a PARTITON BY without an ORDER BY but you cannot have an ORDER BY without a PARTITION BY. Further the table operator will be called once for each partition and the row iterator will only be for the rows within the partition. In summary a PARTITION is a logical concept and one or more partitions may be assigned to a AMP, same behavior as ordered analytic partitions. The USING clause values are modeled as key-value. You can define multiple key-value pairs and a single key can have multiple values. The using clause is a literal value and cannot contain any expressions, DML etc. Further, the values are handled by the syntaxer in a similar manner to regular SQL literals. For example {1 , 1.0 ,'1'} are respectively passed to the table operator as byteint, decimal(2,1) and VARCHAR(1) CHARACTER SET UNICODE values. 36 Section 2.12. QueryGrid Teradata Database provides a means to connect to a remote system and retrieve or insert data using SQL. This enables easy access to Hadoop data for the SQL user, without replicating the data in the warehouse. Note: With Teradata Software Release 15.0, SQL-H has been rebranded as Teradata QueryGrid: Teradata Database to Hadoop. The existing connectors on Teradata 14.10 to TDH/HDP will continue to be called SQL-H. The 14.10 SQL-H was released with Hortonworks Hadoop and is certified to work with TDH 1.1.0.17, TDH/HDP 1.3.2, and TDH/HDP 2.1 (TDH = Teradata Distribution for Hadoop, HDP = Hortonworks Data Platform). The goal and vision of Teradata® QueryGrid™ is to make specialized processing engines, including those in the Teradata Unified Data Architecture™ act as one solution from the user’s perspective. Teradata QueryGrid is the core enabling software, engineered to tightly link with these processing engines to provide intelligent, transparent and seamless access to data and processing. This family of intelligent connectors deliver bi-directional data movement and pushdown processing to enable the Teradata Database or the Teradata Aster Database systems to work as a powerful orchestration layer. As the role of analytics within organizations continues to grow, along with the number and types of data sources and processing requirements, companies face increasing IT complexity. Much of the complexity arises from the proliferation of non-integrated systems from different vendors, each of which is designed for a specific analytic task. This challenge is best addressed by the Teradata® Unified Data Architecture™, which enables businesses to take advantage of new data sources, data types and processing requirements across the Teradata Database, Teradata Aster Database, and open-source Apache™ Hadoop®. Teradata QueryGrid™ optimizes and simplifies access to the systems and data within the Unified Data Architecture and beyond to other source systems, including Oracle Database; delivering seamless multi-system analytics to end-users. This enabling solution orchestrates processing to present a unified analytical environment to the business. It also provides fast, intelligent links between the systems to enhance processing and data movement while leveraging the unique capabilities of each platform. Teradata Database 15.0 brings new capabilities to enable this virtual computing, building on existing features and laying the groundwork for future enhancements. Teradata QueryGrid is a powerful enabler of technologies within and beyond the Unified Data Architecture that delivers seamless data access and localized processing. The QueryGrid adds a single execution layer that orchestrates analyses across Teradata, Teradata Aster, Hadoop, and in 37 the future other databases and platforms. The analysis options include SQL queries, as well as graph, MapReduce, R-based analytics, and other applications. Offering two-way, Infiniband connectivity among data sources, the QueryGrid can execute sophisticated, multi-part analyses. It empowers users to immediately and automatically access and benefit from all their data along with a wide range of processing capabilities, all without IT intervention. This solution raises the bar for enterprise analytics and gives companies a clear competitive advantage. The vision --simply said-- is that a business person connected to the Teradata Database or Aster Database can submit a single SQL query that joins data together from a one or more systems for analysis. There’s no need to depend upon IT to extract data and load it into another machine. The business person doesn’t have to care where the data is – they can simply combine relational tables in Teradata with tables or flat files found in Hadoop on demand. Teradata QueryGrid delivers some important benefits:               It’s easy to use, using existing SQL skills Allows standard ANSI SQL access to Hadoop data Low DBA labor moving and managing data between systems High performance: reducing hours to minutes means more accuracy and faster turnaround for demanding users Cross-system analytics Leverage Teradata/Aster strengths: security, workload management, system management Minimum data movement improves performance and reduces network use Move the processing to the data Leverages existing BI tools and enables self-service Dynamically reads data from other database systems without requiring replication of the data into the local Teradata system Uses intelligent data access and filtering mechanisms that reduces amount of rows that are transported over the wire to Teradata Provides scalable architecture to move and process foreign data Eliminates the details for connecting to foreign databases Provides data type mapping and data conversion The Teradata approach to fabric-based computing also leverages these elements wherever possible for seamlessly accessing data across the Teradata® Unified Data Architecture™:  Teradata BYNET® V5 is built on InfiniBand technology. Perfected over 20 years of massively parallel processing experience, BYNET V5 provides low-latency messaging capabilities for maximum data access. This is accomplished by leveraging the inherent 38 scalability and integrity of InfiniBand to load-balance multiple fabrics, seamlessly handling failover in the event of an interconnect failure.  InfiniBand technology, a Teradata fabric, gains much of its resiliency from the Mellanoxsupplied InfiniBand switches, adapters and cables that are recognized as industry-leading products for high-quality, fully interoperable enterprise switching systems. Section 2.13 DBQL Teradata Database Query Log (DBQL) is the primary data source for query monitoring and evaluation of tuning techniques. Teradata DBQL provides a series of predefined tables that can store, based on rules you specify, historical records of queries and their duration, performance, and target activity. The DBQL data is accessed with SQL queries written against the DBQL tables or views. DBQL is documented in the Database Administration manual for each release of the Teradata database. A Database Administrator (DBA) typically performs the DBQL logging configuration and runs the DBQL logging statements. The DBA grants access on DBQL tables or views to application analysts so that analysts can measure and quantify their database workloads. The recommended leading practice DBQL configuration for Teradata environments is all usage should be logged at the Detail level with SQL and Objects. The only exception is database usage comprised of strictly subsecond, known work, i.e. tactical applications. This subsecond, known work is logged at the Summary level. Logging with DBQL is best accomplished by one logging statement for each “accountstring” on a Teradata system. A database user session is always associated with an account. The account information is set with an “accountstring.” Accountstrings typically carry a workload group name (“$M1$”), identifying information for an application (“WXYZ”), and expansion variables for activity tracking (“&S&D&H” for session number, date, and hour). An example of the recommended DBQL Detail logging statement using this example accountstring is: BEGIN QUERY LOGGING with SQL, OBJECTS LIMIT sqltext=0 ON ALL ACCOUNT = ‘$M1$WXYZ&S&D&H’; This statement writes rows to the DBQLogTbl table. This table contains detailed information including but not limited to CPU seconds used, I/O count, result row count, system clock times for various portions of a query, and other query identifying information. This logging statement writes query SQL to the DBQLSqlTbl table with no SQL text kept in the DBQLogTbl. Database object access counts for a query is written to the DBQLObjTbl table. An example of the recommended DBQL Summary logging statement for tactical applications (only known subsecond work) using the example accountstring on Teradata V2R6 and later is: 39 BEGIN QUERY LOGGING LIMIT SUMMARY=10,50,100 CPUTIME ON ALL ACCOUNT = ‘$M1$WXYZ&S&D&H’; This statement results in up to four rows written to the DBQLSummaryTbl in a 10 minute DBQL logging interval. These rows summarize the query logging information for queries in hundredths of a CPU second between 0 and 0.10 CPU second, 0.10 - 0.50 CPU second, 0.50 - 1 CPU second, and over 1 CPU second. For a well-behaved workload with occasional performance outliers, DBQL threshold logging can be used. Threshold logging is not typically recommended but is available. An example of a DBQL Threshold logging statement using the example accountstring is: BEGIN QUERY LOGGING LIMIT THRESHOLD=100 CPUTIME AND SQLTEXT=10000 ON ALL ACCOUNT = ‘$M1$WXYZ&S&D&H’; The CPUTIME threshold is expressed in hundredths of a CPU second. This statement logs all queries over 1 CPU second to the DBQL Detail table with the first 10,000 characters of the SQL statement also logged to the DBQL Detail table. For queries less than 1 CPU second, this DBQL Threshold logging statement writes a query cumulative count by CPU seconds consumed for each session as a separate row in DBQLSummaryTbl every 10 minutes. The DBQLSummaryTbl will also contain I/O use and other query identifying information. IMPORTANT NOTE: DBQL Threshold and Summary logging cause you to lose the ability to log SQL and OBJECTS. Threshold and Summary logging are recommended only after the workload is appropriately profiled using DBQL Detail logging data. Further, threshold logging should be used in limited circumstances. Detail logging data, with SQL and OBJECTS, is typically desired to ensure a full picture is gathered for analysis of queries performing outside expected norms. DBQL can be enabled on Teradata user names in place of accountstrings. This is typically done when measuring performance tests run under a specific set of known user logins. DBQL logging by accountstrings is the more flexible, production-oriented approach. If a maintenance process is required for storing DBQL data historically, the Download Center on Teradata.com provides a DBQL Setup and Maintenance document with the table definitions, macros, etc. to accomplish the data copy and the historical storage process. This is generally implemented when it is decided that DBQL data should not be in the main or root Teradata DBC database for more than a day. The DBQL data maintenance process from Teradata.com can be implemented in under an hour. Two additional logging statements are recommended in addition to the recommended Detail and Summary logging statements previously mentioned. These additional DBQL logging statements are used temporarily on a database user such as SYSTEMFE to dump the DBQL data buffers before query analysis or during the maintenance process. Examples of this statement are: BEGIN QUERY LOGGING ON SYSTEMFE; 40 END QUERY LOGGING ON SYSTEMFE; DBQL buffers can retain data up to 10 minutes after a workload runs. On a lightly used Teradata system, DBQL buffers may not flush for hours or days. Any DBQL configuration change causes all DBQL data to be written to the DBQL tables. Use of these two additional statements ensure that all data is available before an analysis of DBQL data is performed. DBQL enhancements in Teradata version 14.10 • Descriptions > Enhancements to provide more accurate and complete performance data. i.e. collecting resource usage data for AMP steps > Include CPU & IO usage all the way through an aborted query step (top requested item from the PAC). > Enhance DBQL logging for parallel AMP steps. > Provide a new DBQL table to enable the logging of Database Lock information – Enabler for capturing and storing database Lock information in a DBQL table • Benefit > Provides a more powerful query analysis tool for developing, debugging, and optimizing queries. A new Database Query Logging (DBQL) option, called UTILITYINFO, captures information for load, export, and Data Stream Architecture (DSA) utilities at the job level. • New DBC DBQL table - DBQLUtilityTbl • Supported utilities: > FastLoad Protocol: FastLoad, TPT Load, and JDBC FastLoad > MLOAD Protocol: MultiLoad and TPT Update > MLOADX Protocol: TPT Update operator > FastExport Protocol: FastExport, TPT Export, and JDBC FastExport > DSA > Example: > BEGIN QUERY LOGGING WITH UTILITYINFO ON USER1; – The UTILITYINFO option is turned off by default DBQL enhancements in Teradata version 15.00 DBQL’s Parameterized Query Logging is a new logging feature added in Teradata release 15.0. With earlier versions of Teradata, DBQL did not have the capability to collect values of theparameters for a parameterized query. With PQL feature parameter information along with values are captured into a new DBQL table DBC.DBQLParamTbl. This table holds all the necessary parameter data in a BLOB column. This BLOB column can then be converted into a JSON document using the internal fast path system function TD_SYSFNLIB.TD_DBQLParam(). This feature can be enabled for all users with the following statement. Begin Query Logging with ParamInfo on all; 41 Now that we capture the parameter values the same query can now be replayed with the values in place of parameters. This will help customers isolate problem values supplied to parameters. DBQL Queries User Activity Query : SELECT UserName (TITLE 'User', FORMAT 'X(8)'), SessionID (TITLE 'Session', FORMAT 'ZZZZZZZ9'), QueryID (TITLE 'Query//ID', FORMAT 'ZZZZZ9'), StartTime (TITLE 'Start//Time', FORMAT 'HH:MI:SS'), FirstRespTime (TITLE 'Response//Time', FORMAT 'HH:MI:SS'), TotalIOCount (TITLE 'IO//Count', FORMAT 'ZZ9'), AMPCPUTime (TITLE 'CPU', FORMAT 'ZZ9.99') FROM DBC.DBQLogTbl WHERE UserName='Tester' ORDER BY 1,2,3; Query Analysis: select queryid, FirstStepTime, EstResultRows (format 'zzz,zzz,zzz,zz9'), NumResultRows(format 'zzz,zzz,zzz,zz9'), EstProcTime (format 'zzz,zzz,zz9.999'), AMPCPUTime (format 'zzz,zzz,zz9.999'), TotalIOCount(format 'zzz,zzz,zzz,zz9'), NumOfActiveAMPs(format 'zzz,zz9'), MaxAmpCPUTime(format 'zzz,zzz,zz9.999'), MinAmpCPUTime(format 'zzz,zzz,zz9.999'), MaxAmpIO (format 'zzz,zzz,zzz,zz9'), MinAmpIO (format 'zzz,zzz,zzz,zz9') from dbc.dbqlogtbl where username ='xxxx' and cast(FirstStepTime as date) = '2013-02-13' and NumOfActiveAMPs <> 0 and AMPCPUTime > 100 order by Firststeptime; Errors that result in an abort and possibly a rollback: sel distinct errorcode (format 'zzzzz9'), errortext from dbc.dbqlogtbl where cast (collecttimestamp as date) = date and errorcode in (2631, 2646, 2801, 3110, 3134, 3514, 3535, 3577) order by 1; 42 Top 10 (CPU): SELECT StartTime ,QueryID ,Username ,StatementType ,AMPCPUTime ,rank () over ( order by AMPCPUTime DESC) as ranking from DBC.QryLog where cast(CollectTimeStamp as date) = date qualify ranking <=10; Section 2.14 Administrative Tips The DBC.* views whose names end with "X" are all designed to show only what the current user is allowed to see. To list all databases that have tables that you can access: select distinct databasename from dbc.tablesvX order by 1; To list all the tables for these databases that you can access: select databasename,tablename from dbc.tablesX order by 1,2; Collations can be defined at the USER (modify user collation=x) or SESSION (set session collation x) levels. Set session takes precedence. Each type of collation be discovered by using a query such as SELECT CharType FROM DBC.ColumnsX WHERE ... If Referential Integrity is defined on the tables, you can get a list of the relationships with a query like this: select trim(ParentDB) || '.' || trim(ParentTable) || '.' || trim(ParentKeyColumn) (char(32)) "Parent" , trim(ChildDB) || '.' || trim(ChildTable) || '.' || trim(ChildKeyColumn) (char(32)) "Child" from DBC.All_RI_ParentsX order by IndexName ; If you want to start with a particular parent table and build a hierarchical list, you might try this recursive query: with recursive RI_Hier( ParentDB, ParentTable ,ChildDB, ChildTable ,level ) as ( select ParentDB, ParentTable, ChildDB, ChildTable, 1 43 from DBC.All_RI_ParentsX where ParentDB = <Parent-Database-Name> and ParentTable = <Parent-Table-Name> union all select child.ParentDB, child.ParentTable, child.ChildDB, child.ChildTable, RI_Hier.level+1 from RI_Hier ,DBC.All_RI_ParentsX child where RI_Hier.ChildDB = child.ParentDB and RI_Hier.ChildTable = child.ParentTable ) select trim(ParentDB) || '.' || trim(ParentTable) "Parent" ,trim(ChildDB) || '.' || trim(ChildTable) "Child" ,level from RI_Hier order by level, Parent, Child ; There are a number of session-specific diagnostic features which can be very helpful under specific situations. Use these by executing the statement(s) below in a query window prior to diagnosing the query of interest. Note that these settings are only active for the current session. When the session is terminated, the session parameter is cleared. a) DIAGNOSTIC HELPSTATS ON (NOT ON) FOR SESSION; Using the EXPLAIN feature on a query in conjunction with the above session parameter provides the user with statistics recommendations for the given query. While the list can be very beneficial in helping identify missing statistics, not all of the recommendations may be appropriate. Each of the recommended statistics should be evaluated individually for usefulness. Using the example below, note that the optimizer is suggesting that statistics on columns DatabaseName and TVMName would likely result in higher confidence factors in the query plan. This is due to those columns being used in the WHERE condition of the query. Query: SELECT DatabaseName,TVMName,TableKind FROM dbc.TVM T ,dbc.dbase D WHERE D.DatabaseId=T.DatabaseId AND DatabaseName='?DBName' AND TVMName='?TableName' ORDER BY 1,2; Results (truncated): BEGIN RECOMMENDED STATS -> 44 • "COLLECT STATISTICS dbc.dbase COLUMN (DATABASENAME)". (HighConf) • "COLLECT STATISTICS dbc.TVM COLUMN (TVMNAME)". (HighConf) <- END RECOMMENDED STATS b) DIAGNOSTIC VERBOSEEXPLAIN ON (NOT ON) FOR SESSION; Teradata’s Verbose Explain feature provides additional query plan information above and beyond that shown when using the regular Explain function. Specifically, more detailed information regarding spool usage, hash distribution and join criteria are presented. Use the above session parameter in conjunction with the EXPLAIN feature on a query of interest. 3. Workload Management Section 3.1 Workload Administration Workload Management in Teradata is used to control system resource allocation to the various workloads on the system. After installation of a Teradata partner application at a customer site, Teradata has a number of tools used for workload administration. Teradata Active System Management (TASM) is a grouping of products, including system tables and logs, that interact with each other and a common data source. TASM consists on a number of products: Teradata Workload Analyzer, Viewpoint Workload Monitor/Health, and Viewpoint Workload Designer. TASM also includes features to capture and analyze Resource Usage and Teradata Database Query Log (DBQL) statistics. There are a number of orange books, Teradata magazine articles, and white papers addressing the capabilities of TASM. Perhaps the best source of information for TASM is the Teradata University website at https://university.teradata.com. There are a number of online courses and webcasts available on the Teradata University site which offer a wealth of information on TASM and its component products. The Teradata Viewpoint portal is a framework where Web-based applications, known as portlets, are displayed. IT professionals and business users can customize their portlets to manage and monitor their Teradata systems using a Web browser. Portlets enable users across an enterprise to customize tasks and display options to their specific business needs. You can view current data, run queries, and make timely business decisions, reducing the database administrator workload by allowing you to manage your work independently. Portlets are added to a portal page using the Add Content screen. The Teradata Viewpoint Administrator configures access to portlets based on your role. 45 4. Migrating to Teradata Section 4.1 Utilities and Client Access Teradata offers a complete set of tools and utilities that exploit the power of Teradata for building, accessing, managing, and protecting the data warehouse. Teradata’s data acquisition and integration (load and unload) tools are typically used by partners in the ETL space while partners in the Business Intelligence and EAI spaces use the connectivity and interface tools (ODBC, JDBC, CLIv2, .NET, OLE DB). For this discussion, we will focus primarily on the “Load & Unload tools” and the “Connectivity and Interface Tools.” One common reason that Partners want to integrate their products with Teradata is Teradata’s ability to more efficiently work with, and process, larger amounts of data than the other databases that the Partner is accustomed to working with. With this in mind, Partners should consider the advantages and flexibility offered by the Teradata Parallel Transporter (and the TPT API), which provides the greatest flexibility and throughput capability of all the Load/Unload products. Another relatively new option which should be considered as an ELT architecture option is the ANSI Merge in combination with NoPI tables, which is an option as of Teradata Version 13.0. The ANSI merge offers many of the capabilities of the Teradata utilities (error tables, block-at-atime optimization, etc.) and the FastLoad into the NoPI target table is up to 50% faster than the FastLoad into a target table with a Primary Index. Both of these options are discussed in this section. Many other options are presented here, and it is important to consider all of them to find the methods that are best for a particular Partner’s needs. Section 4.1.1 Teradata Load/Unload Protocols & Products • SQL Protocol - Used for small amounts of data or continuous stream feeds. • Use SQL protocol for > DDL operations – Use industry standard open APIs, BTEQ. > Loading/unloading small amounts of data – Use open APIs, BTEQ, and/or TPump protocol. > Sending Insert/Select or mass update/delete/merge for an ELT scenario – Use open APIs or BTEQ. > Continuous data loading – use TPump protocol. • Industry standard open APIs for SQL protocol > Teradata ODBC Driver > Teradata OLE DB Provider > Teradata .Net Data Provider > Teradata JDBC Driver • Teradata tools that use SQL protocol 46 > TPump protocol – TPump product or Parallel Transporter Stream Operator. – TPump product - script-driven batch tool. – Parallel Transporter Stream – use with TPT API or scripts. – Optimizations for statement and data buffering along with reduced. row locking contention and SQL statement cache reuse on the Teradata Database. – Indexes are retained and updated > BTEQ – Batch SQL and report writer tool that has basic import/export. > Preprocessor2 – Used for imbedding SQL in C application programs. > CLIv2 – Lowest level API but not recommended due to added complexity without much performance gain over using higher level interface – Note: CLIv2 is used by Teradata tools like BTEQ, Teradata load tools etc. One can run different protocols using CLIv2 (e.g., SQL, ARC for backup, FastLoad, MultiLoad, FastExport, etc.) Only SQL protocol is published. • Bulk Data Load/Unload Protocols FastLoad, MultiLoad, FastExport, TPump – are load protocols which can be executed by Stand-alone tools or Teradata Parallel Transporter. – FastLoad Protocol – Bulk loading of empty tables. – MultiLoad Protocol – Bulk Insert, Update, Upsert, & Delete. – FastExport Protocol – Export data out of Teradata DB. – TPump – SQL application for continuous loading uses a pure SQL interface. TPump includes the best knowledge of how to write a high-performance load tool using SQL Inserts – including multi-statement requests, buffering of data, checkpoint/restart, etc. The Teradata Database does not know that the SQL Inserts are coming from TPump, they are just like any other SQL requests. > Load/Unload Products which use the load protocols – Stand-alone Utilities (original tools) • FastLoad, MultiLoad, TPump, FastExport. • Separate tools & languages, script interface only. – JDBC • FastLoad & FastExport protocol for pure Java applications. This is for applications that are pure Java and don’t want to wrap the TPT API C++ interface with JNDI wrappers for performance and portability reasons. JDBC Parameter Arrays can map to Teradata iterated requests/FastLoad. Given an unconstrained network, JDBC FastLoad may be three to 10 times faster than the corresponding SQL PreparedStatement batched insert. FastLoad can be transparently enabled via connection URL setting. 47 – Teradata Parallel Transporter (improved tools). • Execute all load/unload protocols in one product with one scripting language. • Plug-in Operators: Load, Update, Stream, Export. • Provides C++ API to protocols for ISV partners. Section 4.1.2 Input Data Sources with Scripting Tools • Flat files on disk and tape. • Named Pipes. • Access Modules > Plug-ins to Teradata Load Tools to access various data sources. > OLE DB Access Module to read data from any OLE DB Provider (e.g., all databases). > JMS Access Module to read data from JMS queues. > WebSphere MQ Module to read data from MQ queues. > Named Pipe Access Module to add checkpoint/restart to pipes. > Custom – write your own to get data from anywhere. • Teradata Parallel Transporter Operators > ODBC Operator - reads data from any ODBC Driver (e.g., all databases). Access Modules Example Plug-in OLE DB Access Module to read data from any database Oracle Databas e OLE DB Access Teradata Load Tool Module Teradata Databas e Section 4.1.3 Teradata Parallel Transporter What is Teradata Parallel Transporter? • Parallel Transporter is the new generation of Teradata Load/Unload utilities. It is the new version of FastLoad, MultiLoad, TPump, and FastExport. Those load protocols were combined into one tool and new features added. • The FastLoad, FastExport, and MultiLoad protocols are client/server, there is code that runs on the client and code that runs on the server. The Teradata Database server code is unchanged, and the client part of the code was re-written. 48 • If you already know the legacy Stand-alone load tools, everything you have learned about the four load tools still applies as far as features, parameters, limitations (e.g., number of concurrent load jobs), etc. There is no learning curve for the protocols, just learn the new language and the new features. • Why was the client code rewritten? Benefits are one tool with one language, performance improvement on large loads with parallel load streams when I/O is a bottleneck, ease of use, & the TPT API (for better integration with partner load tools, no landing of data, and parallel load streams). • Most everything about load tools still applies > Similar basic features, parameters, limitations (e.g., number of concurrent load jobs), when to use, etc. • Parallel Transporter performs the tasks of FastLoad, MultiLoad, Tpump, and Fast Export. In addition to these functions it also provides: – Common scripting language across all processes. This simplifies the writing of scripts and makes it easier to do tasks such as “Export from this database, load into this database” in a single script. – Full parallelism on the client side (where the FastLoad and MultiLoad run). We now create and use multiple threads on the client side allowing ETL tools to fan out in parallel on the multi-CPU SMP box they run on. – API for connecting ETL tools straight to the TPT Load/Unload Operators. This will simplify integration with these tools and improve performance in some cases. • Wizard that generates scripts from prompts for learning the script language – supports only small subset of Parallel Transporter features. • Command line interface (known as Easy Loader interface) that allows one to create load jobs with a single command line. Supports a subset of features. • With Teradata release 14.10, the MultiLoad protocol on the Teradata Database has been extended. This extension this is known as Extended Mload or MLOADX. The new extension is implemented only for the Parallel Transporter Update Operator. At execution time, the extension converts the TPT Update Operator script into an ELT process if the target table of the utility has any of the following objects: Unique Secondary Indexes, Join Indexes, Referential Integrity, Hash Index, a trigger, and supports temporal tables. This extension eliminates the need for the system administrator to Drop/Create the aforementioned objects just so that Multiload may be used. Conversion from ETL to ELT will happen automatically; no change to the utility or its script is necessary. Note: The names -- FastLoad, MultiLoad, FastExport, TPump – refer to load protocols and not products. Anywhere those load protocols are mentioned, Teradata Parallel Transporter (TPT) can be substituted to run the load/unload protocols. How does it work? 49 Parallel Transporter Architecture Databases Files User Written Scripts Message Queues ETL Tools Custom Programs Direct API Script Parser TPT Infrastructure Parallel Transporter Data Source Operators Load Update Export Teradata Database 50 Stream Increased Throughput Source1 1 job per source or 1 source at a time Source2 Load Utility Teradata Database Traditional Utility Job • • • • • • • • • • • • Read Operator Source2 Source3 Read Operator Read Operator Source3 InMods or Access Modules • • • Source1 Transform Operator Transform Operator Transform Operator Load Operator Load Operator Teradata Database Parallel Transporter Traditional utilities on the left and Parallel Transporter on the right On left, must concatenate the 3 files into 1 or run 3 jobs. Terminology of Producer (Read) Operator, Consumer (Write) Operator, Transform (userwritten), & Independent ones like the DDL Operator. Operators flow data through data streams. If I/O is bottleneck, can read all three files in parallel – generally benefits FastLoad and MultiLoad protocols. If the load utility is the bottleneck and pegged out the CPU, can scale it. Optional user transform Operators (write C++ for simple transforms). Picture on right is one job (all in the rectangle noted by the thin line). Still always more throughput to run multiple jobs. TPT makes one job run faster by internally leveraging parallel processes. Looks like one load job to the DBS. Uses more resources (memory, CPU, etc.) to gain throughput. If I/O or CPU is not bottleneck, then scaling can reduce throughput by having to manage multiple processes for no gain. All processes run asynchronously and in parallel (overlapped I/O & loading of DBS). Note the bottleneck with a TPump/Stream Operator job is usually Teradata (use Priority Scheduler, etc.) and not the I/O. TPump protocol can benefit in a scenario of sending multiple input files to a single TPump job versus multiple TPump jobs reducing table locking. Best performance is scalable performance. This is a scalable solution on the client to match the scalability on the Teradata Database. Parallel processes read parallel input 51 streams , can have parallel load processes that communicate with the Teradata Database through parallel sessions (scale across network bandwidth), & parallel PEs read data while parallel AMPs apply data to target tables. TPT Operators Much of the functionality is provided by scalable, re-usable components called Operators: • Operators can be combined to perform desired operations (load, update, filter, etc.). • Operators & infrastructure are controlled through metadata. • Scalable Data Connector – reads/writes data in parallel from/to files and other sources (e.g. Named Pipes, TPT Filter Operator, etc.). • Export, SQL Selector - read data from Teradata tables (Export uses Fast Export protocol & Selector is similar to BTEQ Export). • Load - inserts data into empty Teradata tables (FastLoad protocol). • SQL Inserter - inserts data into existing tables (similar to BTEQ Import using SQL). • Update - inserts, updates, deletes data in existing tables (uses MultiLoad protocol). Starting with Teradata release 15.0, the TPT Update Operator can load LOBs. • Stream - inserts, updates, deletes data in existing tables in continuous mode. • Infrastructure reads script, parses, creates parallel plan, optimizes, launches, monitors, and handles checkpoint/restart. TPT Operators include: > Load – FastLoad protocol. > Update – MultiLoad protocol. > Stream – TPump protocol. > Export – FastExport protocol. > SQL Inserter. Loads data, including large objects (LOBs), into a new or an existing table using a single SQL protocol session. > SQL Selector. Extracts data, including LOBs, from an existing table using a single SQL protocol session. > Open Database Connectivity (ODBC). Extracts data from external third-party ODBC sources. > Data Connector. Supports simultaneous, parallel reading of multiple data sources, such as various types of files or queuing systems; also allows writing to external data sources. FastLoad, MultiLoad, and FastExport protocols are client/server protocols with a program that runs on the client box talking with a proprietary interface to a program on the Teradata Database. These closed, undocumented interfaces have been opened up with the Teradata Parallel Transporter API. With the FastLoad & MultiLoad protocols, there are two phases, acquisition and apply. In the acquisition phase the client reads data as fast as it can and sends it to the database which puts the data into temp tables. When the data read is exhausted, the client signals the database to start the apply phase where the data in the temp tables is redistributed to the target tables in parallel. Teradata Active System Management tools can throttle Teradata load tools (ML, FL, & FE protocols), TPump must be treated like any other SQL application. 52 TPT API TPT API is the interface for programmatically running the Teradata load protocols. Specifically: FastLoad, MultiLoad, FastExport, and TPump. It enhances partnering with 3rd party tools by:  Proprietary load protocols become open  Partners integrate faster and easier  Partner tool has more control over entire load process  Increased performance TPT API is used by more than ETL vendors – BI vendors use Export Operator to pull data from TDAT into their tool. Not all vendors use API. Script interface has its place (e.g., TDE) & non-parallel tools can create parallel streams using named pipes and the script version of TPT which has a parallel infrastructure to read the multiple input files in parallel. In addition, TPT API has the following characteristics: • • • • • Opens up previously closed, proprietary load protocols. Partners integrate faster & easier with C++ API than generating script language. Can flow parallel streams of data into load tools. No landing of data before loading. Partner tool gets complete control over load process. 53 Parallel Transporter - Using API Data Sources Application/ETL Program Direct API Oracle, etc. Flat file L o a d E x p or t U p d at e St re a m Teradata Database Integration with API Before API – Integration with Scripting 5. Read messages Metadata Data Source Message & Statistics File 2. Write Data Data – Named Pipe ETL Tool 1. Build & Write Script 4. Read Data 4. Write Msgs. FastLoad 3. Invoke Utility 4. Load Data FastLoad Script File 4. Read Script 1. Vendor ETL tool creates Teradata utility script & writes to file. 54 Teradata Database 2. Vendor ETL tool reads source data & writes to intermediate file (lowers performance to land data in intermediate file). 3. Vendor invokes Teradata utility (Teradata tool doesn’t know vendor tool called it). 4. Teradata tool reads script, reads file, loads data, writes messages to file. 5. Vendor ETL tool reads messages file and searches for errors, etc. • Before the API, it was necessary to generate multiple script languages depending on the tool, land data in file or named pipe, and post-process error messages from a file. With API - Integration Example Metadata Pass data & FastLoad parameters ETL Tool Data Source FastLoad Protocol Functions Load Data Teradata Database Return codes & error messages passed • • •       Vendor ETL passes script parameters to API Vendor ETL tool reads source data & passes data buffer to API Teradata tool loads data and passes return codes and messages back to caller Using TPT API, ETL tool reads data into a buffer and passes buffer to Operator thru API (no landing of the data). Get return codes and statistics through function calls in the API. No generating multiple script languages. ETL tool has control over load process (e.g., ETL tool determines when a checkpoint is kicked off instead of being specified in the load script). ETL tool can dynamically choose an operator (e.g. UPDATE vs. STREAM) Simpler, faster, higher performance. Features and Benefits of TPT Benefits Three main benefits are performance (parallel input load streams), various ease of use features, and the TPT API for ETL vendor integration. • Performance – Improved Throughput > Scalable performance on client load server with parallel instances and parallel load streams. > ETL vendors can scale load job across load servers. 55 > No landing the data with TPT API – in memory buffers. • Ease of Use – When Using Scripts Ease of use features apply when you are writing scripts. If you use an ETL vendor, then the ETL tool will either call the TPT API or generate the appropriate script. > Less time – One tool & one scripting language. – Easier to switch between load protocols. – Unlimited symbol substitution. – Load multiple input sources & combine data from dissimilar sources. – Multiple job steps. > Fewer scripts required – Automatically load all files in a directory. – Reduction of number of scripts. – Teradata to Teradata export and load scenario. – Don’t have to write code to generate scripts in multiple languages for multiple tools – just pass buffers in memory > Wizard to aid first-time script building. • Improved ISV partner tool integration via Direct API > Proprietary load protocols become open. > 3rd Party partners integrate faster and easier. > Partner-written programs can directly call load protocols. In addition, TPT also has following benefits: • Less training is involved to learn one scripting language. • Symbol substitution use cases: 1. Supply test names at run time during testing and supply production names at run time in production – the script doesn’t change. 2. Define multiple load operators in the script and specify at run time which load protocol to use (easier to switch between protocols). • Can load multiple input sources & sources can be completely different media and completely different file sizes (selected data must be union compatible). • Multiple job steps (e.g., DDL Operator to delete/allocate table followed by load step). • Load all files in a directory and export from Teradata and load Teradata in one job. • Wizard for simple scripts only, not all features included in Wizard, and not meant to be or replace a 3rd party ETL tool. • When you install TPT there is a “samples” directory, go there and grab a script, modify it, and run it. • TPT Operators run in memory as part of the ETL tool’s address space rather than asynchronously without each other’s knowledge. • Checkpoints are initiated by ETL tool and not by Teradata load tool. • ETL tool gets return codes and operational metadata from function calls rather than parsing output files. • Application program (e.g., ETL tool) reads the source data and calls the API to pass: 56  • 1. Parameters based on the load protocol • 2. Data • 3. Get messages Only the four load/unload protocols are available through TPT API. No access modules are available since ETL vendors have the functionality of the Teradata access modules in their products. The following two diagrams depict the advantages of TPT over Stand-alone tools: Stand-Alone Tools: No Parallel Input Streams Oracle ETL Tool Launches Multiple Instances that Read/Transform Data in Parallel ETL tool must bring parallel streams back to one input source FastLoad can only read one input stream ETL Tool ETL Tool ETL Tool File or Pipe F a st L o a d Teradata • Stand-alone load tools can only read one input source 57 Parallel Input Streams Using API Oracle ETL Tool Launches Multiple Instances that Read/Transform Data in Parallel Parallel Transporter Reads Parallel Streams with Multiple Instances Launched Through API ETL Tool ETL Tool ETL Tool TPT API TPT API TPT API TPT Loa d Inst anc TPT Loa d Inst anc TPT Loa d Inst anc Teradata • Application program (e.g., ETL tool) reads the source data and calls the API to pass: • 1. Parameters to the load Operator based on the load protocol (e.g., number of sessions to connect to Teradata, etc.) • 2. Data • 3. Function calls to get messages and statistics ETL tool can flow parallel streams of data through TPT API to gain throughput for large data loads. Section 4.1.4 Restrictions & Other Techniques • Some bulk loading restrictions (see reference manuals) > FastLoad & MultiLoad – no join indexes, foreign key references, no LOBs > FastLoad – only primary indexes > MultiLoad – no USIs > Due to restrictions, engineering emphasis on ELT and improving Insert/Select, ANSI merge, etc. > ARC can’t move from new release to older release • Other data movement techniques 58 > Use of UDFs (e.g., table functions), stored procedures, triggers, etc. > Data Mover product is a shell on top of TPT API, ARC, and JDBC Other data movement protocols  Teradata Unity Data Mover - Data Mover enables you to define and edit jobs that copy specified database objects, such as tables, users, views, macros, stored procedures, and statistics, from one Teradata Database system to another Teradata Database system. Tables can be copied between Teradata systems and Aster or Hadoop systems. • Teradata Migration Accelerator (TMA) – TMA is a professional services tool that’s designed to assist with migration projects by automating many of the common tasks that are associated with the task. Currently TMA supports Oracle, DB2 and SQL Server migrations. • ARC – Archive and restore. > Moves data at the internal block level. > Format only understood by the Teradata dump and restore tool so no data transformation can be done. > In addition to data protection, it is the fastest protocol for moving data. from Teradata to Teradata (e.g., upgrading machines). > ARC Products: – Backup with ISV products (NetVault, NetBackup, Tivoli). NPARC service uses ARC for system upgrades. • Teradata Replication – changed data capture API. Note: Teradata Replication Services using Golden Gate are not supported with Teradata Software Release 15.0 and above. > Currently undocumented, under development, may change. > Intent is to document the API when it is finalized. > GoldenGate is only company that currently uses CDC API. > Lower data volumes than bulk loading protocols. Summary: Products & Protocols 59 Load Protocol ----> Product TPT BTEQ ODBC driver JDBC driver OLE DB provider .NET Data Provider TPump, TPT Stream FastLoad, TPT Load MultiLoad, TPT Update FastExport, TPT Export BAR & ARC scripts Replication Services (GoldenGate) TDM (calls TPT API & ARC) Teradata Unity SQL (Insert, Update, etc.) FastLoad MultiLoad Tpump FastExport ARC X X X X X X X X X X X X X X X X X X X X CDC (changed data capture API) for pulling data out only X X X X X 60 X X X X X X Section 4.2 Load Strategies & Architectural Options Section 4.2.1 ETL Architectural Options • Run utilities where source data is located > Utilities run on mainframe, Windows, and UNIX. > Mainframe is direct channel attached to Teradata. • Load Server > Dedicated server with high-speed connect to Teradata Database. > Pull data from various platforms & sources – ODBC databases, named pipes, JMS queues, Websphere MQ, etc. – Use ETL tool to source data. • Do ETL with: > Teradata Utilities and/or Teradata Parallel Transporter. > 3rd Party Products – Vendor products work in conjunction with Teradata client software (CLIv2, ODBC, MultiLoad, FastLoad, Teradata Parallel Transporter, etc.). > Custom applications – Utilize CLIv2, ODBC, JDBC, etc. > Sometimes, a combination of the above. • Extract, Load, & Transform (ELT) and ANSI Merge > Load raw data (FastLoad protocol) to staging table and use SQL Insert/Select to do transforms & load target table – Some ISV ETL tools generate post-load SQL – Recommended when pulling large amounts of data from Teradata & reloading Teradata – ANSI Merge is an option as of Teradata Version 12.0 and is much more efficient than insert/update scenario – ANSI Merge in combination with NoPI tables is an option as of Teradata Version 13.0 The ELT approach can also take advantage of the SQL bulk load operations that are available within the Teradata Database. These operations not only support MERGE-INTO but also enhance INSERT-SELECT and UPDATE-FROM. This enables primary, fallback and index data processing with block-at-a-time optimization. The Teradata bulk load operations also allow users to define their own error tables to handle errors from operations on target tables. These are separate and different from the Update operator’s error tables. Furthermore, the no primary index (NoPI) table feature also extends the bulk load capabilities. By allowing NoPI tables, Teradata can load a staging table faster and more efficiently. Merge is ANSI-standard SQL syntax that can perform bulk operations on tables using the extract, load and transform (ELT) function. These operations merge data from one source table into a target table for performing massive inserts, update and upserts. So why use the merge 61 function instead of an insert-select, or when an update join will suffice? Better performance and the added functionality of executing a bulk, SQL-based upsert. Beginning with Teradata 13, the FastLoad target table can be a “No-PI” table. This type of table will load data faster because it avoids the redistribution and sorting steps, but this only postpones what eventually must be done during the merge process. The “merge target table” can have a predefined error table assigned to it for trapping certain kinds of failures during the merge process. There can be up to a 50% performance improvement when NoPI tables are used in a FastLoad loading scenario. • Raw Data Movement for System Migrations > Use NPARC (Name Pipe ARC) service which uses ARC protocol in custom shell scripts > System upgrades require one-time movement of entire system Section 4.2.2 ISV ETL Tool Advantages vs. Teradata Tools • GUI interface > Data workflows designed easily. • Numerous data sources supported > Drop down windows show multiple choices for data sources. • Numerous data transformations on client box > Numerous data sorting options. > Data cleansing. • Some generate SQL for ELT > Generates SQL used to transform data in parallel Teradata Database engine. • Central metadata repository for entire ETL process. Section 4.2.3 Load Strategies • Periodic Batch > Overnight or other reasonable time slot – ELT - FastLoad to staging table then Insert/Select to target. – MultiLoad/TPump target table using Insert, Update, Upserts – TPump protocol – row/hash lock. – MultiLoad – Insert, Update, Upsert, Delete. > Frequent ‘Mini’ Batch – Stage for seconds or Minutes, then do Insert/Select. – Use rotating batch jobs. • Continuous – TPump protocol > Data available immediately - row/hash lock. • Changed data capture > GoldenGate CDC interface to capture data from Teradata & copy to Teradata or non-Teradata system. 62 > ISV tools - CDC data from other databases & load Teradata. Section 4.2.4 ETL Tool Integration 1. Generate scripts for Teradata Stand-alone tools > FastLoad, MultiLoad, TPump, and FastExport. 2. Generate scripts for Parallel Transporter (pre-API) > Operators: Load, Update, Stream, and Export. 3. Integrate with Parallel Transporter thru API > C++ calls directly to Operators. • All three methods run same load protocols. • Advantage of #2 & #3 > Parallel Transporter allows parallel input streams. • Advantage of #3 over #2 > Vendor can integrate faster and easier. > No landing of the data. > Gives vendor tool more control over load process. • All major ETL tool vendors integrate well with Parallel Transporter. Create a data flow diagram with the ETL tool GUI and the ETL tool will internally generate the appropriate Parallel Transporter calls. • Not all tools will benefit with the API. Non-parallel vendors can get parallelism by writing to multiple files/pipes and invoking the script interface where the TPT infrastructure can read the data in parallel. • ISV tools that don’t have the ability to access various data sources can take advantage of the plug-in Access Modules (ODBC, Named Pipes, JMS, etc.) by using the TPT script interface since the API only supports the load/unload Operators. • Teradata Decision Experts uses the script version of TPT and leverages the ODBC Operator to pull data from Oracle while using the Load Operator to load the Teradata Database. • Advantages of TPT API are: > parallel load streams. > not landing the data in a file or pipe. > Increased command and control for ISV product (e.g., ISV product controls the checkpoint intervals, etc.). 63 Section 4.3 Concurrency of Load and Unload Jobs If you do not use the Teradata Viewpoint Workload Designer portlet to throttle concurrent load and unload jobs, the MaxLoadTasks and the MaxLoadAWT fields of the DBSControl determine the combined number of FastLoad, MultiLoad, and FastExport jobs that the system allows to run concurrently. The default is 5 concurrent jobs. If you have the System Throttle (Category 2) rule enabled, even if there is no Utility Throttle defined, the maximum number of jobs is controlled by the Teradata dynamic workload management software and the value in MaxLoadTasks field is ignored. For more information on changing the concurrent job limit from the default value, see "MaxLoadTasks" and "MaxLoadAWT" in the chapter on DBS Control in the Utilities manual. Section 4.4 Load comparisons Teradata recommends using one of the Teradata TPT load operators rather than ODBC or JDBC for loading and/or extracting more than a few thousand rows of data from Teradata. As mentioned in the previous section, the Teradata TPT load operators – specifically Load and Update – were specifically designed for loading large volumes of data into Teradata as rapidly as possible while ODBC and JDBC were not. This is simply due to the fact that Teradata, although it scales very effectively in all boundaries (data, users, query volumes, etc.), is orders of magnitude different in performance than other RDBMs for row at a time processing and any ISV that wants to leverage the strengths of Teradata needs to understand this key point. There can be as much as a 10x difference between TPT Load/Update and ODBC parameter array inserts. Not utilizing ODBC parameter arrays (single row at a time) can make this as much as a 100x difference. 5. References Section 5.1 SQL Examples Preferred processing architecture Teradata is powerful relational database engine that can perform complex processing against large volumes of data. The preferred data processing architecture for a Teradata solution is one that would have business questions/problems passed to the database via complex SQL statements as opposed to selecting data from the database for processing elsewhere. The following examples are meant to provide food for thought in how the ISV would approach the integration with Teradata. 64 Derived Tables Description A derived table is obtained from one or more other tables through the results of a query. How derived tables are implemented determines how or if performance is enhanced or not. For example, one way of optimizing a query is to use derived tables to control how data from different tables is accessed in joins. The use of a derived table in a SELECT forces the subquery to create a spool file, which then becomes the derived table. Derived tables can then be treated in the same way as base tables. Using derived tables avoids CREATE and DROP TABLE statements for storing retrieved information and can assist in optimizing joins. The scope of a derived table is only visible to the level of the SELECT statement calling the subquery. Example (Simple – Derived Table w/ AVG function) The following SELECT statement displays those employees whose salary is below their departmental average. 'WORKERS' in this example is the derived table. (result: 119 rows) SELECT EMPNO, NAME, DEPTNO, SALARY FROM (SELECT AVG(SALARY), DEPTNO FROM EMPLOYEE GROUP BY DEPTNO) AS WORKERS (AVERAGE_SALARY, DEPTNUM), EMPLOYEE WHERE SALARY < AVERAGE_SALARY AND DEPTNUM = DEPTNO ORDER BY DEPTNO, SALARY DESC; Recursive SQL Description Recursive queries are used for hierarchies of data, such as Bill of Materials, organizational structures (department, sub-department, etc.), routes, forums of discussions (posting, response and response to response) and document hierarchies. Example The following selects a row for each child-parent, child-grandparent, etc. relationship in a recursive table WITH RECURSIVE TEMP_TABLE (CHILD, PARENT) AS ( SELECT ROOT.CHILD, ROOT.CHILD FROM CHILD_PARENT_HIERARCHY ROOT UNION ALL SELECT H1.CHILD, H2.PARENT FROM TEMP_TABLE H2, CHILD_PARENT_HIERARCHY H1 WHERE H2.CHILD = H1.PARENT ) SELECT CHILD, PARENT FROM TEMP_TABLE ORDER BY 1,2; 65 Sub Queries Description Permits a more sophisticated and detailed query of a database through the use of nested SELECT statements. Hence, the elimination of intermediate result sets to the client. There are subqueries for search conditions and correlated subqueries when it references columns of outer tables in an enclosing, or containing, inner query. The expression 'correlated subquery' comes from the explicit requirement for the use of correlation names (table aliases) in any correlated subquery in which the same table is referenced in both the internal and external query. Subqueries in Search Conditions Example (Simple - w/ IN statement) The following SELECT statement displays all the transactions which include the partkeys involved in orderkey=1. (result: 184 rows) SELECT T1.P_NAME, T1.P_MFGR FROM PRODUCT T1, ITEM T2 WHERE T1.P_PARTKEY = T2.L_PARTKEY AND T1.P_PARTKEY IN (SELECT L_PARTKEY FROM ITEM WHERE L_ORDERKEY = 1); Example (Simple - w/ 'operators') The following SELECT statement displays the employees with the highest salary and the most years of experience. (result: 1 row) SELECT EMPNO, NAME, DEPTNO, SALARY, YRSEXP FROM EMPLOYEE WHERE (SALARY, YRSEXP) >= ALL (SELECT SALARY, YRSEXP FROM EMPLOYEE); Example (Simple – w/ ‘operator’ and AVG function) The following SELECT statement displays every employee in the Employee table with a salary that is greater than the average salary of all employees in the table. (result: 497 rows) SELECT NAME, DEPTNO, SALARY FROM EMPLOYEE WHERE SALARY > (SELECT AVG(SALARY) FROM EMPLOYEE) ORDER BY NAME; 66 Example (Complex – w/ OLAP function and Having clause) The following SELECT statement uses a nested OLAP function and a HAVING clause to display those Partkeys) that appear in the top 10 percent of profitability in more than 10 Orderkeys. (result: 486 rows) SELECT L_PARTKEY, COUNT(L_ORDERKEY) FROM (SELECT L_ORDERKEY, L_PARTKEY, PROFIT, (QUANTILE(10, PROFIT)) AS PERCENTILE FROM (SELECT L_ORDERKEY, L_PARTKEY, (SUM(L_EXTENDEDPRICE) –COUNT(L_QUANTITY) * 5) AS PROFIT FROM CONTRACT, ITEM WHERE CONTRACT.O_ORDERKEY = ITEM.L_ORDERKEY GROUP BY 1,2) AS ITEMPROFIT GROUP BY L_ORDERKEY QUALIFY PERCENTILE = 0) AS TOPTENPERCENT GROUP BY L_PARTKEY HAVING COUNT(L_ORDERKEY) >= 10; Correlated Subqueries Example (Simple – Correlated Subquery) The following SELECT statement displays employees who have highest salary in each department. (result: 876 rows) SELECT * FROM EMPLOYEE AS T1 WHERE SALARY = (SELECT MAX(SALARY) FROM EMPLOYEE AS T2 WHERE T1.DEPTNO = T2.DEPTNO); Case Statement Description The CASE expression is used to return alternative values based on search conditions. There are two forms of the CASE Expression:  Valued CASE Expression - Specify a SINGLE expression to test (equality) - List the possible values for the test expression that return different results 67 -  CASE---- value_expression_1 WHEN value_expression_n THEN scalar_expression_n ELSE scalar_expression_m (Result is either result_n or result_m) Searched CASE Expression - You do not specify an expression to test. You specify multiple, arbitrary, search conditions that can return different results. - CASE WHEN search_condition_n THEN scalar_expression_n ELSE scalar_expression_m Example (Simple - Valued CASE) The following SELECT statement displays only total Manufacture #2 Retail Price. (result: 1 row) SELECT SUM(CASE P_MFGR WHEN 'Manufacturer#2' THEN P_RETAILPRICE ELSE 0 END) FROM PRODUCT Example (Complex - Searched CASE) The following SELECT statement displays the top product type(s) that has a 25% or higher return rate. (result: 66 rows) SELECT T1.P_TYPE, SUM(CASE WHEN (T2.L_RETURNFLAG = 'R') THEN (T2.L_QUANTITY) ELSE (0) END) AS “RETURNED”, SUM(CASE WHEN (T2.L_RETURNFLAG <> 'R') THEN (T2.L_QUANTITY) ELSE (0) END) AS “NOT RETURNED”, (SUM(CASE WHEN (T2.L_RETURNFLAG = 'R') THEN (T2.L_QUANTITY) ELSE (0) END)) / ( (SUM(CASE WHEN (T2.L_RETURNFLAG = 'R') THEN (T2.L_QUANTITY) ELSE (0) END)) + (SUM(CASE WHEN (T2.L_RETURNFLAG <> 'R') THEN (T2.L_QUANTITY) ELSE (0) END)) ) * 100 AS “% RETURNED” FROM PRODUCT T1, ITEM T2 WHERE T2.L_PARTKEY = T1.P_PARTKEY GROUP BY 1 HAVING ("RETURNED" / ("RETURNED" + "NOT RETURNED")) >= .25 ORDER BY 4 DESC, 1 ASC Example (Advanced - Searched CASE) The following SELECT statement displays a report to show Month-To-Date, Year-To-Date, Rolling 365 day, and Inception to Date Metrics. (result: 1 row) SELECT CURRENT_DATE, SUM(CASE WHEN L_SHIPDATE > 68 (((((CURRENT_DATE (FORMAT 'YYYY'))(CHAR(4))) || '-' || TRIM((EXTRACT(MONTH FROM CURRENT_DATE)) (FORMAT '99')) || '-01')))(DATE) AND L_SHIPDATE < CURRENT_DATE THEN L_EXTENDEDPRICE ELSE 0 END) AS MTD, SUM(CASE WHEN L_SHIPDATE > (((((CURRENT_DATE (FORMAT 'YYYY'))(CHAR(4))) || (('-01-01'))) (DATE))) AND L_SHIPDATE < CURRENT_DATE THEN L_EXTENDEDPRICE ELSE 0 END) AS YTD, SUM(CASE WHEN L_SHIPDATE > CURRENT_DATE-365 THEN L_EXTENDEDPRICE ELSE 0 END) AS ROLLING365, SUM(CASE WHEN L_SHIPDATE < CURRENT_DATE THEN L_EXTENDEDPRICE ELSE 0 END) AS ITD FROM ITEM; Sum (SQL-99 Window Function) Description Returns the cumulative, group, or moving sum of a value_expression, depending on how the aggregation group in the SUM function is specified. Cumulative Sum, Group Sum and Moving Sum syntax is similar, but slight differences are necessary to modify the type of sum desired. SUM (cumulative) SUM (value_expression) OVER(PARTITION BY value_expression ORDER BY value_expression ROWS UNBOUNDED PRECEDING ASC|DESC) SUM (group) SUM (value_expression) OVER(PARTITION BY value_expression ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ASC|DESC) SUM (moving) SUM (value_expression) OVER(PARTITION BY value_expression ORDER BY value_expression ROW width PRECEDING ASC|DESC) Example 1 (Simple – Cumulative Sum) The following SELECT statement displays the cumulative balance per Orderkey by Shipdate. (result: 60175 rows) SELECT L_ORDERKEY, L_SHIPDATE, L_EXTENDEDPRICE, SUM(L_EXTENDEDPRICE) OVER (PARTITION BY L_ORDERKEY ORDER BY L_SHIPDATE ROWS UNBOUNDED PRECEDING) AS BALANCE FROM ITEM ORDER BY L_ORDERKEY, L_SHIPDATE Example 2 (Complex - Cumulative Sum) 69 The following SELECT statement displays projected Monthly and YTD Sums for the Extended Price attribute based on Shipdate. (result: 84 rows) SELECT T1.L_SHIPDATE (FORMAT 'YYYY-MM’)(CHAR(7)), T1.MTD, T1.YTD FROM (SELECT L_SHIPDATE, SUM(L_EXTENDEDPRICE) OVER (PARTITION BY L_SHIPDATE (FORMAT 'YYYY/MM')(CHAR(7)) ORDER BY L_SHIPDATE ROWS UNBOUNDED PRECEDING) AS MTD, SUM(L_EXTENDEDPRICE) OVER (PARTITION BY L_SHIPDATE (FORMAT 'YYYY')(CHAR(4)) ORDER BY L_SHIPDATE ROWS UNBOUNDED PRECEDING) AS YTD FROM (SELECT L_SHIPDATE, SUM(ITEM.L_EXTENDEDPRICE) AS L_EXTENDEDPRICE FROM ITEM GROUP BY 1) AS T3 ) AS T1, (SELECT MAX(L_SHIPDATE) as L_SHIPDATE FROM ITEM GROUP BY L_SHIPDATE (FORMAT 'YYYY-MM') (CHAR(7)) ) AS T2 WHERE T1.L_SHIPDATE (FORMAT 'YYYY-MM’) = T2.L_SHIPDATE(FORMAT 'YYYY-MM') ORDER BY 1 Rank (SQL-99 Window Function) Description Returns an ordered ranking of rows for the value_expression. Example 1 (Simple - Rank) The following SELECT statement ranks Clerks by Order Status based on Total Price. (result: 15000 rows) SELECT O_CLERK, O_ORDERSTATUS, O_TOTALPRICE, RANK() OVER (PARTITION BY O_ORDERSTATUS ORDER BY O_TOTALPRICE DESC) FROM CONTRACT Example 2 (Complex - Rank) 70 The following SELECT statement ranks the Monthly Extended Price in descending order within each year using Monthly and YTD Sums for the Extended Price attribute query above. (result: 84 rows) SELECT T1.L_SHIPDATE (FORMAT 'YYYY-MM')(CHAR(7)), T1.MTD, T1.YTD, RANK() OVER (PARTITION BY T1.L_SHIPDATE (FORMAT 'YYYY')(CHAR(4)) ORDER BY T1.MTD DESC) FROM (SELECT L_SHIPDATE, SUM(L_EXTENDEDPRICE) OVER (PARTITION BY L_SHIPDATE (FORMAT 'YYYY/MM')(CHAR(7)) ORDER BY L_SHIPDATE ROWS UNBOUNDED PRECEDING) AS MTD, SUM(L_EXTENDEDPRICE) OVER (PARTITION BY L_SHIPDATE (FORMAT 'YYYY')(CHAR(4)) ORDER BY L_SHIPDATE ROWS UNBOUNDED PRECEDING) AS YTD FROM (SELECT L_SHIPDATE, SUM(L_EXTENDEDPRICE) AS L_EXTENDEDPRICE FROM ITEM GROUP BY 1) AS T3 ) AS T1, (SELECT MAX(L_SHIPDATE) AS L_SHIPDATE FROM ITEM GROUP BY L_SHIPDATE (FORMAT 'YYYY-MM’) (CHAR(7)) ) AS T2 WHERE T1.L_SHIPDATE (FORMAT 'YYYY-MM') = T2.L_SHIPDATE(FORMAT 'YYYY-MM') ORDER BY T1.L_SHIPDATE (FORMAT 'YYYY')(CHAR(4)),4 Fast Path Insert Description Inserting rows into empty tables only require one entry to the transient journal log for the entire step. Rows are inserted at the block level rather than individual rows. By “stacking” multiple insert statements into one request, all insert steps in the request participate in the single fast path insert. BTEQ nomenclature is to add the semicolon (;) in front of multiple insert statements. Example The following multiple step inserts all participate in the fast path insert for an empty target table INSERT TABLE A 71 SELECT * FROM TABLE A1 ;INSERT TABLE A SELECT * FROM TABLE A2 ; Fast Path Delete A fast path delete occurs when the optimizer recognizes an unqualified DELETE that IS NOT inside an uncommitted unit of work. In this scenario, it will not journal the deleted rows in the transient journal. The logic behind the decision is that, even if the system restarts, the transaction has been previously authorized to run to completion - eliminate all the rows; therefore since the deleted rows will never need to be "restored", the system skips the overhead of copying them into the transient journal. Actually all the DELETE ALL does is to re-chain the internal data blocks on the "data block free- chain" something that is not effected by the number of rows in the table. If a transaction is created with succession of delete statements, it will not be able to take advantage of this feature except (possibly) for the last delete. The reason for this is that, by definition, all statements within a transaction must either succeed or fail. Teradata must be able to recover or undo the changes. To do this, the transient journal must record the "before image" of all data modified by the transaction. A transaction can be created in several ways: o) All SQL statements inside a macro run as a single transaction. o) A multi-statement request is also a transaction. o) Statements executed in parallel (F9) within SQL Assistant operate as a transaction. Note that this is not the case for the other execute, F5: "Execute the statements one step at a time". Within a transaction, you can take advantage of the fast path delete if the following requirements are met: o) In ANSI mode, the (unqualified) DELETE must be immediately followed by a COMMIT. o) In Teradata mode (IMPLICIT), the (unqualified) DELETE must be the last (or only) statement in the transaction. o) In Teradata mode (EXPLICIT), the (unqualified) DELETE must be immediately followed by the END TRANSACTION. Section 5.2 Set vs. Procedural There are two ways to state an operation on data – Set: “Find the Seven of Spades.” – Procedural: “Look at each card; if one of them is the Seven of Spades then show it to me (and, optionally, stop looking).” 72 Teradata can process set expressions (SQL) in parallel, but procedural statements result in serial, row-at-a-time processing.     Think in SETs of data -- not one row at a time A single insert/update/delete on Teradata is painfully slow, compared to its parallel capabilities. Use the TPT to get data into the database. SQL is the only way to make Teradata use its parallelism and most things can be done in SQL. So, how would one approach this? 1 Locate an update DML statement – Insert/Update/Delete 2 Trace each column involved back to its source in a table – Change IF … ELSIF … ELSE ... ENDIF to CASE […] WHEN … WHEN … ELSE … END 3 Locate the conditions under which the update occurs – Change IF … to WHERE … 4 Repeat for each update DML statement As an example, the following is what we would typically encounter in working with an Oracle based application: Cursor: select order.nr, order.amt into :order_nr, :order_amount from order; while there_is_data do fetch from Cursor; if :rundate < current_date then if :order_amount < 0 then :OrdAmt = 0 else :OrdAmt = :order_amount endif; insert into Orders values ( :order_nr, :OrdAmt ); endif; ... endo; This is how you would actually want this executed in Teradata: insert into Orders select order.nr ,case order.amt when < 0 then 0 else order.amt end from order 73 where :rundate < current_date ; A nested loop is usually a join, so why not do it in SQL? Sub-processes or nested processes must be analyzed as nested loops and brought into SQL. Examples in code are multiple subroutines within a package or program, functions in C or PL/SQL or whatever, and nested loops. Another example: cursor TXN_CUR: select Amt, AcctNum from Acct_Transactions where AcctNum = acct_num; cursor ACCT_CUR: select AcctNum, acct_balance from Accts; for acct_rec in ACCT_CUR loop acct_num := acct_rec.AcctNum; acct_bal := acct_rec.acct_balance; for txn_rec in TXN_CUR loop acct_bal := acct_bal + txn_rec.Amt; end loop; UPDATE Accts set acct_balance = acct_bal where AcctNum = acct_num; end loop; And how it should actually be written: UPDATE Accts SET acct_balance = acct_balance + Txn_Sum.total_amt FROM ( SELECT AcctNum, SUM(Amt) from Acct_Transactions GROUP BY AcctNum ) Txn_Sum ( acct, total_Amt ) WHERE Accts.AcctNum = Txn_Sum.acct ; What to do about other complications?   PL/SQL Function and similar features – If it accesses the database, you might be able to treat it like a sub-procedure and/or turn it into a derived table. – If it just does logic, then it can be a CASE operation or a UDF. Processes nested several layers deep – Will take a long time to sort out. – Try to start from the original specs. 74  When all else fails, – You might be able to use Teradata SPL (not parallel) – Or create an aggregate table. An excellent source of information on this subject is George Coleman’s "blog" at http://developer.teradata.com/blog/georgecoleman. It is actually a series of articles on this topic and goes into more depth on this subject. Section 5.3 Statistics collection “cheat sheet” The following statistics collection recommendations are intended for sites that are on any of the Teradata 13.10 software release levels. Most of these recommendations apply to releases earlier than Teradata 13.10, however some may be specific to Teradata 13.10 only. Collect Full Statistics  Non-indexed columns used in predicates  Single-column join constraints, if the column is not unique All NUSIs (but drop NUSIs that aren’t needed/used) USIs/UPIs only if used in non-equality predicates (range constraints) Most NUPIs (see below for a fuller discussion of NUPI statistic collection) Full statistics always need to be collected on relevant columns and indexes on small tables (less than 100 rows per AMP)      PARTITION for all tables whether partitioned or not. Collecting PARTITION statistics on tables that have grown supports more accurate statistics extrapolations. Collecting on PARTITION is an extremely quick operation (as long as table is not over-partitioned). Can Rely on Dynamic AMP Sampling  This is sometimes referred to as a random AMP sample since, in early releases of Teradata, a random AMP was picked for obtaining the sample; however, in current releases, the term dynamic AMP sample is more appropriate.  USIs or UPIs if only used with equality predicates  NUPIs that display even distribution, and if used for joining, conform to assumed  uniqueness (see Point #2 under “Other Considerations” below) See “Other Considerations” for additional points related to dynamic AMP sampling Option to use USING SAMPLE  Unique index columns 75  Nearly-unique (any column which is over 95% unique is considered as a nearly-unique column) columns or indexes Collect Multicolumn Statistics   Groups of columns that often appear together with equality predicates, if the first 16 bytes of the concatenated column values are sufficiently distinct. These statistics are used for single-table estimates. Groups of columns used for joins or aggregations, where there is either a dependency or some degree of correlation among them. With no multicolumn statistics collected, the optimizer assumes complete independence among the column values. The more that the combination of actual values are correlated, the greater the value of collecting multicolumn statistics. Other Considerations   Optimizations such as nested join, partial GROUP BY, and dynamic partition elimination are not chosen unless statistics have been collected on the relevant columns. NUPIs that are used in join steps in the absence of collected statistics are assumed to be 75% unique, and the number of distinct values in the table is derived from that. A NUPI that is far off from being 75% unique (for example, it’s 90% unique, or on the other side, it’s 60% unique or less) benefits from having statistics collected, including a NUPI composed of multiple columns regardless of the length of the concatenated values. However, if it is close to being 75% unique, dynamic AMP samples are adequate. To determine what the uniqueness of a NUPI is before collecting statistics, you can issue this SQL statement: EXPLAIN SELECT DISTINCT nupi-column FROM table;  For a partitioned primary index table, it is recommended that you always collect statistics on: o  PARTITION. This tells the optimizer how many partitions are empty, and how many rows are in each partition. This statistic is used for optimizer costing. o The partitioning column. This provides cardinality estimates to the optimizer when the partitioning column is part of a query’s selection criteria. For a partitioned primary index table, consider collecting these statistics if the partitioning column is not part of the table’s primary index (PI): 76 o  (PARTITION, PI). This statistic is most important when a given PI value may exist in multiple partitions, and can be skipped if a PI value only goes to one partition. It provides the optimizer with the distribution of primary index values across the partitions. It helps in costing the sliding-window and rowkey-based merge join, as well as dynamic partition elimination. o (PARTITION, PI, partitioning column). This statistic provides the combined number of distinct values for the combination of PI and partitioning columns after partition elimination. It is used in rowkey-based merge join costing. Dynamic AMP sampling has the option of pulling samples from all AMPs, rather than from a single AMP (the default). For small tables, with less than 25 rows per AMP, allAMP sampling is done automatically. It is also the default for volatile tables and sparse join indexes. All-AMP sampling comes with these tradeoffs: o Dynamic all-AMP sampling provides a more accurate row count estimate for a table with a NUPI. This benefit becomes important when NUPI statistics have not been collected (as might be the case if the table is extraordinarily large), and the NUPI has an uneven distribution of values. o Statistics extrapolation for any column in a table is triggered only when the optimizer detects that the table has grown. The growth is computed by comparing the current row count with the last known row count to the optimizer. If the o  default single-AMP dynamic sampling estimate of the current row count is not accurate (which can happen if the primary index is skewed), it is recommended to enable all-AMP sampling or re-collect PARTITION statistics. Parsing times for queries may increase when all AMPs are involved, as the queries that perform dynamic AMP sampling will have slightly more work to do. Note that dynamic AMP samples will stay in the dictionary cache until the periodic cache flush, or unless they are purged from the cache for some reason. Because they can be retrieved once and re-used multiple times, it is not expected that dynamic all-AMP samping will will cause additional overhead for all query executions. For temporal tables, follow all collection recommendations made above. However, statistics are currently not supported on BEGIN and END period types. That capability is planned for a future release. These recommendations were compiled by: Carrie Ballinger, Rama Krishna Korlapati, Paul Sinclair 77 Section 5.4 Reserved words Please refer to Appendix B of the Teradata Database SQL Fundamentals manual for a list for Restricted Words. Section 5.5 Orange Books and Suggested Reading The Teradata documentation manuals supplied with the installation and available at www.info.teradata.com are quite readable. Of particular interest are the “Teradata Database Design” and “Teradata Performance Management” manuals. In addition to white papers and Teradata Magazine articles available to www.teradata.com, Teradata Partners have access to the Teradata orange book library via Teradata at Your Service at http://www.teradata.com/services-support/teradata-at-your-service. To access the orange books, enter “orange book” in the text box under “Search Knowledge Repositories” and then clicking “Search”. The Teradata Developer Exchange website, http://developer.teradata.com/, offers a wide variety of blogs, articles, and information related to Teradata. It also offers a download section. Here’s a list of orange books that are particular interest to migrating and tuning applications on the Teradata database. In addition, there are a number of orange books that address controlling and administrating mixed workload environments, and other subjects, that are not listed here. “Understanding Oracle and Teradata Transactions and Isolation Levels for Oracle Migrations.” When migrating applications from Oracle to Teradata, the reduced isolation levels used by the Oracle applications need to be understood before the applications can be ported or redesigned to run on Teradata. This Orange Book will describe transaction boundaries, scheduling, and the isolation levels available in Oracle and in Teradata. It will suggest possible solutions for coping with incompatible isolation levels when migrating from Oracle to Teradata. “ANSI MERGE Enhancements.” This is an overview of the support of full ANSI MERGE syntax for set inserts, updates and upserts into tables. This includes an overview of the batch error handling capabilities. “Single-Level and Multilevel Partitioning.” Usage considerations, examples, recommendations, and technical details for tables and noncompressed join indexes with single-level and multilevel partitioned primary indexes. “Collecting Statistics Teradata Database V2R62.” Having statistical information available is critical to optimal query plans, but collecting statistics can involve time and resources. By combining several different statistics gathering strategies, users of the Teradata database can find the correct balance between good query plans and the time required to ensure adequate statistical information is always available. Note: Although the title specifies Teradata V2R6.2, this orange book is also applicable to Teradata V12 and beyond. 78 “Implementing Tactical Queries the Basics Teradata Database V2R61.” Tactical queries support decision-making of an immediate nature within an active data warehouse environment. These response time-sensitive queries often come with clear service level expectations. This orange addresses supporting tactical queries with the Teradata database. Note: Although the title specifies Teradata V2R6.1, this orange book is also applicable to Teradata V12 and beyond. “Feeding the Active Data Warehouse.” Active ingest is one of the first steps that needs to be considered in evolving towards an active data warehouse. There are several proven approaches to active ingest into a Teradata database. This orange reviews the approaches, their pros and cons, and implementation guidelines. “Introduction to Materialized Views.” Materialized views are implemented as join indexes in Teradata. Join indexes can be used to improve the performance of queries at the expense of update performance and increased storage requirements. “Reserved QueryBand Names for Use by Teradata, Customer and Partner Applications.” The Teradata 12 Feature known as Query Bands provide a means to set Name/Value pairs across individual Database connections at a Session or Transaction level to provide the database with significant information about the connections originating source. This provides a mechanism for Applications to collaborate with the underlying Teradata Database in order to provide for better Workload Management, Prioritization and Accounting. “Teradata Active System Management.” As DBAs and other support engineers attempt to analyze, tune and manage their environment ,s performance, these new features will greatly ease that effort through centralizing management tasks under one domain, providing automation of certain management tasks, improving visibility into management related details, and by introducing management and monitoring by business driven, workload-centric goals. “Stored Procedures Guide.” This Orange books provides an overview of stored procedures and some basic examples for getting started. “User Defined Functions” and “Teradata Java User Defined Functions User’s Guide” are two Orange Books for digging deeper in C/C++ and Java UDF’s. Both guides explain the UDF architecture and how a UDF is created and packaged for use by Teradata. K-means clustering and Teradata 14.10 table operators. Article by Watzke on 17 Sep 2013. Teradata Developer Exchange. http://developer.teradata.com/extensibility/articles/k-meansclustering-and-teradata-14-10-table-operators-0 Here are a few white papers and Teradata manuals that are particular interest to migrating and tuning application on the Teradata database. A list of available white papers can be found at http://www.teradata.com/t/resources.aspx?TaxonomyID=4533. The Teradata manuals are found in the latest documentation set available at www.info.teradata.com. 79 “Implementing AJIs for ROLAP.” This white paper describes how to build and implement ROLAP cubes on the Teradata database. http://www.teradata.com/t/article.aspx?id=1644 “Teradata Database Queue Tables.” This white paper describes the Queue Tables feature that was introduced in Teradata® Database V2R6. This feature enables new and improved ways of building applications to support event processing use cases. http://www.teradata.com/t/article.aspx?id=1660 “Oracle to Teradata Migration Technical Info.” This document is based on a true practical experience in migrating data between Oracle and Teradata. This document contains a description of the technicalities involved in the actual porting of the software and data; as well as some templates and tools provided that are useful for projects of this nature. This is available on the Teradata Partner Portal. [1] Blog entry by carrie on 27 Aug 2012 80

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Partner Technical Guide