Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Open Database Connectivity wikipedia , lookup
Microsoft Jet Database Engine wikipedia , lookup
Entity–attribute–value model wikipedia , lookup
Functional Database Model wikipedia , lookup
Microsoft SQL Server wikipedia , lookup
Clusterpoint wikipedia , lookup
Extensible Storage Engine wikipedia , lookup
DATAWAREHOUSING WHY Data Warehousing? Data warehousing is mainly done for the reporting purposes. All the historical data is put into a Data warehouse, which can be thought of as a Very large Database. Later on, reports are generating out of this Data Warehouse to do analysis of the business. What is difference between Enterprise Data Warehouse (EDW) and a Data Mart? EDW consists of all the information associated with the entire Organization. For example, it will contain information about all the departments (Say Finance, Human Resource, Marketing, Sales etc). Where as Data Mart ONLY contains the data that is specific to one department (Say only Finance). Data Warehousing Tools ETL Tools ETL means Extraction, Transformation and Loading. And tools that extract data from different data sources (SQLServer, Oracle, Flat Files, Sybase etc) into a Datawarehouse are known as ETL tools. Some popular ETL tools in market are Informatica, Ab Initio and Datastage. Reporting Tools Reporting tools are used to generate the reports out of the information (data) stored in the Data warehouse. Some popular reporting tools in the market are Business Objects, Cognos, Microstrategy etc. Data Modeling A Datawarehouse is based on Fact and Dimension tables. Establishing relationship between various Fact table(s) and Dimension tables is called “Data Modeling”. Fact table contains numeric data that is needed in reports e.g. revenue, sales etc Fact table contain information about all dimension table that it is related to. This means FACT table has all the Dimension Keys as Foreign Keys. Data Modeling is of two types: 1. Star Schema Design: Dimension tables surrounds Fact table. Data is in de-normalized form. 1 2. Snow Flakes Schema Design: Dimension tables surrounds Fact table. Data is in normalized form. Dimension table may be further split into a sub-dimension table. Informatica Tool Installation 1. Install Oracle. 2. Install Informatica Client Tools. 3. Install Informatica Server. While Installing the Informatica Server, Give keys for all databases, Give name for Repository (and user name and password), Give TCP/IP Port number (4001), Choose Oracle Version. The ODBC Driver For Oracle is “Merant-32 bit for Oracle”. 2 INFORMATICA About PowerCenter and PowerMart Welcome to PowerMart and PowerCenter, Informatica’s integrated suite of software products that deliver an open, scalable solution addressing the complete life cycle for data warehouse and analytic application development. Both PowerMart and PowerCenter combine the latest technology enhancements for reliably managing data repositories and delivering information resources in a timely, usable manner. The metadata repository coordinates and drives a variety of core functions including extraction, transformation, loading, and management. The Informatica Server can extract large volumes of data from multiple platforms, handle complex transformations, and support high-speed loads. PowerMart and PowerCenter can simplify and accelerate the process of moving data warehouses from development to test to full production. Software features that differ between the PowerMart and PowerCenter: If You Are Using PowerCenter With PowerCenter, you receive all product functionality, including the ability to register multiple servers, share metadata across repositories, and partition data. A PowerCenter license lets you create a single repository that you can configure as a global repository, the core component of a data warehouse. When this guide mentions a PowerCenter Server, it is referring to an Informatica Server with a PowerCenter license. If You Are Using PowerMart This version of PowerMart includes all features except distributed metadata, multiple registered servers, and data partitioning. Also, the various options available with PowerCenter (such as PowerCenter Integration Server for BW, PowerConnect for IBM DB2, PowerConnect for IBM MQSeries, PowerConnect for SAP R/3, PowerConnect for Siebel, and PowerConnect for PeopleSoft) are not available with PowerMart. When this guide mentions a PowerMart Server, it is referring to an Informatica Server with a PowerMart license. Informatica Client Tools: Designer Server Manager Repository Manager Informatica Server Tools: 1. Informatica Server Load Manager Process and Data Transformation Manager Process 3 The Load Manager is the primary Informatica Server process. It performs the following tasks: Manages session and batch scheduling. Locks the session and reads session properties. Reads the parameter file. Expands the server and session variables and parameters. Verifies permissions and privileges. Validates source and target code pages. Creates the session log file. Creates the Data Transformation Manager (DTM) process, which executes the session. The Data Transformation Manager (DTM) process executes the session. DESIGNER The Designer has five tools to help you build mappings and mapplets so you can specify how to move and transform data between sources and targets. The Designer helps you create source definitions, target definitions, and transformations to build your mappings. The Designer allows you to work with multiple tools at one time and to work in multiple folders and repositories at the same time. It also includes windows so you can view folders, repository objects, and tasks. Designer Tools The Designer provides the 5 following tools: Source Analyzer. Used to import or create source definitions for flat file (Fixed-width and delimited flat files), XML, COBOL, ERP, and relational sources (tables, views, and synonyms). Warehouse Designer. Used to import or create target definitions. Transformation Developer. Used to create reusable transformations. Mapplet Designer. Used to create mapplets. Mapping Designer. Used to create mappings. What is a Transformation? 4 A transformation is a repository object that generates, modifies, or passes data. The Designer provides a set of transformations that perform specific functions. For example, an Aggregator transformation performs calculations on groups of data. Transformations in a mapping represent the operations the Informatica Server performs on the data. Data passes into and out of transformations through ports that you connect in a mapping or mapplet. Transformations can be active or passive. An active transformation can change the number of rows that pass through it, such as a Filter transformation that removes rows that do not meet the configured filter condition. A passive transformation does not change the number of rows that pass through it, such as an Expression transformation that performs a calculation on data and passes all rows through the transformation. Transformations can be connected to the data flow, or they can be unconnected. An unconnected transformation is not connected to other transformations in the mapping. It is called within another transformation, and returns a value to that transformation. Table 8-1 provides a brief description of each transformation: Table 8-1. Transformation Descriptions Transformation Advanced Procedure Description External Active/ Connected Calls a procedure in a shared library or in the COM layer of Windows NT. Active/ Connected Aggregator ERP Qualifier Type Performs aggregate calculations. Represents the rows that the Informatica Server reads from an ERP source when it runs a session. Source Active/ Connected Expression Passive/ Connected External Procedure Passive/ Connected Unconnected Filter Active/ Connected Filters records. Input Passive/ Connected Defines mapplet input rows. Available only in the Mapplet Designer. Joiner Active/ Connected Joins records from different databases or flat file systems. Lookup Passive/ Looks up values. Calculates a value. or Calls a procedure in a shared library or in the COM layer of Windows NT. 5 Connected Unconnected or Normalizer Active/ Connected Normalizes records, including those read from COBOL sources. Output Passive/ Connected Defines mapplet output rows. Available only in the Mapplet Designer. Rank Active/ Connected Limits records to a top or bottom range. Sequence Generator Passive/ Connected Generates primary keys. Source Qualifier Active/ Connected Represents the rows that the Informatica Server reads from a relational or flat file source when it runs a session. Router Active/ Connected Routes data into multiple transformations based on a group expression. Stored Procedure Passive/ Connected Unconnected Update Strategy Active/ Connected Determines whether to insert, delete, update, or reject records. Source Passive/ Connected Represents the rows that the Informatica Server reads from an XML source when it runs a session. XML Qualifier or Calls a stored procedure. Overview Of Transformations: 1. Aggregator The Aggregator transformation allows you to perform aggregate calculations, such as averages and sums. The Aggregator transformation is unlike the Expression transformation, in that you can use the Aggregator transformation to perform calculations on groups. The Expression transformation permits you to perform calculations on a row-by-row basis only. 6 When using the transformation language to create aggregate expressions, you can use conditional clauses to filter records, providing more flexibility than SQL language. The Informatica Server performs aggregate calculations as it reads, and stores necessary data group and row data in an aggregate cache. After you create a session that includes an Aggregator transformation, you can enable the session option, Incremental Aggregation. When the Informatica Server performs incremental aggregation, it passes new source data through the mapping and uses historical cache data to perform new aggregation calculations incrementally. 2. Filter The Filter transformation provides the means for filtering rows in a mapping. You pass all the rows from a source transformation through the Filter transformation, and then enter a filter condition for the transformation. All ports in a Filter transformation are input/output, and only rows that meet the condition pass through the Filter transformation. In some cases, you need to filter data based on one or more conditions before writing it to targets. For example, if you have a human resources data warehouse containing information about current employees, you might want to filter out employees who are part-time and hourly. The mapping in Figure 18-1 passes the rows from a human resources table that contains employee data through a Filter transformation. The filter only allows rows through for employees that make salaries of $30,000 or higher. 3. Joiner While a Source Qualifier transformation can join data originating from a common source database, the Joiner transformation joins two related heterogeneous sources residing in different locations or file systems. The combination of sources can be varied. You can use the following sources: a) Two relational tables existing in separate databases b) Two flat files in potentially different file systems c) Two different ODBC sources d) Two instances of the same XML source e) A relational table and a flat file source f) A relational table and an XML source You use the Joiner transformation to join two sources with at least one matching port. The Joiner transformation uses a condition that matches one or more pairs of ports between the two sources. For example, you might want to join a flat file with in-house customer IDs and a relational database table that contains user-defined customer IDs. You could import the flat file into a temporary database table, and then perform the join in the database. However, if you use the Joiner transformation, there is no need to import or create temporary tables. If two relational sources contain keys, then a Source Qualifier transformation can easily join the sources on those keys. Joiner transformations typically combine information from two 7 different sources that do not have matching keys, such as flat file sources. The Joiner transformation allows you to join sources that contain binary data. The Joiner transformation supports the following join types, which you set in the Properties tab: 1. Normal (Default) 2. Master Outer 3. Detail Outer 4. Full Outer 4. Source Qualifier When you add a relational or a flat file source definition to a mapping, you need to connect it to a Source Qualifier transformation. The Source Qualifier represents the records that the Informatica Server reads when it runs a session. You can use the Source Qualifier to perform the following tasks: a) Join data originating from the same source database You can join two or more tables with primary-foreign key relationships by linking the sources to one Source Qualifier. b) Filter records when the Informatica Server reads source data If you include a filter condition, the Informatica Server adds a WHERE clause to the default query. c) Specify an outer join rather than the default inner join If you include a user-defined join, the Informatica Server replaces the join information specified by the metadata in the SQL query. d) Specify sorted ports If you specify a number for sorted ports, the Informatica Server adds an ORDER BY clause to the default SQL query. e) Select only distinct values from the source If you choose Select Distinct, the Informatica Server adds a SELECT DISTINCT statement to the default SQL query. f) Create a custom query to issue a special SELECT statement for the Informatica Server to read source data For example, you might use a custom query to perform aggregate calculations or execute a stored procedure. 5. Stored Procedure 8 A Stored Procedure transformation is an important tool for populating and maintaining databases. Database administrators create stored procedures to automate time-consuming tasks that are too complicated for standard SQL statements. A stored procedure is a precompiled collection of Transact-SQL statements and optional flow control statements, similar to an executable script. Stored procedures are stored and run within the database. You can run a stored procedure with the EXECUTE SQL statement in a database client tool, just as you can run SQL statements. Unlike standard SQL, however, stored procedures allow user-defined variables, conditional statements, and other powerful programming features. Not all databases support stored procedures, and database implementations vary widely on their syntax. You might use stored procedures to: a) Drop and recreate indexes. b) Check the status of a target database before moving records into it. c) Determine if enough space exists in a database. d) Perform a specialized calculation. Database developers and programmers use stored procedures for various tasks within databases, since stored procedures allow greater flexibility than SQL statements. Stored procedures also provide error handling and logging necessary for mission critical tasks. Developers create stored procedures in the database using the client tools provided with the database. The stored procedure must exist in the database before creating a Stored Procedure transformation, and the stored procedure can exist in a source, target, or any database with a valid connection to the Informatica Server. You might use a stored procedure to perform a query or calculation that you would otherwise make part of a mapping. For example, if you already have a well-tested stored procedure for calculating sales tax, you can perform that calculation through the stored procedure instead of recreating the same calculation in an Expression transformation. 6. Sequence Generator The Sequence Generator transformation generates numeric values. You can use the Sequence Generator to create unique primary key values, replace missing primary keys, or cycle through a sequential range of numbers. The Sequence Generator transformation is a connected transformation. It contains two output ports that you can connect to one or more transformations. The Informatica Server generates a value each time a row enters a connected transformation, even if that value is not used. When NEXTVAL is connected to the input port of another transformation, the Informatica Server generates a sequence of numbers. When CURRVAL is connected to the input port of another transformation, the Informatica Server generates the NEXTVAL value plus one. 9 You can make a Sequence Generator reusable, and use it in multiple mappings. You might reuse a Sequence Generator when you perform multiple loads to a single target. For example, if you have a large input file that you separate into three sessions running in parallel, you can use a Sequence Generator to generate primary key values. If you use different Sequence Generators, the Informatica Server might accidentally generate duplicate key values. Instead, you can use the same reusable Sequence Generator for all three sessions to provide a unique value for each target row. 7. Rank The Rank transformation allows you to select only the top or bottom rank of data. You can use a Rank transformation to return the largest or smallest numeric value in a port or group. You can also use a Rank transformation to return the strings at the top or the bottom of a session sort order. During the session, the Informatica Server caches input data until it can perform the rank calculations. The Rank transformation differs from the transformation functions MAX and MIN, in that it allows you to select a group of top or bottom values, not just one value. For example, you can use Rank to select the top 10 salespersons in a given territory. Or, to generate a financial report, you might also use a Rank transformation to identify the three departments with the lowest expenses in salaries and overhead. While the SQL language provides many functions designed to handle groups of data, identifying top or bottom strata within a set of rows is not possible using standard SQL functions. You connect all ports representing the same row set to the transformation. Only the rows that fall within that rank, based on some measure you set when you configure the transformation, pass through the Rank transformation. You can also write expressions to transform data or perform calculations. Figure 22-1 shows a mapping that passes employee data from a human resources table through a Rank transformation. The Rank only passes the rows for the top 10 highest paid employees to the next transformation. 8. Look Up Use a Lookup transformation in your mapping to look up data in a relational table, view, or synonym. Import a lookup definition from any relational database to which both the Informatica Client and Server can connect. You can use multiple Lookup transformations in a mapping. The Informatica Server queries the lookup table based on the lookup ports in the transformation. It compares Lookup transformation port values to lookup table column values based on the lookup condition. Use the result of the lookup to pass to other transformations and the target. You can use the Lookup transformation to perform many tasks, including: 10 a) Get a related value: For example, if your source table includes employee ID, but you want to include the employee name in your target table to make your summary data easier to read. b) Perform a calculation: Many normalized tables include values used in a calculation, such as gross sales per invoice or sales tax, but not the calculated value (such as net sales). c) Update slowly changing dimension tables: You can use a Lookup transformation to determine whether records already exist in the target. You can configure the Lookup transformation to perform different types of lookups. You can configure the transformation to be connected or unconnected, cached or uncached: a) Connected or unconnected: Connected and unconnected transformations receive input and send output in different ways. b) Cached or uncached: Sometimes you can improve session performance by caching the lookup table. If you cache the lookup table, you can choose to use a dynamic or static cache. By default, the lookup cache remains static and does not change during the session. With a dynamic cache, the Informatica Server inserts rows into the cache during the session. Informatica recommends that you cache the target table as the lookup. This enables you to look up values in the target and insert them if they do not exist. 9. Expression You can use the Expression transformations to calculate values in a single row before you write to the target. For example, you might need to adjust employee salaries, concatenate first and last names, or convert strings to numbers. You can use the Expression transformation to perform any non-aggregate calculations. You can also use the Expression transformation to test conditional statements before you output the results to target tables or other transformations. Note: To perform calculations involving multiple rows, such as sums or averages, use the Aggregator transformation. Unlike the Expression transformation, the Aggregator allows you to group and sort data. For details, see Aggregator Transformation. 10. Router A Router transformation is similar to a Filter transformation because both transformations allow you to use a condition to test data. A Filter transformation tests data for one condition 11 and drops the rows of data that do not meet the condition. However, a Router transformation tests data for one or more conditions and gives you the option to route rows of data that do not meet any of the conditions to a default output group. If you need to test the same input data based on multiple conditions, use a Router Transformation in a mapping instead of creating multiple Filter transformations to perform the same task. The Router transformation is more efficient when you design a mapping and when you run a session. For example, to test data based on three conditions, you only need one Router transformation instead of three filter transformations to perform this task. Likewise, when you use a Router transformation in a mapping, the Informatica Server processes the incoming data only once. When you use multiple Filter transformations in a mapping, the Informatica Server processes the incoming data for each transformation. 11. Update Strategy When you design your data warehouse, you need to decide what type of information to store in targets. As part of your target table design, you need to determine whether to maintain all the historic data or just the most recent changes. For example, you might have a target table, T_CUSTOMERS that contains customer data. When a customer address changes, you may want to save the original address in the table, instead of updating that portion of the customer record. In this case, you would create a new record containing the updated address, and preserve the original record with the old customer address. This illustrates how you might store historical information in a target table. However, if you want the T_CUSTOMERS table to be a snapshot of current customer data, you would update the existing customer record and lose the original address. The model you choose constitutes your update strategy, how to handle changes to existing records. In PowerMart and PowerCenter, you set your update strategy at two different levels: a) Within a session: When you configure a session, you can instruct the Informatica Server to either treat all records in the same way (for example, treat all records as inserts), or use instructions coded into the session mapping to flag records for different database operations. b) Within a mapping: Within a mapping, you use the Update Strategy transformation to flag records for insert, delete, update, or reject. 2. SERVER MANAGER The Informatica Server moves data from sources to targets based on mapping and session metadata stored in a repository. What is a Mapping? A mapping is a set of source and target definitions linked by transformation objects that define the rules for data transformation. 12 What is a Session? A session is a set of instructions that describes how and when to move data from sources to targets. Use the Designer to import source and target definitions into the repository and to build mappings. Use the Server Manager to create and manage sessions and batches, and to monitor and stop the Informatica Server. When a session starts, the Informatica Server retrieves mapping and session metadata from the repository to extract data from the source, transform it, and load it into the target. More about a Session A session is a set of instructions that tells the Informatica Server how and when to move data from sources to targets. You create and maintain sessions in the Server Manager. When you create a session, you enter general information such as the session name, session schedule, and the Informatica Server to run the session. You can also select options to execute pre-session shell commands, send post-session email, and FTP source and target files. Using session properties, you can also override parameters established in the mapping, such as source and target location, source and target type, error tracing levels, and transformation attributes. For details on server activity while executing a session, see Understanding the Server Architecture. You can group sessions into a batch. The Informatica Server can run the sessions in a batch in sequential order, or start them concurrently. Some batch settings override session settings. Once you create a session, you can use either the Server Manager or the command line program pmcmd to start or stop the session. You can also use the Server Manager to monitor, edit, schedule, abort, copy, and delete the session. What is a Batch? Batches provide a way to group sessions for either serial or parallel execution by the Informatica Server. There are two types of batches: a) Sequential: Runs sessions one after the other. b) Concurrent: Runs sessions at the same time. You might create a sequential batch if you have sessions with source-target dependencies that you want to run in a specific order. You might also create a concurrent batch if you have several independent sessions you need scheduled at the same time. You can place them all in one batch, then schedule the batch as needed instead of scheduling each individual session. 13 You can create, edit, start, schedule, and stop batches with the Server Manager. However, you cannot copy or abort batches. With pmcmd, you can start and stop batches. 3) REPOSITORY MANAGER 14 The Informatica repository is a relational database that stores information, or metadata, used by the Informatica Server and Client tools. Metadata can include information such as mappings describing how to transform source data, sessions indicating when you want the Informatica Server to perform the transformations, and connect strings for sources and targets. The repository also stores administrative information such as usernames and passwords, permissions and privileges, and product version. You create and maintain the repository with the Repository Manager client tool. With the Repository Manager, you can also create folders to organize metadata and groups to organize users. The Informatica repository is an integral part of a data mart. A data mart includes the following components: a) Targets: The data mart includes one or more databases or flat file systems that store the information used for decision support. b) A server engine: Every data mart needs some kind of server application that reads, transforms, and writes data to targets. In traditional data warehouses, this server application consists of COBOL or SQL code you write to perform these operations. In PowerMart and PowerCenter, you use a single server application that runs on UNIX or Windows NT to read, transform, and write data. c) Metadata: Designing a data mart involves writing and storing a complex set of instructions. You need to know where to get data (sources), how to change it, and where to write the information (targets). PowerMart and PowerCenter call this set of instructions metadata. Each piece of metadata (for example, the description of a source table in an operational database) can contain comments about it. d) A repository: The place where you store the metadata is called a repository. The more sophisticated your repository, the more complex and detailed metadata you can store in it. PowerMart and PowerCenter use a relational database as the repository. 15 IMPROVING MAPPING PERFORMANCE - TIPS 1. Aggregator Transformation You can use the following guidelines to optimize the performance of an Aggregator transformation. a) Use Sorted Input to decrease the use of aggregate caches: The Sorted Input option reduces the amount of data cached during the session and improves session performance. Use this option with the Source Qualifier Number of Sorted Ports option to pass sorted data to the Aggregator transformation. b) Limit connected input/output or output ports: Limit the number of connected input/output or output ports to reduce the amount of data the Aggregator transformation stores in the data cache. c) Filter before aggregating: If you use a Filter transformation in the mapping, place the transformation before the Aggregator transformation to reduce unnecessary aggregation. 2. Filter Transformation The following tips can help filter performance: a) Use the Filter transformation early in the mapping: To maximize session performance, keep the Filter transformation as close as possible to the sources in the mapping. Rather than passing rows that you plan to discard through the mapping, you can filter out unwanted data early in the flow of data from sources to targets. b) Use the Source Qualifier to filter: The Source Qualifier transformation provides an alternate way to filter rows. Rather than filtering rows from within a mapping, the Source Qualifier transformation filters rows when read from a source. The main difference is that the source qualifier limits the row set extracted from a source, while the Filter transformation limits the row set sent to a target. Since a source qualifier reduces the number of rows used throughout the mapping, it provides better performance. However, the source qualifier only lets you filter rows from relational sources, while the Filter transformation filters rows from any type of source. Also, note that since it runs in 16 the database, you must make sure that the source qualifier filter condition only uses standard SQL. The Filter transformation can define a condition using any statement or transformation function that returns either a TRUE or FALSE value. 3. Joiner Transformation The following tips can help improve session performance: a) Perform joins in a database: Performing a join in a database is faster than performing a join in the session. In some cases, this is not possible, such as joining tables from two different databases or flat file systems. If you want to perform a join in a database, you can use the following options: Create a pre-session stored procedure to join the tables in a database before running the mapping. Use the Source Qualifier transformation to perform the join. b) Designate as the master source the source with the smaller number of records: For optimal performance and disk storage, designate the master source as the source with the lower number of rows. With a smaller master source, the data cache is smaller, and the search time is shorter. 4. LookUp Transformation Use the following tips when you configure the Lookup transformation: a) Add an index to the columns used in a lookup condition: If you have privileges to modify the database containing a lookup table, you can improve performance for both cached and uncached lookups. This is important for very large lookup tables. Since the Informatica Server needs to query, sort, and compare values in these columns, the index needs to include every column used in a lookup condition. b) Place conditions with an equality operator (=) first: If a Lookup transformation specifies several conditions, you can improve lookup performance by placing all the conditions that use the equality operator first in the list of conditions that appear under the Condition tab. c) Cache small lookup tables: Improve session performance by caching small lookup tables. The result of the Lookup query and processing is the same, regardless of whether you cache the lookup table or not. d) Join tables in the database: If the lookup table is on the same database as the source table in your mapping and caching is not feasible, join the tables in the source database rather than using a Lookup transformation. 17 e) Un Select the cache look-up option in Look Up transformation if there is no look up over-ride. This improves performance of session. MAPPING VARIABLES 1. Go to Mappings Tab, Click Parameters and Variables Tab, Create a NEW port as below. $$LastRunTime Variable date/time 19 0 Max Give an Initial Value. For example 1/1/1900. 2. IN EXP Transformation, Create Variable as below: SetLastRunTime (date/time) = SETVARIABLE ($$LastRunTime, SESSSTARTTIME) 3. Go to SOURCE QUALIFIER Transformation, Click Properties Tab, In Source Filter area, ENTER the following Expression. UpdateDateTime (Any Date Column from source) >= '$$LastRunTime' AND 18 UpdateDateTime < '$$$SessStartTime' Handle Nulls in DATE iif(isnull(AgedDate),to_date('1/1/1900','MM/DD/YYYY'),trunc(AgedDate,'DAY')) LOOK UP AND UPDATE STRATEGY EXPRESSION First, declare a Look Up condition in Look Up Transformation. For example, EMPID_IN (column coming from source) = EMPID (column in target table) Second, drag and drop these two columns into UPDATE Strategy Transformation. Check the Value coming from source (EMPID_IN) with the column in the target table (EMPID). If both are equal this means that the record already exists in the target. So we need to update the record (DD_UPDATE). Else will insert the record coming from source into the target (DD_INSERT). See below for UPDATE Strategy expression. IIF ((EMPID_IN = EMPID), DD_UPDATE, DD_INSERT) NOTE: Always the Update Strategy expression should be based on Primary keys in the target table. EXPRESSION TRANSFORMATION 1. IIF (ISNULL (ServiceOrderDateValue1), TO_DATE ('1/1/1900','MM/DD/YYYY'), TRUNC (ServiceOrderDateValue1,'DAY')) 2. IIF (ISNULL (NpaNxxId1) or LENGTH (RTRIM (NpaNxxId1))=0 or TO_NUMBER (NpaNxxId1) <= 0,'UNK', NpaNxxId1) 3. IIF (ISNULL (InstallMethodId),0,InstallMethodId) 19 Date_Diff(TRUNC(O_ServiceOrderDateValue),TRUNC(O_ServiceOrderDateValue), 'DD') FILTER CONDITION To pass only NOT TRANSFORMATION. NULL AND NOT SPACES VALUES THROUGH IIF ( ISNULL(LENGTH(RTRIM(LTRIM(ADSLTN)))) ,0 ,LENGTH(RTRIM(LTRIM(ADSLTN))))>0 SECOND FILTER CONDITION [Pass only NOT NULL FROM FILTER] iif(isnull(USER_NAME),FALSE,TRUE) 20 PERFORMANCE TIPS IN GENERAL Most of the gains in performance derive from good database design, thorough query analysis, and appropriate indexing. The largest performance gains can be realized by establishing a good database design. 1. Update Table Statistics in database. SYBASE SYNTAX: update all statistics table_name Adaptive Server’s cost-based optimizer uses statistics about the tables, indexes, and columns named in a query to estimate query costs. It chooses the access method that the optimizer determines has the least cost. But this cost estimate cannot be accurate if statistics are not accurate. Some statistics, such as the number of pages or rows in a table, are updated during query processing. Other statistics, such as the histograms on columns, are only updated when you run the update statistics command or when indexes are created. If you are having problems with a query performing slowly, and seek help from Technical Support or a Sybase news group on the Internet, one of the first questions you are likely be asked is “Did you run update statistics?” You can use the optdiag command (IN SYBASE) to see the time update statistics was last run for each column on which statistics exist: NOTE: Running the update statistics commands requires system resources. Like other maintenance tasks, it should be scheduled at times when load on the server is light. In particular, update statistics requires table scans or leaf-level scans of indexes, may increase I/O contention, may use the CPU to perform sorts, and uses the data and procedure caches. Use of these resources can adversely affect queries running on the server if you run update statistics at times when usage is high. In addition, some update statistics commands require shared locks, which can block updates. • Dropping an index does not drop the statistics for the index, since the optimizer can use column-level statistics to estimate costs, even when no index exists. If you want to remove the statistics after dropping an index, you must explicitly delete them with delete statistics. 21 • Truncating a table does not delete the column-level statistics in sysstatistics. In many cases, tables are truncated and the same data is reloaded. Since truncate table does not delete the column-level statistics, there is no need to run update statistics after the table is reloaded, if the data is the same. If you reload the table with data that has a different distribution of key values, you need to run update statistics. • You can drop and re-create indexes without affecting the index statistics, by specifying 0 for the number of steps in the with statistics clause to create index. This create index command does not affect the statistics in sysstatistics (IN SYBASE): Create index title_id_ix on titles (title_id) with statistics using 0 values This allows you to re-create an index without overwriting statistics that have been edited with optdiag. • If two users attempt to create an index on the same table, with the same columns, at the same time, one of the commands may fail due to an attempt to enter a duplicate key value in sysstatistics. 2. Create Indexes on KEY fields. Keep Index statistics up to date. NOTE: If data modification performance is poor, you may have too many indexes. While indexes favor “select operations”, they slow down “data modifications”. ABOUT INDEXES Indexes are the most important physical design element in improving database performance: • Indexes help prevent table scans. Instead of reading hundreds of data pages, a few index pages and data pages can satisfy many queries. • For some queries, data can be retrieved from a nonclustered index without ever accessing the data rows. • Clustered indexes can randomize data inserts, avoiding insert “hot spots” on the last page of a table. • Indexes can help avoid sorts, if the index order matches the order of columns in an order by clause. In addition to their performance benefits, indexes can enforce the uniqueness of data. Indexes are database objects that can be created for a table to speed direct access to specific data rows. Indexes store the values of the key(s) that were named when the index was created, and logical pointers to the data pages or to other index pages. Adaptive Server (SYBASE) provides two types of indexes: 22 • Clustered indexes, where the table data is physically stored in the order of the keys on the index: • For allpages-locked tables, rows are stored in key order on pages, and pages are linked in key order. • For data-only-locked tables, indexes are used to direct the storage of data on rows and pages, but strict key ordering is not maintained. • Nonclustered indexes, where the storage order of data in the table is not related to index keys. You can create only one clustered index on a table because there is only one possible physical ordering of the data rows. You can create up to 249 nonclustered indexes per table. A table that has no clustered index is called a “heap”. 3. Drop and Re-create the Indexes that hurt performance. Drop Indexes (In Pre-Session) before inserting data AND Re-Create Indexes (In PostSession) after data is inserted. NOTE: With indexes, inserting data is slower. Drop indexes that hurt performance. If an application performs data modifications during the day and generates reports at night, you may want to drop some indexes in the morning and re-create them at night. Drop indexes during periods when frequent updates occur and rebuild them before periods when frequent selects occur. 4. Also you can improve performance by Using transaction log thresholds to automate log dumps and “avoid running out of space”. Using thresholds for space monitoring in data segments. Using partitions to speed loading of data. 5. To tune the SQL Query We can use “Parallel Hints” in the SELECT stmt of SQL Query. Also use the table with large no. of rows last when joining. In other sense, use the table with less no. of rows as a MASTER source. Also Queries that contain ORDER BY or GROUP BY clauses may benefit from creating an index on the ORDER BY or GROUP BY columns. Once you optimize the query, use the SQL override option to take full advantage of these modifications. 23 6. Registering Multiple Servers Also performance can be increased by registering multiple servers which point to same repository. Other methods to Improve Performance Optimizing the Target Database If your session writes to a flat file target, you can optimize session performance by writing to a flat file target that is local to the Informatica Server. If your session writes to a relational target, consider performing the following tasks to increase performance: Drop indexes and key constraints. Increase checkpoint intervals. Use bulk loading. Use external loading. Turn off recovery. Increase database network packet size. 24 Optimize Oracle target databases. Dropping Indexes and Key Constraints When you define key constraints or indexes in target tables, you slow the loading of data to those tables. To improve performance, drop indexes and key constraints before running your session. You can rebuild those indexes and key constraints after the session completes. If you decide to drop and rebuild indexes and key constraints on a regular basis, you can create pre- and post-load stored procedures to perform these operations each time you run the session. Note: To optimize performance, use constraint-based loading only if necessary. Increasing Checkpoint Intervals The Informatica Server performance slows each time it waits for the database to perform a checkpoint. To increase performance, consider increasing the database checkpoint interval. When you increase the database checkpoint interval, you increase the likelihood that the database performs checkpoints as necessary, when the size of the database log file reaches its limit. Bulk Loading on Sybase and Microsoft SQL Server You can use bulk loading to improve the performance of a session that inserts a large amount of data to a Sybase or Microsoft SQL Server database. Configure bulk loading on the Targets dialog box in the session properties. When bulk loading, the Informatica Server bypasses the database log, which speeds performance. Without writing to the database log, however, the target database cannot perform rollback. As a result, the Informatica Server cannot perform recovery of the session. Therefore, you must weigh the importance of improved session performance against the ability to recover an incomplete session. If you have indexes or key constraints on your target tables and you want to enable bulk loading, you must drop the indexes and constraints before running the session. After the session completes, you can rebuild them. If you decide to use bulk loading with the session on a regular basis, you can create pre- and post-load stored procedures to drop and rebuild indexes and key constraints. For other databases, even if you configure the bulk loading option, Informatica Server ignores the commit interval mentioned and commits as needed. External Loading on Teradata, Oracle, and Sybase IQ You can use the External Loader session option to integrate external loading with a session. If you have a Teradata target database, you can use the Teradata external loader utility to bulk load target files. If your target database runs on Oracle, you can use the Oracle SQL*Loader utility to bulk load target files. When you load data to an Oracle database using a partitioned session, 25 you can increase performance if you create the Oracle target table with the same number of partitions you use for the session. If your target database runs on Sybase IQ, you can use the Sybase IQ external loader utility to bulk load target files. If your Sybase IQ database is local to the Informatica Server on your UNIX system, you can increase performance by loading data to target tables directly from named pipes. Use pmconfig to enable the SybaseIQLocaltoPMServer option. When you enable this option, the Informatica Server loads data directly from named pipes rather than writing to a flat file for the Sybase IQ external loader. Increasing Database Network Packet Size You can increase the network packet size in the Informatica Server Manager to reduce target bottleneck. For Sybase and Microsoft SQL Server, increase the network packet size to 8K 16K. For Oracle, increase the network packet size in tnsnames.ora and listener.ora. If you increase the network packet size in the Informatica Server configuration, you also need to configure the database server network memory to accept larger packet sizes. Optimizing Oracle Target Databases If your target database is Oracle, you can optimize the target database by checking the storage clause, space allocation, and rollback segments. When you write to an Oracle database, check the storage clause for database objects. Make sure that tables are using large initial and next values. The database should also store table and index data in separate tablespaces, preferably on different disks. When you write to Oracle target databases, the database uses rollback segments during loads. Make sure that the database stores rollback segments in appropriate tablespaces, preferably on different disks. The rollback segments should also have appropriate storage clauses. You can optimize the Oracle target database by tuning the Oracle redo log. The Oracle database uses the redo log to log loading operations. Make sure that redo log size and buffer size are optimal. You can view redo log properties in the init.ora file. If your Oracle instance is local to the Informatica Server, you can optimize performance by using IPC protocol to connect to the Oracle database. You can set up Oracle database connection in listener.ora and tnsnames.ora. 26 Improving Performance at mapping level Optimizing Datatype Conversions Forcing the Informatica Server to make unnecessary datatype conversions slows performance. For example, if your mapping moves data from an Integer column to a Decimal column, then back to an Integer column, the unnecessary datatype conversion slows performance. Where possible, eliminate unnecessary datatype conversions from mappings. Some datatype conversions can improve system performance. Use integer values in place of other datatypes when performing comparisons using Lookup and Filter transformations. For example, many databases store U.S. zip code information as a Char or Varchar datatype. If you convert your zip code data to an Integer datatype, the lookup database stores the zip code 94303-1234 as 943031234. This helps increase the speed of the lookup comparisons based on zip code. Optimizing Lookup Transformations If a mapping contains a Lookup transformation, you can optimize the lookup. Some of the things you can do to increase performance include caching the lookup table, optimizing the lookup condition, or indexing the lookup table. Caching Lookups If a mapping contains Lookup transformations, you might want to enable lookup caching. In general, you want to cache lookup tables that need less than 300MB. When you enable caching, the Informatica Server caches the lookup table and queries the lookup cache during the session. When this option is not enabled, the Informatica Server queries the lookup table on a row-by-row basis. You can increase performance using a shared or persistent cache: Shared cache. You can share the lookup cache between multiple transformations. You can share an unnamed cache between transformations in the same mapping. You can share a named cache between transformations in the same or different mappings. Persistent cache. If you want to save and reuse the cache files, you can configure the transformation to use a persistent cache. Use this feature when you know the lookup table does not change between session runs. Using a persistent cache can improve performance because the Informatica Server builds the memory cache from the cache files instead of from the database. Reducing the Number of Cached Rows Use the Lookup SQL Override option to add a WHERE clause to the default SQL statement. This allows you to reduce the number of rows included in the cache. 27 Optimizing the Lookup Condition If you include more than one lookup condition, place the conditions with an equal sign first to optimize lookup performance. Indexing the Lookup Table The Informatica Server needs to query, sort, and compare values in the lookup condition columns. The index needs to include every column used in a lookup condition. You can improve performance for both cached and uncached lookups: Cached lookups. You can improve performance by indexing the columns in the lookup ORDER BY. The session log contains the ORDER BY statement. Uncached lookups. Because the Informatica Server issues a SELECT statement for each row passing into the Lookup transformation, you can improve performance by indexing the columns in the lookup condition. Improving Performance at Repository level Tuning Repository Performance The PowerMart and PowerCenter repository has more than 80 tables and almost all tables use one or more indexes to speed up queries. Most databases keep and use column distribution statistics to determine which index to use to execute SQL queries optimally. Database servers do not update these statistics continuously. In frequently-used repositories, these statistics can become outdated very quickly and SQL query optimizers may choose a less than optimal query plan. In large repositories, the impact of choosing a sub-optimal query plan can affect performance drastically. Over time, the repository becomes slower and slower. To optimize SQL queries, you might update these statistics regularly. The frequency of updating statistics depends on how heavily the repository is used. Updating statistics is done table by table. The database administrator can create scripts to automate the task. You can use the following information to generate scripts to update distribution statistics. Note: All PowerMart/PowerCenter repository tables and index names begin with “OPB_”. Oracle Database 28 You can generate scripts to update distribution statistics for an Oracle repository. To generate scripts for an Oracle repository: 1. Run the following queries: select 'analyze table ', table_name, ' compute statistics;' from user_tables where table_name like 'OPB_%' select 'analyze index ', INDEX_NAME, ' compute statistics;' from user_indexes where INDEX_NAME like 'OPB_%' This produces an output like the following: 'ANALYZETABLE' TABLE_NAME 'COMPUTESTATISTICS;' -------------- ---------------- ----------------------------------------------------------------------------analyze table OPB_ANALYZE_DEP compute statistics; analyze table OPB_ATTR compute statistics; analyze table OPB_BATCH_OBJECT compute statistics; 2. Save the output to a file. 3. Edit the file and remove all the headers. Headers are like the following: 'ANALYZETABLE' TABLE_NAME 'COMPUTESTATISTICS;' -------------- ---------------- -------------------4. Run this as an SQL script. This updates repository table statistics. Microsoft SQL Server 29 You can generate scripts to update distribution statistics for a Microsoft SQL Server repository. To generate scripts for a Microsoft SQL Server repository: 1. Run the following query: select 'update statistics ', name from sysobjects where name like 'OPB_%' This produces an output like the following: name ------------------ -----------------update statistics OPB_ANALYZE_DEP update statistics OPB_ATTR update statistics OPB_BATCH_OBJECT 2. Save the output to a file. 3. Edit the file and remove the header information. Headers are like the following: name ------------------ -----------------4. Add a go at the end of the file. 5. Run this as a sql script. This updates repository table statistics. Improving Performance at Session level Optimizing the Session Once you optimize your source database, target database, and mapping, you can focus on optimizing the session. You can perform the following tasks to improve overall performance: 30 Run concurrent batches. Partition sessions. Reduce errors tracing. Remove staging areas. Tune session parameters. Table 19-1 lists the settings and values you can use to improve session performance: Table 19-1. Session Tuning Parameters Setting Default Value DTM Buffer 12,000,000 Pool Size [12 MB] bytes Suggested Minimum Value Suggested Maximum Value 6,000,000 bytes 128,000,000 bytes Buffer block size 64,000 bytes [64 KB] 4,000 bytes 128,000 bytes Index cache size 1,000,000 bytes 1,000,000 bytes 12,000,000 bytes Data cache size 2,000,000 bytes 2,000,000 bytes 24,000,000 bytes Commit interval 10,000 rows N/A N/A Decimal arithmetic Disabled N/A N/A Tracing Level Normal Terse N/A How to correct and load the rejected files when the session completes During a session, the Informatica Server creates a reject file for each target instance in the mapping. If the writer or the target rejects data, the Informatica Server writes the rejected row into the reject file. By default, the Informatica Server creates reject files in the $PMBadFileDir server variable directory. The reject file and session log contain information that helps you determine the cause of the reject. You can correct reject files and load them to relational targets using the Informatica reject loader utility. The reject loader also creates another reject file for the data that the writer or target reject during the reject loading. Complete the following tasks to load reject data into the target: Locate the reject file. Correct bad data. Run the reject loader utility. 31 NOTE: You cannot load rejected data into a flat file target After you locate a reject file, you can read it using a text editor that supports the reject file code page. Reject files contain rows of data rejected by the writer or the target database. Though the Informatica Server writes the entire row in the reject file, the problem generally centers on one column within the row. To help you determine which column caused the row to be rejected, the Informatica Server adds row and column indicators to give you more information about each column: Row indicator. The first column in each row of the reject file is the row indicator. The numeric indicator tells whether the row was marked for insert, update, delete, or reject. Column indicator. Column indicators appear after every column of data. The alphabetical character indicators tell whether the data was valid, overflow, null, or truncated. The following sample reject file shows the row and column indicators: 3,D,1,D,,D,0,D,1094945255,D,0.00,D,-0.00,D 0,D,1,D,April,D,1997,D,1,D,-1364.22,D,-1364.22,D 0,D,1,D,April,D,2000,D,1,D,2560974.96,D,2560974.96,D 3,D,1,D,April,D,2000,D,0,D,0.00,D,0.00,D 0,D,1,D,August,D,1997,D,2,D,2283.76,D,4567.53,D 0,D,3,D,December,D,1999,D,1,D,273825.03,D,273825.03,D 0,D,1,D,September,D,1997,D,1,D,0.00,D,0.00,D Row Indicators The first column in the reject file is the row indicator. The number listed as the row indicator tells the writer what to do with the row of data. Table 15-1 describes the row indicators in a reject file: Table 15-1. Row Indicators in Reject File Row Indicator Meaning Rejected By 0 Insert Writer or target 1 Update Writer or target 2 Delete Writer or target 3 Reject Writer 32 If a row indicator is 3, the writer rejected the row because an update strategy expression marked it for reject. If a row indicator is 0, 1, or 2, either the writer or the target database rejected the row. To narrow down the reason why rows marked 0, 1, or 2 were rejected, review the column indicators and consult the session log. Column Indicators After the row indicator is a column indicator, followed by the first column of data, and another column indicator. Column indicators appear after every column of data and define the type of the data preceding it. Table 15-2 describes the column indicators in a reject file: Table 15-2. Column Indicators in Reject File Column Indicator Type of data Writer Treats As D Valid data. Good data. Writer passes it to the target database. The target accepts it unless a database error occurs, such as finding a duplicate key. O Overflow. Numeric data exceeded Bad data, if you configured the mapping the specified precision or scale for target to reject overflow or truncated the column. data. N Good data. Writer passes it to the target, Null. The column contains a null which rejects it if the target database does value. not accept null values. T Truncated. String data exceeded a Bad data, if you configured the mapping specified precision for the column, target to reject overflow or truncated so the Informatica Server truncated data. it. After you correct the target data in each of the reject files, append “.in” to each reject file you want to load into the target database. For example, after you correct the reject file, t_AvgSales_1.bad, you can rename it t_AvgSales_1.bad.in. After you correct the reject file and rename it to reject_file.in, you can use the reject loader to send those files through the writer to the target database. 33 Use the reject loader utility from the command line to load rejected files into target tables. The syntax for reject loading differs on UNIX and Windows NT/2000 platforms. Use the following syntax for UNIX: pmrejldr pmserver.cfg [folder_name:]session_name Use the following syntax for Windows NT/2000: pmrejldr [folder_name:]session_name 34 Recovering Sessions If you stop a session or if an error causes a session to stop, refer to the session and error logs to determine the cause of failure. Correct the errors, and then complete the session. The method you use to complete the session depends on the properties of the mapping, session, and Informatica Server configuration. Use one of the following methods to complete the session: Run the session again if the Informatica Server has not issued a commit. Truncate the target tables and run the session again if the session is not recoverable. Consider performing recovery if the Informatica Server has issued at least one commit. When the Informatica Server starts a recovery session, it reads the OPB_SRVR_RECOVERY table and notes the row ID of the last row committed to the target database. The Informatica Server then reads all sources again and starts processing from the next row ID. For example, if the Informatica Server commits 10,000 rows before the session fails, when you run recovery, the Informatica Server bypasses the rows up to 10,000 and starts loading with row 10,001. The commit point may be different for sourceand target-based commits. By default, Perform Recovery is disabled in the Informatica Server setup. You must enable Recovery in the Informatica Server setup before you run a session so the Informatica Server can create and/or write entries in the OPB_SRVR_RECOVERY table. Causes for Session Failure Reader errors. Errors encountered by the Informatica Server while reading the source database or source files. Reader threshold errors can include alignment errors while running a session in Unicode mode. Writer errors. Errors encountered by the Informatica Server while writing to the target database or target files. Writer threshold errors can include key constraint violations, loading nulls into a not null field, and database trigger responses. Transformation errors. Errors encountered by the Informatica Server while transforming data. Transformation threshold errors can include conversion errors, and any condition set up as an ERROR, such as null input. Fatal Error A fatal error occurs when the Informatica Server cannot access the source, target, or repository. This can include loss of connection or target database errors, such as lack of database space to load data. If the session uses a Normalizer or Sequence Generator transformation, the Informatica Server cannot update the sequence values in the repository, and a fatal error occurs. 35 36